5.7. Composite Cluster Switching, Failover and Recovery

Switching of a dataservice is done to transfer the Master role from one cluster to another, usually in another datacenter site. This also has the effect of turning the original Master into a Relay. The master dataservice within a composite cluster can be forced to failover to the slave dataservice in the event the master dataservice is offline.

Switching the master dataservice performs the following steps:

  1. Set the master node to offline state. New connections to the master are rejected, and writes to the master are stopped.

  2. On the relay in the target cluster, switch the datasource offline. New connections are rejected, stopping reads on this master.

  3. Kill any outstanding client connections to the master data source, except those belonging to the tungsten account.

  4. Send a heartbeat transaction between the old master and the new master, and wait until this transaction has been received. Once received, the THL on master and slave are up to date.

  5. Perform the switch:

    • Configure all remaining replicators offline

    • Configure the target cluster relay node as the new master.

    • Set the new master to the online state.

    • New connections to the master are permitted.

  6. Configure the old master to be a relay datasource.

  7. Configure the slaves in the primary site to use the new master datasource.

  8. Configure the slaves in the slave site to use the new relay datasource.

  9. Update the connector configurations and enable client connections to connect to the masters and slaves.

The switching process is monitoring by Tungsten Cluster, and if the process fails, either due to a timeout or a recoverable error occurs, the switch operation is rolled back, returning the dataservice to the original configuration. This ensures that the dataservice remains operational. In some circumstances, when performing a manual switch, the command may need to be repeated to ensure the requested switch operation completes.

The process takes a finite amount of time to complete, and the exact timing and duration will depend on the state, health, and database activity on the dataservice. The actual time taken will depend on how up to date the slave being promoted is compared to the master. The switch will take place regardless of the current status after a delay period.