6.7. Composite Cluster Switching, Failover and Recovery

Switching of a dataservice is done to transfer the Active role from one cluster to another, usually in another datacenter site. This also has the effect of turning the original Primary node into a Relay node. The Active dataservice within a composite cluster can be forced to failover to the Passive dataservice in the event the Active dataservice is offline.

Switching the Active dataservice performs the following steps:

  1. Set the Primary node to offline state. New connections to the Primary are rejected, and writes to the Primary are stopped.

  2. On the relay in the target cluster, switch the datasource offline. New connections are rejected, stopping reads on this Primary.

  3. Kill any outstanding client connections to the Primary data source, except those belonging to the tungsten account.

  4. Send a heartbeat transaction between the old Primary and the new Primary, and wait until this transaction has been received. Once received, the THL on Primary and Replica are up to date.

  5. Perform the switch:

    • Configure all remaining replicators offline

    • Configure the target cluster relay node as the new Primary.

    • Set the new Primary to the online state.

    • New connections to the Primary are permitted.

  6. Configure the old Primary to be a relay datasource.

  7. Configure the Replicas in the primary site to use the new Primary datasource.

  8. Configure the Replicas in the Replica site to use the new relay datasource.

  9. Update the connector configurations and enable client connections to connect to the Primaries and Replicas.

The switching process is monitoring by Tungsten Cluster, and if the process fails, either due to a timeout or a recoverable error occurs, the switch operation is rolled back, returning the dataservice to the original configuration. This ensures that the dataservice remains operational. In some circumstances, when performing a manual switch, the command may need to be repeated to ensure the requested switch operation completes.

The process takes a finite amount of time to complete, and the exact timing and duration will depend on the state, health, and database activity on the dataservice. The actual time taken will depend on how up to date the Replica being promoted is compared to the Primary. The switch will take place regardless of the current status after a delay period.