8.4.2. Switch and Failover Steps for Local Clusters

The set of steps described below apply to both manually-initiated commands as well as those that are initiated by the Tungsten manager rules as a part of an automated recovery scenario.

Important

A failure of any of the below steps will result in a rollback of the operation to the starting state. 

Once either the switch or failover operation for a local cluster is triggered, the following steps are taken:

  1. FAILOVER ONLY: THE CURRENT PRIMARY WILL BE MARKED AS FAILED. TUNGSTEN WILL NOT ALLOW ANY CONNECTIONS TO A FAILED PRIMARY. APPLICATIONS WILL APPEAR TO HANG.

  2. FAILOVER ONLY: Put the cluster into maintenance mode. Automatic policy mode will be restored after the failover operation is complete or if the operation is rolled back.

    SWITCH ONLY: Get the current policy mode and then put the cluster into maintenance mode if it isn’t already. The current policy mode will be restored after the switch operation is complete or if the operation is rolled back.

  3. Verify that the manager running on target is operational.

  4. Verify that the replicator on the target is in the online state.

  5. SWITCH ONLY: Verify that the manager on the source is operational.

  6. SWITCH ONLY: Verify that the replicator on the source is in the online state.

  7. SWITCH ONLY: Set the source datasource to the offline state. This operation does the following steps, and by the time it is done, all managers and connectors have an updated copy of the primary/relay datasource which shows its state as offline.

    • Update the on-disk datasource properties files that are stored in cluster-home/conf/cluster/<cluster-name>/datasource/<datasource-name>, simultaneously, on all managers.

    • Update the same datasource on all connectors, simultaneously, by calling the router gateway within each manager. Note that this call is synchronous and will not return until ALL connectors have suspended all new requests to connect to the datasource and have closed all active connections for the datasource. For this reason, the completion of the datasource offline call can be delayed depending on how long the connectors will wait before closing active connections.

  8. SWITCH ONLY: AT THIS POINT NO MORE CONNECTIONS TO THE PRIMARY ARE POSSIBLE. APPLICATIONS WILL APPEAR TO HANG IN PROXY MODE.

  9. Set the target datasource to the offline state. This performs the same set of steps as for the source datasource, above.

At this point in the switch operation, we have both the source and the target datasource in the offline state, there should be no active connections to the underlying database servers for these datasources, and no new connections are being allowed to either datasource. At the connector level, if an application requests a new connection to the primary or to a replica, the call will hang until the switch operation is complete. NOTE: the only replica that is put into the offline state is the one that is the target. Any remaining replicas remain available to handle read operations although, because the primary datasource is offline, replication will start to lag.

Now we're going to do the set of steps required to make sure that the target database server has all of the transactions that have been written to the current source database server. 

  1. SWITCH ONLY: Perform a replicator purge operation on the source. This operation identifies any database server threads that may still be running on the source database server even if the application level connections have been closed and kills those threads. Experience shows that this can sometimes be the case and if these threads were not killed transactions would 'leak' through to the current source and could be lost.

  2. SWITCH ONLY: Flush all transactions from the source to the target. This involves doing a replicator flush operation, which will return the sequence number of the flush transaction and then polling with the replicator for the sequence number returned by the flush. This call with block indefinitely, waiting for the flush sequence number.

  3. Put the replicator for the target into the offline state.

  4. SWITCH ONLY: Put the replicator for the source into the offline state.

  5. SWITCH ONLY: Set the source datasource to the replica role.

  6. Set the target datasource to the primary or relay role.

  7. Set the target replicator to the primary or relay role.

  8. Set the source replicator to the replica role.

  9. FAILOVER ONLY: Set the source datasource to the shunned state.

  10. Put the replicator for the target into the online state.

  11. Put the datasource for the target into the online state.

  12. AT THIS POINT, THE NEW PRIMARY OR RELAY IS AVAILABLE TO APPLICATIONS.

  13. SWITCH ONLY: Put the source datasource, which is now a replica, into the online state.

  14. Iterate through any remaining replicas in the cluster, if any, and reconfigure them to point at the new primary/relay.

    NOTE: This step also includes reconfiguring a relay replicator on another site if the service on which the switch or failover occurred is a primary service of a composite cluster.

  15. Issue a replicator heartbeat. This writes a transaction on the primary and it will propagate to all replicas. We do not wait for this heartbeat event to propagate.