Composite Cluster Switching, Failover and Recovery
Switching of a dataservice is done to transfer the Active role from one cluster to another, usually in another datacenter site. This also has the effect of turning the original Primary node into a Relay node. The Active dataservice within a composite cluster can be forced to failover to the Passive dataservice in the event the Active dataservice is offline.
Switching the Active dataservice performs the following steps:
- Set the Primary node to offline state. New connections to the Primary are rejected, and writes to the Primary are stopped.
- On the relay in the target cluster, switch the datasource offline. New connections are rejected, stopping reads on this Primary.
- Kill any outstanding client connections to the Primary data source, except those belonging to the
tungstenaccount. - Send a heartbeat transaction between the old Primary and the new Primary, and wait until this transaction has been received. Once received, the THL on Primary and Replica are up to date.
- Perform the switch:
- Configure all remaining replicators offline
- Configure the target cluster relay node as the new Primary.
- Set the new Primary to the online state.
- New connections to the Primary are permitted.
- Configure the old Primary to be a relay datasource.
- Configure the Replicas in the primary site to use the new Primary datasource.
- Configure the Replicas in the Replica site to use the new relay datasource.
- Update the connector configurations and enable client connections to connect to the Primaries and Replicas.
The switching process is monitored by Tungsten Cluster, and if the process fails, either due to a timeout or a recoverable error occurs, the switch operation is rolled back, returning the dataservice to the original configuration. This ensures that the dataservice remains operational. In some circumstances, when performing a manual switch, the command may need to be repeated to ensure the requested switch operation completes.
The process takes a finite amount of time to complete, and the exact timing and duration will depend on the state, health, and database activity on the dataservice. The actual time taken will depend on how up to date the Replica being promoted is compared to the Primary. The switch will take place regardless of the current status after a delay period.
Composite Cluster Site Switch
These steps only apply to Composite Active/Passive clusters.
Our example cluster has two sites, alpha and beta. They are both members of composite cluster global. Site east
has hosts db1, db2 and db3. Site west has hosts db4, db5 and db6.
When working with composite clusters, using cctrl will by default connect you to the cluster service associated with the node
in which you launched the tool, however you can easily move around to all clusters by simply using the use command as shown in the examples below.
shell> cctrl
Tungsten Clustering 8.0.4 Build 132
alpha: session established, encryption=false, authentication=false
jgroups: unencrypted, database: unencrypted
[LOGICAL] /alpha > use global
[LOGICAL] /global > ls
COORDINATOR[db2:AUTOMATIC:ONLINE]
alpha:COORDINATOR[db2:AUTOMATIC:ONLINE]
beta:COORDINATOR[db5:AUTOMATIC:ONLINE]
ROUTERS:
+---------------------------------------------------------------------------------+
|connector@db1[43475](ONLINE, created=2, active=0) |
|connector@db2[75463](ONLINE, created=2, active=0) |
|connector@db3[43981](ONLINE, created=2, active=0) |
|connector@db4[59575](ONLINE, created=2, active=0) |
|connector@db5[59581](ONLINE, created=2, active=0) |
|connector@db6[59847](ONLINE, created=2, active=0) |
+---------------------------------------------------------------------------------+
DATASOURCES:
+---------------------------------------------------------------------------------+
|alpha(composite master:ONLINE) |
|STATUS [OK] [2025/01/28 11:35:18 AM UTC] |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|beta(composite slave:ONLINE) |
|STATUS [OK] [2025/01/28 11:35:13 AM UTC] |
+---------------------------------------------------------------------------------+
Composite Active Dataservice (Primary) - alpha
[LOGICAL] /global > use alpha
[LOGICAL] /alpha > ls
COORDINATOR[db2:AUTOMATIC:ONLINE]
ROUTERS:
+---------------------------------------------------------------------------------+
|connector@db1[43475](ONLINE, created=0, active=0) |
|connector@db2[75463](ONLINE, created=0, active=0) |
|connector@db3[43981](ONLINE, created=0, active=0) |
|connector@db4[59575](ONLINE, created=0, active=0) |
|connector@db5[59581](ONLINE, created=0, active=0) |
|connector@db6[59847](ONLINE, created=0, active=0) |
+---------------------------------------------------------------------------------+
DATASOURCES:
+---------------------------------------------------------------------------------+
|db1(master:ONLINE, progress=0, THL latency=1.061) |
|STATUS [OK] [2025/01/28 11:35:02 AM UTC] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=master, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db2(slave:ONLINE, progress=0, latency=1.691) |
|STATUS [OK] [2025/01/28 11:35:00 AM UTC] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db3(slave:ONLINE, progress=0, latency=1.382) |
|STATUS [OK] [2025/01/28 11:35:02 AM UTC] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+---------------------------------------------------------------------------------+
Composite Passive Dataservice (DR) - beta
[LOGICAL] /alpha]ha > use beta
[LOGICAL] /beta > ls
COORDINATOR[db5:AUTOMATIC:ONLINE]
ROUTERS:
+---------------------------------------------------------------------------------+
|connector@db1[43475](ONLINE, created=2, active=0) |
|connector@db2[75463](ONLINE, created=2, active=0) |
|connector@db3[43981](ONLINE, created=2, active=0) |
|connector@db4[59575](ONLINE, created=2, active=0) |
|connector@db5[59581](ONLINE, created=2, active=0) |
|connector@db6[59847](ONLINE, created=2, active=0) |
+---------------------------------------------------------------------------------+
DATASOURCES:
+---------------------------------------------------------------------------------+
|db4(relay:ONLINE, progress=0, latency=1.000) |
|STATUS [OK] [2025/01/28 11:35:02 AM UTC] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=relay, master=db1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db5(slave:ONLINE, progress=0, latency=1.515) |
|STATUS [OK] [2025/01/28 11:35:12 AM UTC] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db4, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=6, active=0) |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db6(slave:ONLINE, progress=0, latency=1.592) |
|STATUS [OK] [2025/01/28 11:35:12 AM UTC] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db4, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=6, active=0) |
+---------------------------------------------------------------------------------+
Manually switch the composite Primary role to the other site:
[LOGICAL] /beta > use global
[LOGICAL] /global > switch
Savepoint switch_0(cluster=global, source=db2, created=2025/01/28 13:53:00 UTC) created
SELECTED SLAVE: 'beta@global'
FLUSHING TRANSACTIONS THROUGH 'db1@alpha'
REPLICATOR 'db1' IS NOW USING MASTER CONNECT URI 'thl://db4:2112/'
composite data source 'beta@global' is now OFFLINE
PUT THE NEW MASTER 'beta@global' ONLINE
PUT THE PRIOR MASTER 'alpha@global' ONLINE AS A SLAVE
REVERT POLICY: MAINTENANCE => AUTOMATIC
SWITCH TO 'beta@global' WAS SUCCESSFUL
[LOGICAL] /global > ls
COORDINATOR[db2:AUTOMATIC:ONLINE]
alpha:COORDINATOR[db2:AUTOMATIC:ONLINE]
beta:COORDINATOR[db5:AUTOMATIC:ONLINE]
ROUTERS:
+---------------------------------------------------------------------------------+
|connector@db1[43475](ONLINE, created=2, active=0) |
|connector@db2[75463](ONLINE, created=2, active=0) |
|connector@db3[43981](ONLINE, created=2, active=0) |
|connector@db4[59575](ONLINE, created=2, active=0) |
|connector@db5[59581](ONLINE, created=2, active=0) |
|connector@db6[59847](ONLINE, created=2, active=0) |
+---------------------------------------------------------------------------------+
DATASOURCES:
+---------------------------------------------------------------------------------+
|alpha(composite slave:ONLINE) |
|STATUS [OK] [2025/01/28 01:53:13 PM UTC] |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|beta(composite master:ONLINE) |
|STATUS [OK] [2025/01/28 01:53:13 PM UTC] |
+---------------------------------------------------------------------------------+
Composite Cluster Site Failover (Forced Switch)
In the event the Active site goes down, and a graceful manual switch is not possible, the composite Active role can be failed over to the Passive
cluster using cctrl. The failover command performs the forced switch operation. It will try to update the configuration of the east data
service but will not fail if not successful.
In this example, hosts db4 (the composite Primary), db5 and db6 in cluster beta have been shut down. To force
dataservice alpha to become the primary, login to a node in that cluster and get into cctrl:
shell> cctrl -multi
Tungsten Clustering 8.0.4 Build 132
west: session established
[LOGICAL] / > use global
[LOGICAL] /global > ls
COORDINATOR[db2:AUTOMATIC:ONLINE]
alpha:COORDINATOR[db2:AUTOMATIC:ONLINE]
ROUTERS:
+---------------------------------------------------------------------------------+
|connector@db1[43475](ONLINE, created=2, active=0) |
|connector@db2[75463](ONLINE, created=2, active=0) |
|connector@db3[43981](ONLINE, created=2, active=0) |
+---------------------------------------------------------------------------------+
DATASOURCES:
+---------------------------------------------------------------------------------+
|alpha(composite slave:ONLINE) |
|STATUS [OK] [2025/01/28 01:53:13 PM UTC] |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|beta(composite master:SHUNNED(FAILSAFE_SHUN)) |
|STATUS [SHUNNED] [2025/01/28 02:21:56 PM UTC] |
+---------------------------------------------------------------------------------+
Mark the beta data service as failed to prevent further actions:
[LOGICAL] /global > datasource beta fail
WARNING: This is an expert-level command:
Incorrect use may cause data corruption
or make the cluster unavailable.
Do you want to continue? (y/n)> y
COMPOSITE DATA SOURCE 'beta' IS NOW IN THE FAILED STATE
[LOGICAL] /global > ls
COORDINATOR[db2:AUTOMATIC:ONLINE]
alpha:COORDINATOR[db2:AUTOMATIC:ONLINE]
ROUTERS:
+---------------------------------------------------------------------------------+
|connector@db1[43475](ONLINE, created=2, active=0) |
|connector@db2[75463](ONLINE, created=2, active=0) |
|connector@db3[43981](ONLINE, created=2, active=0) |
+---------------------------------------------------------------------------------+
DATASOURCES:
+---------------------------------------------------------------------------------+
|alpha(composite slave:ONLINE) |
|STATUS [OK] [2025/01/28 01:53:13 PM UTC] |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|beta(composite master:FAILED(MANUALLY-FAILED)) |
|STATUS [CRITICAL] [2025/01/28 02:29:53 PM UTC] |
|REASON[MANUALLY-FAILED] |
+---------------------------------------------------------------------------------+
Issue the failover command to force the alpha dataservice to become the composite Primary:
[LOGICAL] /global > failover
WARNING: DATA SERVICE 'beta' IS NOT AVAILABLE. CANNOT GET STATE
EXCEPTION=Unable to continue with command because no manager is available in service 'beta'.
Savepoint failover_1(cluster=global, source=db2, created=2025/01/28 14:35:37 UTC) created
SELECTED SLAVE: 'alpha@global'
ENSURING THAT WE CATCH UP WITH THE MOST ADVANCED RELAY
composite data source 'alpha@global' is now OFFLINE
PUT THE NEW MASTER 'alpha@global' ONLINE
REVERT POLICY: MAINTENANCE => AUTOMATIC
FAILOVER TO 'alpha@global' WAS SUCCESSFUL
[LOGICAL] /global > ls
COORDINATOR[db2:AUTOMATIC:ONLINE]
alpha:COORDINATOR[db2:AUTOMATIC:ONLINE]
ROUTERS:
+---------------------------------------------------------------------------------+
|connector@db1[43475](ONLINE, created=2, active=0) |
|connector@db2[75463](ONLINE, created=2, active=0) |
|connector@db3[43981](ONLINE, created=2, active=0) |
+---------------------------------------------------------------------------------+
DATASOURCES:
+---------------------------------------------------------------------------------+
|alpha(composite master:ONLINE) |
|STATUS [OK] [2025/01/28 02:36:15 PM UTC] |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|beta(composite master:SHUNNED(MANUAL-FAILOVER)) |
|STATUS [SHUNNED] [2025/01/28 02:35:46 PM UTC] |
+---------------------------------------------------------------------------------+
Composite Cluster Site Recovery
When the site that was lost is returned to operation and all tungsten services have been restarted, if at all possible, the cluster will attempt automatic recovery ensuring the cluster is returned as a slave dataservice and all nodes online. For the automatic recovery to be attempted, the clusters must be in the AUTOMATIC policy mode.
If the nodes cannot be recovered, the first step in recovering the SHUNNED dataservice is to re-provision the nodes if the data has gotten out of sync. See "Provision or Reprovision a Replica" for more information.
Once the failed site has been restored, the shunned/superseded dataservice can be brought back online using cctrl . The recover command
performs this operation, annotating the progress.
...
DATASOURCES:
+---------------------------------------------------------------------------------+
|alpha(composite master:ONLINE) |
|STATUS [OK] [2025/01/28 02:36:15 PM UTC] |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|beta(composite master:SHUNNED(SUPERSEDED)) |
|STATUS [SHUNNED] [2025/01/28 02:39:41 PM UTC] |
+---------------------------------------------------------------------------------+
...
Use the recover to bring the SHUNNED dataservice back online as a composite Replica:
[LOGICAL] /global > recover
IDENTIFIED DATASOURCE 'beta@global' FOR RECOVERY
COULD NOT IDENTIFY ACTIVE PRIMARY FOR SERVICE 'beta'
ATTEMPTING TO IDENTIFY A FAILED PRIMARY FOR 'beta'
PHYSICAL DATA SERVICE 'beta' DOES NOT HAVE AN ACTIVE RELAY
FORCING THE PHYSICAL RELAY TO BE 'db4'
DATASOURCE 'db4@beta' IS NOW A RELAY
RECOVERED 2 DATA SOURCES IN SERVICE 'beta'
composite data source 'beta@global' role is now slave
composite data source 'beta' is now OFFLINE
REVERT SET POLICY AUTOMATIC
RECOVERY OF COMPOSITE SERVICE 'global' IS COMPLETE
[LOGICAL] /global > ls
COORDINATOR[db2:AUTOMATIC:ONLINE]
alpha:COORDINATOR[db2:AUTOMATIC:ONLINE]
beta:COORDINATOR[db4:AUTOMATIC:ONLINE]
ROUTERS:
+---------------------------------------------------------------------------------+
|connector@db1[43475](ONLINE, created=2, active=0) |
|connector@db2[75463](ONLINE, created=2, active=0) |
|connector@db3[43981](ONLINE, created=2, active=0) |
|connector@db4[2062](ONLINE, created=2, active=0) |
|connector@db5[2079](ONLINE, created=2, active=0) |
|connector@db6[2080](ONLINE, created=2, active=0) |
+---------------------------------------------------------------------------------+
DATASOURCES:
+---------------------------------------------------------------------------------+
|alpha(composite master:ONLINE) |
|STATUS [OK] [2025/01/28 02:36:15 PM UTC] |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|beta(composite slave:ONLINE) |
|STATUS [OK] [2025/01/28 02:40:06 PM UTC] |
+---------------------------------------------------------------------------------+
Composite Cluster Relay Recovery
If the Relay node in a Composite cluster should ever point to the incorrect Primary node, you can perform the following procedure to re-point the replicator to the desired Primary node.
For example, say we have a composite cluster global, with nodes db1, db2 and db3 in alpha and db4, db5 and db6 in beta. db1
is the Primary and db4 is the relay.
In the output below, the relay node db4 shows that its replicator is using db2 as the Primary instead of db1:
+---------------------------------------------------------------------------------+
|db4(relay:ONLINE, progress=5, latency=0.352) |
|STATUS [OK] [2025/01/28 02:39:49 PM UTC] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=relay, master=db2, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=8108, active=0) |
+---------------------------------------------------------------------------------+
Use the replicator command to adjust the relay source:
shell> cctrl
Tungsten Clustering 8.0.4 Build 132
west: session established
[LOGICAL] /beta > set policy maintenance
[LOGICAL] /beta > replicator db4 offline
[LOGICAL] /beta > replicator db4 relay alpha/db1
[LOGICAL] /beta > set policy automatic
[LOGICAL] /beta > ls
+---------------------------------------------------------------------------------+
|db4(relay:ONLINE, progress=5, latency=0.352) |
|STATUS [OK] [2025/01/28 02:39:49 PM UTC] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=relay, master=db1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+---------------------------------------------------------------------------------+