Tungsten Clustering

Switching Primary Hosts

The Primary host within a dataservice can be switched, either automatically, or manually. Automatic switching occurs when the dataservice is in the AUTOMATIC policy mode, and a failure in the underlying datasource has been identified. The automatic process is designed to keep the dataservice running without requiring manual intervention.

Manual switching of the Primary can be performed during maintenance operations, for example during an upgrade or dataserver modification. In this situation, the Primary must be manually taken out of service, but without affecting the rest of the dataservice. By switching the Primary to another datasource in the dataservice, the original Primary can be put offline, or shunned, while maintenance occurs. Once the maintenance has been completed, the datasource can be re-enabled, and either remain as a Replica, or switched back as the Primary datasource.

Switching a datasource, whether automatically or manually, occurs while the dataservice is running, and without affecting the operation of the dataservice as a whole. Client application connections through Tungsten Connector are automatically reassigned to the datasources in the dataservice, and application operation will be unaffected by the change. Switching the datasource manually requires a single command that performs all of the required steps, monitoring and managing the switch process.

Switching the Primary, manually or automatically, performs the following steps within the dataservice:

Set the Primary node to offline state. New connections to the Primary are rejected, and writes to the Primary are stopped.
On the Replica that will be promoted, switch the datasource offline. New connections are rejected, stopping reads on this Replica.
Kill any outstanding client connections to the Primary data source, except those belonging to the tungsten account.
Send a heartbeat transaction between the Primary and the Replica, and wait until this transaction has been received. Once received, the THL on Primary and Replica are up to date.
Perform the switch:
- Configure all remaining replicators offline
- Configure the selected Replica as the new Primary.
- Set the new Primary to the online state.
- New connections to the Primary are permitted.
Configure the remaining Replicas to use the new Primary as the Primary datasource.
Update the connector configurations and enable client connections to connect to the Primaries and Replicas.

The switching process is monitored by Tungsten Cluster, and if the process fails, either due to a timeout or a recoverable error occurs, the switch operation is rolled back, returning the dataservice to the original configuration. This ensures that the dataservice remains operational. In some circumstances, when performing a manual switch, the command may need to be repeated to ensure the requested switch operation completes.

The process takes a finite amount of time to complete, and the exact timing and duration will depend on the state, health, and database activity on the dataservice. The actual time taken will depend on how up to date the Replica being promoted is compared to the Primary. The switch will take place regardless of the current status after a delay period.

Automatic Primary Failover

When the dataservice policy mode is AUTOMATIC , the dataservice will automatically failover the Primary host when the existing Primary is identified as having failed or become unavailable.

For example, when the Primary host db1 becomes unavailable because of a network problem, the dataservice automatically switches to db3. The dataservice status is updated accordingly, showing the automatically shunned db2:

[LOGICAL:EXPERT] /alpha > ls

COORDINATOR[db1:AUTOMATIC:ONLINE]

ROUTERS:
+---------------------------------------------------------------------------------+
|connector@db1[7435](ONLINE, created=2, active=0)                                 |
|connector@db2[7472](ONLINE, created=2, active=0)                                 |
|connector@db3[7468](ONLINE, created=2, active=0)                                 |
+---------------------------------------------------------------------------------+

DATASOURCES:
+---------------------------------------------------------------------------------+
|db1(master:SHUNNED(FAILED-OVER-TO-db3), progress=8, THL latency=0.981)           |
|STATUS [SHUNNED] [2025/01/27 01:51:23 PM UTC]                                    |
+---------------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                          |
|  REPLICATOR(role=master, state=DEGRADED)                                        |
|  DATASERVER(state=STOPPED)                                                      |
|  CONNECTIONS(created=4, active=0)                                               |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db2(slave:ONLINE, progress=8, latency=1.004)                                     |
|STATUS [OK] [2025/01/27 01:51:40 PM UTC]                                         |
+---------------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                          |
|  REPLICATOR(role=slave, master=db1, state=ONLINE)                               |
|  DATASERVER(state=ONLINE)                                                       |
|  CONNECTIONS(created=0, active=0)                                               |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db3(master:ONLINE, progress=10, THL latency=0.380)                               |
|STATUS [OK] [2025/01/27 01:51:27 PM UTC]                                         |
+---------------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                          |
|  REPLICATOR(role=master, state=ONLINE)                                          |
|  DATASERVER(state=ONLINE)                                                       |
|  CONNECTIONS(created=2, active=0)                                               |
+---------------------------------------------------------------------------------+

The status for the original Primary (db1) identifies the datasource as shunned, and indicates which datasource was promoted to the Primary in the FAILED-OVER-TO-db3.

An automatic failover can be triggered by using the datasource fail command:

[LOGICAL:EXPERT] /alpha > datasource db1 fail

This triggers the automatic failover sequence, and simulates what would happen if the specified host failed.

If db1 becomes available again, the datasource is not automatically added back to the dataservice, but must be explicitly re-added to the dataservice. The status of the dataservice once db1 returns is shown below:

[LOGICAL:EXPERT] /alpha > ls
...

+---------------------------------------------------------------------------------+
|db1(master:SHUNNED(FAILED-OVER-TO-db3), progress=8, THL latency=0.981)           |
|STATUS [SHUNNED] [2025/01/27 01:51:23 PM UTC]                                    |
+---------------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                          |
|  REPLICATOR(role=master, state=DEGRADED)                                        |
|  DATASERVER(state=ONLINE)                                                       |
|  CONNECTIONS(created=4, active=0)                                               |
+---------------------------------------------------------------------------------+
...

Because db1 was previously the Primary, the datasource recover command verifies that the server is available, configures the node as a Replica of the newly promoted Primary, and re-enables the services:

[LOGICAL:EXPERT] /alpha > datasource db1 recover
RECOVERING DATASOURCE 'db1@alpha'
VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'db1'
Verified that DB server notification 'db1' is in state 'ONLINE'
DATA SERVER 'db1' IS NOW AVAILABLE FOR CONNECTIONS
RECOVERING 'db1@alpha' TO A SLAVE USING 'db3@alpha' AS THE MASTER
SETTING THE ROLE OF DATASOURCE 'db1@alpha' FROM 'master' TO 'slave'
RECOVERY OF 'db1@alpha' WAS SUCCESSFUL

If the command is successful, then the node should be up and running as a Replica of the new Primary.

The recovery process can fail if the THL data and dataserver contents do not match, for example when statements have been executed on a Replica. For information on recovering from failures that recover cannot fix, see "Replica Datasource Extended Recovery".

Manual Primary Switch

In a single data service dataservice configuration, the Primary can be switched between nodes within the dataservice manually using cctrl . The switch command performs the switch operation, annotating the progress.

[LOGICAL:EXPERT] /alpha > switch
SET POLICY: AUTOMATIC => MAINTENANCE
EVALUATING SLAVE: db3(stored=14, applied=14, latency=0.682, datasource-group-id=0)
EVALUATING SLAVE: db2(stored=14, applied=14, latency=0.686, datasource-group-id=0)
SELECTED SLAVE: db3@alpha
Savepoint switch_2(cluster=alpha, source=db2, created=2025/01/27 13:54:48 UTC) created
PURGE REMAINING ACTIVE SESSIONS ON CURRENT MASTER 'db1@alpha'
PURGED A TOTAL OF 0 ACTIVE SESSIONS ON MASTER 'db1@alpha'
FLUSH TRANSACTIONS ON CURRENT MASTER 'db1@alpha'
PUT THE NEW MASTER 'db3@alpha' ONLINE
PUT THE PRIOR MASTER 'db1@alpha' ONLINE AS A SLAVE
SWITCH TO 'db3@alpha' WAS SUCCESSFUL

By default, switch chooses the most up to date Replica within the dataservice (db3 in the above example), but an explicit Replica can also be selected:

[LOGICAL:EXPERT] /alpha > switch to db2
SET POLICY: AUTOMATIC => MAINTENANCE
EVALUATING SLAVE: db2(stored=22, applied=22, latency=0.974, datasource-group-id=0)
SELECTED SLAVE: db2@alpha
Savepoint switch_4(cluster=alpha, source=db2, created=2025/01/27 13:56:07 UTC) created
PURGE REMAINING ACTIVE SESSIONS ON CURRENT MASTER 'db1@alpha'
PURGED A TOTAL OF 0 ACTIVE SESSIONS ON MASTER 'db1@alpha'
FLUSH TRANSACTIONS ON CURRENT MASTER 'db1@alpha'
PUT THE NEW MASTER 'db2@alpha' ONLINE
PUT THE PRIOR MASTER 'db1@alpha' ONLINE AS A SLAVE
SWITCH TO 'db2@alpha' WAS SUCCESSFUL

With the previous example, the switch occurred specifically to the node db2 .

Automatic Primary Failover​

Manual Primary Switch​

Automatic Primary Failover

Manual Primary Switch