3.3.6. Dataserver maintenance

3.3.6. Dataserver maintenance
Prev	^Up	3.3. Deploying Composite Active/Active Clusters	Next

3.3.6. Dataserver maintenance

3.3.6.1. Fixing Replication Errors

3.3.6.1. Fixing Replication Errors

3.3.6.1.1. Recovering Cross Site Services

In the event of a replication fault, the standard cctrl, trepctl and other utility commands in Chapter 9, Command-line Tools can be used to bring the dataservice back into operation. All the tools are safe to use.

If you have to perform any updates or modifications to the stored MySQL data, ensure binary logging has been disabled using:

mysql> SET SESSION SQL_LOG_BIN=0;

Before running any commands. This prevents statements and operations reaching the binary log so that the operations will not be replicated to other hosts.

3.3.6.1.1. Recovering Cross Site Services

In a Composite Active/Active topology, a switch or a failover not only promotes a Replica to be a new Primary, but also will require the ability to reconfigure cross-site communications. This process therefore assumes that cross-site communication is online and working. In some situations, it may be possible that cross-site communication is down, or for some reason cross-site replication is in an OFFLINE:ERROR state - for example a DDL or DML statement that worked in the local cluster may have failed to apply in the remote.

If a switch or failover occurs and the process is unable to reconfigure the cross-site replicators, the local switch will still succeed, however the associated cross-site services will be placed into a SHUNNED(SUBSERVICE-SWITCH-FAILED) state.

The guide explains how to recover from this situation.

The examples are based on a 2-cluster topology, named NYC and LONDON and the composite dataservice named GLOBAL.
The cluster is configured with the following dataservers:
- NYC : db1 (Primary), db2 (Replica), db3 (Replica)
- LONDON: db4 (Primary), db5 (Replica), db6 (Replica)
The cross site replicators in both clusters are in an OFFLINE:ERROR state due to failing DDL.
A switch was then issued, promoting db3 as the new Primary in NYC and db5 as the new Primary in LONDON

When the cluster enters a state where the cross-site services are in an error, output from cctrl will look like the following:

shell> cctrl -expert -multi
[LOGICAL:EXPERT] / > use london_from_nyc
london_from_nyc: session established, encryption=false, authentication=false
[LOGICAL:EXPERT] /london_from_nyc > ls
COORDINATOR[db6:AUTOMATIC:ONLINE]
 
ROUTERS:
+---------------------------------------------------------------------------------+
|connector@db1[26248](ONLINE, created=0, active=0)                                |
|connector@db2[14906](ONLINE, created=0, active=0)                                |
|connector@db3[15035](ONLINE, created=0, active=0)                                |
|connector@db4[27813](ONLINE, created=0, active=0)                                |
|connector@db5[4379](ONLINE, created=0, active=0)                                 |
|connector@db6[2098](ONLINE, created=0, active=0)                                 |
+---------------------------------------------------------------------------------+
 
DATASOURCES:
+---------------------------------------------------------------------------------+
|db5(relay:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=6, latency=0.219)          |
|STATUS [SHUNNED] [2025/01/07 10:27:24 AM UTC]                                    |
+---------------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                          |
|  REPLICATOR(role=relay, master=db3, state=ONLINE)                               |
|  DATASERVER(state=ONLINE)                                                       |
|  CONNECTIONS(created=0, active=0)                                               |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db4(slave:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=6, latency=0.252)          |
|STATUS [SHUNNED] [2025/01/07 10:27:25 AM UTC]                                    |
+---------------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                          |
|  REPLICATOR(role=slave, master=db5, state=ONLINE)                               |
|  DATASERVER(state=ONLINE)                                                       |
|  CONNECTIONS(created=0, active=0)                                               |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db6(slave:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=6, latency=0.279)          |
|STATUS [SHUNNED] [2025/01/07 10:27:25 AM UTC]                                    |
+---------------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                          |
|  REPLICATOR(role=slave, master=db4, state=ONLINE)                               |
|  DATASERVER(state=ONLINE)                                                       |
|  CONNECTIONS(created=0, active=0)                                               |
+---------------------------------------------------------------------------------+

In the above example, you can see that all services are in the SHUNNED(SUBSERVICE-SWITCH-FAILED) state, and partial reconfiguration has happened.

The Replicators for db4 and db6 should be Replicas of db5, db5 has correctly configured to the new Primary in nyc, db3. The actual state of the cluster in each scenario maybe different depending upon the cause of the loss of cross-site communication. Using the steps below, apply the necessary actions that relate to your own cluster state, if in any doubt always contact Continuent Support for assistance.

The first step is to ensure the initial replication errors have been resolved and that the replicators are in an online state, the steps to resolve the replicators will depend on the reason for the error, for further guidance on resolving these issues, see Chapter 6, Operations Guide.
From one node, connect into cctrl at the expert level:
```
shell> cctrl -expert -multi
```
Next, connect to the cross-site subservice, in this example, london_from_nyc
```
cctrl> use london_from_nyc
```
Next, place the service into MAINTENANCE Mode
```
cctrl> set policy maintenance
```
Enable override of commands issued
```
cctrl> set force true
```
Bring the relay datasource online
```
cctrl> datasource db5 online
```
If you need to change the source for the relay replicator to the correct, new, Primary in the remote cluster, take the replicator offline. If the relay source is correct, then move on to step 10
```
cctrl> replicator db5 offline
```
Change the source of the relay replicator
```
cctrl> replicator db5 relay nyc/db3
```
Bring the replicator online
```
cctrl> replicator db5 online
```

For each datasource that requires the replicator altering, issue the following commands:

cctrl> replicator datasource offline
cctrl> replicator datasource slave db5
cctrl> replicator datasource online

For example:

cctrl> replicator db4 offline
cctrl> replicator db4 slave db5
cctrl> replicator db4 online

Once all replicators are using the correct source, we can then bring the cluster back
```
cctrl> cluster welcome
```
Some of the datasources may still be in the SHUNNED state, so for each of those, you can then issue the following
```
cctrl> datasource datasource online
```
For example:
```
cctrl> datasource db4 online
```
Once all nodes are online, we can then return the cluster to AUTOMATIC
```
cctrl> set policy automatic
```
Repeat this process for the other cross-site subservice if required

Prev	Up	Next
3.3.5. Resetting all dataservices	^Level	3.3.7. Adding a Cluster to a Composite Active/Active Topology

Continuent Documentation