Recover a failed Replica
As long as the cluster is operating in the AUTOMATIC policy mode, then in most cases the managers will attempt to recover a failed replica automatically.
If automatic recovery fails, further investigation will be required to establish the cause of failure. Once the fault has been resolved and the host is viable
(ie no data corruption) and available again, it can be recovered back into Replica mode using either the recover command, or the specific
single datasource command datasource db1 recover:
[LOGICAL:EXPERT] /alpha > recover
RECOVERING DATASERVICE 'alpha
SET POLICY: AUTOMATIC => MAINTENANCE
FOUND PHYSICAL DATASOURCE TO RECOVER: 'db1@alpha'
RECOVERING DATASOURCE 'db1@alpha'
VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'db1'
Verified that DB server notification 'db1' is in state 'ONLINE'
DATA SERVER 'db1' IS NOW AVAILABLE FOR CONNECTIONS
RECOVERING 'db1@alpha' TO A SLAVE USING 'db3@alpha' AS THE MASTER
SETTING THE ROLE OF DATASOURCE 'db1@alpha' FROM 'master' TO 'slave'
RECOVERY OF DATA SERVICE 'alpha' SUCCEEDED
REVERT POLICY: MAINTENANCE => AUTOMATIC
RECOVERED 1 DATA SOURCES IN SERVICE 'alpha'
or:
[LOGICAL:EXPERT] /alpha > datasource db1 recover
RECOVERING DATASOURCE 'db1@alpha'
STARTING SERVICE 'db1/mysql'
VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'db1'
Verified that DB server notification 'db1' is in state 'ONLINE'
DATA SERVER 'db1' IS NOW AVAILABLE FOR CONNECTIONS
RECOVERING 'db1@alpha' TO A SLAVE USING 'db3@alpha' AS THE MASTER
RECOVERY OF 'db1@alpha' WAS SUCCESSFUL
The single recover command will attempt to recover all the Replica resources in the cluster, bringing them all online and back into service. The
command operates on all shunned or failed Replicas, and only works if there is an active Primary available.
In some cases, the datasource may show as ONLINE and the recover command does not bring the datasource online, particularly with the
following error:
The datasource 'db1' is not FAILED or SHUNNED and cannot be recovered.
Checking the datasource status in cctrl the replicator service has failed, but the datasource shows as online:
+---------------------------------------------------------------------------------+
|db1(slave:ONLINE, progress=-1, latency=-1.000) |
|STATUS [OK] [2025/01/27 01:59:33 PM UTC] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db3, state=SUSPECT) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=4, active=0) |
+---------------------------------------------------------------------------------+
In this case, the datasource can be manually shunned, which will then enable the recover command to operate and bring the node back into operation.
Recover a Replica from manually shunned state
A Replica that has been manually shunned can be added back to the dataservice using the datasource recover command:
[LOGICAL:EXPERT] /alpha > ls
...
DATASOURCES:
+---------------------------------------------------------------------------------+
|db1(slave:SHUNNED(MANUALLY-SHUNNED), progress=41, latency=59020.057) |
|STATUS [SHUNNED] [2025/01/28 08:35:36 AM UTC] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db3, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+---------------------------------------------------------------------------------+
...
[LOGICAL:EXPERT] /alpha > datasource db1 recover
RECOVERING DATASOURCE 'db1@alpha'
VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'db1'
Verified that DB server notification 'db1' is in state 'ONLINE'
DATA SERVER 'db1' IS NOW AVAILABLE FOR CONNECTIONS
RECOVERING 'db1@alpha' TO A SLAVE USING 'db3@alpha' AS THE MASTER
RECOVERY OF 'db1@alpha' WAS SUCCESSFUL
[LOGICAL:EXPERT] /alpha > ls
...
DATASOURCES:
+---------------------------------------------------------------------------------+
|db1(slave:ONLINE, progress=41, latency=59020.000) |
|STATUS [OK] [2025/01/28 08:36:37 AM UTC] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db3, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+---------------------------------------------------------------------------------+
...
Provision or Reprovision a Replica
In the event that you cannot get the Replica to recover using the datasource recover command, you can re-provision the Replica from another Replica
within your dataservice.
The command performs three operations automatically:
- Performs a backup of a remote Replica
- Copies the backup to the current host
- Restores the backup
When using tprovision you must be logged in to the Replica that has failed or that you want to reprovision. You cannot reprovision a Replica remotely.
To use tprovision:
- Log in to the failed Replica.
- Select the active Replica within the dataservice that you want to use to reprovision the failed Replica. You may use the Primary but this will impact performance on that host. If you use MyISAM tables the operation will create some locking in order to get a consistent snapshot.
- Run
tprovisionspecifying the source you have selected:
shell> tprovision -s db2 -m xtrabackup
2025/01/27 15:59:15 | INFO Started script tprovision
2025/01/27 15:59:20 | INFO xtrabackup version is 8.0.35-31
2025/01/27 15:59:20 | INFO MySQL version is 8.0.40
2025/01/27 15:59:20 | INFO
source = db2
method = xtrabackup
parallel threads = 4
port = 22
topology = CLUSTER
...
...
Verified that DB server notification 'db1' is in state 'ONLINE'
DATA SERVER 'db1' IS NOW AVAILABLE FOR CONNECTIONS
RECOVERING 'db1@alpha' TO A SLAVE USING 'db3@alpha' AS THE MASTER
RECOVERY OF 'db1@alpha' WAS SUCCESSFUL
[LOGICAL] /alpha >
Exiting...
2025/01/27 15:59:48 | INFO Script ended
tprovision handles the cluster status, backup, restore, and repositioning of the replication stream so that restored Replica is ready to
start operating again.
For a full explanation of using tprovision see "The tprovision Command"
When using a Multi-Site/Active-Active topology the additional replicator must be put offline before restoring data and put online after completion.
shell> mm_trepctl offline
shell> tprovision -s db2 -m xtrabackup
shell> mm_trepctl online
shell> mm_trepctl status
Replica Datasource Extended Recovery
If the current Replica will not recover, but the replicator state and sequence number are valid, the Replica is pointing to the wrong Primary, or still mistakenly has the Primary role when it should be a Replica, then the Replica can be forced back into the Replica state.
For example, in the output from ls in cctrl below, the replicator on db2 is mistakenly identified as the Primary, even
though db1 is correctly operating as the Primary.
It is important to note that in the scenario shown below, as the db2 host is in a SHUNNED state, and the Replicator is
OFFLINE, it is therefore isolated from applications and won't be operating as the incorrectly labelled role. db1 is the only functional
primary node in the cluster.
COORDINATOR[db3:AUTOMATIC:ONLINE]
ROUTERS:
+----------------------------------------------------------------------------+
|connector@db1[2096](ONLINE, created=2, active=0) |
|connector@db2[2092](ONLINE, created=2, active=0) |
|connector@db3[2107](ONLINE, created=2, active=0) |
+----------------------------------------------------------------------------+
DATASOURCES:
+----------------------------------------------------------------------------+
|db1(master:ONLINE, progress=43, THL latency=0.073) |
|STATUS [OK] [2025/01/28 08:39:09 AM UTC] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=master, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|db2(slave:SHUNNED(MANUALLY-SHUNNED), progress=-1, latency=-1.000) |
|STATUS [SHUNNED] [2025/01/28 08:42:11 AM UTC] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=master, state=OFFLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|db3(slave:ONLINE, progress=43, latency=0.319) |
|STATUS [OK] [2025/01/28 08:42:14 AM UTC] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+----------------------------------------------------------------------------+
The datasource db2 can be brought back online using this sequence:
Enable
set force truemode:[LOGICAL:EXPERT] /alpha > set force trueFORCE: trueSwitch the replicator offline:
[LOGICAL:EXPERT] /alpha > replicator db2 offlineReplicator 'db2' is now OFFLINESet the replicator to
replicaoperation:[LOGICAL:EXPERT] /alpha > replicator db2 slaveReplicator 'db2' is now a slave of replicator 'db1'In some instances you may need to explicitly specify which node is your Primary when you configure the Replica; appending the Primary hostname to the command specifies the Primary host to use:
[LOGICAL:EXPERT] /alpha > replicator db2 slave db1Replicator 'db2' is now a slave of replicator 'db1'Switch the replicator service online:
[LOGICAL:EXPERT] /alpha > replicator db2 onlineReplicator 'db2' is now ONLINEEnsure the datasource is correctly configured as a Replica:
[LOGICAL:EXPERT] /alpha > datasource db2 slaveDatasource 'db2' now has role 'slave'Recover the Replica back to the dataservice:
[LOGICAL:EXPERT] /alpha > datasource db2 recoverDataSource 'db2' is now OFFLINE
Datasource db2 should now be back in the dataservice as a working datasource.
Similar processes can be used to force a datasource back into the primary role if a switch or recover operation failed to set the role properly.
If the recover command fails, there are a number of solutions that may bring the dataservice back to the normal operational state. The exact method
will depend on whether there are other active Replicas (from which a backup can be taken) or recent backups of the Replica are available, and the reasons for the
original failure. Some potential solutions include
- If there is a recent backup of the failed Replica, restore the Replica using that backup. The latest backup can be restored using "operations-restore".
- If there is no recent backup, but have another Replica from which you can recover the failed Replica, the node should be rebuilt using the backup from another Replica. See "operations-restore-otherreplica".