A Replica that has failed but which has become available again can be recovered back into Replica mode using the recover command:
[LOGICAL:EXPERT] /alpha > recover
FOUND PHYSICAL DATASOURCE TO RECOVER: 'host2@alpha'
VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'host2'
DATA SERVER 'host2' IS NOW AVAILABLE FOR CONNECTIONS
RECOVERING 'host2@alpha' TO A SLAVE USING 'host1@alpha' AS THE MASTER
DataSource 'host2' is now OFFLINE
RECOVERY OF DATA SERVICE 'alpha' SUCCEEDED
RECOVERED 1 DATA SOURCES IN SERVICE 'alpha'
The recover command will attempt to recover all the Replica resources in the cluster, bringing them all online and back into service. The command operates on all shunned or failed Replicas, and only works if there is an active Primary available.
To recover a single datasource back into the dataservice, use the explicit form:
[LOGICAL:EXPERT] /alpha > datasource host1 recover
VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'host1'
DATA SERVER 'host1' IS NOW AVAILABLE FOR CONNECTIONS
RECOVERING 'host1@alpha' TO A SLAVE USING 'host2@alpha' AS THE MASTER
RECOVERY OF 'host1@alpha' WAS SUCCESSFUL
In some cases, the datasource may show as
ONLINE
and the
recover command does not bring the
datasource online, particularly with the following error:
The datasource 'host1' is not FAILED or SHUNNED and cannot be recovered.
Checking the datasource status in cctrl the replicator service has failed, but the datasource shows as online:
+----------------------------------------------------------------------------+ |host1 (slave:ONLINE, progress=-1, latency=-1.000) | |STATUS [OK] [2013/06/24 12:42:06 AM BST] | +----------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=slave, Master=host1, state=SUSPECT) | | DATASERVER(state=ONLINE) | +----------------------------------------------------------------------------+
In this case, the datasource can be manually shunned, which will then enable the recover command to operate and bring the node back into operation.
In the event that you cannot get the Replica to recover using the datasource recover command, you can re-provision the Replica from another Replica within your dataservice.
The command performs three operations automatically:
Performs a backup of a remote Replica
Copies the backup to the current host
Restores the backup
When using tungsten_provision_slave you must be logged in to the Replica that has failed or that you want to reprovision. You cannot reprovision a Replica remotely.
When using tprovision you must be logged in to the Replica that has failed or that you want to reprovision. You cannot reprovision a Replica remotely.
To use tprovision :
Log in to the failed Replica.
Select the active Replica within the dataservice that you want to use to reprovision the failed Replica. You may use the Primary but this will impact performance on that host. If you use MyISAM tables the operation will create some locking in order to get a consistent snapshot.
Run tprovision specifying the source you have selected:
shell> tprovision --source=host2
NOTE >> Put alpha replication service offline
NOTE >> Create a mysqldump backup of host2 »
in /opt/continuent/backups/provision_mysqldump_2013-11-21_09-31_52
NOTE >> host2 >> Create mysqldump in »
/opt/continuent/backups/provision_mysqldump_2013-11-21_09-31_52/provision.sql.gz
NOTE >> Load the mysqldump file
NOTE >> Put the alpha replication service online
NOTE >> Clear THL and relay logs for the alpha replication service
The default backup service for the host will be used;
mysqldump can be used by specifying the
--mysqldump
option.
tprovision handles the cluster status, backup, restore, and repositioning of the replication stream so that restored Replica is ready to start operating again.
When using a Multi-Site/Active-Active topology the additional replicator must be put offline before restoring data and put online after completion.
shell>mm_trepctl offline
shell>tprovision --source=host2
shell>mm_trepctl online
shell>mm_trepctl status
For more information on using tprovision see Section 9.27, “The tprovision Script” .
A Replica that has been manually shunned can be added back to the dataservice using the datasource recover command:
[LOGICAL:EXPERT] /alpha > datasource host3 recover
DataSource 'host3' is now OFFLINE
In AUTOMATIC
policy mode,
the Replica will automatically be recovered from
OFFLINE
to
ONLINE
mode.
In MANUAL
or
MAINTENANCE
policy mode, the
datasource must be manually switched to the online state:
[LOGICAL:EXPERT] /alpha > datasource host3 online
Setting server for data source 'host3' to READ-ONLY
+----------------------------------------------------------------------------+
|host3 |
+----------------------------------------------------------------------------+
|Variable_name Value |
|read_only ON |
+----------------------------------------------------------------------------+
DataSource 'host3@alpha' is now ONLINE
If the current Replica will not recover, but the replicator state and sequence number are valid, the Replica is pointing to the wrong Primary, or still mistakenly has the Primary role when it should be a Replica, then the Replica can be forced back into the Replica state.
For example, in the output from
ls in cctrl
below, host2
is mistakenly
identified as the Primary, even though
host1
is correctly operating as
the Primary.
COORDINATOR[host1:AUTOMATIC:ONLINE] ROUTERS: +----------------------------------------------------------------------------+ |connector@host1[1848](ONLINE, created=0, active=0) | |connector@host2[4098](ONLINE, created=0, active=0) | |connector@host3[4087](ONLINE, created=0, active=0) | +----------------------------------------------------------------------------+ DATASOURCES: +----------------------------------------------------------------------------+ |host1(Master:ONLINE, progress=23, THL latency=0.198) | |STATUS [OK] [2013/05/30 11:29:44 AM BST] | +----------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=Master, state=ONLINE) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=0, active=0) | +----------------------------------------------------------------------------+ +----------------------------------------------------------------------------+ |host2(slave:SHUNNED(MANUALLY-SHUNNED), progress=-1, latency=-1.000) | |STATUS [SHUNNED] [2013/05/30 11:23:15 AM BST] | +----------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=Master, state=OFFLINE) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=0, active=0) | +----------------------------------------------------------------------------+ +----------------------------------------------------------------------------+ |host3(slave:ONLINE, progress=23, latency=178877.000) | |STATUS [OK] [2013/05/30 11:33:15 AM BST] | +----------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=slave, Master=host1, state=ONLINE) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=0, active=0) | +----------------------------------------------------------------------------+
The datasource host2
can be
brought back online using this sequence:
Enable set force mode:
[LOGICAL:EXPERT] /alpha > set force true
FORCE: true
Shun the datasource:
[LOGICAL:EXPERT] /alpha > datasource host2 shun
DataSource 'host2' set to SHUNNED
Switch the replicator offline:
[LOGICAL:EXPERT] /alpha > replicator host2 offline
Replicator 'host2' is now OFFLINE
Set the replicator to
Replica
operation:
[LOGICAL:EXPERT] /alpha > replicator host2 slave
Replicator 'host2' is now a slave of replicator 'host1'
In some instances you may need to explicitly specify which node is your Primary when you configure the Replica; appending the Primary hostname to the command specifies the Primary host to use:
[LOGICAL:EXPERT] /alpha > replicator host2 slave host1
Replicator 'host2' is now a slave of replicator 'host1'
Switch the replicator service online:
[LOGICAL:EXPERT] /alpha > replicator host2 online
Replicator 'host2' is now ONLINE
Ensure the datasource is correctly configured as a Replica:
[LOGICAL:EXPERT] /alpha > datasource host2 slave
Datasource 'host2' now has role 'slave'
Recover the Replica back to the dataservice:
[LOGICAL:EXPERT] /alpha > datasource host2 recover
DataSource 'host2' is now OFFLINE
Datasource host2
should now be
back in the dataservice as a working datasource.
Similar processes can be used to force a datasource back into the
Primary
role if a switch or
recover operation failed to set the role properly.
If the recover command fails, there are a number of solutions that may bring the dataservice back to the normal operational state. The exact method will depend on whether there are other active Replicas (from which a backup can be taken) or recent backups of the Replica are available, and the reasons for the original failure. Some potential solutions include
If there is a recent backup of the failed Replica, restore the Replica using that backup. The latest backup can be restored using Section 6.11, “Restoring a Backup” .
If there is no recent backup, but have another Replica from which you can recover the failed Replica, the node should be rebuilt using the backup from another Replica. See Section 6.11.3, “Restoring from Another Replica” .