5.6.1. Recover a failed slave

A slave that has failed but which has become available again can be recovered back into slave mode using the recover command:

[LOGICAL:EXPERT] /alpha > recover
FOUND PHYSICAL DATASOURCE TO RECOVER: 'host2@alpha'
VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'host2'
DATA SERVER 'host2' IS NOW AVAILABLE FOR CONNECTIONS
RECOVERING 'host2@alpha' TO A SLAVE USING 'host1@alpha' AS THE MASTER
DataSource 'host2' is now OFFLINE
RECOVERY OF DATA SERVICE 'alpha' SUCCEEDED
RECOVERED 1 DATA SOURCES IN SERVICE 'alpha'

The recover command will attempt to recover all the slave resources in the cluster, bringing them all online and back into service. The command operates on all shunned or failed slaves, and only works if there is an active master available.

To recover a single datasource back into the dataservice, use the explicit form:

[LOGICAL:EXPERT] /alpha > datasource host1 recover
VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'host1'
DATA SERVER 'host1' IS NOW AVAILABLE FOR CONNECTIONS
RECOVERING 'host1@alpha' TO A SLAVE USING 'host2@alpha' AS THE MASTER
RECOVERY OF 'host1@alpha' WAS SUCCESSFUL

In some cases, the datasource may show as ONLINE and the recover command does not bring the datasource online, particularly with the following error:

The datasource 'host1' is not FAILED or SHUNNED and cannot be recovered.

Checking the datasource status in cctrl the replicator service has failed, but the datasource shows as online:

+----------------------------------------------------------------------------+
|host1 (slave:ONLINE, progress=-1, latency=-1.000)                           |
|STATUS [OK] [2013/06/24 12:42:06 AM BST]                                    |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE)                                                      |
| REPLICATOR(role=slave, master=host1, state=SUSPECT)                        |
| DATASERVER(state=ONLINE)                                                   |
+----------------------------------------------------------------------------+

In this case, the datasource can be manually shunned, which will then enable the recover command to operate and bring the node back into operation.

5.6.1.1. Provision or Reprovision a Slave

In the event that you cannot get the slave to recover using the datasource recover command, you can re-provision the slave from another slave within your dataservice.

The command performs three operations automatically:

  1. Performs a backup of a remote slave

  2. Copies the backup to the current host

  3. Restores the backup

Warning

When using tungsten_provision_slave you must be logged in to the slave that has failed or that you want to reprovision. You cannot reprovision a slave remotely.

To use tungsten_provision_slave :

  1. Log in to the failed slave.

  2. Select the active slave within the dataservice that you want to use to reprovision the failed slave. You may use the master but this will impact performance on that host. If you use MyISAM tables the operation will create some locking in order to get a consistent snapshot.

  3. Run tungsten_provision_slave specifying the source you have selected:

    shell> tungsten_provision_slave --source=host2
      NOTE  >> Put alpha replication service offline
      NOTE  >> Create a mysqldump backup of host2 »
      in /opt/continuent/backups/provision_mysqldump_2013-11-21_09-31_52
      NOTE  >> host2 >> Create mysqldump in »
      /opt/continuent/backups/provision_mysqldump_2013-11-21_09-31_52/provision.sql.gz
      NOTE  >> Load the mysqldump file
      NOTE  >> Put the alpha replication service online
      NOTE  >> Clear THL and relay logs for the alpha replication service

    The default backup service for the host will be used; mysqldump can be used by specifying the --mysqldump option.

    tungsten_provision_slave handles the cluster status, backup, restore, and repositioning of the replication stream so that restored slave is ready to start operating again.

Important

When using a Multisite/Multimaster topology the additional replicator must be put offline before restoring data and put online after completion.

shell> mm_trepctl offline
shell> tungsten_provision_slave --source=host2
shell> mm_trepctl online
shell> mm_trepctl status

For more information on using tungsten_provision_slave see Section 8.26, “The tungsten_provision_slave Script” .

5.6.1.2. Recover a slave from manually shunned state

A slave that has been manually shunned can be added back to the dataservice using the datasource recover command:

[LOGICAL:EXPERT] /alpha > datasource host3 recover
DataSource 'host3' is now OFFLINE

In AUTOMATIC policy mode, the slave will automatically be recovered from OFFLINE to ONLINE mode.

In MANUAL or MAINTENANCE policy mode, the datasource must be manually switched to the online state:

[LOGICAL:EXPERT] /alpha > datasource host3 online
Setting server for data source 'host3' to READ-ONLY
+----------------------------------------------------------------------------+
|host3                                                                       |
+----------------------------------------------------------------------------+
|Variable_name  Value                                                        |
|read_only  ON                                                               |
+----------------------------------------------------------------------------+
DataSource 'host3@alpha' is now ONLINE

5.6.1.3. Slave Datasource Extended Recovery

If the current slave will not recover, but the replicator state and sequence number are valid, the slave is pointing to the wrong master, or still mistakenly has the master role when it should be a slave, then the slave can be forced back into the slave state.

For example, in the output from ls in cctrl below, host2 is mistakenly identified as the master, even though host1 is correctly operating as the master.

COORDINATOR[host1:AUTOMATIC:ONLINE]

ROUTERS:
+----------------------------------------------------------------------------+
|connector@host1[1848](ONLINE, created=0, active=0)                          |
|connector@host2[4098](ONLINE, created=0, active=0)                          |
|connector@host3[4087](ONLINE, created=0, active=0)                          |
+----------------------------------------------------------------------------+

DATASOURCES:
+----------------------------------------------------------------------------+
|host1(master:ONLINE, progress=23, THL latency=0.198)                        |
|STATUS [OK] [2013/05/30 11:29:44 AM BST]                                    |
+----------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                     |
|  REPLICATOR(role=master, state=ONLINE)                                     |
|  DATASERVER(state=ONLINE)                                                  |
|  CONNECTIONS(created=0, active=0)                                          |
+----------------------------------------------------------------------------+

+----------------------------------------------------------------------------+
|host2(slave:SHUNNED(MANUALLY-SHUNNED), progress=-1, latency=-1.000)         |
|STATUS [SHUNNED] [2013/05/30 11:23:15 AM BST]                               |
+----------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                     |
|  REPLICATOR(role=master, state=OFFLINE)                                    |
|  DATASERVER(state=ONLINE)                                                  |
|  CONNECTIONS(created=0, active=0)                                          |
+----------------------------------------------------------------------------+

+----------------------------------------------------------------------------+
|host3(slave:ONLINE, progress=23, latency=178877.000)                        |
|STATUS [OK] [2013/05/30 11:33:15 AM BST]                                    |
+----------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                     |
|  REPLICATOR(role=slave, master=host1, state=ONLINE)                        |
|  DATASERVER(state=ONLINE)                                                  |
|  CONNECTIONS(created=0, active=0)                                          |
+----------------------------------------------------------------------------+

The datasource host2 can be brought back online using this sequence:

  1. Enable set force mode:

    [LOGICAL:EXPERT] /alpha > set force true
    FORCE: true
  2. Shun the datasource:

    [LOGICAL:EXPERT] /alpha > datasource host2 shun
    DataSource 'host2' set to SHUNNED
  3. Switch the replicator offline:

    [LOGICAL:EXPERT] /alpha > replicator host2 offline
    Replicator 'host2' is now OFFLINE
  4. Set the replicator to slave operation:

    [LOGICAL:EXPERT] /alpha > replicator host2 slave
    Replicator 'host2' is now a slave of replicator 'host1'

    In some instances you may need to explicitly specify which node is your master when you configure the slave; appending the master hostname to the command specifies the master host to use:

    [LOGICAL:EXPERT] /alpha > replicator host2 slave host1
    Replicator 'host2' is now a slave of replicator 'host1'
  5. Switch the replicator service online:

    [LOGICAL:EXPERT] /alpha > replicator host2 online
    Replicator 'host2' is now ONLINE
  6. Ensure the datasource is correctly configured as a slave:

    [LOGICAL:EXPERT] /alpha > datasource host2 slave
    Datasource 'host2' now has role 'slave'
  7. Recover the slave back to the dataservice:

    [LOGICAL:EXPERT] /alpha > datasource host2 recover
    DataSource 'host2' is now OFFLINE

Datasource host2 should now be back in the dataservice as a working datasource.

Similar processes can be used to force a datasource back into the master role if a switch or recover operation failed to set the role properly.

If the recover command fails, there are a number of solutions that may bring the dataservice back to the normal operational state. The exact method will depend on whether there are other active slaves (from which a backup can be taken) or recent backups of the slave are available, and the reasons for the original failure. Some potential solutions include