5.6.2. Recover a failed master

When a master datasource is automatically failed over in AUTOMATIC policy mode, the datasource can be brought back into the dataservice as a slave by using the recover command:

[LOGICAL:EXPERT] /alpha > datasource host1 recover
VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'host1'
DATA SERVER 'host1' IS NOW AVAILABLE FOR CONNECTIONS
RECOVERING 'host1@alpha' TO A SLAVE USING 'host2@alpha' AS THE MASTER
SETTING THE ROLE OF DATASOURCE 'host1@alpha' FROM 'master' TO 'slave'
RECOVERY OF 'host1@alpha' WAS SUCCESSFUL

The recovered datasource will be added back to the dataservice as a slave.

5.6.2.1. Recover when there are no masters

When there are no masters available, due to a failover of a master, or multiple host failure there are two options available. The first is to use the recover master using , which sets the master to the specified host, and tries to automatically recover all the remaining nodes in the dataservice. The second is to manually set the master host, and recover the remainder of the datasources manually.

  • Using recover master using

    Warning

    This command should only be used in urgent scenarios where the most up to date master can be identified. If there are multiple failures or mismatches between masters and slaves, the command may not be able to recover all services, but will always result in an active master being configured.

    This command performs two distinct actions, first it calls set master to select the new master, and then it calls datasource recover on each of the remaining slaves. This attempts to recover the entire dataservice by switching the master and reconfiguring the slaves to work with the new master.

    To use, first you should examine the state of the dataservice and choose which datasource is the most up to date or canonical. For example, within the following output, each datasource has the same sequence number, so any datasource could potentially be used as the master:

    [LOGICAL] /alpha > ls
    
    COORDINATOR[host1:AUTOMATIC:ONLINE]
    
    ROUTERS:
    +----------------------------------------------------------------------------+
    |connector@host1[18450](ONLINE, created=0, active=0)                         |
    |connector@host2[8877](ONLINE, created=0, active=0)                          |
    |connector@host3[8895](ONLINE, created=0, active=0)                          |
    +----------------------------------------------------------------------------+
    
    DATASOURCES:
    +----------------------------------------------------------------------------+
    |host1(master:SHUNNED(FAILSAFE AFTER Shunned by fail-safe procedure),        |
    |progress=17, THL latency=0.565)                                             |
    |STATUS [OK] [2013/11/04 04:39:28 PM GMT]                                    |
    +----------------------------------------------------------------------------+
    |  MANAGER(state=ONLINE)                                                     |
    |  REPLICATOR(role=master, state=ONLINE)                                     |
    |  DATASERVER(state=ONLINE)                                                  |
    |  CONNECTIONS(created=0, active=0)                                          |
    +----------------------------------------------------------------------------+
    
    +----------------------------------------------------------------------------+
    |host2(slave:SHUNNED(FAILSAFE AFTER Shunned by fail-safe procedure),         |
    |progress=17, latency=1.003)                                                 |
    |STATUS [OK] [2013/11/04 04:39:51 PM GMT]                                    |
    +----------------------------------------------------------------------------+
    |  MANAGER(state=ONLINE)                                                     |
    |  REPLICATOR(role=slave, master=host1, state=ONLINE)                        |
    |  DATASERVER(state=ONLINE)                                                  |
    |  CONNECTIONS(created=0, active=0)                                          |
    +----------------------------------------------------------------------------+
    
    +----------------------------------------------------------------------------+
    |host3(slave:SHUNNED(FAILSAFE AFTER Shunned by fail-safe procedure),         |
    |progress=17, latency=1.273)                                                 |
    |STATUS [OK] [2013/10/26 06:30:26 PM BST]                                    |
    +----------------------------------------------------------------------------+
    |  MANAGER(state=ONLINE)                                                     |
    |  REPLICATOR(role=slave, master=host1, state=ONLINE)                        |
    |  DATASERVER(state=ONLINE)                                                  |
    |  CONNECTIONS(created=0, active=0)                                          |
    +----------------------------------------------------------------------------+

    Once a host has been chosen, call the recover master using command specifying the full servicename and hostname of the chosen datasource:

    [LOGICAL] /alpha > recover master using alpha/host1
    
    This command is generally meant to help in the recovery of a data service
    that has data sources shunned due to a fail-safe shutdown of the service or
    under other circumstances where you wish to force a specific data source to become
    the primary. Be forewarned that if you do not exercise care when using this command
    you may lose data permanently or otherwise make your data service unusable.
    Do you want to continue? (y/n)> y
    DATA SERVICE 'alpha' DOES NOT HAVE AN ACTIVE PRIMARY. CAN PROCEED WITH 'RECOVER USING'
    VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'host1'
    DATA SERVER 'host1' IS NOW AVAILABLE FOR CONNECTIONS
    DataSource 'host1' is now OFFLINE
    DATASOURCE 'host1@alpha' IS NOW A MASTER
    FOUND PHYSICAL DATASOURCE TO RECOVER: 'host2@alpha'
    VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'host2'
    DATA SERVER 'host2' IS NOW AVAILABLE FOR CONNECTIONS
    RECOVERING 'host2@alpha' TO A SLAVE USING 'host1@alpha' AS THE MASTER
    DataSource 'host2' is now OFFLINE
    RECOVERY OF DATA SERVICE 'alpha' SUCCEEDED
    FOUND PHYSICAL DATASOURCE TO RECOVER: 'host3@alpha'
    VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'host3'
    DATA SERVER 'host3' IS NOW AVAILABLE FOR CONNECTIONS
    RECOVERING 'host3@alpha' TO A SLAVE USING 'host1@alpha' AS THE MASTER
    DataSource 'host3' is now OFFLINE
    RECOVERY OF DATA SERVICE 'alpha' SUCCEEDED
    RECOVERED 2 DATA SOURCES IN SERVICE 'alpha'

    You will be prompted to ensure that you wish to choose the selected host as the new master. cctrl then proceeds to set the new master, and recover the remaining slaves.

    If this operation fails, you can try the manual process, using set master and proceeding to recover each slave manually.

  • Using set master

    The set master command forcibly sets the master to the specified host. It should only be used in the situation where no master is currently available within the dataservice, and recovery has failed. This command performs only one operation, and that is to explicitly set the new master to the specified host.

    Warning

    Using set master is an expert level command and may lead to data loss if the wrong master is used. Because of this, the cctrl must be forced to execute the command by using set force true . The command will not be executed otherwise.

    To use the command, pick the most up to date master, or the host that you want to use as the master within your dataservice, then issue the command:

    [LOGICAL] /alpha > set master host3
    VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'host3'
    DATA SERVER 'host3' IS NOW AVAILABLE FOR CONNECTIONS
    DataSource 'host3' is now OFFLINE
    DATASOURCE 'host3@alpha' IS NOW A MASTER

    This does not recover the remaining slaves within the cluster, these must be manually recovered. This can be achieved either by using Section 5.6.1, “Recover a failed slave” , or if this is not possible, using Section 5.6.1.1, “Provision or Reprovision a Slave” .

5.6.2.2. Recover a shunned master

When a master datasource fails in MANUAL policy mode, and the node has been failed over, once the datasource becomes available, the node can be added back to the dataservice by using the recover command, which enables the host as a slave:

[LOGICAL:EXPERT] /alpha > datasource host1 recover
VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'host1'
DATA SERVER 'host1' IS NOW AVAILABLE FOR CONNECTIONS
RECOVERING 'host1@alpha' TO A SLAVE USING 'host2@alpha' AS THE MASTER
SETTING THE ROLE OF DATASOURCE 'host1@alpha' FROM 'master' TO 'slave'
RECOVERY OF 'host1@alpha' WAS SUCCESSFUL

The recovered master will added back to the dataservice as a slave.

5.6.2.3. Manually Failing over a Master in MAINTENANCE policy mode

If the dataservice is in MAINTENANCE mode when the master fails, automatic recovery cannot sensibly make the decision about which node should be used as the master. In that case, the datasource service must be manually reconfigured.

In the sample below, host1 is the current master, and host2 is a slave. To manually update and switch host1 to be the slave and host2 to be the master:

  1. Shun the failed master ( host1 ) and set the replicator offline:

    [LOGICAL:EXPERT] /alpha > datasource host1 shun
    DataSource 'host1' set to SHUNNED
    [LOGICAL:EXPERT] /alpha > replicator host1 offline
    Replicator 'host1' is now OFFLINE
  2. Shun the slave host2 and set the replicator to the offline state:

    [LOGICAL:EXPERT] /alpha > datasource host2 shun
    DataSource 'host2' set to SHUNNED
    [LOGICAL:EXPERT] /alpha > replicator host2 offline
    Replicator 'host2' is now OFFLINE
  3. Configure host2 ) as the master within the replicator service:

    [LOGICAL:EXPERT] /alpha > replicator host2 master
  4. Set the replicator on host2 online:

    [LOGICAL:EXPERT] /alpha > replicator host2 online
  5. Recover host2 online and then set it online:

    [LOGICAL:EXPERT] /alpha > datasource host2 welcome
    [LOGICAL:EXPERT] /alpha > datasource host2 online
  6. Switch the replicator to be in slave mode:

    [LOGICAL:EXPERT] /alpha > replicator host1 slave host2
    Replicator 'host1' is now a slave of replicator 'host2'
  7. Switch the replicator online:

    [LOGICAL:EXPERT] /alpha > replicator host1 online
    Replicator 'host1' is now ONLINE
  8. Switch the datasource role for host1 to be in slave mode:

    [LOGICAL:EXPERT] /alpha > datasource host1 slave
    Datasource 'host1' now has role 'slave'
  9. The configuration and roles for the host have been updated, the datasource can be added back to the dataservice and then put online:

    [LOGICAL:EXPERT] /alpha > datasource host1 recover
    DataSource 'host1' is now OFFLINE
    [LOGICAL:EXPERT] /alpha > datasource host1 online
    Setting server for data source 'host1' to READ-ONLY
    +----------------------------------------------------------------------------+
    |host1                                                                       |
    +----------------------------------------------------------------------------+
    |Variable_name  Value                                                        |
    |read_only  ON                                                               |
    +----------------------------------------------------------------------------+
    DataSource 'host1@alpha' is now ONLINE
  10. With the dataservice in automatic policy mode, the datasource will be placed online, which can be verified with ls :

    [LOGICAL:EXPERT] /alpha > ls
    
    COORDINATOR[host3:AUTOMATIC:ONLINE]
    
    ROUTERS:
    +----------------------------------------------------------------------------+
    |connector@host1[19869](ONLINE, created=0, active=0)                         |
    |connector@host2[28116](ONLINE, created=0, active=0)                         |
    |connector@host3[1533](ONLINE, created=0, active=0)                          |
    +----------------------------------------------------------------------------+
    
    DATASOURCES:
    +----------------------------------------------------------------------------+
    |host1(slave:ONLINE, progress=156325, latency=725.737)                       |
    |STATUS [OK] [2013/05/14 01:06:08 PM BST]                                    |
    +----------------------------------------------------------------------------+
    |  MANAGER(state=ONLINE)                                                     |
    |  REPLICATOR(role=slave, master=host2, state=ONLINE)                        |
    |  DATASERVER(state=ONLINE)                                                  |
    |  CONNECTIONS(created=0, active=0)                                          |
    +----------------------------------------------------------------------------+
    
    +----------------------------------------------------------------------------+
    |host2(master:ONLINE, progress=156325, THL latency=0.606)                    |
    |STATUS [OK] [2013/05/14 12:53:41 PM BST]                                    |
    +----------------------------------------------------------------------------+
    |  MANAGER(state=ONLINE)                                                     |
    |  REPLICATOR(role=master, state=ONLINE)                                     |
    |  DATASERVER(state=ONLINE)                                                  |
    |  CONNECTIONS(created=0, active=0)                                          |
    +----------------------------------------------------------------------------+
    
    +----------------------------------------------------------------------------+
    |host3(slave:ONLINE, progress=156325, latency=1.642)                         |
    |STATUS [OK] [2013/05/14 12:53:41 PM BST]                                    |
    +----------------------------------------------------------------------------+
    |  MANAGER(state=ONLINE)                                                     |
    |  REPLICATOR(role=slave, master=host2, state=ONLINE)                        |
    |  DATASERVER(state=ONLINE)                                                  |
    |  CONNECTIONS(created=0, active=0)                                          |
    +----------------------------------------------------------------------------+

5.6.2.4. Failing over a master

When a master datasource fails in MANUAL policy mode, the datasource must be manually failed over to an active datasource, either by selecting the most up to date slave automatically:

[LOGICAL:EXPERT] /alpha > failover

Or to an explicit host:

[LOGICAL:EXPERT] /alpha > failover to host2
SELECTED SLAVE: host2@alpha
PURGE REMAINING ACTIVE SESSIONS ON CURRENT MASTER 'host1@alpha'
SHUNNING PREVIOUS MASTER 'host1@alpha'
PUT THE NEW MASTER 'host2@alpha' ONLINE
RECONFIGURING SLAVE 'host3@alpha' TO POINT TO NEW MASTER 'host2@alpha'
FAILOVER TO 'host2' WAS COMPLETED

For the failover command to work, the following conditions must be met:

  • There must be a master or relay in the SHUNNED or FAILED state.

  • There must be at least one slave in the ONLINE state.

If there is not already a SHUNNED or FAILED master and a failover must be forced, use datasource shun on the master, or failover to a specific slave.