6.6.2. Recover a failed Primary

When a Primary datasource is automatically failed over in AUTOMATIC policy mode, the datasource can be brought back into the dataservice as a Replica by using the recover command:

[LOGICAL:EXPERT] /alpha > datasource host1 recover
VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'host1'
DATA SERVER 'host1' IS NOW AVAILABLE FOR CONNECTIONS
RECOVERING 'host1@alpha' TO A SLAVE USING 'host2@alpha' AS THE MASTER
SETTING THE ROLE OF DATASOURCE 'host1@alpha' FROM 'Master' TO 'slave'
RECOVERY OF 'host1@alpha' WAS SUCCESSFUL

The recovered datasource will be added back to the dataservice as a Replica.

6.6.2.1. Recover when there are no Primaries

When there are no Primaries available, due to a failover of a Primary, or multiple host failure there are two options available. The first is to use the recover Master using , which sets the Primary to the specified host, and tries to automatically recover all the remaining nodes in the dataservice. The second is to manually set the Primary host, and recover the remainder of the datasources manually.

  • Using recover Master using

    Warning

    This command should only be used in urgent scenarios where the most up to date Primary can be identified. If there are multiple failures or mismatches between Primaries and Replicas, the command may not be able to recover all services, but will always result in an active Primary being configured.

    This command performs two distinct actions, first it calls set Master to select the new Primary, and then it calls datasource recover on each of the remaining Replicas. This attempts to recover the entire dataservice by switching the Primary and reconfiguring the Replicas to work with the new Primary.

    To use, first you should examine the state of the dataservice and choose which datasource is the most up to date or canonical. For example, within the following output, each datasource has the same sequence number, so any datasource could potentially be used as the Primary:

    [LOGICAL] /alpha > ls
    
    COORDINATOR[host1:AUTOMATIC:ONLINE]
    
    ROUTERS:
    +----------------------------------------------------------------------------+
    |connector@host1[18450](ONLINE, created=0, active=0)                         |
    |connector@host2[8877](ONLINE, created=0, active=0)                          |
    |connector@host3[8895](ONLINE, created=0, active=0)                          |
    +----------------------------------------------------------------------------+
    
    DATASOURCES:
    +----------------------------------------------------------------------------+
    |host1(Master:SHUNNED(FAILSAFE AFTER Shunned by fail-safe procedure),        |
    |progress=17, THL latency=0.565)                                             |
    |STATUS [OK] [2013/11/04 04:39:28 PM GMT]                                    |
    +----------------------------------------------------------------------------+
    |  MANAGER(state=ONLINE)                                                     |
    |  REPLICATOR(role=Master, state=ONLINE)                                     |
    |  DATASERVER(state=ONLINE)                                                  |
    |  CONNECTIONS(created=0, active=0)                                          |
    +----------------------------------------------------------------------------+
    
    +----------------------------------------------------------------------------+
    |host2(slave:SHUNNED(FAILSAFE AFTER Shunned by fail-safe procedure),         |
    |progress=17, latency=1.003)                                                 |
    |STATUS [OK] [2013/11/04 04:39:51 PM GMT]                                    |
    +----------------------------------------------------------------------------+
    |  MANAGER(state=ONLINE)                                                     |
    |  REPLICATOR(role=slave, Master=host1, state=ONLINE)                        |
    |  DATASERVER(state=ONLINE)                                                  |
    |  CONNECTIONS(created=0, active=0)                                          |
    +----------------------------------------------------------------------------+
    
    +----------------------------------------------------------------------------+
    |host3(slave:SHUNNED(FAILSAFE AFTER Shunned by fail-safe procedure),         |
    |progress=17, latency=1.273)                                                 |
    |STATUS [OK] [2013/10/26 06:30:26 PM BST]                                    |
    +----------------------------------------------------------------------------+
    |  MANAGER(state=ONLINE)                                                     |
    |  REPLICATOR(role=slave, Master=host1, state=ONLINE)                        |
    |  DATASERVER(state=ONLINE)                                                  |
    |  CONNECTIONS(created=0, active=0)                                          |
    +----------------------------------------------------------------------------+

    Once a host has been chosen, call the recover Master using command specifying the full servicename and hostname of the chosen datasource:

    [LOGICAL] /alpha > recover Master using alpha/host1
    
    This command is generally meant to help in the recovery of a data service
    that has data sources shunned due to a fail-safe shutdown of the service or
    under other circumstances where you wish to force a specific data source to become
    the primary. Be forewarned that if you do not exercise care when using this command
    you may lose data permanently or otherwise make your data service unusable.
    Do you want to continue? (y/n)> y
    DATA SERVICE 'alpha' DOES NOT HAVE AN ACTIVE PRIMARY. CAN PROCEED WITH 'RECOVER USING'
    VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'host1'
    DATA SERVER 'host1' IS NOW AVAILABLE FOR CONNECTIONS
    DataSource 'host1' is now OFFLINE
    DATASOURCE 'host1@alpha' IS NOW A MASTER
    FOUND PHYSICAL DATASOURCE TO RECOVER: 'host2@alpha'
    VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'host2'
    DATA SERVER 'host2' IS NOW AVAILABLE FOR CONNECTIONS
    RECOVERING 'host2@alpha' TO A SLAVE USING 'host1@alpha' AS THE MASTER
    DataSource 'host2' is now OFFLINE
    RECOVERY OF DATA SERVICE 'alpha' SUCCEEDED
    FOUND PHYSICAL DATASOURCE TO RECOVER: 'host3@alpha'
    VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'host3'
    DATA SERVER 'host3' IS NOW AVAILABLE FOR CONNECTIONS
    RECOVERING 'host3@alpha' TO A SLAVE USING 'host1@alpha' AS THE MASTER
    DataSource 'host3' is now OFFLINE
    RECOVERY OF DATA SERVICE 'alpha' SUCCEEDED
    RECOVERED 2 DATA SOURCES IN SERVICE 'alpha'

    You will be prompted to ensure that you wish to choose the selected host as the new Primary. cctrl then proceeds to set the new Primary, and recover the remaining Replicas.

    If this operation fails, you can try the manual process, using set Master and proceeding to recover each Replica manually.

  • Using set Master

    The set Master command forcibly sets the Primary to the specified host. It should only be used in the situation where no Primary is currently available within the dataservice, and recovery has failed. This command performs only one operation, and that is to explicitly set the new Primary to the specified host.

    Warning

    Using set Master is an expert level command and may lead to data loss if the wrong Primary is used. Because of this, the cctrl must be forced to execute the command by using set force true . The command will not be executed otherwise.

    To use the command, pick the most up to date Primary, or the host that you want to use as the Primary within your dataservice, then issue the command:

    [LOGICAL] /alpha > set Master host3
    VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'host3'
    DATA SERVER 'host3' IS NOW AVAILABLE FOR CONNECTIONS
    DataSource 'host3' is now OFFLINE
    DATASOURCE 'host3@alpha' IS NOW A MASTER

    This does not recover the remaining Replicas within the cluster, these must be manually recovered. This can be achieved either by using Section 6.6.1, “Recover a failed Replica” , or if this is not possible, using Section 6.6.1.1, “Provision or Reprovision a Replica” .

6.6.2.2. Recover a shunned Primary

When a Primary datasource fails in MANUAL policy mode, and the node has been failed over, once the datasource becomes available, the node can be added back to the dataservice by using the recover command, which enables the host as a Replica:

[LOGICAL:EXPERT] /alpha > datasource host1 recover
VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'host1'
DATA SERVER 'host1' IS NOW AVAILABLE FOR CONNECTIONS
RECOVERING 'host1@alpha' TO A SLAVE USING 'host2@alpha' AS THE MASTER
SETTING THE ROLE OF DATASOURCE 'host1@alpha' FROM 'Master' TO 'slave'
RECOVERY OF 'host1@alpha' WAS SUCCESSFUL

The recovered Primary will added back to the dataservice as a Replica.

6.6.2.3. Manually Failing over a Primary in MAINTENANCE policy mode

If the dataservice is in MAINTENANCE mode when the Primary fails, automatic recovery cannot sensibly make the decision about which node should be used as the Primary. In that case, the datasource service must be manually reconfigured.

In the sample below, host1 is the current Primary, and host2 is a Replica. To manually update and switch host1 to be the Replica and host2 to be the Primary:

  1. Shun the failed Primary ( host1 ) and set the replicator offline:

    [LOGICAL:EXPERT] /alpha > datasource host1 shun
    DataSource 'host1' set to SHUNNED
    [LOGICAL:EXPERT] /alpha > replicator host1 offline
    Replicator 'host1' is now OFFLINE
  2. Shun the Replica host2 and set the replicator to the offline state:

    [LOGICAL:EXPERT] /alpha > datasource host2 shun
    DataSource 'host2' set to SHUNNED
    [LOGICAL:EXPERT] /alpha > replicator host2 offline
    Replicator 'host2' is now OFFLINE
  3. Configure host2 ) as the Primary within the replicator service:

    [LOGICAL:EXPERT] /alpha > replicator host2 Master
  4. Set the replicator on host2 online:

    [LOGICAL:EXPERT] /alpha > replicator host2 online
  5. Recover host2 online and then set it online:

    [LOGICAL:EXPERT] /alpha > datasource host2 welcome
    [LOGICAL:EXPERT] /alpha > datasource host2 online
  6. Switch the replicator to be in Replica mode:

    [LOGICAL:EXPERT] /alpha > replicator host1 slave host2
    Replicator 'host1' is now a slave of replicator 'host2'
  7. Switch the replicator online:

    [LOGICAL:EXPERT] /alpha > replicator host1 online
    Replicator 'host1' is now ONLINE
  8. Switch the datasource role for host1 to be in Replica mode:

    [LOGICAL:EXPERT] /alpha > datasource host1 slave
    Datasource 'host1' now has role 'slave'
  9. The configuration and roles for the host have been updated, the datasource can be added back to the dataservice and then put online:

    [LOGICAL:EXPERT] /alpha > datasource host1 recover
    DataSource 'host1' is now OFFLINE
    [LOGICAL:EXPERT] /alpha > datasource host1 online
    Setting server for data source 'host1' to READ-ONLY
    +----------------------------------------------------------------------------+
    |host1                                                                       |
    +----------------------------------------------------------------------------+
    |Variable_name  Value                                                        |
    |read_only  ON                                                               |
    +----------------------------------------------------------------------------+
    DataSource 'host1@alpha' is now ONLINE
  10. With the dataservice in automatic policy mode, the datasource will be placed online, which can be verified with ls :

    [LOGICAL:EXPERT] /alpha > ls
    
    COORDINATOR[host3:AUTOMATIC:ONLINE]
    
    ROUTERS:
    +----------------------------------------------------------------------------+
    |connector@host1[19869](ONLINE, created=0, active=0)                         |
    |connector@host2[28116](ONLINE, created=0, active=0)                         |
    |connector@host3[1533](ONLINE, created=0, active=0)                          |
    +----------------------------------------------------------------------------+
    
    DATASOURCES:
    +----------------------------------------------------------------------------+
    |host1(slave:ONLINE, progress=156325, latency=725.737)                       |
    |STATUS [OK] [2013/05/14 01:06:08 PM BST]                                    |
    +----------------------------------------------------------------------------+
    |  MANAGER(state=ONLINE)                                                     |
    |  REPLICATOR(role=slave, Master=host2, state=ONLINE)                        |
    |  DATASERVER(state=ONLINE)                                                  |
    |  CONNECTIONS(created=0, active=0)                                          |
    +----------------------------------------------------------------------------+
    
    +----------------------------------------------------------------------------+
    |host2(Master:ONLINE, progress=156325, THL latency=0.606)                    |
    |STATUS [OK] [2013/05/14 12:53:41 PM BST]                                    |
    +----------------------------------------------------------------------------+
    |  MANAGER(state=ONLINE)                                                     |
    |  REPLICATOR(role=Master, state=ONLINE)                                     |
    |  DATASERVER(state=ONLINE)                                                  |
    |  CONNECTIONS(created=0, active=0)                                          |
    +----------------------------------------------------------------------------+
    
    +----------------------------------------------------------------------------+
    |host3(slave:ONLINE, progress=156325, latency=1.642)                         |
    |STATUS [OK] [2013/05/14 12:53:41 PM BST]                                    |
    +----------------------------------------------------------------------------+
    |  MANAGER(state=ONLINE)                                                     |
    |  REPLICATOR(role=slave, Master=host2, state=ONLINE)                        |
    |  DATASERVER(state=ONLINE)                                                  |
    |  CONNECTIONS(created=0, active=0)                                          |
    +----------------------------------------------------------------------------+

6.6.2.4. Failing over a Primary

When a Primary datasource fails in MANUAL policy mode, the datasource must be manually failed over to an active datasource, either by selecting the most up to date Replica automatically:

[LOGICAL:EXPERT] /alpha > failover

Or to an explicit host:

[LOGICAL:EXPERT] /alpha > failover to host2
SELECTED SLAVE: host2@alpha
PURGE REMAINING ACTIVE SESSIONS ON CURRENT MASTER 'host1@alpha'
SHUNNING PREVIOUS MASTER 'host1@alpha'
PUT THE NEW MASTER 'host2@alpha' ONLINE
RECONFIGURING SLAVE 'host3@alpha' TO POINT TO NEW MASTER 'host2@alpha'
FAILOVER TO 'host2' WAS COMPLETED

For the failover command to work, the following conditions must be met:

  • There must be a Primary or relay in the SHUNNED or FAILED state.

  • There must be at least one Replica in the ONLINE state.

If there is not already a SHUNNED or FAILED Primary and a failover must be forced, use datasource shun on the Primary, or failover to a specific Replica.

6.6.2.5. Split-Brain Discussion

A split-brain occurs when a cluster which normally has a single write Primary, has two write-able Primaries.

This means that some writes which should go to the “real” Primary are sent to a different node which was promoted to write Primary by mistake.

Once that happens, some writes exist on one Primary and not the other, creating two broken Primaries. Merging the two data sets is impossible, leading to a full restore, which is clearly NOT desirable.

We can say that a split-brain scenario is to be strongly avoided.

A situation like this is most often encountered when there is a network partition of some sort, especially with the nodes spread over multiple availability zones in a single region of a cloud deployment.

This would potentially result in all nodes being isolated, without a clear majority within the voting quorum.

A poorly-designed cluster could elect more than one Primary under these conditions, leading to the split-brain scenario.

Since a network partition would potentially result in all nodes being isolated without a clear majority within the voting quorum, the default action of a Tungsten Cluster is to SHUN all of the nodes.

Shunning ALL of the nodes means that no client traffic is being processed by any node, both reads and writes are blocked.

When this happens, it is up to a human administrator to select the proper Primary and recover the cluster.

For more information, please see Section 6.6.2, “Recover a failed Primary”.