When a Primary datasource is automatically failed over in
AUTOMATIC
policy mode, the
datasource can be brought back into the dataservice as a Replica by using
the recover command:
[LOGICAL:EXPERT] /alpha > datasource host1 recover
VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'host1'
DATA SERVER 'host1' IS NOW AVAILABLE FOR CONNECTIONS
RECOVERING 'host1@alpha' TO A SLAVE USING 'host2@alpha' AS THE MASTER
SETTING THE ROLE OF DATASOURCE 'host1@alpha' FROM 'Master' TO 'slave'
RECOVERY OF 'host1@alpha' WAS SUCCESSFUL
The recovered datasource will be added back to the dataservice as a Replica.
When there are no Primaries available, due to a failover of a Primary, or multiple host failure there are two options available. The first is to use the recover Master using , which sets the Primary to the specified host, and tries to automatically recover all the remaining nodes in the dataservice. The second is to manually set the Primary host, and recover the remainder of the datasources manually.
Using recover Master using
This command should only be used in urgent scenarios where the most up to date Primary can be identified. If there are multiple failures or mismatches between Primaries and Replicas, the command may not be able to recover all services, but will always result in an active Primary being configured.
This command performs two distinct actions, first it calls set Master to select the new Primary, and then it calls datasource recover on each of the remaining Replicas. This attempts to recover the entire dataservice by switching the Primary and reconfiguring the Replicas to work with the new Primary.
To use, first you should examine the state of the dataservice and choose which datasource is the most up to date or canonical. For example, within the following output, each datasource has the same sequence number, so any datasource could potentially be used as the Primary:
[LOGICAL] /alpha > ls
COORDINATOR[host1:AUTOMATIC:ONLINE]
ROUTERS:
+----------------------------------------------------------------------------+
|connector@host1[18450](ONLINE, created=0, active=0) |
|connector@host2[8877](ONLINE, created=0, active=0) |
|connector@host3[8895](ONLINE, created=0, active=0) |
+----------------------------------------------------------------------------+
DATASOURCES:
+----------------------------------------------------------------------------+
|host1(Master:SHUNNED(FAILSAFE AFTER Shunned by fail-safe procedure), |
|progress=17, THL latency=0.565) |
|STATUS [OK] [2013/11/04 04:39:28 PM GMT] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=Master, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|host2(slave:SHUNNED(FAILSAFE AFTER Shunned by fail-safe procedure), |
|progress=17, latency=1.003) |
|STATUS [OK] [2013/11/04 04:39:51 PM GMT] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, Master=host1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|host3(slave:SHUNNED(FAILSAFE AFTER Shunned by fail-safe procedure), |
|progress=17, latency=1.273) |
|STATUS [OK] [2013/10/26 06:30:26 PM BST] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, Master=host1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+----------------------------------------------------------------------------+
Once a host has been chosen, call the recover Master using command specifying the full servicename and hostname of the chosen datasource:
[LOGICAL] /alpha > recover Master using alpha/host1
This command is generally meant to help in the recovery of a data service
that has data sources shunned due to a fail-safe shutdown of the service or
under other circumstances where you wish to force a specific data source to become
the primary. Be forewarned that if you do not exercise care when using this command
you may lose data permanently or otherwise make your data service unusable.
Do you want to continue? (y/n)> y
DATA SERVICE 'alpha' DOES NOT HAVE AN ACTIVE PRIMARY. CAN PROCEED WITH 'RECOVER USING'
VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'host1'
DATA SERVER 'host1' IS NOW AVAILABLE FOR CONNECTIONS
DataSource 'host1' is now OFFLINE
DATASOURCE 'host1@alpha' IS NOW A MASTER
FOUND PHYSICAL DATASOURCE TO RECOVER: 'host2@alpha'
VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'host2'
DATA SERVER 'host2' IS NOW AVAILABLE FOR CONNECTIONS
RECOVERING 'host2@alpha' TO A SLAVE USING 'host1@alpha' AS THE MASTER
DataSource 'host2' is now OFFLINE
RECOVERY OF DATA SERVICE 'alpha' SUCCEEDED
FOUND PHYSICAL DATASOURCE TO RECOVER: 'host3@alpha'
VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'host3'
DATA SERVER 'host3' IS NOW AVAILABLE FOR CONNECTIONS
RECOVERING 'host3@alpha' TO A SLAVE USING 'host1@alpha' AS THE MASTER
DataSource 'host3' is now OFFLINE
RECOVERY OF DATA SERVICE 'alpha' SUCCEEDED
RECOVERED 2 DATA SOURCES IN SERVICE 'alpha'
You will be prompted to ensure that you wish to choose the selected host as the new Primary. cctrl then proceeds to set the new Primary, and recover the remaining Replicas.
If this operation fails, you can try the manual process, using set Master and proceeding to recover each Replica manually.
Using set Master
The set Master command forcibly sets the Primary to the specified host. It should only be used in the situation where no Primary is currently available within the dataservice, and recovery has failed. This command performs only one operation, and that is to explicitly set the new Primary to the specified host.
Using set Master is an expert level command and may lead to data loss if the wrong Primary is used. Because of this, the cctrl must be forced to execute the command by using set force true . The command will not be executed otherwise.
To use the command, pick the most up to date Primary, or the host that you want to use as the Primary within your dataservice, then issue the command:
[LOGICAL] /alpha > set Master host3
VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'host3'
DATA SERVER 'host3' IS NOW AVAILABLE FOR CONNECTIONS
DataSource 'host3' is now OFFLINE
DATASOURCE 'host3@alpha' IS NOW A MASTER
This does not recover the remaining Replicas within the cluster, these must be manually recovered. This can be achieved either by using Section 6.6.1, “Recover a failed Replica” , or if this is not possible, using Section 6.6.1.1, “Provision or Reprovision a Replica” .
When a Primary datasource fails in
MANUAL
policy mode, and the
node has been failed over, once the datasource becomes available, the
node can be added back to the dataservice by using the
recover command, which enables the host as a Replica:
[LOGICAL:EXPERT] /alpha > datasource host1 recover
VERIFYING THAT WE CAN CONNECT TO DATA SERVER 'host1'
DATA SERVER 'host1' IS NOW AVAILABLE FOR CONNECTIONS
RECOVERING 'host1@alpha' TO A SLAVE USING 'host2@alpha' AS THE MASTER
SETTING THE ROLE OF DATASOURCE 'host1@alpha' FROM 'Master' TO 'slave'
RECOVERY OF 'host1@alpha' WAS SUCCESSFUL
The recovered Primary will added back to the dataservice as a Replica.
MAINTENANCE
policy mode
If the dataservice is in
MAINTENANCE
mode when the
Primary fails, automatic recovery cannot sensibly make the decision
about which node should be used as the Primary. In that case, the
datasource service must be manually reconfigured.
In the sample below, host1
is
the current Primary, and host2
is
a Replica. To manually update and switch
host1
to be the Replica and
host2
to be the Primary:
Shun the failed Primary (
host1
) and set the
replicator offline:
[LOGICAL:EXPERT] /alpha >datasource host1 shun
DataSource 'host1' set to SHUNNED [LOGICAL:EXPERT] /alpha >replicator host1 offline
Replicator 'host1' is now OFFLINE
Shun the Replica host2
and set
the replicator to the offline state:
[LOGICAL:EXPERT] /alpha >datasource host2 shun
DataSource 'host2' set to SHUNNED [LOGICAL:EXPERT] /alpha >replicator host2 offline
Replicator 'host2' is now OFFLINE
Configure host2
) as the
Primary within the replicator service:
[LOGICAL:EXPERT] /alpha > replicator host2 Master
Set the replicator on host2
online:
[LOGICAL:EXPERT] /alpha > replicator host2 online
Recover host2
online and
then set it online:
[LOGICAL:EXPERT] /alpha >datasource host2 welcome
[LOGICAL:EXPERT] /alpha >datasource host2 online
Switch the replicator to be in
Replica
mode:
[LOGICAL:EXPERT] /alpha > replicator host1 slave host2
Replicator 'host1' is now a slave of replicator 'host2'
Switch the replicator online:
[LOGICAL:EXPERT] /alpha > replicator host1 online
Replicator 'host1' is now ONLINE
Switch the datasource role for
host1
to be in Replica mode:
[LOGICAL:EXPERT] /alpha > datasource host1 slave
Datasource 'host1' now has role 'slave'
The configuration and roles for the host have been updated, the datasource can be added back to the dataservice and then put online:
[LOGICAL:EXPERT] /alpha >datasource host1 recover
DataSource 'host1' is now OFFLINE [LOGICAL:EXPERT] /alpha >datasource host1 online
Setting server for data source 'host1' to READ-ONLY +----------------------------------------------------------------------------+ |host1 | +----------------------------------------------------------------------------+ |Variable_name Value | |read_only ON | +----------------------------------------------------------------------------+ DataSource 'host1@alpha' is now ONLINE
With the dataservice in automatic policy mode, the datasource will be placed online, which can be verified with ls :
[LOGICAL:EXPERT] /alpha > ls
COORDINATOR[host3:AUTOMATIC:ONLINE]
ROUTERS:
+----------------------------------------------------------------------------+
|connector@host1[19869](ONLINE, created=0, active=0) |
|connector@host2[28116](ONLINE, created=0, active=0) |
|connector@host3[1533](ONLINE, created=0, active=0) |
+----------------------------------------------------------------------------+
DATASOURCES:
+----------------------------------------------------------------------------+
|host1(slave:ONLINE, progress=156325, latency=725.737) |
|STATUS [OK] [2013/05/14 01:06:08 PM BST] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, Master=host2, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|host2(Master:ONLINE, progress=156325, THL latency=0.606) |
|STATUS [OK] [2013/05/14 12:53:41 PM BST] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=Master, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|host3(slave:ONLINE, progress=156325, latency=1.642) |
|STATUS [OK] [2013/05/14 12:53:41 PM BST] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, Master=host2, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+----------------------------------------------------------------------------+
When a Primary datasource fails in
MANUAL
policy mode, the
datasource must be manually failed over to an active datasource,
either by selecting the most up to date Replica automatically:
[LOGICAL:EXPERT] /alpha > failover
Or to an explicit host:
[LOGICAL:EXPERT] /alpha > failover to host2
SELECTED SLAVE: host2@alpha
PURGE REMAINING ACTIVE SESSIONS ON CURRENT MASTER 'host1@alpha'
SHUNNING PREVIOUS MASTER 'host1@alpha'
PUT THE NEW MASTER 'host2@alpha' ONLINE
RECONFIGURING SLAVE 'host3@alpha' TO POINT TO NEW MASTER 'host2@alpha'
FAILOVER TO 'host2' WAS COMPLETED
For the failover command to work, the following conditions must be met:
If there is not already a
SHUNNED
or
FAILED
Primary and a failover
must be forced, use datasource
shun on the Primary, or failover to a specific Replica.
A split-brain occurs when a cluster which normally has a single write Primary, has two write-able Primaries.
This means that some writes which should go to the “real” Primary are sent to a different node which was promoted to write Primary by mistake.
Once that happens, some writes exist on one Primary and not the other, creating two broken Primaries. Merging the two data sets is impossible, leading to a full restore, which is clearly NOT desirable.
We can say that a split-brain scenario is to be strongly avoided.
A situation like this is most often encountered when there is a network partition of some sort, especially with the nodes spread over multiple availability zones in a single region of a cloud deployment.
This would potentially result in all nodes being isolated, without a clear majority within the voting quorum.
A poorly-designed cluster could elect more than one Primary under these conditions, leading to the split-brain scenario.
Since a network partition would potentially result in all nodes being isolated without a clear majority within the voting quorum, the default action of a Tungsten Cluster is to SHUN all of the nodes.
Shunning ALL of the nodes means that no client traffic is being processed by any node, both reads and writes are blocked.
When this happens, it is up to a human administrator to select the proper Primary and recover the cluster.
For more information, please see Section 6.6.2, “Recover a failed Primary”.