All datasources will be in one of a number of states that indicate their current operational status.
ONLINE
State
A datasource in the ONLINE
state
is considered to be operating normally, with replication, connector and
other traffic being handled as normal.
OFFLINE
State
A datasource in the OFFLINE
does
not accept connections through the connector for either reads or writes.
When the dataservice is in the
AUTOMATIC
policy mode, a
datasource in the OFFLINE
state is
automatically recovered and placed into the
ONLINE
state. If this operation
fails, the datasource remains in the
OFFLINE
state.
When the dataservice is in
MAINTENANCE
or
MANUAL
policy mode, the
datasource will remain in the
OFFLINE
state until the datasource
is explicitly switched to the
ONLINE
state.
FAILED
State
When a datasource fails, for example when a failure in one of the
services for the datasource stops responding or fails, the datasource
will be placed into the FAILED
state. In the example below, the underlying dataserver has failed:
+----------------------------------------------------------------------------+ |host3(slave:FAILED(DATASERVER 'host3@alpha' STOPPED), | |progress=154146, latency=31.419) | |STATUS [CRITICAL] [2013/05/10 11:51:42 PM BST] | |REASON[DATASERVER 'host3@alpha' STOPPED] | +----------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=slave, master=host1, state=ONLINE) | | DATASERVER(state=STOPPED) | | CONNECTIONS(created=208, active=0) | +----------------------------------------------------------------------------+
For a FAILED
datasource, the
recover command within
cctrl can be used to attempt to recover the
datasource to the operational state. If this fails, the underlying fault
must be identified and addressed before the datasource is recovered.
SHUNNED
State
A SHUNNED
datasource implies that
the datasource is OFFLINE
. Unlike
the OFFLINE
state, a
SHUNNED
datasource is not
automatically recovered.
A datasource in a SHUNNED
state is
not connected or actively part of the dataservice. Individual services
can be reconfigured and restarted. The operating system and any other
maintenance to be performed can be carried out while a host is in the
SHUNNED
state without affecting
the other members of the dataservice.
Datasources can be manually or automatically shunned. The current reason
for the SHUNNED
state is indicated
in the status output. For example, in the sample below, the node
host3
was manually shunned for
maintenance reasons:
... +----------------------------------------------------------------------------+ |host3(slave:SHUNNED(MANUALLY-SHUNNED), progress=157454, latency=1.000) | |STATUS [SHUNNED] [2013/05/14 05:12:52 PM BST] | ...
SHUNNED
StatesA SHUNNED node can have a number of different sub-states depending on certain actions or events that have happened within the cluster. These are as folllows:
SHUNNED(DRAIN-CONNECTIONS)
SHUNNED(FAILSAFE_SHUN)
SHUNNED(MANUALLY-SHUNNED)
SHUNNED(CONFLICTS-WITH-COMPOSITE-MASTER)
SHUNNED(FAILSAFE AFTER Shunned by fail-safe procedure)
SHUNNED(SUBSERVICE-SWITCH-FAILED)
SHUNNED(FAILED-OVER-TO-db2)
SHUNNED(SET-RELAY)
SHUNNED(FAILOVER-ABORTED AFTER UNABLE TO COMPLETE FAILOVER…)
SHUNNED(CANNOT-SYNC-WITH-HOME-SITE)
Below are various examples and possible troubleshooting steps and soultions, where applicable.
Please THINK before you issue ANY commands. These are examples ONLY, and are not to be followed blindly because every situation is different
The DRAIN-CONNECTIONS
state means that the datasource [NODE|CLUSTER] drain [timeout]
command has been successfully completed and the node or cluster is now SHUNNED as requested.
The datasource drain command will prevent new connections to the specified data source, while ongoing
connections remain untouched. If a timeout (in seconds) is given, ongoing connections will be severed after the timeout
expires. This command returns immediately, no matter whether a timeout is given or not. Under the hood, this command will
put the data source into SHUNNED
state, with lastShunReason
set to
DRAIN-CONNECTIONS
. This feature is available as of version 7.0.2
cctrl> use world cctrl> ls +---------------------------------------------------------------------------------+ |emea(composite master:ONLINE, global progress=21269, max latency=8.997) | |STATUS [OK] [2023/01/17 09:11:36 PM UTC] | +---------------------------------------------------------------------------------+ | emea(master:ONLINE, progress=21, max latency=8.997) | | emea_from_usa(relay:ONLINE, progress=21248, max latency=3.000) | +---------------------------------------------------------------------------------+ +---------------------------------------------------------------------------------+ |usa(composite master:SHUNNED(DRAIN-CONNECTIONS), global progress=21, max | |latency=2.217) | |STATUS [SHUNNED] [2023/01/19 08:05:02 PM UTC] | +---------------------------------------------------------------------------------+ | usa(master:SHUNNED, progress=-1, max latency=-1.000) | | usa_from_emea(relay:ONLINE, progress=21, max latency=2.217) | +---------------------------------------------------------------------------------+ cctrl> use usa cctrl> ls +---------------------------------------------------------------------------------+ |db16-demo.continuent.com(master:SHUNNED(DRAIN-CONNECTIONS), progress=-1, THL | |latency=-1.000) | |STATUS [SHUNNED] [2023/01/19 08:05:02 PM UTC][SSL] | +---------------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=master, state=OFFLINE) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=0, active=0) | +---------------------------------------------------------------------------------+ +---------------------------------------------------------------------------------+ |db17-demo.continuent.com(slave:SHUNNED(DRAIN-CONNECTIONS), progress=-1, | |latency=-1.000) | |STATUS [SHUNNED] [2023/01/19 08:05:03 PM UTC][SSL] | +---------------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=slave, master=db16-demo.continuent.com, state=OFFLINE) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=0, active=0) | +---------------------------------------------------------------------------------+ +---------------------------------------------------------------------------------+ |db18-demo.continuent.com(slave:SHUNNED(DRAIN-CONNECTIONS), progress=-1, | |latency=-1.000) | |STATUS [SHUNNED] [2023/01/19 08:05:02 PM UTC][SSL] | +---------------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=slave, master=db16-demo.continuent.com, state=OFFLINE) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=0, active=0) | +---------------------------------------------------------------------------------+ cctrl> use world cctrl> datasource usa welcome
The FAILSAFE_SHUN
state means that there was a complete network partition so that
none of the nodes were able to communicate with each other. The database writes are blocked to prevent
a split-brain from happening.
+----------------------------------------------------------------------------+ |db1(master:SHUNNED(FAILSAFE_SHUN), progress=56747909871, THL | |latency=12.157) | |STATUS [OK] [2021/09/25 01:09:04 PM CDT] | +----------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=master, state=ONLINE) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=374639937, active=0) | +----------------------------------------------------------------------------+ +----------------------------------------------------------------------------+ |db2(slave:SHUNNED(FAILSAFE_SHUN), progress=-1, latency=-1.000) | |STATUS [OK] [2021/09/15 11:58:05 PM CDT] | +----------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=slave, master=db1, state=OFFLINE) | | DATASERVER(state=STOPPED) | | CONNECTIONS(created=70697946, active=0) | +----------------------------------------------------------------------------+ +----------------------------------------------------------------------------+ |db3(slave:SHUNNED(FAILSAFE_SHUN), progress=56747909871, latency=12.267) | |STATUS [OK] [2021/09/25 01:09:21 PM CDT] | +----------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=slave, master=db1, state=ONLINE) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=168416988, active=0) | +----------------------------------------------------------------------------+ cctrl> set force true cctrl> datasource db1 welcome cctrl> datasource db1 online (if needed) cctrl> recover
The MANUALLY-SHUNNED
state means that an administrator has issued the
datasource {NODE|CLUSTER} shun command using cctrl or the REST API,
resulting in the specified node or cluster being SHUNNED.
+----------------------------------------------------------------------------+ |db1(master:SHUNNED(MANUALLY-SHUNNED), progress=15969982, THL | |latency=0.531) | |STATUS [SHUNNED] [2014/01/17 02:57:19 PM MST] | +----------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=master, state=ONLINE) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=4204, active=23) | +----------------------------------------------------------------------------+ cctrl> set force true cctrl> datasource db1 welcome cctrl> datasource db1 online (if needed) cctrl> recover
The CONFLICTS-WITH-COMPOSITE-MASTER
state means that we already have an active
primary in the cluster and we can’t bring this primary online because of this.
+----------------------------------------------------------------------------+ |db1(master:SHUNNED(CONFLICTS-WITH-COMPOSITE-MASTER), | |progress=25475128064, THL latency=0.010) | |STATUS [SHUNNED] [2015/04/11 02:35:24 PM PDT] | +----------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=master, state=ONLINE) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=2568, active=0) | +----------------------------------------------------------------------------+
The FAILSAFE AFTER Shunned by fail-safe procedure
state means that the Manager voting Quorum
encountered an unrecoverable problem and shut down database writes to prevent a Split-brain situation.
+----------------------------------------------------------------------------+ |db1(master:SHUNNED(FAILSAFE AFTER Shunned by fail-safe | |procedure), progress=96723577, THL latency=0.779) | |STATUS [OK] [2014/03/22 01:12:35 AM EDT] | +----------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=master, state=ONLINE) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=135, active=0) | +----------------------------------------------------------------------------+ +----------------------------------------------------------------------------+ |db2(slave:SHUNNED(FAILSAFE AFTER Shunned by fail-safe | |procedure), progress=96723575, latency=0.788) | |STATUS [OK] [2014/03/31 04:52:39 PM EDT] | +----------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=slave, master=db1, state=ONLINE) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=28, active=0) | +----------------------------------------------------------------------------+ +----------------------------------------------------------------------------+ |db5(slave:SHUNNED:ARCHIVE (FAILSAFE AFTER Shunned by | |fail-safe procedure), progress=96723581, latency=0.905) | |STATUS [OK] [2014/03/22 01:13:58 AM EDT] | +----------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=slave, master=db1, state=ONLINE) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=23, active=0) | +----------------------------------------------------------------------------+ cctrl> set force true cctrl> datasource db1 welcome cctrl> datasource db1 online (if needed) cctrl> recover
The SUBSERVICE-SWITCH-FAILED
state means that the cluster tried to switch
the Primary role to another node in response to an admin request, but was unable to do so due
to a failure at the sub-service level in a Composite Active-Active (CAA) cluster.
+---------------------------------------------------------------------------------+ |db1(relay:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=6668586, | |latency=1.197) | |STATUS [SHUNNED] [2021/01/14 10:20:33 AM UTC][SSL] | +---------------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=relay, master=db4, state=ONLINE) | | DATASERVER(state=ONLINE) | +---------------------------------------------------------------------------------+ +---------------------------------------------------------------------------------+ |db2(slave:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=6668586, | |latency=1.239) | |STATUS [SHUNNED] [2021/01/14 10:20:39 AM UTC][SSL] | +---------------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=slave, master=db1, state=ONLINE) | | DATASERVER(state=ONLINE) | +---------------------------------------------------------------------------------+ +---------------------------------------------------------------------------------+ |db3(slave:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=6668591, | |latency=0.501) | |STATUS [SHUNNED] [2021/01/14 10:20:36 AM UTC][SSL] | +---------------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=slave, master=pip-db1, state=ONLINE) | | DATASERVER(state=ONLINE) | +---------------------------------------------------------------------------------+ cctrl> use {SUBSERVICE-NAME-HERE} cctrl> set force true cctrl> datasource db1 welcome cctrl> datasource db1 online (if needed) cctrl> recover
The FAILED-OVER-TO-{nodename}
state means that the cluster automatically and
successfully invoked a failover from one node to another. The fact that there appear to be two masters is
completely normal after a failover, and indicates the cluster should be manually recovered once the node
which failed is fixed.
+----------------------------------------------------------------------------+ |db1(master:SHUNNED(FAILED-OVER-TO-db2), progress=248579111, | |THL latency=0.296) | |STATUS [SHUNNED] [2016/01/23 02:15:16 AM CST] | +----------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=master, state=ONLINE) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=108494736, active=0) | +----------------------------------------------------------------------------+ +----------------------------------------------------------------------------+ |db2(master:ONLINE, progress=248777065, THL latency=0.650) | |STATUS [OK] [2016/01/23 02:15:24 AM CST] | +----------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=master, state=ONLINE) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=3859635, active=591) | +----------------------------------------------------------------------------+ cctrl> recover
The SET-RELAY
state means that the cluster was in the middle of a
switch which failed to complete for either a Composite (CAP) Passive cluster or in a Composite (CAA) sub-service.
+---------------------------------------------------------------------------------+ |db1(relay:SHUNNED(SET-RELAY), progress=-1, latency=-1.000) | |STATUS [SHUNNED] [2022/08/05 08:13:03 AM PDT] | +---------------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=relay, master=db4, state=SUSPECT) | | DATASERVER(state=ONLINE) | +---------------------------------------------------------------------------------+ +---------------------------------------------------------------------------------+ |db2(slave:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=14932, | |latency=0.000) | |STATUS [SHUNNED] [2022/08/05 06:13:36 AM PDT] | +---------------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=slave, master=db1, state=ONLINE) | | DATASERVER(state=ONLINE) | +---------------------------------------------------------------------------------+ +---------------------------------------------------------------------------------+ |db3(slave:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=14932, | |latency=0.000) | |STATUS [SHUNNED] [2022/08/05 06:13:38 AM PDT] | +---------------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=slave, master=db1, state=ONLINE) | | DATASERVER(state=ONLINE) | +---------------------------------------------------------------------------------+ cctrl> use {PASSIVE-SERVICE-NAME-HERE} cctrl> set force true cctrl> datasource db1 welcome cctrl> datasource db1 online (if needed) cctrl> recover
The FAILOVER-ABORTED AFTER UNABLE TO COMPLETE FAILOVER
state means that the cluster
tried to automatically fail over the Primary role to another node but was unable to do so.
+----------------------------------------------------------------------------+ |db1(master:SHUNNED(FAILOVER-ABORTED AFTER UNABLE TO COMPLETE FAILOVER | | FOR DATASOURCE 'db1'. CHECK COORDINATOR MANAGER LOG), | | progress=21179013, THL latency=4.580) | |STATUS [SHUNNED] [2020/04/10 01:40:17 PM CDT] | +----------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=master, state=ONLINE) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=294474815, active=0) | +----------------------------------------------------------------------------+ +----------------------------------------------------------------------------+ |db2(slave:ONLINE, progress=21179013, latency=67.535) | |STATUS [OK] [2020/04/02 09:42:42 AM CDT] | +----------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=slave, master=db1, state=ONLINE) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=22139851, active=1) | +----------------------------------------------------------------------------+ +----------------------------------------------------------------------------+ |db3(slave:ONLINE, progress=21179013, latency=69.099) | |STATUS [OK] [2020/04/07 10:20:20 AM CDT] | +----------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=slave, master=db1, state=ONLINE) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=66651718, active=7) | +----------------------------------------------------------------------------+
The CANNOT-SYNC-WITH-HOME-SITE
state is a composite-level state which means
that the sites were unable to see each other at some point in time. This scenario may need a
manual recovery at the composite level for the cluster to heal.
From usa side: emea(composite master:SHUNNED(CANNOT-SYNC-WITH-HOME-SITE) From emea side: usa(composite master:SHUNNED(CANNOT-SYNC-WITH-HOME-SITE) cctrl compositeSvc> recover