To perform maintenance on the dataservice, for example to update the MySQL configuration file, can be achieved in a similar sequence to that shown in Section 6.15, “Performing Database or OS Maintenance”, except that you must also restart the corresponding Tungsten Replicator service after the main Tungsten Cluster service has been placed back online.
For example, to perform maintenance on the
east
service:
Put the dataservice into
MAINTENANCE
mode. This
ensures that Tungsten Cluster will not attempt to automatically recover
the service.
cctrl [east]> set policy maintenance
Shun the first Replica datasource so that maintenance can be performed on the host.
cctrl [east]> datasource east1 shun
Perform the updates, such as updating
my.cnf
, changing schemas, or
performing other maintenance.
If MySQL configuration has been modified, restart the MySQL service:
cctrl [east]> service host/mysql restart
Bring the host back into the dataservice:
cctrl [east]> datasource host recover
Perform a switch so that the Primary becomes a Replica and can then be shunned and have the necessary maintenance performed:
cctrl [east]> switch
Repeat the previous steps to shun the host, perform maintenance, and then switch again until all the hosts have been updated.
Set the policy back to automatic:
cctrl> set policy automatic
On each host in the other region, manually restart the Tungsten Replicator service, which will have gone offline when MySQL was restarted:
shell> /opt/replicator/tungsten/tungsten-replicator/bin/trepctl -host host -service east online
In the event of a replication fault, the standard cctrl, trepctl and other utility commands in Chapter 9, Command-line Tools can be used to bring the dataservice back into operation. All the tools are safe to use.
If you have to perform any updates or modifications to the stored MySQL data, ensure binary logging has been disabled using:
mysql> SET SESSION SQL_LOG_BIN=0;
Before running any commands. This prevents statements and operations reaching the binary log so that the operations will not be replicated to other hosts.
In a cmm_name; topology, a switch or a failover not only promotes a
Replica to be a new Primary, but also will require the ability to
reconfigure cross-site communications. This process therefore assumes
that cross-site communication is online and working. In some
situations, it may be possible that cross-site communication is down,
or for some reason cross-site replication is in an
OFFLINE:ERROR
state - for
example a DDL or DML statement that worked in the local cluster may
have failed to apply in the remote.
If a switch or failover occurs and the process is unable to
reconfigure the cross-site replicators, the local switch will still
succeed, however the associated cross-site services will be placed
into a
SHUNNED(SUBSERVICE-SWITCH-FAILED)
state.
The guide explains how to recover from this situation.
The examples are based on a 2-cluster topology, named
NYC
and
LONDON
and the composite
dataservice named GLOBAL
.
The cluster is configured with the following dataservers:
NYC : db1 (Primary), db2 (Replica),
db3 (Replica)
LONDON: db4 (Primary), db5 (Replica),
db6 (Replica)
The cross site replicators in both clusters are in an
OFFLINE:ERROR
state due to
failing DDL.
A switch was then issued, promoting db3 as the new Primary in NYC
and db5 as the new Primary in
LONDON
When the cluster enters a state where the cross-site services are in an error, output from cctrl will look like the following:
shell>cctrl -expert -multi
[LOGICAL:EXPERT] / >use london_from_nyc
london_from_nyc: session established, encryption=false, authentication=false [LOGICAL:EXPERT] /london_from_nyc >ls
COORDINATOR[db6:AUTOMATIC:ONLINE] ROUTERS: +---------------------------------------------------------------------------------+ |connector@db1[26248](ONLINE, created=0, active=0) | |connector@db2[14906](ONLINE, created=0, active=0) | |connector@db3[15035](ONLINE, created=0, active=0) | |connector@db4[27813](ONLINE, created=0, active=0) | |connector@db5[4379](ONLINE, created=0, active=0) | |connector@db6[2098](ONLINE, created=0, active=0) | +---------------------------------------------------------------------------------+ DATASOURCES: +---------------------------------------------------------------------------------+ |db5(relay:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=6, latency=0.219) | |STATUS [SHUNNED] [2018/03/15 10:27:24 AM UTC] | +---------------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=relay, master=db3, state=ONLINE) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=0, active=0) | +---------------------------------------------------------------------------------+ +---------------------------------------------------------------------------------+ |db4(slave:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=6, latency=0.252) | |STATUS [SHUNNED] [2018/03/15 10:27:25 AM UTC] | +---------------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=slave, master=db5, state=ONLINE) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=0, active=0) | +---------------------------------------------------------------------------------+ +---------------------------------------------------------------------------------+ |db6(slave:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=6, latency=0.279) | |STATUS [SHUNNED] [2018/03/15 10:27:25 AM UTC] | +---------------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=slave, master=db4, state=ONLINE) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=0, active=0) | +---------------------------------------------------------------------------------+
In the above example, you can see that all services are in the
SHUNNED(SUBSERVICE-SWITCH-FAILED)
state, and partial reconfiguration has happened.
The Replicators for db4 and db6 should be Replicas of db5, db5 has correctly configured to the new Primary in nyc, db3. The actual state of the cluster in each scenario maybe different depending upon the cause of the loss of cross-site communication. Using the steps below, apply the necessary actions that relate to your own cluster state, if in any doubt always contact Continuent Support for assistance.
The first step is to ensure the initial replication errors have been resolved and that the replicators are in an online state, the steps to resolve the replicators will depend on the reason for the error, for further guidance on resolving these issues, see Chapter 6, Operations Guide.
From one node, connect into cctrl at the expert level:
shell> cctrl -expert -multi
Next, connect to the cross-site subservice, in this example, london_from_nyc
cctrl> use london_from_nyc
Next, place the service into Maintenance Mode
cctrl> set policy maintenance
Enable override of commands issued
cctrl> set force true
Bring the relay datasource online
cctrl> datasource db5 online
If you need to change the source for the relay replicator to the correct, new, Primary in the remote cluster, take the replicator offline. If the relay source is correct, then move on to step 10
cctrl> replicator db5 offline
Change the source of the relay replicator
cctrl> replicator db5 relay nyc/db3
Bring the replicator online
cctrl> replicator db5 online
For each datasource that requires the replicator altering, issue the following commands:
cctrl>replicator
cctrl>datasource
offlinereplicator
cctrl>datasource
slave db5replicator
datasource
online
For example:
cctrl>replicator db4 offline
cctrl>replicator db4 slave db5
cctrl>replicator db4 online
Once all replicators are using the correct source, we can then bring the cluster back
cctrl> cluster welcome
Some of the datasources may still be in the SHUNNED state, so for each of those, you can then issue the following
cctrl> datasource datasource
online
For example:
cctrl> datasource db4 online
Once all nodes are online, we can then return the cluster to automatic
cctrl> set policy automatic
Repeat this process for the other cross-site subservice if required
Direct link video.