3.4.7. Dataserver maintenance

To perform maintenance on the dataservice, for example to update the MySQL configuration file, can be achieved in a similar sequence to that shown in Section 6.15, “Performing Database or OS Maintenance”, except that you must also restart the corresponding Tungsten Replicator service after the main Tungsten Cluster service has been placed back online.

For example, to perform maintenance on the east service:

  1. Put the dataservice into MAINTENANCE mode. This ensures that Tungsten Cluster will not attempt to automatically recover the service.

    cctrl [east]> set policy maintenance
  2. Shun the first Replica datasource so that maintenance can be performed on the host.

    cctrl [east]> datasource east1 shun
  3. Perform the updates, such as updating my.cnf, changing schemas, or performing other maintenance.

  4. If MySQL configuration has been modified, restart the MySQL service:

    cctrl [east]> service host/mysql restart
  5. Bring the host back into the dataservice:

    cctrl [east]> datasource host recover
  6. Perform a switch so that the Primary becomes a Replica and can then be shunned and have the necessary maintenance performed:

    cctrl [east]> switch
  7. Repeat the previous steps to shun the host, perform maintenance, and then switch again until all the hosts have been updated.

  8. Set the policy back to automatic:

    cctrl> set policy automatic
  9. On each host in the other region, manually restart the Tungsten Replicator service, which will have gone offline when MySQL was restarted:

    shell> /opt/replicator/tungsten/tungsten-replicator/bin/trepctl -host host -service east online

3.4.7.1. Fixing Replication Errors

In the event of a replication fault, the standard cctrl, trepctl and other utility commands in Chapter 9, Command-line Tools can be used to bring the dataservice back into operation. All the tools are safe to use.

If you have to perform any updates or modifications to the stored MySQL data, ensure binary logging has been disabled using:

mysql> SET SESSION SQL_LOG_BIN=0;

Before running any commands. This prevents statements and operations reaching the binary log so that the operations will not be replicated to other hosts.

3.4.7.1.1. Recovering Cross Site Services

In a cmm_name; topology, a switch or a failover not only promotes a Replica to be a new Primary, but also will require the ability to reconfigure cross-site communications. This process therefore assumes that cross-site communication is online and working. In some situations, it may be possible that cross-site communication is down, or for some reason cross-site replication is in an OFFLINE:ERROR state - for example a DDL or DML statement that worked in the local cluster may have failed to apply in the remote.

If a switch or failover occurs and the process is unable to reconfigure the cross-site replicators, the local switch will still succeed, however the associated cross-site services will be placed into a SHUNNED(SUBSERVICE-SWITCH-FAILED) state.

The guide explains how to recover from this situation.

  • The examples are based on a 2-cluster topology, named NYC and LONDON and the composite dataservice named GLOBAL.

  • The cluster is configured with the following dataservers:

    • NYC : db1 (Primary), db2 (Replica), db3 (Replica)

    • LONDON: db4 (Primary), db5 (Replica), db6 (Replica)

  • The cross site replicators in both clusters are in an OFFLINE:ERROR state due to failing DDL.

  • A switch was then issued, promoting db3 as the new Primary in NYC and db5 as the new Primary in LONDON

When the cluster enters a state where the cross-site services are in an error, output from cctrl will look like the following:

shell> cctrl -expert -multi
[LOGICAL:EXPERT] / > use london_from_nyc
london_from_nyc: session established, encryption=false, authentication=false
[LOGICAL:EXPERT] /london_from_nyc > ls
COORDINATOR[db6:AUTOMATIC:ONLINE]
 
ROUTERS:
+---------------------------------------------------------------------------------+
|connector@db1[26248](ONLINE, created=0, active=0)                                |
|connector@db2[14906](ONLINE, created=0, active=0)                                |
|connector@db3[15035](ONLINE, created=0, active=0)                                |
|connector@db4[27813](ONLINE, created=0, active=0)                                |
|connector@db5[4379](ONLINE, created=0, active=0)                                 |
|connector@db6[2098](ONLINE, created=0, active=0)                                 |
+---------------------------------------------------------------------------------+
 
DATASOURCES:
+---------------------------------------------------------------------------------+
|db5(relay:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=6, latency=0.219)          |
|STATUS [SHUNNED] [2018/03/15 10:27:24 AM UTC]                                    |
+---------------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                          |
|  REPLICATOR(role=relay, master=db3, state=ONLINE)                               |
|  DATASERVER(state=ONLINE)                                                       |
|  CONNECTIONS(created=0, active=0)                                               |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db4(slave:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=6, latency=0.252)          |
|STATUS [SHUNNED] [2018/03/15 10:27:25 AM UTC]                                    |
+---------------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                          |
|  REPLICATOR(role=slave, master=db5, state=ONLINE)                               |
|  DATASERVER(state=ONLINE)                                                       |
|  CONNECTIONS(created=0, active=0)                                               |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db6(slave:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=6, latency=0.279)          |
|STATUS [SHUNNED] [2018/03/15 10:27:25 AM UTC]                                    |
+---------------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                          |
|  REPLICATOR(role=slave, master=db4, state=ONLINE)                               |
|  DATASERVER(state=ONLINE)                                                       |
|  CONNECTIONS(created=0, active=0)                                               |
+---------------------------------------------------------------------------------+

In the above example, you can see that all services are in the SHUNNED(SUBSERVICE-SWITCH-FAILED) state, and partial reconfiguration has happened.

The Replicators for db4 and db6 should be Replicas of db5, db5 has correctly configured to the new Primary in nyc, db3. The actual state of the cluster in each scenario maybe different depending upon the cause of the loss of cross-site communication. Using the steps below, apply the necessary actions that relate to your own cluster state, if in any doubt always contact Continuent Support for assistance.

  1. The first step is to ensure the initial replication errors have been resolved and that the replicators are in an online state, the steps to resolve the replicators will depend on the reason for the error, for further guidance on resolving these issues, see Chapter 6, Operations Guide.

  2. From one node, connect into cctrl at the expert level:

    shell> cctrl -expert -multi
  3. Next, connect to the cross-site subservice, in this example, london_from_nyc

    cctrl> use london_from_nyc
  4. Next, place the service into Maintenance Mode

    cctrl> set policy maintenance
  5. Enable override of commands issued

    cctrl> set force true
  6. Bring the relay datasource online

    cctrl> datasource db5 online
  7. If you need to change the source for the relay replicator to the correct, new, Primary in the remote cluster, take the replicator offline. If the relay source is correct, then move on to step 10

    cctrl> replicator db5 offline
  8. Change the source of the relay replicator

    cctrl> replicator db5 relay nyc/db3
  9. Bring the replicator online

    cctrl> replicator db5 online
  10. For each datasource that requires the replicator altering, issue the following commands:

    cctrl> replicator datasource offline
    cctrl> replicator datasource slave db5
    cctrl> replicator datasource online

    For example:

    cctrl> replicator db4 offline
    cctrl> replicator db4 slave db5
    cctrl> replicator db4 online
  11. Once all replicators are using the correct source, we can then bring the cluster back

    cctrl> cluster welcome
  12. Some of the datasources may still be in the SHUNNED state, so for each of those, you can then issue the following

    cctrl> datasource datasource online

    For example:

    cctrl> datasource db4 online
  13. Once all nodes are online, we can then return the cluster to automatic

    cctrl> set policy automatic
  14. Repeat this process for the other cross-site subservice if required

Direct link video.