5.8.1. Identifying a Transaction Mismatch

When a mismatch occurs, the replicator service will indicate that there was a problem applying a transaction on the slave. The replication process stops applying changes to the slave when the first transaction fails to be applied to the slave. This prevents multiple-statements from failing

Within cctrl the status of the datasource will be marked as DIMINISHED, and the replicator state as SUSPECT:

LOGICAL] /alpha > ls

COORDINATOR[host3:AUTOMATIC:ONLINE]
...
+----------------------------------------------------------------------------+
|host2(slave:ONLINE, progress=-1, latency=-1.000)                            |
|STATUS [DIMINISHED] [2013/06/26 10:14:12 AM BST]                            |
|REASON[FAILED TO RECOVER REPLICATOR 'host2']                                |
+----------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                     |
|  REPLICATOR(role=slave, master=host1, state=SUSPECT)                       |
|  DATASERVER(state=ONLINE)                                                  |
|  CONNECTIONS(created=0, active=0)                                          |
+----------------------------------------------------------------------------+
...

More detailed information about the status and the statement that failed can be obtained within cctrl using the replicator command:

[LOGICAL] /alpha > replicator host2 status
+----------------------------------------------------------------------------+
|replicator host2 status                                                     |
+----------------------------------------------------------------------------+
|            appliedLastEventId:    NONE                                     |
|              appliedLastSeqno:    -1                                       |
|                appliedLatency:    -1.0                                     |
|                      channels:    -1                                       |
|                   clusterName:    firstcluster                             |
|                currentEventId:    NONE                                     |
|             currentTimeMillis:    1372238640236                            |
|                dataServerHost:    host2                                    |
|                    extensions:                                             |
|             latestEpochNumber:    -1                                       |
|              masterConnectUri:    thl://host1/                             |
|               masterListenUri:    thl://host2:2112/                        |
|            maximumStoredSeqNo:    -1                                       |
|            minimumStoredSeqNo:    -1                                       |
|               offlineRequests:    NONE                                     |
|pendingError:    Event application failed: seqno=120 fragno=0               |
|message=java.sql.SQLException: Statement failed on slave but succeeded on   |
|master                                                                      |
|              pendingErrorCode:    NONE                                     |
|           pendingErrorEventId:    mysql-bin.000012:0000000000012967;0      |
|             pendingErrorSeqno:    120                                      |
|pendingExceptionMessage:    java.sql.SQLException: Statement failed on      |
|slave but succeeded on master                                               |
|insert into messages values (0,'Trial message','Jack','Jill',now())         |
|                pipelineSource:    UNKNOWN                                  |
|               relativeLatency:    -1.0                                     |
|            resourcePrecedence:    99                                       |
|                       rmiPort:    10000                                    |
|                          role:    slave                                    |
|                     seqnoType:    java.lang.Long                           |
|                   serviceName:    firstcluster                             |
|                   serviceType:    unknown                                  |
|             simpleServiceName:    firstcluster                             |
|                      siteName:    default                                  |
|                      sourceId:    host2                                    |
|                         state:    OFFLINE:ERROR                            |
|            timeInStateSeconds:    587.806                                  |
|               transitioningTo:                                             |
|                 uptimeSeconds:    61371.957                                |
|                       version:    Continuent Tungsten 4.0.8 build 116      |
+----------------------------------------------------------------------------+

The trepsvc.log log file will also contain the error information about the failed statement. For example:

...
INFO   | jvm 1    | 2013/06/26 10:14:12 | 2013-06-26 10:14:12,423 [firstcluster - 
    q-to-dbms-0] INFO  pipeline.SingleThreadStageTask Performing emergency 
    rollback of applied changes
INFO   | jvm 1    | 2013/06/26 10:14:12 | 2013-06-26 10:14:12,424 [firstcluster - 
    q-to-dbms-0] INFO  pipeline.SingleThreadStageTask Dispatching error event: 
    Event application failed: seqno=120 fragno=0 message=java.sql.SQLException: 
    Statement failed on slave but succeeded on master
INFO   | jvm 1    | 2013/06/26 10:14:12 | 2013-06-26 10:14:12,424 [firstcluster - 
    pool-2-thread-1] ERROR management.OpenReplicatorManager Received error notification, 
    shutting down services :
INFO   | jvm 1    | 2013/06/26 10:14:12 | Event application failed: seqno=120 fragno=0 
    message=java.sql.SQLException: Statement failed on slave but succeeded on master
INFO   | jvm 1    | 2013/06/26 10:14:12 | insert into messages values (0,'Trial message',
    'Jack','Jill',now())
INFO   | jvm 1    | 2013/06/26 10:14:12 | com.continuent.tungsten.replicator.applier.ApplierException:
    java.sql.SQLException: Statement failed on slave but succeeded on master
...

Once the error or problem has been found, the exact nature of the error should be determined so that a resolution can be identified:

  1. Identify the reason for the failure by examining the full error message. Common causes are:

    • Duplicate primary key

      A row or statement is being inserted or updated that already has the same insert ID or would generate the same insert ID for tables that have auto increment enabled. The insert ID can be identified from the output of the transaction using thl. Check the slave to identify the faulty row. To correct this problem you will either need to skip the transaction or delete the offending row from the slave dataserver.

      The error will normally be identified due to the following error message when viewing the current replicator status, for example:

      [LOGICAL] /alpha > replicator host3 status
      ...
      pendingError           : Event application failed: seqno=10 fragno=0 »
          message=java.sql.SQLException: Statement failed on slave but succeeded on master
      pendingErrorCode       : NONE
      pendingErrorEventId    : mysql-bin.000032:0000000000001872;0
      pendingErrorSeqno      : 10
      pendingExceptionMessage: java.sql.SQLException: Statement failed on slave but succeeded on master
                               insert into myent values (0,'Test Message')
      ...

      The error can be generated when an insert or update has taken place on the slave rather than on the master.

      To resolve this issue, check the full THL for the statement that failed. The information is provided in the error message, but full examination of the THL can help with identification of the full issue. For example, to view the THL for the sequence number:

      shell> thl list -seqno 10
      SEQ# = 10 / FRAG# = 0 (last frag)
      - TIME = 2014-01-09 16:47:40.0
      - EPOCH# = 1
      - EVENTID = mysql-bin.000032:0000000000001872;0
      - SOURCEID = host1
      - METADATA = [mysql_server_id=1;dbms_type=mysql;service=firstcluster;shard=test]
      - TYPE = com.continuent.tungsten.replicator.event.ReplDBMSEvent
      - SQL(0) = SET INSERT_ID = 2
      - OPTIONS = [##charset = UTF-8, autocommit = 1, sql_auto_is_null = 0, foreign_key_checks = 1, » 
           unique_checks = 1, sql_mode = '', character_set_client = 33, collation_connection = 33, »
           collation_server = 8]
      - SCHEMA = test
      - SQL(1) = insert into myent values (0,'Test Message')

      In this example, an INSERT operation is inserting a new row. The generated insert ID is also shown (in line 9, SQL(0))... Check the destination database and determine the what the current value of the corresponding row:

      mysql> select * from myent where id = 2;
      +----+---------------+
      | id | msg           |
      +----+---------------+
      |  2 | Other Message |
      +----+---------------+
      1 row in set (0.00 sec)

      The actual row values are different, which means that either value may be correct. In complex data structures, there may be multiple statements or rows that trigger this error if following data also relies on this value.

      For example, if multiple rows have been inserted on the slave, multiple transactions may be affected. In this scenario, checking multiple sequence numbers from the THL will highlight this information.

    • Missing table or schema

      If a table or database is missing, this should be reported in the detailed error message. For example:

      Caused by: java.sql.SQLSyntaxErrorException: Unable to switch to database »
          'contacts'Error was: Unknown database 'contacts'

      This error can be caused when maintenance has occurred, a table has failed to be initialized properly, or the

    • Incompatible table or schema

      A modified table structure on the slave can cause application of the transaction to fail if there are missing or different column specifications for the table data.

      This particular error can be generated when changes to the table definition have been made, perhaps during a maintenance window.

      Check the table definition on the master and slave and ensure they match.

  2. Choose a resolution method:

    Depending on the data structure and environment, resolution can take one of the following forms:

    • Skip the transaction on the slave

      If the data on the slave is considered correct, or the data in both tables is the same or similar, the transaction from the master to the slave can be skipped. This process involves placing the replicator online and specifying one or more transactions to be skipped or ignored. At the end of this process, the replicator should be in the ONLINE state.

      For more information on skipping single or multiple transactions, see Section 5.8.2, “Skipping Transactions”.

    • Delete the offending row or rows on the slave

      If the data on the master is considered canonical, then the data on the slave can be removed, and the replicator placed online.

      Warning

      Deleting data on the slave may cause additional problems if the data is used by other areas of your application, relations to foreign tables.

      For example:

      mysql> delete from myent where id = 2;
      Query OK, 1 row affected (0.01 sec)

      Now place the replicator online and check the status:

      [LOGICAL] /alpha > replicator host3 online
    • Restore or reprovision the slave

      If the transaction cannot be skipped, or the data safely deleted or modified, and only a single slave is affected, a backup of an existing, working, slave can be taken and restored to the broken slave.

      To perform a backup and restore, see Section 5.9, “Creating a Backup”, or Section 5.10, “Restoring a Backup”. To reprovision a slave from the master or another slave, see tungsten_provision_slave.