Managing Transaction Failures
Inconsistencies between a Primary and Replica dataserver can occur for a number of reasons, including:
An update or insertion has occurred on the Replica independently of the Primary. This situation can occur if updates are allowed on a Replica that is acting as a read-only Replica for scale out, or in the event of running management or administration scripts on the Replica.
A switch or failover operation has lead to inconsistencies. This can happen if client applications are still writing to the Replica or Primary at the point of the switch.
A database failure causes a database or table to become corrupted.
When a failure to apply transactions occurs, the problem must be resolved, either by skipping or ignoring the transaction, or fixing and updating the underlying database so that the transaction can be applied.
When a failure occurs, replication is stopped immediately at the first transaction that caused the problem, but it may not be the only transaction and this may require extensive examination of the pending transactions to determine what caused the original database failure and then to fix and address the error and restart replication.
Identifying a Transaction Mismatch
When a mismatch occurs, the replicator service will indicate that there was a problem applying a transaction on the Replica. The replication process stops applying changes to the Replica when the first transaction fails to be applied to the Replica. This prevents multiple-statements from failing.
Within cctrl the status of the datasource will be marked as FAILED, and the replicator state as SUSPECT:
LOGICAL] /alpha > ls
COORDINATOR[db2:AUTOMATIC:ONLINE]
...
+---------------------------------------------------------------------------------+
|db2(slave:FAILED(REPLICATOR IS IN THE 'SUSPECT' STATE), progress=-1, |
|latency=-1.000) |
|STATUS [CRITICAL] [2025/01/31 10:54:26 AM UTC] |
|REASON[REPLICATOR IS IN THE 'SUSPECT' STATE] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db3, state=SUSPECT) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=2, active=0) |
+---------------------------------------------------------------------------------+
...
More detailed information about the status and the statement that failed can be obtained in a number of ways, either within cctrl using the
replicator command:
[LOGICAL] /nyc > replicator db2 status
+---------------------------------------------------------------------------------+
|replicator db2 status |
+---------------------------------------------------------------------------------+
| appliedLastEventId: NONE |
| appliedLastGoodLatency: -1.0 |
| appliedLastGoodSeqno: -1 |
| appliedLastSeqno: -1 |
| appliedLatency: -1.0 |
| autoRecoveryEnabled: false |
| autoRecoveryTotal: 0 |
| channels: -1 |
| clusterName: nyc |
| currentEventId: NONE |
| currentTimeMillis: 1738320971988 |
| dataServerHost: db2 |
| extensions: |
| host: db2 |
| latestEpochNumber: -1 |
| masterConnectUri: thl://db3:2112/ |
| masterListenUri: thl://db2:2112/ |
| maximumStoredSeqNo: -1 |
| minimumStoredSeqNo: -1 |
| offlineRequests: NONE |
|pendingError: Event application failed: seqno=18 fragno=0 |
|message=java.sql.SQLException: Statement failed on slave but succeeded on |
|master |
| pendingErrorCode: NONE |
| pendingErrorEventId: mysql-bin.000003:0000000000001389;2010 |
| pendingErrorSeqno: 18 |
|pendingExceptionMessage: java.sql.SQLException: Statement failed on slave |
|but succeeded on master |
|insert into regions values (1, "EMEA") |
| pipelineSource: UNKNOWN |
| relativeLatency: -1.0 |
| resourceJdbcDriver: org.drizzle.jdbc.DrizzleDriver |
|resourceJdbcUrl: |
|jdbc:mysql:thin://db2:13306/${DBNAME}?jdbcCompliantTruncation=false&zeroDateT |
|imeBehavior=convertToNull&tinyInt1isBit=false&allowMultiQueries=true&yearIsDa |
|teType=false |
| resourceVendor: mysql |
| rmiPort: 10000 |
| role: slave |
| seqnoType: java.lang.Long |
| serviceName: nyc |
| serviceType: unknown |
| simpleServiceName: nyc |
| siteName: default |
| sourceId: db2 |
| state: OFFLINE:ERROR |
| timeInStateSeconds: 114.834 |
| timezone: GMT |
| transitioningTo: |
| uptimeSeconds: 1811.377 |
| useSSLConnection: false |
| version: Tungsten Replicator 8.0.4 Build 132 |
+---------------------------------------------------------------------------------+
By checking the replication status with trepctl, the pendingError and pendingExceptionMessage will show the error
indicating the failure to insert the statement. For example:
shell> trepctl status
...
pendingError : Event application failed: seqno=18 fragno=0 message=java.sql.SQLException: Statement failed on slave but succeeded on master
pendingErrorCode : NONE
pendingErrorEventId : mysql-bin.000003:0000000000001389;2010
pendingErrorSeqno : 18
pendingExceptionMessage: java.sql.SQLException: Statement failed on slave but succeeded on master
insert into regions values (1, "EMEA")
...
The trepsvc.log log file will also contain the error information about the failed statement. For example:
...
2025/01/31 10:54:17.194 | com.continuent.tungsten.replicator.applier.ApplierException: java.sql.SQLException: Statement failed on slave but succeeded on master
2025/01/31 10:54:17.194 | at com.continuent.tungsten.replicator.applier.MySQLDrizzleApplier.applyStatementData(MySQLDrizzleApplier.java:162) ~[tungsten-replicator.jar:?]
2025/01/31 10:54:17.194 | at com.continuent.tungsten.replicator.applier.JdbcApplier.apply(JdbcApplier.java:1789) ~[tungsten-replicator.jar:?]
2025/01/31 10:54:17.194 | at com.continuent.tungsten.replicator.applier.ApplierWrapper.apply(ApplierWrapper.java:101) ~[tungsten-replicator.jar:?]
2025/01/31 10:54:17.194 | at com.continuent.tungsten.replicator.pipeline.SingleThreadStageTask.apply(SingleThreadStageTask.java:1032) [tungsten-replicator.jar:?]
2025/01/31 10:54:17.194 | at com.continuent.tungsten.replicator.pipeline.SingleThreadStageTask.runTask(SingleThreadStageTask.java:634) [tungsten-replicator.jar:?]
2025/01/31 10:54:17.195 | at com.continuent.tungsten.replicator.pipeline.SingleThreadStageTask.run(SingleThreadStageTask.java:207) [tungsten-replicator.jar:?]
2025/01/31 10:54:17.195 | at java.lang.Thread.run(Thread.java:750) [?:1.8.0_432]
2025/01/31 10:54:17.195 | Caused by: java.sql.SQLException: Statement failed on slave but succeeded on master
2025/01/31 10:54:17.195 | at com.continuent.tungsten.replicator.applier.MySQLDrizzleApplier.applyStatementData(MySQLDrizzleApplier.java:124) ~[tungsten-replicator.jar:?]
2025/01/31 10:54:17.195 | ... 6 more
2025/01/31 10:54:17.195 | Caused by: java.sql.SQLIntegrityConstraintViolationException: Duplicate entry '1' for key 'regions.PRIMARY'
2025/01/31 10:54:17.195 | at org.drizzle.jdbc.internal.SQLExceptionMapper.get(SQLExceptionMapper.java:92) ~[drizzle-jdbc-1.5.20-SNAPSHOT.jar:?]
2025/01/31 10:54:17.195 | at org.drizzle.jdbc.DrizzleStatement.executeBatch(DrizzleStatement.java:951) ~[drizzle-jdbc-1.5.20-SNAPSHOT.jar:?]
2025/01/31 10:54:17.195 | at com.continuent.tungsten.replicator.applier.MySQLDrizzleApplier.applyStatementData(MySQLDrizzleApplier.java:110) ~[tungsten-replicator.jar:?]
2025/01/31 10:54:17.195 | ... 6 more
Once the error or problem has been found, the exact nature of the error should be determined so that a resolution can be identified:
Identify the reason for the failure by examining the full error message. Common causes are:
Duplicate primary key
A row or statement is being inserted or updated that already has the same insert ID or would generate the same insert ID for tables that have auto increment enabled. The insert ID can be identified from the output of the transaction using
thl. Check the Replica to identify the faulty row. To correct this problem you will either need to skip the transaction or delete the offending row from the Replica dataserver.In the above example, the error can be identified through the methods described. The log file output will also show the error direct from the database, and in this case that error is
Duplicate entry '1' for key 'regions.PRIMARY'The error can be generated when an insert or update has taken place on the Replica rather than on the Primary.
This can also be a common error is misconfigured Composite Active/Active clusters where generated sequence offsets have not been set correctly or where the application is not active/active aware and writes happen in both clusters using the same application generated ID causing PK violations.
To resolve this issue, check the full THL for the statement that failed. The information is provided in the error message, but full examination of the THL can help with identification of the full issue. The sequence number of the statement causing error can be found in the
pendingErrorSeqnofield.
In our example, this is 18. For example, to view the THL for the sequence number:shell> thl list -seqno 18SEQ# = 18 / FRAG# = 0 (last frag)- FILE = thl.data.0000000001- TIME = 2025-01-31 10:53:33.0- EPOCH# = 15- EVENTID = mysql-bin.000003:0000000000001389;2010- SOURCEID = db3- METADATA = [mysql_server_id=12;mysql_thread_id=2010;dbms_type=mysql;tz_aware=true;service=nyc;shard=hr]- TYPE = com.continuent.tungsten.replicator.event.ReplDBMSEvent- OPTIONS = [##charset = UTF-8, autocommit = 1, sql_auto_is_null = 0, foreign_key_checks = 1, unique_checks = 1, »sql_mode = 'NO_ENGINE_SUBSTITUTION,ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,IGNORE_SPACE,»NO_ZERO_DATE,NO_ZERO_IN_DATE', character_set_client = 33, collation_connection = 33, collation_server = 255]- SCHEMA = hr- SQL(0) = insert into regions values (1, "EMEA")In this example, an
INSERToperation is inserting a new row and the ID has been explicity defined (1). Check the destination database and determine the what the current value of the corresponding row:mysql> select * from regions where region_id = 1;+-----------+-------------+| region_id | region_name |+-----------+-------------+| 1 | EMEA |+-----------+-------------+1 row in set (0.00 sec)The actual row values are identical, which means that it is likely this row was inserted into the replica database directly by mistake.
Missing table or schema
If a table or database is missing, this should be reported in the detailed error message. For example:
Caused by: java.sql.SQLSyntaxErrorException: Unable to switch to database »'contacts'Error was: Unknown database 'contacts'Incompatible table or schema
A modified table structure on the Replica can cause application of the transaction to fail if there are missing or different column specifications for the table data.
This particular error can be generated when changes to the table definition have been made, perhaps during a maintenance window.
Check the table definition on the Primary and Replica and ensure they match.
Choose a resolution method:
Depending on the data structure and environment, resolution can take one of the following forms:
Skip the transaction on the Replica
If the data on the Replica is considered correct, or the data in both tables is the same or similar, the transaction from the Primary to the Replica can be skipped. This process involves placing the replicator online and specifying one or more transactions to be skipped or ignored. At the end of this process, the replicator should be in the
ONLINEstate.For more information on skipping single or multiple transactions, see "Skipping Transactions".
Delete the offending row or rows on the Replica
If the data on the Primary is considered canonical, then the data on the Replica can be removed, and the replicator placed online.
WarningDeleting data on the Replica may cause additional problems if the data is used by other areas of your application, or has relations to foreign tables.
For example:
mysql> delete from regions where region_id = 1;Query OK, 1 row affected (0.01 sec)Now place the replicator online and check the status:
shell> trepctl onlineor, within
cctrl:[LOGICAL] /alpha > replicator host3 onlineRestore or reprovision the Replica
If the transaction cannot be skipped, or the data safely deleted or modified, and only a single Replica is affected, a backup of an existing, working, Replica can be taken and restored to the broken Replica.
The
tprovisioncommand automates this process.To perform a backup and restore, see "Creating a Backup", or "Restoring a Backup".
To reprovision a Replica from the Primary or another Replica, see "The tprovision Command".
Skipping Transactions
When a failure caused by a mismatch or failure to apply one or more transactions, the transaction(s) can be skipped. Transactions can either be skipped one at a time, through a specific range, or a list of single and range specifications.
Skipping over events can easily lead to Replica inconsistencies and later replication errors. Care should be taken to ensure that the transaction(s) can be safely skipped without causing problems. See "operations-transactions-ident".
Skipping a Single Transaction
If the error was caused by only a single statement or transaction, the transaction can be skipped using
trepctl online:shell> trepctl online -skip-seqno 10The individual transaction will be skipped, and the next transaction (11), will be applied to the destination database.
Skipping a Transaction Range
If there is a range of statements that need to be skipped, specify a range by defining the lower and upper limits:
shell> trepctl online -skip-seqno 10-20This skips all of the transaction within the specified range, and then applies the next transaction (21) to the destination database.
Skipping Multiple Transactions
If there are transactions mixed in with others that need to be skipped, the specification can include single transactions and ranges by separating each element with a comma:
shell> trepctl online -skip-seqno 10,12-14,16,19-20In this example, only the transactions 11, 15, 17 and 18 would be applied to the target database. Replication would then continue from transaction 21.
Regardless of the method used to skip single or multiple transactions, the status of the replicator should be checked to ensure that replication is online.