Tungsten Clustering

Failover Response when MySQL Server Fails

The Manager and Replicator behave in concert when MySQL dies on the Primary node. When this happens, the replicator is unable to update the trep_commit_seqno table any longer, and therefore must either abort extraction or continue extracting without recording the extracted position into the database.

By default, the Manager will delay failover until all remaining events have been extracted from the binary logs on the failing Primary node as a way to protect data integrity.

This behavior is configurable via the following setting, shown with the default of false:

property=replicator.store.thl.stopOnDBError=false

Failover will only continue once:

all available events are completely read from the binary logs on the Primary node
all events have reached the Replicas

When property=replicator.store.thl.stopOnDBError=true, then the Replicator will stop extracting once it is unable to update the trep_commit_seqno table in MySQL, and the Manager will perform the failover without waiting, at the risk of possible data loss due to leaving binlog events behind. All such situations are logged.

For use cases where failover speed is more important than data accuracy, those NOT willing to wait for long failover can set property=replicator.store.thl.stopOnDBError=true and still use tungsten_find_orphaned to manually analyze and perform the data recovery.

For more information, please see "The tungsten_find_orphaned Command".