8.3. Tungsten Manager Failover Tuning

There are currently three discrete faults that can cause a failover of a Primary:

  • Database server failure - failover will occur 20 seconds after the initial detection.

    --property=policy.liveness.dbping.fail.threshold=1

    The Tungsten Manager is unable to connect to the database server and gets an i/o error. If the database cannot respond to a tcp connect request after the configured number of attempts, the database server is flagged as STOPPED which initiates the failover.

    This would mean, literally, that the process for the database server is gone and cannot respond to a tcp connect request. In this case, by default, the manager will try two more times, once every 10 seconds, after the initial i/o error is detected and after the then 30 second interval has elapsed, will flag the database server as being in the STOPPED state and this, in turn, initiates the failover.

  • Host failure - failover will occur 30 seconds after the initial detection

    --property=policy.liveness.hostPing.fail.threshold=2

    The host on which the Primary database server is running is 'gone'. The first indication that the Primary host is gone could be because the manager on that host no longer appears in the group of managers, one of which runs on each database server host. It could also be that the managers on the hosts besides the Primary do not see a 'heartbeat' message from the Primary manager. In a variety of circumstances like this, both of the managers will, over a 60 second interval of time, once every 10 seconds, attempt to establish, definitively, that the Primary host is indeed either gone or completely unreachable via the network. If this is established, the remaining managers in the group will establish a quorum and the coordinator of that group will initiate failover.

  • A replicator failure, if --property=policy.fence.masterReplicator is set to true, will cause a failover 70 seconds after initial detection

    --property=policy.fence.masterReplicator.threshold=6

    Depending on how you have the manager configured, a Primary replicator failure can also start a process of initiating a failover. There's a specific manager property (--property=policy.fence.masterReplicator=true) that tells a manager to 'fence' a Primary replicator that goes into either a failed or stopped state. The manager will then try to recover the Primary replicator to an online state and, again, after an interval of 60 seconds, if the Primary replicator does not recover, a failover will be initiated. BY DEFAULT, THIS BEHAVIOR IS TURNED OFF. Most customers prefer to keep a fully functional Primary running, even if replication fails, rather than have a failover occur.

Important

The interval of time from the first detection of a fault until a failover occurs is configurable over 10 second intervals. The formula for determining the listed default failover intervals is based on the value of 'threshold' in the properties file (tungsten-manager/conf/manager.properties): interval = (threshold + 1) * 10 seconds

Additionally, there are multiple ways to influence the behavior of the cluster AFTER a failover has been invoked. Below are some of the key variables:

  • Behavior when MySQL is not available but the binary logs are - wait for the Replicator to finish extracting the binary logs or not?

    --property=replicator.store.thl.stopOnDBError=false

    The Manager and Replicator behave in concert when MySQL dies on the Primary node. When this happens, the replicator is unable to update the trep_commit_seqno table any longer, and therefore must either abort extraction or continue extracting without recording the extracted position into the database.

    The default of false means that the Manager will delay failover until all remaining events have been extracted from the binary logs on the failing Primary node as a way to protect data integrity.

    Failover will only continue once:

    • all available events are completely read from the binary logs on the Primary node

    • all events have reached the Replicas

    When --property=replicator.store.thl.stopOnDBError=true, then the Replicator will stop extracting once it is unable to update the trep_commit_seqno table in MySQL, and the Manager will perform the failover without waiting, at the risk of possible data loss due to leaving binlog events behind. All such situations are logged.

    For use cases where failover speed is more important than data accuracy, those NOT willing to wait for long failover can set replicator.store.thl.stopOnDBError=true and still use tungsten_find_orphaned to manually analyze and perform the data recovery. For more information, please see Section 9.28, “The tungsten_find_orphaned Command”.

  • Replica THL apply wait time before failover - how long to wait term of seconds for a Replica to finish applying all stored THL to the database before failing over to it.

    --property=manager.failover.thl.apply.wait.timeout=0

    During a failover, the manager will wait until the Replica that is the candidate for promotion to Primary has applied all stored THL events before promoting that node to Primary.

    The default value is 0, which means "wait indefinitely until all stored THL events are applied".

    Warning

    Any value other than zero (0) invites data loss due to the fact that once the Replica is promoted to Primary, any unapplied stored events in the THL will be ignored, and therefore lost.

    Whenever a failover occurs, the Replica with most events stored in the local THL is selected so that when the events are eventually applied, the data is as close to the original Primary as possible with the least number of events missed.

    That is usually, but not always, the most up-to-date Replica, which is the one with the most events applied.

  • Replica latency check - how far behind in term of seconds is each Replica? If too far behind, do not use for failover.

    --property=policy.slave.promotion.latency.threshold=900

    The policy.slave.promotion.latency.threshold=900 option is the "maximum Replica latency" - this means the number of seconds to which a Replica must be current with the Primary in order to qualify as a candidate for failover. The default is 15 minutes (900 seconds).