7.9. Deploying Automatic Replicator Recovery

Automatic recovery enables the replicator to go back ONLINE in the event of a transient failure that is triggered during either the ONLINE or GOING-ONLINE:SYNCHRONIZING state that would otherwise trigger a change of states to OFFLINE. For example, connection failures, or restarts in the MySQL service, trigger the replicator to go OFFLINE. With autorecovery enabled, the replicator will attempt to put the replicator ONLINE again to keep the service running. Failures outside of these states will not trigger autorecovery.

Autorecovery operates by scheduling an attempt to go back online after a transient failure. If autorecovery is enabled, the process works as follows:

  1. If a failure is identified, the replicator attempts to go back online after a specified delay. The delay allows the replicator time to decide whether autorecovery should be attempted. For example, if the MySQL server restarts, the delay gives time for the MySQL server to come back online before the replicator goes back online.

  2. Recovery is attempted a configurable number of times. This presents the replicator from continually attempting to go online within a service that has a more serious failure. If the replicator fails to go ONLINE within the configurable reset interval, then the replicator will go to the OFFLINE state.

  3. If the replicator remains in the ONLINE state for a configurable period of time, then the automatic recovery is deemed to have succeeded. If the autorecovery fails, then the autorecovery attempts counter is incremented by one.

The configurable parameters are set using tpm within the static properties for the replicator:

  • --auto-recovery-max-attempts

    Sets the maximum number of attempts to automatically recovery from any single failure trigger. This prevents the autorecovery mechanism continually attempting autorecover. The current number of attempts is reset if the replicator remains online for the configured reset period.

  • --auto-recovery-delay-interval

    The delay between entering the OFFLINE state, and attempting autorecovery. On servers that are busy, use some form of network or HA solution, or have high MySQL restart/startup times, this value should be configured accordingly to give the underlying services time to startup again after failure.

  • --auto-recovery-reset-interval

    The duration after a successful autorecovery has been completed that the replicator must remain in the ONLINE state for the recovery process to be deemed to have succeeded. The number of attempts for autorecovery is reset to 0 (zero) if the replicator stays up for this period of time.

Auto recovery is enabled only when the --auto-recovery-max-attempts parameter is set to a non-zero value.

To enable:

shell> tpm update alpha --auto-recovery-max-attempts=5

The autorecovery status can be monitored within trepsvc.log and through the autoRecoveryEnabled and autoRecoveryTotal parameters output by trepctl. For example:

shell> trepctl status
Processing status command...
NAME                     VALUE
----                     -----
...
autoRecoveryEnabled    : false
autoRecoveryTotal      : 0
...

The above output indicates that the autorecovery service is disabled. The autoRecoveryTotal is a count of the number of times the autorecovery has been completed since the replicator has started.