8.8. Deploying Automatic Replicator Recovery
Version Support: 2.2.1 or later
Automatic recovery enables the replicator to go back
ONLINE in the event of a transient
failure that is triggered during either the
GOING-ONLINE:SYNCHRONIZING state that
would otherwise trigger a change of states to
OFFLINE. For example, connection
failures, or restarts in the MySQL service, trigger the replicator to go
OFFLINE. With autorecovery enabled,
the replicator will attempt to put the replicator
ONLINE again to keep the service
running. Failures outside of these states will not trigger autorecovery.
Autorecovery operates by scheduling an attempt to go back online after a
transient failure. If autorecovery is enabled, the process works as follows:
If a failure is identified, the replicator attempts to go back online
after a specified delay. The delay allows the replicator time to decide
whether autorecovery should be attempted. For example, if the MySQL
server restarts, the delay gives time for the MySQL server to come back
online before the replicator goes back online.
Recovery is attempted a configurable number of times. This presents the
replicator from continually attempting to go online within a service
that has a more serious failure. If the replicator fails to go
ONLINE within the configurable
reset interval, then the replicator will go to the
If the replicator remains in the
ONLINE state for a configurable
period of time, then the automatic recovery is deemed to have succeeded.
If the autorecovery fails, then the autorecovery attempts counter is
incremented by one.
The configurable parameters are set using tpm within the
static properties for the replicator:
Sets the maximum number of attempts to automatically recovery from any
single failure trigger. This prevents the autorecovery mechanism
continually attempting autorecover. The current number of attempts is
reset if the replicator remains online for the configured reset period.
The delay between entering the
OFFLINE state, and attempting
autorecovery. On servers that are busy, use some form of network or HA
solution, or have high MySQL restart/startup times, this value should be
configured accordingly to give the underlying services time to startup
again after failure.
The duration after a successful autorecovery has been completed that the
replicator must remain in the
ONLINE state for the recovery
process to be deemed to have succeeded. The number of attempts for
autorecovery is reset to 0 (zero) if the replicator stays up for this
period of time.
Auto recovery is enabled only when the
--auto-recovery-max-attempts parameter is
set to a non-zero value.
tpm update alpha --auto-recovery-max-attempts=5
The autorecovery status can be monitored within
trepsvc.log and through the
autoRecoveryTotal parameters output by
trepctl. For example:
Processing status command...
autoRecoveryEnabled : false
autoRecoveryTotal : 0
The above output indicates that the autorecovery service is disabled. The
autoRecoveryTotal is a count of the number of times
the autorecovery has been completed since the replicator has started.