Automatic recovery enables the replicator to go back
ONLINE
in the event of a transient
failure that is triggered during either the
ONLINE
or
GOING-ONLINE:SYNCHRONIZING
state that
would otherwise trigger a change of states to
OFFLINE
. For example, connection
failures, or restarts in the MySQL service, trigger the replicator to go
OFFLINE
. With autorecovery enabled,
the replicator will attempt to put the replicator
ONLINE
again to keep the service
running. Failures outside of these states will not trigger autorecovery.
Autorecovery operates by scheduling an attempt to go back online after a transient failure. If autorecovery is enabled, the process works as follows:
If a failure is identified, the replicator attempts to go back online after a specified delay. The delay allows the replicator time to decide whether autorecovery should be attempted. For example, if the MySQL server restarts, the delay gives time for the MySQL server to come back online before the replicator goes back online.
Recovery is attempted a configurable number of times. This presents the
replicator from continually attempting to go online within a service
that has a more serious failure. If the replicator fails to go
ONLINE
within the configurable
reset interval, then the replicator will go to the
OFFLINE
state.
If the replicator remains in the
ONLINE
state for a configurable
period of time, then the automatic recovery is deemed to have succeeded.
If the autorecovery fails, then the autorecovery attempts counter is
incremented by one.
The configurable parameters are set using tpm within the static properties for the replicator:
Sets the maximum number of attempts to automatically recovery from any single failure trigger. This prevents the autorecovery mechanism continually attempting autorecover. The current number of attempts is reset if the replicator remains online for the configured reset period.
--auto-recovery-delay-interval
The delay between entering the
OFFLINE
state, and attempting
autorecovery. On servers that are busy, use some form of network or HA
solution, or have high MySQL restart/startup times, this value should be
configured accordingly to give the underlying services time to startup
again after failure.
--auto-recovery-reset-interval
The duration after a successful autorecovery has been completed that the
replicator must remain in the
ONLINE
state for the recovery
process to be deemed to have succeeded. The number of attempts for
autorecovery is reset to 0 (zero) if the replicator stays up for this
period of time.
Auto recovery is enabled only when the
--auto-recovery-max-attempts
parameter is
set to a non-zero value.
To enable:
shell> tpm update alpha --auto-recovery-max-attempts=5
The autorecovery status can be monitored within
trepsvc.log
and through the
autoRecoveryEnabled
and
autoRecoveryTotal
parameters output by
trepctl. For example:
shell> trepctl status
Processing status command...
NAME VALUE
---- -----
...
autoRecoveryEnabled : false
autoRecoveryTotal : 0
...
The above output indicates that the autorecovery service is disabled. The
autoRecoveryTotal
is a count of the number of times
the autorecovery has been completed since the replicator has started.