Skip to main content
Tungsten Clustering

Replicator Fencing

Tungsten Cluster can be configured to handle failures during replication automatically and to fence the failure so that the issues do not lead to issues with the rest of the cluster, which may lead to problems with applications operating against the cluster. By default, the cluster is designed to take no specific action aside from indicating and registering the replicator so that the node will be identified as being within the DIMINISHED or CRITICAL state.

This behavior can be changed so that the failed replicator is fenced, with configuration operating on either the Primary, or Replica replicators. When fencing has been enabled, the node will be placed into either the OFFLINE state if the node is a Replica or a failover will occur if the node is a Primary.

Fencing a Replica Node Due to a Replication Fault

If the replicator should be placed into the OFFLINE state when replicator stops or raises an error, the following option can be set through tpm on the cluster configuration to set the policy.fence.slaveReplicator to true:

[defaults]
...
property=policy.fence.slaveReplicator=true

The delay before the fencing operation takes place can be configured using the policy.fence.slaveReplicator.threshold parameter, which configures the delay before taking action, with the value multiplied by 10. For example, a setting of 6 implies a delay of 60 seconds. The delay enables transient errors, such as network failures, to be effectively managed without automatically fencing the Replica.

[defaults]
...
property=policy.fence.slaveReplicator.threshold=6

Once a Replica has been fenced, the state will automatically be cleared when the replicator returns to the ONLINE state. Once this has been identified, the node will be placed in the ONLINE state.

Fencing Primary Replicators

In the event of a Primary replicator failure, the fencing operation places the datasource into the FAILED state, triggering an automatic failover (see "Automatic Primary Failover"). Because this triggers a failover in the event of fencing the replicator, the configuration should only be enabled if it is critical for your business that replication errors/stops should trigger a significant operation as failover.

To enable fencing of the Primary node due to replication faults, use the policy.fence.masterReplicator configuration property when configuring the cluster:

[defaults]
...
property=policy.fence.masterReplicator=true

The delay before the fencing operation takes place can be configured using the policy.fence.masterReplicator.threshold property. The default value is 6, or 60 seconds.

[defaults]
...
property=policy.fence.masterReplicator.threshold=6

When the replicator is identified as available, the Primary datasource is not placed back into the online state. Instead, the failed datasource must be explicitly recovered using the recover or datasource host recover commands.