Skip to main content
Tungsten Clustering

Best Practices for Proper Cluster Failovers

What are the best practices for ensuring the cluster always behaves as expected? Are there any reasons for a cluster NOT to fail over? If so, what are they?

Here are three common reasons that a cluster might not failover properly:

  • Policy Not Automatic
    • BEST PRACTICE: Ensure the cluster policy is AUTOMATIC unless you specifically need it to be otherwise
  • Complete Network Partition
    • If the nodes are unable to communicate cluster-wide, then all nodes will go into a FAILSAFE-SHUN mode to protect the data from a split-brain situation.
    • BEST PRACTICE: Ensure that all nodes are able to see each other via the required network ports
  • No Available Replica
    • BEST PRACTICE: Ensure there is at least one ONLINE node that is not in STANDBY or ARCHIVE mode
    • BEST PRACTICE: Ensure that the Manager is running on all nodes
    • BEST PRACTICE: Ensure all Replicators are either ONLINE or GOING ONLINE:SYNCHRONIZING
      • SOLUTION: Use the check_tungsten_online command to verify that the Replicator (and Manager) is ONLINE on each node
    • BEST PRACTICE: Ensure the replication applied latency is under the threshold, default 900 seconds