Tungsten Clustering

Best Practices for Proper Cluster Failovers

What are the best practices for ensuring the cluster always behaves as expected? Are there any reasons for a cluster NOT to fail over? If so, what are they?

Here are three common reasons that a cluster might not failover properly:

Policy Not Automatic
- BEST PRACTICE: Ensure the cluster policy is AUTOMATIC unless you specifically need it to be otherwise
  - SOLUTION: Use the check_tungsten_policy command to verify the policy status
Complete Network Partition
- If the nodes are unable to communicate cluster-wide, then all nodes will go into a FAILSAFE-SHUN mode to protect the data from a split-brain situation.
- BEST PRACTICE: Ensure that all nodes are able to see each other via the required network ports
  - SOLUTION: Verify that all required ports are open between all nodes local and remote - see "Network Port Requirements"
  - SOLUTION: Use the check_tungsten_online command to check the DataSource State on each node
No Available Replica
- BEST PRACTICE: Ensure there is at least one ONLINE node that is not in STANDBY or ARCHIVE mode
  - SOLUTION: Use the check_tungsten_online command to check the DataSource State on each node
- BEST PRACTICE: Ensure that the Manager is running on all nodes
  - SOLUTION: Use the check_tungsten_services command to verify that the Tungsten processes are running on each node
- BEST PRACTICE: Ensure all Replicators are either ONLINE or GOING ONLINE:SYNCHRONIZING
  - SOLUTION: Use the check_tungsten_online command to verify that the Replicator (and Manager) is ONLINE on each node
- BEST PRACTICE: Ensure the replication applied latency is under the threshold, default 900 seconds
  - SOLUTION: Use the check_tungsten_latency command to check the latency on each node