Best Practices for Proper Cluster Failovers
What are the best practices for ensuring the cluster always behaves as expected? Are there any reasons for a cluster NOT to fail over? If so, what are they?
Here are three common reasons that a cluster might not failover properly:
- Policy Not Automatic
- BEST PRACTICE: Ensure the cluster policy is AUTOMATIC unless you specifically need it to be otherwise
- SOLUTION: Use the
check_tungsten_policycommand to verify the policy status
- SOLUTION: Use the
- BEST PRACTICE: Ensure the cluster policy is AUTOMATIC unless you specifically need it to be otherwise
- Complete Network Partition
- If the nodes are unable to communicate cluster-wide, then all nodes will go into a
FAILSAFE-SHUNmode to protect the data from a split-brain situation. - BEST PRACTICE: Ensure that all nodes are able to see each other via the required network ports
- SOLUTION: Verify that all required ports are open between all nodes local and remote - see "Network Port Requirements"
- SOLUTION: Use the
check_tungsten_onlinecommand to check the DataSource State on each node
- If the nodes are unable to communicate cluster-wide, then all nodes will go into a
- No Available Replica
- BEST PRACTICE: Ensure there is at least one
ONLINEnode that is not inSTANDBYorARCHIVEmode- SOLUTION: Use the
check_tungsten_onlinecommand to check the DataSource State on each node
- SOLUTION: Use the
- BEST PRACTICE: Ensure that the Manager is running on all nodes
- SOLUTION: Use the
check_tungsten_servicescommand to verify that the Tungsten processes are running on each node
- SOLUTION: Use the
- BEST PRACTICE: Ensure all Replicators are either
ONLINEorGOING ONLINE:SYNCHRONIZING- SOLUTION: Use the
check_tungsten_onlinecommand to verify that the Replicator (and Manager) is ONLINE on each node
- SOLUTION: Use the
- BEST PRACTICE: Ensure the replication applied latency is under the threshold, default 900 seconds
- SOLUTION: Use the
check_tungsten_latencycommand to check the latency on each node
- SOLUTION: Use the
- BEST PRACTICE: Ensure there is at least one