8.2.5. Understanding the FAILSAFE-SHUN state.

8.2.5. Understanding the `FAILSAFE-SHUN` state.
Prev	^Up	8.2. Tungsten Manager Failover Behavior	Next

8.2.5. Understanding the `FAILSAFE-SHUN` state.

8.2.5.1. What causes a FAILSAFE-SHUN
8.2.5.2. How to tune the delay before FAILSAFE-SHUN
8.2.5.3. FAQ

A Tungsten Cluster requires an odd number of database nodes to establish a voting quorum using group communications—typically, this means a three-node cluster. This quorum mechanism ensures that, in the event of a primary database failure, the system can automatically and correctly failover to a healthy node.

Because Tungsten Cluster’s quorum relies on jGroups protocols for communication, a highly reliable network is essential for proper functionality. While the cluster is designed to handle intermittent network failures, there may be situations where the Tungsten Manager executes a fail-safe operation to protect data integrity. In such cases, database nodes may isolate themselves from client traffic to prevent a split-brain scenario.

Although the Tungsten Cluster is built to withstand many hardware and software failures, the underlying network reliability ultimately defines the limits of its fault-tolerance capabilities.

This section will cover the basic steps the Tungsten Manager takes when a network failure isolates it from the other nodes—helping you better understand how FAILSAFE-SHUN works and why it matters.

8.2.5.1. What causes a `FAILSAFE-SHUN`

In the case of a network failure, on what basis does Tungsten initiate a fail-safe shun? Is it triggered by ICMP unreachability or JGroups unreachability?

Failover only happens if the manager is in a non-primary jGroups partition:

2025/02/02 23:12:36 | CHECKING FOR QUORUM: MUST BE AT LEAST 2 DB MEMBERS
2025/02/02 23:12:36 | POTENTIAL QUORUM SET MEMBERS ARE: db1, db3, db2
2025/02/02 23:12:36 | SIMPLE MAJORITY SIZE: 2
2025/02/02 23:12:36 | GC VIEW OF CURRENT DB MEMBERS IS: db1
2025/02/02 23:12:36 | VALIDATED DB MEMBERS ARE: db1
2025/02/02 23:12:36 | REACHABLE DB MEMBERS ARE: db1
2025/02/02 23:12:36 |
========================================================================
2025/02/02 23:12:36 | CONCLUSION: I AM IN A NON-PRIMARY PARTITION OF 1 MEMBERS OUT OF A REQUIRED MAJORITY SIZE OF 2
2025/02/02 23:12:36 | AND THERE ARE 0 REACHABLE WITNESSES OUT OF 0
2025/02/02 23:12:36 | WARN | db1 | WARN [EnterpriseRulesConsequenceDispatcher] - COULD NOT ESTABLISH A QUORUM. RESTARTING SAFE…

A partition (a group) is formed by JGroups. When a manager first starts, it looks for a group to join. If a group is not available, then the Manager will create a group. The other Managers that start after the first manager will join the group. If a Manager is shut down, then it will leave the group. The logs show that the db1 manager was alone in the group and restarted in failsafe-shun mode.

But how did we arrive here? Every 3 seconds a heartbeat is sent to the other nodes. If there is no reply for 10 seconds, then a heartbeat gap detection rule will fire:

2025/02/02 23:12:14 | INFO | db1 | INFO [EnterpriseRulesConsequenceDispatcher] - HEARTBEAT GAP DETECTED FOR MEMBER 'db2'
2025/02/02 23:12:16 | INFO | db1 | INFO [EnterpriseRulesConsequenceDispatcher] - HEARTBEAT GAP DETECTED FOR MEMBER 'db3'
2025/02/02 23:12:16 | NETWORK CONNECTIVITY: PING TIMEOUT=5
2025/02/02 23:12:16 | NETWORK CONNECTIVITY: CHECKING MY OWN ('db1') CONNECTIVITY
2025/02/02 23:12:16 | HOST db1/10.88.75.104: ALIVE
2025/02/02 23:12:16 | (ping) result: true, duration: 0.01s, notes: ping -c 1 -w 5 10.88.75.104
2025/02/02 23:12:16 | NETWORK CONNECTIVITY: CHECKING CLUSTER MEMBER 'db2'
2025/02/02 23:12:21 | HOST db2/10.88.75.105: NOT REACHABLE
2025/02/02 23:12:21 | (ping) result: false, duration: 5.01s, notes: ping -c 1 -w 5 10.88.75.105
2025/02/02 23:12:21 | NETWORK CONNECTIVITY: CHECKING CLUSTER MEMBER 'db3'
2025/02/02 23:12:26 | HOST db3/10.88.75.106: NOT REACHABLE
2025/02/02 23:12:26 | (ping) result: false, duration: 5.00s, notes: ping -c 1 -w 5 10.88.75.106

In this case, as seen below, db1 viewed db2 and db3 as down or not reachable because the network was down. This situation will then trigger a MembershipInvalidAlarm. The alarm will check if the other managers are reachable through Group communication or not. If they are not reachable via jGroups, the manager will increment the dispatch number and after 10 seconds will check again. When the dispatch number equals the max dispatch AND the node is still seen as alone in the cluster, the manager will restart in FAILSAFE-SHUN mode.

2025/02/02 23:12:36 | INFO | db1 | INFO [Rule_0550$u58$_INVESTIGATE$u58$_TIME_KEEPER_FOR_INVALID_MEMBERSHIP_ALARM71605603] - TIMER EXPIRED, INCREMENTED ALARM: MembershipInvalidAlarm: FAULT: MEMBER db3@pod(UNKNOWN), MAX DISPATCH=3, DISPATCH=3, EXPIRED=true
2025/02/02 23:12:36 | INFO | db1 | INFO [Rule_0530$u58$_INVESTIGATE_MEMBERSHIP_VALIDITY1008640238] - CONSEQUENCE
2025/02/02 23:12:36 | NETWORK CONNECTIVITY: PING TIMEOUT=5
2025/02/02 23:12:36 | NETWORK CONNECTIVITY: CHECKING MY OWN ('db1') CONNECTIVITY
2025/02/02 23:12:36 | HOST db1/10.88.75.104: ALIVE
2025/02/02 23:12:36 | (ping) result: true, duration: 0.00s, notes: ping -c 1 -w 2 10.88.75.104
2025/02/02 23:12:36 | NETWORK CONNECTIVITY: CHECKING CLUSTER MEMBER 'db2'
2025/02/02 23:12:36 | HOST db2/10.88.75.105: ALIVE
2025/02/02 23:12:36 | (ping) result: true, duration: 0.00s, notes: ping -c 1 -w 5 10.88.75.105
2025/02/02 23:12:36 |
========================================================================
2025/02/02 23:12:36 | CHECKING FOR QUORUM: MUST BE AT LEAST 2 DB MEMBERS
2025/02/02 23:12:36 | POTENTIAL QUORUM SET MEMBERS ARE: db1, db3, db2
2025/02/02 23:12:36 | SIMPLE MAJORITY SIZE: 2
2025/02/02 23:12:36 | GC VIEW OF CURRENT DB MEMBERS IS: db1
2025/02/02 23:12:36 | VALIDATED DB MEMBERS ARE: db1
2025/02/02 23:12:36 | REACHABLE DB MEMBERS ARE: db1
2025/02/02 23:12:36 |
========================================================================
2025/02/02 23:12:36 | CONCLUSION: I AM IN A NON-PRIMARY PARTITION OF 1 MEMBERS OUT OF A REQUIRED MAJORITY SIZE OF 2
2025/02/02 23:12:36 | AND THERE ARE 0 REACHABLE WITNESSES OUT OF 0
2025/02/02 23:12:36 | WARN | db1 | WARN [EnterpriseRulesConsequenceDispatcher] - COULD NOT ESTABLISH A QUORUM. RESTARTING SAFE…

The Tungsten Manager initiates a FAILSAFE-SHUN when a MemberHearbeatGap invokes a MembershipInvalidAlarm. Only a MembershipInvalidAlarm can cause a FAILSAFE-SHUN, and only if the host is alone in the group.

Note that the values above are MAX DISPATCH=3 and DISPATCH=3. The max dispatch value is controlled by the policy.invalid.membership.retry.threshold property.

In this case the network was online for 20 seconds between db1 and db2 (but not for db3) as we can see from the logs (at 23:12:33):

2025/02/02 23:12:33 | NETWORK CONNECTIVITY: PING TIMEOUT=5
2025/02/02 23:12:33 | NETWORK CONNECTIVITY: CHECKING MY OWN ('db1') CONNECTIVITY
2025/02/02 23:12:33 | HOST db1/10.88.75.104: ALIVE
2025/02/02 23:12:33 | (ping) result: true, duration: 0.00s, notes: ping -c 1 -w 2 10.88.75.104
2025/02/02 23:12:33 | NETWORK CONNECTIVITY: CHECKING CLUSTER MEMBER 'db2'
2025/02/02 23:12:33 | HOST db2/10.88.75.105: ALIVE
2025/02/02 23:12:33 | (ping) result: true, duration: 0.00s, notes: ping -c 1 -w 5 10.88.75.105

At this point it was too late for the group communications to re-establish the quorum and form a group, so the Manager restarted at 23:12:36.

8.2.5.2. How to tune the delay before `FAILSAFE-SHUN`

To increase the delay before a FAILSAFE-SHUN is invoked, set the policy.invalid.membership.retry.threshold property in your /etc/tungsten/tungsten.ini and run tpm update. For example, setting this value to 6 would result in the Managers being able to survive a network blip that lasts a maximum of 50 seconds:

property=policy.invalid.membership.retry.threshold=6

The formula works out to: 6 x 10 = 60 – 10 (safety period for jGroups to re-form the group) = 50-second delay before fail-safe shun occurs.

8.2.5.3. FAQ

The following sections provide the questions and answers to questions often asked by customers.

8.2.5.3.1. Does an ICMP packet drop relate to an invalid membership issue in Tungsten?
8.2.5.3.2. Does the coordinator decide to initiate a FAILSAFE-SHUN across the cluster, or does each host's Manager independently invoke the fail-safe script?
8.2.5.3.3. We observed that DB1 initiated a FAILSAFE-SHUN at 23:12:36, DB2 at 23:12:33, and DB3 at 23:12:35. Based on the current settings, can you analyze the logs to determine the exact duration of the network blip before the FAILSAFE-SHUN was triggered across all three database nodes?
8.2.5.3.4. If we change the property policy.invalid.membership.retry.threshold=6, what would be the potential impact on failover behavior or data loss?
8.2.5.3.5. Apart from the heartbeat gap detection invalid alarm, where can we find logs in Tungsten that indicate non-validated database members?
8.2.5.3.6. What is the difference between a FAILSAFE-SHUN invocation and an automatic failover initiation when the first node detects a network issue? If the affected node is the primary, will it trigger an automatic failover, or a FAILSAFE-SHUN? If the affected node is a secondary, will it only trigger a FAILSAFE-SHUN? Here are the associated logs:
8.2.5.3.7. Does an ICMP packet drop cause a FAILSAFE-SHUN to initiate?
8.2.5.3.8. In the case of a network failure, on what basis does Tungsten initiate the initial FAILSAFE-SHUN? Is it triggered by ICMP unreachability or JGroups unreachability?
8.2.5.3.9. In some cases, we did not observe a ping failure in the Tungsten logs, yet Tungsten still initiated a FAILSAFE-SHUN. If this is due to JGroups, how can we trace it through the tungsten logs beyond generic messages like "Quorum lost" and similar entries?
8.2.5.3.10. What are your recommendations for moving from ICMP to a TCP-based network health check in Tungsten? What are the pros and cons of this transition?
8.2.5.3.11. Packet drops and overrun errors do not include timestamps, and other services remained unaffected while only Tungsten was impacted. How can we analyse Tungsten logs to determine whether the issue was caused by a network problem or network blip , especially given the lack of detailed JGroups logging?
8.2.5.3.12. Which protocol does the Heartbeat mechanism use to send signals in the Tungsten system (TCP, ICMP, or UDP)?
8.2.5.3.13. Which protocol does JGroups use by default (TCP or UDP)?
8.2.5.3.14. What is the difference between the following entries in Tungsten logs? GC VIEW OF CURRENT DB MEMBERS, VALIDATED DB MEMBERS, REACHABLE DB MEMBERS

8.2.5.3.1.	Does an ICMP packet drop relate to an invalid membership issue in Tungsten?
	No. Only a `MemberHeartbeatGap` (not getting a heartbeat reply for 10 seconds) will trigger a `MembershipInvalidAlarm`.
8.2.5.3.2.	Does the coordinator decide to initiate a `FAILSAFE-SHUN` across the cluster, or does each host's Manager independently invoke the fail-safe script?
	Each Manager independently invokes the fail-safe script if they are alone in the quorum.
8.2.5.3.3.	We observed that DB1 initiated a `FAILSAFE-SHUN` at 23:12:36, DB2 at 23:12:33, and DB3 at 23:12:35. Based on the current settings, can you analyze the logs to determine the exact duration of the network blip before the `FAILSAFE-SHUN` was triggered across all three database nodes?
	This can be seen only from your network monitoring logs. In the Manager logs we can see only the response to the `"ping -c 1 -w 5 hostname"` result during the membership validity check when we increment the timer, but during this 10-second period the network can be flapping (up-down-up-down), which is NOT visible from the Manager logs. What we can see during the membership validity check is that the network at that exact time was either up or down.
8.2.5.3.4.	If we change the property `policy.invalid.membership.retry.threshold=6`, what would be the potential impact on failover behavior or data loss?
	Failover can happen for the following reasons: if the primary MySQL server goes down. This has nothing to do with our case. if the host with the primary MySQL server goes down. This has nothing to do with our case. If the host with the primary MySQL server gets isolated AND the other two hosts can still form a majority of the quorum. This has nothing to do with our case. If the host with the primary MySQL server gets isolated AND also the other hosts get isolated from each other. In this case failover does not happen (we have only singular managers isolated from each other) and the cluster will go to failsafe shun. This is our case. This means that setting the threshold to 6 will only result in keeping the primary online to resist the network blip. Data loss after a failover can happen ONLY if the operator of the cluster issues a recover command in force mode without paying any attention to the error displayed. Error is displayed when there are transactions left in the primary binlog that were not replicated to the new Primary. In this case the recover command will issue an error and will not proceed. The operator needs to manually apply the transactions to the new Primary. If the operator forces the recover, data loss will occur.
8.2.5.3.5.	Apart from the heartbeat gap detection invalid alarm, where can we find logs in Tungsten that indicate non-validated database members?
	The message `"Non-validated database members"` in the Manager's `tmsvc.log` means that there wasn't a Manager on that host up and running or because of the network breakage that Manager is not reachable.
8.2.5.3.6.	What is the difference between a `FAILSAFE-SHUN` invocation and an automatic failover initiation when the first node detects a network issue? If the affected node is the primary, will it trigger an automatic failover, or a `FAILSAFE-SHUN`? If the affected node is a secondary, will it only trigger a `FAILSAFE-SHUN`? Here are the associated logs: From DB1: 2025/02/02 23:12:33 \| GC VIEW OF CURRENT DB MEMBERS IS: db1, db2, db3 2025/02/02 23:12:33 \| VALIDATED DB MEMBERS ARE: db1 2025/02/02 23:12:33 \| REACHABLE DB MEMBERS ARE: db1, db3, db2 2025/02/02 23:12:36 \| GC VIEW OF CURRENT DB MEMBERS IS: db1, db2, db3 2025/02/02 23:12:36 \| VALIDATED DB MEMBERS ARE: db1, db3 2025/02/02 23:12:36 \| REACHABLE DB MEMBERS ARE: db1, db3, db2 2025/02/02 23:12:36 \| GC VIEW OF CURRENT DB MEMBERS IS: db1 2025/02/02 23:12:36 \| VALIDATED DB MEMBERS ARE: db1 2025/02/02 23:12:36 \| REACHABLE DB MEMBERS ARE: db1 2025/02/02 23:12:36 \| FATAL \| db1 \| FATAL [ClusterManagementHelper] - Fail-safe invoked. Forcing a failsafe exit followed by a restart. From DB2: 2025/02/02 23:12:28 \| GC VIEW OF CURRENT DB MEMBERS IS: db2 2025/02/02 23:12:28 \| VALIDATED DB MEMBERS ARE: db2 2025/02/02 23:12:28 \| REACHABLE DB MEMBERS ARE: db2 2025/02/02 23:12:33 \| GC VIEW OF CURRENT DB MEMBERS IS: db2 2025/02/02 23:12:33 \| VALIDATED DB MEMBERS ARE: db2 2025/02/02 23:12:33 \| REACHABLE DB MEMBERS ARE: db2 2025/02/02 23:12:33 \| FATAL \| db2 \| FATAL [ClusterManagementHelper] - Fail-safe invoked. Forcing a failsafe exit followed by a restart. From DB3: 2025/02/02 23:12:35 \| GC VIEW OF CURRENT DB MEMBERS IS: db1, db2, db3 2025/02/02 23:12:35 \| VALIDATED DB MEMBERS ARE: db1, db3 2025/02/02 23:12:35 \| REACHABLE DB MEMBERS ARE: db1, db3, db2 2025/02/02 23:12:35 \| GC VIEW OF CURRENT DB MEMBERS IS: db3 2025/02/02 23:12:35 \| VALIDATED DB MEMBERS ARE: db3 2025/02/02 23:12:35 \| REACHABLE DB MEMBERS ARE: db3 2025/02/02 23:12:35 \| FATAL \| db3 \| FATAL [ClusterManagementHelper] - Fail-safe invoked. Forcing a failsafe exit followed by a restart.
	Automatic failover is initiated by the coordinator. For a coordinator to be able to failover, it must be in a partition that has a majority of nodes. If every Manager is in its own single partition, no failover can be expected. Only a `FAILSAFE-SHUN` can happen at this point.
8.2.5.3.7.	Does an ICMP packet drop cause a `FAILSAFE-SHUN` to initiate?
	No. A `FAILSAFE-SHUN` will only happen if the Manager is alone in the JGroups partition.
8.2.5.3.8.	In the case of a network failure, on what basis does Tungsten initiate the initial `FAILSAFE-SHUN`? Is it triggered by ICMP unreachability or JGroups unreachability?
	Tungsten initiates a `FAILSAFE-SHUN` when a `MemberHearbeatGap` invokes a `MembershipInvalidAlarm`. Only a `MembershipInvalidAlarm` can cause a `FAILSAFE-SHUN`, and only if the host is alone in the group.
8.2.5.3.9.	In some cases, we did not observe a ping failure in the Tungsten logs, yet Tungsten still initiated a `FAILSAFE-SHUN`. If this is due to JGroups, how can we trace it through the tungsten logs beyond generic messages like `"Quorum lost"` and similar entries?
	`FAILSAFE-SHUN` can happen even if there wasn't a ping failure (e.g. the host is up and running) but the other Managers are down (stopped, restarted).
8.2.5.3.10.	What are your recommendations for moving from ICMP to a TCP-based network health check in Tungsten? What are the pros and cons of this transition?
	Not recommended - ICMP is the best practice as seen over years of experience in the field. When ICMP is failing, TCP will fail too. If a network is unstable, using TCP to somehow mask the instability is not the correct approach to checking network health for a database cluster. Our implementation of JGroups uses TCP to communicate with the group members and the ping utility uses ICMP.
8.2.5.3.11.	Packet drops and overrun errors do not include timestamps, and other services remained unaffected while only Tungsten was impacted. How can we analyse Tungsten logs to determine whether the issue was caused by a network problem or network blip , especially given the lack of detailed JGroups logging?
	The correct way to analyze the Manager logs is to follow the provided example above. Start by locating the triggered events and then follow the included explanations. Grep for `HEARTBEAT GAP DETECTED FOR MEMBER` and look for which events trigger other events. Also, JGroups logging is disabled by default. The best practice is to keep JGroups logging disabled because it will rapidly fill the logs and is very difficult to interpret.
8.2.5.3.12.	Which protocol does the Heartbeat mechanism use to send signals in the Tungsten system (TCP, ICMP, or UDP)?
	Heartbeat events are sent via TCP
8.2.5.3.13.	Which protocol does JGroups use by default (TCP or UDP)?
	Our jGroups implementation uses TCP to communicate.
8.2.5.3.14.	What is the difference between the following entries in Tungsten logs? `GC VIEW OF CURRENT DB MEMBERS`, `VALIDATED DB MEMBERS`, `REACHABLE DB MEMBERS`
	`GC VIEW OF CURRENT DB MEMBERS IS: db1, db2, db3` - this is how the group communication (JGroups GC) sees members `VALIDATED DB MEMBERS ARE: db1, db3` - managers that are up and running and respond to the pingMember command (TCP) `REACHABLE DB MEMBERS ARE: db1, db3, db2` - host reachable by ping (host ping utility ICMP)

Prev	Up	Next
8.2. Tungsten Manager Failover Behavior	^Level	8.3. Tungsten Manager Failover Tuning

Continuent Documentation