A Tungsten Cluster requires an odd number of database nodes to establish a voting quorum using group communications—typically, this means a three-node cluster. This quorum mechanism ensures that, in the event of a primary database failure, the system can automatically and correctly failover to a healthy node.
Because Tungsten Cluster’s quorum relies on jGroups protocols for communication, a highly reliable network is essential for proper functionality. While the cluster is designed to handle intermittent network failures, there may be situations where the Tungsten Manager executes a fail-safe operation to protect data integrity. In such cases, database nodes may isolate themselves from client traffic to prevent a split-brain scenario.
Although the Tungsten Cluster is built to withstand many hardware and software failures, the underlying network reliability ultimately defines the limits of its fault-tolerance capabilities.
This section will cover the basic steps the Tungsten Manager takes when a network failure isolates it
from the other nodes—helping you better understand how FAILSAFE-SHUN
works and why it matters.
In the case of a network failure, on what basis does Tungsten initiate a fail-safe shun? Is it triggered by ICMP unreachability or JGroups unreachability?
Failover only happens if the manager is in a non-primary jGroups partition:
2025/02/02 23:12:36 | CHECKING FOR QUORUM: MUST BE AT LEAST 2 DB MEMBERS 2025/02/02 23:12:36 | POTENTIAL QUORUM SET MEMBERS ARE: db1, db3, db2 2025/02/02 23:12:36 | SIMPLE MAJORITY SIZE: 2 2025/02/02 23:12:36 | GC VIEW OF CURRENT DB MEMBERS IS: db1 2025/02/02 23:12:36 | VALIDATED DB MEMBERS ARE: db1 2025/02/02 23:12:36 | REACHABLE DB MEMBERS ARE: db1 2025/02/02 23:12:36 | ======================================================================== 2025/02/02 23:12:36 | CONCLUSION: I AM IN A NON-PRIMARY PARTITION OF 1 MEMBERS OUT OF A REQUIRED MAJORITY SIZE OF 2 2025/02/02 23:12:36 | AND THERE ARE 0 REACHABLE WITNESSES OUT OF 0 2025/02/02 23:12:36 | WARN | db1 | WARN [EnterpriseRulesConsequenceDispatcher] - COULD NOT ESTABLISH A QUORUM. RESTARTING SAFE…
A partition (a group) is formed by JGroups. When a manager first starts, it looks for a group to join. If a group is not available, then the Manager will create a group. The other Managers that start after the first manager will join the group. If a Manager is shut down, then it will leave the group. The logs show that the db1 manager was alone in the group and restarted in failsafe-shun mode.
But how did we arrive here? Every 3 seconds a heartbeat is sent to the other nodes. If there is no reply for 10 seconds, then a heartbeat gap detection rule will fire:
2025/02/02 23:12:14 | INFO | db1 | INFO [EnterpriseRulesConsequenceDispatcher] - HEARTBEAT GAP DETECTED FOR MEMBER 'db2' 2025/02/02 23:12:16 | INFO | db1 | INFO [EnterpriseRulesConsequenceDispatcher] - HEARTBEAT GAP DETECTED FOR MEMBER 'db3' 2025/02/02 23:12:16 | NETWORK CONNECTIVITY: PING TIMEOUT=5 2025/02/02 23:12:16 | NETWORK CONNECTIVITY: CHECKING MY OWN ('db1') CONNECTIVITY 2025/02/02 23:12:16 | HOST db1/10.88.75.104: ALIVE 2025/02/02 23:12:16 | (ping) result: true, duration: 0.01s, notes: ping -c 1 -w 5 10.88.75.104 2025/02/02 23:12:16 | NETWORK CONNECTIVITY: CHECKING CLUSTER MEMBER 'db2' 2025/02/02 23:12:21 | HOST db2/10.88.75.105: NOT REACHABLE 2025/02/02 23:12:21 | (ping) result: false, duration: 5.01s, notes: ping -c 1 -w 5 10.88.75.105 2025/02/02 23:12:21 | NETWORK CONNECTIVITY: CHECKING CLUSTER MEMBER 'db3' 2025/02/02 23:12:26 | HOST db3/10.88.75.106: NOT REACHABLE 2025/02/02 23:12:26 | (ping) result: false, duration: 5.00s, notes: ping -c 1 -w 5 10.88.75.106
In this case, as seen below, db1 viewed db2 and db3 as down or not reachable because the network was down.
This situation will then trigger a MembershipInvalidAlarm
. The alarm
will check if the other managers are reachable through Group communication or not. If they are not reachable
via jGroups, the manager will increment the dispatch number and after 10 seconds will check again. When the
dispatch number equals the max dispatch AND the node is still seen as alone in the cluster, the manager will
restart in FAILSAFE-SHUN
mode.
2025/02/02 23:12:36 | INFO | db1 | INFO [Rule_0550$u58$_INVESTIGATE$u58$_TIME_KEEPER_FOR_INVALID_MEMBERSHIP_ALARM71605603] - TIMER EXPIRED, INCREMENTED ALARM: MembershipInvalidAlarm: FAULT: MEMBER db3@pod(UNKNOWN), MAX DISPATCH=3, DISPATCH=3, EXPIRED=true 2025/02/02 23:12:36 | INFO | db1 | INFO [Rule_0530$u58$_INVESTIGATE_MEMBERSHIP_VALIDITY1008640238] - CONSEQUENCE 2025/02/02 23:12:36 | NETWORK CONNECTIVITY: PING TIMEOUT=5 2025/02/02 23:12:36 | NETWORK CONNECTIVITY: CHECKING MY OWN ('db1') CONNECTIVITY 2025/02/02 23:12:36 | HOST db1/10.88.75.104: ALIVE 2025/02/02 23:12:36 | (ping) result: true, duration: 0.00s, notes: ping -c 1 -w 2 10.88.75.104 2025/02/02 23:12:36 | NETWORK CONNECTIVITY: CHECKING CLUSTER MEMBER 'db2' 2025/02/02 23:12:36 | HOST db2/10.88.75.105: ALIVE 2025/02/02 23:12:36 | (ping) result: true, duration: 0.00s, notes: ping -c 1 -w 5 10.88.75.105 2025/02/02 23:12:36 | ======================================================================== 2025/02/02 23:12:36 | CHECKING FOR QUORUM: MUST BE AT LEAST 2 DB MEMBERS 2025/02/02 23:12:36 | POTENTIAL QUORUM SET MEMBERS ARE: db1, db3, db2 2025/02/02 23:12:36 | SIMPLE MAJORITY SIZE: 2 2025/02/02 23:12:36 | GC VIEW OF CURRENT DB MEMBERS IS: db1 2025/02/02 23:12:36 | VALIDATED DB MEMBERS ARE: db1 2025/02/02 23:12:36 | REACHABLE DB MEMBERS ARE: db1 2025/02/02 23:12:36 | ======================================================================== 2025/02/02 23:12:36 | CONCLUSION: I AM IN A NON-PRIMARY PARTITION OF 1 MEMBERS OUT OF A REQUIRED MAJORITY SIZE OF 2 2025/02/02 23:12:36 | AND THERE ARE 0 REACHABLE WITNESSES OUT OF 0 2025/02/02 23:12:36 | WARN | db1 | WARN [EnterpriseRulesConsequenceDispatcher] - COULD NOT ESTABLISH A QUORUM. RESTARTING SAFE…
The Tungsten Manager initiates a FAILSAFE-SHUN
when a
MemberHearbeatGap
invokes a
MembershipInvalidAlarm
. Only a MembershipInvalidAlarm
can cause a FAILSAFE-SHUN
, and only if the host is alone in the group.
Note that the values above are MAX DISPATCH=3
and
DISPATCH=3
. The max dispatch value is controlled by
the policy.invalid.membership.retry.threshold
property.
In this case the network was online for 20 seconds between db1 and db2 (but not for db3) as we can see from the logs (at 23:12:33):
2025/02/02 23:12:33 | NETWORK CONNECTIVITY: PING TIMEOUT=5 2025/02/02 23:12:33 | NETWORK CONNECTIVITY: CHECKING MY OWN ('db1') CONNECTIVITY 2025/02/02 23:12:33 | HOST db1/10.88.75.104: ALIVE 2025/02/02 23:12:33 | (ping) result: true, duration: 0.00s, notes: ping -c 1 -w 2 10.88.75.104 2025/02/02 23:12:33 | NETWORK CONNECTIVITY: CHECKING CLUSTER MEMBER 'db2' 2025/02/02 23:12:33 | HOST db2/10.88.75.105: ALIVE 2025/02/02 23:12:33 | (ping) result: true, duration: 0.00s, notes: ping -c 1 -w 5 10.88.75.105
At this point it was too late for the group communications to re-establish the quorum and form a group, so the Manager restarted at 23:12:36.
To increase the delay before a FAILSAFE-SHUN
is invoked, set the
policy.invalid.membership.retry.threshold
property in your /etc/tungsten/tungsten.ini
and run tpm update. For example, setting this value to 6 would result in
the Managers being able to survive a network blip that lasts a maximum of 50 seconds:
property=policy.invalid.membership.retry.threshold=6
The formula works out to: 6 x 10 = 60 – 10
(safety period for jGroups
to re-form the group) = 50-second delay before fail-safe shun occurs.
The following sections provide the questions and answers to questions often asked by customers.
8.2.5.3.1. | Does an ICMP packet drop relate to an invalid membership issue in Tungsten? |
No. Only a | |
8.2.5.3.2. |
Does the coordinator decide to initiate a |
Each Manager independently invokes the fail-safe script if they are alone in the quorum. | |
8.2.5.3.3. |
We observed that DB1 initiated a |
This can be seen only from your network monitoring logs. In the Manager logs we can see only the response to the
| |
8.2.5.3.4. |
If we change the property |
Failover can happen for the following reasons:
This means that setting the threshold to 6 will only result in keeping the primary online to resist the network blip. Data loss after a failover can happen ONLY if the operator of the cluster issues a recover command in force mode without paying any attention to the error displayed. Error is displayed when there are transactions left in the primary binlog that were not replicated to the new Primary. In this case the recover command will issue an error and will not proceed. The operator needs to manually apply the transactions to the new Primary. If the operator forces the recover, data loss will occur. | |
8.2.5.3.5. | Apart from the heartbeat gap detection invalid alarm, where can we find logs in Tungsten that indicate non-validated database members? |
The message | |
8.2.5.3.6. |
What is the difference between a From DB1: 2025/02/02 23:12:33 | GC VIEW OF CURRENT DB MEMBERS IS: db1, db2, db3 2025/02/02 23:12:33 | VALIDATED DB MEMBERS ARE: db1 2025/02/02 23:12:33 | REACHABLE DB MEMBERS ARE: db1, db3, db2 2025/02/02 23:12:36 | GC VIEW OF CURRENT DB MEMBERS IS: db1, db2, db3 2025/02/02 23:12:36 | VALIDATED DB MEMBERS ARE: db1, db3 2025/02/02 23:12:36 | REACHABLE DB MEMBERS ARE: db1, db3, db2 2025/02/02 23:12:36 | GC VIEW OF CURRENT DB MEMBERS IS: db1 2025/02/02 23:12:36 | VALIDATED DB MEMBERS ARE: db1 2025/02/02 23:12:36 | REACHABLE DB MEMBERS ARE: db1 2025/02/02 23:12:36 | FATAL | db1 | FATAL [ClusterManagementHelper] - Fail-safe invoked. Forcing a failsafe exit followed by a restart. From DB2: 2025/02/02 23:12:28 | GC VIEW OF CURRENT DB MEMBERS IS: db2 2025/02/02 23:12:28 | VALIDATED DB MEMBERS ARE: db2 2025/02/02 23:12:28 | REACHABLE DB MEMBERS ARE: db2 2025/02/02 23:12:33 | GC VIEW OF CURRENT DB MEMBERS IS: db2 2025/02/02 23:12:33 | VALIDATED DB MEMBERS ARE: db2 2025/02/02 23:12:33 | REACHABLE DB MEMBERS ARE: db2 2025/02/02 23:12:33 | FATAL | db2 | FATAL [ClusterManagementHelper] - Fail-safe invoked. Forcing a failsafe exit followed by a restart. From DB3: 2025/02/02 23:12:35 | GC VIEW OF CURRENT DB MEMBERS IS: db1, db2, db3 2025/02/02 23:12:35 | VALIDATED DB MEMBERS ARE: db1, db3 2025/02/02 23:12:35 | REACHABLE DB MEMBERS ARE: db1, db3, db2 2025/02/02 23:12:35 | GC VIEW OF CURRENT DB MEMBERS IS: db3 2025/02/02 23:12:35 | VALIDATED DB MEMBERS ARE: db3 2025/02/02 23:12:35 | REACHABLE DB MEMBERS ARE: db3 2025/02/02 23:12:35 | FATAL | db3 | FATAL [ClusterManagementHelper] - Fail-safe invoked. Forcing a failsafe exit followed by a restart. |
Automatic failover is initiated by the coordinator. For a coordinator to be able to failover, it must be in a partition that has a
majority of nodes. If every Manager is in its own single partition, no failover can be expected. Only a
| |
8.2.5.3.7. |
Does an ICMP packet drop cause a |
No. A | |
8.2.5.3.8. |
In the case of a network failure, on what basis does Tungsten initiate the initial
|
Tungsten initiates a | |
8.2.5.3.9. |
In some cases, we did not observe a ping failure in the Tungsten logs, yet Tungsten still initiated a
|
| |
8.2.5.3.10. | What are your recommendations for moving from ICMP to a TCP-based network health check in Tungsten? What are the pros and cons of this transition? |
Not recommended - ICMP is the best practice as seen over years of experience in the field. When ICMP is failing, TCP will fail too. If a network is unstable, using TCP to somehow mask the instability is not the correct approach to checking network health for a database cluster. Our implementation of JGroups uses TCP to communicate with the group members and the ping utility uses ICMP. | |
8.2.5.3.11. | Packet drops and overrun errors do not include timestamps, and other services remained unaffected while only Tungsten was impacted. How can we analyse Tungsten logs to determine whether the issue was caused by a network problem or network blip , especially given the lack of detailed JGroups logging? |
The correct way to analyze the Manager logs is to follow the provided example above. Start by locating the triggered events and then follow the included explanations.
Grep for Also, JGroups logging is disabled by default. The best practice is to keep JGroups logging disabled because it will rapidly fill the logs and is very difficult to interpret. | |
8.2.5.3.12. | Which protocol does the Heartbeat mechanism use to send signals in the Tungsten system (TCP, ICMP, or UDP)? |
Heartbeat events are sent via TCP | |
8.2.5.3.13. | Which protocol does JGroups use by default (TCP or UDP)? |
Our jGroups implementation uses TCP to communicate. | |
8.2.5.3.14. |
What is the difference between the following entries in Tungsten logs?
|
|