Rule Organization - Detection, Investigation, Fencing, Recovery
The main focus of manager operations are fault processing business rules, and since these rules are organized into categories based on their function. As you can infer from some of the previous definitions, there are four major categories of rules, with some ancillary 'housekeeping' categories. The four major categories are:
Fault Detection : raise alarms for specific or nascent faults.
The text show below comes directly from the source code for the manager rules and comprise the major faults detected by the manager:
0600: DETECT MEMBER HEARTBEAT GAP
This rule fires if there are not already other alarm types pending, for a specific member, and if at least 30-45 seconds has elapsed since the last time a given manager send a ClusterMemberHeartbeat event to the group of managers. The result of this rule firing will be that a MemberHeartbeatGapAlarm is raised as well as a
MembershipInvalidAlarm. Both of these alarms trigger other rules, explained later, which do further investigations of the current cluster connectivity and membership.0601: DETECT STOPPED CLUSTER MANAGER
This rule fires after a
MemberHeartbeatGapAlarmhas been raised since the reason for a heartbeat gap can be that a manager has stopped. Further processing determines whether or not the manager is, indeed, stopped. This rule generates aManagerStoppedAlarmif it determines that the manager in question is, in fact, stopped.0602: DETECT DATASERVER FAULT
This rule fires whenever monitoring detects that a given database server is not in the ONLINE state. The result of this rule firing is a
DataServerFaultAlarmwhich, in turn, results in further investigation via one of the investigation rules described later.0604: DETECT UNREACHABLE REMOTE SERVICE
In the case of a composite cluster, the current coordinator on each site will connect to one manager on the remote site. After establishing this connection, the local manager will poll the remote manager for liveness and generate a
RemoteServiceHeartbeatNotificationthat has a status of REACHABLE. If the remote manager is not reachable, the manager that is polling the remote service will generate aRemoteServiceHeartbeatNotificationthat has a resource state of UNREACHABLE in which case this rule triggers and aRemoteDataServiceUnreachableAlarmis raised by this rule.0606: DETECT REPLICATOR FAULT
As the name of this rule indicates, it will detect a replicator fault. This rule triggers if it sees a replicator notification that indicates that any of the replicators in the cluster are in a STOPPED, SUSPECT or OFFLINE state. The result of triggering of this rule is at a
ReplicatorFaultAlarmis raised.
Fault Investigation - process alarms by iteratively investigating whether or not the alarm represents a true fault that requires further action.
0525: INVESTIGATE MY LIVENESS
In this rule title the word MY refers to the manager that is currently valuating the rule. This rule triggers when it sees a
MemberHeartbeatGapAlarmand, as a result of triggering, the manager checks to see if it has both network connectivity as well as visibility of the other cluster members including, if necessary, visibility of a passive witness host. If, during this connectivity check, a manager determines that it is isolated from the rest of the cluster, it will restart itself in a failsafe mode meaning that if will shun all of its database resources an then attempt to join an existing cluster group as a part of a quorum. This rule is particularly important in cases where there are transient or even protracted network outages since it forms a part of the strategy used to avoid split-brain operations. If the manager, after restarting, is able to become part of a cluster quorum group, the process of joining that group will result in shunned resources becoming available again if appropriate.0530: INVESTIGATE MEMBERSHIP VALIDITY
This rule is triggered when it sees a MembershipInvalidAlarm which was previously generated as the result of a member heartbeat gap. The previous rule, INVESTIGATE MY LIVENESS checks for network connectivity for the current manager. This rule checks to see if the current manager is a part of a cluster quorum group. This type of check implies a connectivity check as well but goes further to see if other managers in the group are alive and operational. This rule is another critical part of split brain avoidance since, depending on what it determines, it will take one of the following actions:
If the manager does not have network connectivity after checking, every 10 seconds, for a period of 60 seconds, it will restart itself in two different modes:
If the manager detects that it is the last man standing, meaning that it is currently responsible for the Primary datasource and all of the other cluster members had previously stopped, it will restart normally, leaving the Primary datasource available, and will be prepared to be the leader of any new group of managers.
If the manager is not the last man standing, it will restart in failsafe mode i.e. will restart with all of its resources shunned and will attempt to join an existing group.
If the manager has network connectivity i.e. can see all of the other hosts in the cluster, it then checks to see if it is a part of a primary partition i.e. a cluster quorum group. If, the first time it checks, it determines that it is not a part of a primary partition, it immediately disconnects all existing Tungsten connector connections from itself. This has the effect, on the Tungsten connector side, of immediately suspending all new database connection requests until such time that the manager determines that it is in a primary partition. This is, again, a critical part of avoiding split-brain operation since it makes it impossible for connectors to satisfy new connection requests until a valid cluster quorum can be definitively validated.
The manager will then keep doing this check for quorum group membership, every 10 seconds, for a period of 60 seconds. If it determines that it is not a member of a quorum group, it will use the same criteria, as mentioned previously in the network connectivity case, to determine how it shall restart i.e.:
If it is the last man standing it will, as in the above case, restart normally, leaving the Primary datasource available.
If the manager is not the last man standing, it will restart in failsafe mode i.e. will restart with all of its resources shunned and will attempt to join an existing group.
If, after all of the previous checks, the manager establishes that it is a part of a quorum group, it will, if necessary because it disconnected Tungsten connectors during quorum validation, it will become available for Tungsten connectors to connect to it again after synchronizing its view of the cluster with the current cluster coordinator, and will then continue normal operations.
0550: INVESTIGATE: TIME KEEPER FOR HEARBEAT GAP ALARM
0550: INVESTIGATE: TIME KEEPER FOR INVALID MEMBERSHIP ALARM
0550: INVESTIGATE: TIME KEEPER FOR MANAGER STOPPED ALARM
0550: INVESTIGATE: TIME KEEPER FOR DATASERVER STOPPED ALARM
0551: INVESTIGATE: TIME KEEPER FOR REMOTE SERVICE STOPPED ALARM
0552: INVESTIGATE: TIME KEEPER FOR REPLICATOR FAULT ALARM
Fault Fencing - fences validated faults, rendering them less disruptive/harmful from the standpoint of the application.
0303: FENCE FAILED NODE
0304: FENCE FAULTED DATASERVER
0305: FENCE UNREACHABLE REMOTE SERVICE
0306: FENCE REPLICATOR FAULT - DIMINISHED DATASOURCE
0306: FENCE REPLICATOR FAULT - EXPIRED ALARM
Fault Recovery - attempts to render the fault completely harmless by taking some action that either corrects the fault or by providing alternative resources to manage the fault.
0200a: RECOVER MASTER DATASOURCE BY FAILING OVER - NON-REPLICATOR FAULT
0200b: RECOVER MASTER DATASOURCE BY FAILING OVER - REPLICATOR FAULT
0201: RECOVER COMPOSITE DATASOURCES TO ONLINE
0201a: RECOVER FAILSAFE PHYSICAL SLAVE WITH ONLINE PRIMARY
0201: RECOVER FAILSAFE SHUNNED COMPOSITE SLAVE TO ONLINE
0202: RECOVER OFFLINE PHYSICAL DATASOURCES TO ONLINE
0203: RECOVER FAILED PHYSICAL DATASOURCES TO ONLINE
0204: RECOVER MASTER REPLICATORS TO ONLINE
0205: RECOVER SLAVE REPLICATORS TO ONLINE
0206: RECOVER FROM DIMINISHED STATE WHEN STOPPED REPLICATOR RESTARTS
0207: RECOVER MERGED MEMBERS
0208: RECOVER AND RECONCILE REMOTE DATA SERVICE STATE
0209: PREVENT MULTIPLE ONLINE MASTERS
0210: RECOVER WITNESSES TO ONLINE
0211: RECOVER REMOTE FAILSAFE SHUNNED COMPOSITE MASTER TO ONLINE
0212: RECOVER NON-READ-ONLY SLAVES TO READ-ONLY