7.3.3. Rule Organization - Detection, Investigation, Fencing, Recovery

Content Being Written

This section of the documentation is currently being produced and may be incomplete and/or subject to change.

The main focus of manager operations are fault processing business rules, and since these rules are organized into categories based on their function. As you can infer from some of the previous definitions, there are four major categories of rules, with some ancillary 'houskeeping' categories. The four major categories are:

  1. Fault Detection - raise alarms for specific or nascent faults.

    The text show below comes directly from the source code for the manager rules and comprise the major faults detected by the manager:

    • rule "0600: DETECT MEMBER HEARTBEAT GAP" - This rule fires if there are not already other alarm types pending, for a specific member, and if at least 30-45 seconds has elapsed since the last time a given manager send a ClusterMemberHeartbeat event to the group of managers.  The result of this rule firing will be that a MemberHeartbeatGapAlarm is raised as well as a MembershipInvalidAlarm. Both of these alarms trigger other rules, explained later, which do further investigations of the current cluster connectivity and membership.

    • rule "0601: DETECT STOPPED CLUSTER MANAGER" - This rule fires after a MemberHeartbeatGapAlarm has been raised since the reason for a heartbeat gap can be that a manager has stopped.  Further processing determines whether or not the manager is, indeed, stopped. This rule generates a ManagerStoppedAlarm if it determines that the manager in question is, in fact, stopped.

    • rule "0602: DETECT DATASERVER FAULT" - This rule fires whenever monitoring detects that a given database server is not in the ONLINE state. The result of this rule firing is a DataServerFaultAlarm which, in turn, results in further investigation via one of the investigation rules described later.

    • rule "0604: DETECT UNREACHABLE REMOTE SERVICE" - In the case of a composite cluster, the current coordinator on each site will connect to one manager on the remote site. After establishing this connection, the local manager will poll the remote manager for liveness and generate a RemoteServiceHeartbeatNotification that has a status of REACHABLE. If the remote manager is not reachable, the manager that is polling the remote service will generate a RemoteServiceHeartbeatNotification that has a resource state of UNREACHABLE in which case this rule triggers and a RemoteDataServiceUnreachableAlarm is raised by this rule.

    • rule "0606: DETECT REPLICATOR FAULT" - As the name of this rule indicates, it will detect a replicator fault.  This rule triggers if it sees a replicator notification that indicates that any of the replicators in the cluster are in a STOPPED, SUSPECT or OFFLINE state. The result of triggering of this rule is at a ReplicatorFaultAlarm is raised.

  2. Fault Investigation - process alarms by iteratively investigating whether or not the alarm represents a true fault that requires further action.

    • rule "0525: INVESTIGATE MY LIVENESS" - In this rule title the word MY refers to the manager that is currently evaluating the rule. This rule triggers when it sees a MemberHeartbeatGapAlarm and, as a result of triggering, the manager checks to see if it has both network connectivity as well as visibility of the other cluster members including, if necessary, visibility of a passive witness host. If, during this connectivity check, a manager determines that it is isolated from the rest of the cluster, it will restart itself in a failsafe mode meaning that if will shun all of its database resources an then attempt to join an existing cluster group as a part of a quorum.  This rule is particularly important in cases where there are transient or even protracted network outages since it forms a part of the strategy used to avoid split-brain operations.  If the manager, after restarting, is able to become part of a cluster quorum group, the process of joining that group will result in shunned resources becoming available again if appropriate.

    • rule "0530: INVESTIGATE MEMBERSHIP VALIDITY" - This rule is triggered when it sees a MembershipInvalidAlarm which was previously generated as the result of a member heartbeat gap.  The previous rule, INVESTIGATE MY LIVENESS checks for network connectivity for the current manager.  This rule checks to see if the current manager is a part of a cluster quorum group.  This type of check implies a connectivity check as well but goes further to see if other managers in the group are alive and operational. This rule is another critical part of split brain avoidance since, depending on what it determines, it will take one of the following actions:

      • If the manager does not have network connectivity after checking, every 10 seconds, for a period of 60 seconds, it will restart itself in two different modes:

        1. If the manager detects that it is the last man standing, meaning that it is currently  responsible for the master datasource and all of the other cluster members had previously stopped, it will restart normally, leaving the master datasource available, and will be prepared to be the leader of any new group of managers.

        2. If the manager is not the last man standing, it will restart in failsafe mode i.e. will restart with all of its resources shunned and will attempt to join an existing group.

      • If the manager has network connectivity i.e. can see all of the other hosts in the cluster, it then checks to see if it is a part of a primary partition i.e. a cluster quorum group.  If, the first time it checks, it determines that it is not a part of a primary partition, it immediately disconnects all existing Tungsten connector connections from itself.  This has the effect, on the Tungsten connector side, of immediately suspending all new database connection requests until such time that the manager determines that it is in a primary partition.  This is, again, a critical part of avoiding split-brain operation since it makes it impossible for connectors to satisfy new connection requests until a valid cluster quorum can be definitively validated.

      • The manager will then keep doing this check for quorum group membership, every 10 seconds, for a period of 60 seconds. If it determines that it is not a member of a quorum group, it will use the same criteria, as mentioned previously in the network connectivity case, to determine how it shall restart i.e.:

        1. If it is the last man standing it will, as in the above case, restart normally, leaving the master datasource available.

        2. If the manager is not the last man standing, it will restart in failsafe mode i.e. will restart with all of its resources shunned and will attempt to join an existing group.

      • If, after all of the previous checks, the manager establishes that it is a part of a quorum group, it will, if necessary because it disconnected Tungsten connectors during quorum validation, it will become available for Tungsten connectors to connect to it again after synchronizing its view of the cluster with the current cluster coordinator, and will then continue normal operations.  

    • rule "0550: INVESTIGATE: TIME KEEPER FOR HEARBEAT GAP ALARM" -

    • rule "0550: INVESTIGATE: TIME KEEPER FOR INVALID MEMBERSHIP ALARM" -

    • rule "0550: INVESTIGATE: TIME KEEPER FOR MANAGER STOPPED ALARM" -

    • rule "0550: INVESTIGATE: TIME KEEPER FOR DATASERVER STOPPED ALARM" -

    • rule "0551: INVESTIGATE: TIME KEEPER FOR REMOTE SERVICE STOPPED ALARM" -

    • rule "0552: INVESTIGATE: TIME KEEPER FOR REPLICATOR FAULT ALARM" -

  3. Fault Fencing - fences validated faults, rendering them less disruptive/harmful from the standpoint of the application.

    • rule "0303: FENCE FAILED NODE" -

    • rule "0304: FENCE FAULTED DATASERVER" -

    • rule "0305: FENCE UNREACHABLE REMOTE SERVICE" -

    • rule "0306: FENCE REPLICATOR FAULT - DIMINISHED DATASOURCE" -

    • rule "0306: FENCE REPLICATOR FAULT - EXPIRED ALARM" -

  4. Fault Recovery - attempts to render the fault completely harmless by taking some action that either corrects the fault or by providing alternative resources to manage the fault.

    • rule "0200a: RECOVER MASTER DATASOURCE BY FAILING OVER - NON-REPLICATOR FAULT"  -

    • rule "0200b: RECOVER MASTER DATASOURCE BY FAILING OVER - REPLICATOR FAULT"  -

    • rule "0201: RECOVER COMPOSITE DATASOURCES TO ONLINE"  -

    • rule "0201a: RECOVER FAILSAFE PHYSICAL SLAVE WITH ONLINE PRIMARY"  -

    • rule "0201: RECOVER FAILSAFE SHUNNED COMPOSITE SLAVE TO ONLINE"  -

    • rule "0202: RECOVER OFFLINE PHYSICAL DATASOURCES TO ONLINE"  -

    • rule "0203: RECOVER FAILED PHYSICAL DATASOURCES TO ONLINE"  -

    • rule "0204: RECOVER MASTER REPLICATORS TO ONLINE"  -

    • rule "0205: RECOVER SLAVE REPLICATORS TO ONLINE"  -

    • rule "0206: RECOVER FROM DIMINISHED STATE WHEN STOPPED REPLICATOR RESTARTS"  -

    • rule "0207: RECOVER MERGED MEMBERS"  -

    • rule "0208: RECOVER AND RECONCILE REMOTE DATA SERVICE STATE"  -

    • rule "0209: PREVENT MULTIPLE ONLINE MASTERS"  -

    • rule "0210: RECOVER WITNESSES TO ONLINE"  -

    • rule "0211: RECOVER REMOTE FAILSAFE SHUNNED COMPOSITE MASTER TO ONLINE"  -

    • rule "0212: RECOVER NON-READ-ONLY SLAVES TO READ-ONLY"  -