1.2. Tungsten Manager

The Tungsten Manager is responsible for monitoring and managing a Continuent Tungsten dataservice. The manager has a number of control and supervisory roles for the operation of the cluster, and acts both as a control and a central information source for the status and health of the dataservice as a whole.

Primarily, the Tungsten Manager handles the following tasks:

  • Monitors the replication status of each datasource within the cluster.

  • Communicates and updates Tungsten Connector with information about the status of each datasource. In the event of a change of status, Tungsten Connectors are notified so that queries can be redirected accordingly.

  • Manages all the individual components of the system. Using the Java JMX system the manager is able to directly control the different components to change status, control the replication process, and

  • Checks to determine the availability of datasources by using either the Echo TCP/IP protocol on port 7 (default), or using the system ping protocol to determine whether a host is available. The configuration of the protocol to be used can be made by adjusting the manager properties. For more information, see Section B.2.2.3, “Host Availability Checks”.

  • Includes an advanced rules engine. The rule engine is used to respond to different events within the cluster and perform the necessary operations to keep the dataservice in optimal working state. During any change in status, whether user-selected or automatically triggered due to a failure, the rules are used to make decisions about whether to restart services, swap masters, or reconfigure connectors.

1.2.1. Tungsten Manager Failover Rules

There are currently three discrete faults that can cause a failover of a master:

  • Database server failure - failover will occur 20 seconds after the initial detection.

    --property=policy.liveness.dbping.fail.threshold=1

    The Tungsten Manager is unable to connect to the database server and gets an i/o error. If the database cannot respond to a tcp connect request after the configured number of attempts, the database server is flagged as STOPPED which initiates the failover.

    This would mean, literally, that the process for the database server is gone and cannot respond to a tcp connect request. In this case, by default, the manager will try two more times, once every 10 seconds, after the initial i/o error is detected and after the then 30 second interval has elapsed, will flag the database server as being in the STOPPED state and this, in turn, initiates the failover.

  • Host failure - failover will occur 30 seconds after the initial detection

    --property=policy.liveness.hostPing.fail.threshold=2

    The host on which the master database server is running is 'gone'. The first indication that the master host is gone could be because the manager on that host no longer appears in the group of managers, one of which runs on each database server host. It could also be that the managers on the hosts besides the master do not see a 'heartbeat' message from the master manager. In a variety of circumstances like this, both of the managers will, over a 60 second interval of time, once every 10 seconds, attempt to establish, definitively, that the master host is indeed either gone or completely unreachable via the network. If this is established, the remaining managers in the group will establish a quorum and the coordinator of that group will initiate failover.

  • A replicator failure, if --property=policy.fence.masterReplicator is set to true, will cause a failover 70 seconds after initial detection

    --property=policy.fence.masterReplicator.threshold=6

    Depending on how you have the manager configured, a master replicator failure can also start a process of initiating a failover. There's a specific manager property (--property=policy.fence.masterReplicator=true) that tells a manager to 'fence' a master replicator that goes into either a failed or stopped state. The manager will then try to recover the master replicator to an online state and, again, after an interval of 60 seconds, if the master replicator does not recover, a failover will be initiated. BY DEFAULT, THIS BEHAVIOR IS TURNED OFF. Most customers prefer to keep a fully functional master running, even if replication fails, rather than have a failover occur.

Important

The interval of time from the first detection of a fault until a failover occurs is configurable over 10 second intervals. The formula for determining the listed default failover intervals is based on the value of 'threshold' in the properties file (tungsten-manager/conf/manager.properties): interval = (threshold + 1) * 10 seconds