6.2.6. Management and Monitoring of Hadoop Deployments

Once the two services — extractor and applier — have been installed, the services can be monitored using trepctl. To monitor the master (extractor) service:

shell>  trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000023:0000000505545003;0
appliedLastSeqno       : 10992
appliedLatency         : 42.764
channels               : 1
clusterName            : alpha
currentEventId         : mysql-bin.000023:0000000505545003
currentTimeMillis      : 1389871897922
dataServerHost         : host1
extensions             : 
host                   : host1
latestEpochNumber      : 0
masterConnectUri       : thl://localhost:/
masterListenUri        : thl://host1:2112/
maximumStoredSeqNo     : 10992
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : jdbc:mysql:thin://host1:13306/
relativeLatency        : 158296.922
resourcePrecedence     : 99
rmiPort                : 10000
role                   : master
seqnoType              : java.lang.Long
serviceName            : alpha
serviceType            : local
simpleServiceName      : alpha
siteName               : default
sourceId               : host1
state                  : ONLINE
timeInStateSeconds     : 165845.474
transitioningTo        : 
uptimeSeconds          : 165850.047
useSSLConnection       : false
version                : Tungsten Replicator 5.1.1 build 202
Finished status command...

On the slave, trepctl status shows the currently applied into Hadoop:

shell> trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000010:0000000349816178;0
appliedLastSeqno       : 1102
appliedLatency         : 57.109
channels               : 1
clusterName            : alpha
currentEventId         : NONE
currentTimeMillis      : 1389629684476
dataServerHost         : 192.168.1.252
extensions             : 
host                   : 192.168.1.252
latestEpochNumber      : 236
masterConnectUri       : thl://host1:2112/
masterListenUri        : null
maximumStoredSeqNo     : 1102
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : thl://host1:2112/
relativeLatency        : 121.476
resourcePrecedence     : 99
rmiPort                : 10002
role                   : slave
seqnoType              : java.lang.Long
serviceName            : alpha
serviceType            : local
simpleServiceName      : alpha
siteName               : default
sourceId               : 192.168.1.252
state                  : ONLINE
timeInStateSeconds     : 9690.134
transitioningTo        : 
uptimeSeconds          : 10734.015
useSSLConnection       : false
version                : Tungsten Replicator 5.1.1 build 202

6.2.6.1. Troubleshooting Hadoop Replication

Replicating to Hadoop involves a number of discrete, specific steps. Due to the batch and multi-stage nature of the extract and apply process, replication can stall or stop due to a variety of issues.

6.2.6.1.1. Errors Reading/Writing commitseqno.0 File

During initial installation, or when starting up replication, the replicator may report that the commitseqno.0 can not be created or written properly, or during startup, that the file cannot be read.

The following checks and recovery procedures can be tried:

  • Check the permissions of the directory to the commitseqno.0 file, the file itself, and the ownership:

    shell> hadoop fs -ls -R /user/tungsten/metadata
    drwxr-xr-x   - cloudera cloudera          0 2014-01-14 10:40 /user/tungsten/metadata/alpha
    -rw-r--r--   3 cloudera cloudera        251 2014-01-14 10:40 /user/tungsten/metadata/alpha/commitseqno.0
  • Check that the file is writable and is not empty. An empty file may indicate a problem updating the content with the new sequence number.

  • Check the content of the file is correct. The content should be a JSON structure containing the replicator state and position information. For example:

    shell> hadoop fs -cat /user/tungsten/metadata/alpha/commitseqno.0
    {
      "appliedLatency" : "0",
      "epochNumber" : "0",
      "fragno" : "0",
      "shardId" : "dna",
      "seqno" : "8",
      "eventId" : "mysql-bin.000015:0000000000103156;0",
      "extractedTstamp" : "1389706078000",
      "lastFrag" : "true",
      "sourceId" : "host1"
    }
  • Try deleting the commitseqno.0 file and placing the replicator online:

    shell> hadoop fs -rm /user/tungsten/metadata/alpha/commitseqno.0
    shell> trepctl online
6.2.6.1.2. Recovering from Replication Failure

If the replication fails, is manually stopped, or the host needs to be restarted, replication should continue from the last point When replication was stopped. Files that were being written when replication was last running will be overwritten and the information recreated.

Unlike other Heterogeneous replication implementations, the Hadoop applier stores the current replication state and restart position in a file within the HDFS of the target Hadoop environment. To recover from failed replication, this file must be deleted, so that the THL can be re-read from the master and CSV files will be recreated and applied into HDFS.

  1. On the Slave, put the replicator offline:

    shell> trepctl offline
  2. Remove the THL files from the slave:

    shell> trepctl reset -thl
  3. Remove the staging CSV files replicated into Hadoop:

    shell> hadoop fs -rm -r /user/tungsten/staging
  4. Reset the restart position:

    shell> rm /opt/continuent/tungsten/tungsten-replicator/data/alpha/commitseqno.0

    Replace alpha and /opt/continuent with the corresponding service name and installation location.

  5. Restart replication on the slave; this will start to recreate the THL files from the MySQL binary log:

    shell> trepctl online