5.3.4.5. Troubleshooting Hadoop Replication

Replicating to Hadoop involves a number of discrete, specific steps. Due to the batch and multi-stage nature of the extract and apply process, replication can stall or stop due to a variety of issues.

5.3.4.5.1. Errors Reading/Writing commitseqno.0 File

During initial installation, or when starting up replication, the replicator may report that the commitseqno.0 can not be created or written properly, or during startup, that the file cannot be read.

The following checks and recovery procedures can be tried:

  • Check the permissions of the directory to the commitseqno.0 file, the file itself, and the ownership:

    shell> hadoop fs -ls -R /user/tungsten/metadata
    drwxr-xr-x   - cloudera cloudera          0 2014-01-14 10:40 /user/tungsten/metadata/alpha
    -rw-r--r--   3 cloudera cloudera        251 2014-01-14 10:40 /user/tungsten/metadata/alpha/commitseqno.0
  • Check that the file is writable and is not empty. An empty file may indicate a problem updating the content with the new sequence number.

  • Check the content of the file is correct. The content should be a JSON structure containing the replicator state and position information. For example:

    shell> hadoop fs -cat /user/tungsten/metadata/alpha/commitseqno.0
    {
      "appliedLatency" : "0",
      "epochNumber" : "0",
      "fragno" : "0",
      "shardId" : "dna",
      "seqno" : "8",
      "eventId" : "mysql-bin.000015:0000000000103156;0",
      "extractedTstamp" : "1389706078000",
      "lastFrag" : "true",
      "sourceId" : "host1"
    }
  • Try deleting the commitseqno.0 file and placing the replicator online:

    shell> hadoop fs -rm /user/tungsten/metadata/alpha/commitseqno.0
    shell> trepctl online
5.3.4.5.2. Recovering from Replication Failure

If the replication fails, is manually stopped, or the host needs to be restarted, replication should continue from the last point When replication was stopped. Files that were being written when replication was last running will be overwritten and the information recreated.

Unlike other Heterogeneous replication implementations, the Hadoop applier stores the current replication state and restart position in a file within the HDFS of the target Hadoop environment. To recover from failed replication, this file must be deleted, so that the THL can be re-read from the master and CSV files will be recreated and applied into HDFS.

  1. On the Slave, put the replicator offline:

    shell> trepctl offline
  2. Remove the THL files from the slave:

    shell> trepctl reset -thl
  3. Remove the staging CSV files replicated into Hadoop:

    shell> hadoop fs -rm -r /user/tungsten/staging
  4. Reset the restart position:

    shell> rm /opt/continuent/tungsten/tungsten-replicator/data/alpha/commitseqno.0

    Replace alpha and /opt/continuent with the corresponding service name and installation location.

  5. Restart replication on the slave; this will start to recreate the THL files from the MySQL binary log:

    shell> trepctl online