4.6.4.5. Troubleshooting Hadoop Replication

Replicating to Hadoop involves a number of discrete, specific steps. Due to the batch and multi-stage nature of the extract and apply process, replication can stall or stop due to a variety of issues.

4.6.4.5.1. Errors Reading/Writing commitseqno.0 File

During initial installation, or when starting up replication, the replicator may report that the commitseqno.0 can not be created or written properly, or during startup, that the file cannot be read.

The following checks and recovery procedures can be tried:

  • Check the permissions of the directory to the commitseqno.0 file, the file itself, and the ownership:

    shell> hadoop fs -ls -R /user/tungsten/metadata
    drwxr-xr-x   - cloudera cloudera          0 2020-01-14 10:40 /user/tungsten/metadata/alpha
    -rw-r--r--   3 cloudera cloudera        251 2020-01-14 10:40 /user/tungsten/metadata/alpha/commitseqno.0
  • Check that the file is writable and is not empty. An empty file may indicate a problem updating the content with the new sequence number.

  • Check the content of the file is correct. The content should be a JSON structure containing the replicator state and position information. For example:

    shell> hadoop fs -cat /user/tungsten/metadata/alpha/commitseqno.0
    {
      "appliedLatency" : "0",
      "epochNumber" : "0",
      "fragno" : "0",
      "shardId" : "dna",
      "seqno" : "8",
      "eventId" : "mysql-bin.000015:0000000000103156;0",
      "extractedTstamp" : "1578998421000"
      "lastFrag" : "true",
      "sourceId" : "host1"
    }
  • Try deleting the commitseqno.0 file and placing the replicator online:

    shell> hadoop fs -rm /user/tungsten/metadata/alpha/commitseqno.0
    shell> trepctl online
4.6.4.5.2. Recovering from Replication Failure

If the replication fails, is manually stopped, or the host needs to be restarted, replication should continue from the last point When replication was stopped. Files that were being written when replication was last running will be overwritten and the information recreated.

Unlike other Heterogeneous replication implementations, the Hadoop applier stores the current replication state and restart position in a file within the HDFS of the target Hadoop environment. To recover from failed replication, this file must be deleted, so that the THL can be re-read from the Source and CSV files will be recreated and applied into HDFS.

  1. On the Applier, put the replicator offline:

    shell> trepctl offline
  2. Remove the THL files from the Applier:

    shell> trepctl reset -thl
  3. Remove the staging CSV files replicated into Hadoop:

    shell> hadoop fs -rm -r /user/tungsten/staging
  4. Reset the restart position:

    shell> rm /opt/continuent/tungsten/tungsten-replicator/data/alpha/commitseqno.0

    Replace alpha and /opt/continuent with the corresponding service name and installation location.

  5. Restart replication on the Applier; this will start to recreate the THL files from the MySQL binary log:

    shell> trepctl online
4.6.4.5.3. Missing Primary Key

Replication may fail at the applier stage if the source data does not contain the correct ROW format and information, including the primary key data. trepctl may report the following error:

...
pendingErrorEventId    : mysql-bin.000015:0000000000143981;0
pendingErrorSeqno      : 10
pendingExceptionMessage: Wrapped com.continuent.tungsten.replicator.ReplicatorException: »
    Unable to find a primary key for dna.alt_allele_attrib and there is no default » 
    from property stagePkeyColumn (../../tungsten-replicator//samples/scripts/batch/hdfs-merge.js#18)
pipelineSource         : UNKNOWN
relativeLatency        : -1.0
...

If the primary key was missing in the source data, the table structure on the source must be updated, and the THL information recreated.