Troubleshooting Hadoop Replication
Replicating to Hadoop involves a number of discrete, specific steps. Due to the batch and multi-stage nature of the extract and apply process, replication can stall or stop due to a variety of issues.
Errors Reading/Writing commitseqno.0 File
During initial installation, or when starting up replication, the replicator may report that the commitseqno.0 can not be created or
written properly, or during startup, that the file cannot be read.
The following checks and recovery procedures can be tried:
Check the permissions of the directory to the
commitseqno.0file, the file itself, and the ownership:shell> hadoop fs -ls -R /user/tungsten/metadatadrwxr-xr-x - cloudera cloudera 0 2020-01-14 10:40 /user/tungsten/metadata/alpha-rw-r--r-- 3 cloudera cloudera 251 2020-01-14 10:40 /user/tungsten/metadata/alpha/commitseqno.0Check that the file is writable and is not empty. An empty file may indicate a problem updating the content with the new sequence number.
Check the content of the file is correct. The content should be a JSON structure containing the replicator state and position information. For example:
shell> hadoop fs -cat /user/tungsten/metadata/alpha/commitseqno.0{"appliedLatency" : "0","epochNumber" : "0","fragno" : "0","shardId" : "dna","seqno" : "8","eventId" : "mysql-bin.000015:0000000000103156;0","extractedTstamp" : "1578998421000""lastFrag" : "true","sourceId" : "host1"}Try deleting the
commitseqno.0file and placing the replicator online:shell> hadoop fs -rm /user/tungsten/metadata/alpha/commitseqno.0shell> trepctl online
Recovering from Replication Failure
If the replication fails, is manually stopped, or the host needs to be restarted, replication should continue from the last point when replication was stopped. Files that were being written when replication was last running will be overwritten and the information recreated.
Unlike other Heterogeneous replication implementations, the Hadoop applier stores the current replication state and restart position in a file within the HDFS of the target Hadoop environment. To recover from failed replication, this file must be deleted, so that the THL can be re-read from the Source and CSV files will be recreated and applied into HDFS.
On the Applier, put the replicator offline:
shell> trepctl offlineRemove the THL files from the Applier:
shell> trepctl reset -thlRemove the staging CSV files replicated into Hadoop:
shell> hadoop fs -rm -r /user/tungsten/stagingReset the restart position:
shell> rm /opt/continuent/tungsten/tungsten-replicator/data/alpha/commitseqno.0Replace
alphaand/opt/continuentwith the corresponding service name and installation location.Restart replication on the Applier; this will start to recreate the THL files from the MySQL binary log:
shell> trepctl online
Missing Primary Key
Replication may fail at the applier stage if the source data does not contain the correct ROW format and information, including the primary key
data. trepctl may report the following error:
...
pendingErrorEventId : mysql-bin.000015:0000000000143981;0
pendingErrorSeqno : 10
pendingExceptionMessage: Wrapped com.continuent.tungsten.replicator.ReplicatorException:
Unable to find a primary key for dna.alt_allele_attrib and there is no default
from property stagePkeyColumn (../../tungsten-replicator//samples/scripts/batch/hdfs-merge.js#18)
pipelineSource : UNKNOWN
relativeLatency : -1.0
...
If the primary key was missing in the source data, the table structure on the source must be updated, and the THL information recreated.