Replicating to Hadoop involves a number of discrete, specific steps. Due to the batch and multi-stage nature of the extract and apply process, replication can stall or stop due to a variety of issues.
During initial installation, or when starting up replication, the
replicator may report that the
commitseqno.0
can not be
created or written properly, or during startup, that the file cannot
be read.
The following checks and recovery procedures can be tried:
Check the permissions of the directory to the
commitseqno.0
file, the
file itself, and the ownership:
shell> hadoop fs -ls -R /user/tungsten/metadata
drwxr-xr-x - cloudera cloudera 0 2020-01-14 10:40 /user/tungsten/metadata/alpha
-rw-r--r-- 3 cloudera cloudera 251 2020-01-14 10:40 /user/tungsten/metadata/alpha/commitseqno.0
Check that the file is writable and is not empty. An empty file may indicate a problem updating the content with the new sequence number.
Check the content of the file is correct. The content should be a JSON structure containing the replicator state and position information. For example:
shell> hadoop fs -cat /user/tungsten/metadata/alpha/commitseqno.0
{
"appliedLatency" : "0",
"epochNumber" : "0",
"fragno" : "0",
"shardId" : "dna",
"seqno" : "8",
"eventId" : "mysql-bin.000015:0000000000103156;0",
"extractedTstamp" : "1578998421000"
"lastFrag" : "true",
"sourceId" : "host1"
}
Try deleting the
commitseqno.0
file and
placing the replicator online:
shell>hadoop fs -rm /user/tungsten/metadata/alpha/commitseqno.0
shell>trepctl online
If the replication fails, is manually stopped, or the host needs to be restarted, replication should continue from the last point When replication was stopped. Files that were being written when replication was last running will be overwritten and the information recreated.
Unlike other Heterogeneous replication implementations, the Hadoop applier stores the current replication state and restart position in a file within the HDFS of the target Hadoop environment. To recover from failed replication, this file must be deleted, so that the THL can be re-read from the Source and CSV files will be recreated and applied into HDFS.
On the Applier, put the replicator offline:
shell> trepctl offline
Remove the THL files from the Applier:
shell> trepctl reset -thl
Remove the staging CSV files replicated into Hadoop:
shell> hadoop fs -rm -r /user/tungsten/staging
Reset the restart position:
shell> rm /opt/continuent/tungsten/tungsten-replicator/data/alpha/commitseqno.0
Replace alpha
and
/opt/continuent
with the corresponding
service name and installation location.
Restart replication on the Applier; this will start to recreate the THL files from the MySQL binary log:
shell> trepctl online
Replication may fail at the applier stage if the source data does not contain the correct ROW format and information, including the primary key data. trepctl may report the following error:
... pendingErrorEventId : mysql-bin.000015:0000000000143981;0 pendingErrorSeqno : 10 pendingExceptionMessage: Wrapped com.continuent.tungsten.replicator.ReplicatorException: » Unable to find a primary key for dna.alt_allele_attrib and there is no default » from property stagePkeyColumn (../../tungsten-replicator//samples/scripts/batch/hdfs-merge.js#18) pipelineSource : UNKNOWN relativeLatency : -1.0 ...
If the primary key was missing in the source data, the table structure on the source must be updated, and the THL information recreated.