6.4.4. Installing Hadoop Replication

Installation of the Hadoop replication consists of multiple stages:

  1. Install the Master replicator extract information from your source database. Separate instructions are available for:

  2. Install the Slave replicator which will apply information to the target Hadoop environment. See Section, “Slave Replicator Service”.

  3. Once the installation of the Master and Slave components have been completed, materialization of tables and views on Generating Materialized Views

The continuent-tools-hadoop repository contains a set of tools that allow for the convenient creation of DDL, materialized views, and data comparison on the tables that have been replicated from MySQL.

To obtain the tools, use git

shell> ./bin/load-reduce-check -s test -Ujdbc:mysql:thin://tr-hadoop2:13306 -udbload -ppassword

The load-reduce-check command performs four distinct steps:

  1. Reads the schema from the MySQL server and creates the staging table DDL within Hive

  2. Reads the schema from the MySQL server and creates the base table DDL within Hive

  3. Executes the materialized view process on each selected staging table data to build the base table content.

  4. Performs a data comparison Accessing Generated Tables in Hive

If not already completed, the schema generation process described in Section, “Schema Generation” should have been followed. This creates the necessary Hive schema and staging schema definitions.

Once the tables have been created through ddlscan you can query the stage tables:

hive> select * from stage_xxx_movies_large limit 10;
I	10	1	57475	All in the Family	1971	Archie Feels Left Out (#4.17)
I	10	2	57476	All in the Family	1971	Archie Finds a Friend (#6.18)
I	10	3	57477	All in the Family	1971	Archie Gets the Business: Part 1 (#8.1)
I	10	4	57478	All in the Family	1971	Archie Gets the Business: Part 2 (#8.2)
I	10	5	57479	All in the Family	1971	Archie Gives Blood (#1.4)
I	10	6	57480	All in the Family	1971	Archie Goes Too Far (#3.17)
I	10	7	57481	All in the Family	1971	Archie in the Cellar (#4.10)
I	10	8	57482	All in the Family	1971	Archie in the Hospital (#3.15)
I	10	9	57483	All in the Family	1971	Archie in the Lock-Up (#2.3)
I	10	10	57484	All in the Family	1971	Archie Is Branded (#3.20) Management and Monitoring of Hadoop Deployments

Once the two services — extractor and applier — have been installed, the services can be monitored using trepctl. To monitor the master (extractor) service:

shell>  trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000023:0000000505545003;0
appliedLastSeqno       : 10992
appliedLatency         : 42.764
channels               : 1
clusterName            : alpha
currentEventId         : mysql-bin.000023:0000000505545003
currentTimeMillis      : 1389871897922
dataServerHost         : host1
extensions             : 
host                   : host1
latestEpochNumber      : 0
masterConnectUri       : thl://localhost:/
masterListenUri        : thl://host1:2112/
maximumStoredSeqNo     : 10992
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : jdbc:mysql:thin://host1:13306/
relativeLatency        : 158296.922
resourcePrecedence     : 99
rmiPort                : 10000
role                   : master
seqnoType              : java.lang.Long
serviceName            : alpha
serviceType            : local
simpleServiceName      : alpha
siteName               : default
sourceId               : host1
state                  : ONLINE
timeInStateSeconds     : 165845.474
transitioningTo        : 
uptimeSeconds          : 165850.047
useSSLConnection       : false
version                : Tungsten Replicator 6.0.4 build 27
Finished status command...

When monitoring, the primary concernrs beyond identifying and copying with any errors is to monitor the applied latency. LArger numbers for applied latency generally indicate the the information is being written out to disk effectively. There are a number of strategies that should be checked:

  • Confirm that the Hadoop environment is running effectively. Any delays to writing to HDFS will impact the replicator.

  • Adjust the block commit parameters. Tuning the block commit levels should find the balance between frequent updates to achieve the required latency, and generating files of a suitable file sizes so that Hadoop can process them effectively for processing through map/reduce. You should try both increasing and reducing the sizes to find and figure out the the correct settings according to your source data. Troubleshooting Hadoop Replication

Replicating to Hadoop involves a number of discrete, specific steps. Due to the batch and multi-stage nature of the extract and apply process, replication can stall or stop due to a variety of issues. Errors Reading/Writing commitseqno.0 File

During initial installation, or when starting up replication, the replicator may report that the commitseqno.0 can not be created or written properly, or during startup, that the file cannot be read.

The following checks and recovery procedures can be tried:

  • Check the permissions of the directory to the commitseqno.0 file, the file itself, and the ownership:

    shell> hadoop fs -ls -R /user/tungsten/metadata
    drwxr-xr-x   - cloudera cloudera          0 2014-01-14 10:40 /user/tungsten/metadata/alpha
    -rw-r--r--   3 cloudera cloudera        251 2014-01-14 10:40 /user/tungsten/metadata/alpha/commitseqno.0
  • Check that the file is writable and is not empty. An empty file may indicate a problem updating the content with the new sequence number.

  • Check the content of the file is correct. The content should be a JSON structure containing the replicator state and position information. For example:

    shell> hadoop fs -cat /user/tungsten/metadata/alpha/commitseqno.0
      "appliedLatency" : "0",
      "epochNumber" : "0",
      "fragno" : "0",
      "shardId" : "dna",
      "seqno" : "8",
      "eventId" : "mysql-bin.000015:0000000000103156;0",
      "extractedTstamp" : "1389706078000",
      "lastFrag" : "true",
      "sourceId" : "host1"
  • Try deleting the commitseqno.0 file and placing the replicator online:

    shell> hadoop fs -rm /user/tungsten/metadata/alpha/commitseqno.0
    shell> trepctl online Recovering from Replication Failure

If the replication fails, is manually stopped, or the host needs to be restarted, replication should continue from the last point When replication was stopped. Files that were being written when replication was last running will be overwritten and the information recreated.

Unlike other Heterogeneous replication implementations, the Hadoop applier stores the current replication state and restart position in a file within the HDFS of the target Hadoop environment. To recover from failed replication, this file must be deleted, so that the THL can be re-read from the master and CSV files will be recreated and applied into HDFS.

  1. On the Slave, put the replicator offline:

    shell> trepctl offline
  2. Remove the THL files from the slave:

    shell> trepctl reset -thl
  3. Remove the staging CSV files replicated into Hadoop:

    shell> hadoop fs -rm -r /user/tungsten/staging
  4. Reset the restart position:

    shell> rm /opt/continuent/tungsten/tungsten-replicator/data/alpha/commitseqno.0

    Replace alpha and /opt/continuent with the corresponding service name and installation location.

  5. Restart replication on the slave; this will start to recreate the THL files from the MySQL binary log:

    shell> trepctl online