3.4. Deploying MySQL to Hadoop Replication

Replicating data into Hadoop is achieved by generating character-separated values from ROW-based information that is applied directly to the Hadoop HDFS using a batch loading process. Files are written directly to the HDFS using the Hadoop client libraries. A separate process is then used to merge existing data, and the changed information extracted from the master database.

Deployment of the Hadoop replication is similar to other heterogeneous installations; two separate installations are created:

  • Service Alpha on the master extracts the information from the MySQL binary log into THL.

  • Service Alpha on the slave reads the information from the remote replicator as THL, applying it to Hadoop. The applier works in two stages:

Figure 3.8. Topologies: MySQL to Hadoop

Topologies: MySQL to Hadoop

Basic requirements for replication into Hadoop:

  • Hadoop Replication is supported on the following Hadoop distributions:

    • Cloudera Enterprise 4.4, Cloudera Enterprise 5.0 (Certified)

    • HortonWorks DataPlatform 2.0

    • Amazon Elastic MapReduce

    • IBM InfoSphere BigInsights

  • Source tables must have primary keys. Without a primary key, Tungsten Replicator is unable to determine the row to be updated when the data reaches Hadoop.