4.6.2. Preparing for Hadoop Replication

During the replication process, data is exchanged from the MySQL database/table/row structure into corresponding Hadoop directory and files, as shown in the table below:

MySQL Hadoop
Database Directory
Table Hive-compatible Character-Separated Text file
Row Line in the text file, fields terminated by character 0x01

4.6.2.1. Hadoop Host

The Hadoop environment should have the following features and parameters for the most efficient operation:

  • Disk storage

    There must be enough disk storage for the change data, data being actively merged, and the live data for the replicated information. Depending on the configuration and rate of changes in the Source, the required data space will fluctuate.

    For example, replicating a 10GB dataset, and 5GB of change data during replication, will require at least 30GB of storage. 10GB for the original dataset, 5GB of change data, and 10-25GB of merged data. The exact size is dependent on the quantity of inserts/updates/deletes.

  • Pre-requisites

    Currently, deployment of the target to a relay host is not supported. One host within the Hadoop cluster must be chosen to act as the target.

    The prerequisites for a standard Tungsten Replicator should be followed, including:

    This will provide the base environment into which Tungsten Replicator can be installed.

  • HDFS Location

    The /user/tungsten directory must be writable by the replicator user within HDFS:

    shell> hadoop fs -mkdir /user/tungsten
    shell> hadoop fs -chmod 700 /user/tungsten
    shell> hadoop fs -chown tungsten /user/tungsten

    These commands should be executed by a user with HDFS administration rights (e.g. the hdfs user).

  • Replicator User Group Membership

    The user that will be executing the replicator (typically tungsten, as recommended in the Appendix B, Prerequisites) must be a member of the hive group on the Hadoop host where the replicator will be installed. Without this membership, the user will be unable to execute Hive queries.

4.6.2.2. Schema Generation

In order to access the generated tables, both staging and the final tables, it is necessary to create a schema definition. The ddlscan tool can be used to read the existing definition of the tables from the source server and generate suitable Hive schema definitions to access the table data.

To create the staging table definition, use the ddl-mysql-hive-0.10.vm template; you must specify the JDBC connection string, user, password and database names. For example:

shell> ddlscan -user tungsten -url 'jdbc:mysql:thin://host1:13306/test' -pass password \
   -template ddl-mysql-hive-0.10.vm -db test
--
-- SQL generated on Wed Jan 29 16:17:05 GMT 2020 by Tungsten ddlscan utility
-- 
-- url = jdbc:mysql:thin://host1:13306/test
-- user = tungsten
-- dbName = test
--
CREATE DATABASE test;

DROP TABLE IF EXISTS test.movies_large;

CREATE TABLE test.movies_large
(
  id INT ,
  title STRING ,
  year INT ,
  episodetitle STRING  )
;

The output from this command should be applied to your Hive installation within the Hadoop cluster. For example, by capturing the output, transferring that file and then running:

shell> cat schema.sql | hive

To create Hive tables that read the staging files loaded by the replicator, use the ddl-mysql-hive-0.10-staging.vm:

shell> ddlscan -user tungsten -url 'jdbc:mysql:thin://host:13306/test' -pass password \
    -template ddl-mysql-hive-0.10-staging.vm -db test

The process creates the schema and tables which match the schema and table names on the source database.

Transfer this file to your Hadoop environment and then create the generated schema:

shell> cat schema-staging.sql |hive

The process creates matching schema names, but table names are modified to include the prefix stage_xxx_. For example, for the table movies_large a staging table named stage_xxx_movies_large is created. The Hive table definition is created pointing to the external file-based tables, using the default 0x01 field separator and 0x0A (newline) record separator. If different values were used for these in the configuration, the schema definition in the captured file from ddlscan should be updated by hand.

The tables should now be available within Hive. For more information on accessing and using the tables, see Section 4.6.4.3, “Accessing Generated Tables in Hive”.