5.2.2. Preparing Hosts for Hadoop Replication

During the replication process, data is exchanged from the MySQL database/table/row structure into corresponding Hadoop directory and files, as shown in the table below:

MySQL Hadoop
Database Directory
Table Hive-compatible Character-Separated Text file
Row Line in the text file, fields terminated by character 0x01

5.2.2.1. MySQL Host

The data replicated from MySQL can be any data, although there are some known limitations and assumptions made on the way the information is transferred.

The following are required for replication to Hadoop:

  • MySQL must be using Row-based replication for information to be replicated to Hadoop. For the best results, you should change the global binary log format, ideally in the configuration file (my.cnf):

    binlog-format = row

    Alternatively, the global binlog format can be changed by executing the following statement:

    mysql> SET GLOBAL binlog-format = ROW;

    For MySQL 5.6.2 and later, you must enable full row log images:

    binlog-row-image = full

    This information will be forgotten when the MySQL server is restarted; placing the configuration in the my.cnf file will ensure this option is permanently enabled.

  • Table format should be updated to UTF8 by updating the MySQL configuration (my.cnf):

    character-set-server=utf8
    collation-server=utf8_general_ci

    Tables must also be configured as UTF8 tables, and existing tables should be updated to UTF8 support before they are replicated to prevent character set corruption issues.

  • To prevent timezone configuration storing zone adjusted values and exporting this information to the binary log and Hadoop, fix the timezone configuration to use UTC within the configuration file (my.cnf):

    default-time-zone='+00:00'
  • Each table that is being replicated must have a primary key. Failure to provide a primary key in the table definition will cause replication to stop.

  • All tables to be replicated should include a primary key. Failure to include a primary key on a table will cause replication to fail.

5.2.2.2. Hadoop Host

The Hadoop environment should have the following features and parameters for the most efficient operation:

  • Disk storage

    There must be enough disk storage for the change data, data being actively merged, and the live data for the replicated information. Depending on the configuration and rate of changes in the master, the required data space will fluctuate.

    For example, replicating a 10GB dataset, and 5GB of change data during replication, will require at least 30GB of storage. 10GB for the original dataset, 5GB of change data, and 10-25GB of merged data. The exact size is dependent on the quantity of inserts/updates/deletes.

  • Pre-requisites

    Currently, deployment of the slave to a relay host is not supported. One host within the Hadoop cluster must be chosen to act as the slave.

    The prerequisites for a standard Tungsten Replicator should be followed, including:

    This will provide the base environment into which Tungsten Replicator can be installed.

  • HDFS Location

    The /user/tungsten directory must be writable by the replicator user within HDFS:

    shell> hadoop fs -mkdir /user/tungsten
    shell> hadoop fs -chmod 700 /user/tungsten
    shell> hadoop fs -chown tungsten /user/tungsten

    These commands should be executed by a user with HDFS administration rights (e.g. the hdfs user).

  • Replicator User Group Membership

    The user that will be executing the replicator (typically tungsten, as recommended in the Appendix C, Prerequisites) must be a member of the hive group on the Hadoop host where the replicator will be installed. Without this membership, the user will be unable to execute Hive queries.

5.2.2.3. Schema Generation

In order to access the generated tables, both staging and the final tables, it is necessary to create a schema definition. The ddlscan tool can be used to read the existing definition of the tables from the source server and generate suitable Hive schema definitions to access the table data.

To create the staging table definition, use the ddl-mysql-hive-0.10.vm template; you must specify the JDBC connection string, user, password and database names. For example:

shell> ddlscan -user tungsten -url 'jdbc:mysql:thin://host1:13306/test' -pass password \
   -template ddl-mysql-hive-0.10.vm -db test
--
-- SQL generated on Wed Jan 29 16:17:05 GMT 2014 by Tungsten ddlscan utility
-- 
-- url = jdbc:mysql:thin://host1:13306/test
-- user = tungsten
-- dbName = test
--
CREATE DATABASE test;

DROP TABLE IF EXISTS test.movies_large;

CREATE TABLE test.movies_large
(
  id INT ,
  title STRING ,
  year INT ,
  episodetitle STRING  )
;

The output from this command should be applied to your Hive installation within the Hadoop cluster. For example, by capturing the output, transferring that file and then running:

shell> cat schema.sql | hive

To create Hive tables that read the staging files loaded by the replicator, use the ddl-mysql-hive-0.10-staging.vm:

shell> ddlscan -user tungsten -url 'jdbc:mysql:thin://host:13306/test' -pass password \
    -template ddl-mysql-hive-0.10-staging.vm -db test

The process creates the schema and tables which match the schema and table names on the source database.

Transfer this file to your Hadoop environment and then create the generated schema:

shell> cat schema-staging.sql |hive

The process creates matching schema names, but table names are modified to include the prefix stage_xxx_. For example, for the table movies_large a staging table named stage_xxx_movies_large is created. The Hive table definition is created pointing to the external file-based tables, using the default 0x01 field separator and 0x0A (newline) record separator. If different values were used for these in the configuration, the schema definition in the captured file from ddlscan should be updated by hand.

The tables should now be available within Hive. For more information on accessing and using the tables, see Section 5.2.5, “Accessing Generated Tables in Hive”.