Preparing for Hadoop Replication
During the replication process, data is exchanged from the MySQL database/table/row structure into corresponding Hadoop directory and files, as shown in the table below:
| MySQL | Hadoop |
|---|---|
| Database | Directory |
| Table | Hive-compatible Character-Separated Text file |
| Row | Line in the text file, fields terminated by character 0x01 |
Hadoop Host
The Hadoop environment should have the following features and parameters for the most efficient operation:
Disk storage
There must be enough disk storage for the change data, data being actively merged, and the live data for the replicated information. Depending on the configuration and rate of changes in the Source, the required data space will fluctuate.
For example, replicating a 10GB dataset, and 5GB of change data during replication, will require at least 30GB of storage. 10GB for the original dataset, 5GB of change data, and 10-25GB of merged data. The exact size is dependent on the quantity of inserts/updates/deletes.
Prerequisites
Currently, deployment of the target to a relay host is not supported. One host within the Hadoop cluster must be chosen to act as the target.
The prerequisites for a standard Tungsten Replicator should be followed, including:
- "Creating the User Environment"
- "Networking Requirements"
- "Directory Locations and Configuration"
- "3rd Party Software Requirements"
This will provide the base environment into which Tungsten Replicator can be installed.
HDFS Location
The
/user/tungstendirectory must be writable by the replicator user within HDFS:shell> hadoop fs -mkdir /user/tungstenshell> hadoop fs -chmod 700 /user/tungstenshell> hadoop fs -chown tungsten /user/tungstenThese commands should be executed by a user with HDFS administration rights (e.g. the
hdfsuser).Replicator User Group Membership
The user that will be executing the replicator (typically
tungsten, as recommended in the "Installation Prerequisites") must be a member of thehivegroup on the Hadoop host where the replicator will be installed. Without this membership, the user will be unable to execute Hive queries.
Schema Generation
In order to access the generated tables, both staging and the final tables, it is necessary to create a schema definition. The ddlscan tool
can be used to read the existing definition of the tables from the source server and generate suitable Hive schema definitions to access the table data.
To create the staging table definition, use the ddl-mysql-hive-0.10.vm template; you must specify the JDBC connection string, user,
password and database names. For example:
shell> ddlscan -user tungsten -url 'jdbc:mysql:thin://host1:13306/test' -pass password \
-template ddl-mysql-hive-0.10.vm -db test
--
-- SQL generated on Wed Jan 29 16:17:05 GMT 2020 by Tungsten ddlscan utility
--
-- url = jdbc:mysql:thin://host1:13306/test
-- user = tungsten
-- dbName = test
--
CREATE DATABASE test;
DROP TABLE IF EXISTS test.movies_large;
CREATE TABLE test.movies_large
(
id INT ,
title STRING ,
year INT ,
episodetitle STRING )
;
The output from this command should be applied to your Hive installation within the Hadoop cluster. For example, by capturing the output, transferring that file and then running:
shell> cat schema.sql | hive
To create Hive tables that read the staging files loaded by the replicator, use the ddl-mysql-hive-0.10-staging.vm template:
shell> ddlscan -user tungsten -url 'jdbc:mysql:thin://host:13306/test' -pass password \
-template ddl-mysql-hive-0.10-staging.vm -db test
The process creates the schema and tables which match the schema and table names on the source database.
Transfer this file to your Hadoop environment and then create the generated schema:
shell> cat schema-staging.sql |hive
The process creates matching schema names, but table names are modified to include the prefix stage_xxx_. For example, for the
table movies_large a staging table named stage_xxx_movies_large is created. The Hive table definition is created pointing to
the external file-based tables, using the default 0x01 field separator and 0x0A (newline) record separator. If different
values were used for these in the configuration, the schema definition in the captured file from ddlscan should be updated by hand.
The tables should now be available within Hive. For more information on accessing and using the tables, see "Accessing Generated Tables in Hive".