During the replication process, data is exchanged from the MySQL database/table/row structure into corresponding Hadoop directory and files, as shown in the table below:
MySQL | Hadoop |
---|---|
Database | Directory |
Table | Hive-compatible Character-Separated Text file |
Row |
Line in the text file, fields terminated by character
0x01
|
The Hadoop environment should have the following features and parameters for the most efficient operation:
Disk storage
There must be enough disk storage for the change data, data being actively merged, and the live data for the replicated information. Depending on the configuration and rate of changes in the Source, the required data space will fluctuate.
For example, replicating a 10GB dataset, and 5GB of change data during replication, will require at least 30GB of storage. 10GB for the original dataset, 5GB of change data, and 10-25GB of merged data. The exact size is dependent on the quantity of inserts/updates/deletes.
Pre-requisites
Currently, deployment of the target to a relay host is not supported. One host within the Hadoop cluster must be chosen to act as the target.
The prerequisites for a standard Tungsten Replicator should be followed, including:
This will provide the base environment into which Tungsten Replicator can be installed.
HDFS Location
The /user/tungsten
directory
must be writable by the replicator user within HDFS:
shell>hadoop fs -mkdir /user/tungsten
shell>hadoop fs -chmod 700 /user/tungsten
shell>hadoop fs -chown tungsten /user/tungsten
These commands should be executed by a user with HDFS administration
rights (e.g. the hdfs
user).
Replicator User Group Membership
The user that will be executing the replicator (typically
tungsten
, as recommended in
the Appendix B, Prerequisites) must be a member of the
hive
group on the Hadoop host
where the replicator will be installed. Without this membership, the
user will be unable to execute Hive queries.
In order to access the generated tables, both staging and the final tables, it is necessary to create a schema definition. The ddlscan tool can be used to read the existing definition of the tables from the source server and generate suitable Hive schema definitions to access the table data.
To create the staging table definition, use the
ddl-mysql-hive-0.10.vm
template; you must specify the JDBC connection string, user, password
and database names. For example:
shell> ddlscan -user tungsten -url 'jdbc:mysql:thin://host1:13306/test' -pass password \
-template ddl-mysql-hive-0.10.vm -db test
--
-- SQL generated on Wed Jan 29 16:17:05 GMT 2020 by Tungsten ddlscan utility
--
-- url = jdbc:mysql:thin://host1:13306/test
-- user = tungsten
-- dbName = test
--
CREATE DATABASE test;
DROP TABLE IF EXISTS test.movies_large;
CREATE TABLE test.movies_large
(
id INT ,
title STRING ,
year INT ,
episodetitle STRING )
;
The output from this command should be applied to your Hive installation within the Hadoop cluster. For example, by capturing the output, transferring that file and then running:
shell> cat schema.sql | hive
To create Hive tables that read the staging files loaded by the
replicator, use the
ddl-mysql-hive-0.10-staging.vm
:
shell> ddlscan -user tungsten -url 'jdbc:mysql:thin://host:13306/test' -pass password \
-template ddl-mysql-hive-0.10-staging.vm -db test
The process creates the schema and tables which match the schema and table names on the source database.
Transfer this file to your Hadoop environment and then create the generated schema:
shell> cat schema-staging.sql |hive
The process creates matching schema names, but table names are modified
to include the prefix stage_xxx_
.
For example, for the table
movies_large
a staging table named
stage_xxx_movies_large
is created.
The Hive table definition is created pointing to the external file-based
tables, using the default 0x01
field separator and 0x0A
(newline)
record separator. If different values were used for these in the
configuration, the schema definition in the captured file from
ddlscan should be updated by hand.
The tables should now be available within Hive. For more information on accessing and using the tables, see Section 4.6.4.3, “Accessing Generated Tables in Hive”.