During the replication process, data is exchanged from the MySQL database/table/row structure into corresponding Hadoop directory and files, as shown in the table below:
|Table||Hive-compatible Character-Separated Text file|
Line in the text file, fields terminated by character
The Hadoop environment should have the following features and parameters for the most efficient operation:
There must be enough disk storage for the change data, data being actively merged, and the live data for the replicated information. Depending on the configuration and rate of changes in the master, the required data space will fluctuate.
For example, replicating a 10GB dataset, and 5GB of change data during replication, will require at least 30GB of storage. 10GB for the original dataset, 5GB of change data, and 10-25GB of merged data. The exact size is dependent on the quantity of inserts/updates/deletes.
Currently, deployment of the slave to a relay host is not supported. One host within the Hadoop cluster must be chosen to act as the slave.
The prerequisites for a standard Tungsten Replicator should be followed, including:
This will provide the base environment into which Tungsten Replicator can be installed.
must be writable by the replicator user within HDFS:
hadoop fs -mkdir /user/tungstenshell>
hadoop fs -chmod 700 /user/tungstenshell>
hadoop fs -chown tungsten /user/tungsten
These commands should be executed by a user with HDFS administration
rights (e.g. the
Replicator User Group Membership
The user that will be executing the replicator (typically
tungsten, as recommended in
the Appendix C, Prerequisites) must be a member of the
hive group on the Hadoop host
where the replicator will be installed. Without this membership, the
user will be unable to execute Hive queries.
In order to access the generated tables, both staging and the final tables, it is necessary to create a schema definition. The ddlscan tool can be used to read the existing definition of the tables from the source server and generate suitable Hive schema definitions to access the table data.
To create the staging table definition, use the
template; you must specify the JDBC connection string, user, password
and database names. For example:
ddlscan -user tungsten -url 'jdbc:mysql:thin://host1:13306/test' -pass password \ -template ddl-mysql-hive-0.10.vm -db test-- -- SQL generated on Wed Jan 29 16:17:05 GMT 2014 by Tungsten ddlscan utility -- -- url = jdbc:mysql:thin://host1:13306/test -- user = tungsten -- dbName = test -- CREATE DATABASE test; DROP TABLE IF EXISTS test.movies_large; CREATE TABLE test.movies_large ( id INT , title STRING , year INT , episodetitle STRING ) ;
The output from this command should be applied to your Hive installation within the Hadoop cluster. For example, by capturing the output, transferring that file and then running:
cat schema.sql | hive
To create Hive tables that read the staging files loaded by the
replicator, use the
ddlscan -user tungsten -url 'jdbc:mysql:thin://host:13306/test' -pass password \ -template ddl-mysql-hive-0.10-staging.vm -db test
The process creates the schema and tables which match the schema and table names on the source database.
Transfer this file to your Hadoop environment and then create the generated schema:
cat schema-staging.sql |hive
The process creates matching schema names, but table names are modified
to include the prefix
For example, for the table
movies_large a staging table named
stage_xxx_movies_large is created.
The Hive table definition is created pointing to the external file-based
tables, using the default
field separator and
record separator. If different values were used for these in the
configuration, the schema definition in the captured file from
ddlscan should be updated by hand.
The tables should now be available within Hive. For more information on accessing and using the tables, see Section 126.96.36.199, “Accessing Generated Tables in Hive”.