6.4.1. Hadoop Replication Operation
constructs change data from the source-database, and uses this information
in combination with any existing data to construct, using Hive, a
materialized view. A summary of this basic structure can be seen in
Figure 6.4, “Topologies: MySQL to Hadoop Replication Operation”.
Figure 6.4. Topologies: MySQL to Hadoop Replication Operation
The full replication of information operates as follows:
Data is extracted from the source database using the standard
extractor, for example by reading the row change data from the binlog
colnames filter is used to
extract column name information from the database. This enables the
row-change information to be tagged with the corresponding column
information. The data changes, and corresponding row names, are stored
in the THL.
pkey filter is used to
extract primary key data from the source tables.
On the slave replicator, the THL data is read and written into
batch-files in the character-separated value format.
The information in these files is change data, and contains not only
the original data, but also metadata about the operation performed
UPDATE, and the primary key of
for each table. All
statements are recorded as a
DELETE of the existing data,
INSERT of the new data.
A second process uses the CSV stage data and any existing data, to
build a materialized view that mirrors the source table data
The staging files created by the replicator are in a specific format that
incorporates change and operation information in addition to the original
The format of the files is a character separated values file, with
each row separated by a newline, and individual fields separated by
0x01. This is
supported by Hive as a native value separator.
The content of the file consists of the full row data extracted from
the master, plus metadata describing the operation for each row, the
sequence number, and then the full row information.
Table-specific primary key
I (Insert) or D (Delete)
SEQNO that generated this row
Unique row ID within the batch
The commit timestamp of the original tranaction, which can be used
|| || |
For example, the MySQL row:
| 3 | #1 Single | 2006 | Cats and Dogs (#1.4) |
Is represented within the staging files generated as:
I^A1318^A1^A2017-06-07 09:22:28.000^A3^A3^A#1 Single^A2006^ACats and Dogs (#1.4)
The character separator, and whether to use quoting, are configurable
within the replicator when it is deployed. The default is to use a newline
character for records, and the
character for fields. For more information on these fields and how they
can be configured, see
Section 7.2.7, “Supported CSV Formats”.
On the Hadoop host, information is stored into a number of locations
within the HDFS during the data transfer:
Table 6.2. Hadoop Replication Directory Locations
Top-level directory for Tungsten Replicator information
Location for metadata related to the replication operation
The directory (named after the servicename of the replicator
service) that holds service-specific metadata
Directory of the data transferred
Directory of the data transferred from a specific servicename.
Directory of the data transferred specific to a database.
Directory of the data transferred specific to a table.
Filename of a single file of the data transferred for a specific
table and database.
Files are automatically created, named according to the parent table name,
and the starting Tungsten Replicator sequence number for each file that is
transferred. The size of the files is determined by the batch and commit
parameters. For example, in the truncated list of files below displayed
using the hadoop fs command,
hadoop fs -ls /user/tungsten/staging/hadoop/chicago
Found 66 items
-rw-r--r-- 3 cloudera cloudera 1270236 2014-01-13 06:58 /user/tungsten/staging/alpha/hadoop/chicago/chicago-10.csv
-rw-r--r-- 3 cloudera cloudera 10274189 2014-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-103.csv
-rw-r--r-- 3 cloudera cloudera 1275832 2014-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-104.csv
-rw-r--r-- 3 cloudera cloudera 1275411 2014-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-105.csv
-rw-r--r-- 3 cloudera cloudera 10370471 2014-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-113.csv
-rw-r--r-- 3 cloudera cloudera 1279435 2014-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-114.csv
-rw-r--r-- 3 cloudera cloudera 2544062 2014-01-13 06:58 /user/tungsten/staging/alpha/hadoop/chicago/chicago-12.csv
-rw-r--r-- 3 cloudera cloudera 11694202 2014-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-123.csv
-rw-r--r-- 3 cloudera cloudera 1279072 2014-01-13 08:34 /user/tungsten/staging/alpha/hadoop/chicago/chicago-124.csv
-rw-r--r-- 3 cloudera cloudera 2570481 2014-01-13 08:34 /user/tungsten/staging/alpha/hadoop/chicago/chicago-126.csv
-rw-r--r-- 3 cloudera cloudera 9073627 2014-01-13 08:34 /user/tungsten/staging/alpha/hadoop/chicago/chicago-133.csv
-rw-r--r-- 3 cloudera cloudera 1279708 2014-01-13 08:34 /user/tungsten/staging/alpha/hadoop/chicago/chicago-134.csv
The individual file numbers will not be sequential, as they will depend on
the sequence number, batch size and range of tables transferred.