The batch applier loads data into the slave DBMS using CSV files and
appropriate load commands like
Here is the basic algorithm.
While executing within a commit block, we write incoming transactions into
open CSV files written by the class
There is one CSV file per database table. The following sample shows
Tungsten adds three extra column values to each line of CSV output.
A transaction code that has the value "I" for insert and "D" for
delete. Other types are available.
The Tungsten transaction sequence number
A line number that starts with 1 and increments by 1 for each new
The commit timestamp, i.e. the origin timestamp of the committed
statement that generated the row information.
Different update types are handled as follows:
Each insert generates a single row containing all values in the row
with an "I" opcode.
Each delete generates a single row with the key and a "D" opcode.
Non-key fields are null.
Each update results in a delete with the row key followed by an
Statements are ignored. If you want DDL you need to put it in
Tungsten writes each row update into the corresponding CSV file for the
SQL. At commit time the following steps occur:
Flush and close each CSV file. This ensures that if there is a failure
the files are fully visible in storage.
For each table execute a merge script to move the data from CSV into
the data warehouse. This script varies depending on the data warehouse
type or even for specific application. It generally consists of a
sequence of operating system commands, load commands like
LOAD DATA INFILE to load in
the CSV data, and ordinary SQL commands to move/massage data.
When all tables are loaded, issue a single commit on the SQL
The main requirement of merge scripts is that they must ensure rows load
and that delete and insert operations apply in the correct order. Tungsten
includes load scripts for MySQL and Vertica that do this automatically.
It is common to use staging tables to help load data. These are described
in more detail in a later section.