The batch applier loads data into the Target DBMS using CSV files and
appropriate load commands like LOAD DATA
INFILE
or COPY
.
Here is the basic algorithm.
While executing within a commit block, we write incoming transactions into
open CSV files written by the class CsvWriter
.
There is one CSV file per database table. The following sample shows
typical contents.
"I","84900","1","2016-03-11 20:51:10.000","986","http://www.continent.com/software" "D","84901","2","2016-03-11 20:51:10.000","143",null "I","84901","3","2016-03-11 20:51:10.000","143","http://www.microsoft.com"
Tungsten adds four extra column values to each line of CSV output.
Column | Description |
---|---|
opcode
| A transaction code that has the value "I" for insert and "D" for delete. Other types are available. |
seqno
| The Tungsten transaction sequence number |
row_id
| A line number that starts with 1 and increments by 1 for each new row |
timestamp
| The commit timestamp, i.e. the origin timestamp of the committed statement that generated the row information. |
Different update types are handled as follows:
Each insert generates a single row containing all values in the row with an "I" opcode.
Each delete generates a single row with the key and a "D" opcode. Non-key fields are null.
Each update results in a delete with the row key followed by an insert.
Statements are ignored. If you want DDL you need to put it in yourself.
Tungsten writes each row update into the corresponding CSV file for the SQL. At commit time the following steps occur:
Flush and close each CSV file. This ensures that if there is a failure the files are fully visible in storage.
For each table execute a merge script to move the data from CSV into
the data warehouse. This script varies depending on the data warehouse
type or even for specific application. It generally consists of a
sequence of operating system commands, load commands like
COPY
or
LOAD DATA INFILE
to load in
the CSV data, and ordinary SQL commands to move/massage data.
When all tables are loaded, issue a single commit on the SQL connection.
The main requirement of merge scripts is that they must ensure rows load and that delete and insert operations apply in the correct order. Tungsten includes load scripts for MySQL and Vertica that do this automatically.
It is common to use staging tables to help load data. These are described in more detail in a later section.