5.6.1. How It Works

The batch applier loads data into the Target DBMS using CSV files and appropriate load commands like LOAD DATA INFILE or COPY. Here is the basic algorithm.

While executing within a commit block, we write incoming transactions into open CSV files written by the class CsvWriter. There is one CSV file per database table. The following sample shows typical contents.

"I","84900","1","2016-03-11 20:51:10.000","986","http://www.continent.com/software"
"D","84901","2","2016-03-11 20:51:10.000","143",null
"I","84901","3","2016-03-11 20:51:10.000","143","http://www.microsoft.com"

Tungsten adds four extra column values to each line of CSV output.

Column Description
opcode A transaction code that has the value "I" for insert and "D" for delete. Other types are available.
seqno The Tungsten transaction sequence number
row_id A line number that starts with 1 and increments by 1 for each new row
timestamp The commit timestamp, i.e. the origin timestamp of the committed statement that generated the row information.

Different update types are handled as follows:

  • Each insert generates a single row containing all values in the row with an "I" opcode.

  • Each delete generates a single row with the key and a "D" opcode. Non-key fields are null.

  • Each update results in a delete with the row key followed by an insert.

  • Statements are ignored. If you want DDL you need to put it in yourself.

Tungsten writes each row update into the corresponding CSV file for the SQL. At commit time the following steps occur:

  1. Flush and close each CSV file. This ensures that if there is a failure the files are fully visible in storage.

  2. For each table execute a merge script to move the data from CSV into the data warehouse. This script varies depending on the data warehouse type or even for specific application. It generally consists of a sequence of operating system commands, load commands like COPY or LOAD DATA INFILE to load in the CSV data, and ordinary SQL commands to move/massage data.

  3. When all tables are loaded, issue a single commit on the SQL connection.

The main requirement of merge scripts is that they must ensure rows load and that delete and insert operations apply in the correct order. Tungsten includes load scripts for MySQL and Vertica that do this automatically.

It is common to use staging tables to help load data. These are described in more detail in a later section.