5.6.5. Staging Tables

Staging tables are intermediate tables that help with data loading. There are different usage patterns for staging tables.

5.6.5.1. Staging Table Names

Tungsten assumes that staging tables, if present, follow certain conventions for naming and provides a number of configuration properties for generating staging table names that match the base tables in the data warehouse without colliding with them.

Property Description
stageColumnPrefix Prefix for seqno, row_id, and opcode columns generated by Tungsten
stageTablePrefix Prefix for stage table name
stageSchemaPrefix Prefix for the schema in which the stage tables reside

These values are set in the static properties file that defines the replication service. They can be set at install time using --property options. The following example shows typical values from a service properties file.

replicator.applier.dbms.stageColumnPrefix=tungsten_
replicator.applier.dbms.stageTablePrefix=stage_xxx_
replicator.applier.dbms.stageSchemaPrefix=load_

If your data warehouse contains a table named foo in schema bar, these properties would result in a staging table name of load_bar.stage_xxx_foo for the staging table. The Tungsten generated column containing the seqno, if present, would be named tungsten_seqno.

Note

Staging tables are by default in the same schema as the table they update. You can put them in a different schema using the stageSchemaPrefix property as shown in the example.

5.6.5.2. Whole Record Staging

Whole record staging loads the entire CSV file into an identical table, then runs queries to apply rows to the base table or tables in the data warehouse. One of the strengths of whole record staging is that it allows you to construct a merge script that can handle any combination of INSERT, UPDATE, or DELETE operations. A weakness is that whole record staging can result in sub-optimal I/O for workloads that consist mostly of INSERT operations.

For example, suppose we have a base table created by the following CREATE TABLE command:

CREATE TABLE `mydata` (
`id` int(11) NOT NULL,
`f_data` float DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

A whole record staging table would look as follows.

CREATE TABLE `stage_xxx_croc_mydata` (
`tungsten_opcode` char(1) DEFAULT NULL,
`tungsten_seqno` int(11) DEFAULT NULL,
`tungsten_row_id` int(11) DEFAULT NULL,
`id` int(11) NOT NULL,
`f_data` float DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Note that this table does not have a primary key defined. Most data warehouses do not use primary keys and many of them do not even permit it in the create table syntax.

Note also that the non-primary columns must permit nulls. This is required for deletes, which contain only the Tungsten generated columns plus the primary key.

5.6.5.3. Delete Key Staging

Another approach is to load INSERT rows directly into the base data warehouse tables without staging. All you need to stage is the keys for deleted records. This reduces I/O considerably for workloads that have mostly inserts. The downside is that it may require introduce ordering dependencies between DELETE and INSERT operations that require special handling by upstream applications to generate transactions that will load without conflicts.

Delete key staging tables can be as simple as the follow example:

CREATE TABLE `stage_xxx_croc_mydata` (
`id` int(11) NOT NULL,
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

5.6.5.4. Staging Table Generation

Tungsten does not generate staging tables automatically. Creation of staging tables is the responsibility of users, but using the ddlscan tool with the right template can be simplified.