Staging tables are intermediate tables that help with data loading. There are different usage patterns for staging tables.
Tungsten assumes that staging tables, if present, follow certain conventions for naming and provides a number of configuration properties for generating staging table names that match the base tables in the data warehouse without colliding with them.
Property | Description |
---|---|
stageColumnPrefix
| Prefix for seqno, row_id, and opcode columns generated by Tungsten |
stageTablePrefix
| Prefix for stage table name |
stageSchemaPrefix
| Prefix for the schema in which the stage tables reside |
These values are set in the static properties file that defines the
replication service. They can be set at install time using
--property
options. The following
example shows typical values from a service properties file.
replicator.applier.dbms.stageColumnPrefix=tungsten_ replicator.applier.dbms.stageTablePrefix=stage_xxx_ replicator.applier.dbms.stageSchemaPrefix=load_
If your data warehouse contains a table named
foo
in schema
bar
, these properties would result
in a staging table name of
load_bar.stage_xxx_foo
for the
staging table. The Tungsten generated column containing the
seqno
, if present, would be named
tungsten_seqno
.
Staging tables are by default in the same schema as the table they
update. You can put them in a different schema using the
stageSchemaPrefix
property as shown in the example.
Whole record staging loads the entire CSV file into an identical table,
then runs queries to apply rows to the base table or tables in the data
warehouse. One of the strengths of whole record staging is that it
allows you to construct a merge script that can handle any combination
of INSERT
,
UPDATE
, or
DELETE
operations. A weakness is
that whole record staging can result in sub-optimal I/O for workloads
that consist mostly of INSERT
operations.
For example, suppose we have a base table created by the following
CREATE TABLE
command:
CREATE TABLE `mydata` ( `id` int(11) NOT NULL, `f_data` float DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
A whole record staging table would look as follows.
CREATE TABLE `stage_xxx_croc_mydata` ( `tungsten_opcode` char(1) DEFAULT NULL, `tungsten_seqno` int(11) DEFAULT NULL, `tungsten_row_id` int(11) DEFAULT NULL, `id` int(11) NOT NULL, `f_data` float DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Note that this table does not have a primary key defined. Most data warehouses do not use primary keys and many of them do not even permit it in the create table syntax.
Note also that the non-primary columns must permit nulls. This is required for deletes, which contain only the Tungsten generated columns plus the primary key.
Another approach is to load
INSERT
rows directly into the
base data warehouse tables without staging. All you need to stage is the
keys for deleted records. This reduces I/O considerably for workloads
that have mostly inserts. The downside is that it may require introduce
ordering dependencies between
DELETE
and
INSERT
operations that require
special handling by upstream applications to generate transactions that
will load without conflicts.
Delete key staging tables can be as simple as the follow example:
CREATE TABLE `stage_xxx_croc_mydata` ( `id` int(11) NOT NULL, ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Tungsten does not generate staging tables automatically. Creation of staging tables is the responsibility of users, but using the ddlscan tool with the right template can be simplified.