4.2.5. Keeping CDC Information

The Redshift applier can keep the CDC data, that is, the raw CDC CSV data that is recorded and replicated during the loading process, rather than simply cleaning up the CDC files and deleting them. The CDC data can be useful if you want to be able to monitor data changes over time.

The process works as follows:

  1. Batch applier generates CSV files.

  2. Batch applier loads the CSV data into the staging tables.

  3. Batch applier loads the CSV data into the CDC tables.

  4. Staging data is merged with the base table data.

  5. Staging data is deleted.

Unlike the staging and base table information, the data in the CDC tables is kept forever, without removing any of the processed information. Using this data you can report on change information over time for different data sets, or even recreate datasets at a specific time by using the change information.

To enable this feature:

  1. When creating the DDL for the staging and base tables, also create the table information for the CDC data for each table. The actual format of the information is the same as the staging table data, and can be created using ddlscan:

    shell> ddlscan -service my_red -db test \
        -template ddl-mysql-redshift-staging.vm \
        -opt renameSchema cdc_{schema} -opt renameTable {table}_cdc
  2. In the configuration file, s3-config-svc.json for each service, specify the name of the table to be used when storing the CDC information using the storeCDCIn field. This should specify the table template to be used, with the schema and table name being automatically replaced by the load script. The structure should match the structure used by ddlscan to define the CDC tables:

    {
      "awsS3Path" : "s3://your-bucket-for-redshift/redshift-test",
      "awsAccessKey" : "access-key-id",
      "awsSecretKey" : "secret-access-key",
      "storeCDCIn" : "cdc_{schema}.{table}_cdc"
    }
  3. Restart the replicator using replicator restart to update the configuration.