5.5.2. Preparing for Amazon Redshift Replication

Preparing the hosts for the replication process requires setting some key configuration parameters within the MySQL server to ensure that data is stored and written correctly. On the Amazon Redshift side, the database and schema must be created using the existing schema definition so that the databases and tables exist within Amazon Redshift.

Source Host

Configure the source and target hosts following the prerequisites outlined in Appendix B, Prerequisites then follow the appropriate steps for the required extractor topology outlined in Chapter 3, Deploying MySQL Extractors for MySQL extraction, or Chapter 4, Deploying Oracle Extractors for Oracle CDC extraction.

The following are required for replication to Amazon Redshift:

5.5.2.1. Redshift Preparation for Amazon Redshift Deployments

On the Amazon Redshift host, you need to perform some preparation of the destination database, first creating the database, and then creating the tables that are to be replicated. Setting up this process requires the configuration of a number of components outside of Tungsten Replicator in order to support the loading.

  • An existing Amazon Web Services (AWS) account, and the AWS Access Key and Secret Key required to interact with the account through the API.

  • A configured Amazon S3 service. If the S3 service has not already been configured, visit the AWS console and sign up for the Amazon S3 service.

  • The s3cmd installed and configured. The s3cmd can be downloaded from s3cmd on s3tools.org. To configure the command to automatically connect to the Amazon S3 service without requiring further authentication, the .s3cfg in the tungsten users home directory should be configured with the corresponding access key and secret key. For example:

    [default]
    access_key = ACCESS_KEY
    ...
    secret_key = SECRET_KEY
  • Create an S3 bucket that will be used to hold the CSV files that are generated by the replicator. This can be achieved either through the web interface, or via the command-line, for example:

    shell> s3cmd mb s3://tungsten-csv
  • A running Redshift instance must be available, and the port and IP address of the Tungsten Clustering that will be replicating into Redshift must have been added to the Redshift instance security credentials.

    Make a note of the user and password that has been provided with access to the Redshift instance, as these will be needed when installing the applier. Also make a note of the Redshift instance address, as this will need to be provided to the applier configuration.

  • Create an s3-config-servicename.json file based on the sample provided within cluster-home/samples/conf/s3-config-servicename.json within the Tungsten Clustering staging directory.

    Once created, the file will be copied into the /opt/continuent/share directory to be used by the batch applier script.

    If multiple services are being created, one file must be created for each service.

    {
      "awsS3Path" : "s3://your-bucket-for-redshift/redshift-test",
      "awsAccessKey" : "access-key-id",
      "awsSecretKey" : "secret-access-key",
      "cleanUpS3Files" : "true"
    }

    The allowed options for this file are as follows:

    • awsS3Path — the location within your S3 storage where files should be loaded.

    • awsAccessKey — the S3 access key to access your S3 storage.

    • awsS3Path — the S3 secret key to access your S3 storage.

    • cleanUpS3Files — a boolean value used to identify whether the CSV files loaded into S3 should be deleted after they have been imported and merged. If set to true, the files are automatically deleted once the files have been successfully imported into the Redshift staging tables. If set to false, files are not automatically removed.

    • storeCDCIn — a definition table that stores the change data from the load, in addition to importing to staging and base tables. The {schema} and {table} variables will be automatically replaced with the corresponding schema and table name. For more information on keeping CDC information, see Section 5.5.5, “Keeping CDC Information”.

5.5.2.2. Amazon Redshift DDL Generation for Amazon Redshift Deployments

In order for the data to be written into the Redshift tables, the tables must be generated. Tungsten Replicator does not replicate the DDL statements between the source and applier between heterogeneous deployments due to differences in the format of the DDL statements. The supplied ddlscan tool can translate the DDL from the source database into suitable DDL for the target database.

For each database being replicated, DDL must be generated twice, once for the staging tables where the change data is loaded, and again for the live tables. To generate the necessary DDL:

  1. To generate the staging table DDL, ddlscan must be executed on the source (master) host. After the replicator has been installed, the ddlscan can automatically pick up the configuration to connect to the host, or it can be specified on the command line:

    On the source host for each database that is being replicated, run ddlscan using the ddl-mysql-redshift-staging.vm:

    shell> ddlscan -db test -template ddl-mysql-redshift-staging.vm
    DROP TABLE stage_xxx_test.stage_xxx_msg;
    CREATE TABLE stage_xxx_test.stage_xxx_msg
    (
      tungsten_opcode CHAR(2),
      tungsten_seqno INT,
      tungsten_row_id INT,
      tungsten_commit_timestamp TIMESTAMP,
      id INT,
      msg CHAR(80),
      PRIMARY KEY (tungsten_opcode, tungsten_seqno, tungsten_row_id)
    );

    Check the output to ensure that no errors have been generated during the process. These may indicate datatype limitations that should be identified before continuing. The generated output should be captured and then executed on the Redshift host to create the table.

  2. Once the staging tables have been created, execute ddlscan again using the base table template, ddl-mysql-redshift.vm:

    shell> ddlscan -db test -template ddl-mysql-redshift.vm
    DROP TABLE test.msg;
    CREATE TABLE test.msg
    (
      id INT,
      msg CHAR(80),
      PRIMARY KEY (id)
    );

    Once again, check the output for errors, then capture the output and execute the generated DDL against the Redshift instance.

The DDL templates translate datatypes as directly as possible, with the following caveats:

  • The length of MySQL VARCHAR length is quadrupled, because MySQL counts characters, while Redshift counts bytes.

  • There is no TIME datatype in Redshift, instead, TIME columns are converted to VARCHAR(17).

  • Primary keys from MySQL are applied into Redshift where possible.

Once the DDL has been generated within the Redshift instance, the replicator will be ready to be installed.