4.2.2. Preparing for Amazon Redshift Replication

Preparing the hosts for the replication process requires setting some key configuration parameters within the MySQL server to ensure that data is stored and written correctly. On the Amazon Redshift side, the database and schema must be created using the existing schema definition so that the databases and tables exist within Amazon Redshift.

Source Host

Configure the source and target hosts following the prerequisites outlined in Appendix B, Prerequisites then follow the appropriate steps for the required extractor topology outlined in Chapter 3, Deploying MySQL Extractors.

The following are required for replication to Amazon Redshift:

4.2.2.1. Redshift Preparation for Amazon Redshift Deployments

On the Amazon Redshift host, you need to perform some preparation of the destination database, first creating the database, and then creating the tables that are to be replicated. Setting up this process requires the configuration of a number of components outside of Tungsten Replicator in order to support the loading.

  • An existing Amazon Web Services (AWS) account, and either the AWS Access Key and Secret Key, or configured IAM Roles, required to interact with the account through the API. For information on creating IAM Roles, see Section 4.2.2.2, “Configuring Identity Access Management within AWS”

  • A configured Amazon S3 service. If the S3 service has not already been configured, visit the AWS console and sign up for the Amazon S3 service.

  • The s3cmd or the aws tools installed and configured. The s3cmd can be downloaded from s3cmd on s3tools.org.

    If using the s3cmd, you should then configure the command to automatically connect to the Amazon S3 service without requiring further authentication, the .s3cfg in the tungsten users home directory should be configured as follows:

    • Using Access Keys:

      [default]
      access_key = ACCESS_KEY
      secret_key = SECRET_KEY
    • Using IAM Roles: Leave values blank - copy example as is

      [default]
      access_key = 
      secret_key = 
      security_token =
  • Create an S3 bucket that will be used to hold the CSV files that are generated by the replicator. This can be achieved either through the web interface, or via the command-line, for example:

    shell> s3cmd mb s3://tungsten-csv
  • A running Redshift instance must be available, and the port and IP address of the Tungsten Cluster that will be replicating into Redshift must have been added to the Redshift instance security credentials.

    Make a note of the user and password that has been provided with access to the Redshift instance, as these will be needed when installing the applier. Also make a note of the Redshift instance address, as this will need to be provided to the applier configuration.

  • Create an s3-config-servicename.json file based on the sample provided within cluster-home/samples/conf/s3-config-servicename.json within the Tungsten Replicator staging directory, or using the example below.

    Once created, the file will be copied into the /opt/continuent/share directory to be used by the batch applier script.

    If multiple services are being created, one file must be created for each service.

    The following example shows the use of Access and Secret Keys:

    {
      "awsS3Path" : "s3://your-bucket-for-redshift/redshift-test",
      "awsAccessKey" : "access-key-id",
      "awsSecretKey" : "secret-access-key",
      "cleanUpS3Files" : "true"
    }

    The following example shows the use of IAM Roles:

    {
      "awsS3Path" : "s3://your-bucket-for-redshift/redshift-test",
      "awsIAMRole" : "arn:iam-role",
      "cleanUpS3Files" : "true"
    }

    The allowed options for this file are as follows:

    • awsS3Path — the location within your S3 storage where files should be loaded.

    • awsAccessKey — the S3 access key to access your S3 storage. Not required if awsIAMRole is used.

    • awsSecretKey — the S3 secret key associated with the Access Key. Not required if awsIAMRole is used.

    • awsIAMRole — the IAM role configured to allow Redshift to interact with S3. Not required if awsAccessKey and awsSecretKey are in use.

    • multiServiceTarget (true/false) — to indicate if there are multiple appliers writing into the single Redshift Target, for example when the source is Tungsten Cluster Composite Active/Active or a Tungsten Replicator Fan-In Topology (Default: false).

    • singleLockTable (true/false) — to indicate the table lock behaviour when multiServiceTarget is true. Will be ignored if multiServiceTarget set to false (Default: true)

    • lockTablePrefix — the prefix for the lock tables when singleLockTable is false. (Default: lock_xxx_)

    • s3Binary — the binary to use for loading csv file up to S3. (Valid Values: s3cmd, s4cmd, aws) (Default: s3cmd)

    • cleanUpS3Files — a boolean value used to identify whether the CSV files loaded into S3 should be deleted after they have been imported and merged. If set to true, the files are automatically deleted once the files have been successfully imported into the Redshift staging tables. If set to false, files are not automatically removed.

    • gzipS3Files — setting to true will result in the csv files being gzipped prior to loading into S3 (Default: false)

    • storeCDCIn — a definition table that stores the change data from the load, in addition to importing to staging and base tables. The {schema} and {table} variables will be automatically replaced with the corresponding schema and table name. For more information on keeping CDC information, see Section 4.2.5, “Keeping CDC Information”.

4.2.2.2. Configuring Identity Access Management within AWS

Identity Management with AWS is complex, but a useful and secure way of restriciting services interacting with each other, and for restricting user access to the AWS platform.

Tungsten Replicator for Redshift, requires a certain level of interaction between the replicator and S3 and between Redshift and S3.

Note

All versions up to and including Tungsten Replicator version 6.0 can utilise IAM Roles for uploading the csv files to S3, however for loading the data from S3 into Redshift, the only option is to use Access and Secret Keys.

Tungsten Replicator version 6.1 onwards will also allow for the use of IAM Roles for loading data from S3 into Redshift.

To use IAM Roles with Tungsten Replicator you will need to create two roles, with the following recommended policies:

To allow csv files to be loaded upto S3:

  • Role should be associated with the AWS Service: EC2

  • AWS Defined Policy Name: AmazonS3FullAccess, or

  • Define and create your own policy, with, at minimum, the ability to write to the bucket you intend to use for the Redshift Applier

  • Associate this role to the EC2 instance running the Tungsten Replicator software

For use by Redshift COPY command to load csv into staging tables:

  • Role should be associated with the AWS Service: Redshift

  • AWS Defined Policy Name: AmazonS3FullAccess, or

  • Define and create your own policy, with, at minimum, the ability to read from the bucket you intend to use for the Redshift Applier

  • Associate this role to the Redshift Cluster.

Note

For more details and full instructions on creating and managing IAM roles, review the AWS documentation

4.2.2.3. Amazon Redshift DDL Generation for Amazon Redshift Deployments

In order for the data to be written into the Redshift tables, the tables must be generated. Tungsten Replicator does not replicate the DDL statements between the source and applier between heterogeneous deployments due to differences in the format of the DDL statements. The supplied ddlscan tool can translate the DDL from the source database into suitable DDL for the target database.

For each database being replicated, DDL must be generated twice, once for the staging tables where the change data is loaded, and again for the live tables. To generate the necessary DDL:

  1. To generate the staging table DDL, ddlscan must be executed on the Extractor host. After the replicator has been installed, the ddlscan can automatically pick up the configuration to connect to the host, or it can be specified on the command line:

    On the source host for each database that is being replicated, run ddlscan using the ddl-mysql-redshift-staging.vm:

    shell> ddlscan -db test -template ddl-mysql-redshift-staging.vm
    DROP TABLE stage_xxx_test.stage_xxx_msg;
    CREATE TABLE stage_xxx_test.stage_xxx_msg
    (
      tungsten_opcode CHAR(2),
      tungsten_seqno INT,
      tungsten_row_id INT,
      tungsten_commit_timestamp TIMESTAMP,
      id INT,
      msg CHAR(80),
      PRIMARY KEY (tungsten_opcode, tungsten_seqno, tungsten_row_id)
    );

    Check the output to ensure that no errors have been generated during the process. These may indicate datatype limitations that should be identified before continuing. The generated output should be captured and then executed on the Redshift host to create the table.

  2. Once the staging tables have been created, execute ddlscan again using the base table template, ddl-mysql-redshift.vm:

    shell> ddlscan -db test -template ddl-mysql-redshift.vm
    DROP TABLE test.msg;
    CREATE TABLE test.msg
    (
      id INT,
      msg CHAR(80),
      PRIMARY KEY (id)
    );

    Once again, check the output for errors, then capture the output and execute the generated DDL against the Redshift instance.

The DDL templates translate datatypes as directly as possible, with the following caveats:

  • The length of MySQL VARCHAR length is quadrupled, because MySQL counts characters, while Redshift counts bytes.

  • There is no TIME datatype in Redshift, instead, TIME columns are converted to VARCHAR(17).

  • Primary keys from MySQL are applied into Redshift where possible.

Once the DDL has been generated within the Redshift instance, the replicator will be ready to be installed.

4.2.2.4. Handling Concurrent Writes from Multiple Appliers

Version Support: 6.1.4

The features outlined in this section where specifically introduced in Tungsten Replicator 6.1.4.

Redshift only supports a SERIALIZABLE transaction isolation level, which differs from relational databases like MySQL, which is REPEATABLE READ by default. Isolation Levels determine the behaviour of the database for concurrent access to the tables within transactions.

When loading data into Redshift, from multiple appliers, this isolation level can cause locking issues that would manifest as errors in the Replicator Log similiar to the following:

Detail: Serializable isolation violation on table - 150379, transactions forming the cycle are: 2356786, 2356787 
» (pid:17914) (../../tungsten-replicator//appliers/batch/redshift.js#219)

In some cases, the replicator will simply retry and carry on successfully, but on very busy systems this can sometimes cause the replicator to fall back into an OFFLINE:ERROR state and manual intervention would be required.

To overcome this problem, the first step is to ensure that each applier has its own set of staging tables that the CSV files are loaded into. By default all staging tables will be named with the prefix stage_xxx_

First of all, to generate the staging tables, you would typically use ddlscan that would look something like the following:

shel> ddlscan -user tungsten -pass secret -url jdbc:mysql:thin://db01:3306/ 
  » -db hr -template ddl-mysql-redshift-staging.vm > staging.sql

To change the default prefix of the staging table, for example, to stage_nyc_ you can provide the option to the ddlscan command as follows:

shel> ddlscan -user tungsten -pass secret -url jdbc:mysql:thin://db01:3306/ 
  » -db hr -template ddl-mysql-redshift-staging.vm -opt tablePrefix stage_nyc_ > staging.sql

You would need to execute this for each applier, changing the prefix accordingly. Once this has been executed and the tables have been built in Redshift, you will then need to add the additional property to each applier to instruct which staging tables to use. The property should be added to the tungsten.ini file and a tpm update issued

property=replicator.applier.dbms.stageTablePrefix=stage_nyc_

4.2.2.4.1. Increase load rates

The first and easiest step to try and overcome the isolation errors, would be to increase the batch commit levels and the batch commit interval. Each system works differently so there is no simple calculation to find the right level. These values should be adjusted in small increments to find the right balance for your system.

Within your configuration, adjust the following two parameters:

  • svc-block-commit-size

  • svc-block-commit-interval

4.2.2.4.2. Enable Transaction Locking

Within the redshift applier, it is possible to introduce table locking. This will enable multiple appliers to process their own THL and load the transactions without impacting, or being impacted by, other appliers.

This configuration should only be used when multiple appliers are in use, however it must also be recognised that the addition of table locking could introduce latency in applying to Redshift on extremely busy systems, it could also impact client applications from reading the tables due to Redshift's isolation level. To avoid this, table locking should also include an increase in the block commit size and block commit interval properties mentioned above.

There are two types of table locking approaches, depending upon your environment will determine which approach is better for you.

  • Single Lock Table: This approach should be used for appliers in extremely busy systems where a block-commit-size of 500000 or greater does not eliminate isolation errors and where mutliple tables are updated within each transaction.

  • One Lock Table per Base Table: This approach should be used for appliers in less busy systems, or where parallel apply has been enabled within the applier, regardless of system activity levels.

To enable the single lock table approach:

  • The following option should be added to the s3-config-servicename.json file:

    "multiServiceTarget": "true"

  • Connect to Redshift with the same account used by the applier, and using the DDL below, create the lock table:

    CREATE TABLE public.tungsten_lock_table
    (
      ID  INT
    );

To enable the lock table per base table approach:

  • The following option should be added to the s3-config-servicename.json file:

    "multiServiceTarget": "true",
      "singleLockTable": "false"

  • Create a lock table for each of the base tables within Redshift. A ddlscan template can be used to generate the ddl. In the following example the ddlscan command is generating lock table ddl for all tables within the hr schema:

    shel> ddlscan -user tungsten -pass secret -url jdbc:mysql:thin://db01:3306/ 
      » -db hr -template ddl-mysql-redshift-lock.vm > outfile.sql

    Execute the output from ddlscan into redshift

After enabling either of the above methods, if replication has already been installed you will need to simply restart the replicator by issuing the following:

shel> replicator restart