3.7. Replicating from a Cluster to a Datawarehouse

You can replicate data from an existing cluster out to a datawarehouse solution such as Hadoop or Vertica by changing the configuration of your cluster and attaching the replication slave which handles the datawarehouse loading. The changes to the cluster involve the following modifications in order to provide the information in the correct format for the replication to the datawarehouse.

  • MySQL nodes should be configured to be using Row-based replication throughout.

    This is required because replication to the datawarehouse environment must send the raw-data, rather than the statements which cannot be applied directly to a target datawarehouse.

  • Heterogeneous mode should be enabled within the cluster.

    Heterogeneous mode enables a number of filters that provide the additional information, such as column names, primary key details, and SET and ENUM translation, required so that the information can be replicated into the datawarehouse correctly.

  • Replication slave, which replicates into the datawarehouse, should be configured in heterogeneous mode.

    This enables certain filters and processing that ensure the THL and replicated content is formatted correctly.

The deployment works by configuring the cluster replication service in heterogeneous mode, and then replicating out to the slave that writes into the datawarehouse by using a cluster alias. This ensures that changes to the cluster topology, for example master switches during a failover, to ensure that replication continues effectively to your chosen datawarehouse.

Figure 3.6. Topologies: Replication from Cluster to Datawarehouse

Topologies: Replication from Cluster to Datawarehouse

In order to configure this, there are three steps:

  1. Create a cluster, or upgrade an existing cluster, to operate in heterogeneous mode

  2. Configure an alias to the existing Continuent Tungsten service

  3. Configure a replicator that points to the alias acting as a slave that writes into your chosen datawarehouse

There are also the following requirements:

  • The Continuent Tungsten and Tungsten Replicator must be 4.0 or later.

  • Hosts on both the replicator and cluster must be able to communicate with each other.

  • Replicator must be able to connect as the tungsten user to the databases within the cluster

Deploying a combination cluster and datawarehouse deployment consists of two different stages, the deployment of the heterogeneous cluster, and the deployment of the slave replicator that will replicate the data into the chosen datawarehouse.

  1. Deployment of a Heterogeneous Cluster

    1. On your staging server for the cluster, download the release package.

    2. Unpack the release package:

      shell> tar zxf continuent-tungsten-4.0.7-111.tar.gz
    3. Change to the unpackaged directory:

      shell> cd continuent-tungsten-4.0.7-111
    4. Execute the tpm command to configure the installation defaults:

      shell> ./tools/tpm configure defaults \
          --reset \
          --datasource-port=13306 \
          --home-directory=/opt/qa/continuent \
          --user=tungsten \
          --datasource-user=tungsten \
          --datasource-password=secret \
          --application-user=tungsten_testing \
          --application-password=private \
          --connector-listen-port=9999

      The description of each of the options is shown below; click the icon to hide this detail:

      Click the icon to show a detailed description of each argument.

    5. Next, configure the specific cluster configuration parameters. In this example, host1 is the master, and host3 and host4 are slaves.

      shell> ./tools/tpm configure east \
          --topology=clustered \
          --master=ct1 \
          --slaves=ct3,ct4 \
          --connectors=ct1,ct3,ct4 \
          --enable-heterogeneous-master=true

      The description of each of the options is shown below; click the icon to hide this detail:

      Click the icon to show a detailed description of each argument.

    6. Finally, perform the installation and then start the services:

      shell> ./tools/tpm install east
      shell> ./tools/tpm start east

      If the installation process fails, check the output of the /tmp/tungsten-configure.log file for more information about the root cause.

      Once complete, the cluster should be up and running with connectors and master/slave services.

  2. Next, you need to configure the replicator that will act as a slave, reading information from the cluster and then applying that data into the chosen datawarehouse. The sample below uses Amazon Redshift.

    The tpm command to create the service on the replicator is as follows:

    1. On your replicator node, download the release package. This node may be located either on a separate host (for example when replicating to Amazon Redshift), or on the same node as the target database service (for HP Vertica or Hadoop).

    2. Unpack the release package:

      shell> tar zxf tungsten-replicator-4.0.0-500.tar.gz
    3. Change to the unpackaged directory:

      shell> cd tungsten-replicator-4.0.0-500
    4. Execute the tpm command to configure the core installation: installation:

      shell> ./tools/tpm configure defaults \
          --reset \
          --home-directory=/opt/continuent \
          --root-command-prefix=true
    5. Configure the information about the cluster; this will not install the cluster, just provide the information required by the replicator installation so that it can correctly communicate with all the nodes in the cluster, regardless of the topology.

      shell> ./tools/tpm configure east \
          --topology=clustered \
          --master=ct1 \
          --slaves=ct3,ct4 \
          --connectors=ct1,ct3,ct4 \
          --enable-heterogeneous-master=true
    6. Configure the replicator service to write the data into the target datawarehouse. The example below shows the configuration for replicating into Amazon Redshift. For other examples, see the corresponding deployment section:

      shell> ./tools/tpm configure east_slaves \
          --topology=cluster-slave \
          --relay=ct2 \
          --master-dataservice=east \
          --enable-heterogeneous-slave=true \
          --replication-host=$REDSHIFT_HOST \
          --replication-user=$REDSHIFT_USER \
          --replication-password=$REDSHIFT_PASSWORD \
          --datasource-type=redshift \
          --batch-enabled=true \
          --batch-load-template=redshift \
          --redshift-dbname=dev
      shell> ./tools/tpm install east_slaves

      The description of each of the options is shown below; click the icon to hide this detail:

      Click the icon to show a detailed description of each argument.

    7. Now the defaults are configured, the service can be installed and then started:

      shell> ./tools/tpm install east_slaves
      shell> ./tools/tpm start east_slaves

The cluster and datawarehouse slave should now be configured and running, with data being replicated both within the cluster, and out to the Redshift datawarehouse.