Database Replication and Clustering
You can replicate data from an existing cluster out to a datawarehouse solution such as Hadoop or Vertica by changing the configuration of your cluster and attaching the replication slave which handles the datawarehouse loading. The changes to the cluster involve the following modifications in order to provide the information in the correct format for the replication to the datawarehouse.
MySQL nodes should be configured to be using Row-based replication throughout.
This is required because replication to the datawarehouse environment must send the raw-data, rather than the statements which cannot be applied directly to a target datawarehouse.
Heterogeneous mode should be enabled within the cluster.
Heterogeneous mode enables a number of filters that provide the
additional information, such as column names, primary key details, and
SET
and
ENUM
translation, required so
that the information can be replicated into the datawarehouse correctly.
Replication slave, which replicates into the datawarehouse, should be configured in heterogeneous mode.
This enables certain filters and processing that ensure the THL and replicated content is formatted correctly.
The deployment works by configuring the cluster replication service in heterogeneous mode, and then replicating out to the slave that writes into the datawarehouse by using a cluster alias. This ensures that changes to the cluster topology, for example master switches during a failover, to ensure that replication continues effectively to your chosen datawarehouse.
In order to configure this, there are three steps:
Create a cluster, or upgrade an existing cluster, to operate in heterogeneous mode
Configure an alias to the existing Tungsten Clustering service
Configure a replicator that points to the alias acting as a slave that writes into your chosen datawarehouse
There are also the following requirements:
The Tungsten Clustering and Tungsten Replicator must be 4.0 or later.
Hosts on both the replicator and cluster must be able to communicate with each other.
Replicator must be able to connect as the
tungsten
user to the databases
within the cluster
Deploying a combination cluster and datawarehouse deployment consists of two different stages, the deployment of the heterogeneous cluster, and the deployment of the slave replicator that will replicate the data into the chosen datawarehouse.
Deployment of a Heterogeneous Cluster
On your staging server for the cluster, download the release package.
Unpack the release package:
shell> tar zxf tungsten-clustering-5.1.1-202.tar.gz
Change to the unpackaged directory:
shell> cd tungsten-clustering-5.1.1-202
Execute the tpm command to configure the installation defaults:
shell> ./tools/tpm configure defaults \
--reset \
--datasource-port=13306 \
--home-directory=/opt/qa/continuent \
--user=tungsten \
--datasource-user=tungsten \
--datasource-password=secret \
--application-user=tungsten_testing \
--application-password=private \
--connector-listen-port=9999
The description of each of the options is shown below; click the icon to hide this detail:
Click the icon to show a detailed description of each argument.
Configures default options that will be configured for all future services.
Clears any existing configuration.
Set the port for connecting to the MySQL server.
--home-directory=/opt/replicator
The installation directory of the Tungsten service. This is where the service will be installed on each server in your dataservice.
The profile script used when your shell starts. Using this line modifies your profile script to add a path to the Tungsten tools so that managing Tungsten Clustering™ are easier to use.
The operating system user name that you have created for the
Tungsten service,
tungsten
.
User for installing the Tungsten Replicator service.
The user to be used for the connection to the datasource.
The password that will be used for connections to the datasource.
The user to be used for the application connection.
The password that will be used for application connections.
The port on which the Connector will listen for for incoming connections.
Next, configure the specific cluster configuration parameters. In
this example, host1
is the
master, and host3
and
host4
are slaves.
shell> ./tools/tpm configure east \
--topology=clustered \
--master=ct1 \
--slaves=ct3,ct4 \
--connectors=ct1,ct3,ct4 \
--enable-heterogeneous-master=true
The description of each of the options is shown below; click the icon to hide this detail:
Click the icon to show a detailed description of each argument.
Configures default options that will be configured for all future services.
The toplogy of the deployed service, in this case a standard cluster.
The master database of the deployed service.
The slaves databases of the deployed service.
The hosts that will provide the connector service in the deployed cluster.
Enables the cluster to support a heterogeneous service which provides the necessary configuration information, such as enabling filters and other configuration so that the THL generated is usable by the downstream slave that will replicate the data into the datawarehouse.
Finally, perform the installation and then start the services:
shell>./tools/tpm install east
shell>./tools/tpm start east
If the installation process fails, check the output of the
/tmp/tungsten-configure.log
file for more information about the root cause.
Once complete, the cluster should be up and running with connectors and master/slave services.
Next, you need to configure the replicator that will act as a slave, reading information from the cluster and then applying that data into the chosen datawarehouse. The sample below uses Amazon Redshift.
The tpm command to create the service on the replicator is as follows:
On your replicator node, download the release package. This node may be located either on a separate host (for example when replicating to Amazon Redshift), or on the same node as the target database service (for HP Vertica or Hadoop).
Unpack the release package:
shell> tar zxf tungsten-replicator-4.0.0-500.tar.gz
Change to the unpackaged directory:
shell> cd tungsten-replicator-4.0.0-500
Execute the tpm command to configure the core installation: installation:
shell> ./tools/tpm configure defaults \
--reset \
--home-directory=/opt/continuent \
--root-command-prefix=true
Configure the information about the cluster; this will not install the cluster, just provide the information required by the replicator installation so that it can correctly communicate with all the nodes in the cluster, regardless of the topology.
shell> ./tools/tpm configure east \
--topology=clustered \
--master=ct1 \
--slaves=ct3,ct4 \
--connectors=ct1,ct3,ct4 \
--enable-heterogeneous-master=true
Configure the replicator service to write the data into the target datawarehouse. The example below shows the configuration for replicating into Amazon Redshift. For other examples, see the corresponding deployment section:
shell> ./tools/tpm configure east_slaves \
--topology=cluster-slave \
--relay=ct2 \
--master-dataservice=east \
--enable-heterogeneous-slave=true \
--replication-host=$REDSHIFT_HOST \
--replication-user=$REDSHIFT_USER \
--replication-password=$REDSHIFT_PASSWORD \
--datasource-type=redshift \
--batch-enabled=true \
--batch-load-template=redshift \
--redshift-dbname=dev
shell> ./tools/tpm install east_slaves
The description of each of the options is shown below; click the icon to hide this detail:
Click the icon to show a detailed description of each argument.
Configures default options that will be configured for all future services.
Configure the topology as a cluster-slave. This will configure the individual replicator as a slave of all the nodes in the cluster, as defined in the previous configuration of the cluster topology.
Configure the node as the relay for the cluster which will replicate data into the datawarehouse.
Configure the master dataservice that the relay will be configured to relay data from. This picks up the configuration of the cluster and cluster hosts.
Configures the slave to correctly process the incoming data so that it can be written to the datawarehouse. This includes correcting the processing of text data types and configuring the appropriate filters.
The target host for writing data. In the case of Redshift, this is the fully qualified hostname of the Redshift host.
The user within the Redshift service that will be used to write data into the database.
--replication-password=password
The password for the user within the Redshift service that will be used to write data into the database.
Set the datasource type to be used when storing information about the replication state.
Enable the batch service, this configures the JavaScript batch engine and CSV writing semantics to generate the data to be applied into a datawarehouse.
--batch-load-template=redshift
The batch load template to be used. Since we are replicating
into Redshift, the
redshift
template
is used.
The name of the database within the Redshift service where the data will be written.
Now the defaults are configured, the service can be installed and then started:
shell>./tools/tpm install east_slaves
shell>./tools/tpm start east_slaves
The cluster and datawarehouse slave should now be configured and running, with data being replicated both within the cluster, and out to the Redshift datawarehouse.