Copyright © 2023 Continuent Ltd
Abstract
This manual documents Tungsten Replicator 7.0. This includes information for:
Tungsten Replicator
Build date: 2024-11-28 (505168f6)
Up to date builds of this document: Tungsten Replicator 7.0 Manual (Online), Tungsten Replicator 7.0 Manual (PDF)
Table of Contents
ddl-check-pkeys.vm
ddl-mysql-hive-0.10.vm
ddl-mysql-hive-0.10-staging.vm
ddl-mysql-hive-metadata.vm
ddl-mysql-oracle.vm
ddl-mysql-oracle-cdc.vm
ddl-mysql-redshift.vm
ddl-mysql-redshift-staging.vm
ddl-mysql-vertica.vm
ddl-mysql-vertica-staging.vm
ddl-oracle-mysql.vm
ddl-oracle-mysql-pk-only.vm
env.sh Script
INI
tpm Methodsansiquotes.js
Filterbreadcrumbs.js
Filterdbrename.js
Filterdbselector.js
Filterdbupper.js
Filterdropcolumn.js
Filterdropcomments.js
Filterdropddl.js
Filterdropmetadata.js
Filterdroprow.js
Filterdropstatementdata.js
Filterdropsqlmode.js
Filterdropxa.js
Filterforeignkeychecks.js
Filterinsertsonly.js
Filtermaskdata.js
Filternocreatedbifnotexists.js
Filtershardbyrules.js
Filtershardbyseqno.js
Filtershardbytable.js
Filtertosingledb.js
Filtertruncatetext.js
Filterzerodate2null.js
FiltermasterConnectUri
masterListenUri
accessFailures
active
activeSeqno
appliedLastEventId
appliedLastSeqno
appliedLatency
applier.class
applier.name
applyTime
autoRecoveryEnabled
autoRecoveryTotal
averageBlockSize
blockCommitRowCount
cancelled
channel
channels
clusterName
commits
committedMinSeqno
criticalPartition
currentBlockSize
currentEventId
currentLastEventId
currentLastFragno
currentLastSeqno
currentTimeMillis
dataServerHost
discardCount
doChecksum
estimatedOfflineInterval
eventCount
extensions
extractTime
extractor.class
extractor.name
filter.#.class
filter.#.name
filterTime
flushIntervalMillis
fsyncOnFlush
headSeqno
intervalGuard
lastCommittedBlockSize
lastCommittedBlockTime
latestEpochNumber
logConnectionTimeout
logDir
logFileRetainMillis
logFileSize
maxChannel
maxDelayInterval
maxOfflineInterval
maxSize
maximumStoredSeqNo
minimumStoredSeqNo
name
offlineRequests
otherTime
pendingError
pendingErrorCode
pendingErrorEventId
pendingErrorSeqno
pendingExceptionMessage
pipelineSource
processedMinSeqno
queues
readOnly
relativeLatency
resourcePrecedence
rmiPort
role
seqnoType
serializationCount
serialized
serviceName
serviceType
shard_id
simpleServiceName
siteName
sourceId
stage
started
state
stopRequested
store.#
storeClass
syncInterval
taskCount
taskId
timeInCurrentEvent
timeInStateSeconds
timeoutMillis
totalAssignments
transitioningTo
uptimeSeconds
version
List of Figures
List of Tables
--output
Optioncondrestart
console
restart
start
tungsten
Sub-Directory StructureTable of Contents
This manual documents Tungsten Replicator 7.0 up to and including 7.0.3 build 141. Differences between minor versions are highlighted stating the explicit minor release version, such as 7.0.3.x.
For other versions and products, please use the appropriate manual.
The trademarks, logos, and service marks in this Document are the property of Continuent or other third parties. You are not permitted to use these Marks without the prior written consent of Continuent or such appropriate third party. Continuent, Tungsten, uni/cluster, m/cluster, p/cluster, uc/connector, and the Continuent logo are trademarks or registered trademarks of Continuent in the United States, France, Finland and other countries.
All Materials on this Document are (and shall continue to be) owned exclusively by Continuent or other respective third party owners and are protected under applicable copyrights, patents, trademarks, trade dress and/or other proprietary rights. Under no circumstances will you acquire any ownership rights or other interest in any Materials by or through your access or use of the Materials. All right, title and interest not expressly granted is reserved to Continuent.
All rights reserved.
This documentation uses a number of text and style conventions to indicate and differentiate between different types of information:
Text in this style is used to show an important element or piece of information. It may be used and combined with other text styles as appropriate to the context.
Text in this style is used to show a section heading, table heading, or particularly important emphasis of some kind.
Program or configuration options are formatted using
this style
. Options are also
automatically linked to their respective documentation page when this is
known. For example, tpm and
--hosts
both link automatically to the
corresponding reference page.
Parameters or information explicitly used to set values to commands or
options is formatted using this
style
.
Option values, for example on the command-line are marked up using this
format: --help
. Where possible, all
option values are directly linked to the reference information for that
option.
Commands, including sub-commands to a command-line tool are formatted using Text in this style. Commands are also automatically linked to their respective documentation page when this is known. For example, tpm links automatically to the corresponding reference page.
Text in this style
indicates
literal or character sequence text used to show a specific value.
Filenames, directories or paths are shown like this
/etc/passwd
. Filenames and paths
are automatically linked to the corresponding reference page if
available.
Bulleted lists are used to show lists, or detailed information for a list of items. Where this information is optional, a magnifying glass symbol enables you to expand, or collapse, the detailed instructions.
Code listings are used to show sample programs, code, configuration files and other elements. These can include both user input and replaceable values:
shell>cd /opt/continuent/software
shell>ar zxvf
tungsten-replicator-7.0.3-141.tar.gz
In the above example command-lines to be entered into a shell are prefixed
using shell
. This shell is typically
sh,
ksh, or
bash on Linux and Unix platforms.
If commands are to be executed using administrator privileges, each line will be prefixed with root-shell, for example:
root-shell> vi /etc/passwd
To make the selection of text easier for copy/pasting, ignorable text, such
as shell>
are ignored during
selection. This allows multi-line instructions to be copied without
modification, for example:
mysql>create database test_selection;
mysql>drop database test_selection;
Lines prefixed with mysql>
should
be entered within the mysql
command-line.
If a command-line or program listing entry contains lines that are two wide to be displayed within the documentation, they are marked using the » character:
the first line has been extended by using a » continuation line
They should be adjusted to be entered on a single line.
Text marked up with this style
is information that is
entered by the user (as opposed to generated by the system). Text formatted
using this style
should be replaced with the
appropriate file, version number or other variable information according to
the operation being performed.
In the HTML versions of the manual, blocks or examples that can be userinput can be easily copied from the program listing. Where there are multiple entries or steps, use the 'Show copy-friendly text' link at the end of each section. This provides a copy of all the user-enterable text.
Are you planning on completing your first installation?
Have you followed the Appendix B, Prerequisites?
Have you chosen your deployment type? See Chapter 2, Deployment Overview
Is this a Primary/Replica deployment?
Are you looking to configure an applier??
Are you using the Tungsten Replicator AMI available in the Amazon AWS Marketplace?
Would you like to understand the different types of installation?
There are two installation methods available in tpm, INI and Staging. A comparison of the two methods is at
Do you want to upgrade to the latest version?
See Section 7.14.1, “ Upgrading Tungsten Replicator using tpm ”.
Are you trying to update or change the configuration of your system?
Would you like to perform database or operating system maintenance?
Do you need to backup or restore your system?
For backup instructions, see Section 7.7, “Creating a Backup”, and to restore a previously made backup, see Section 7.8, “Restoring a Backup”.
Table of Contents
Tungsten Replicator™ is a replication engine supporting a variety of different extractor and applier modules. Data can be extracted from MySQL, Amazon RDS MySQL, Amazon Aurora, Microsoft Azure and Google Cloud SQL, and applied to a variety of transactional stores, NoSQL stores and datawarehouse stores. For a full list of supported sources and targets, see Table 1.1, “Supported Extractors” and Table 1.2, “Supported Appliers” below
During replication, Tungsten Replicator assigns data a unique global transaction ID, and enables flexible statement and/or row-based replication of data. This enables data to be exchanged between different databases and different database versions. During replication, information can be filtered and modified, and deployment can be between on-premise or cloud-based databases. For performance, Tungsten Replicator™ provides support for parallel replication, and advanced topologies such as fan-in, star and active/active, and can be used efficiently in cross-site deployments.
Tungsten Replicator™ is the core foundation for Tungsten Cluster™ for HA, DR and geographically distributed solutions.
Features in Tungsten Replicator
Includes support for replicating into Hadoop (including Apache Hadoop, Cloudera, HortonWorks, MapR, Amazon EMR)
Includes support for replicating into Amazon Redshift, including storing change data within Amazon S3
Includes support for replicating into PostgreSQL, Apache Kafka, MongoDB
Includes support for replicating to and from Amazon Aurora/RDS (MySQL) deployments
Available as an AMI via Amazon Marketplace (Without Support)
SSL Support for managing MySQL deployments
Network Client filter for handling complex data translation/migration needs during replication
The table below shows the version of Tungsten Replicator that support was added for the specific extractor
Table 1.1. Supported Extractors
Source | 5.3 | 5.4 | 6.0 | 6.1 | 7.0 |
---|---|---|---|---|---|
MySQL (5.0 to 5.6) | x | x | x | x | x |
MySQL 5.7 | x | x | x | x | x |
MySQL 8 | x | x | x | ||
MariaDB (5.5, 10) | x | x | x | x | x |
Amazon Aurora/RDS MySQL | x | x | x | x | x |
Google Cloud MySQL | x | x | x | x | x |
Microsoft Azure | x | x | x | x | x |
The table below shows the version of Tungsten Replicator that support was added for the specific applier
Table 1.2. Supported Appliers
Target | 5.3 | 5.4 | 6.0 | 6.1 | 7.0 |
---|---|---|---|---|---|
MySQL (incl MariaDB) | x | x | x | x | x |
Amazon Aurora/RDS MySQL | x | x | x | x | x |
Microsoft Azure | x | x | x | x | x |
Google Cloud MySQL | x | x | x | x | x |
Oracle (incl. Cloud) | x | x | x | x | x |
PostgreSQL (incl. Cloud) | x | x | x | x | x |
Hadoop | x | x | x | x | x |
Vertica | x | x | x | x | x |
Amazon Redshift | x | x | x | x | x |
MongoDB | x | x | x | x | x |
MongoDB Atlas | x (6.1.3) | x | |||
Apache Kafka | x | x | x | x | x |
Clickhouse | x | x |
Tungsten Replicator is a high performance replication engine that works with a number of different source and target databases to provide high-performance and improved replication functionality over the native solution. With MySQL replication, for example, the enhanced functionality and information provided by Tungsten Replicator allows for global transaction IDs, advanced topology support such as Composite Active/Active, star, and fan-in, and enhanced latency identification.
In addition to providing enhanced functionality Tungsten Replicator is also capable of heterogeneous replication by enabling the replicated information to be transformed after it has been read from the data server to match the functionality or structure in the target server. This functionality allows for replication between MySQL and a variety of heterogeneous targets.
Understanding how Tungsten Replicator works requires looking at the overall replicator structure. There are three major components in the system that provide the core of the replication functionality:
Extractor
The extractor component reads data from a MysQL data server and writes that information into the Transaction History Log (THL). The role of the extractor is to read the information from a suitable source of change information and write it into the THL in the native or defined format, either as SQL statements or row-based information.
Information is always extracted from a source database and recorded within the THL in the form of a complete transaction. The full transaction information is recorded and logged against a single, unique, transaction ID used internally within the replicator to identify the data.
Applier
Appliers within Tungsten Replicator convert the THL information and apply it to a destination data server. The role of the applier is to read the THL information and apply that to the data server.
The applier works with a number of different target databases, and is responsible for writing the information to the database. Because the transactional data in the THL is stored either as SQL statements or row-based information, the applier has the flexibility to reformat the information to match the target data server. Row-based data can be reconstructed to match different database formats, for example, converting row-based information into an Oracle-specific table row, or a MongoDB document.
Transaction History Log (THL)
The THL contains the information extracted from a data server. Information within the THL is divided up by transactions, either implied or explicit, based on the data extracted from the data server. The THL structure, format, and content provides a significant proportion of the functionality and operational flexibility within Tungsten Replicator.
As the THL data is stored additional information, such as the metadata and options in place when the statement or row data was extracted are recorded. Each transaction is also recorded with an incremental global transaction ID. This ID enables individual transactions within the THL to be identified, for example to retrieve their content, or to determine whether different appliers within a replication topology have written a specific transaction to a data server.
These components will be examined in more detail as different aspects of the system are described with respect to the different systems, features, and functionality that each system provides.
From this basic overview and structure of Tungsten Replicator, the replicator allows for a number of different topologies and solutions that replicate information between different services. Straightforward replication topologies, such as Primary/Replica are easy to understand with the basic concepts described above. More complex topologies use the same core components. For example, Composite Active/Active topologies make use of the global transaction ID to prevent the same statement or row data being applied to a data server multiple times. Fan-in topologies allow the data from multiple data servers to be combined into one data server.
Extractors exist for reading information from the following sources:
Reading the MySQL binary log (binlog) directly from the disk and translating that content and session information into the THL. Using this method to read the binlog in it's different formats, such as the statement, row and mixed-based logging.
Remotely from MySQL server over a network, including reading from an Amazon RDS MySQL or Amazon Aurora instance. This enables the replicator to read the information remotely, either on services where direct access to the binlog is not available, or where we cannot be installed (Such as databases hosted on a Windows platform).
Once information has been recorded into THL, particularly when that information has been recorded in row-based format, it is possible to apply that information out to a variety of different targets, both transactional and SQL based solutions, and also NoSQL and analytical targets.
Available appliers include:
MySQL
Community Edition
Enterprise Edition
Percona
MariaDB
Amazon Aurora/RDS (Including cross region)
Google Cloud SQL
Microsoft Azure
Oracle
PostgreSQL
Amazon RedShift
HPE Vertica
Hadoop, compatible with all major distributions
MongoDB (Including Atlas from v6.1.3 onwards)
Apache Kafka
Clickhouse (Experimental)
For more information on how the heterogeneous replicator works, see Section 2.8.1, “How Heterogeneous Replication Works”. For more information on the batch applier, which works with datawarehouse targets, see Section 5.6, “Batch Loading for Data Warehouses”.
Tungsten Replicator operates by reading information from the source database and transferring that information to the Transaction History Log (THL).
Each transaction within the THL includes the SQL statement or the row-based data written to the database. The information also includes, where possible, transaction specific options and metadata, such as character set data, SQL modes and other information that may affect how the information is written when the data is applied. The combination of the metadata and the global transaction ID also enable more complex data replication scenarios to be supported, such as Composite Active/Active, without fear of duplicating statement or row data application because the source and global transaction ID can be compared.
In addition to all this information, the THL also includes a timestamp and a record of when the information was written into the database before the change was extracted. Using a combination of the global transaction ID and this timing information provides information on the latency and how up to date a dataserver is compared to the original datasource.
Depending on the underlying storage of the data, the information can be reformatted and applied to different data servers. When dealing with row-based data, this can be applied to a different type of data server, or completely reformatted and applied to non-table based services such as MongoDB.
THL information is stored for each replicator service, and can also be exchanged over the network between different replicator instances. This enables transaction data to be exchanged between different hosts within the same network or across wide-area-networks.
Filtering within the replicator enables the information within the THL to be removed, augmented, or modified as the information is transferred within and between the replicators.
During filtering, the information in the THL can be modified in a host of different ways, including but not limited to:
Filtering out information based on the schema name, table name or column name. This is useful if you want a subset of the information in your target database, or if you want want to apply only certain columns to the information.
Filter information based on the content, or value of one or more fields.
Filter information based on the operation type, for example, only applying inserts to a target ignoring updates or deletes.
Modify or alter the format or structure of the data. This can be used to change the data format to be compatible with a target system, for example due to data type limitations, or sizes.
Add information to the data. For example, adding a database name, source name, or additional or compound fields into the target data. Within an analytics system this can be useful when combining data from multiple sources so that the source system or customer can still be identified.
The format, content, and structure of the data and the THL can be modified and new data can even be created through the filters.
For more information on the filters available, and how to use them, see Chapter 11, Replication Filters.
Table of Contents
Tungsten Replicator creates a unique replication interface between two databases. Because Tungsten Replicator is independent of the dataserver it affords a number of different advantages, including more flexible replication strategies, filtering, and easier control to pause, restart, and skip statements between hosts.
Replication is supported from, and to, different dataservers using different technologies through a series of extractor and applier components which independently read data from, and write data to, the dataservers in question.
The replication process is made possible by reading the binary log on each host. The information from the binary log is written into the Tungsten Replicator Transaction History Log (THL), and the THL is then transferred between hosts and then applied to each Target host. More information can be found in Chapter 1, Introduction.
Before covering the basics of creating different dataservices, there are some key terms that will be used throughout the setup and installation process that identify different components of the system. these are summarised in Table 2.1, “Key Terminology”.
Table 2.1. Key Terminology
Tungsten Term | Traditional Term | Description |
---|---|---|
dataserver | Database | The database on a host. Datasources include MySQL, or Oracle. |
datasource | Host or Node | One member of a dataservice and the associated Tungsten components. |
staging host | - | The machine (and directory) from which Tungsten Replicator is installed and configured. The machine does not need to be the same as any of the existing hosts in the cluster. |
staging directory | - | The directory where the installation files are located and the installer is executed. Further configuration and updates must be performed from this directory. |
Before attempting installation, there are a number of prerequisite tasks which must be completed to set up your hosts, database, and Tungsten Replicator service:
Setup a staging host from which you will configure and manage your installation.
Configure each host that will be used within your dataservice.
Configure your MySQL installation, so that Tungsten Replicator can work with the database.
Prepare and configure the target environment
The following sections provide guidance and instructions for creating a number of different deployment scenarios using Tungsten Replicator.
Tungsten Replicator is available in a number of different distribution types, and
the methods for configuration available for these different packages
differs. See Section 9.1, “Comparing Staging and INI
tpm Methods” for more
information on the available installation methods.
Deployment Type/Package | TAR/GZip | RPM |
---|---|---|
Staging Installation | Yes | No |
INI File Configuration | Yes | Yes |
Deploy Entire Cluster | Yes | No |
Deploy Per Machine | Yes | Yes |
Two primary deployment sources are available:
Using the TAR/GZip package creates a local directory that enables you to perform installs and updates from the extracted 'staging' directory, or use the INI file format.
Using the RPM package format is more suited to using the INI file format, as hosts can be installed and upgraded to the latest RPM package independently of each other.
All packages are named according to the product, version number, build release and extension. For example:
tungsten-replicator-7.0.3-141.tar.gz
The version number is
7.0.3
and build
number 141
. Build
numbers indicate which build a particular release version is based on, and
may be useful when installing patches provided by support.
To use the TAR/GZipped packages, download the files to your machine and unpack them:
shell>cd /opt/continuent/software
shell>tar zxf tungsten-replicator-7.0.3-141.tar.gz
This will create a directory matching the downloaded package name,
version, and build number from which you can perform an install using
either the INI file or command-line configuration. To use, you will need
to use the tpm command within the
tools
directory of the extracted package:
shell> cd tungsten-replicator-7.0.3-141
The RPM packages can be used for installation, but are primarily designed to be in combination with the INI configuration file.
Installation
Installing the RPM package will do the following:
Create the tungsten
system user
if it doesn't exist
Make the tungsten
system user
part of the mysql
group if it
exists
Create the
/opt/continuent/software
directory
Unpack the software into
/opt/continuent/software
Define the $CONTINUENT_PROFILES
and
$REPLICATOR_PROFILES
environment variables
Update the profile script to include the
/opt/continuent/share/env.sh
script
Create the /etc/tungsten
directory
Run tpm install if the
/etc/tungsten.ini
or
/etc/tungsten/tungsten.ini
file exists
Although the RPM packages complete a number of the pre-requisite steps required to configure your cluster, there are additional steps, such as configuring ssh, that you still need to complete. For more information, see Appendix B, Prerequisites.
By using the package files you are able to setup a new server by creating
the /etc/tungsten.ini
file and then installing the
package. Any output from the tpm command will go to
/opt/continuent/service_logs/rpm.output
.
If you download the package files directly, you may need to add the signing key to your environment before the package will load properly.
For yum platforms (RHEL/CentOS/Amazon Linux), the rpm command is used :
root-shell> rpm --import http://www.continuent.com/RPM-GPG-KEY-continuent
For Ubuntu/Debian platforms, the gpg command is used :
root-shell> gpg --keyserver keyserver.ubuntu.com --recv-key 7206c924
Once an INI file has been created and the packages are available, the installation can be completed using:
On RHEL/CentOS/Amazon Linux:
root-shell> yum install tungsten-replicator
On Ubuntu/Debian:
root-shell> apt-get install tungsten-replicator
Upgrades
If you upgrade to a new version of the RPM package it will do the following:
Unpack the software into
/opt/continuent/software
Run tpm update if the
/etc/tungsten.ini
or
/etc/tungsten/tungsten.ini
file exists
The tpm update will restart all Continuent Tungsten services so you do not need to do anything after upgrading the package file.
A successful deployment depends on being mindful during deployment, operations and ongoing maintenance.
Identify the best deployment method for your environment and use that
in production and testing. See
Section 9.1, “Comparing Staging and INI
tpm Methods”.
Standardize the OS and database prerequisites. There are Ansible modules available for immediate use within AWS, or as a template for modifications.
More information on the Ansible method is available in this blog article.
Ensure that the output of the `hostname` command and the nodename entries in the Tungsten configuration match exactly prior to installing Tungsten.
The configuration keys that define nodenames are: --slaves
, --dataservice-slaves
, --members
, --master
, --dataservice-master-host
, --masters
and --relay
For security purposes you should ensure that you secure the following areas of your deployment:
Ensure that you create a unique installation and deployment user, such as tungsten, and set the correct file permissions on installed directories. See Section B.3.4, “Directory Locations and Configuration”.
When using ssh and/or SSL, ensure that the ssh key or certificates are suitably protected. See Section B.3.3.2, “SSH Configuration”.
Use a firewall, such as iptables to protect the network ports that you need to use. The best solution is to ensure that only known hosts can connect to the required ports for Tungsten Cluster. For more information on the network ports required for Tungsten Cluster operation, see Section B.3.3.1, “Network Ports”.
If possible, use authentication and SSL connectivity between hosts to protext your data and authorisation for the tools used in your deployment.
See Chapter 6, Deployment: Security for more information.
Choose your topology from the deployment section and verify the configuration matches the basic settings. Additional settings may be included for custom features but the basics are needed to ensure proper operation. If your configuration is not listed or does not match our documented settings; we cannot guarantee correct operation.
If you are using ROW
replication, any triggers that run additional
INSERT
/UPDATE
/DELETE
operations must be updated so they do not run on the Replica servers.
Make sure you know the structure of the Tungsten Cluster home directory and how to initialize your environment for administration. See Section 7.1, “The Home Directory” and Section 7.2, “Establishing the Shell Environment”.
Prior to migrating applications to Tungsten Cluster test failover and recovery procedures from Chapter 7, Operations Guide. Be sure to try recovering a failed Primary and reprovisioning failed Replicas.
When deciding on the Service Name for your configurations, keep them simple and short and only use alphanumerics (Aa-Zz,0-9) and underscores (_).
In this section we identify the best practices for performing a Tungsten Software upgrade.
Identify the deployment method chosen for your environment, Staging or
INI. See Section 9.1, “Comparing Staging and INI
tpm Methods”.
The best practice for Tungsten software is to upgrade All-at-Once, performing zero Primary switches.
The Staging deployment method automatically does an All-at-Once upgrade - this is the basic design of the Staging method.
For an INI upgrade, there are two possible ways, One-at-a-Time (with at least one Primary switch), and All-at-Once (no switches at all).
See Section 9.4.3, “Upgrades with an INI File” for more information.
Here is the sequence of events for a proper Tungsten upgrade on a 3-node cluster with the INI deployment method:
Login to the Customer Downloads Portal and get the latest version of the software.
Copy the file (i.e.
tungsten-clustering-7.0.2-161.tar.gz
) to each
host that runs a Tungsten component.
Set the cluster to policy MAINTENANCE
On every host:
Extract the tarball under /opt/continuent/software/ (i.e.
create
/opt/continuent/software/tungsten-clustering-7.0.2-161
)
cd to the newly extracted directory
Run the Tungsten Package Manager tool, tools/tpm update --replace-release
For example, here are the steps in order:
On ONE database node: shell>cctrl
cctrl>set policy maintenance
cctrl>exit
On EVERY Tungsten host at the same time: shell>cd /opt/continuent/software
shell>tar xvzf tungsten-clustering-7.0.2-161.tar.gz
shell>cd tungsten-clustering-7.0.2-161
To perform the upgrade and restart the Connectors gracefully at the same time: shell>tools/tpm update --replace-release
To perform the upgrade and delay the restart of the Connectors to a later time: shell>tools/tpm update --replace-release --no-connectors
When it is time for the Connector to be promoted to the new version, perhaps after taking it out of the load balancer: shell>tpm promote-connector
When all nodes are done, on ONE database node: shell>cctrl
cctrl>set policy automatic
cctrl>exit
WHY is it ok to upgrade and restart everything all at once?
Let’s look at each component to examine what happens during the upgrade, starting with the Manager layer.
Once the cluster is in Maintenance mode, the Managers cease to make changes to the cluster, and therefore Connectors will not reroute traffic either.
Since Manager control of the cluster is passive in Maintenance mode, it is safe to stop and restart all Managers - there will be zero impact to the cluster operations.
The Replicators function independently of client MySQL requests (which come through the Connectors and go to the MySQL database server), so even if the Replicators are stopped and restarted, there should be only a small window of delay while the replicas catch up with the Primary once upgraded. If the Connectors are reading from the Replicas, they may briefly get stale data if not using SmartScale.
Finally, when the Connectors are upgraded they must be restarted so the new version can take over. As discussed in this blog post, Zero-Downtime Upgrades, the Tungsten Cluster software upgrade process will do two key things to help keep traffic flowing during the Connector upgrade promote step:
Execute `connector graceful-stop 30` to gracefully drain existing connections and prevent new connections.
Using the new software version, initiate the start/retry feature which launches a new connector process while another one is still bound to the server socket. The new Connector process will wait for the socket to become available by retrying binding every 200ms by default (which is tunable), drastically reducing the window for application connection failures.
Setup proper monitoring for all servers as described in Section 7.15, “Monitoring Tungsten Cluster”.
Configure the Tungsten Cluster services to startup and shutdown along with the server. See Section 2.5, “Configuring Startup on Boot”.
Your license allows for a testing cluster. Deploy a cluster that matches your production cluster and test all operations and maintenance operations there.
Disable any automatic operating system patching processes. The use of automatic patching will cause issues when all database servers automatically restart without coordination. See Section 7.13.3, “Performing Maintenance on an Entire Dataservice”.
Regularly check for maintenance releases and upgrade your environment. Every version includes stability and usability fixes to ease the administrative process.
There are a variety of tpm options that can be used to alter some aspect of the deployment during configuration. Although they might not be provided within the example deployments, they may be used or required for different installation environments. These include options such as altering the ports used by different components, or the commands and utilities used to monitor or manage the installation once deployment has been completed. Some of the most common options are included within this section.
Changes to the configuration should be made with tpm update. This continues the procedure of using tpm install during installation. See Section 9.5.20, “tpm update Command” for more information on using tpm update.
--datasource-systemctl-service
On some platforms and environments the command used to manage and control the MySQL or MariaDB service is handled by a tool other than the services or /etc/init.d/mysql commands.
Depending on the system or environment other commands using the same
basic structure may be used. For example, within CentOS 7, the command
is systemctl. You can explicitly
set the command to be used by using the
--datasource-systemctl-service
to
specify the name of the tool.
The format of the corresponding command that will be used is expected to follow the same format as previous commands, for example to start the database service::
shell> systemctl mysql stop
Different commands must follow the same basic structure, the command
configured by
--datasource-systemctl-service
, the
servicename, and the status (i.e.
stop
).
To shutdown a running Tungsten Replicator operation you must switch off the replicator:
shell> replicator stop
Stopping Tungsten Replicator Service...
Stopped Tungsten Replicator Service.
Stopping the replicator in this way results in an ungraceful shutdown of the replicator. To perform a graceful shutdown, use trepctl offline first, then stop or restart the replicator.
To start the replicator service if it is not already running:
shell> replicator start
Starting Tungsten Replicator Service...
To restart the replicator (stop and start) service if it is not already running:
shell> replicator restart
Stopping Tungsten Replicator Service...
Stopped Tungsten Replicator Service.
Starting Tungsten Replicator Service...
For some scenarios, such as initiating a load within a heterogeneous
environment, the replicator can be started up in the
OFFLINE
state:
shell> replicator start offline
In a clustered environment, if the cluster was configured with
auto-enable=false
then you will need to put
each node online individually.
By default, Tungsten Replicator does not start automatically on boot. To enable Tungsten Replicator to start at boot time on a system supporting the Linux Standard Base (LSB), use the deployall script provided in the installation directory to create the necessary boot scripts on your system:
shell> sudo deployall
To disable automatic startup at boot time, use the undeployall command:
shell> sudo undeployall
Removing components from a dataservice is quite straightforward, usually involves both modifying the running service and changing the configuration. Changing the configuration is necessary to ensure that the host is not re-configured and installed when the installation is next updated.
In this section:
To remove a datasource from an existing deployment there are two primary stages, removing it from the active service, and then removing it from the active configuration.
For example, to remove host6
from a
service:
Login to host6.
Stop the replicator:
shell> replicator stop
Now the node has been removed from the active dataservice, the host must be removed from the configuration.
Now you must remove the node from the configuration, although the exact method depends on which installation method used with tpm:
If you are using staging directory method with tpm:
Change to the staging directory. The current staging directory can be located using tpm query staging:
shell>tpm query staging
tungsten@host1:/home/tungsten/tungsten-replicator-7.0.3-141 shell>cd /home/tungsten/tungsten-replicator-7.0.3-141
Update the configuration, omitting the host from the list of members of the dataservice:
shell> tpm update alpha \
--members=host1,host2,host3
If you are using the INI file method with tpm:
Remove the INI configuration file:
shell> rm /etc/tungsten/tungsten.ini
Remove the installed software directory:
shell> rm -rf /opt/continuent
The following sections provide understanding around the different styles of deployment available and the different topologies that can be configured using Tungsten Replicator
Replication Operation Support | |
---|---|
Statements Replicated | Yes, within MySQL/MySQL Topologies only |
Rows Replicated | Yes |
Schema Replicated | Yes, within MySQL/MySQL Topologies only |
ddlscan Supported | Yes, supported for mixed MySQL, and data warehouse targets |
Tungsten Replicator for MySQL operates by
Reading the MySQL binary log (binlog) directly from the disk and translating that content and session information into the THL. Using this method to read the binlog in it's different formats, such as the statement, row and mixed-based logging.
Remotely from the MySQL server over a network, including reading from an Amazon Aurora MySQL instance, for example. This enables the replicator to read the information remotely, either on services where direct access to the binlog is not available, or where we cannot be installed. This is also referred to as Offboard installation
The following diagrams show these two methods of extraction
Tungsten Replicator for MySQL is supported within the following environments:
MySQL Community Edition
MySQL Enterprise Edition from Oracle
Percona
MariaDB
Amazon RDS
Amazon Aurora
Google Cloud MySQL
In addition, the following requirements and limitations are in effect:
Tables must have primary keys (Only applicable when the target is not Oracle, MySQL or Postgres)
Row-based binary logging must be configured for heterogeneous deployment models
Datatype support varies, depending upon the target. Check applier documentation appropriate to deployment target for more detail.
Currently, DDL is only replicated in MySQL to MySQL deployments
The flexibility of the replicator allows you to install the software in a number of ways to fit into a number of possible limitations or restrictions you may be faced with, in addition to a number of flexible topologies. These are outlined below
Onboard
This method will involve the Tungsten Replicator being installed on the same host as the Source MySQL Database. This method is suitable for:
On-Premise deployments
EC2 Hosted Databases in AWS
Google Cloud SQL Hosted Instances
Offboard
This method will involve the Tungsten Replicator being installed on the different host to the Source MySQL Database. This method is suitable for:
On-Premise deployments
EC2 Instances in AWS
Google Cloud SQL Hosted Instances
Amazon RDS MySQL Instances
Amazon Aurora Instances
Direct
This method involved the Tungsten Replicator being installed on a different host to the source MySQL Database, however the replicator will also act as the applier, writing out to the target This method is suitable for:
Amazon RDS MySQL Instances
Amazon Aurora Instances
Cluster-Extractor topologies, extracting direct from a Tungsten Cluster
AWS Marketplace AMI
This method is based on a pre-built AMI available for purchase within the Amazon Marketplace. This method is suitable for:
Amazon AWS Hosted solutions, including RDS and Aurora
There are a number of different methods in which Tungsten Replicator can be configured, review Section 2.7.2, “Understanding Deployment Models” for full details of the differences between each deployment style. The following sections explain the different topology styles that can be deployed
Section 2.7.3.1, “Simple Primary/Replica Topology”
A simple Primary/Replica topology replicating from one source host to one target.
Section 2.7.3.2, “Active/Active Topology”
A more advanced topology allowing bi-direcitonal replication between two or more hosts.
This topology can only be configured between MySQL hosts
Section 2.7.3.3, “Fan-Out Topology”
A more advanced Primary/Replica topology replicating from a single source host into multiple targets.
Each target can be of a different type, and advanced filtering can elevate this topology into a highly advanced solution.
Section 2.7.3.4, “Fan-In Topology”
The reverse of Fan-Out, this topology allows multiple source hosts to be replicated into a single target.
Advanced filtering within the replicator will allow flexibility to, for exmaple, remap schemas
Section 2.7.3.5, “Replicating in/out of an existing Tungsten Cluster”
Configuring the replicator as a Cluster-Extractor will allow you to leverage THL generated within an existing Tungsten Cluster to be replicated to a standalone target
Primary/Replica is the simplest and most straightforward of all replication scenarios, and also the basis of all other types of topology. The fundamental basis for the Primary/Replica topology is that changes in the Source are distributed and applied to the each of the configured Targets.
An active/active topology, relies on a number of individual services that are used to define a Primary/Replica topology between each group of hosts. In a three-node active/active setup, for example, three different services are created on each host, each service creates a Primary/Replica relationship between a primary host (itself) and the remote Targets. A change on any individual host will be replicated to the other databases in the topology creating the active/active configuration.
The fan-out topology allows you to replicate from one single host out to two or more target hosts. Fan-out topologies are often in situations where you have different reporting requirements, for example, sales figures may need aggregating and reporting within a redshift environment but payroll information may need replicating to a MySQL environment for back office processing.
The fan-in topology is the logical opposite of a Primary/Replica topology. In a fan-in topology, the data from two (or more) Sources is combined together on one Target. Fan-in topologies are often in situations where you have satellite databases, maybe for sales or retail operations, and need to combine that information together in a single database for processing.
If you have an existing cluster and you want to replicate the data out to a separate standalone server using Tungsten Replicator then you can create a cluster alias, and use a Primary/Replica topology to replicate from the cluster. This allows for THL events from the cluster to be applied to a separate server for the purposes of backup or separate analysis.
Heterogeneous deployments cover installations where data is being replicated between two different database solutions. These include, but are not limited to:
MySQL (Incl. Cloud based solutions such as Amazon RDS, Aurora or Google Cloud), to...
The following sections provide more detail and information on the setup and configuration of these different solutions.
Heterogeneous replication works slightly differently compared to the native MySQL to MySQL replication. This is because SQL statements, including both Data Manipulation Language (DML) and Data Definition Language (DDL) cannot be executed on a target system as they were extracted from the MySQL database. The SQL dialects are different, so that an SQL statement on MySQL is not the same as an SQL statement on Oracle, and differences in the dialects mean that either the statement would fail, or would perform an incorrect operation.
On targets that do not support SQL of any kind, such as MongoDB, replicating SQL statements would achieve nothing since they cannot be executed at all.
All heterogeneous replication deployments therefore use row-based replication. This extracts only the raw row data, not the statement information. Because it is only row-data, it can be easily re-assembled or constructed into another format, including statements in other SQL dialects, native appliers for alternative formats, such as JSON or BSON, or external CSV formats that enable the data to be loaded in bulk batches into a variety of different targets.
Replication into targets where the JDBC Driver can be used, such as Oracle and Postgres, work as follows:
Data is extracted from the source MySQL database:
The MySQL server is configured to write transactions into the MySQL binary log using row-based logging. This generates information in the log in the form of the individual updated rows, rather than the statement that was used to perform the update. For example, instead of recording the statement:
mysql> INSERT INTO MSG VALUES (1,'Hello World');
The information is stored as a row entry against the updated table:
The information is written into the THL as row-based events, with the event type (insert, update or delete) is appended to the metadata of the THL event.
It is the raw row data that is stored in the THL. Because the row data, not the SQL statement, has been recorded, the differences in SQL dialects between does not need to be taken into account. In fact, Data Definition Language (DDL) and other SQL statements are deliberately ignored so that replication does not break.
The row-based transactions stored in the THL are transferred from the Extractor to the Applier.
On the Applier side, the row-based event data is wrapped into a suitable SQL statement for the target database environment. Because the raw row data is available, it can be constructed into any suitable statement appropriate for the target database.
For heterogeneous replication where data is written into a target database using a native applier, such as MongoDB, the row-based information is written into the database using the native API. With MongoDB, for example, data is reformatted into BSON and then applied into MongoDB using the native insert/update/delete API calls.
For batch appliers, such as Hadoop, Vertica and Redshift, the row-data is
converted into CSV files in batches. The format of the CSV file includes both
the original row data for all the columns of each table, and metadata on each line that
contain the unique SEQNO
and the operation type
(insert, delete or update). A modified form of the CSV is used in some
cases where the operation type is only an insert or delete, with updates
being translated into a delete followed by an insert of the updated
information.
These temporary CSV files are then loaded into the native environment as part of the replicator using a custom script that employs the specific tools of that database that support CSV imports. The raw CSV data is loaded into a staging table that contains the per-row metadata and the row data itself.
Depending on the batch environment, the loading of the data into the final
destination tables is performed either within the same script, or by using
a separate script. Both methods work in the same basic fashion; the base
table is updated using the data from the staging table, with each row
marked to be deleted, deleted, and the latest row (calculated from the
highest SEQNO
) for each primary key) are then
inserted
Because heterogeneous replication does not replicate SQL statements, including DDL statements that would normally define and generate the table structures, a different method must be used.
Tungsten Replicator includes a tool called ddlscan which can read the schema definition from MySQL and translate that into the schema definition required on the target database. During the process, differences in supported sizes and datatypes are identified and either modified to a suitable value, or highlighted as a definition that must be changed in the generated DDL.
Once this modified form of the DDL has been completed, it can then be executed against the target database to generate the DDL required for Tungsten Replicator to apply data. The same basic method s used in batch loading environments where a staging table is required, with the additional staging columns added to the DDL automatically.
For MongoDB or Kafka, where no explicit DDL needs to be generated, the use of ddlscan is not required.
Table of Contents
The following sections outline the steps to configure the replicator for extraction. Each section covers the basic configuration to deploy an extractor in each of the deployment models (Onboard or Offboard) regardless of target database type.
To complete the deployment, after preparing the basic extractor configuration, follow the steps outlined in Chapter 4, Deploying Appliers appropriate to the target database type for your deployment.
Before installing Tungsten Replicator there are a number of steps that need to be completed to prepare the hosts.
First, ensure you have followed the general notes within Section B.3, “Host Configuration”. For supported platforms and environments, see Section B.1, “Requirements”.
If configuring extraction from MySQL instances hosted on your own hardware, or, for example, on EC2 instances, follow the MySQL specific pre-requisites within Section B.4, “MySQL Database Setup”
If configuring extraction from Amazon RDS or Amazon Aurora, also follow the pre-requisites within Section B.4, “MySQL Database Setup” however, paying specific attention to Section B.4.6, “MySQL Unprivileged Users”
For more detail on changing parameters within Amazon AWS, see Section 3.3.1, “Changing Amazon RDS/Aurora Instance Configurations”
A pre-requisite checklist is available to download and can be used to ensure your environment is ready for installation. See Section B.5, “Prerequisite Checklist”
Primary/Replica is the simplest and most straightforward of all replication scenarios, and also the basis of all other types of topology. The fundamental basis for the Primary/Replica topology is that changes in the Primary are distributed and applied to the each of the configured Replicas.
This deployment style can be used against the following sources
MySQL Community Edition
MySQL Enterprise Edition
Percona MySQL
MariaDB
Google Cloud MySQL
This deployment assumes full access to the host, including access to Binary Logs, therefore this deployment style is not suitable for RDS or Aurora extraction. For these sources, see Section 3.3, “Deploying an Extractor for Amazon Aurora”
tpm includes a specific topology structure for the basic Primary/Replica configuration, using the list of hosts and the Primary host definition to define the Primary/Replica relationship. Before starting the installation, the prerequisites must have been completed (see Appendix B, Prerequisites). To create a Primary/Replica using tpm:
There are two types of installation, either via a Staging Install, or via an ini file install.
To understand the differences between these two installation methods, see
Section 9.1, “Comparing Staging and INI
tpm Methods”
Regardless of which installation method you choose, the steps are the same, and are outlined below, using the appropriate example confguration based on your deployment style
If you plan to make full use of the REST API (which is enabled by default) you will need to also configure a username and password for API access. This must be done by specifying the following options in your configuration:
rest-api-admin-user=tungsten rest-api-admin-pass=secret
Install the Tungsten Replicator package (see Section 2.1.2, “Using the RPM package files”), or download the compressed tarball and unpack it, either on the source host, or on the staging host:
shell>cd /opt/continuent/software
shell>tar zxf tungsten-replicator-
7.0.3-141
.tar.gz
Change to the Tungsten Replicator staging directory:
shell> cd tungsten-replicator-7.0.3-141
Onboard Installation
Configure the replicator for extraction from a locally installed and configured MySQL Installation (In this example, the service name is alpha)
Click the link below to switch examples between Staging and INI methods
shell>./tools/tpm configure defaults \ --reset \ --install-directory=/opt/continuent \ --user=tungsten \ --profile-script=~/.bash_profile \ --mysql-allow-intensive-checks=true \ --rest-api-admin-user=apiuser \ --rest-api-admin-pass=secret
shell>./tools/tpm configure alpha \ --master=localhost \ --members=localhost \ --enable-heterogeneous-service=true \ --replication-port=3306 \ --replication-user=tungsten_alpha \ --replication-password=secret \ --datasource-mysql-conf=/etc/my.cnf
shell> vi /etc/tungsten/tungsten.ini
[defaults] install-directory=/opt/continuent user=tungsten profile-script=~/.bash_profile mysql-allow-intensive-checks=true rest-api-admin-user=apiuser rest-api-admin-pass=secret
[alpha] master=localhost members=localhost enable-heterogeneous-service=true replication-port=3306 replication-user=tungsten_alpha replication-password=secret datasource-mysql-conf=/etc/my.cnf
Configuration group defaults
The description of each of the options is shown below; click the icon to hide this detail:
For staging configurations, deletes all pre-existing configuration information between updating with the new configuration values.
--install-directory=/opt/continuent
install-directory=/opt/continuent
Path to the directory where the active deployment will be installed. The configured directory will contain the software, THL and relay log information unless configured otherwise.
System User
--profile-script=~/.bash_profile
profile-script=~/.bash_profile
Append commands to include env.sh in this profile script
--mysql-allow-intensive-checks=true
mysql-allow-intensive-checks=true
For MySQL installation, enables detailed checks on the supported data types within the MySQL database to confirm compatibility. This includes checking each table definition individually for any unsupported data types.
Configuration group alpha
The description of each of the options is shown below; click the icon to hide this detail:
The hostname of the primary (extractor) within the current service.
Hostnames for the dataservice members
--enable-heterogeneous-service=true
enable-heterogeneous-service=true
On a Primary
--mysql-use-bytes-for-string
is set to false.
colnames
filter is
enabled (in the
binlog-to-q
stage
to add column names to the THL information.
pkey
filter is
enabled (in the
binlog-to-q
and
q-to-dbms
stage),
with the
addPkeyToInserts
and
addColumnsToDeletes
filter options set to false.
enumtostring
filter is enabled (in the
q-to-thl
stage), to
translate ENUM
values to their string equivalents.
settostring
filter
is enabled (in the
q-to-thl
stage), to
translate SET
values to their string equivalents.
On a Replica
--mysql-use-bytes-for-string
is set to true.
The network port used to connect to the database server. The default port used depends on the database being configured.
--replication-user=tungsten_alpha
replication-user=tungsten_alpha
For databases that required authentication, the username to use when connecting to the database using the corresponding connection method (native, JDBC, etc.).
The password to be used when connecting to the database using
the corresponding
--replication-user
.
--datasource-mysql-conf=/etc/my.cnf
datasource-mysql-conf=/etc/my.cnf
MySQL config file
In the above example,
--datasource-mysql-conf
, is optional
and can be used if the MySQL configuration file cannot be located by
tpm, or is in a non-default location
Offboard Installation
Configure the replicator for extraction from a remotely installed and configured MySQL Installation (In this example, the service name is alpha)
In the example below, the server offboardhost
is the
host that the Replicator is installed upon, and the server
dbhost
is the database host to apply the events to.
Click the link below to switch examples between Staging and INI methods
shell>./tools/tpm configure defaults \ --reset \ --install-directory=/opt/continuent \ --user=tungsten \ --profile-script=~/.bash_profile \ --mysql-allow-intensive-checks=true \ --skip-validation-check=MySQLAvailableCheck \ --skip-validation-check=MySQLConfFile \ --skip-validation-check=RowBasedBinaryLoggingCheck \ --rest-api-admin-user=apiuser \ --rest-api-admin-pass=secret
shell>./tools/tpm configure alpha \ --master=offboardhost \ --members=offboardhost \ --enable-heterogeneous-service=true \ --privileged-master=true \ --replication-host=dbhost \ --replication-port=3306 \ --replication-user=tungsten_alpha \ --replication-password=secret
shell> vi /etc/tungsten/tungsten.ini
[defaults] install-directory=/opt/continuent user=tungsten profile-script=~/.bash_profile mysql-allow-intensive-checks=true skip-validation-check=MySQLAvailableCheck skip-validation-check=MySQLConfFile skip-validation-check=RowBasedBinaryLoggingCheck rest-api-admin-user=apiuser rest-api-admin-pass=secret
[alpha] master=offboardhost members=offboardhost enable-heterogeneous-service=true privileged-master=true replication-host=dbhost replication-port=3306 replication-user=tungsten_alpha replication-password=secret
Configuration group defaults
The description of each of the options is shown below; click the icon to hide this detail:
For staging configurations, deletes all pre-existing configuration information between updating with the new configuration values.
--install-directory=/opt/continuent
install-directory=/opt/continuent
Path to the directory where the active deployment will be installed. The configured directory will contain the software, THL and relay log information unless configured otherwise.
System User
--profile-script=~/.bash_profile
profile-script=~/.bash_profile
Append commands to include env.sh in this profile script
--mysql-allow-intensive-checks=true
mysql-allow-intensive-checks=true
For MySQL installation, enables detailed checks on the supported data types within the MySQL database to confirm compatibility. This includes checking each table definition individually for any unsupported data types.
--skip-validation-check=MySQLAvailableCheck
skip-validation-check=MySQLAvailableCheck
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
--skip-validation-check=MySQLConfFile
skip-validation-check=MySQLConfFile
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
--skip-validation-check=RowBasedBinaryLoggingCheck
skip-validation-check=RowBasedBinaryLoggingCheck
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
Configuration group alpha
The description of each of the options is shown below; click the icon to hide this detail:
The hostname of the primary (extractor) within the current service.
Hostnames for the dataservice members
--enable-heterogeneous-service=true
enable-heterogeneous-service=true
On a Primary
--mysql-use-bytes-for-string
is set to false.
colnames
filter is
enabled (in the
binlog-to-q
stage
to add column names to the THL information.
pkey
filter is
enabled (in the
binlog-to-q
and
q-to-dbms
stage),
with the
addPkeyToInserts
and
addColumnsToDeletes
filter options set to false.
enumtostring
filter is enabled (in the
q-to-thl
stage), to
translate ENUM
values to their string equivalents.
settostring
filter
is enabled (in the
q-to-thl
stage), to
translate SET
values to their string equivalents.
On a Replica
--mysql-use-bytes-for-string
is set to true.
Does the login for the Primary database service have superuser privileges
Hostname of the datasource where the database is located. If the specified hostname matches the current host or member name, the database is assumed to be local. If the hostnames do not match, extraction is assumed to be via remote access. For MySQL hosts, this configures a remote replication Replica (relay) connection.
The network port used to connect to the database server. The default port used depends on the database being configured.
--replication-user=tungsten_alpha
replication-user=tungsten_alpha
For databases that required authentication, the username to use when connecting to the database using the corresponding connection method (native, JDBC, etc.).
The password to be used when connecting to the database using
the corresponding
--replication-user
.
Once the prerequisites and configuring of the installation has been completed, the software can be installed:
shell> ./tools/tpm install
In both of the above examples,
enable-heterogenous-service
, is only
required if the target applier is NOT a
MySQL database
If the installation process fails, check the output of the
/tmp/tungsten-configure.log
file for
more information about the root cause.
Once the installation has been completed, you can now proceed to configure the Applier service following the relevant step within Chapter 4, Deploying Appliers.
Following installation of the applier, the services can be started. For information on starting and stopping Tungsten Cluster see Section 2.4, “Starting and Stopping Tungsten Replicator”; configuring init scripts to startup and shutdown when the system boots and shuts down, see Section 2.5, “Configuring Startup on Boot”.
For information on checking the running service, see Section 3.2.1, “Monitoring the MySQL Extractor”.
Once the service has been started, a quick view of the service status can be determined using trepctl:
shell> trepctl services
Processing services command...
NAME VALUE
---- -----
appliedLastSeqno: 3593
appliedLatency : 1.074
role : master
serviceName : alpha
serviceType : local
started : true
state : ONLINE
Finished services command...
The key fields are:
appliedLastSeqno
and
appliedLatency
indicate the global transaction
ID and latency of the host. These are important when monitoring the
status of the cluster to determine how up to date a host is and
whether a specific transaction has been applied.
role
indicates the current role of the host
within the scope of this dataservice.
state
shows the current status of the host
within the scope of this dataservice.
More detailed status information can also be obtained. On the Extractor:
shell> trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysql-bin.000009:0000000000001033;0
appliedLastSeqno : 3593
appliedLatency : 1.074
channels : 1
clusterName : default
currentEventId : mysql-bin.000009:0000000000001033
currentTimeMillis : 1373615598598
dataServerHost : host1
extensions :
latestEpochNumber : 3589
masterConnectUri :
masterListenUri : thl://host1:2112/
maximumStoredSeqNo : 3593
minimumStoredSeqNo : 0
offlineRequests : NONE
pendingError : NONE
pendingErrorCode : NONE
pendingErrorEventId : NONE
pendingErrorSeqno : -1
pendingExceptionMessage: NONE
pipelineSource : jdbc:mysql:thin://host1:3306/
relativeLatency : 604904.598
resourcePrecedence : 99
rmiPort : 10000
role : master
seqnoType : java.lang.Long
serviceName : alpha
serviceType : local
simpleServiceName : alpha
siteName : default
sourceId : host1
state : ONLINE
timeInStateSeconds : 604903.621
transitioningTo :
uptimeSeconds : 1202137.328
version : Tungsten Replicator 7.0.3 build 141
Finished status command...
For more information on using trepctl, see Section 8.20, “The trepctl Command”.
Definitions of the individual field descriptions in the above example output can be found in Section E.2, “Generated Field Reference”.
For more information on management and operational detailed for managing your replicator installation, see Chapter 7, Operations Guide.
Replicating from Amazon Aurora, operates by directly accessing the binary log provided by Aurora and enables you to take advantage of the Amazon Web, either replicating from the remote Aurora instance, or to a standard EC2 instance within AWS. The complexity with Aurora is that there is no access to the host that is running the instance, or the MySQL binary logs.
To use this service, two aspects of the Tungsten Replicator are
required, direct mode and unprivileged user support. Direct mode reads the
MySQL binary log over the network, rather than accessing the binlog on the
filesystem. The unprivileged mode enables the user to access and update
information within Aurora without requiring
SUPER
privileges, which are
unavailable within an Aurora instance. For more information, see
Section B.4.6, “MySQL Unprivileged Users”.
The deployment requires a host for the extractor installation, this can be an EC2 instance within your AWS environment, or it could be a remote host in your own environment.
This deployment follows a similar model to an Offboard Installation
Before starting the installation, the prerequisites must have been completed (see Appendix B, Prerequisites) on both the Host designated for the installation of the extractor, and within the source database instance.
There are two types of installation, either via a Staging Install, or via an ini file install.
To understand the differences between these two installation methods, see
Section 9.1, “Comparing Staging and INI
tpm Methods”
Regardless of which installation method you choose, the steps are the same, and are outlined below.
Install the Tungsten Replicator package (see Section 2.1.2, “Using the RPM package files”), or download the compressed tarball and unpack it, either on the source host, or on the staging host:
shell>cd /opt/continuent/software
shell>tar zxf tungsten-replicator-
7.0.3-141
.tar.gz
Change to the Tungsten Replicator staging directory:
shell> cd tungsten-replicator-7.0.3-141
Configure the replicator for extraction (In this example, the service name is alpha)
Click the link below to switch examples between Staging and INI methods
shell>./tools/tpm configure defaults \ --reset \ --install-directory=/opt/continuent \ --user=tungsten \ --profile-script=~/.bash_profile \ --mysql-allow-intensive-checks=true \ --skip-validation-check=InstallerMasterSlaveCheck \ --skip-validation-check=MySQLPermissionsCheck \ --skip-validation-check=MySQLBinaryLogsEnabledCheck \ --skip-validation-check=MySQLMyISAMCheck \ --skip-validation-check=RowBasedBinaryLoggingCheck \ --rest-api-admin-user=apiuser \ --rest-api-admin-pass=secret
shell>./tools/tpm configure alpha \ --master=localhost \ --members=localhost \ --enable-heterogeneous-service=true \ --privileged-master=false \ --replication-host=rds.endpoint.url \ --replication-port=3306 \ --replication-user=tungsten_alpha \ --replication-password=secret \ --datasource-mysql-conf=/dev/null \ --svc-extractor-filters=dropcatalogdata \ --property=replicator.service.comments=true
shell> vi /etc/tungsten/tungsten.ini
[defaults] install-directory=/opt/continuent user=tungsten profile-script=~/.bash_profile mysql-allow-intensive-checks=true skip-validation-check=InstallerMasterSlaveCheck skip-validation-check=MySQLPermissionsCheck skip-validation-check=MySQLBinaryLogsEnabledCheck skip-validation-check=MySQLMyISAMCheck skip-validation-check=RowBasedBinaryLoggingCheck rest-api-admin-user=apiuser rest-api-admin-pass=secret
[alpha] master=localhost members=localhost enable-heterogeneous-service=true privileged-master=false replication-host=rds.endpoint.url replication-port=3306 replication-user=tungsten_alpha replication-password=secret datasource-mysql-conf=/dev/null svc-extractor-filters=dropcatalogdata property=replicator.service.comments=true
Configuration group defaults
The description of each of the options is shown below; click the icon to hide this detail:
For staging configurations, deletes all pre-existing configuration information between updating with the new configuration values.
--install-directory=/opt/continuent
install-directory=/opt/continuent
Path to the directory where the active deployment will be installed. The configured directory will contain the software, THL and relay log information unless configured otherwise.
System User
--profile-script=~/.bash_profile
profile-script=~/.bash_profile
Append commands to include env.sh in this profile script
--mysql-allow-intensive-checks=true
mysql-allow-intensive-checks=true
For MySQL installation, enables detailed checks on the supported data types within the MySQL database to confirm compatibility. This includes checking each table definition individually for any unsupported data types.
--skip-validation-check=InstallerMasterSlaveCheck
skip-validation-check=InstallerMasterSlaveCheck
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
--skip-validation-check=MySQLPermissionsCheck
skip-validation-check=MySQLPermissionsCheck
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
--skip-validation-check=MySQLBinaryLogsEnabledCheck
skip-validation-check=MySQLBinaryLogsEnabledCheck
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
--skip-validation-check=MySQLMyISAMCheck
skip-validation-check=MySQLMyISAMCheck
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
--skip-validation-check=RowBasedBinaryLoggingCheck
skip-validation-check=RowBasedBinaryLoggingCheck
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
Configuration group alpha
The description of each of the options is shown below; click the icon to hide this detail:
The hostname of the primary (extractor) within the current service.
Hostnames for the dataservice members
--enable-heterogeneous-service=true
enable-heterogeneous-service=true
On a Primary
--mysql-use-bytes-for-string
is set to false.
colnames
filter is
enabled (in the
binlog-to-q
stage
to add column names to the THL information.
pkey
filter is
enabled (in the
binlog-to-q
and
q-to-dbms
stage),
with the
addPkeyToInserts
and
addColumnsToDeletes
filter options set to false.
enumtostring
filter is enabled (in the
q-to-thl
stage), to
translate ENUM
values to their string equivalents.
settostring
filter
is enabled (in the
q-to-thl
stage), to
translate SET
values to their string equivalents.
On a Replica
--mysql-use-bytes-for-string
is set to true.
Does the login for the Primary database service have superuser privileges
--replication-host=rds.endpoint.url
replication-host=rds.endpoint.url
Hostname of the datasource where the database is located. If the specified hostname matches the current host or member name, the database is assumed to be local. If the hostnames do not match, extraction is assumed to be via remote access. For MySQL hosts, this configures a remote replication Replica (relay) connection.
The network port used to connect to the database server. The default port used depends on the database being configured.
--replication-user=tungsten_alpha
replication-user=tungsten_alpha
For databases that required authentication, the username to use when connecting to the database using the corresponding connection method (native, JDBC, etc.).
The password to be used when connecting to the database using
the corresponding
--replication-user
.
--datasource-mysql-conf=/dev/null
datasource-mysql-conf=/dev/null
MySQL config file
--svc-extractor-filters=dropcatalogdata
svc-extractor-filters=dropcatalogdata
Replication service extractor filters
--property=replicator.service.comments=true
property=replicator.service.comments=true
The --property
option enables
you to explicitly set property values in the target files. A
number of different models are supported:
key=value
Set the property defined by
key
to the specified
value without evaluating any template values or other rules.
key+=value
Add the value to the property defined by
key
. Template values and
other options append their settings to the end of the
specified property.
key~=/match/replace/
Evaluate any template values and other settings, and then
perform the specified Ruby regex operation to the property
defined by key
. For
example
--property=replicator.key~=/(.*)/somevalue,\1/
will prepend somevalue
before the template value for
replicator.key
.
Once the prerequisites and configuring of the installation has been completed, the software can be installed:
shell> ./tools/tpm install
In the above examples,
enable-heterogenous-service
,
is only required if the target applier is NOT
a MySQL database
datasource-mysql-conf
, needs to be
set as shown as we do not have access to the my.cnf
file
If the installation process fails, check the output of the
/tmp/tungsten-configure.log
file for
more information about the root cause.
Once the installation has been completed, you can now proceed to configure the Applier service following the relevant step within Chapter 4, Deploying Appliers.
Following installation of the applier, the services can be started. For information on starting and stopping Tungsten Cluster see Section 2.4, “Starting and Stopping Tungsten Replicator”; configuring init scripts to startup and shutdown when the system boots and shuts down, see Section 2.5, “Configuring Startup on Boot”.
Monitoring the extractor is the same as an extractor from MySQL, for information, see Section 3.2.1, “Monitoring the MySQL Extractor”.
The configuration of RDS and Aurora instances can be modified to change the
parameters for MySQL instances, the Amazon equivalent of modifying the
my.cnf
file.
These steps can be used for changing the configuration for RDS Instances only. See Section 3.3.1.2, “Changing Amazon Aurora Parameters using AWS Console” for steps to change Aurora parameters
The parameters can be set internally by connecting to the instance and using the configuration function within the instance. For example:
mysql> call mysql.rds_set_configuration('binlog retention hours', 48);
An RDS command-line interface is available which enables modifying these parameters. To enable the command-line interface:
shell>wget http://s3.amazonaws.com/rds-downloads/RDSCli.zip
shell>unzip RDSCli.zip
shell>export AWS_RDS_HOME=/home/tungsten/RDSCli-1.13.002
shell>export PATH=$PATH:$AWS_RDS_HOME/bin
The current RDS instances can be listed by using rds-describe-db-instances:
shell> rds-describe-db-instances --region=us-east-1
To change parameters, a new parameter group must be created, and then applied to a running instance or instances before restarting the instance:
Create a new custom parameter group:
shell> rds-create-db-parameter-group repgroup
-d 'Parameter group for DB Replicas' -f mysql5.1
Where repgroup
is the
replicator group name.
Set the new parameter value:
shell> rds-modify-db-parameter-group repgroup
--parameters \
"name=max_allowed_packet,value=67108864, method=immediate"
Apply the parameter group to your instance:
shell> rds-modify-db-instance instancename
--db-parameter-group-name=repgroup
Where instancename
is the
name given to your instance.
Restart the instance:
shell> rds-reboot-db-instance instancename
To change the parameters for Aurora Instances, you can follow the following guidelines using the AWS Console
Login to the AWS Console using your account credentials and navigate to the RDS Dashboard. From here, select "Parameter Groups" from the left hand list
Select the "Create Parameter Group" Button to the top right
This dialog will now allow you to create a new parameter group using an existing one as a template. Select the appropriate template to use and complete the rest of the details. You need to create a DB Paramater group and a DB Cluster Parameter Group
Now you have the two groups, you can modify the parameters accordingly, by selecting the group in the list and then selecting the "Edit" option.
Now the groups are setup, you can assign these groups to existing Aurora Instances, or you can assign them during instance creation. If you are assigning to existing instances, you may need to restart the instance for certain parameters to take effect.
Some parameters can only be set via the cluster parameter group - such as enabling binary logging, others can only be change in the DB Parameter group.
If you have an existing cluster and you want to replicate the data out to a separate standalone server using Tungsten Replicator then you can create a cluster alias, and use a Primary/Replica topology to replicate from the cluster. This allows for THL events from the cluster to be applied to a separate server for the purposes of backup or separate analysis.
During the installation process a cluster-alias
and
cluster-slave
are declared. The cluster-alias
describes all of the servers in the cluster and how they may be reached.
The cluster-slave
defines one or more servers that
will replicate from the cluster.
The Tungsten Replicator will be installed on the Cluster-Extractor server. That server will download THL data and apply them to the local server. If the Cluster-Extractor has more than one server; one of them will be declared the relay (or Primary). The other members of the Cluster-Extractor may also download THL data from that server.
If the relay for the Cluster-Extractor fails; the other nodes will automatically start downloading THL data from a server in the cluster. If a non-relay server fails; it will not have any impact on the other members.
Identify the cluster to replicate from. You will need the Primary, Replicas and THL port (if specified). Use tpm reverse from a cluster member to find the correct values.
If you are replicating to a non-MySQL server. Update the configuration of the cluster to include the following properties prior to beginning.
svc-extractor-filters=colnames,pkey property=replicator.filter.pkey.addColumnsToDeletes=true property=replicator.filter.pkey.addPkeyToInserts=true
Identify all servers that will replicate from the cluster. If there is more than one, a relay server should be identified to replicate from the cluster and provide THL data to other servers.
Prepare each server according to the prerequisites for the DBMS platform it is serving. If you are working with multiple DBMS platforms; treat each platform as a different Cluster-Extractor during deployment.
Make sure the THL port for the cluster is open between all servers.
Install the Tungsten Replicator package or download the Tungsten Replicator tarball, and unpack it:
shell>cd /opt/continuent/software
shell>tar zxf
tungsten-replicator-7.0.3-141.tar.gz
Change to the unpackaged directory:
shell> cd tungsten-replicator-7.0.3-141
Configure the replicator
Click the link below to switch examples between Staging and INI methods
shell>./tools/tpm configure defaults \ --install-directory=/opt/continuent \ --profile-script=~/.bash_profile \ --replication-password=secret \ --replication-port=13306 \ --replication-user=tungsten \ --user=tungsten \ --rest-api-admin-user=apiuser \ --rest-api-admin-pass=secret
shell>./tools/tpm configure alpha \ --master=host1 \ --slaves=host2,host3 \ --thl-port=2112 \ --topology=cluster-alias
shell>./tools/tpm configure beta \ --relay=host6 \ --relay-source=alpha \ --topology=cluster-slave
shell> vi /etc/tungsten/tungsten.ini
[defaults] install-directory=/opt/continuent profile-script=~/.bash_profile replication-password=secret replication-port=13306 replication-user=tungsten user=tungsten rest-api-admin-user=apiuser rest-api-admin-pass=secret
[alpha] master=host1 slaves=host2,host3 thl-port=2112 topology=cluster-alias
[beta] relay=host6 relay-source=alpha topology=cluster-slave
Configuration group defaults
The description of each of the options is shown below; click the icon to hide this detail:
--install-directory=/opt/continuent
install-directory=/opt/continuent
Path to the directory where the active deployment will be installed. The configured directory will contain the software, THL and relay log information unless configured otherwise.
--profile-script=~/.bash_profile
profile-script=~/.bash_profile
Append commands to include env.sh in this profile script
The password to be used when connecting to the database using
the corresponding
--replication-user
.
The network port used to connect to the database server. The default port used depends on the database being configured.
For databases that required authentication, the username to use when connecting to the database using the corresponding connection method (native, JDBC, etc.).
System User
Configuration group alpha
The description of each of the options is shown below; click the icon to hide this detail:
The hostname of the primary (extractor) within the current service.
What are the Replicas for this dataservice?
Port to use for THL Operations
Replication topology for the dataservice.
Configuration group beta
The description of each of the options is shown below; click the icon to hide this detail:
The hostname of the primary (extractor) within the current service.
Dataservice name to use as a relay source
Replication topology for the dataservice.
If you are replicating to a non-MySQL server. Include the following steps in your configuration.
shell>mkdir -p /opt/continuent/share/
shell>cp tungsten-replicator/support/filters-config/convertstringfrommysql.json » /opt/continuent/share/
Then, include the following parameters in the configuration
property=replicator.stage.remote-to-thl.filters=convertstringfrommysql
property=replicator.filter.convertstringfrommysql.definitionsFile= »
/opt/continuent/share/convertstringfrommysql.json
This dataservice cluster-alias
name MUST be the
same as the cluster dataservice name that you are replicating from.
Do not include
start-and-report=true
if you are
taking over for MySQL native replication. See
Section 7.10.1, “Migrating from MySQL Native Replication 'In-Place'” for next
steps after completing installation.
Once the configuration has been completed, you can perform the installation to set up the services using this configuration:
shell> ./tools/tpm install
During the installation and startup, tpm will notify you of any problems that need to be fixed before the service can be correctly installed and started. If the service starts correctly, you should see the configuration and current status of the service.
If the installation process fails, check the output of the
/tmp/tungsten-configure.log
file for
more information about the root cause.
The cluster should be installed and ready to use.
Table of Contents
The following sections outline the steps to configure the replicator for applying into your target of choice. Each section covers the basic configuration to deploy an applier in each of the deployment models (Onboard or Offboard).
Before preparing the applier configuration, follow the steps outlined in Chapter 3, Deploying MySQL Extractors to configure the extractor.
Deploying the MySQL applier is the most straight forward of deployments. This section covers configuration of the applier into all releases of MySQL, including Amazon RDS, Amazon Aurora, Google Cloud SQL and Microsoft Azure.
Service Alpha on host1 extracts the information from the MySQL binary log into THL.
Service Alpha reads the information from the remote replicator as THL, and applies that to the target MySQL instance via a JDBC Connector.
The Applier replicator can be installed on:
A host with write access to the target database host
An EC2 Host with write access to the target Instance
The same host as the target database
The same host as the extractor (See Section 5.3, “Deploying Multiple Replicators on a Single Host”)
Configure the source and target hosts following the prerequisites outlined in Appendix B, Prerequisites then follow the appropriate steps for the required extractor topology outlined in Chapter 3, Deploying MySQL Extractors.
MySQL Target
Applies to:
Standalone hosted instances
EC2 hosted instances
Google Cloud hosted instances
Microsoft Azure hosted instances
To prepare the target MySQL Database, ensure the user accounts are created as per the steps outlined in Section B.4.5, “MySQL User Configuration”
Amazon RDS/Amazon Aurora Target
For Amazon based targets, as we do not have access to the host, nor can we configure accounts with elevated privileges, follow the steps in Section B.4.6, “MySQL Unprivileged Users” to prepare the target for replication
The data replicated from MySQL can be any data, although there are some known limitations and assumptions made on the way the information is transferred.
Table format should be updated to UTF8 by updating the MySQL
configuration (my.cnf
):
character-set-server=utf8
collation-server=utf8_general_ci
To prevent timezone configuration storing zone adjusted values and
exporting this information to the binary log and AmazonRDS, fix the
timezone configuration to use UTC within the configuration file
(my.cnf
):
default-time-zone='+00:00'
If your target is an Amazon RDS or Aurora Instance, that has not yet been created, follow the steps in Section 4.1.2, “Prepare Amazon RDS/Amazon Aurora”
If your target is a hosted MySQL environment, proceed to Section 4.1.3, “Install MySQL Applier”
Create the Amazon Instance
If the instance does not already exist, create the Amazon RDS or Amazon Aurora instance and take a note of the endpoint URL reported. This information will be required when configuring the replicator service.
Also take a note of the user and password used for connecting to the instance.
Check your security group configuration.
The host used as the Target for applying changes to the Amazon instance must have been added to the security groups. Within Amazon RDS and Aurora, security groups configure the hosts that are allowed to connect to the Amazon instance, and hence update information within the database. The configuration must include the IP address of the Applier replicator, whether that host is within Amazon EC2 or external.
Change RDS/Aurora instance properties
Depending on the configuration and data to be replicated, the
parameter of the running instance may need to be modified. For
example, the
max_allowed_packet
parameter
may need to be increased.
For more information on changing parameters, see Section 3.3.1, “Changing Amazon RDS/Aurora Instance Configurations”.
The applier will read information from the Extractor and write database changes into the target instance.
To configure the Applier replicator for either local or remote MySQL or for Amazon RDS/Aurora, the process is the same, but with a slightly different configuration, this is outlined below:
Unpack the Tungsten Replicator distribution in staging directory:
shell> tar zxf tungsten-replicator-7.0.3-141.tar.gz
Change into the staging directory:
shell> cd tungsten-replicator-7.0.3-141
Use the appropriate template config for your target
If you plan to make full use of the REST API (which is enabled by default) you will need to also configure a username and password for API access. This must be done by specifying the following options in your configuration:
rest-api-admin-user=tungsten rest-api-admin-pass=secret
Once the prerequisites and configuring of the installation has been completed, the software can be installed:
shell> ./tools/tpm install
If the installation process fails, check the output of the
/tmp/tungsten-configure.log
file
for more information about the root cause.
The replicators can now be started using the replicator command.
The status of the replicator can be checked and monitored by using the trepctl command.
Configure the installation using tpm:
shell>./tools/tpm configure defaults \ --reset \ --install-directory=/opt/continuent \ --user=tungsten \ --mysql-allow-intensive-checks=true \ --profile-script=~/.bash_profile \ --rest-api-admin-user=apiuser \ --rest-api-admin-pass=secret
shell>./tools/tpm configure alpha \ --master=sourcehost \ --members=localhost,sourcehost \ --datasource-type=mysql \ --replication-user=tungsten \ --replication-password=secret \ --replication-host=remotedbhost
shell> vi /etc/tungsten/tungsten.ini
[defaults] install-directory=/opt/continuent user=tungsten mysql-allow-intensive-checks=true profile-script=~/.bash_profile rest-api-admin-user=apiuser rest-api-admin-pass=secret
[alpha] master=sourcehost members=localhost,sourcehost datasource-type=mysql replication-user=tungsten replication-password=secret replication-host=remotedbhost
Configuration group defaults
The description of each of the options is shown below; click the icon to hide this detail:
For staging configurations, deletes all pre-existing configuration information between updating with the new configuration values.
--install-directory=/opt/continuent
install-directory=/opt/continuent
Path to the directory where the active deployment will be installed. The configured directory will contain the software, THL and relay log information unless configured otherwise.
System User
--mysql-allow-intensive-checks=true
mysql-allow-intensive-checks=true
For MySQL installation, enables detailed checks on the supported data types within the MySQL database to confirm compatibility. This includes checking each table definition individually for any unsupported data types.
--profile-script=~/.bash_profile
profile-script=~/.bash_profile
Append commands to include env.sh in this profile script
Configuration group alpha
The description of each of the options is shown below; click the icon to hide this detail:
The hostname of the primary (extractor) within the current service.
--members=localhost,sourcehost
Hostnames for the dataservice members
Database type
For databases that required authentication, the username to use when connecting to the database using the corresponding connection method (native, JDBC, etc.).
The password to be used when connecting to the database using
the corresponding
--replication-user
.
--replication-host=remotedbhost
Hostname of the datasource where the database is located. If the specified hostname matches the current host or member name, the database is assumed to be local. If the hostnames do not match, extraction is assumed to be via remote access. For MySQL hosts, this configures a remote replication Replica (relay) connection.
replication-host
should only be added to
the above configuration if the target MySQL Database is on a different
host to the applier installation
Amazon RDS and Amazon Aurora Targets
Configure the installation using tpm:
shell>./tools/tpm configure defaults \ --reset \ --install-directory=/opt/continuent \ --user=tungsten \ --mysql-allow-intensive-checks=true \ --profile-script=~/.bash_profile \ --skip-validation-check=InstallerMasterSlaveCheck \ --skip-validation-check=MySQLPermissionsCheck \ --skip-validation-check=MySQLBinaryLogsEnabledCheck \ --skip-validation-check=MySQLMyISAMCheck \ --skip-validation-check=RowBasedBinaryLoggingCheck \ --rest-api-admin-user=apiuser \ --rest-api-admin-pass=secret
shell>./tools/tpm configure alpha \ --master=sourcehost \ --members=localhost,sourcehost \ --datasource-type=mysql \ --datasource-mysql-conf=/dev/null \ --replication-user=rdsuser \ --replication-password=secret \ --privileged-slave=false \ --replication-host=rds-endpoint-url \ --service-type=remote
shell> vi /etc/tungsten/tungsten.ini
[defaults] install-directory=/opt/continuent user=tungsten mysql-allow-intensive-checks=true profile-script=~/.bash_profile skip-validation-check=InstallerMasterSlaveCheck skip-validation-check=MySQLPermissionsCheck skip-validation-check=MySQLBinaryLogsEnabledCheck skip-validation-check=MySQLMyISAMCheck skip-validation-check=RowBasedBinaryLoggingCheck rest-api-admin-user=apiuser rest-api-admin-pass=secret
[alpha] master=sourcehost members=localhost,sourcehost datasource-type=mysql datasource-mysql-conf=/dev/null replication-user=rdsuser replication-password=secret privileged-slave=false replication-host=rds-endpoint-url service-type=remote
Configuration group defaults
The description of each of the options is shown below; click the icon to hide this detail:
For staging configurations, deletes all pre-existing configuration information between updating with the new configuration values.
--install-directory=/opt/continuent
install-directory=/opt/continuent
Path to the directory where the active deployment will be installed. The configured directory will contain the software, THL and relay log information unless configured otherwise.
System User
--mysql-allow-intensive-checks=true
mysql-allow-intensive-checks=true
For MySQL installation, enables detailed checks on the supported data types within the MySQL database to confirm compatibility. This includes checking each table definition individually for any unsupported data types.
--profile-script=~/.bash_profile
profile-script=~/.bash_profile
Append commands to include env.sh in this profile script
--skip-validation-check=InstallerMasterSlaveCheck
skip-validation-check=InstallerMasterSlaveCheck
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
--skip-validation-check=MySQLPermissionsCheck
skip-validation-check=MySQLPermissionsCheck
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
--skip-validation-check=MySQLBinaryLogsEnabledCheck
skip-validation-check=MySQLBinaryLogsEnabledCheck
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
--skip-validation-check=MySQLMyISAMCheck
skip-validation-check=MySQLMyISAMCheck
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
--skip-validation-check=RowBasedBinaryLoggingCheck
skip-validation-check=RowBasedBinaryLoggingCheck
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
Configuration group alpha
The description of each of the options is shown below; click the icon to hide this detail:
The hostname of the primary (extractor) within the current service.
--members=localhost,sourcehost
Hostnames for the dataservice members
Database type
--datasource-mysql-conf=/dev/null
datasource-mysql-conf=/dev/null
MySQL config file
For databases that required authentication, the username to use when connecting to the database using the corresponding connection method (native, JDBC, etc.).
The password to be used when connecting to the database using
the corresponding
--replication-user
.
Does the login for the Replica database service have superuser privileges
--replication-host=rds-endpoint-url
replication-host=rds-endpoint-url
Hostname of the datasource where the database is located. If the specified hostname matches the current host or member name, the database is assumed to be local. If the hostnames do not match, extraction is assumed to be via remote access. For MySQL hosts, this configures a remote replication Replica (relay) connection.
What is the replication service type?
Replication to MySQL and Amazon based instances operates in the same manner as all other replication environments. The current status can be monitored using trepctl. On the Extractor:
shell> trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysql-bin.000043:0000000000000291;84
appliedLastSeqno : 2320
appliedLatency : 0.733
channels : 1
clusterName : alpha
currentEventId : mysql-bin.000043:0000000000000291
currentTimeMillis : 1387544952494
dataServerHost : host1
extensions :
host : host1
latestEpochNumber : 60
masterConnectUri : thl://localhost:/
masterListenUri : thl://host1:2112/
maximumStoredSeqNo : 2320
minimumStoredSeqNo : 0
offlineRequests : NONE
pendingError : NONE
pendingErrorCode : NONE
pendingErrorEventId : NONE
pendingErrorSeqno : -1
pendingExceptionMessage: NONE
pipelineSource : jdbc:mysql:thin://host1:13306/
relativeLatency : 23.494
resourcePrecedence : 99
rmiPort : 10000
role : master
seqnoType : java.lang.Long
serviceName : alpha
serviceType : local
simpleServiceName : alpha
siteName : default
sourceId : host1
state : ONLINE
timeInStateSeconds : 99525.477
transitioningTo :
uptimeSeconds : 99527.364
useSSLConnection : false
version : Tungsten Replicator 7.0.3 build 141
Finished status command...
On the Applier, use trepctl and monitor the
appliedLatency
and
appliedLastSeqno
. The output will include the
hostname of the Amazon RDS instance:
shell> trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysql-bin.000043:0000000000000291;84
appliedLastSeqno : 2320
appliedLatency : 797.615
channels : 1
clusterName : default
currentEventId : NONE
currentTimeMillis : 1387545785268
dataServerHost : documentationtest.cnlhon44f2wq.eu-west-1.rds.amazonaws.com
extensions :
host : documentationtest.cnlhon44f2wq.eu-west-1.rds.amazonaws.com
latestEpochNumber : 60
masterConnectUri : thl://host1:2112/
masterListenUri : thl://host2:2112/
maximumStoredSeqNo : 2320
minimumStoredSeqNo : 0
offlineRequests : NONE
pendingError : NONE
pendingErrorCode : NONE
pendingErrorEventId : NONE
pendingErrorSeqno : -1
pendingExceptionMessage: NONE
pipelineSource : thl://host1:2112/
relativeLatency : 856.268
resourcePrecedence : 99
rmiPort : 10000
role : slave
seqnoType : java.lang.Long
serviceName : alpha
serviceType : local
simpleServiceName : alpha
siteName : default
sourceId : documentationtest.cnlhon44f2wq.eu-west-1.rds.amazonaws.com
state : ONLINE
timeInStateSeconds : 461.885
transitioningTo :
uptimeSeconds : 668.606
useSSLConnection : false
version : Tungsten Replicator 7.0.3 build 141
Finished status command...
Amazon Redshift is a cloud-based data warehouse service that integrates with other Amazon services, such as S3, to provide an SQL-like interface to the loaded data. Replication for Amazon Redshift moves data from MySQL datastores, through S3, and into the Redshift environment in real-time, avoiding the need to manually export and import the data.
Replication to Amazon Redshift operates as follows:
Data is extracted from the source database into THL.
When extracting the data from the THL, the Amazon Redshift replicator writes the data into CSV files according to the name of the source tables. The files contain all of the row-based data, including the global transaction ID generated by the extractor during replication, and the operation type (insert, delete, etc) as part of the CSV data.
The generated CSV files are loaded into Amazon S3 using either the s3cmd command or the aws s3 cli tools. This enables easy access to your Amazon S3 installation and simplifies the loading.
The CSV data is loaded from S3 into Redshift staging tables using the
Redshift COPY
command,
which imports raw CSV into Redshift tables.
SQL statements are then executed within Redshift to perform updates on the live version of the tables, using the CSV, batch loaded, information, deleting old rows, and inserting the new data when performing updates to work effectively within the confines of Amazon Redshift operation.
Setting up replication requires setting up both the Extractor and Applier components as two different configurations, one for MySQL and the other for Amazon Redshift. Replication also requires some additional steps to ensure that the Amazon Redshift host is ready to accept the replicated data that has been extracted. Tungsten Replicator provides all the tools required to perform these operations during the installation and setup.
The Redshift applier makes use of the JavaScript based batch loading system (see Section 5.6.4, “JavaScript Batchloader Scripts”). This constructs change data from the source-database. The change data is then loaded into staging tables, at which point a process will then merge the change data up into the base tables A summary of this basic structure can be seen in Figure 4.3, “Topologies: Redshift Replication Operation”.
Different object types within the two systems are mapped as follows:
The full replication of information operates as follows:
Data is extracted from the source database using the standard extractor, for example by reading the row change data from the binlog in MySQL.
The Section 11.4.5, “ColumnName Filter” filter is used to extract column name information from the database. This enables the row-change information to be tagged with the corresponding column information. The data changes, and corresponding row names, are stored in the THL.
The Section 11.4.32, “PrimaryKey Filter” filter is used to extract primary key data from the source tables.
On the Applier replicator, the THL data is read and written into batch-files in the character-separated value format.
The information in these files is change data, and contains not only
the original row values from the source tables, but also metadata
about the operation performed (i.e.
INSERT
,
DELETE
or
UPDATE
, and the primary key of
for each table. All UPDATE
statements are recorded as a
DELETE
of the existing data,
and an INSERT
of the new data.
In addition to these core operation types, the batch applier can also
be configured to record UPDATE
operations that result in
INSERT
or
DELETE
rows. This enables
Redshift to process the update information more simply than performing
the individual DELETE
and
INSERT
operations.
A second process uses the CSV stage data and any existing data, to build a materialized view that mirrors the source table data structure.
The staging files created by the replicator are in a specific format that incorporates change and operation information in addition to the original row data.
The format of the files is a character separated values file, with
each row separated by a newline, and individual fields separated by
the character 0x01
. This is
supported by Hive as a native value separator.
The content of the file consists of the full row data extracted from the Source, plus metadata describing the operation for each row, the sequence number, and then the full row information.
Operation | Sequence No | Table-specific primary key | DateTime | Table-columns... |
---|---|---|---|---|
OPTYPE |
SEQNO that generated this row
| PRIMARYKEY | DATATIME of source table commit |
The operation field will match one of the following values
Operation | Description | Notes |
---|---|---|
I |
Row is an INSERT of new
data
| |
D |
Row is DELETE of existing
data
| |
UI |
Row is an UPDATE which
caused INSERT of data
| |
UD |
Row is an UPDATE which
caused DELETE of data
|
For example, the MySQL row from an
INSERT
of:
| 3 | #1 Single | 2006 | Cats and Dogs (#1.4) |
Is represented within the CSV staging files generated as:
"I","5","3","2014-07-31 14:29:17.000","3","#1 Single","2006","Cats and Dogs (#1.4)"
The character separator, and whether to use quoting, are configurable within the replicator when it is deployed. For Redshift, the default behavior is to generate quoted and comma separated fields.
Preparing the hosts for the replication process requires setting some key configuration parameters within the MySQL server to ensure that data is stored and written correctly. On the Amazon Redshift side, the database and schema must be created using the existing schema definition so that the databases and tables exist within Amazon Redshift.
Source Host
Configure the source and target hosts following the prerequisites outlined in Appendix B, Prerequisites then follow the appropriate steps for the required extractor topology outlined in Chapter 3, Deploying MySQL Extractors.
The following are required for replication to Amazon Redshift:
On the Amazon Redshift host, you need to perform some preparation of the destination database, first creating the database, and then creating the tables that are to be replicated. Setting up this process requires the configuration of a number of components outside of Tungsten Replicator in order to support the loading.
An existing Amazon Web Services (AWS) account, and either the AWS Access Key and Secret Key, or configured IAM Roles, required to interact with the account through the API. For information on creating IAM Roles, see Section 4.2.2.2, “Configuring Identity Access Management within AWS”
A configured Amazon S3 service. If the S3 service has not already been configured, visit the AWS console and sign up for the Amazon S3 service.
The s3cmd or the aws tools installed and configured. The s3cmd can be downloaded from s3cmd on s3tools.org.
If using the s3cmd, you should then configure the command to automatically connect to the
Amazon S3 service without requiring further authentication, the
.s3cfg
in the
tungsten
users home directory
should be configured as follows:
Using Access Keys:
[default] access_key =ACCESS_KEY
secret_key =SECRET_KEY
Using IAM Roles: Leave values blank - copy example as is
[default] access_key = secret_key = security_token =
Create an S3 bucket that will be used to hold the CSV files that are generated by the replicator. This can be achieved either through the web interface, or via the command-line, for example:
shell> s3cmd mb s3://tungsten-csv
A running Redshift instance must be available, and the port and IP address of the Tungsten Cluster that will be replicating into Redshift must have been added to the Redshift instance security credentials.
Make a note of the user and password that has been provided with access to the Redshift instance, as these will be needed when installing the applier. Also make a note of the Redshift instance address, as this will need to be provided to the applier configuration.
Create an
s3-config-
file based on the sample provided within
servicename
.jsoncluster-home/samples/conf/s3-config-
within the Tungsten Replicator staging directory, or using the example below.
servicename
.json
Once created, the file will be copied into the
/opt/continuent/share
directory to be used by
the batch applier script.
If multiple services are being created, one file must be created for each service.
The following example shows the use of Access and Secret Keys:
{ "awsS3Path" : "s3://your-bucket-for-redshift/redshift-test
", "awsAccessKey" : "access-key-id
", "awsSecretKey" : "secret-access-key
", "cleanUpS3Files" : "true
" }
The following example shows the use of IAM Roles:
{ "awsS3Path" : "s3://your-bucket-for-redshift/redshift-test
", "awsIAMRole" : "arn:iam-role
", "cleanUpS3Files" : "true
" }
The allowed options for this file are as follows:
awsS3Path
— the
location within your S3 storage where files should be loaded.
awsAccessKey
— the S3
access key to access your S3 storage. Not required if awsIAMRole is used.
awsSecretKey
— the S3
secret key associated with the Access Key. Not required if awsIAMRole is used.
awsIAMRole
— the IAM
role configured to allow Redshift to interact with S3. Not required if
awsAccessKey and awsSecretKey are in use.
multiServiceTarget
(true/false) — to indicate if there
are multiple appliers writing into the single Redshift Target, for example when the source
is Tungsten Cluster Composite Active/Active or a Tungsten Replicator Fan-In Topology (Default: false).
singleLockTable
(true/false) — to indicate the table
lock behaviour when multiServiceTarget is true. Will be ignored if multiServiceTarget set to false
(Default: true)
lockTablePrefix
— the prefix for the lock tables
when singleLockTable is false. (Default: lock_xxx_)
s3Binary
— the binary to use for loading csv file
up to S3. (Valid Values: s3cmd, s4cmd, aws) (Default: s3cmd)
redshiftCopyOptions
— allows the passing of additional
valid syntax to be added to the Redshift COPY command during csv loading from S3 into Redshift Staging Tables.
A list of valid parameters can be found in the Redshift documentation
cleanUpS3Files
— a
boolean value used to identify whether the CSV files loaded into
S3 should be deleted after they have been imported and merged.
If set to true, the files are automatically deleted once the
files have been successfully imported into the Redshift staging
tables. If set to false, files are not automatically removed.
gzipS3Files
— setting to true will result in the csv files
being gzipped prior to loading into S3 (Default: false)
storeCDCIn
— a
definition table that stores the change data from the load, in
addition to importing to staging and base tables. The
{schema}
and
{table}
variables will be
automatically replaced with the corresponding schema and table
name. For more information on keeping CDC information, see
Section 4.2.5, “Keeping CDC Information”.
Identity Management with AWS is complex, but a useful and secure way of restriciting services interacting with each other, and for restricting user access to the AWS platform.
Tungsten Replicator for Redshift, requires a certain level of interaction between the replicator and S3 and between Redshift and S3.
All versions up to and including Tungsten Replicator version 6.0 can utilise IAM Roles for uploading the csv files to S3, however for loading the data from S3 into Redshift, the only option is to use Access and Secret Keys.
Tungsten Replicator version 6.1 onwards will also allow for the use of IAM Roles for loading data from S3 into Redshift.
To use IAM Roles with Tungsten Replicator you will need to create two roles, with the following recommended policies:
To allow csv files to be loaded upto S3:
Role should be associated with the AWS Service: EC2
AWS Defined Policy Name: AmazonS3FullAccess, or
Define and create your own policy, with, at minimum, the ability to write to the bucket you intend to use for the Redshift Applier
Associate this role to the EC2 instance running the Tungsten Replicator software
For use by Redshift COPY command to load csv into staging tables:
Role should be associated with the AWS Service: Redshift
AWS Defined Policy Name: AmazonS3FullAccess, or
Define and create your own policy, with, at minimum, the ability to read from the bucket you intend to use for the Redshift Applier
Associate this role to the Redshift Cluster.
For more details and full instructions on creating and managing IAM roles, review the AWS documentation
In order for the data to be written into the Redshift tables, the tables must be generated. Tungsten Replicator does not replicate the DDL statements between the source and applier between heterogeneous deployments due to differences in the format of the DDL statements. The supplied ddlscan tool can translate the DDL from the source database into suitable DDL for the target database.
For each database being replicated, DDL must be generated twice, once for the staging tables where the change data is loaded, and again for the live tables. To generate the necessary DDL:
To generate the staging table DDL, ddlscan must be executed on the Extractor host. After the replicator has been installed, the ddlscan can automatically pick up the configuration to connect to the host, or it can be specified on the command line:
On the source host for each database that is being replicated, run
ddlscan using the
ddl-mysql-redshift-staging.vm
:
shell> ddlscan -db test -template ddl-mysql-redshift-staging.vm
DROP TABLE stage_xxx_test.stage_xxx_msg;
CREATE TABLE stage_xxx_test.stage_xxx_msg
(
tungsten_opcode CHAR(2),
tungsten_seqno INT,
tungsten_row_id INT,
tungsten_commit_timestamp TIMESTAMP,
id INT,
msg CHAR(80),
PRIMARY KEY (tungsten_opcode, tungsten_seqno, tungsten_row_id)
);
Check the output to ensure that no errors have been generated during the process. These may indicate datatype limitations that should be identified before continuing. The generated output should be captured and then executed on the Redshift host to create the table.
Once the staging tables have been created, execute
ddlscan again using the base table template,
ddl-mysql-redshift.vm
:
shell> ddlscan -db test -template ddl-mysql-redshift.vm
DROP TABLE test.msg;
CREATE TABLE test.msg
(
id INT,
msg CHAR(80),
PRIMARY KEY (id)
);
Once again, check the output for errors, then capture the output and execute the generated DDL against the Redshift instance.
The DDL templates translate datatypes as directly as possible, with the following caveats:
The length of MySQL VARCHAR
length is quadrupled, because MySQL counts characters, while
Redshift counts bytes.
There is no TIME
datatype in
Redshift, instead, TIME
columns are converted to
VARCHAR(17)
.
Primary keys from MySQL are applied into Redshift where possible.
Once the DDL has been generated within the Redshift instance, the replicator will be ready to be installed.
The features outlined in this section where specifically introduced in Tungsten Replicator 6.1.4.
Redshift only supports a SERIALIZABLE
transaction isolation level, which differs
from relational databases like MySQL, which is REPEATABLE READ
by default.
Isolation Levels determine the behaviour of
the database for concurrent access to the tables within transactions.
When loading data into Redshift, from multiple appliers, this isolation level can cause locking issues that would manifest as errors in the Replicator Log similiar to the following:
Detail: Serializable isolation violation on table - 150379, transactions forming the cycle are: 2356786, 2356787 » (pid:17914) (../../tungsten-replicator//appliers/batch/redshift.js#219)
In some cases, the replicator will simply retry and carry on successfully, but on very busy systems
this can sometimes cause the replicator to fall back into an OFFLINE:ERROR
state
and manual intervention would be required.
To overcome this problem, the first step is to ensure that each applier has its own set of staging
tables that the CSV files are loaded into. By default all staging tables will be named with the prefix
stage_xxx_
First of all, to generate the staging tables, you would typically use ddlscan that would look something like the following:
shel> ddlscan -user tungsten -pass secret -url jdbc:mysql:thin://db01:3306/
» -db hr -template ddl-mysql-redshift-staging.vm > staging.sql
To change the default prefix of the staging table, for example, to stage_nyc_
you can provide the option to the ddlscan command as follows:
shel> ddlscan -user tungsten -pass secret -url jdbc:mysql:thin://db01:3306/
» -db hr -template ddl-mysql-redshift-staging.vm -opt tablePrefix stage_nyc_ > staging.sql
You would need to execute this for each applier, changing the prefix accordingly. Once this has
been executed and the tables have been built in Redshift, you will then need to add the additional
property to each applier to instruct which staging tables to use. The property should be added to the
tungsten.ini
file and a tpm update issued
property=replicator.applier.dbms.stageTablePrefix=stage_nyc_
The first and easiest step to try and overcome the isolation errors, would be to increase the batch commit levels and the batch commit interval. Each system works differently so there is no simple calculation to find the right level. These values should be adjusted in small increments to find the right balance for your system.
Within your configuration, adjust the following two parameters:
svc-block-commit-size
svc-block-commit-interval
Within the redshift applier, it is possible to introduce table locking. This will enable multiple appliers to process their own THL and load the transactions without impacting, or being impacted by, other appliers.
This configuration should only be used when multiple appliers are in use, however it must also be recognised that the addition of table locking could introduce latency in applying to Redshift on extremely busy systems, it could also impact client applications from reading the tables due to Redshift's isolation level. To avoid this, table locking should also include an increase in the block commit size and block commit interval properties mentioned above.
There are two types of table locking approaches, depending upon your environment will determine which approach is better for you.
Single Lock Table: This approach should be used for appliers in extremely busy systems where a block-commit-size of 500000 or greater does not eliminate isolation errors and where mutliple tables are updated within each transaction.
One Lock Table per Base Table: This approach should be used for appliers in less busy systems, or where parallel apply has been enabled within the applier, regardless of system activity levels.
To enable the single lock table approach:
The following option should be added to the
s3-config-
file:
servicename
.json
"multiServiceTarget": "true"
Connect to Redshift with the same account used by the applier, and using the DDL below, create the lock table:
CREATE TABLE public.tungsten_lock_table ( ID INT );
To enable the lock table per base table approach:
The following option should be added to the
s3-config-
file:
servicename
.json
"multiServiceTarget": "true", "singleLockTable": "false"
Create a lock table for each of the base tables within Redshift. A ddlscan template
can be used to generate the ddl. In the following example the ddlscan command is generating
lock table ddl for all tables within the hr
schema:
shel> ddlscan -user tungsten -pass secret -url jdbc:mysql:thin://db01:3306/
» -db hr -template ddl-mysql-redshift-lock.vm > outfile.sql
Execute the output from ddlscan into redshift
After enabling either of the above methods, if replication has already been installed you will need to simply restart the replicator by issuing the following:
shel> replicator restart
Replication into Redshift requires two separate replicator installations, one that extracts information from the source database, and a second that generates the CSV files, loads those files into S3 and then executes the statements on the Redshift database to import the CSV data and apply the transformations to build the final tables.
The two replication services can operate on the same machine, (See Section 5.3, “Deploying Multiple Replicators on a Single Host”) or they can be installed on two different machines.
Once you have completed the configuration of the Amazon Redshift database, you can configure and install the applier as described using the steps below.
Before installing the applier, the following additions need adding to the extractor configuration. Apply the following parameter to the extractor configuration before installing the applier
Add the following the /etc/tungsten/tungsten.ini
[alpha] ...Existing Replicator Config... enable-heterogeneous-service=true
shell>tpm update
The above step is only applicable for standalone extractors. If you are configuring replications from an existing Tungsten Cluster (Cluster-Extractor), follow the steps outlined here to ensure the cluster is configured correctly: Section 3.4.1, “Prepare: Replicating Data Out of a Cluster”
The applier can now be configured. Unpack the Tungsten Replicator distribution in staging directory:
shell> tar zxf tungsten-replicator-7.0.3-141.tar.gz
Change into the staging directory:
shell> cd tungsten-replicator-7.0.3-141
Configure the installation using tpm:
shell>./tools/tpm configure defaults \ --reset \ --user=tungsten \ --install-directory=/opt/continuent \ --profile-script=~/.bash_profile \ --rest-api-admin-user=apiuser \ --rest-api-admin-pass=secret
shell>./tools/tpm configure alpha \ --topology=master-slave \ --master=sourcehost \ --members=localhost \ --datasource-type=redshift \ --replication-host=redshift.us-east-1.redshift.amazonaws.com \ --replication-user=awsRedshiftUser \ --replication-password=awsRedshiftPass \ --redshift-dbname=dev \ --batch-enabled=true \ --batch-load-template=redshift \ --svc-applier-filters=dropstatementdata \ --svc-applier-block-commit-interval=30s \ --svc-applier-block-commit-size=250000
shell> vi /etc/tungsten/tungsten.ini
[defaults] user=tungsten install-directory=/opt/continuent profile-script=~/.bash_profile rest-api-admin-user=apiuser rest-api-admin-pass=secret
[alpha] topology=master-slave master=sourcehost members=localhost datasource-type=redshift replication-host=redshift.us-east-1.redshift.amazonaws.com replication-user=awsRedshiftUser replication-password=awsRedshiftPass redshift-dbname=dev batch-enabled=true batch-load-template=redshift svc-applier-filters=dropstatementdata svc-applier-block-commit-interval=30s svc-applier-block-commit-size=250000
Configuration group defaults
The description of each of the options is shown below; click the icon to hide this detail:
For staging configurations, deletes all pre-existing configuration information between updating with the new configuration values.
System User
--install-directory=/opt/continuent
install-directory=/opt/continuent
Path to the directory where the active deployment will be installed. The configured directory will contain the software, THL and relay log information unless configured otherwise.
--profile-script=~/.bash_profile
profile-script=~/.bash_profile
Append commands to include env.sh in this profile script
Configuration group alpha
The description of each of the options is shown below; click the icon to hide this detail:
Replication topology for the dataservice.
The hostname of the primary (extractor) within the current service.
Hostnames for the dataservice members
Database type
--replication-host=redshift.us-east-1.redshift.amazonaws.com
replication-host=redshift.us-east-1.redshift.amazonaws.com
Hostname of the datasource where the database is located. If the specified hostname matches the current host or member name, the database is assumed to be local. If the hostnames do not match, extraction is assumed to be via remote access. For MySQL hosts, this configures a remote replication Replica (relay) connection.
--replication-user=awsRedshiftUser
replication-user=awsRedshiftUser
For databases that required authentication, the username to use when connecting to the database using the corresponding connection method (native, JDBC, etc.).
--replication-password=awsRedshiftPass
replication-password=awsRedshiftPass
The password to be used when connecting to the database using
the corresponding
--replication-user
.
Name of the Redshift database to replicate into
Should the replicator service use a batch applier
--batch-load-template=redshift
Value for the loadBatchTemplate property
--svc-applier-filters=dropstatementdata
svc-applier-filters=dropstatementdata
Replication service applier filters
--svc-applier-block-commit-interval=30s
svc-applier-block-commit-interval=30s
Minimum interval between commits
--svc-applier-block-commit-size=250000
svc-applier-block-commit-size=250000
Applier block commit size (min 1)
If your MySQL source is a Tungsten Cluster, ensure the additional steps below are also included in your applier configuration
First, prepare the required filter configuration file as follows on the Redshift applier host(s) only:
shell>mkdir -p /opt/continuent/share/
shell>cp tungsten-replicator/support/filters-config/convertstringfrommysql.json /opt/continuent/share/
Then, include the following parameters in the configuration
property=replicator.stage.remote-to-thl.filters=convertstringfrommysql
property=replicator.filter.convertstringfrommysql.definitionsFile=/opt/continuent/share/convertstringfrommysql.json
If you plan to make full use of the REST API (which is enabled by default) you will need to also configure a username and password for API access. This must be done by specifying the following options in your configuration:
rest-api-admin-user=tungsten rest-api-admin-pass=secret
Once the prerequisites and configuring of the installation has been completed, the software can be installed:
shell> ./tools/tpm install
If the installation process fails, check the output of the
/tmp/tungsten-configure.log
file for
more information about the root cause.
On the host that is loading data into Redshift, create the
s3-config-
file and then copy that file into the servicename
.jsonshare
directory
within the installed directory on that host. For example:
shell> cp s3-config-servicename
.json /opt/continuent/share/
Now the services can be started:
shell> replicator start
Once the service is configured and running, the service can be monitored as normal using the trepctl command. See Section 4.2.6, “Management and Monitoring of Amazon Redshift Deployments” for more information.
Create a database within your source MySQL instance:
mysql> CREATE DATABASE redtest;
Create a table within your source MySQL instance:
mysql> CREATE TABLE redtest.msg (id INT PRIMARY KEY AUTO_INCREMENT,msg CHAR(80));
Create a schema for the tables:
redshift> CREATE SCHEMA redtest;
Create a staging table within your Redshift instance:
redshift> CREATE TABLE redtest.stage_xxx_msg (tungsten_opcode CHAR(1), \
tungsten_seqno INT, tungsten_row_id INT,tungsten_date CHAR(30),id INT,msg CHAR(80));
Create the target table:
redshift> CREATE TABLE redtest.msg (id INT,msg CHAR(80));
Insert some data within your MySQL source instance:
mysql>INSERT INTO redtest.msg VALUES (0,'First');
Query OK, 1 row affected (0.04 sec) mysql>INSERT INTO redtest.msg VALUES (0,'Second');
Query OK, 1 row affected (0.04 sec) mysql>INSERT INTO redtest.msg VALUES (0,'Third');
Query OK, 1 row affected (0.04 sec) mysql>UPDATE redtest.msg SET msg = 'This is the first update of the second row' WHERE ID = 2;
Check the replicator status on the applier
(host2
):
shell> trepctl status
There should be 5 transactions replicated.
Check the table within Redshift:
redshift> SELECT * FROM redtest.msg;
1 First
3 Third
2 This is the first update of the second row
The Redshift applier can keep the CDC data, that is, the raw CDC CSV data that is recorded and replicated during the loading process, rather than simply cleaning up the CDC files and deleting them. The CDC data can be useful if you want to be able to monitor data changes over time.
The process works as follows:
Batch applier generates CSV files.
Batch applier loads the CSV data into the staging tables.
Batch applier loads the CSV data into the CDC tables.
Staging data is merged with the base table data.
Staging data is deleted.
Unlike the staging and base table information, the data in the CDC tables is kept forever, without removing any of the processed information. Using this data you can report on change information over time for different data sets, or even recreate datasets at a specific time by using the change information.
To enable this feature:
When creating the DDL for the staging and base tables, also create the table information for the CDC data for each table. The actual format of the information is the same as the staging table data, and can be created using ddlscan:
shell> ddlscan -service my_red -db test \
-template ddl-mysql-redshift-staging.vm \
-opt renameSchema cdc_{schema} -opt renameTable {table}_cdc
In the configuration file,
s3-config-svc.json
for each
service, specify the name of the table to be used when storing the CDC
information using the storeCDCIn
field. This should specify the table template to be used, with the
schema and table name being automatically replaced by the load script.
The structure should match the structure used by
ddlscan to define the CDC tables:
{ "awsS3Path" : "s3://your-bucket-for-redshift/redshift-test
", "awsAccessKey" : "access-key-id
", "awsSecretKey" : "secret-access-key
", "storeCDCIn" : "cdc_{schema}.{table}_cdc
" }
Restart the replicator using replicator restart to update the configuration.
Monitoring a Amazon Redshift replication scenario requires checking the status of both the Extractor - extracting data from MySQL - and the Applier which retrieves the remote THL information and applies it to Amazon Redshift.
shell> trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysql-bin.000006:0000000000002857;-1
appliedLastSeqno : 15
appliedLatency : 1.918
autoRecoveryEnabled : false
autoRecoveryTotal : 0
channels : 1
clusterName : alpha
currentEventId : mysql-bin.000006:0000000000002857
currentTimeMillis : 1407336195165
dataServerHost : redshift1
extensions :
host : redshift1
latestEpochNumber : 8
masterConnectUri : thl://localhost:/
masterListenUri : thl://redshift1:2112/
maximumStoredSeqNo : 15
minimumStoredSeqNo : 0
offlineRequests : NONE
pendingError : NONE
pendingErrorCode : NONE
pendingErrorEventId : NONE
pendingErrorSeqno : -1
pendingExceptionMessage: NONE
pipelineSource : jdbc:mysql:thin://redshift1:3306/tungsten_alpha
relativeLatency : 35.164
resourcePrecedence : 99
rmiPort : 10000
role : master
seqnoType : java.lang.Long
serviceName : alpha
serviceType : local
simpleServiceName : alpha
siteName : default
sourceId : redshift1
state : ONLINE
timeInStateSeconds : 34.807
transitioningTo :
uptimeSeconds : 36.493
useSSLConnection : false
version : Tungsten Replicator 7.0.3 build 141
Finished status command...
On the Applier, the output of trepctl shows the current sequence number and applier status:
shell> trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysql-bin.000006:0000000000002857;-1
appliedLastSeqno : 15
appliedLatency : 154.748
autoRecoveryEnabled : false
autoRecoveryTotal : 0
channels : 1
clusterName : alpha
currentEventId : NONE
currentTimeMillis : 1407336316454
dataServerHost : redshift.us-east-1.redshift.amazonaws.com
extensions :
host : redshift.us-east-1.redshift.amazonaws.com
latestEpochNumber : 8
masterConnectUri : thl://redshift1:2112/
masterListenUri : null
maximumStoredSeqNo : 15
minimumStoredSeqNo : 0
offlineRequests : NONE
pendingError : NONE
pendingErrorCode : NONE
pendingErrorEventId : NONE
pendingErrorSeqno : -1
pendingExceptionMessage: NONE
pipelineSource : thl://redshift1:2112/
relativeLatency : 156.454
resourcePrecedence : 99
rmiPort : 10000
role : slave
seqnoType : java.lang.Long
serviceName : alpha
serviceType : local
simpleServiceName : alpha
siteName : default
sourceId : redshift.us-east-1.redshift.amazonaws.com
state : ONLINE
timeInStateSeconds : 2.28
transitioningTo :
uptimeSeconds : 524104.751
useSSLConnection : false
version : Tungsten Replicator 7.0.3 build 141
Finished status command...
The appliedLastSeqno
should match as normal.
Because of the batching of transactions the
appliedLatency
may be much higher than a normal
MySQL to MySQL replication.
The batch loading parameters controlling the batching of data can be tuned
and update by studying the output from the
trepsvc.log
log file. The log will
show a line containing the number of rows updated:
INFO scripting.JavascriptExecutor COUNT: 4
See Section 12.1, “Block Commit” for more information on these parameters.
Hewlett-Packard's Vertica provides support for BigData, SQL-based analysis and processing. Integration with MySQL enables data to be replicated live from the MySQL database directly into Vertica without the need to manually export and import the data.
Replication to Vertica operates as follows:
Data is extracted from the source database into THL.
When extracting the data from the THL, the Vertica replicator writes the data into CSV files according to the name of the source tables. The files contain all of the row-based data, including the global transaction ID generated by Tungsten Replicator during replication, and the operation type (insert, delete, etc) as part of the CSV data.
The CSV data is then loaded into Vertica into staging tables.
SQL statements are then executed to perform updates on the live version of the tables, using the CSV, batch loaded, information, deleting old rows, and inserting the new data when performing updates to work effectively within the confines of Vertica operation.
Setting up replication requires setting up both the Extractor and Applier components as two different configurations, one for MySQL and the other for Vertica. Replication also requires some additional steps to ensure that the Vertica host is ready to accept the replicated data that has been extracted. Tungsten Replicator uses all the tools required to perform these operations during the installation and setup.
Preparing the hosts for the replication process requires setting some key configuration parameters within the MySQL server to ensure that data is stored and written correctly. On the Vertica side, the database and schema must be created using the existing schema definition so that the databases and tables exist within Vertica.
Source Host
Configure the source and target hosts following the prerequisites outlined in Appendix B, Prerequisites then follow the appropriate steps for the required extractor topology outlined in Chapter 3, Deploying MySQL Extractors.
Vertica Host
On the Vertica host, you need to perform some preparation of the destination database, first creating the database, and then creating the tables that are to be replicated.
Create a database (if you want to use a different one than those already configured), and a schema that will contain the Tungsten data about the current replication position:
shell>vsql -Udbadmin -wsecret bigdata
Welcome to vsql, the Vertica Analytic Database v5.1.1-0 interactive terminal. Type: \h for help with SQL commands \? for help with vsql commands \g or terminate with semicolon to execute query \q to quit bigdata=>create schema tungsten_alpha;
The schema will be used only by Tungsten Replicator to store metadata about the replication process.
Locate the Vertica JDBC driver. This can be downloaded separately from
the Vertica website. The driver will need to be copied into the
Tungsten Replicator lib
directory.
shell> cp vertica-jdbc-7.1.2-0.jar tungsten-replicator-7.0.3-141/tungsten-replicator/lib/
You need to create tables within Vertica according to the databases and tables that need to be replicated; the tables are not automatically created for you. From a Tungsten Replicator deployment directory, the ddlscan command can be used to identify the existing tables, and create table definitions for use within Vertica.
To use ddlscan, the template for Vertica must be specified, along with the user/password information to connect to the source database to collect the schema definitions. The tool should be run from the templates directory.
The tool will need to be executed twice, the first time generates the live table definitions:
shell>cd tungsten-replicator-7.0.3-141
shell>cd tungsten-replicator/samples/extensions/velocity/
shell>ddlscan -user tungsten -url 'jdbc:mysql:thin://host1:13306/access_log' -pass password \ -template ddl-mysql-vertica.vm -db access_log
/* SQL generated on Fri Sep 06 14:37:40 BST 2013 by ./ddlscan utility of Tungsten url = jdbc:mysql:thin://host1:13306/access_log user = tungsten dbName = access_log */ CREATE SCHEMA access_log; DROP TABLE access_log.access_log; CREATE TABLE access_log.access_log ( id INT , userid INT , datetime INT , session CHAR(30) , operation CHAR(80) , opdata CHAR(80) ) ORDER BY id; ...
The output should be redirected to a file and then used to create tables within Vertica:
shell> ddlscan -user tungsten -url 'jdbc:mysql:thin://host1:13306/access_log' -pass password \
-template ddl-mysql-vertica.vm -db access_log >access_log.ddl
The output of the command should be checked to ensure that the table definitions are correct.
The file can then be applied to Vertica:
shell> cat access_log.ddl | vsql -Udbadmin -wsecret bigdata
This generates the table definitions for live data. The process should be repeated to create the table definitions for the staging data by using te staging template:
shell> ddlscan -user tungsten -url 'jdbc:mysql:thin://host1:13306/access_log' -pass password \
-template ddl-mysql-vertica-staging.vm -db access_log >access_log.ddl-staging
Then applied to Vertica:
shell> cat access_log.ddl-staging | vsql -Udbadmin -wsecret bigdata
The process should be repeated for each database that will be replicated.
Once the preparation of the MySQL and Vertica databases are ready, you can proceed to installing Tungsten Replicator
Before installing the applier, the following additions need adding to the extractor configuration. Apply the following parameter to the extractor configuration before installing the applier
Add the following the /etc/tungsten/tungsten.ini
[alpha] ...Existing Replicator Config... enable-heterogeneous-service=true
shell>tpm update
The above step is only applicable for standalone extractors. If you are configuring replications from an existing Tungsten Cluster (Cluster-Extractor), follow the steps outlined here to ensure the cluster is configured correctly: Section 3.4.1, “Prepare: Replicating Data Out of a Cluster”
The applier can now be configured.
Unpack the Tungsten Replicator distribution in staging directory:
shell> tar zxf tungsten-replicator-7.0.3-141.tar.gz
Change into the staging directory:
shell> cd tungsten-replicator-7.0.3-141
Locate the Vertica JDBC driver. This can be downloaded separately from
the Vertica website. The driver will need to be copied into the
Tungsten Replicator lib
directory.
shell> cp vertica-jdbc-7.1.2-0.jar tungsten-replicator-7.0.3-141/tungsten-replicator/lib/
Configure the installation using tpm:
shell>./tools/tpm configure defaults \ --reset \ --user=tungsten \ --install-directory=/opt/continuent \ --profile-script=~/.bash_profile \ --skip-validation-check=HostsFileCheck \ --skip-validation-check=InstallerMasterSlaveCheck \ --rest-api-admin-user=apiuser \ --rest-api-admin-pass=secret
shell>./tools/tpm configure alpha \ --topology=master-slave \ --master=sourcehost \ --members=localhost \ --datasource-type=vertica \ --replication-user=dbadmin \ --replication-password=password \ --vertica-dbname=dev \ --batch-enabled=true \ --batch-load-template=vertica6 \ --batch-load-language=js \ --replication-port=5433 \ --svc-applier-filters=dropstatementdata \ --svc-applier-block-commit-interval=30s \ --svc-applier-block-commit-size=25000 \ --disable-relay-logs=true
shell> vi /etc/tungsten/tungsten.ini
[defaults] user=tungsten install-directory=/opt/continuent profile-script=~/.bash_profile skip-validation-check=HostsFileCheck skip-validation-check=InstallerMasterSlaveCheck rest-api-admin-user=apiuser rest-api-admin-pass=secret
[alpha] topology=master-slave master=sourcehost members=localhost datasource-type=vertica replication-user=dbadmin replication-password=password vertica-dbname=dev batch-enabled=true batch-load-template=vertica6 batch-load-language=js replication-port=5433 svc-applier-filters=dropstatementdata svc-applier-block-commit-interval=30s svc-applier-block-commit-size=25000 disable-relay-logs=true
Configuration group defaults
The description of each of the options is shown below; click the icon to hide this detail:
For staging configurations, deletes all pre-existing configuration information between updating with the new configuration values.
System User
--install-directory=/opt/continuent
install-directory=/opt/continuent
Path to the directory where the active deployment will be installed. The configured directory will contain the software, THL and relay log information unless configured otherwise.
--profile-script=~/.bash_profile
profile-script=~/.bash_profile
Append commands to include env.sh in this profile script
--skip-validation-check=HostsFileCheck
skip-validation-check=HostsFileCheck
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
--skip-validation-check=InstallerMasterSlaveCheck
skip-validation-check=InstallerMasterSlaveCheck
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
Configuration group alpha
The description of each of the options is shown below; click the icon to hide this detail:
Replication topology for the dataservice.
The hostname of the primary (extractor) within the current service.
Hostnames for the dataservice members
Database type
For databases that required authentication, the username to use when connecting to the database using the corresponding connection method (native, JDBC, etc.).
--replication-password=password
The password to be used when connecting to the database using
the corresponding
--replication-user
.
Name of the database to replicate into
Should the replicator service use a batch applier
--batch-load-template=vertica6
Value for the loadBatchTemplate property
Which script language to use for batch loading
The network port used to connect to the database server. The default port used depends on the database being configured.
--svc-applier-filters=dropstatementdata
svc-applier-filters=dropstatementdata
Replication service applier filters
--svc-applier-block-commit-interval=30s
svc-applier-block-commit-interval=30s
Minimum interval between commits
--svc-applier-block-commit-size=25000
svc-applier-block-commit-size=25000
Applier block commit size (min 1)
Disable the use of relay-logs?
If you plan to make full use of the REST API (which is enabled by default) you will need to also configure a username and password for API access. This must be done by specifying the following options in your configuration:
rest-api-admin-user=tungsten rest-api-admin-pass=secret
Once the prerequisites and configuring of the installation has been completed, the software can be installed:
shell> ./tools/tpm install
If you encounter problems during the installation, check the output of
the /tmp/tungsten-configure.log
file for more information about the root cause.
Once the service is configured and running, the service can be monitored as normal using the trepctl command. See Section 4.3.3, “Management and Monitoring of Vertica Deployments” for more information.
Monitoring a Vertica replication scenario requires checking the status of both the Extractor - extracting data from MySQL - and the Applier which retrieves the remote THL information and applies it to Vertica.
shell> trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysql-bin.000012:0000000128889042;0
appliedLastSeqno : 1070
appliedLatency : 22.537
channels : 1
clusterName : alpha
currentEventId : mysql-bin.000012:0000000128889042
currentTimeMillis : 1378489888477
dataServerHost : mysqldb01
extensions :
latestEpochNumber : 897
masterConnectUri : thl://localhost:/
masterListenUri : thl://mysqldb01:2112/
maximumStoredSeqNo : 1070
minimumStoredSeqNo : 0
offlineRequests : NONE
pendingError : NONE
pendingErrorCode : NONE
pendingErrorEventId : NONE
pendingErrorSeqno : -1
pendingExceptionMessage: NONE
pipelineSource : jdbc:mysql:thin://mysqldb01:13306/
relativeLatency : 691980.477
resourcePrecedence : 99
rmiPort : 10000
role : master
seqnoType : java.lang.Long
serviceName : alpha
serviceType : local
simpleServiceName : alpha
siteName : default
sourceId : mysqldb01
state : ONLINE
timeInStateSeconds : 694039.058
transitioningTo :
uptimeSeconds : 694041.81
useSSLConnection : false
version : Tungsten Replicator 7.0.3 build 141
Finished status command...
On the Applier, the output of trepctl shows the current sequence number and applier status:
shell> trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysql-bin.000012:0000000128889042;0
appliedLastSeqno : 1070
appliedLatency : 78.302
channels : 1
clusterName : default
currentEventId : NONE
currentTimeMillis : 1378479271609
dataServerHost : vertica01
extensions :
latestEpochNumber : 897
masterConnectUri : thl://mysqldb01:2112/
masterListenUri : null
maximumStoredSeqNo : 1070
minimumStoredSeqNo : 0
offlineRequests : NONE
pendingError : NONE
pendingErrorCode : NONE
pendingErrorEventId : NONE
pendingErrorSeqno : -1
pendingExceptionMessage: NONE
pipelineSource : thl://mysqldb01:2112/
relativeLatency : 681363.609
resourcePrecedence : 99
rmiPort : 10000
role : slave
seqnoType : java.lang.Long
serviceName : alpha
serviceType : local
simpleServiceName : alpha
siteName : default
sourceId : vertica01
state : ONLINE
timeInStateSeconds : 681486.806
transitioningTo :
uptimeSeconds : 689922.693
useSSLConnection : false
version : Tungsten Replicator 7.0.3 build 141
Finished status command...
The appliedLastSeqno
should match as normal.
Because of the batching of transactions the
appliedLatency
may be much higher than a normal
MySQL to MySQL replication.
The following items detail some of the more common problems with replication through to Vertica. Often the underlying issue is related to the data types, the data format, or the number of columns.
If the following is reported by the replicator:
pendingError : Replicator unable to go online due to error » Operation failed: Online operation failed (Unable to prepare plugin: class » name=com.continuent.tungsten.replicator.datasource.DataSourceService » message=[Unable to load driver: com.vertica.jdbc.Driver]) state : OFFLINE:ERROR
The Vertica JDBC driver is missing from the installation. The Vertica
JDBC JAR file must have been placed into the
tungsten-replicator/lib
directory within the release diectory before running tpm
update or tpm install.
The following error:
pendingExceptionMessage: Invalid write to CSV file: name=/opt/continuent/tmp/staging/alpha/staging0/test-msg-1.csv » table=test.msg table_columns=schemaname,schemahash csv_columns=tungsten_opcode,tungsten_seqno, » tungsten_row_id,tungsten_commit_timestamp,nullschemaname,schemahash
Indicates the source THL has been not been marked up correctly. Either
the colnames
filter has not been
enabled, or the
--enable-batch-service
has not been
confifgred during installation. This means that the source THL is not
being populated with the right information, either the full list of
columns, or the column names and primary key information is incorrect.
The configuration should be updated, and then the THL on both the
Extractor and Applier should be recreated by using trepctl
reset.
If you get an error similar to the following:
pendingExceptionMessage: CSV loading failed: schema=test table=msg CSV » file=/opt/continuent/tmp/staging/alpha/staging0/test-msg-1.csv » message=com.continuent.tungsten.replicator.ReplicatorException: Incoming table data » has no primary keys: test.msg » (/opt/continuent/tungsten/tungsten-replicator/appliers/batch/vertica6.js#70)
Either the pkey
filter has not
been enabled, or the source tables on the source database do not
contain primary keys. This means that the source THL is not being
populated with the primary key information from the table which is
requird in order to load into Vertica through the batch mechanism.
The configuration should be updated, and then the THL on both the
Extractor and Applier should be recreated by using trepctl
reset.
The following error indicates that the incoming data could not be loaded into the staging table within Vertica:
pendingError : Stage task failed: q-to-dbms pendingExceptionMessage: CSV loading failed: schema=blog table=article CSV » file=/tmp/staging/alpha/staging0/blog-article-432.csv » message=com.continuent.tungsten.replicator.ReplicatorException: LOAD DATA ROW count does not match: sql=COPY blog.stage_xxx_article » FROM '/tmp/staging/alpha/staging0/blog-article-432.csv' » DIRECT NULL 'null' DELIMITER ',' ENCLOSED BY '"' » expected_copy_rows=3614 rows=2233 ; exceptions are in » /tmp/tungsten_vertica_blog.article.exceptions » (../../tungsten-replicator//samples/scripts/batch/vertica6.js#67)
There are a number of possible reasons for this. The actual reasons
can be found in the exceptions file which is generated, the error
message contains the location. In this example
/tmp/tungsten_vertica_blog.article.exceptions
.
Possible reasons include:
Mismatch in the number of columns in the source file and the target table. Check the source and target tables match, including the four special fields used in all staging tables.
Mismatch in the data types of one or more of the columns in target table. Check the source and target table definitions match, or at least support the corresponding data. For example, the column size, length or format is correct. Loading character data into numeric columns, or floating point values into integer columns for example is not supported.
Badly formatted CSV file. This happens when the incoming data contains newliness or commas or other data that is incompatible with the CSV format. The CSV file should have been kept, the location is also in the error message. Examine the file and check the format. You may need to enable filters to modify and 'clean' the data so that it is more compatible with the CSV format.
Remember that changes to the DDL within the source database are not automatically replicated to Vertica. Changes to the table definitions, additional tables, or additional databases, must all be updated manually within Vertica.
If you get errors similar to:
stage_xxx_access_log does not exist
When loading into Vertica, it means that the staging tables have not created correctly. Check the steps for creating the staging tables using ddlscan in Section 4.3.1, “Preparing for Vertica Deployments”.
Replication may fail if date types contain zero values, which are
legal in MySQL. For example, the timestamp
0000-00-00 00:00:00
is valid in
MySQL. An error reporting a mismatch in the values will be reported
when applying the data into Vertica, for example:
ERROR 2631: Column "time" is of type timestamp but expression is of type int HINT: You will need to rewrite or cast the expression
Or:
ERROR 2992: Date/time field value out of range: "0" HINT: Perhaps you need a different "datestyle" setting
To address this error, use the
zerodate2null
filter, which
translates zero-value dates into a valid NULL value. This can be
enabled by adding the
zerodate2null
filter to the
applier stage when configuring the service using
tpm:
shell> ./tools/tpm update alpha --repl-svc-applier-filters=zerodate2null
Kafka is a highly scalable messaging platform that provides a method for distributing information through a series of messages organised by a specified topic. With Tungsten Replicator the incoming stream of data from the upstream replicator is converted, on a row by row basis, into a JSON document that contains the row information. A new message is created for each row, even from multiple-row transactions.
The deployment of Tungsten Replicator to Kafka service is slightly different. There are two parts to the process:
Service Alpha on the Extractor, extracts the information from the MySQL binary log into THL.
Service Alpha on the Applier, reads the information from the remote replicator as THL, and applies that to Kafka.
With the Kafka applier, information is extracted from the source database using the row-format, column names and primary keys are identified, and translated to a JSON format, and then embedded into a larger Kafka message. The topic used is either composed from the schema name or can be configured to use an explicit topic type, and the generated information included in the Kafka message can include the source schema, table, and commit time information.
The transfer operates as follows:
Data is extracted from MySQL using the standard extractor, reading the row change data from the binlog.
The Section 11.4.5, “ColumnName Filter” filter is used to extract column name information from the database. This enables the row-change information to be tagged with the corresponding column information. The data changes, and corresponding row names, are stored in the THL.
The Section 11.4.32, “PrimaryKey Filter” filter is used to add primary key information to row-based replication data.
The THL information is then applied to Kafka using the Kafka applier.
There are some additional considerations when applying to Kafka that should be taken into account:
Because Kafka is a message queue and not a database, traditional transactional semantics are not supported. This means that although the data will be applied to Kafka as a message, there is no guarantee of transactional consistency. By default the applier will ensure that the message has been correctly received by the Kafka service, it is the responsibility of the Kafka environment and configuration to ensure delivery. The replicator.applier.dbms.zookeeperString can be used to ensure acknowledgements are received from the Kafka service.
One message is sent for each row of source information in each transaction. For example, if 20 rows have been inserted or updated in a single transaction, then 20 separate Kafka messages will be generated.
A separate message is broadcast for each operation, and includes the operation type. A single message will be broadcast for each row for each operation. So if 20 rows are delete, 20 messages are generated, each with the operation type.
If replication fails in the middle of a large transaction, and the
replicator goes OFFLINE
, when the
replicator goes online it may resend rows and messages.
The two replication services can operate on the same machine, (See Section 5.3, “Deploying Multiple Replicators on a Single Host”) or they can be installed on two different machines.
Configure the source and target hosts following the prerequisites outlined in Appendix B, Prerequisites then follow the appropriate steps for the required extractor topology outlined in Chapter 3, Deploying MySQL Extractors.
In general, it is easier to understand that a row within the MySQL table is converted into a single message on the Kafka side, the topic used is made up of the schema name and table name, and the message ID is composed of the primary key information, but can optionally include the schema and table name and primary key information.
For example, the following row within MySQL:
mysql> select * from messages where id = 99999 \G
*************************** 1. row ***************************
id: 99999
msg: Hello Kafka
1 row in set (0.00 sec)
Is replicated into Kafka as a Kafka message using the topic
test_msg
:
{ "_seqno" : "4865", "_source_table" : "msg", "_committime" : "2017-07-13 15:30:37.0", "_source_schema" : "test", "record" : { "msg" : "Hello Kafka", "id" : "2384726" }, "_optype" : "INSERT" }
In the output, the record
contains
the actualy record data, the other fields in the message are:
_seqno — the THL sequence number of the transaction.
_source_table — the source table. Inclusion of this information is optional.
_committime — the original transaction commit time. Inclusion of this information is optional.
_source_schema — the source schema. Inclusion of this information is optional.
_optype — the operation type (INSERT, UPDATE, DELETE).
When preparing the hosts you must be aware of this translation of the different structures, as it will have an effect on the way the information is replicated from MySQL to Kafka.
MySQL Host
The data replicated from MySQL can be any data, although there are some known limitations and assumptions made on the way the information is transferred.
When configuring the extractor database and host, ensure heterogenous specific prerequisities have been included, see Section B.4.4, “MySQL Configuration for Heterogeneous Deployments”
For the best results when replicating, be aware of the following issues and limitations:
Use primary keys on all tables. The use of primary keys will improve the lookup of information within Kafka when rows are updated. Without a primary key on a table a full table scan is performed, which can affect performance.
MySQL TEXT
columns are
correctly replicated, but cannot be used as keys.
MySQL BLOB
columns are
converted to text using the configured character type. Depending on
the data that is being stored within the
BLOB
, the data may need to be
custom converted. A filter can be written to convert and reformat the
content as required.
Kafka Host
On the Kafka side, status information is stored into the Zookeeper instance used for configuring Kafka, and the Zookeeper and Kafka instances must be up and running before the replicator is first started. There are no specific configuration elements required on the Kafka host.
Installation of the Kafka replication requires special configuration of the Extractor and Applier hosts so that each is configured for the correct datasource type.
Before installing the applier, the following additions need adding to the extractor configuration. Apply the following parameter to the extractor configuration before installing the applier
Add the following the /etc/tungsten/tungsten.ini
[alpha] ...Existing Replicator Config... enable-heterogeneous-service=true
shell>tpm update
The above step is only applicable for standalone extractors. If you are configuring replications from an existing Tungsten Cluster (Cluster-Extractor), follow the steps outlined here to ensure the cluster is configured correctly: Section 3.4.1, “Prepare: Replicating Data Out of a Cluster”
Unpack the Tungsten Replicator distribution in staging directory:
shell> tar zxf tungsten-replicator-7.0.3-141.tar.gz
Change into the staging directory:
shell> cd tungsten-replicator-7.0.3-141
Configure the installation using tpm:
shell>./tools/tpm configure defaults \ --reset \ --install-directory=/opt/continuent \ --profile-script=~/.bash_profile \ --rest-api-admin-user=apiuser \ --rest-api-admin-pass=secret
shell>./tools/tpm configure alpha \ --master=sourcehost \ --members=localhost \ --datasource-type=kafka \ --replication-user=root \ --replication-password=null \ --replication-port=9092 \ --property=replicator.applier.dbms.zookeeperString=localhost:2181 \ --property=replicator.applier.dbms.requireacks=1
shell> vi /etc/tungsten/tungsten.ini
[defaults] install-directory=/opt/continuent profile-script=~/.bash_profile rest-api-admin-user=apiuser rest-api-admin-pass=secret
[alpha] master=sourcehost members=localhost datasource-type=kafka replication-user=root replication-password=null replication-port=9092 property=replicator.applier.dbms.zookeeperString=localhost:2181 property=replicator.applier.dbms.requireacks=1
Configuration group defaults
The description of each of the options is shown below; click the icon to hide this detail:
For staging configurations, deletes all pre-existing configuration information between updating with the new configuration values.
--install-directory=/opt/continuent
install-directory=/opt/continuent
Path to the directory where the active deployment will be installed. The configured directory will contain the software, THL and relay log information unless configured otherwise.
--profile-script=~/.bash_profile
profile-script=~/.bash_profile
Append commands to include env.sh in this profile script
Configuration group alpha
The description of each of the options is shown below; click the icon to hide this detail:
The hostname of the primary (extractor) within the current service.
Hostnames for the dataservice members
Database type
For databases that required authentication, the username to use when connecting to the database using the corresponding connection method (native, JDBC, etc.).
The password to be used when connecting to the database using
the corresponding
--replication-user
.
The network port used to connect to the database server. The default port used depends on the database being configured.
If your MySQL source is a Tungsten Cluster, ensure the additional steps below are also included in your applier configuration
First, prepare the required filter configuration file as follows on the Kafka applier host(s) only:
shell>mkdir -p /opt/continuent/share/
shell>cp tungsten-replicator/support/filters-config/convertstringfrommysql.json /opt/continuent/share/
Then, include the following parameters in the configuration
property=replicator.stage.remote-to-thl.filters=convertstringfrommysql
property=replicator.filter.convertstringfrommysql.definitionsFile=/opt/continuent/share/convertstringfrommysql.json
If you plan to make full use of the REST API (which is enabled by default) you will need to also configure a username and password for API access. This must be done by specifying the following options in your configuration:
rest-api-admin-user=tungsten rest-api-admin-pass=secret
Once the prerequisites and configuring of the installation has been completed, the software can be installed:
shell> ./tools/tpm install
If you encounter problems during the installation, check the output of
the /tmp/tungsten-configure.log
file for more information about the root cause.
Once the service is configured and running, the service can be monitored as normal using the trepctl command. See Section 4.4.3, “Management and Monitoring of Kafka Deployments” for more information.
A number of optional, configurable, properties are available that
control how Tungsten Replicator applies and populates information when the
data is written into Kafka. The following properties can by set during
configuration using
--property=PROPERTYNAME=value
:
Table 4.1. Optional Kafka Applier Properties
Option | Description |
---|---|
replicator.applier.dbms.embedCommitTime | Sets whether the commit time for the source row is embedded into the document |
replicator.applier.dbms.embedSchemaTable | Embed the source schema name and table name in the stored document |
replicator.applier.dbms.enabletxinfo.kafka | Embeds transaction information (generated by the rowaddtxninfo filter) into each Kafka message |
replicator.applier.dbms.enabletxninfoTopic | Embeds transaction information into a separate Kafka message broadcast on an independent channel from the one used by the actual database data. One message is sent per transaction or THL event. |
replicator.applier.dbms.keyFormat | Determines the format of the message ID |
replicator.applier.dbms.requireacks | Defines whether when writing messages to the Kafka cluster, how many acknowledgements from Kafka nodes is required |
replicator.applier.dbms.retrycount | The number of retries for sending each message |
replicator.applier.dbms.txninfoTopic | Sets the topic name for transaction messages |
replicator.applier.dbms.zookeeperString | Connection string for Zookeeper, including hostname and port |
− replicator.applier.dbms.embedCommitTime
Option | replicator.applier.dbms.embedCommitTime | |
Description | Sets whether the commit time for the source row is embedded into the document | |
Value Type | boolean | |
Default | true | |
Valid Values | false | Do not embed the source database commit time |
true | Embed the source database commit time into the stored document |
Embeds the commit time of the source database row into the document information:
{
"_seqno" : "4865",
"_source_table" : "msg",
"_committime" : "2017-07-13 15:30:37.0",
"_source_schema" : "test",
"record" : {
"msg" : "Hello Kafka",
"id" : "2384726"
},
"_optype" : "INSERT"
}
− replicator.applier.dbms.embedSchemaTable
Option | replicator.applier.dbms.embedSchemaTable | |
Description | Embed the source schema name and table name in the stored document | |
Value Type | boolean | |
Default | true | |
Valid Values | false | Do not embed the schema or database name in the document |
true | Embed the source schema name and database name into the stored document |
If enabled, the documented stored into Elasticsearch will include the source schema and database name. This can be used to identify the source of the information if the schema and table name is not being used for the index and type names (see replicator.applier.dbms.useSchemaAsIndex and replicator.applier.dbms.useTableAsType).
{ "_seqno" : "4865", "_source_table" : "msg", "_committime" : "2017-07-13 15:30:37.0", "_source_schema" : "test", "record" : { "msg" : "Hello Kafka", "id" : "2384726" }, "_optype" : "INSERT" }
− replicator.applier.dbms.enabletxinfo.kafka
Option | replicator.applier.dbms.enabletxinfo.kafka | |
Description | Embeds transaction information (generated by the rowaddtxninfo filter) into each Kafka message | |
Value Type | boolean | |
Default | false | |
Valid Values | false | Do not include transaction information in each |
true | Embed transaction information into each Kafka message |
Embeds information about the entire transaction information
using the data provided by the
rowaddtxninfo
filter and
other information embedded in each THL event into each message
sent. The transaction information includes information about the
entire transaction (row counts, event ID and tables modified)
into each message. Since one message is normally sent for each
row of data, by adding the information about the full
transaction into the message it's possible to validate and
identify what other messages may be part of a single transaction
when the messages are being re-assembled by a Kafka client.
For example, when looking at a single message in Kafka, the
message includes a txninfo
section:
{ "_source_table" : "msg", "_committime" : "2018-03-07 12:53:21.0", "record" : { "msg2" : "txinfo", "id" : "109", "msg" : "txinfo" }, "_optype" : "INSERT", "_seqno" : "164", "txnInfo" : { "schema" : [ { "schemaName" : "msg", "rowCount" : "1", "tableName" : "msg" }, { "rowCount" : "2", "schemaName" : "msg", "tableName" : "msgsub" } ], "serviceName" : "alpha", "totalCount" : "3", "tungstenTransId" : "164", "firstRecordInTransaction" : "true" }, "_source_schema" : "msg" }
This block of the overall message includes the following objects and information:
schema
An array of the row counts within this transaction, with a row count included for each schema and table.
serviceName
The name of the Tungsten Replicator service that generated the message.
totalCount
The total number of rows modified within the entire transaction.
firstRecordInTransaction
If this field exists, it should always be set to true and indicats that this message was generated by the first row inserted, updated or deleted in the overall transaction. This effectively indicates the start of the overall transaction.
lastRecordInTransaction
If this field exists, it should always be set to true and indicats that this message was generated by the last row inserted, updated or deleted in the overall transaction. This effectively indicates the end of the overall transaction
Note that this information block is included in
every message for each row within an
overall transaction. The
firstRecordInTransaction
and
lastRecordInTransaction
can be used to identify the start and end of the transaction
overall.
− replicator.applier.dbms.enabletxninfoTopic
Option | replicator.applier.dbms.enabletxninfoTopic | |
Description | Embeds transaction information into a separate Kafka message broadcast on an independent channel from the one used by the actual database data. One message is sent per transaction or THL event. | |
Value Type | boolean | |
Default | false | |
Valid Values | false | Do not generate transaction information |
true | Send transaction information on a separate Kafka topic for each transaction |
If enabled, it sends a separate message on a Kafka topic containing information about the entire tranaction. The topic name can be configured by setting the replicator.applier.dbms.txninfoTopic property.
The default message sent will look like the following example:
{ "txnInfo" : { "tungstenTransId" : "164", "schema" : [ { "schemaName" : "msg", "rowCount" : "1", "tableName" : "msg" }, { "schemaName" : "msg", "rowCount" : "2", "tableName" : "msgsub" } ], "totalCount" : "3", "serviceName" : "alpha" } }
This block of the overall message includes the following objects and information:
schema
An array of the row counts within this transaction, with a row count included for each schema and table.
serviceName
The name of the Tungsten Replicator service that generated the message.
totalCount
The total number of rows modified within the entire transaction.
− replicator.applier.dbms.keyFormat
Option | replicator.applier.dbms.keyFormat | |
Description | Determines the format of the message ID | |
Value Type | string | |
Default | pkey | |
Valid Values | pkey | Combine the primary key column values into a single string |
pkeyus | Combine the primary key column values into a single string joined by an underscore character | |
tspkey | Combine the schema name, table name, and primary key column values into a single string joined by an underscore character | |
tspkeyus | Combine the schema name, table name, and primary key column values into a single string |
Determines the format of the message ID used when sending the
message into Kafka. For example, when configured to use
tspkeyus
, then the format
of the message ID will consist of the schemaname, table name and
primary key column information separated by underscores,
SCHEMANAME_TABLENAME_234
.
− replicator.applier.dbms.requireacks
Option | replicator.applier.dbms.requireacks | |
Description | Defines whether when writing messages to the Kafka cluster, how many acknowledgements from Kafka nodes is required | |
Value Type | string | |
Default | all | |
Valid Values | 1 | Only the lead host should acknowledge receipt of the message |
all | All nodes should acknowledge receipt of the message |
Sets the acknowledgement counter for sending messages into the Kafka queue.
− replicator.applier.dbms.retrycount
Option | replicator.applier.dbms.retrycount | |
Description | The number of retries for sending each message | |
Value Type | number | |
Default | 0 |
Determines the number of times the message will attempt to be sent before failure.
− replicator.applier.dbms.txninfoTopic
Option | replicator.applier.dbms.txninfoTopic | |
Description | Sets the topic name for transaction messages | |
Value Type | string | |
Default | tungsten_transactions |
Sets the topic name to be used when sending independent transaction information messagesa about each THL event. See replicator.applier.dbms.addtxninfo.
− replicator.applier.dbms.zookeeperString
Option | replicator.applier.dbms.zookeeperString | |
Description | Connection string for Zookeeper, including hostname and port | |
Value Type | string | |
Default | ${replicator.global.db.host}:2181 |
The string to be used when connecting to Zookeeper. The default is to use port 2181 on the host used by replicator.global.db.host.
Once the extractor and applier have been installed, services can be monitored using the trepctl command.
For example, to monitor the extractor status:
shell> trepctl status
appliedLastEventId : mysql-bin.000009:0000000000002298;2340
appliedLastSeqno : 10
appliedLatency : 0.788
autoRecoveryEnabled : false
autoRecoveryTotal : 0
channels : 1
clusterName : alpha
currentEventId : mysql-bin.000009:0000000000002298
currentTimeMillis : 1498687871560
dataServerHost : mysqlhost
extensions :
host : mysqlhost
latestEpochNumber : 0
masterConnectUri : thl://localhost:/
masterListenUri : thl://mysqlhost:2112/
maximumStoredSeqNo : 10
minimumStoredSeqNo : 0
offlineRequests : NONE
pendingError : NONE
pendingErrorCode : NONE
pendingErrorEventId : NONE
pendingErrorSeqno : -1
pendingExceptionMessage: NONE
pipelineSource : /var/lib/mysql
relativeLatency : 99185.56
resourcePrecedence : 99
rmiPort : 10000
role : master
seqnoType : java.lang.Long
serviceName : east
serviceType : local
simpleServiceName : east
siteName : default
sourceId : mysqlhost
state : ONLINE
timeInStateSeconds : 101347.786
timezone : GMT
transitioningTo :
uptimeSeconds : 101358.88
useSSLConnection : false
version : Tungsten Replicator 7.0.3 build 141
Finished status command...
The replicator service operates just the same as a standard extractor service of a typical MySQL replication service.
The Kafka applier service can be accessed either remotely from the extractor:
shell> trepctl -host kafka status
...
Or locally on the Kafka host:
shell> trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysql-bin.000008:0000000000412301;0
appliedLastSeqno : 1296
appliedLatency : 10.253
channels : 1
clusterName : alpha
currentEventId : NONE
currentTimeMillis : 1377098139212
dataServerHost : kafka
extensions :
latestEpochNumber : 1286
masterConnectUri : thl://host1:2112/
masterListenUri : null
maximumStoredSeqNo : 1296
minimumStoredSeqNo : 0
offlineRequests : NONE
pendingError : NONE
pendingErrorCode : NONE
pendingErrorEventId : NONE
pendingErrorSeqno : -1
pendingExceptionMessage: NONE
pipelineSource : thl://mysqlhost:2112/
relativeLatency : 771.212
resourcePrecedence : 99
rmiPort : 10000
role : slave
seqnoType : java.lang.Long
serviceName : alpha
serviceType : local
simpleServiceName : alpha
siteName : default
sourceId : kafka
state : ONLINE
timeInStateSeconds : 177783.343
transitioningTo :
uptimeSeconds : 180631.276
useSSLConnection : false
version : Tungsten Replicator 7.0.3 build 141
Finished status command...
Monitoring the status of replication between the source and target is also
the same. The appliedLastSeqno
still indicates the
sequence number that has been applied to Kafka, and the event ID from
Kafka can still be identified from
appliedLastEventId
.
Sequence numbers between the two hosts should match, as in a source/target deployment, but due to the method used to replicate, the applied latency may be higher.
To check for information within Kafka, use a tool or the kafka-console-consumer.sh command-line client:
shell> kafka-console-consumer.sh --topic test_msg --zookeeper localhost:2181
The output should be checked to ensure that information is being correctly replicated. If strings are shown as a hex value, for example:
"title" : "[B@7084a5c"
It probably indicates that UTF8 and/or
--mysql-use-bytes-for-string=false
options were not used during installation. If you are reading from a
cluster this is expected behavior, and you should enable the
convertstringfrommysql
filter as
shown in the installation examples. In pure replicator scenarios, ensure
that the
--mysql-use-bytes-for-string=false
setting is enabled, or that you are using
--enable-heterogeneous-service
.
Deployment of a replication to MongoDB service is slightly different to other appliers, there are two parts to the process:
Service Alpha on the Extractor, extracts the information from the MySQL binary log into THL.
Service Alpha on the Applier reads the information from the remote replicator as THL, and applies that to MongoDB.
Basic reformatting and restructuring of the data is performed by translating the structure extracted from one database in row format and restructuring for application in a different format. A filter, the ColumnNameFilter, is used to extract the column names against the extracted row-based information.
With the MongoDB applier, information is extracted from the source database using the row-format, column names and primary keys are identified, and translated to the BSON (Binary JSON) format supported by MongoDB. The fields in the source row are converted to the key/value pairs within the generated BSON.
The transfer operates as follows:
Data is extracted from MySQL using the standard extractor, reading the row change data from the binlog.
The Section 11.4.5, “ColumnName Filter” filter is used to extract column name information from the database. This enables the row-change information to be tagged with the corresponding column information. The data changes, and corresponding row names, are stored in the THL.
The THL information is then applied to MongoDB using the MongoDB applier.
The two replication services can operate on the same machine, (See Section 5.3, “Deploying Multiple Replicators on a Single Host”) or they can be installed on two different machines.
The MongoDB applier can also be used to apply into a MongoDB Atlas instance.
The configuration for MongoDB Atlas is slightly different and follows a typical offboard applier process, similar in style to applying to Amazon Aurora Instances
Specific installation steps for MongoDB Atlas are outlined here Section 4.5.4, “Install MongoDB Atlas Applier”
Configure the source and target hosts following the prerequisites outlined in Appendix B, Prerequisites then follow the appropriate steps for the required extractor topology outlined in Chapter 3, Deploying MySQL Extractors.
During the replication process, data is exchanged from the MySQL database/table/row structure into corresponding MongoDB structures, as follows
In general, it is easier to understand that a row within the MySQL table is converted into a single document on the MongoDB side, and automatically added to a collection matching the table name.
For example, the following row within MySQL:
mysql> select * from recipe where recipeid = 1085 \G
*************************** 1. row ***************************
recipeid: 1085
title: Creamy egg and leek special
subtitle:
servings: 4
active: 1
parid: 0
userid: 0
rating: 0.0
cumrating: 0.0
createdate: 0
1 row in set (0.00 sec)
Is replicated into the MongoDB document:
{ "_id" : ObjectId("5212233584ae46ce07e427c3"), "recipeid" : "1085", "title" : "Creamy egg and leek special", "subtitle" : "", "servings" : "4", "active" : "1", "parid" : "0", "userid" : "0", "rating" : "0.0", "cumrating" : "0.0", "createdate" : "0" }
When preparing the hosts you must be aware of this translation of the different structures, as it will have an effect on the way the information is replicated from MySQL to MongoDB.
MySQL Host
The data replicated from MySQL can be any data, although there are some known limitations and assumptions made on the way the information is transferred.
When configuring the extractor database and host, ensure heterogenous specific prerequisities have been included, see Section B.4.4, “MySQL Configuration for Heterogeneous Deployments”
For the best results when replicating, be aware of the following issues and limitations:
Use primary keys on all tables. The use of primary keys will improve the lookup of information within MongoDB when rows are updated. Without a primary key on a table a full table scan is performed, which can affect performance.
MySQL TEXT
columns are
correctly replicated, but cannot be used as keys.
MySQL BLOB
columns are
converted to text using the configured character type. Depending on
the data that is being stored within the
BLOB
, the data may need to be
custom converted. A filter can be written to convert and reformat the
content as required.
MongoDB Host
Enable networking; by default MongoDB is configured to listen only on
the localhost
(127.0.0.1) IP
address. The address should be changed to the IP address off your
host, or 0.0.0.0
, which
indicates all interfaces on the current host.
Ensure that network port 27017, or the port you want to use for MongoDB is configured as the listening port.
Installation of the MongoDB replication requires special configuration of the Source and Target hosts so that each is configured for the correct datasource type.
To configure the Applier replicators:
Before installing the applier, the following additions need adding to the extractor configuration. Apply the following parameter to the extractor configuration before installing the applier
Add the following the /etc/tungsten/tungsten.ini
[alpha] ...Existing Replicator Config... enable-heterogeneous-service=true
shell>tpm update
The above step is only applicable for standalone extractors. If you are configuring replications from an existing Tungsten Cluster (Cluster-Extractor), follow the steps outlined here to ensure the cluster is configured correctly: Section 3.4.1, “Prepare: Replicating Data Out of a Cluster”
Unpack the Tungsten Replicator distribution in staging directory:
shell> tar zxf tungsten-replicator-7.0.3-141.tar.gz
Change into the staging directory:
shell> cd tungsten-replicator-7.0.3-141
Configure the installation using tpm:
shell>./tools/tpm configure defaults \ --reset \ --install-directory=/opt/continuent \ --profile-script=~/.bash_profile \ --rest-api-admin-user=apiuser \ --rest-api-admin-pass=secret
shell>./tools/tpm configure alpha \ --master=sourcehost \ --members=localhost \ --datasource-type=mongodb \ --replication-user=tungsten \ --replication-password=secret \ --svc-applier-filters=dropstatementdata \ --role=slave \ --replication-port=27017
shell> vi /etc/tungsten/tungsten.ini
[defaults] install-directory=/opt/continuent profile-script=~/.bash_profile rest-api-admin-user=apiuser rest-api-admin-pass=secret
[alpha] master=sourcehost members=localhost datasource-type=mongodb replication-user=tungsten replication-password=secret svc-applier-filters=dropstatementdata role=slave replication-port=27017
Configuration group defaults
The description of each of the options is shown below; click the icon to hide this detail:
For staging configurations, deletes all pre-existing configuration information between updating with the new configuration values.
--install-directory=/opt/continuent
install-directory=/opt/continuent
Path to the directory where the active deployment will be installed. The configured directory will contain the software, THL and relay log information unless configured otherwise.
--profile-script=~/.bash_profile
profile-script=~/.bash_profile
Append commands to include env.sh in this profile script
Configuration group alpha
The description of each of the options is shown below; click the icon to hide this detail:
The hostname of the primary (extractor) within the current service.
Hostnames for the dataservice members
Database type
For databases that required authentication, the username to use when connecting to the database using the corresponding connection method (native, JDBC, etc.).
The password to be used when connecting to the database using
the corresponding
--replication-user
.
--svc-applier-filters=dropstatementdata
svc-applier-filters=dropstatementdata
Replication service applier filters
What is the replication role for this service?
The network port used to connect to the database server. The default port used depends on the database being configured.
If you plan to make full use of the REST API (which is enabled by default) you will need to also configure a username and password for API access. This must be done by specifying the following options in your configuration:
rest-api-admin-user=tungsten rest-api-admin-pass=secret
Once the prerequisites and configuring of the installation has been completed, the software can be installed:
shell> ./tools/tpm install
If the installation process fails, check the output of the
/tmp/tungsten-configure.log
file for
more information about the root cause.
Once the replicators have started, the status of the service can be checked using trepctl. See Section 4.5.5, “Management and Monitoring of MongoDB Deployments” for more information.
Installation of the MongoDB replication requires special configuration of the Source and Target hosts so that each is configured for the correct datasource type.
To configure the Applier replicators:
Before installing the applier, the following addition needs adding to the extractor configuration. Apply the following parameters on the extractor host, update the extractor using the details below, and then install the applier
For Staging installs:
shell>cd tungsten-replicator-7.0.3-141
shell>./tools/tpm configure alpha \ --enable-heterogeneous-master=true
shell>./tools/tpm update
For INI installs:
Add the following the /etc/tungsten/tungsten.ini
[alpha]
...Existing Replicator Config...
enable-heterogeneous-master=true
shell> tpm update
Unpack the Tungsten Replicator distribution in staging directory:
shell> tar zxf tungsten-replicator-7.0.3-141.tar.gz
Change into the staging directory:
shell> cd tungsten-replicator-7.0.3-141
Configure the installation using tpm:
shell>./tools/tpm configure defaults \ --reset \ --install-directory=/opt/continuent \ --profile-script=~/.bash_profile \ --disable-security-controls=false \ --rmi-ssl=false \ --thl-ssl=false \ --rmi-authentication=false \ --rest-api-admin-user=apiuser \ --rest-api-admin-pass=secret
shell>./tools/tpm configure alpha \ --master=sourcehost \ --members=localhost \ --datasource-type=mongodb \ --replication-user=tungsten \ --replication-password=secret \ --svc-applier-filters=dropstatementdata \ --role=slave \ --replication-host=atlasendpoint.mongodb.net \ --replication-port=27017 \ --property=replicator.applier.dbms.connectString=mongodb+srv://${replicator.global.db.user}:${replicator.global.db.password}@${replicator.global.db.host}/?retryWrites=true&w=majority
shell> vi /etc/tungsten/tungsten.ini
[defaults] install-directory=/opt/continuent profile-script=~/.bash_profile disable-security-controls=false rmi-ssl=false thl-ssl=false rmi-authentication=false rest-api-admin-user=apiuser rest-api-admin-pass=secret
[alpha] master=sourcehost members=localhost datasource-type=mongodb replication-user=tungsten replication-password=secret svc-applier-filters=dropstatementdata role=slave replication-host=atlasendpoint.mongodb.net replication-port=27017 property=replicator.applier.dbms.connectString=mongodb+srv://${replicator.global.db.user}:${replicator.global.db.password}@${replicator.global.db.host}/?retryWrites=true&w=majority
Configuration group defaults
The description of each of the options is shown below; click the icon to hide this detail:
For staging configurations, deletes all pre-existing configuration information between updating with the new configuration values.
--install-directory=/opt/continuent
install-directory=/opt/continuent
Path to the directory where the active deployment will be installed. The configured directory will contain the software, THL and relay log information unless configured otherwise.
--profile-script=~/.bash_profile
profile-script=~/.bash_profile
Append commands to include env.sh in this profile script
--disable-security-controls=false
disable-security-controls=false
Disables all forms of security, including SSL, TLS and authentication
Enable SSL encryption of RMI communication on this host
Enable SSL encryption of THL communication for this service
Enable RMI authentication for the services running on this host
Configuration group alpha
The description of each of the options is shown below; click the icon to hide this detail:
The hostname of the primary (extractor) within the current service.
Hostnames for the dataservice members
Database type
For databases that required authentication, the username to use when connecting to the database using the corresponding connection method (native, JDBC, etc.).
The password to be used when connecting to the database using
the corresponding
--replication-user
.
--svc-applier-filters=dropstatementdata
svc-applier-filters=dropstatementdata
Replication service applier filters
What is the replication role for this service?
--replication-host=atlasendpoint.mongodb.net
replication-host=atlasendpoint.mongodb.net
Hostname of the datasource where the database is located. If the specified hostname matches the current host or member name, the database is assumed to be local. If the hostnames do not match, extraction is assumed to be via remote access. For MySQL hosts, this configures a remote replication Replica (relay) connection.
The network port used to connect to the database server. The default port used depends on the database being configured.
The --property
option enables
you to explicitly set property values in the target files. A
number of different models are supported:
key=value
Set the property defined by
key
to the specified
value without evaluating any template values or other rules.
key+=value
Add the value to the property defined by
key
. Template values and
other options append their settings to the end of the
specified property.
key~=/match/replace/
Evaluate any template values and other settings, and then
perform the specified Ruby regex operation to the property
defined by key
. For
example
--property=replicator.key~=/(.*)/somevalue,\1/
will prepend somevalue
before the template value for
replicator.key
.
If you plan to make full use of the REST API (which is enabled by default) you will need to also configure a username and password for API access. This must be done by specifying the following options in your configuration:
rest-api-admin-user=tungsten rest-api-admin-pass=secret
Once the prerequisites and configuring of the installation has been completed, the software can be installed:
shell> ./tools/tpm install
If the installation process fails, check the output of the
/tmp/tungsten-configure.log
file for
more information about the root cause.
The above example assumes SSL is not enabled between the extractor and applier replicators.
If SSL is required, then you must omit the following
properties from the example configs displayed above, or change the values to true
:
rmi-ssl=false
, thl-ssl=false
, rmi-authentication=false
Once you have installed the replicator, there are a few more steps required to allow the replicator to be able to authenticate with MongoDB Atlas.
MongoDB Atlas requires TLS connections for all Atlas Clusters, therefore we need to configure the replicator to recognise this.
From May 1, 2021, MongoDB Atlas has moved to new TLS Certificiates using ISRG instead of IdenTrust for their root Certificate Authority.
All new clusters created after this time, or any existing clusters that have since been migrated to this new root CA will need to follow the correct procedure to configure the replicator. Both procedures are below, follow the correct one that relates to your configuration.
For MongoDB Atlas Cluster created PRIOR to May 1, 2021, or that have not yet migrated to the new LetsEncrypt root Certificate:
Using the correct Atlas Endpoint, issue the following command to retrieve the Atlas certificates
shell> openssl s_client -showcerts -connect atlas-endpoint.mongodb.net:27017
The output may be quite long and will include at least two certificates bound by the header/footer as follows
-----BEGIN CERTIFICATE----- xxxx xxxx -----END CERTIFICATE-----
Copy each certificate, including the header/footer, into individual files
Using keytool, we now need to load each certificte into the truststore that was created during the replicator installation. Repeat the example below for each certificate, ensuring you use a unique alias name for each certificate.
shell> keytool -import -alias your-alias1
-file cert1.cer
-keystore /opt/continuent/share/tungsten_truststore.ts
When prompted, the default password for the truststore will be tungsten
unless
you specified a different password during installation
Once this is complete, you can now start the replicator
shell> replicator start
For MongoDB Atlas Cluster created AFTER May 1, 2021, or that have been migrated to the new LetsEncrypt root Certificate:
Obtain the LetsEncrypt root Certificate from here
Copy the certificate into a file called letsencrypt.pem
in the home directory of the applier host, including the BEGIN an END header/footer, for example:
-----BEGIN CERTIFICATE----- xxxx xxxx -----END CERTIFICATE-----
Using keytool, we now need to import this certificte into the truststore that was created during the replicator installation.
shell> keytool -import -alias letsencrypt
-file letsencrypt.pem
-keystore /opt/continuent/share/tungsten_truststore.ts
When prompted, the default password for the truststore will be tungsten
unless
you specified a different password during installation
Once this is complete, you can now start the replicator
shell> replicator start
Once the replicators have started, the status of the service can be checked using trepctl. See Section 4.5.5, “Management and Monitoring of MongoDB Deployments” for more information.
Once the two services — extractor and applier — have been installed, the services can be monitored using trepctl. To monitor the extractor service:
shell> trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysql-bin.000008:0000000000412301;0
appliedLastSeqno : 1296
appliedLatency : 1.889
channels : 1
clusterName : epsilon
currentEventId : mysql-bin.000008:0000000000412301
currentTimeMillis : 1377097812795
dataServerHost : host1
extensions :
latestEpochNumber : 1286
masterConnectUri : thl://localhost:/
masterListenUri : thl://host2:2112/
maximumStoredSeqNo : 1296
minimumStoredSeqNo : 0
offlineRequests : NONE
pendingError : NONE
pendingErrorCode : NONE
pendingErrorEventId : NONE
pendingErrorSeqno : -1
pendingExceptionMessage: NONE
pipelineSource : jdbc:mysql:thin://host1:13306/
relativeLatency : 177444.795
resourcePrecedence : 99
rmiPort : 10000
role : master
seqnoType : java.lang.Long
serviceName : alpha
serviceType : local
simpleServiceName : alpha
siteName : default
sourceId : host1
state : ONLINE
timeInStateSeconds : 177443.948
transitioningTo :
uptimeSeconds : 177461.483
version : Tungsten Replicator 7.0.3 build 141
Finished status command...
The replicator service operates just the same as a standard Extractor service of a typical MySQL replication service.
The MongoDB applier service can be accessed either remotely from the Extractor:
shell> trepctl -host host2 status
...
Or locally on the MongoDB host:
shell> trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysql-bin.000008:0000000000412301;0
appliedLastSeqno : 1296
appliedLatency : 10.253
channels : 1
clusterName : alpha
currentEventId : NONE
currentTimeMillis : 1377098139212
dataServerHost : host2
extensions :
latestEpochNumber : 1286
masterConnectUri : thl://host1:2112/
masterListenUri : null
maximumStoredSeqNo : 1296
minimumStoredSeqNo : 0
offlineRequests : NONE
pendingError : NONE
pendingErrorCode : NONE
pendingErrorEventId : NONE
pendingErrorSeqno : -1
pendingExceptionMessage: NONE
pipelineSource : thl://host1:2112/
relativeLatency : 177771.212
resourcePrecedence : 99
rmiPort : 10000
role : slave
seqnoType : java.lang.Long
serviceName : alpha
serviceType : local
simpleServiceName : alpha
siteName : default
sourceId : host2
state : ONLINE
timeInStateSeconds : 177783.343
transitioningTo :
uptimeSeconds : 180631.276
version : Tungsten Replicator 7.0.3 build 141
Finished status command...
Monitoring the status of replication between the Source and Target is also
the same. The appliedLastSeqno
still indicates the
sequence number that has been applied to MongoDB, and the event ID from
MongoDB can still be identified from
appliedLastEventId
.
Sequence numbers between the two hosts should match, as in a Primary/Replica deployment, but due to the method used to replicate, the applied latency may be higher. Tables that do not use primary keys, or large individual row updates may cause increased latency differences.
To check for information within MongoDB, use the mongo command-line client:
shell>mongo
MongoDB shell version: 2.2.4 connecting to: test >use cheffy;
switched to db cheffy
The show collections will indicate the tables from MySQL that have been replicated to MongoDB:
> show collections
access_log
audit_trail
blog_post_record
helpdb
ingredient_recipes
ingredient_recipes_bytext
ingredients
ingredients_alt
ingredients_keywords
ingredients_matches
ingredients_measures
ingredients_plurals
ingredients_search_class
ingredients_search_class_map
ingredients_shop_class
ingredients_xlate
ingredients_xlate_class
keyword_class
keywords
measure_plurals
measure_trans
metadata
nut_fooddesc
nut_foodgrp
nut_footnote
nut_measure
nut_nutdata
nut_nutrdef
nut_rda
nut_rda_class
nut_source
nut_translate
nut_weight
recipe
recipe_coll_ids
recipe_coll_search
recipe_collections
recipe_comments
recipe_pics
recipebase
recipeingred
recipekeywords
recipemeta
recipemethod
recipenutrition
search_translate
system.indexes
terms
Collection counts should match the row count of the source tables:
> > db.recipe.count() 2909
The db.collection.find() command can be used to list the documents within a given collection.
> db.recipe.find() { "_id" : ObjectId("5212233584ae46ce07e427c3"), "recipeid" : "1085", "title" : "Creamy egg and leek special", "subtitle" : "", "servings" : "4", "active" : "1", "parid" : "0", "userid" : "0", "rating" : "0.0", "cumrating" : "0.0", "createdate" : "0" } { "_id" : ObjectId("5212233584ae46ce07e427c4"), "recipeid" : "87", "title" : "Chakchouka", "subtitle" : "A traditional Arabian and North African dish and often accompanied with slices of cooked meat", "servings" : "4", "active" : "1", "parid" : "0", "userid" : "0", "rating" : "0.0", "cumrating" : "0.0", "createdate" : "0" } ...
The output should be checked to ensure that information is being correctly replicated. If strings are shown as a hex value, for example:
"title" : "[B@7084a5c"
It probably indicates that UTF8 and/or
--mysql-use-bytes-for-string=false
options were not used during installation. The configuration can be
updated using tpm to address this issue.
Replicating data into Hadoop is achieved by generating character-separated values from ROW-based information that is applied directly to the Hadoop HDFS using a batch loading process. Files are written directly to the HDFS using the Hadoop client libraries. A separate process is then used to merge existing data, and the changed information extracted from the Source database.
Deployment of the Hadoop replication is similar to other heterogeneous installations; two separate installations are created:
Service Alpha on the extractor, extracts the information from the MySQL binary log into THL.
Service Alpha on the applier, reads the information from the remote replicator as THL, applying it to Hadoop. The applier works in two stages:
Basic requirements for replication into Hadoop:
Hadoop Replication is supported on the following Hadoop distributions and releases:
Cloudera Enterprise 4.4, Cloudera Enterprise 5.0 (Certified) up to Cloudera Enterprise 5.8
HortonWorks DataPlatform 2.0
Amazon Elastic MapReduce
IBM InfoSphere BigInsights 2.1 and 3.0
MapR 3.0, 3.1, and 5.x
Pivotal HD 2.0
Apache Hadoop 2.1.0, 2.2.0
Source tables must have primary keys. Without a primary key, Tungsten Replicator is unable to determine the row to be updated when the data reaches Hadoop.
The Hadoop applier makes use of the JavaScript based batch loading system (see Section 5.6.4, “JavaScript Batchloader Scripts”). This constructs change data from the source-database, and uses this information in combination with any existing data to construct, using Hive, a materialized view. A summary of this basic structure can be seen in Figure 4.8, “Topologies: Hadoop Replication Operation”.
The full replication of information operates as follows:
Data is extracted from the source database using the standard extractor, for example by reading the row change data from the binlog in MySQL.
The colnames
filter is used to
extract column name information from the database. This enables the
row-change information to be tagged with the corresponding column
information. The data changes, and corresponding row names, are stored
in the THL.
The pkey
filter is used to
extract primary key data from the source tables.
On the applier replicator, the THL data is read and written into batch-files in the character-separated value format.
The information in these files is change data, and contains not only
the original data, but also metadata about the operation performed
(i.e. INSERT
,
DELETE
or
UPDATE
, and the primary key of
for each table. All UPDATE
statements are recorded as a
DELETE
of the existing data,
and an INSERT
of the new data.
A second process uses the CSV stage data and any existing data, to build a materialized view that mirrors the source table data structure.
The staging files created by the replicator are in a specific format that incorporates change and operation information in addition to the original row data.
The format of the files is a character separated values file, with
each row separated by a newline, and individual fields separated by
the character 0x01
. This is
supported by Hive as a native value separator.
The content of the file consists of the full row data extracted from the source, plus metadata describing the operation for each row, the sequence number, and then the full row information.
Operation | Sequence No | Unique Row | Commit TimeStamp | Table-specific primary key | Table-column |
---|---|---|---|---|---|
I (Insert) or D (Delete) |
SEQNO that generated this row
| Unique row ID within the batch | The commit timestamp of the original transaction, which can be used for partitioning |
For example, the MySQL row:
| 3 | #1 Single | 2006 | Cats and Dogs (#1.4) |
Is represented within the staging files generated as:
I^A1318^A1^A2017-06-07 09:22:28.000^A3^A3^A#1 Single^A2006^ACats and Dogs (#1.4)
The character separator, and whether to use quoting, are configurable
within the replicator when it is deployed. The default is to use a newline
character for records, and the 0x01
character for fields. For more information on these fields and how they
can be configured, see
Section 5.6.7, “Supported CSV Formats”.
On the Hadoop host, information is stored into a number of locations within the HDFS during the data transfer:
Table 4.2. Hadoop Replication Directory Locations
Directory/File | Description |
---|---|
/user/USERNAME
| Top-level directory for Tungsten Replicator information, using the configured replication user. |
/user/tungsten/metadata
| Location for metadata related to the replication operation |
/user/tungsten/metadata/
| The directory (named after the servicename of the replicator service) that holds service-specific metadata |
/user/tungsten/staging
| Directory of the data transferred |
/user/tungsten/staging/
| Directory of the data transferred from a specific servicename. |
/user/tungsten/staging/
| Directory of the data transferred specific to a database. |
/user/tungsten/staging/
| Directory of the data transferred specific to a table. |
/user/tungsten/staging/
| Filename of a single file of the data transferred for a specific table and database. |
Files are automatically created, named according to the parent table name, and the starting Tungsten Replicator sequence number for each file that is transferred. The size of the files is determined by the batch and commit parameters. For example, in the truncated list of files below displayed using the hadoop fs command,
shell> hadoop fs -ls /user/tungsten/staging/hadoop/chicago
Found 66 items
-rw-r--r-- 3 cloudera cloudera 1270236 2020-01-13 06:58 /user/tungsten/staging/alpha/hadoop/chicago/chicago-10.csv
-rw-r--r-- 3 cloudera cloudera 10274189 2020-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-103.csv
-rw-r--r-- 3 cloudera cloudera 1275832 2020-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-104.csv
-rw-r--r-- 3 cloudera cloudera 1275411 2020-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-105.csv
-rw-r--r-- 3 cloudera cloudera 10370471 2020-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-113.csv
-rw-r--r-- 3 cloudera cloudera 1279435 2020-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-114.csv
-rw-r--r-- 3 cloudera cloudera 2544062 2020-01-13 06:58 /user/tungsten/staging/alpha/hadoop/chicago/chicago-12.csv
-rw-r--r-- 3 cloudera cloudera 11694202 2020-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-123.csv
-rw-r--r-- 3 cloudera cloudera 1279072 2020-01-13 08:34 /user/tungsten/staging/alpha/hadoop/chicago/chicago-124.csv
-rw-r--r-- 3 cloudera cloudera 2570481 2020-01-13 08:34 /user/tungsten/staging/alpha/hadoop/chicago/chicago-126.csv
-rw-r--r-- 3 cloudera cloudera 9073627 2020-01-13 08:34 /user/tungsten/staging/alpha/hadoop/chicago/chicago-133.csv
-rw-r--r-- 3 cloudera cloudera 1279708 2020-01-13 08:34 /user/tungsten/staging/alpha/hadoop/chicago/chicago-134.csv
...
The individual file numbers will not be sequential, as they will depend on the sequence number, batch size and range of tables transferred.
During the replication process, data is exchanged from the MySQL database/table/row structure into corresponding Hadoop directory and files, as shown in the table below:
MySQL | Hadoop |
---|---|
Database | Directory |
Table | Hive-compatible Character-Separated Text file |
Row |
Line in the text file, fields terminated by character
0x01
|
The Hadoop environment should have the following features and parameters for the most efficient operation:
Disk storage
There must be enough disk storage for the change data, data being actively merged, and the live data for the replicated information. Depending on the configuration and rate of changes in the Source, the required data space will fluctuate.
For example, replicating a 10GB dataset, and 5GB of change data during replication, will require at least 30GB of storage. 10GB for the original dataset, 5GB of change data, and 10-25GB of merged data. The exact size is dependent on the quantity of inserts/updates/deletes.
Pre-requisites
Currently, deployment of the target to a relay host is not supported. One host within the Hadoop cluster must be chosen to act as the target.
The prerequisites for a standard Tungsten Replicator should be followed, including:
This will provide the base environment into which Tungsten Replicator can be installed.
HDFS Location
The /user/tungsten
directory
must be writable by the replicator user within HDFS:
shell>hadoop fs -mkdir /user/tungsten
shell>hadoop fs -chmod 700 /user/tungsten
shell>hadoop fs -chown tungsten /user/tungsten
These commands should be executed by a user with HDFS administration
rights (e.g. the hdfs
user).
Replicator User Group Membership
The user that will be executing the replicator (typically
tungsten
, as recommended in
the Appendix B, Prerequisites) must be a member of the
hive
group on the Hadoop host
where the replicator will be installed. Without this membership, the
user will be unable to execute Hive queries.
In order to access the generated tables, both staging and the final tables, it is necessary to create a schema definition. The ddlscan tool can be used to read the existing definition of the tables from the source server and generate suitable Hive schema definitions to access the table data.
To create the staging table definition, use the
ddl-mysql-hive-0.10.vm
template; you must specify the JDBC connection string, user, password
and database names. For example:
shell> ddlscan -user tungsten -url 'jdbc:mysql:thin://host1:13306/test' -pass password \
-template ddl-mysql-hive-0.10.vm -db test
--
-- SQL generated on Wed Jan 29 16:17:05 GMT 2020 by Tungsten ddlscan utility
--
-- url = jdbc:mysql:thin://host1:13306/test
-- user = tungsten
-- dbName = test
--
CREATE DATABASE test;
DROP TABLE IF EXISTS test.movies_large;
CREATE TABLE test.movies_large
(
id INT ,
title STRING ,
year INT ,
episodetitle STRING )
;
The output from this command should be applied to your Hive installation within the Hadoop cluster. For example, by capturing the output, transferring that file and then running:
shell> cat schema.sql | hive
To create Hive tables that read the staging files loaded by the
replicator, use the
ddl-mysql-hive-0.10-staging.vm
:
shell> ddlscan -user tungsten -url 'jdbc:mysql:thin://host:13306/test' -pass password \
-template ddl-mysql-hive-0.10-staging.vm -db test
The process creates the schema and tables which match the schema and table names on the source database.
Transfer this file to your Hadoop environment and then create the generated schema:
shell> cat schema-staging.sql |hive
The process creates matching schema names, but table names are modified
to include the prefix stage_xxx_
.
For example, for the table
movies_large
a staging table named
stage_xxx_movies_large
is created.
The Hive table definition is created pointing to the external file-based
tables, using the default 0x01
field separator and 0x0A
(newline)
record separator. If different values were used for these in the
configuration, the schema definition in the captured file from
ddlscan should be updated by hand.
The tables should now be available within Hive. For more information on accessing and using the tables, see Section 4.6.4.3, “Accessing Generated Tables in Hive”.
For replicating into HDFS where Kerberos support has been enabled, the
hadoop_kerberos.js
vatch
script can be used in place of the normal
hadoop.js
script.
The script will need modification before it can be used, due to the varying implementations of Kerberos, and to ensure the correct authentication parameters are used.
Before installed, edit the
hadoop_kerberos.js
file located
within
tungsten-replicator/appliers/batch/hadoop-kerberos.js
within the installation package. Within that file is the line called
before the HDFS operations are called:
var kinit_prefix = "kinit USER/LEVEL@REALM -k -t KEYTAB_FILE;"
Edit this line to set the correct command and/or authentication parameters, such as the username and keytab file. The configured command will be executed immediately before all the commands that operate on the Hadoop filesystem, including creating directories and files.
For example, the variable might be updated to:
var kinit_prefix = "kinit mc/admin@CLOUDERA -k -t mcadmin.keytab;"
When installing, use
--batch-load-template=hadoop_kerberos.js
to enable the new batch load script.
Installation of the Hadoop replication consists of multiple stages:
Configure the source and target hosts following the prerequisites outlined in Appendix B, Prerequisites then follow the appropriate steps for the required extractor topology outlined in Chapter 3, Deploying MySQL Extractors.
Install the Applier replicator which will apply information to the target Hadoop environment.
Once the installation of the Extractor and Applier components have been completed, materialization of tables and views can be performed.
The applier replicator service reads information from the THL of the source and applies this to a local instance of Hadoop.
Installation must take place on a node within the Hadoop cluster. Writing to a remote HDFS filesystem is not currently supported.
Before installing the applier, the following additions need adding to the extractor configuration. Apply the following parameters, update the extractor and then install the applier
For Staging Install:
shell>cd tungsten-replicator-7.0.3-141
shell>./tools/tpm configure alpha \ --enable-batch-service=true
shell>./tools/tpm update
For INI Installs:
Add the following the /etc/tungsten/tungsten.ini
[alpha] ...Existing Replicator Config... enable-batch-service=true
shell>tpm update
The applier can now be configured.
Unpack the Tungsten Replicator distribution in staging directory:
shell> tar zxf tungsten-replicator-7.0.3-141.tar.gz
Change into the staging directory:
shell> cd tungsten-replicator-7.0.3-141
Configure the installation using tpm:
shell>./tools/tpm configure defaults \ --reset \ --user=tungsten \ --install-directory=/opt/continuent \ --profile-script=~/.bash_profile \ --skip-validation-check=HostsFileCheck \ --skip-validation-check=InstallerMasterSlaveCheck \ --skip-validation-check=DatasourceDBPort \ --skip-validation-check=DirectDatasourceDBPort \ --skip-validation-check=ReplicationServicePipelines \ --rest-api-admin-user=apiuser \ --rest-api-admin-pass=secret
shell>./tools/tpm configure alpha \ --master=host1 \ --members=host2 \ --property=replicator.datasource.global.csvType=hive \ --property=replicator.stage.q-to-dbms.blockCommitInterval=1s \ --property=replicator.stage.q-to-dbms.blockCommitRowCount=1000 \ --replication-password=secret \ --replication-user=tungsten \ --batch-enabled=true \ --batch-load-language=js \ --batch-load-template=hadoop \ --datasource-type=file
shell> vi /etc/tungsten/tungsten.ini
[defaults] user=tungsten install-directory=/opt/continuent profile-script=~/.bash_profile skip-validation-check=HostsFileCheck skip-validation-check=InstallerMasterSlaveCheck skip-validation-check=DatasourceDBPort skip-validation-check=DirectDatasourceDBPort skip-validation-check=ReplicationServicePipelines rest-api-admin-user=apiuser rest-api-admin-pass=secret
[alpha] master=host1 members=host2 property=replicator.datasource.global.csvType=hive property=replicator.stage.q-to-dbms.blockCommitInterval=1s property=replicator.stage.q-to-dbms.blockCommitRowCount=1000 replication-password=secret replication-user=tungsten batch-enabled=true batch-load-language=js batch-load-template=hadoop datasource-type=file
Configuration group defaults
The description of each of the options is shown below; click the icon to hide this detail:
For staging configurations, deletes all pre-existing configuration information between updating with the new configuration values.
System User
--install-directory=/opt/continuent
install-directory=/opt/continuent
Path to the directory where the active deployment will be installed. The configured directory will contain the software, THL and relay log information unless configured otherwise.
--profile-script=~/.bash_profile
profile-script=~/.bash_profile
Append commands to include env.sh in this profile script
--skip-validation-check=HostsFileCheck
skip-validation-check=HostsFileCheck
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
--skip-validation-check=InstallerMasterSlaveCheck
skip-validation-check=InstallerMasterSlaveCheck
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
--skip-validation-check=DatasourceDBPort
skip-validation-check=DatasourceDBPort
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
--skip-validation-check=DirectDatasourceDBPort
skip-validation-check=DirectDatasourceDBPort
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
--skip-validation-check=ReplicationServicePipelines
skip-validation-check=ReplicationServicePipelines
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
Configuration group alpha
The description of each of the options is shown below; click the icon to hide this detail:
The hostname of the primary (extractor) within the current service.
Hostnames for the dataservice members
The password to be used when connecting to the database using
the corresponding
--replication-user
.
For databases that required authentication, the username to use when connecting to the database using the corresponding connection method (native, JDBC, etc.).
Should the replicator service use a batch applier
Which script language to use for batch loading
Value for the loadBatchTemplate property
Database type
If you plan to make full use of the REST API (which is enabled by default) you will need to also configure a username and password for API access. This must be done by specifying the following options in your configuration:
rest-api-admin-user=tungsten rest-api-admin-pass=secret
Once the prerequisites and configuring of the installation has been completed, the software can be installed:
shell> ./tools/tpm install
If the installation process fails, check the output of the
/tmp/tungsten-configure.log
file
for more information about the root cause.
Once the service has been installed it can be monitored using the trepctl command. See Section 4.6.4.4, “Management and Monitoring of Hadoop Deployments” for more information. If there are problems during installation, see Section 4.6.4.5, “Troubleshooting Hadoop Replication”.
Added in 6.0.4. From Tungsten Replicator 6.0.4, continuent-tools-hadoop are now packaged within the main Tungsten Replicator software bundle and can be found within ./tungsten-replicator/support/hadoop-tools
The continuent-tools-hadoop
repository contains a set of tools that allow for the convenient
creation of DDL, materialized views, and data comparison on the tables
that have been replicated from MySQL.
To obtain the tools, use git
shell> ./bin/load-reduce-check -s test -Ujdbc:mysql:thin://tr-hadoop2:13306 -udbload -ppassword
The load-reduce-check command performs four distinct steps:
Reads the schema from the MySQL server and creates the staging table DDL within Hive
Reads the schema from the MySQL server and creates the base table DDL within Hive
Executes the materialized view process on each selected staging table data to build the base table content.
Performs a data comparison
If not already completed, the schema generation process described in Section 4.6.2.2, “Schema Generation” should have been followed. This creates the necessary Hive schema and staging schema definitions.
Once the tables have been created through ddlscan you can query the stage tables:
hive> select * from stage_xxx_movies_large limit 10;
OK
I 10 1 57475 All in the Family 1971 Archie Feels Left Out (#4.17)
I 10 2 57476 All in the Family 1971 Archie Finds a Friend (#6.18)
I 10 3 57477 All in the Family 1971 Archie Gets the Business: Part 1 (#8.1)
I 10 4 57478 All in the Family 1971 Archie Gets the Business: Part 2 (#8.2)
I 10 5 57479 All in the Family 1971 Archie Gives Blood (#1.4)
I 10 6 57480 All in the Family 1971 Archie Goes Too Far (#3.17)
I 10 7 57481 All in the Family 1971 Archie in the Cellar (#4.10)
I 10 8 57482 All in the Family 1971 Archie in the Hospital (#3.15)
I 10 9 57483 All in the Family 1971 Archie in the Lock-Up (#2.3)
I 10 10 57484 All in the Family 1971 Archie Is Branded (#3.20)
Once the two services — extractor and applier — have been installed, the services can be monitored using trepctl. To monitor the Extractor service:
shell> trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysql-bin.000023:0000000505545003;0
appliedLastSeqno : 10992
appliedLatency : 42.764
channels : 1
clusterName : alpha
currentEventId : mysql-bin.000023:0000000505545003
currentTimeMillis : 1389871897922
dataServerHost : host1
extensions :
host : host1
latestEpochNumber : 0
masterConnectUri : thl://localhost:/
masterListenUri : thl://host1:2112/
maximumStoredSeqNo : 10992
minimumStoredSeqNo : 0
offlineRequests : NONE
pendingError : NONE
pendingErrorCode : NONE
pendingErrorEventId : NONE
pendingErrorSeqno : -1
pendingExceptionMessage: NONE
pipelineSource : jdbc:mysql:thin://host1:13306/
relativeLatency : 158296.922
resourcePrecedence : 99
rmiPort : 10000
role : master
seqnoType : java.lang.Long
serviceName : alpha
serviceType : local
simpleServiceName : alpha
siteName : default
sourceId : host1
state : ONLINE
timeInStateSeconds : 165845.474
transitioningTo :
uptimeSeconds : 165850.047
useSSLConnection : false
version : Tungsten Replicator 7.0.3 build 141
Finished status command...
When monitoring, the primary concernrs beyond identifying and copying with any errors is to monitor the applied latency. LArger numbers for applied latency generally indicate the the information is being written out to disk effectively. There are a number of strategies that should be checked:
Confirm that the Hadoop environment is running effectively. Any delays to writing to HDFS will impact the replicator.
Adjust the block commit parameters. Tuning the block commit levels should find the balance between frequent updates to achieve the required latency, and generating files of a suitable file sizes so that Hadoop can process them effectively for processing through map/reduce. You should try both increasing and reducing the sizes to find and figure out the the correct settings according to your source data.
Replicating to Hadoop involves a number of discrete, specific steps. Due to the batch and multi-stage nature of the extract and apply process, replication can stall or stop due to a variety of issues.
During initial installation, or when starting up replication, the
replicator may report that the
commitseqno.0
can not be
created or written properly, or during startup, that the file cannot
be read.
The following checks and recovery procedures can be tried:
Check the permissions of the directory to the
commitseqno.0
file, the
file itself, and the ownership:
shell> hadoop fs -ls -R /user/tungsten/metadata
drwxr-xr-x - cloudera cloudera 0 2020-01-14 10:40 /user/tungsten/metadata/alpha
-rw-r--r-- 3 cloudera cloudera 251 2020-01-14 10:40 /user/tungsten/metadata/alpha/commitseqno.0
Check that the file is writable and is not empty. An empty file may indicate a problem updating the content with the new sequence number.
Check the content of the file is correct. The content should be a JSON structure containing the replicator state and position information. For example:
shell> hadoop fs -cat /user/tungsten/metadata/alpha/commitseqno.0
{
"appliedLatency" : "0",
"epochNumber" : "0",
"fragno" : "0",
"shardId" : "dna",
"seqno" : "8",
"eventId" : "mysql-bin.000015:0000000000103156;0",
"extractedTstamp" : "1578998421000"
"lastFrag" : "true",
"sourceId" : "host1"
}
Try deleting the
commitseqno.0
file and
placing the replicator online:
shell>hadoop fs -rm /user/tungsten/metadata/alpha/commitseqno.0
shell>trepctl online
If the replication fails, is manually stopped, or the host needs to be restarted, replication should continue from the last point When replication was stopped. Files that were being written when replication was last running will be overwritten and the information recreated.
Unlike other Heterogeneous replication implementations, the Hadoop applier stores the current replication state and restart position in a file within the HDFS of the target Hadoop environment. To recover from failed replication, this file must be deleted, so that the THL can be re-read from the Source and CSV files will be recreated and applied into HDFS.
On the Applier, put the replicator offline:
shell> trepctl offline
Remove the THL files from the Applier:
shell> trepctl reset -thl
Remove the staging CSV files replicated into Hadoop:
shell> hadoop fs -rm -r /user/tungsten/staging
Reset the restart position:
shell> rm /opt/continuent/tungsten/tungsten-replicator/data/alpha/commitseqno.0
Replace alpha
and
/opt/continuent
with the corresponding
service name and installation location.
Restart replication on the Applier; this will start to recreate the THL files from the MySQL binary log:
shell> trepctl online
Replication may fail at the applier stage if the source data does not contain the correct ROW format and information, including the primary key data. trepctl may report the following error:
... pendingErrorEventId : mysql-bin.000015:0000000000143981;0 pendingErrorSeqno : 10 pendingExceptionMessage: Wrapped com.continuent.tungsten.replicator.ReplicatorException: » Unable to find a primary key for dna.alt_allele_attrib and there is no default » from property stagePkeyColumn (../../tungsten-replicator//samples/scripts/batch/hdfs-merge.js#18) pipelineSource : UNKNOWN relativeLatency : -1.0 ...
If the primary key was missing in the source data, the table structure on the source must be updated, and the THL information recreated.
Replication Operation Support | |
---|---|
Statements Replicated | No |
Rows Replicated | Yes |
Schema Replicated | No |
ddlscan Supported | Yes |
Tungsten Cluster supports replication to Oracle as a datasource. This allows replication of data from MySQL to Oracle. See Section B.1.2, “Database Support” for more details.
Replication in these configurations operates using two separate replicators:
Replicator on the Extractor, extracts the information from the source database into THL.
Replicator on the Applier reads the information from the remote replicator as THL, and applies that to the target database.
Configure the source and target hosts following the prerequisites outlined in Appendix B, Prerequisites followed by the additional prerequisites specific to Oracle Targets outlined in Section 4.7.1.1, “Additional Prerequisites for Oracle Targets” then finally follow the appropriate steps for the required extractor topology outlined in Chapter 3, Deploying MySQL Extractors.
When replicating from MySQL to Oracle there are a number of datatype differences that should be accommodated to ensure reliable replication of the information. The core differences are described in Table 4.3, “Data Type differences when replicating data from MySQL to Oracle”.
Table 4.3. Data Type differences when replicating data from MySQL to Oracle
MySQL Datatype | Oracle Datatype | Notes |
---|---|---|
INT
|
NUMBER(10, 0)
| |
BIGINT
|
NUMBER(19, 0)
| |
TINYINT
|
NUMBER(3, 0)
| |
SMALLINT
|
NUMBER(5, 0)
| |
MEDIUMINT
|
NUMBER(7, 0)
| |
DECIMAL(x,y)
|
NUMBER(x, y)
| |
FLOAT
|
FLOAT
| |
CHAR(n)
|
CHAR(n)
| |
VARCHAR(n)
|
VARCHAR2(n)
| For sizes less than 2000 bytes data can be replicated. For lengths larger than 2000 bytes, the data will be truncated when written into Oracle |
DATE
|
DATE
| |
DATETIME
|
DATE
| |
TIMESTAMP
|
DATE
| |
TEXT
|
CLOB
|
Replicator can transform
TEXT into
CLOB or
VARCHAR(N) . If you
choose VARCHAR(N) on Oracle, the length of the data accepted by
Oracle will be limited to 4000. This is limitation of Oracle. The
size of CLOB
columns within Oracle is calculated in terabytes. If
TEXT fields on MySQL are
known to be less than 4000 bytes (not characters) long, then
VARCHAR(4000) can
be used on Oracle. This may be faster than using
CLOB .
|
BLOB
|
BLOB
| |
ENUM(...)
|
VARCHAR(255)
|
Use the EnumToString filter
|
SET(...)
|
VARCHAR(255)
|
Use the SetToString filter
|
When replicating to Oracle, the ddlscan command can be used to generate DDL appropriate for the supported data types in the target database. In MySQL to Oracle deployments the DDL can be read from the MySQL server and generated for the Oracle server so that replication can begin without manually creating the Oracle specific DDL.
In addition, the following DDL differences and requirements exist:
Column orders on MySQL and Oracle must match, but column names do not have to match.
Using the dropcolumn
filter,
columns can be dropped and ignored if required.
Each table within MySQL should have a Primary Key. Without a primary
key, full-row based lookups are performed on the data when performing
UPDATE
or
DELETE
operations. With a
primary key, the pkey
filter can
add metadata to the
UPDATE
/DELETE
event, enabling faster application of events within Oracle.
Indexes on MySQL and Oracle do not have to match. This allows for different index types and tuning between the two systems according to application and dataserver performance requirements.
Keywords that are restricted on Oracle should not be used within MySQL
as table, column or database names. For example, the keyword
SESSION
is not allowed within
Oracle. Tungsten Cluster determines the column name from the target
database metadata by position (column reference), not name, so
replication will not fail, but applications may need to be adapted.
For compatibility, try to avoid Oracle keywords.
For more information on differences between MySQL and Oracle, see Oracle and MySQL Compared.
To make the process of migration from MySQL to Oracle easier, Tungsten Cluster includes a tool called ddlscan which will read table definitions from MySQL and create appropriate Oracle table definitions to use during replication.
For reference information on the ddlscan tool, see Section 8.6, “The ddlscan Command”.
When replicating to Oracle there are a number of key steps that must be performed. The primary process is the preparation of the Oracle database and DDL for the database schema that are being replicated. Although DDL statements will be replicated to Oracle, they will often fail because of SQL language differences. Because of this, tables within Oracle must be created before replication starts.
When applying to oracle there are additional prerequisites required to ensure the replicator can connect to, and apply to, the target database
For remote Oracle targets (Offboard Applier)
To enable the replicator to apply to a remote Oracle Instance, the Replicator
host will require an Oracle Client installation, with an appropriate TNS entry
configured in the tnsnames.ora
file
In addition, the environment for the tungsten OS user will need to be configured
with ORACLE_HOME
and LD_LIBRARY_PATH
variables
For remote and local Oracle targets
Before installing you need to ensure that you have the ojdbc7.jar
file in the
correct location.
This can be copied to either:
$ORACLE_HOME/jdbc/lib
, or
/opt/continuent/software/tungsten-replicator-7.0.3-141/tungsten_replicator/lib
Before installing replication, the Oracle target database must be configured:
A user and schema must exist for each database from MySQL that you want to replicate. In addition, the schema used by the services within Tungsten Cluster must have an associated schema and user name.
For example, if you are replicating the database
sales
to Oracle, the following
statements must be executed to create a suitable schema. This can be
performed through any connection, including
sqlplus:
shell>sqlplus sys/oracle as sysdba
SQL>CREATE USER
sales
IDENTIFIED BYpassword
DEFAULT TABLESPACE DEMO QUOTA UNLIMITED ON DEMO;
The above assumes a suitable tablespace has been created
(DEMO
in this case).
A schema must also be created for each service replicating into
Oracle. For example, if the service is called
alpha
, then the
tungsten_alpha
schema/user
must be created. The same command can be used:
SQL> CREATE USER tungsten_alpha IDENTIFIED BY password DEFAULT TABLESPACE DEMO QUOTA UNLIMITED ON DEMO;
One of the users used above must be configured so that it has the rights to connect to Oracle and has all rights so that it can execute statements on any schema:
SQL>GRANT CONNECT TO tungsten_alpha;
SQL>GRANT DBA TO tungsten_alpha;
The user/password combination selected will be required when configuring the Applier replication service.
On the host which has been already configured as the Extractor, use ddlscan to extract the DDL for Oracle:
shell>cd tungsten-replicator-7.0.3-141
shell>./bin/ddlscan -user tungsten -url 'jdbc:mysql:thin://host1:3306/access_log' \ -pass password -template ddl-mysql-oracle.vm -db access_log
The output should be captured and checked before applying it to your Oracle instance:
shell> ./bin/ddlscan -user tungsten -url 'jdbc:mysql:thin://host1:3306/access_log' \
-pass password -template ddl-mysql-oracle.vm -db access_log > access_log.ddl
If you are happy with the output, it can be executed against your target Oracle database:
shell> cat access_log.ddl | sqlplus sys/oracle as sysdba
The generated DDL includes statements to drop existing tables if they exist. This will fail in a new installation, but the output can be ignored.
Once the process has been completed for this database, it must be repeated for each database that you plan on replicating from Oracle to MySQL.
The Applier replicator will read the THL from the remote Extractor and apply it into Oracle using a standard JDBC connection. The Applier replicator needs to know the Extractor hostname, and the datasource type.
Unpack the Tungsten Replicator distribution in staging directory:
shell> tar zxf tungsten-replicator-7.0.3-141.tar.gz
Change into the staging directory:
shell> cd tungsten-replicator-7.0.3-141
Obtain a copy of the Oracle JDBC driver and copy it into the
tungsten-replicator/lib
directory:
shell> cp ojdbc7.jar ./tungsten-replicator/lib/
Configure the installation using tpm:
shell>./tools/tpm configure defaults \ --reset \ --install-directory=/opt/continuent \ --profile-script=~/.bash_profile \ --skip-validation-check=InstallerMasterSlaveCheck \ --rest-api-admin-user=apiuser \ --rest-api-admin-pass=secret
shell>./tools/tpm configure alpha \ --master=sourcehost \ --members=localhost \ --datasource-type=oracle \ --datasource-oracle-service=ORCL \ --datasource-user=tungsten_alpha \ --datasource-password=secret \ --svc-applier-filters=dropstatementdata
shell> vi /etc/tungsten/tungsten.ini
[defaults] install-directory=/opt/continuent profile-script=~/.bash_profile skip-validation-check=InstallerMasterSlaveCheck rest-api-admin-user=apiuser rest-api-admin-pass=secret
[alpha] master=sourcehost members=localhost datasource-type=oracle datasource-oracle-service=ORCL datasource-user=tungsten_alpha datasource-password=secret svc-applier-filters=dropstatementdata
Configuration group defaults
The description of each of the options is shown below; click the icon to hide this detail:
For staging configurations, deletes all pre-existing configuration information between updating with the new configuration values.
--install-directory=/opt/continuent
install-directory=/opt/continuent
Path to the directory where the active deployment will be installed. The configured directory will contain the software, THL and relay log information unless configured otherwise.
--profile-script=~/.bash_profile
profile-script=~/.bash_profile
Append commands to include env.sh in this profile script
--skip-validation-check=InstallerMasterSlaveCheck
skip-validation-check=InstallerMasterSlaveCheck
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
Configuration group alpha
The description of each of the options is shown below; click the icon to hide this detail:
The hostname of the primary (extractor) within the current service.
Hostnames for the dataservice members
Database type
--datasource-oracle-service=ORCL
datasource-oracle-service=ORCL
Oracle Service Name
--datasource-user=tungsten_alpha
datasource-user=tungsten_alpha
For databases that required authentication, the username to use when connecting to the database using the corresponding connection method (native, JDBC, etc.).
The password to be used when connecting to the database using
the corresponding
--replication-user
.
--svc-applier-filters=dropstatementdata
svc-applier-filters=dropstatementdata
Replication service applier filters
replication-host
should be added to
the above configuration if the target Oracle Database is on a different
host to the applier installation
If you plan to make full use of the REST API (which is enabled by default) you will need to also configure a username and password for API access. This must be done by specifying the following options in your configuration:
rest-api-admin-user=tungsten rest-api-admin-pass=secret
Once the prerequisites and configuring of the installation has been completed, the software can be installed:
shell> ./tools/tpm install
If the installation process fails, check the output of the
/tmp/tungsten-configure.log
file
for more information about the root cause.
Once the installation has completed, the status of the service should be reported. The service should be online and reading events from the Extractor replicator.
The status of the replicator can be checked and monitored by using the trepctl command.
Deployment of replication to PostgreSQL service operates as follows:
Service Alpha on the Extractor, extracts the information from the MySQL binary log into THL.
Service Alpha on the Applier reads the information from the remote replicator as THL, and applies that to PostgreSQL using a standard JDBC driver by constructing PostgreSQL compatible SQL to insert, update and delete the target data.
The two replication services can operate on the same machine, (See Section 5.3, “Deploying Multiple Replicators on a Single Host”) or they can be installed on two different machines.
Configure the source and target hosts following the prerequisites outlined in Appendix B, Prerequisites then follow the appropriate steps for the required extractor topology outlined in Chapter 3, Deploying MySQL Extractors.
For replication to PostgreSQL hosts, you must ensure that the networking and user configuration has been configured correctly.
Within the PostgreSQL configuration, two changes need to be made:
Configure the networking so that the listen address for the PostgreSQL
server is configured correctly for this edit. Edit the
/etc/postgresql/main/postgresql.conf
file and
edit the listen_address
line
either to *
or to an explicit IP
address. For example:
listen_addresses = '192.168.3.73'
Edit the /etc/postgresql/main/pg_hba.conf
file
and ensure that the password properties match the password settings
and hostname limitations. In particular, the replicator will
communicate over the public IP address, not localhost, and so you must
ensure that network-based connections using a user/password
combination are allowed. For example, you may want to add a line to
the file that provides network-wide access, or at least access for the
local network range:
local all all md5
A suitable user must be created with rights and permissions to create
databases, as this is required by the replicator to create databases,
tables, and other objects. The creatuser command can be
used for this purpose. The --createdb
adds the
CREATEDB
permission:
shell> createuser tungsten --createdb
You will be prompted to provide a password for the user.
Alternatively, you can create the user and permissions through the psql interface:
shell>sudo -u postgres psql --port=5433 --user=postgres postgres
Type "help" for help. postgres=#CREATE ROLE tungsten WITH LOGIN PASSWORD 'password';
postgres=#ALTER ROLE tungsten CREATEDB;
You may also want to grant specific privileges to existing databases which must be done within the psql interface:
shell>sudo -u postgres psql --port=5433 --user=postgres postgres
Type "help" for help. postgres=#GRANT ALL ON DATABASE postgres TO tungsten;
Once you have completed the configuration of the PostgreSQL database, you can configure and install the PostgreSQL applier as described using the steps below.
Unpack the Tungsten Replicator distribution in staging directory:
shell> tar zxf tungsten-replicator-7.0.3-141.tar.gz
Change into the staging directory:
shell> cd tungsten-replicator-7.0.3-141
Configure the installation using tpm:
shell>./tools/tpm configure defaults \ --reset \ --install-directory=/opt/continuent \ --user=tungsten \ --profile-script=~/.bash_profile \ --rest-api-admin-user=apiuser \ --rest-api-admin-pass=secret
shell>./tools/tpm configure alpha \ --master=sourcehost \ --members=localhost,sourcehost \ --datasource-type=postgresql \ --postgresql-dbname=dbname \ --replication-user=tungsten \ --replication-password=secret \ --replication-host=remotedbhost \ --replication-port=5432
shell> vi /etc/tungsten/tungsten.ini
[defaults] install-directory=/opt/continuent user=tungsten profile-script=~/.bash_profile rest-api-admin-user=apiuser rest-api-admin-pass=secret
[alpha] master=sourcehost members=localhost,sourcehost datasource-type=postgresql postgresql-dbname=dbname replication-user=tungsten replication-password=secret replication-host=remotedbhost replication-port=5432
Configuration group defaults
The description of each of the options is shown below; click the icon to hide this detail:
For staging configurations, deletes all pre-existing configuration information between updating with the new configuration values.
--install-directory=/opt/continuent
install-directory=/opt/continuent
Path to the directory where the active deployment will be installed. The configured directory will contain the software, THL and relay log information unless configured otherwise.
System User
--profile-script=~/.bash_profile
profile-script=~/.bash_profile
Append commands to include env.sh in this profile script
Configuration group alpha
The description of each of the options is shown below; click the icon to hide this detail:
The hostname of the primary (extractor) within the current service.
--members=localhost,sourcehost
Hostnames for the dataservice members
Database type
Name of the database to replicate
For databases that required authentication, the username to use when connecting to the database using the corresponding connection method (native, JDBC, etc.).
The password to be used when connecting to the database using
the corresponding
--replication-user
.
--replication-host=remotedbhost
Hostname of the datasource where the database is located. If the specified hostname matches the current host or member name, the database is assumed to be local. If the hostnames do not match, extraction is assumed to be via remote access. For MySQL hosts, this configures a remote replication Replica (relay) connection.
The network port used to connect to the database server. The default port used depends on the database being configured.
If you plan to make full use of the REST API (which is enabled by default) you will need to also configure a username and password for API access. This must be done by specifying the following options in your configuration:
rest-api-admin-user=tungsten rest-api-admin-pass=secret
Once the prerequisites and configuring of the installation has been completed, the software can be installed:
shell> ./tools/tpm install
If the installation process fails, check the output of the
/tmp/tungsten-configure.log
file for
more information about the root cause.
Once the replicators have started, the status of the service can be checked using trepctl. See Section 4.8.3, “Management and Monitoring of PostgreSQL Deployments” for more information.
Once the two services — extractor and applier — have been installed, the services can be monitored using trepctl. To monitor the extractor service:
shell> trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysql-bin.000008:0000000000412301;0
appliedLastSeqno : 1296
appliedLatency : 1.889
channels : 1
clusterName : epsilon
currentEventId : mysql-bin.000008:0000000000412301
currentTimeMillis : 1377097812795
dataServerHost : host1
extensions :
latestEpochNumber : 1286
masterConnectUri : thl://localhost:/
masterListenUri : thl://host2:2112/
maximumStoredSeqNo : 1296
minimumStoredSeqNo : 0
offlineRequests : NONE
pendingError : NONE
pendingErrorCode : NONE
pendingErrorEventId : NONE
pendingErrorSeqno : -1
pendingExceptionMessage: NONE
pipelineSource : jdbc:mysql:thin://host1:13306/
relativeLatency : 177444.795
resourcePrecedence : 99
rmiPort : 10000
role : master
seqnoType : java.lang.Long
serviceName : alpha
serviceType : local
simpleServiceName : alpha
siteName : default
sourceId : host1
state : ONLINE
timeInStateSeconds : 177443.948
transitioningTo :
uptimeSeconds : 177461.483
version : Tungsten Replicator 7.0.3 build 141
Finished status command...
The replicator service operates just the same as a standard Extractor service of a typical MySQL replication service.
The PostgreSQL applier service can be accessed either remotely from the Extractor:
shell> trepctl -host host2 status
...
Or locally on the Applier host:
shell> trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysql-bin.000008:0000000000412301;0
appliedLastSeqno : 1296
appliedLatency : 10.253
channels : 1
clusterName : alpha
currentEventId : NONE
currentTimeMillis : 1377098139212
dataServerHost : host2
extensions :
latestEpochNumber : 1286
masterConnectUri : thl://host1:2112/
masterListenUri : null
maximumStoredSeqNo : 1296
minimumStoredSeqNo : 0
offlineRequests : NONE
pendingError : NONE
pendingErrorCode : NONE
pendingErrorEventId : NONE
pendingErrorSeqno : -1
pendingExceptionMessage: NONE
pipelineSource : thl://host1:2112/
relativeLatency : 177771.212
resourcePrecedence : 99
rmiPort : 10000
role : slave
seqnoType : java.lang.Long
serviceName : alpha
serviceType : local
simpleServiceName : alpha
siteName : default
sourceId : host2
state : ONLINE
timeInStateSeconds : 177783.343
transitioningTo :
uptimeSeconds : 180631.276
version : Tungsten Replicator 7.0.3 build 141
Finished status command...
Monitoring the status of replication between the Source and Target is also
the same. The appliedLastSeqno
still indicates the
sequence number that has been applied to PostgreSQL, and the event ID from
PostgreSQL can still be identified from
appliedLastEventId
.
Sequence numbers between the two hosts should match, as in a Primary/Replica deployment, but due to the method used to replicate, the applied latency may be higher. Tables that do not use primary keys, or large individual row updates may cause increased latency differences.
Table of Contents
If you have an AWS account, you can take advantage of pre-built EC2 hosts, complete with all necessary pre-requisites in place, launched from an AWS Marketplace AMI.
Upon launch, a wizard will start and prompt you for a number of credentials to build a default configuration for Tungsten Replicator
For a complete end-to-end Replication Pipeline you will need:
One host launched from the Tungsten Replicator for MySQL Source Extraction AMI for each Source you wish to extract from
One or more hosts launched from the appropriate Target AMI to match your requirements
Source Databases
The Tungsten Replicator for MySQL Source Extraction is required for extraction from any of the following:
MySQL hosted on another EC2 instance
MySQL hosted on the same EC2 host launched from the AMI
An existing Tungsten Clustering Installation
Amazon RDS
Amazon Aurora
MySQL hosted on a remote non-AWS host
Google Cloud SQL
Microsoft Azure
Target Databases
MySQL Targets include all of the following:
MySQL hosted on another EC2 instance
MySQL hosted on the same EC2 host launched from the AMI
An existing Tungsten Clustering Installation
Amazon RDS
Amazon Aurora
MySQL hosted on a remote non-AWS host
Google Cloud SQL
Microsoft Azure
PostgreSQL (Including RDS)
Oracle (Including RDS)
Upon launch, the AMI does NOT include the required binaries for a locally hosted database instance. For a local install for either the extractor or the applier, this will need to be configured manually beforehand.
If you plan to extract from an existing Tungsten Cluster (Cluster-Extractor) a number of changes may need to be applied to your cluster configuration, in addition your cluster must be running the same release as Tungsten Replicator. For more details on Cluster requirements consult the appropriate Applier specific pages here: Chapter 4, Deploying Appliers
For any non-AWS hosted instances, ensure the appropriate inbound and outbound security rules are in place to allow WAN Communication.
When using the AMI to configure an Extractor or Applier, it is important to ensure all the necessary target/source database pre-requisites are in place.
For extraction, ensure your source MySQL Instance is configured as per the Database specific notes in Section B.4, “MySQL Database Setup”
In addition, for Amazon based extraction, pay particular attention to Section B.4.6, “MySQL Unprivileged Users”
For preparing the target database, specific notes for target pre-requisities, where appropriate, are detailed within each applier deployment section found at Chapter 4, Deploying Appliers
Once you have prepared your sources and targets, you can now launch the relevant AMI's from the Marketplace
Within your AWS Dashboard, you can find the AMI by searching within the Marketplace for "Continuent"
Select the Extractor AMI and the Target AMI based on your choice of target database. Each AMI is restricted to only configure an applier based on the choice of target. There are no restrictions on extraction, providing the necessary pre-requisities are in place.
Ensure you select a Security group that allows communication to the source and target databases, the require network ports are detailed in Section B.3.3.1, “Network Ports”
After launching the AMI, obtain the public IP and connect to the shell using your preferred Terminal application, eg
shell> ssh -i your-key.pem
ec2-user@publicIP
Upon connecting, you will see a welcome message, from here you can now connect as the tungsten user
shell> sudo su - tungsten
The launch wizard will start automatically and start prompting you for details regarding your source or target database.
It is advisable to configure the Extractor AMI first as you will need to provide details of the extractor when you configure the applier.
Once you have provided all the information to the wizard, you will be prompted on screen for the next steps.
In summary, the wizard will have completed the following:
Created tungsten.ini
within /etc/tungsten
Created additional directories for software installation
Created additional configuration files depening upon target requirements
Created a log file of the Wizard execution within /home/tungsten/ami-launch/log
The latest version of Tungsten Replicator will be unpacked within /opt/continuent/software
The wizard does not install the software, this allows you to fine tune the configuration to suit your needs, such as adding additional filters, or adjusting memory and buffer allocations.
For more information on all the possible configuration parameters, see Section 9.8, “tpm Configuration Options”
You can now install the software, follow the on screen instructions displayed after Wizard completetion to install using tpm, or review Section 9.4.2, “Installation with INI File”
For further reading and understanding of how to manage the replicator, review Chapter 7, Operations Guide
For steps on starting and stopping the replicator, review Section 2.4, “Starting and Stopping Tungsten Replicator”
For details on how to monitor and interact with the running replicator using the trepctl tool, review Section 8.20, “The trepctl Command”
The fan-in topology is the logical opposite of a Primary/Replica topology. In a fan-in topology, the data from two Sources is combined together on one Target. Fan-in topologies are often in situations where you have satellite databases, maybe for sales or retail operations, and need to combine that information together in a single database for processing.
Some additional considerations need to be made when using fan-in topologies:
If the same tables from each each machine are being merged together, it is possible to get collisions in the data where auto increment is used. The effects can be minimized by using increment offsets within the MySQL configuration:
auto-increment-offset = 1
auto-increment-increment = 4
Fan-in can work more effectively, and be less prone to problems with the
corresponding data by configuring specific tables at different sites.
For example, with two sites in New York and San Jose databases and
tables can be prefixed with the site name, i.e.
sjc_sales
and
nyc_sales
.
Alternatively, a filter can be configured to rename the database
sales
dynamically to the
corresponding location based tables. See
Section 11.4.34, “Rename Filter” for more information.
Statement-based replication will work for most instances, but where your
statements are updating data dynamically within the statement, in fan-in
the information may get increased according to the name of fan-in
Sources. Update your configuration file to explicitly use row-based
replication by adding the following to your
my.cnf
file:
binlog-format = row
Triggers can cause problems during fan-in replication if two different statements from each Source and replicated to the Target and cause the operations to be triggered multiple times. Tungsten Replicator cannot prevent triggers from executing on the concentrator host and there is no way to selectively disable triggers. Check at the trigger level whether you are executing on a Source or Target. For more information, see Section C.4.1, “Triggers”.
To create the configuration the Extractors and services must be specified, the topology specification takes care of the actual configuration:
shell> ./tools/tpm configure epsilon \
--topology=fan-in \
--install-directory=/opt/continuent \
--replication-user=tungsten \
--replication-password=password \
--master=host1,host2 \
--members=host1,host2,host3 \
--master-services=alpha,beta \
--rest-api-admin-user=apiuser \
--rest-api-admin-pass=secret
shell> vi /etc/tungsten/tungsten.ini
[epsilon]
topology=fan-in
install-directory=/opt/continuent
replication-user=tungsten
replication-password=password
master=host1,host2
members=host1,host2,host3
master-services=alpha,beta
rest-api-admin-user=apiuser
rest-api-admin-pass=secret
Configuration group epsilon
The description of each of the options is shown below; click the icon to hide this detail:
Replication topology for the dataservice.
--install-directory=/opt/continuent
install-directory=/opt/continuent
Path to the directory where the active deployment will be installed. The configured directory will contain the software, THL and relay log information unless configured otherwise.
For databases that required authentication, the username to use when connecting to the database using the corresponding connection method (native, JDBC, etc.).
--replication-password=password
The password to be used when connecting to the database using
the corresponding
--replication-user
.
The hostname of the primary (extractor) within the current service.
Hostnames for the dataservice members
Data service names that should be used on each Primary
For additional options supported for configuration with tpm, see Chapter 9, The tpm Deployment Command.
If the installation process fails, check the output of the
/tmp/tungsten-configure.log
file for
more information about the root cause.
Once the installation has been completed, the service will be started and ready to use.
Once the service has been started, a quick view of the service status can be determined using trepctl. Because there are multiple services, the service name and host name must be specified explicitly. The Extractor connection of one of the fan-in hosts:
shell> trepctl -service alpha -host host1 status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysql-bin.000012:0000000000000418;0
appliedLastSeqno : 0
appliedLatency : 1.194
channels : 1
clusterName : alpha
currentEventId : mysql-bin.000012:0000000000000418
currentTimeMillis : 1375451438898
dataServerHost : host1
extensions :
latestEpochNumber : 0
masterConnectUri : thl://localhost:/
masterListenUri : thl://host1:2112/
maximumStoredSeqNo : 0
minimumStoredSeqNo : 0
offlineRequests : NONE
pendingError : NONE
pendingErrorCode : NONE
pendingErrorEventId : NONE
pendingErrorSeqno : -1
pendingExceptionMessage: NONE
pipelineSource : jdbc:mysql:thin://host1:13306/
relativeLatency : 6232.897
resourcePrecedence : 99
rmiPort : 10000
role : master
seqnoType : java.lang.Long
serviceName : alpha
serviceType : local
simpleServiceName : alpha
siteName : default
sourceId : host1
state : ONLINE
timeInStateSeconds : 6231.881
transitioningTo :
uptimeSeconds : 6238.061
version : Tungsten Replicator 7.0.3 build 141
Finished status command...
The corresponding Extractor service from the other host is
beta
on
host2
:
shell> trepctl -service beta -host host2 status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysql-bin.000012:0000000000000415;0
appliedLastSeqno : 0
appliedLatency : 0.941
channels : 1
clusterName : beta
currentEventId : mysql-bin.000012:0000000000000415
currentTimeMillis : 1375451493579
dataServerHost : host2
extensions :
latestEpochNumber : 0
masterConnectUri : thl://localhost:/
masterListenUri : thl://host2:2112/
maximumStoredSeqNo : 0
minimumStoredSeqNo : 0
offlineRequests : NONE
pendingError : NONE
pendingErrorCode : NONE
pendingErrorEventId : NONE
pendingErrorSeqno : -1
pendingExceptionMessage: NONE
pipelineSource : jdbc:mysql:thin://host2:13306/
relativeLatency : 6286.579
resourcePrecedence : 99
rmiPort : 10000
role : master
seqnoType : java.lang.Long
serviceName : beta
serviceType : local
simpleServiceName : beta
siteName : default
sourceId : host2
state : ONLINE
timeInStateSeconds : 6285.823
transitioningTo :
uptimeSeconds : 6291.053
version : Tungsten Replicator 7.0.3 build 141
Finished status command...
Note that because this is a fan-in topology, the sequence numbers and applied sequence numbers will be different for each service, as each service is independently storing data within the fan-in hub database.
The following sequence number combinations should match between the different hosts on each service:
The sequence numbers between host1
and host2
will not match, as they
are two independent services.
For more information on using trepctl, see Section 8.20, “The trepctl Command”.
Definitions of the individual field descriptions in the above example output can be found in Section E.2, “Generated Field Reference”.
For more information on management and operational detailed for managing your cluster installation, see Chapter 7, Operations Guide.
It is possible to install multiple replicators on the same host. This can be useful, either when building complex topologies with multiple services, and in hetereogenous environments where you are reading from one database and writing to another that may be installed on the same single server.
When installing multiple replicator services on the same host, different values must be set for the following configuration parameters:
Before continuing with deployment you will need the following:
The name to use for the service.
The list of datasources in the service. These are the servers which will be running MySQL.
The username and password of the MySQL replication user.
All servers must be prepared with the proper prerequisites. See Appendix B, Prerequisites for additional details.
RMI network port used for communicating with the replicator service.
Set through the --rmi-port
parameter
to tpm. Note that RMI ports are configured in
pairs; the default port is 10000, port 10001 is used automatically.
When specifying an alternative port, the subsequent port must also be
available. For example, specifying port 10002 also requires 10003.
THL network port used for exchanging THL data.
Set through the --thl-port
parameter
to tpm. The default THL port is 2112. This option
is required for services operating as Extractors.
Extractor THL port, i.e. the port from which an Applier will read THL events from the Extractor
Set through the --master-thl-port
parameter to tpm. When operating as an Applier, the
explicit THL port should be specified to ensure that you are
connecting to the THL port correctly.
Extractor hostname
Set through the --master-thl-host
parameter to tpm. This is optional if the Extractor
hostname has been configured correctly through the
--master
parameter.
Installation directory used when the replicator is installed.
Set through the --install-directory
or --install-directory
parameters to
tpm. This directory must have been created, and be
configured with suitable permissions before installation starts. For
more information, see Section B.3.4, “Directory Locations and Configuration”.
For example, to create two services, one that reads from MySQL and another that writes to MongoDB on the same host:
Install the Tungsten Replicator package or download the Tungsten Replicator tarball, and unpack it:
shell>cd /opt/continuent/software
shell>tar zxf
tungsten-replicator-7.0.3-141.tar.gz
Create the proper directories with appropriate ownership and permissions:
shell>sudo mkdir /opt/applier /opt/extractor
shell>sudo chown tungsten: /opt/applier/ /opt/extractor/
shell>sudo chmod 700 /opt/applier/ /opt/extractor/
Change to the Tungsten Replicator directory:
shell> cd tungsten-replicator-7.0.3-141
Extractor reading from MySQL (Click link to switch examples between Staging Method or INI Method):
shell>./tools/tpm configure defaults \ --reset \ --install-directory=/opt/extractor \ --user=tungsten \ --profile-script=~/.bash_profile \ --mysql-allow-intensive-checks=true \ --disable-security-controls=true \ --executable-prefix=ext \ --rest-api-admin-user=apiuser \ --rest-api-admin-pass=secret
shell>./tools/tpm configure alpha \ --master=offboardhost \ --members=offboardhost \ --enable-heterogeneous-service=true \ --replication-port=3306 \ --replication-user=tungsten_alpha \ --replication-password=secret \ --datasource-mysql-conf=/etc/my.cnf \ --svc-extractor-filters=colnames,pkey \ --property=replicator.filter.pkey.addColumnsToDeletes=true \ --property=replicator.filter.pkey.addPkeyToInserts=true \ --mysql-enable-enumtostring=true \ --mysql-enable-settostring=true \ --mysql-use-bytes-for-string=false
shell> vi /etc/tungsten/tungsten.ini
[defaults] install-directory=/opt/extractor user=tungsten profile-script=~/.bash_profile mysql-allow-intensive-checks=true disable-security-controls=true executable-prefix=ext rest-api-admin-user=apiuser rest-api-admin-pass=secret
[alpha] master=offboardhost members=offboardhost enable-heterogeneous-service=true replication-port=3306 replication-user=tungsten_alpha replication-password=secret datasource-mysql-conf=/etc/my.cnf svc-extractor-filters=colnames,pkey property=replicator.filter.pkey.addColumnsToDeletes=true property=replicator.filter.pkey.addPkeyToInserts=true mysql-enable-enumtostring=true mysql-enable-settostring=true mysql-use-bytes-for-string=false
Configuration group defaults
The description of each of the options is shown below; click the icon to hide this detail:
For staging configurations, deletes all pre-existing configuration information between updating with the new configuration values.
--install-directory=/opt/extractor
install-directory=/opt/extractor
Path to the directory where the active deployment will be installed. The configured directory will contain the software, THL and relay log information unless configured otherwise.
System User
--profile-script=~/.bash_profile
profile-script=~/.bash_profile
Append commands to include env.sh in this profile script
--mysql-allow-intensive-checks=true
mysql-allow-intensive-checks=true
For MySQL installation, enables detailed checks on the supported data types within the MySQL database to confirm compatibility. This includes checking each table definition individually for any unsupported data types.
--disable-security-controls=true
disable-security-controls=true
Disables all forms of security, including SSL, TLS and authentication
When enabled, the supplied prefix is added to each command alias
that is generated for a given installation. This enables
multiple installations to co-exist and and be accessible through
a unique alias. For example, if the executable prefix is
configured as east
, then
an alias for the installation to trepctl will
be created as east_trepctl.
Alias information for executable prefix data is stored within
the
$CONTINUENT_ROOT/share/aliases.sh
file for each installation.
Configuration group alpha
The description of each of the options is shown below; click the icon to hide this detail:
The hostname of the primary (extractor) within the current service.
Hostnames for the dataservice members
--enable-heterogeneous-service=true
enable-heterogeneous-service=true
On a Primary
--mysql-use-bytes-for-string
is set to false.
colnames
filter is
enabled (in the
binlog-to-q
stage
to add column names to the THL information.
pkey
filter is
enabled (in the
binlog-to-q
and
q-to-dbms
stage),
with the
addPkeyToInserts
and
addColumnsToDeletes
filter options set to false.
enumtostring
filter is enabled (in the
q-to-thl
stage), to
translate ENUM
values to their string equivalents.
settostring
filter
is enabled (in the
q-to-thl
stage), to
translate SET
values to their string equivalents.
On a Replica
--mysql-use-bytes-for-string
is set to true.
The network port used to connect to the database server. The default port used depends on the database being configured.
--replication-user=tungsten_alpha
replication-user=tungsten_alpha
For databases that required authentication, the username to use when connecting to the database using the corresponding connection method (native, JDBC, etc.).
The password to be used when connecting to the database using
the corresponding
--replication-user
.
--datasource-mysql-conf=/etc/my.cnf
datasource-mysql-conf=/etc/my.cnf
MySQL config file
--svc-extractor-filters=colnames,pkey
svc-extractor-filters=colnames,pkey
Replication service extractor filters
--mysql-enable-enumtostring=true
mysql-enable-enumtostring=true
Enable a filter to convert ENUM values to strings
--mysql-enable-settostring=true
Enable a filter to convert SET types to strings
--mysql-use-bytes-for-string=false
mysql-use-bytes-for-string=false
Transfer strings as their byte representation?
This is a standard configuration using the default ports, with the
directory /opt/extractor
.
Applier for writing to MongoDB (Click link to switch examples between Staging Method or INI Method):
shell>./tools/tpm configure defaults \ --reset \ --install-directory=/opt/applier \ --profile-script=~/.bash_profile \ --skip-validation-check=InstallerMasterSlaveCheck \ --executable-prefix=app \ --rest-api-admin-user=apiuser \ --rest-api-admin-pass=secret
shell>./tools/tpm configure alpha \ --master=localhost \ --members=localhost \ --role=slave \ --datasource-type=mongodb \ --replication-user=tungsten \ --replication-password=secret \ --rmi-port=10002 \ --master-thl-port=2112 \ --master-thl-host=localhost \ --thl-port=2113
shell> vi /etc/tungsten/tungsten.ini
[defaults] install-directory=/opt/applier profile-script=~/.bash_profile skip-validation-check=InstallerMasterSlaveCheck executable-prefix=app rest-api-admin-user=apiuser rest-api-admin-pass=secret
[alpha] master=localhost members=localhost role=slave datasource-type=mongodb replication-user=tungsten replication-password=secret rmi-port=10002 master-thl-port=2112 master-thl-host=localhost thl-port=2113
Configuration group defaults
The description of each of the options is shown below; click the icon to hide this detail:
For staging configurations, deletes all pre-existing configuration information between updating with the new configuration values.
--install-directory=/opt/applier
install-directory=/opt/applier
Path to the directory where the active deployment will be installed. The configured directory will contain the software, THL and relay log information unless configured otherwise.
--profile-script=~/.bash_profile
profile-script=~/.bash_profile
Append commands to include env.sh in this profile script
--skip-validation-check=InstallerMasterSlaveCheck
skip-validation-check=InstallerMasterSlaveCheck
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
When enabled, the supplied prefix is added to each command alias
that is generated for a given installation. This enables
multiple installations to co-exist and and be accessible through
a unique alias. For example, if the executable prefix is
configured as east
, then
an alias for the installation to trepctl will
be created as east_trepctl.
Alias information for executable prefix data is stored within
the
$CONTINUENT_ROOT/share/aliases.sh
file for each installation.
Configuration group alpha
The description of each of the options is shown below; click the icon to hide this detail:
The hostname of the primary (extractor) within the current service.
Hostnames for the dataservice members
What is the replication role for this service?
Database type
For databases that required authentication, the username to use when connecting to the database using the corresponding connection method (native, JDBC, etc.).
The password to be used when connecting to the database using
the corresponding
--replication-user
.
Replication RMI listen port
Primary THL Port
Primary THL Hostname
Port to use for THL Operations
In this configuration, the Extractor THL port is specified explicitly,
along with the THL port used by this replicator, the RMI port used for
administration, and the installation directory
/opt/applier
.
Run tpm to install the software
shell > ./tools/tpm install
During the startup and installation, tpm will
notify you of any problems that need to be fixed before the service
can be correctly installed and started. If
start-and-report
is set and the
service starts correctly, you should see the configuration and current
status of the service.
Initialize your PATH
and environment.
shell >source /opt/extractor/share/env.sh
shell >source /opt/applier/share/env.sh
Check the replication status.
When multiple replicators have been installed, checking the replicator
status through trepctl depends on the replicator
executable location used. If
/opt/extractor/tungsten/tungsten-replicator/bin/trepctl
,
the extractor service status will be reported. If
/opt/applier/tungsten/tungsten-replicator/bin/trepctl
is used, then the applier service status will be reported.
To make things easier, in the config examples above
executable-prefix
has been used, which will
set up OS aliases. These aliases are setup when you source the relevant
env.sh
files, this will also happen by default when you
login to the host providing profile-script
has been specified
The use of the prefix and aliases, then simplifies the use of all
executables, for example, based on the setting of
executable-prefix
in the above config
examples, to report the status of the extractor, you
can execute:
shell> ext_trepctl status
Or to check the applier service:
shell> app_trepctl status
Alternatively, a specific replicator can be checked by explicitly specifying the RMI port of the service. For example, to check the extractor service:
shell> trepctl -port 10000 status
Or to check the applier service:
shell> trepctl -port 10002 status
When an explicit port has been specified in this way, the executable used is irrelevant. Any valid trepctl instance will work.
Further, either path may be used to get a summary view using multi_trepctl:
shell> /opt/extractor/tungsten/tungsten-replicator/scripts/multi_trepctl
| host | servicename | role | state | appliedlastseqno | appliedlatency |
| host1 | extractor | master | ONLINE | 0 | 1.724 |
| host1 | applier | slave | ONLINE | 0 | 0.000 |
Follow the guidelines in Section 2.2, “Best Practices”.
If you have an existing dataservice, data can be replicated from a standalone MySQL server into the service. The replication is configured by creating a service that reads from the standalone MySQL server and writes into the Primary of the target dataservice. By writing this way, changes are replicated to the Primary and Replica in the new deployment.
Additionally, using a replicator that writes data into an existing data service can be used when migrating from an existing service into a new Tungsten Cluster service.
In order to configure this deployment, there are two steps:
Create a new replicator that reads this data and writes the replicated data into the Primary of the destination dataservice.
Create a new replicator that reads the binary logs directly from the external MySQL service through the Primary of the destination dataservice
There are also the following requirements:
The host on which you want to replicate to must have Tungsten Replicator 5.3.0 or later.
Hosts on both the replicator and cluster must be able to communicate with each other.
The replication user on the source host must have the
RELOAD
,
REPLICATION SLAVE
, and
REPLICATION CLIENT
GRANT
privileges.
Replicator must be able to connect as the
tungsten
user to the databases
within the cluster.
Install the Tungsten Replicator package (see
Section 2.1.2, “Using the RPM package files”), or download the compressed
tarball and unpack it on host1
:
shell>cd /opt/replicator/software
shell>tar zxf tungsten-replicator-
7.0.3-141
.tar.gz
Change to the Tungsten Replicator staging directory:
shell> cd tungsten-replicator-7.0.3-141
Configure the replicator on host1
First we configure the defaults and a cluster alias that points to the Primaries and Replicas within the current Tungsten Cluster service that you are replicating from:
Click the link below to switch examples between Staging and INI methods
shell>./tools/tpm configure defaults \ --install-directory=/opt/replicator \ --rmi-port=10002 \ --user=tungsten \ --replication-user=tungsten \ --replication-password=secret \ --skip-validation-check=MySQLNoMySQLReplicationCheck \ --rest-api-admin-user=apiuser \ --rest-api-admin-pass=secret
shell>./tools/tpm configure beta \ --topology=direct \ --master=host1 \ --direct-datasource-host=host3 \ --thl-port=2113
shell> vi /etc/tungsten/tungsten.ini
[defaults] install-directory=/opt/replicator rmi-port=10002 user=tungsten replication-user=tungsten replication-password=secret skip-validation-check=MySQLNoMySQLReplicationCheck rest-api-admin-user=apiuser rest-api-admin-pass=secret
[beta] topology=direct master=host1 direct-datasource-host=host3 thl-port=2113
Configuration group defaults
The description of each of the options is shown below; click the icon to hide this detail:
--install-directory=/opt/replicator
install-directory=/opt/replicator
Path to the directory where the active deployment will be installed. The configured directory will contain the software, THL and relay log information unless configured otherwise.
Replication RMI listen port
System User
For databases that required authentication, the username to use when connecting to the database using the corresponding connection method (native, JDBC, etc.).
The password to be used when connecting to the database using
the corresponding
--replication-user
.
--skip-validation-check=MySQLNoMySQLReplicationCheck
skip-validation-check=MySQLNoMySQLReplicationCheck
The --skip-validation-check
disables a given validation check. If any validation check
fails, the installation, validation or configuration will
automatically stop.
Using this option enables you to bypass the specified check, although skipping a check may lead to an invalid or non-working configuration.
You can identify a given check if an error or warning has been raised during configuration. For example, the default table type check:
... ERROR >> centos >> The datasource root@centos:3306 (WITH PASSWORD) » uses MyISAM as the default storage engine (MySQLDefaultTableTypeCheck) ...
The check in this case is
MySQLDefaultTableTypeCheck
,
and could be ignored using
--skip-validation-check=MySQLDefaultTableTypeCheck
.
Setting both
--skip-validation-check
and
--enable-validation-check
is
equivalent to explicitly disabling the specified check.
Configuration group beta
The description of each of the options is shown below; click the icon to hide this detail:
Replication topology for the dataservice.
The hostname of the primary (extractor) within the current service.
--direct-datasource-host=host3
Database server hostname
Port to use for THL Operations
This creates a configuration that specifies that the topology should read
directly from the source host, host3
,
writing directly to host1
. An
alternative THL port is provided to ensure that the THL listener is not
operating on the same network port as the original.
Now install the service, which will create the replicator reading direct
from host3
into
host1
:
shell> ./tools/tpm install
If the installation process fails, check the output of the
/tmp/tungsten-configure.log
file for
more information about the root cause.
Once the installation has been completed, you must update the position of the replicator so that it points to the correct position within the source database to prevent errors during replication. If the replication is being created as part of a migration process, determine the position of the binary log from the external replicator service used when the backup was taken. For example:
mysql> show master status;
*************************** 1. row ***************************
File: mysql-bin.000026
Position: 1311
Binlog_Do_DB:
Binlog_Ignore_DB:
1 row in set (0.00 sec)
Use dsctl set to update the replicator position to point to the Primary log position:
shell> /opt/replicator/tungsten/tungsten-replicator/bin/dsctl -service beta set \
-reset -seqno 0 -epoch 0 \
-source-id host3 -event-id mysql-bin.000026:1311
Now start the replicator:
shell> /opt/replicator/tungsten/tungsten-replicator/bin/replicator start
Replication status should be checked by explicitly using the servicename and/or RMI port:
shell> /opt/replicator/tungsten/tungsten-replicator/bin/trepctl -service beta status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysql-bin.000026:0000000000001311;1252
appliedLastSeqno : 5
appliedLatency : 0.748
channels : 1
clusterName : beta
currentEventId : mysql-bin.000026:0000000000001311
currentTimeMillis : 1390410611881
dataServerHost : host1
extensions :
host : host3
latestEpochNumber : 1
masterConnectUri : thl://host3:2112/
masterListenUri : thl://host1:2113/
maximumStoredSeqNo : 5
minimumStoredSeqNo : 0
offlineRequests : NONE
pendingError : NONE
pendingErrorCode : NONE
pendingErrorEventId : NONE
pendingErrorSeqno : -1
pendingExceptionMessage: NONE
pipelineSource : jdbc:mysql:thin://host3:13306/
relativeLatency : 8408.881
resourcePrecedence : 99
rmiPort : 10000
role : master
seqnoType : java.lang.Long
serviceName : beta
serviceType : local
simpleServiceName : beta
siteName : default
sourceId : host3
state : ONLINE
timeInStateSeconds : 8408.21
transitioningTo :
uptimeSeconds : 8409.88
useSSLConnection : false
version : Tungsten Replicator 7.0.3 build 141
Finished status command...
Parallel apply is an important technique for achieving high speed replication and curing Replica lag. It works by spreading updates to Replicas over multiple threads that split transactions on each schema into separate processing streams. This in turn spreads I/O activity across many threads, which results in faster overall updates on the Replica. In ideal cases throughput on Replicas may improve by up to 5 times over single-threaded MySQL native replication.
It is worth noting that the only thing Tungsten parallelizes is applying transactions to Replicas. All other operations in each replication service are single-threaded.
Parallel replication works best on workloads that meet the following criteria:
ROW based binary logging must be enabled in the MySQL database.
Data are stored in independent schemas. If you have 100 customers per server with a separate schema for each customer, your application is a good candidate.
Transactions do not span schemas. Tungsten serializes such transactions, which is to say it stops parallel apply and runs them by themselves. If more than 2-3% of transactions are serialized in this way, most of the benefits of parallelization are lost.
Workload is well-balanced across schemas.
The Replica host(s) are capable and have free memory in the OS page cache.
The host on which the Replica runs has a sufficient number of cores to operate a large number of Java threads.
Not all workloads meet these requirements. If your transactions are within a single schema only, you may need to consider different approaches, such as Replica prefetch. Contact Continuent for other suggestions.
Parallel replication does not work well on underpowered hosts, such as Amazon m1.small instances. In fact, any host that is already I/O bound under single-threaded replication will typical will not show much improvement with parallel apply.
Parallel apply is enabled using the
svc-parallelization-type
and
channels
options of
tpm. The parallelization type defaults to
none
which is to say
that parallel apply is disabled. You should set it to
disk
. The
channels
option sets the the number of
channels (i.e., threads) you propose to use for applying data. Here is a
code example of a MySQL Applier installation with parallel apply enabled. The
Replica will apply transactions using 30 channels.
shell>./tools/tpm configure defaults \ --reset \ --install-directory=/opt/continuent \ --user=tungsten \ --mysql-allow-intensive-checks=true \ --profile-script=~/.bash_profile \ --start-and-report=true
shell>./tools/tpm configure alpha \ --master=sourcehost \ --members=localhost,sourcehost \ --datasource-type=mysql \ --replication-user=tungsten \ --replication-password=secret \ --svc-parallelization-type=disk \ --channels=10
shell> vi /etc/tungsten/tungsten.ini
[defaults] install-directory=/opt/continuent user=tungsten mysql-allow-intensive-checks=true profile-script=~/.bash_profile start-and-report=true
[alpha] master=sourcehost members=localhost,sourcehost datasource-type=mysql replication-user=tungsten replication-password=secret svc-parallelization-type=disk channels=10
Configuration group defaults
The description of each of the options is shown below; click the icon to hide this detail:
For staging configurations, deletes all pre-existing configuration information between updating with the new configuration values.
--install-directory=/opt/continuent
install-directory=/opt/continuent
Path to the directory where the active deployment will be installed. The configured directory will contain the software, THL and relay log information unless configured otherwise.
System User
--mysql-allow-intensive-checks=true
mysql-allow-intensive-checks=true
For MySQL installation, enables detailed checks on the supported data types within the MySQL database to confirm compatibility. This includes checking each table definition individually for any unsupported data types.
--profile-script=~/.bash_profile
profile-script=~/.bash_profile
Append commands to include env.sh in this profile script
Start the services and report out the status after configuration
Configuration group alpha
The description of each of the options is shown below; click the icon to hide this detail:
The hostname of the primary (extractor) within the current service.
--members=localhost,sourcehost
Hostnames for the dataservice members
Database type
For databases that required authentication, the username to use when connecting to the database using the corresponding connection method (native, JDBC, etc.).
The password to be used when connecting to the database using
the corresponding
--replication-user
.
--svc-parallelization-type=disk
Method for implementing parallel apply
Number of replication channels to use for parallel apply.
If the installation process fails, check the output of the
/tmp/tungsten-configure.log
file for
more information about the root cause.
There are several additional options that default to reasonable values. You may wish to change them in special cases.
buffer-size
— Sets the
replicator block commit size, which is the number of transactions to
commit at once on Replicas. Values up to 100 are normally fine.
native-slave-takeover
— Used
to allow Tungsten to take over from native MySQL replication and
parallelize it. See here for more.
You can check the number of active channels on a Replica by looking at the "channels" property once the replicator restarts.
Replica shell> trepctl -service alpha status| grep channels
channels : 10
The channel count for a Primary will ALWAYS be 1 because extraction is single-threaded:
Primary shell> trepctl -service alpha status| grep channels
channels : 1
Enabling parallel apply will dramatically increase the number of connections to the database server.
Typically the calculation on a Replica would be: Connections = Channel_Count x Sevice_Count x 2, so for a 4-way Composite Composite Active/Active topology with 30 channels there would be 30 x 4 x 2 = 240 connections required for the replicator alone, not counting application traffic.
You may display the currently used number of connections in MySQL:
mysql> SHOW STATUS LIKE 'max_used_connections';
+----------------------+-------+
| Variable_name | Value |
+----------------------+-------+
| Max_used_connections | 190 |
+----------------------+-------+
1 row in set (0.00 sec)
Below are suggestions for how to change the maximum connections setting in MySQL both for the running instance as well as at startup:
mysql>SET GLOBAL max_connections = 512;
mysql>SHOW VARIABLES LIKE 'max_connections';
+-----------------+-------+ | Variable_name | Value | +-----------------+-------+ | max_connections | 512 | +-----------------+-------+ 1 row in set (0.00 sec) shell>vi /etc/my.cnf
#max_connections = 151 max_connections = 512
Channels and Parallel Apply
Parallel apply works by using multiple threads for the final stage of the
replication pipeline. These threads are known as channels. Restart points
for each channel are stored as individual rows in table
trep_commit_seqno
if you are
applying to a relational DBMS server, including MySQL, Oracle, and data
warehouse products like Vertica.
When you set the channels
argument, the
tpm program configures the replication service to
enable the requested number of channels. A value of 1 results in
single-threaded operation.
Do not change the number of channels without setting the replicator offline cleanly. See the procedure later in this page for more information.
How Many Channels Are Enough?
Pick the smallest number of channels that loads the Replica fully. For evenly distributed workloads this means that you should increase channels so that more threads are simultaneously applying updates and soaking up I/O capacity. As long as each shard receives roughly the same number of updates, this is a good approach.
For unevenly distributed workloads, you may want to decrease channels to spread the workload more evenly across them. This ensures that each channel has productive work and minimizes the overhead of updating the channel position in the DBMS.
Once you have maximized I/O on the DBMS server leave the number of channels alone. Note that adding more channels than you have shards does not help performance as it will lead to idle channels that must update their positions in the DBMS even though they are not doing useful work. This actually slows down performance a little bit.
Effect of Channels on Backups
If you back up a Replica that operates with more than one channel, say 30, you can only restore that backup on another Replica that operates with the same number of channels. Otherwise, reloading the backup is the same as changing the number of channels without a clean offline.
When operating Tungsten Replicator in a Tungsten cluster, you should always set the number of channels to be the same for all replicators. Otherwise you may run into problems if you try to restore backups across MySQL instances that load with different locations.
If the replicator has only a single channel enabled, you can restore the backup anywhere. The same applies if you run the backup after the replicator has been taken offline cleanly.
When you issue a trepctl offline command, Tungsten Replicator will bring all channels to the same point in the log and then go offline. This is known as going offline cleanly. When a Replica has been taken offline cleanly the following are true:
The trep_commit_seqno
table
contains a single row
The trep_shard_channel
table
is empty
When parallel replication is not enabled, you can take the replicator offline by stopping the replicator process. There is no need to issue a trepctl offline command first.
Putting a replicator offline may take a while if the slowest and fastest
channels are far apart, i.e., if one channel gets far ahead of another.
The separation between channels is controlled by the
maxOfflineInterval
parameter, which defaults to 5
seconds. This sets the allowable distance between commit timestamps
processed on different channels. You can adjust this value at
installation or later. The following example shows how to change it
after installation. This can be done at any time and does not require
the replicator to go offline cleanly.
Click the link below to switch examples between Staging and INI methods...
shell>tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.0.3-141 shell>echo The staging USER is `tpm query staging| cut -d: -f1 | cut -d@ -f1`
The staging USER is tungsten shell>echo The staging HOST is `tpm query staging| cut -d: -f1 | cut -d@ -f2`
The staging HOST is db1 shell>echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.0.3-141 shell>ssh {STAGING_USER}@{STAGING_HOST}
shell>cd {STAGING_DIRECTORY}
shell> ./tools/tpm configure alpha \
--property=replicator.store.parallel-queue.maxOfflineInterval=30
Run the tpm command to update the software with the Staging-based configuration:
shell> ./tools/tpm update
For information about making updates when using a Staging-method deployment, please see Section 9.3.7, “Configuration Changes from a Staging Directory”.
shell> vi /etc/tungsten/tungsten.ini
[alpha]
...
property=replicator.store.parallel-queue.maxOfflineInterval=30
Run the tpm command to update the software with the INI-based configuration:
shell>tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.0.3-141 shell>echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.0.3-141 shell>cd {STAGING_DIRECTORY}
shell>./tools/tpm update
For information about making updates when using an INI file, please see Section 9.4.4, “Configuration Changes with an INI file”.
The offline interval is only the the approximate time that Tungsten Replicator will take to go offline. Up to a point, larger values (say 60 or 120 seconds) allow the replicator to parallelize in spite of a few operations that are relatively slow. However, the down side is that going offline cleanly can become quite slow.
If you need to take a replicator offline quickly, you can either stop the replicator process or issue the following command:
shell> trepctl offline -immediate
Both of these result in an unclean shutdown. However, parallel replication is completely crash-safe provided you use transactional table types like InnoDB, so you will be able to restart without causing Replica consistency problems.
You must take the replicator offline cleanly to change the number of channels or when reverting to MySQL native replication. Failing to do so can result in errors when you restart replication.
To enable parallel replication after installation, take the replicator offline cleanly using the following command:
shell> trepctl offline
Modify the configuration to add two parameters:
shell>tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.0.3-141 shell>echo The staging USER is `tpm query staging| cut -d: -f1 | cut -d@ -f1`
The staging USER is tungsten shell>echo The staging HOST is `tpm query staging| cut -d: -f1 | cut -d@ -f2`
The staging HOST is db1 shell>echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.0.3-141 shell>ssh {STAGING_USER}@{STAGING_HOST}
shell>cd {STAGING_DIRECTORY}
shell> ./tools/tpm configure defaults \
--svc-parallelization-type=disk \
--channels=10
Run the tpm command to update the software with the Staging-based configuration:
shell> ./tools/tpm update
For information about making updates when using a Staging-method deployment, please see Section 9.3.7, “Configuration Changes from a Staging Directory”.
[defaults]
...
svc-parallelization-type=disk
channels=10
Run the tpm command to update the software with the INI-based configuration:
shell>tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.0.3-141 shell>echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.0.3-141 shell>cd {STAGING_DIRECTORY}
shell>./tools/tpm update
For information about making updates when using an INI file, please see Section 9.4.4, “Configuration Changes with an INI file”.
You make use an actual data service name in place of the keyword defaults
.
Signal the changes by a complete restart of the Replicator process:
shell> replicator restart
You can check the number of active channels on a Replica by looking at the "channels" property once the replicator restarts.
Replica shell> trepctl -service alpha status| grep channels
channels : 10
The channel count for a Primary will ALWAYS be 1 because extraction is single-threaded:
Primary shell> trepctl -service alpha status| grep channels
channels : 1
Enabling parallel apply will dramatically increase the number of connections to the database server.
Typically the calculation on a Replica would be: Connections = Channel_Count x Sevice_Count x 2, so for a 4-way Composite Composite Active/Active topology with 30 channels there would be 30 x 4 x 2 = 240 connections required for the replicator alone, not counting application traffic.
You may display the currently used number of connections in MySQL:
mysql> SHOW STATUS LIKE 'max_used_connections';
+----------------------+-------+
| Variable_name | Value |
+----------------------+-------+
| Max_used_connections | 190 |
+----------------------+-------+
1 row in set (0.00 sec)
Below are suggestions for how to change the maximum connections setting in MySQL both for the running instance as well as at startup:
mysql>SET GLOBAL max_connections = 512;
mysql>SHOW VARIABLES LIKE 'max_connections';
+-----------------+-------+ | Variable_name | Value | +-----------------+-------+ | max_connections | 512 | +-----------------+-------+ 1 row in set (0.00 sec) shell>vi /etc/my.cnf
#max_connections = 151 max_connections = 512
To change the number of channels you must take the replicator offline cleanly using the following command:
shell> trepctl offline
This command brings all channels up the same transaction in the log,
then goes offline. If you look in the
trep_commit_seqno
table, you will
notice only a single row, which shows that updates to the Replica have
been completely serialized to a single point. At this point you may
safely reconfigure the number of channels on the replicator, for example
using the following command:
Click the link below to switch examples between Staging and INI methods...
shell>tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.0.3-141 shell>echo The staging USER is `tpm query staging| cut -d: -f1 | cut -d@ -f1`
The staging USER is tungsten shell>echo The staging HOST is `tpm query staging| cut -d: -f1 | cut -d@ -f2`
The staging HOST is db1 shell>echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.0.3-141 shell>ssh {STAGING_USER}@{STAGING_HOST}
shell>cd {STAGING_DIRECTORY}
shell> ./tools/tpm configure alpha \
--channels=5
Run the tpm command to update the software with the Staging-based configuration:
shell> ./tools/tpm update
For information about making updates when using a Staging-method deployment, please see Section 9.3.7, “Configuration Changes from a Staging Directory”.
[alpha]
...
channels=5
Run the tpm command to update the software with the INI-based configuration:
shell>tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.0.3-141 shell>echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.0.3-141 shell>cd {STAGING_DIRECTORY}
shell>./tools/tpm update
For information about making updates when using an INI file, please see Section 9.4.4, “Configuration Changes with an INI file”.
You can check the number of active channels on a Replica by looking at the "channels" property once the replicator restarts.
If you attempt to reconfigure channels without going offline cleanly,
Tungsten Replicator will signal an error when you attempt to go online
with the new channel configuration. The cure is to revert to the
previous number of channels, go online, and then go offline cleanly.
Note that attempting to clean up the
trep_commit_seqno
and
trep_shard_channel
tables manually
can result in your Replicas becoming inconsistent and requiring full
resynchronization. You should only do such cleanup under direction from
Continuent support.
Failing to follow the channel reconfiguration procedure carefully may result in your Replicas becoming inconsistent or failing. The cure is usually full resynchronization, so it is best to avoid this if possible.
The following steps describe how to gracefully disable parallel apply replication.
To disable parallel apply, you must first take the replicator offline cleanly using the following command:
shell> trepctl offline
This command brings all channels up the same transaction in the log,
then goes offline. If you look in the
trep_commit_seqno
table, you will
notice only a single row, which shows that updates to the Replica have
been completely serialized to a single point. At this point you may
safely disable parallel apply on the replicator, for example using the
following command:
Click the link below to switch examples between Staging and INI methods...
shell>tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.0.3-141 shell>echo The staging USER is `tpm query staging| cut -d: -f1 | cut -d@ -f1`
The staging USER is tungsten shell>echo The staging HOST is `tpm query staging| cut -d: -f1 | cut -d@ -f2`
The staging HOST is db1 shell>echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.0.3-141 shell>ssh {STAGING_USER}@{STAGING_HOST}
shell>cd {STAGING_DIRECTORY}
shell> ./tools/tpm configure alpha \
--svc-parallelization-type=none \
--channels=1
Run the tpm command to update the software with the Staging-based configuration:
shell> ./tools/tpm update
For information about making updates when using a Staging-method deployment, please see Section 9.3.7, “Configuration Changes from a Staging Directory”.
[alpha]
...
svc-parallelization-type=none
channels=1
Run the tpm command to update the software with the INI-based configuration:
shell>tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.0.3-141 shell>echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.0.3-141 shell>cd {STAGING_DIRECTORY}
shell>./tools/tpm update
For information about making updates when using an INI file, please see Section 9.4.4, “Configuration Changes with an INI file”.
You can check the number of active channels on a Replica by looking at the "channels" property once the replicator restarts.
shell> trepctl -service alpha status| grep channels
channels : 1
If you attempt to reconfigure channels without going offline cleanly,
Tungsten Replicator will signal an error when you attempt to go online
with the new channel configuration. The cure is to revert to the
previous number of channels, go online, and then go offline cleanly.
Note that attempting to clean up the
trep_commit_seqno
and
trep_shard_channel
tables manually
can result in your Replicas becoming inconsistent and requiring full
resynchronization. You should only do such cleanup under direction from
Continuent support.
Failing to follow the channel reconfiguration procedure carefully may result in your Replicas becoming inconsistent or failing. The cure is usually full resynchronization, so it is best to avoid this if possible.
As with channels you should only change the parallel queue type after the replicator has gone offline cleanly. The following example shows how to update the parallel queue type after installation:
Click the link below to switch examples between Staging and INI methods...
shell>tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.0.3-141 shell>echo The staging USER is `tpm query staging| cut -d: -f1 | cut -d@ -f1`
The staging USER is tungsten shell>echo The staging HOST is `tpm query staging| cut -d: -f1 | cut -d@ -f2`
The staging HOST is db1 shell>echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.0.3-141 shell>ssh {STAGING_USER}@{STAGING_HOST}
shell>cd {STAGING_DIRECTORY}
shell> ./tools/tpm configure alpha \
--svc-parallelization-type=disk \
--channels=5
Run the tpm command to update the software with the Staging-based configuration:
shell> ./tools/tpm update
For information about making updates when using a Staging-method deployment, please see Section 9.3.7, “Configuration Changes from a Staging Directory”.
[alpha]
...
svc-parallelization-type=disk
channels=5
Run the tpm command to update the software with the INI-based configuration:
shell>tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.0.3-141 shell>echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.0.3-141 shell>cd {STAGING_DIRECTORY}
shell>./tools/tpm update
For information about making updates when using an INI file, please see Section 9.4.4, “Configuration Changes with an INI file”.
Basic monitoring of a parallel deployment can be performed using the techniques in Chapter 7, Operations Guide. Specific operations for parallel replication are provided in the following sections.
The replicator has several helpful commands for tracking replication performance:
Command | Description |
---|---|
trepctl status | Shows basic variables including overall latency of Replica and number of apply channels |
trepctl status -name shards | Shows the number of transactions for each shard |
trepctl status -name stores | Shows the configuration and internal counters for stores between tasks |
trepctl status -name tasks | Shows the number of transactions (events) and latency for each independent task in the replicator pipeline |
The trepctl status appliedLastSeqno parameter shows the sequence number of the last transaction committed. Here is an example from a Replica with 5 channels enabled.
shell> trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysql-bin.000211:0000000020094456;0
appliedLastSeqno : 78021
appliedLatency : 0.216
channels : 5
...
Finished status command...
When parallel apply is enabled, the meaning of
appliedLastSeqno
changes. It is the minimum
recovery position across apply channels, which means it is the position
where channels restart in the event of a failure. This number is quite
conservative and may make replication appear to be further behind than
it actually is.
Busy channels mark their position in table
trep_commit_seqno
as they
commit. These are up-to-date with the traffic on that channel, but
channels have latency between those that have a lot of big
transactions and those that are more lightly loaded.
Inactive channels do not get any transactions, hence do not mark
their position. Tungsten sends a control event across all channels
so that they mark their commit position in
trep_commit_channel
. It is
possible to see a delay of many seconds or even minutes in unloaded
systems from the true state of the Replica because of idle channels
not marking their position yet.
For systems with few transactions it is useful to lower the synchronization interval to a smaller number of transactions, for example 500. The following command shows how to adjust the synchronization interval after installation:
Click the link below to switch examples between Staging and INI methods...
shell>tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.0.3-141 shell>echo The staging USER is `tpm query staging| cut -d: -f1 | cut -d@ -f1`
The staging USER is tungsten shell>echo The staging HOST is `tpm query staging| cut -d: -f1 | cut -d@ -f2`
The staging HOST is db1 shell>echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.0.3-141 shell>ssh {STAGING_USER}@{STAGING_HOST}
shell>cd {STAGING_DIRECTORY}
shell> ./tools/tpm configure alpha \
--property=replicator.store.parallel-queue.syncInterval=500
Run the tpm command to update the software with the Staging-based configuration:
shell> ./tools/tpm update
For information about making updates when using a Staging-method deployment, please see Section 9.3.7, “Configuration Changes from a Staging Directory”.
[alpha]
...
property=replicator.store.parallel-queue.syncInterval=500
Run the tpm command to update the software with the INI-based configuration:
shell>tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.0.3-141 shell>echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.0.3-141 shell>cd {STAGING_DIRECTORY}
shell>./tools/tpm update
For information about making updates when using an INI file, please see Section 9.4.4, “Configuration Changes with an INI file”.
Note that there is a trade-off between the synchronization interval
value and writes on the DBMS server. With the foregoing setting, all
channels will write to the
trep_commit_seqno
table every 500
transactions. If there were 50 channels configured, this could lead to
an increase in writes of up to 10%—each channel could end up
adding an extra write to mark its position every 10 transactions. In
busy systems it is therefore better to use a higher synchronization
interval for this reason.
You can check the current synchronization interval by running the trepctl status -name stores command, as shown in the following example:
shell> trepctl status -name stores
Processing status command (stores)...
...
NAME VALUE
---- -----
...
name : parallel-queue
...
storeClass : com.continuent.tungsten.replicator.thl.THLParallelQueue
syncInterval : 10000
Finished status command (stores)...
You can also force all channels to mark their current position by sending a heartbeat through using the trepctl heartbeat command.
Relative latency is a trepctl status parameter. It indicates the latency since the last time the appliedSeqno advanced; for example:
shell> trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysql-bin.000211:0000000020094766;0
appliedLastSeqno : 78022
appliedLatency : 0.571
...
relativeLatency : 8.944
Finished status command...
In this example the last transaction had a latency of .571 seconds from the time it committed on the Primary and committed 8.944 seconds ago. If relative latency increases significantly in a busy system, it may be a sign that replication is stalled. This is a good parameter to check in monitoring scripts.
Serialization count refers to the number of transactions that the replicator has handled that cannot be applied in parallel because they involve dependencies across shards. For example, a transaction that spans multiple shards must serialize because it might cause cause an out-of-order update with respect to transactions that update a single shard only.
You can detect the number of transactions that have been serialized by
looking at the serializationCount
parameter using
the trepctl status -name stores command. The
following example shows a replicator that has processed 1512
transactions with 26 serialized.
shell> trepctl status -name stores
Processing status command (stores)...
...
NAME VALUE
---- -----
criticalPartition : -1
discardCount : 0
estimatedOfflineInterval: 0.0
eventCount : 1512
headSeqno : 78022
maxOfflineInterval : 5
maxSize : 10
name : parallel-queue
queues : 5
serializationCount : 26
serialized : false
...
Finished status command (stores)...
In this case 1.7% of transactions are serialized. Generally speaking you will lose benefits of parallel apply if more than 1-2% of transactions are serialized.
The maximum offline interval (maxOfflineInterval
)
parameter controls the "distance" between the fastest and slowest
channels when parallel apply is enabled. The replicator measures
distance using the seconds between commit times of the last transaction
processed on each channel. This time is roughly equivalent to the amount
of time a replicator will require to go offline cleanly.
You can change the maxOfflineInterval
as shown in
the following example, the value is defined in seconds.
Click the link below to switch examples between Staging and INI methods...
shell>tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.0.3-141 shell>echo The staging USER is `tpm query staging| cut -d: -f1 | cut -d@ -f1`
The staging USER is tungsten shell>echo The staging HOST is `tpm query staging| cut -d: -f1 | cut -d@ -f2`
The staging HOST is db1 shell>echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.0.3-141 shell>ssh {STAGING_USER}@{STAGING_HOST}
shell>cd {STAGING_DIRECTORY}
shell> ./tools/tpm configure alpha \
--property=replicator.store.parallel-queue.maxOfflineInterval=30
Run the tpm command to update the software with the Staging-based configuration:
shell> ./tools/tpm update
For information about making updates when using a Staging-method deployment, please see Section 9.3.7, “Configuration Changes from a Staging Directory”.
[alpha]
...
property=replicator.store.parallel-queue.maxOfflineInterval=30
Run the tpm command to update the software with the INI-based configuration:
shell>tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.0.3-141 shell>echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.0.3-141 shell>cd {STAGING_DIRECTORY}
shell>./tools/tpm update
For information about making updates when using an INI file, please see Section 9.4.4, “Configuration Changes with an INI file”.
You can view the configured value as well as the estimate current value using the trepctl status -name stores command, as shown in yet another example:
shell> trepctl status -name stores
Processing status command (stores)...
NAME VALUE
---- -----
...
estimatedOfflineInterval: 1.3
...
maxOfflineInterval : 30
...
Finished status command (stores)...
Parallel apply works best when transactions are distributed evenly across shards and those shards are distributed evenly across available channels. You can monitor the distribution of transactions over shards using the trepctl status -name shards command. This command lists transaction counts for all shards, as shown in the following example.
shell> trepctl status -name shards
Processing status command (shards)...
...
NAME VALUE
---- -----
appliedLastEventId: mysql-bin.000211:0000000020095076;0
appliedLastSeqno : 78023
appliedLatency : 0.255
eventCount : 3523
shardId : cust1
stage : q-to-dbms
...
Finished status command (shards)...
If one or more shards have a very large
eventCount
value compared to the others, this is
a sign that your transaction workload is poorly distributed across
shards.
The listing of shards also offers a useful trick for finding serialized
transactions. Shards that Tungsten Replicator cannot safely parallelize
are assigned the dummy shard ID
#UNKNOWN
. Look for this shard to
find the count of serialized transactions. The
appliedLastSeqno
for this shard gives the
sequence number of the most recent serialized transaction. As the
following example shows, you can then list the contents of the
transaction to see why it serialized. In this case, the transaction
affected tables in different schemas.
shell>trepctl status -name shards
Processing status command (shards)... NAME VALUE ---- ----- appliedLastEventId: mysql-bin.000211:0000000020095529;0 appliedLastSeqno : 78026 appliedLatency : 0.558 eventCount : 26 shardId : #UNKNOWN stage : q-to-dbms ... Finished status command (shards)... shell>thl list -seqno 78026
SEQ# = 78026 / FRAG# = 0 (last frag) - TIME = 2013-01-17 22:29:42.0 - EPOCH# = 1 - EVENTID = mysql-bin.000211:0000000020095529;0 - SOURCEID = logos1 - METADATA = [mysql_server_id=1;service=percona;shard=#UNKNOWN] - TYPE = com.continuent.tungsten.replicator.event.ReplDBMSEvent - OPTIONS = [##charset = ISO8859_1, autocommit = 1, sql_auto_is_null = 0, » foreign_key_checks = 1, unique_checks = 1, sql_mode = '', character_set_client = 8, » collation_connection = 8, collation_server = 33] - SCHEMA = - SQL(0) = insert into mats_0.foo values(1) /* ___SERVICE___ = [percona] */ - OPTIONS = [##charset = ISO8859_1, autocommit = 1, sql_auto_is_null = 0, » foreign_key_checks = 1, unique_checks = 1, sql_mode = '', character_set_client = 8, » collation_connection = 8, collation_server = 33] - SQL(1) = insert into mats_1.foo values(1)
The replicator normally distributes shards evenly across channels. As
each new shard appears, it is assigned to the next channel number, which
then rotates back to 0 once the maximum number has been assigned. If the
shards have uneven transaction distributions, this may lead to an uneven
number of transactions on the channels. To check, use the
trepctl status -name tasks and look for tasks
belonging to the q-to-dbms
stage.
shell> trepctl status -name tasks
Processing status command (tasks)...
...
NAME VALUE
---- -----
appliedLastEventId: mysql-bin.000211:0000000020095076;0
appliedLastSeqno : 78023
appliedLatency : 0.248
applyTime : 0.003
averageBlockSize : 2.520
cancelled : false
currentLastEventId: mysql-bin.000211:0000000020095076;0
currentLastFragno : 0
currentLastSeqno : 78023
eventCount : 5302
extractTime : 274.907
filterTime : 0.0
otherTime : 0.0
stage : q-to-dbms
state : extract
taskId : 0
...
Finished status command (tasks)...
If you see one or more channels that have a very high
eventCount
, consider either assigning shards
explicitly to channels or redistributing the workload in your
application to get better performance.
Tungsten Replicator by default assigns channels using a round robin
algorithm that assigns each new shard to the next available channel. The
current shard assignments are tracked in table
trep_shard_channel
in the Tungsten
catalog schema for the replication service.
For example, if you have 2 channels enabled and Tungsten processes three different shards, you might end up with a shard assignment like the following:
foo => channel 0 bar => channel 1 foobar => channel 0
This algorithm generally gives the best results for most installations and
is crash-safe, since the contents of the
trep_shard_channel
table persist if
either the DBMS or the replicator fails.
It is possible to override the default assignment by updating the
shard.list
file found in the
tungsten-replicator/conf
directory. This file normally looks like the following:
# SHARD MAP FILE. # This file contains shard handling rules used in the ShardListPartitioner # class for parallel replication. If unchanged shards will be hashed across # available partitions. # You can assign shards explicitly using a shard name match, where the form # is <db>=<partition>. #common1=0 #common2=0 #db1=1 #db2=2 #db3=3 # Default partition for shards that do not match explicit name. # Permissible values are either a partition number or -1, in which # case values are hashed across available partitions. (-1 is the # default. #(*)=-1 # Comma-separated list of shards that require critical section to run. # A "critical section" means that these events are single-threaded to # ensure that all dependencies are met. #(critical)=common1,common2 # Method for channel hash assignments. Allowed values are round-robin and # string-hash. (hash-method)=round-robin
You can update the shard.list file to do three types of custom overrides.
Change the hashing method for channel assignments. Round-robin uses
the trep_shard_channel
table.
The string-hash method just hashes the shard name.
Assign shards to explicit channels. Add lines of the form
shard=channel
to the file as
shown by the commented-out entries.
Define critical shards. These are shards that must be processed in serial fashion. For example if you have a sharded application that has a single global shard with reference information, you can declare the global shard to be critical. This helps avoid applications seeing out of order information.
Changes to shard.list must be made with care. The same cautions apply here as for changing the number of channels or the parallelization type. For subscription customers we strongly recommend conferring with Continuent Support before making changes.
Channels receive transactions through a special type of queue, known as a
parallel queue. Tungsten offers two implementations of parallel queues,
which vary in their performance as well as the requirements they may place
on hosts that operate parallel apply. You choose the type of queue to
enable using the
--svc-parallelization-type
option.
Do not change the parallel queue type without setting the replicator offline cleanly. See the procedure later in this page for more information.
Disk Parallel Queue
(disk
option)
A disk parallel queue uses a set of independent threads to read from the Transaction History Log and feed short in-memory queues used by channels. Disk queues have the advantage that they minimize memory required by Java. They also allow channels to operate some distance apart, which improves throughput. For instance, one channel may apply a transaction that committed 2 minutes before the transaction another channel is applying. This separation keeps a single slow transaction from blocking all channels.
Disk queues minimize memory consumption of the Java VM but to function
efficiently they do require pages from the Operating System page cache.
This is because the channels each independently read from the Transaction
History Log. As long as the channels are close together the storage pages
tend to be present in the Operating System page cache for all threads but
the first, resulting in very fast reads. If channels become widely
separated, for example due to a high
maxOfflineInterval
value, or the host has
insufficient free memory, disk queues may operate slowly or impact other
processes that require memory.
Memory Parallel Queue
(memory
option)
A memory parallel queue uses a set of in-memory queues to hold transactions. One stage reads from the Transaction History Log and distributes transactions across the queues. The channels each read from one of the queues. In-memory queues have the advantage that they do not need extra threads to operate, hence reduce the amount of CPU processing required by the replicator.
When you use in-memory queues you must set the maxSize property on the queue to a relatively large value. This value sets the total number of transaction fragments that may be in the parallel queue at any given time. If the queue hits this value, it does not accept further transaction fragments until existing fragments are processed. For best performance it is often necessary to use a relatively large number, for example 10,000 or greater.
The following example shows how to set the maxSize property after installation. This value can be changed at any time and does not require the replicator to go offline cleanly:
Click the link below to switch examples between Staging and INI methods...
shell>tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.0.3-141 shell>echo The staging USER is `tpm query staging| cut -d: -f1 | cut -d@ -f1`
The staging USER is tungsten shell>echo The staging HOST is `tpm query staging| cut -d: -f1 | cut -d@ -f2`
The staging HOST is db1 shell>echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.0.3-141 shell>ssh {STAGING_USER}@{STAGING_HOST}
shell>cd {STAGING_DIRECTORY}
shell> ./tools/tpm configure alpha \
--property=replicator.store.parallel-queue.maxSize=10000
Run the tpm command to update the software with the Staging-based configuration:
shell> ./tools/tpm update
For information about making updates when using a Staging-method deployment, please see Section 9.3.7, “Configuration Changes from a Staging Directory”.
[alpha]
...
property=replicator.store.parallel-queue.maxSize=10000
Run the tpm command to update the software with the INI-based configuration:
shell>tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.0.3-141 shell>echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.0.3-141 shell>cd {STAGING_DIRECTORY}
shell>./tools/tpm update
For information about making updates when using an INI file, please see Section 9.4.4, “Configuration Changes with an INI file”.
You may need to increase the Java VM heap size when you increase the
parallel queue maximum size. Use the
--java-mem-size
option on the
tpm command for this purpose or edit the Replicator
wrapper.conf
file directly.
Memory queues are not recommended for production use at this time. Use disk queues.
Tungsten Replicator normally applies SQL changes to Targets by constructing SQL statements and executing in the exact order that transactions appear in the Tungsten History Log (THL). This works well for OLTP databases like MySQL, Oracle, and MongoDB. However, it is a poor approach for data warehouses.
Data warehouse products like Vertica or Redshift load very slowly through
JDBC interfaces (50 times slower or even more compared to MySQL). Instead,
such databases supply batch loading commands that upload data in parallel.
For instance Vertica uses the
COPY
command.
Tungsten Replicator has a batch applier named SimpleBatchApplier that groups transactions and then loads data. This is known as "batch apply." You can configure Tungsten to load 10s of thousands of transactions at once using template that apply the correct commands for your chosen data warehouse.
While we use the term batch apply Tungsten is not batch-oriented in the sense of traditional Extract/Transfer/Load tools, which may run only a small number of batches a day. Tungsten builds batches automatically as transactions arrive in the log. The mechanism is designed to be self-adjusting. If small transaction batches cause loading to be slower, Tungsten will automatically tend to adjust the batch size upwards until it no longer lags during loading.
The batch applier loads data into the Target DBMS using CSV files and
appropriate load commands like LOAD DATA
INFILE
or COPY
.
Here is the basic algorithm.
While executing within a commit block, we write incoming transactions into
open CSV files written by the class CsvWriter
.
There is one CSV file per database table. The following sample shows
typical contents.
"I","84900","1","2016-03-11 20:51:10.000","986","http://www.continent.com/software" "D","84901","2","2016-03-11 20:51:10.000","143",null "I","84901","3","2016-03-11 20:51:10.000","143","http://www.microsoft.com"
Tungsten adds four extra column values to each line of CSV output.
Column | Description |
---|---|
opcode
| A transaction code that has the value "I" for insert and "D" for delete. Other types are available. |
seqno
| The Tungsten transaction sequence number |
row_id
| A line number that starts with 1 and increments by 1 for each new row |
timestamp
| The commit timestamp, i.e. the origin timestamp of the committed statement that generated the row information. |
Different update types are handled as follows:
Each insert generates a single row containing all values in the row with an "I" opcode.
Each delete generates a single row with the key and a "D" opcode. Non-key fields are null.
Each update results in a delete with the row key followed by an insert.
Statements are ignored. If you want DDL you need to put it in yourself.
Tungsten writes each row update into the corresponding CSV file for the SQL. At commit time the following steps occur:
Flush and close each CSV file. This ensures that if there is a failure the files are fully visible in storage.
For each table execute a merge script to move the data from CSV into
the data warehouse. This script varies depending on the data warehouse
type or even for specific application. It generally consists of a
sequence of operating system commands, load commands like
COPY
or
LOAD DATA INFILE
to load in
the CSV data, and ordinary SQL commands to move/massage data.
When all tables are loaded, issue a single commit on the SQL connection.
The main requirement of merge scripts is that they must ensure rows load and that delete and insert operations apply in the correct order. Tungsten includes load scripts for MySQL and Vertica that do this automatically.
It is common to use staging tables to help load data. These are described in more detail in a later section.
Tungsten currently has some important limitations for batch loading, namely:
Primary keys must be a single column only. Tungsten does not handle multi-column keys.
Binary data is not certified and may cause problems when converted to CSV as it will be converted to Unicode.
These limitations will be relaxed in future releases.
Here is how to set up on MySQL. For more information on specific data warehouse types, refer to Chapter 2, Deployment Overview.
Enable row replication on the MySQL Source using set global
binlog_format=row
or by
updating my.cnf
.
Ensure that you are operating using GMT throughout your source and target database.
Install using the
--batch-enabled=true
option. Here's a
typical vertica applier configuration, taken from
Section 4.3, “Deploying the Vertica Applier” :.
shell>./tools/tpm configure defaults \ --reset \ --user=tungsten \ --install-directory=/opt/continuent \ --profile-script=~/.bash_profile \ --skip-validation-check=HostsFileCheck \ --skip-validation-check=InstallerMasterSlaveCheck \ --rest-api-admin-user=apiuser \ --rest-api-admin-pass=secret
shell>./tools/tpm configure alpha \ --topology=master-slave \ --master=sourcehost \ --members=localhost \ --datasource-type=vertica \ --replication-user=dbadmin \ --replication-password=password \ --vertica-dbname=dev \ --batch-enabled=true \ --batch-load-template=vertica6 \ --batch-load-language=js \ --replication-port=5433 \ --svc-applier-filters=dropstatementdata \ --svc-applier-block-commit-interval=30s \ --svc-applier-block-commit-size=25000 \ --disable-relay-logs=true
shell> vi /etc/tungsten/tungsten.ini
[defaults] user=tungsten install-directory=/opt/continuent profile-script=~/.bash_profile skip-validation-check=HostsFileCheck skip-validation-check=InstallerMasterSlaveCheck rest-api-admin-user=apiuser rest-api-admin-pass=secret
[alpha] topology=master-slave master=sourcehost members=localhost datasource-type=vertica replication-user=dbadmin replication-password=password vertica-dbname=dev batch-enabled=true batch-load-template=vertica6 batch-load-language=js replication-port=5433 svc-applier-filters=dropstatementdata svc-applier-block-commit-interval=30s svc-applier-block-commit-size=25000 disable-relay-logs=true
The JavaScript batchloader enables data to be loaded into datawarehouse and other targets through a simplified JavaScript command script. The script implements specific functions for specification stages for the apply process, from preparation to commit, allowing for internal data, external commands, and other operations to be executed in sequence.
The actual loading process works through the specification of a JavaScript batchload script that defines what operations to perform during each stage of the batchloading process. These mirror the basic steps in the operation of applying the data that is being batchloaded, as shown in Figure 5.3, “Batchloading: JavaScript”.
To summarize:
prepare() is called when the replicator goes online
begin() is called before a single transaction starts
apply() is called to copy and load the raw CSV data
commit() is called after the raw data has been loaded
release() is called when the replicator goes offline
The JavaScript batchloader can be used with parallel apply to enable multiple threads to be generated and apply data to the target database. This can be useful in datawarehouse environments where simultaneous loading (and commit) enables effective application of multiple table data into the datawarehouse.
The defined JavaScript methods like prepare, begin, commit, and release are called independently for each environment. This means that you should ensure actions in these methods do not conflict with each other.
CSV files are divided across the scripts. If there is a large number of files that all take about the same time to load and there are three threads (parallelization=3), each individual load script will see about a third of the files. You should therefore not code assumptions that you have seen all tables or CSV files in a single script.
Parallel load script is only recommended for data sources like Hadoop that are idempotent. When applying to a data source that is non-idempotent (for example MySQL or potentially Vertica) you should just use a single thread.
Staging tables are intermediate tables that help with data loading. There are different usage patterns for staging tables.
Tungsten assumes that staging tables, if present, follow certain conventions for naming and provides a number of configuration properties for generating staging table names that match the base tables in the data warehouse without colliding with them.
Property | Description |
---|---|
stageColumnPrefix
| Prefix for seqno, row_id, and opcode columns generated by Tungsten |
stageTablePrefix
| Prefix for stage table name |
stageSchemaPrefix
| Prefix for the schema in which the stage tables reside |
These values are set in the static properties file that defines the
replication service. They can be set at install time using
--property
options. The following
example shows typical values from a service properties file.
replicator.applier.dbms.stageColumnPrefix=tungsten_ replicator.applier.dbms.stageTablePrefix=stage_xxx_ replicator.applier.dbms.stageSchemaPrefix=load_
If your data warehouse contains a table named
foo
in schema
bar
, these properties would result
in a staging table name of
load_bar.stage_xxx_foo
for the
staging table. The Tungsten generated column containing the
seqno
, if present, would be named
tungsten_seqno
.
Staging tables are by default in the same schema as the table they
update. You can put them in a different schema using the
stageSchemaPrefix
property as shown in the example.
Whole record staging loads the entire CSV file into an identical table,
then runs queries to apply rows to the base table or tables in the data
warehouse. One of the strengths of whole record staging is that it
allows you to construct a merge script that can handle any combination
of INSERT
,
UPDATE
, or
DELETE
operations. A weakness is
that whole record staging can result in sub-optimal I/O for workloads
that consist mostly of INSERT
operations.
For example, suppose we have a base table created by the following
CREATE TABLE
command:
CREATE TABLE `mydata` ( `id` int(11) NOT NULL, `f_data` float DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
A whole record staging table would look as follows.
CREATE TABLE `stage_xxx_croc_mydata` ( `tungsten_opcode` char(1) DEFAULT NULL, `tungsten_seqno` int(11) DEFAULT NULL, `tungsten_row_id` int(11) DEFAULT NULL, `id` int(11) NOT NULL, `f_data` float DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Note that this table does not have a primary key defined. Most data warehouses do not use primary keys and many of them do not even permit it in the create table syntax.
Note also that the non-primary columns must permit nulls. This is required for deletes, which contain only the Tungsten generated columns plus the primary key.
Another approach is to load
INSERT
rows directly into the
base data warehouse tables without staging. All you need to stage is the
keys for deleted records. This reduces I/O considerably for workloads
that have mostly inserts. The downside is that it may require introduce
ordering dependencies between
DELETE
and
INSERT
operations that require
special handling by upstream applications to generate transactions that
will load without conflicts.
Delete key staging tables can be as simple as the follow example:
CREATE TABLE `stage_xxx_croc_mydata` ( `id` int(11) NOT NULL, ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Tungsten does not generate staging tables automatically. Creation of staging tables is the responsibility of users, but using the ddlscan tool with the right template can be simplified.
Character sets are a headache in batch loading because all updates are written and read from CSV files, which can result in invalid transactions along the replication path. Such problems are very difficult to debug. Here are some tips to improve chances of happy replicating.
Use UTF8 character sets consistently for all string and text data.
Force Tungsten to convert data to Unicode rather than transferring strings:
shell> mysql-use-bytes-for-string=false
When starting the replicator for MySQL replication, include the following option:
shell> java-file-encoding=UTF8
Tungsten Replicator supports a number of CSV formats that can and should be used with specific heterogeneous environments when using the batch loading process, or generating CSV files in general for testing or loading.
A number of standard types are included, and the use of these standard types when generating CSV is controlled by the replicator.datasource.global.csvType property. Depending on the configured target, the corresponding type will be configured automatically. For example, if you configure a Vertica deployment, the replicator will be configured to default to the Vertica style CSV format.
Using the wrong CSV format with a given target may break replication. You should always use the appropriate CSV format for the defined target.
Table 5.1. Continuent Tungsten Directory Structure
Format | Field Separator | Record Separator | Escape Sequence | Escaped Characters | Null Policy | Null Value | Show Headers | Use Quotes | Quote String | Suppressed Characters |
---|---|---|---|---|---|---|---|---|---|---|
hive
|
\u0001
|
\n
|
\\
|
\u0001\\
|
Use Null Value
|
\\N
|
false
|
false
|
|
\n\r
|
mysql
|
,
|
\n
|
\\
|
\\
|
Use Null Value
|
\\N
|
false
|
true
|
\"
| |
oracle
|
,
|
\n
|
\\
|
\\
|
Use Null Value
|
\\N
|
false
|
true
|
\"
| |
vertica
|
,
|
\n
|
\\
|
\\
|
Skip Value
|
|
false
|
true
|
\"
|
\n
|
redshift
|
,
|
\n
|
\"
|
|
Skip Value
|
|
false
|
true
|
\"
|
\n
|
In addition to the standardised types, the
replicator.datasource.global.csvType
property can be set to custom
, in
which case the following configurable values are used instead:
replicator.datasource.global.csv.fieldSeparator
— the character used to separate fields, such as
,
(comma).
replicator.datasource.global.csv.RecordSeparator — the character used to separate records, such as the newline character.
replicator.datasource.global.csv.nullValue — the value to use for NULL (empty) values.
replicator.datasource.global.csv.useQuotes
— whether to use quotes to encapsulate field values (specified
using true
or
false
).
replicator.datasource.global.csv.useHeaders
— whether to include the column headers in the generated CSV
(specified using true
or
false
).
The CSV generated when using the batch loading process creates a number of special columns that are designed to hold the appropriate information for loading the staging data into the target system.
There are four fields supported:
opcode
— The operation
code, a one- or two-letter code indicating the operation type. For
more information on the supported codes, see
Section 5.6.9, “Batchloading Opcodes”.
seqno
— Contains the
current THL event (sequence) number for the row data being loaded. The
sequence number generated is specific to the THL event number.
row_id
— Contains a
unique row ID (a monotonically incrementing number) which is unique to
this CSV file for the table data being loaded. This can be useful for
systems where the sequence number alone is not enough to identify an
incoming row, even with the incoming primary key information.
commit_timestamp
— the
timestamp of when the data was originally committed by the source
database, taken from the TIME
within the THL
event.
service
— the service
name of the replicator service that performed the loading and
generated the CSV. This field is not enabled by default, but is
provided to allow for data concentration into a BigData target while
enabling identification of the source service and/or database that
generated the data.
These fields are placed before the actual data for the corresponding table, for example, with the default setting, the following CSV is generated, the last three columns are specific to the table data:
"I","74","1","2017-05-26 13:00:11.000","655337","Dr No","kat"
The configuration of the list of fields, and the order in which they appear, is controlled by the replicator.applier.dbms.stageColumnNames property. By default, all four fields, in the order shown above, are used:
replicator.applier.dbms.stageColumnNames=opcode,seqno,row_id,commit_timestamp
The actual names used (and passed to the JavaScript environment) are also
controlled by another property,
replicator.applier.dbms.stageColumnPrefix.
This value is prepended to each column within the JS environment, and
expected by the various tools. For example, with the default
tungsten_
the true name for the
opcode
is
tungsten_opcode
.
Modifying the list of fields generated by the CSV writer may stop
batchloading from working. Unless otherwise noted, the default
batchloading scripts all expect to see the default four columns
(opcode
,
seqno
,
row_id
and
commit_timestamp
.
The batchloading an CSV generation process use the
opcode
value to specify the
operation type for each row. The default mode is to use only the
I
and
D
codes for inserts and deletes
respectively, with an update being represented as two rows, one a delete
and the other an insert of the new information.
This behavior can be altered to denote updates with a
U
character, with the row containing
the updated information. To enable this mode, set the
replicator.applier.dbms.useUpdateOpcode
to true
.
It is also possible to identify situations where the incoming row data
indicates a delete operation that resulted from an update (for example, in
a cascade or related column), and an insert from an update. When this mode
is enable, the opcode
becomes a
two-character value or UD
and
UI
respectively. To enable this
option, set the
replicator.applier.dbms.distinguishUpdates
property to true
.
Changing the default opcode modes may cause replication to fail. The
default JavaScript batchloading scripts expect the default
I
and
D
notation with updated implied
through a delete and insert operation.
Time zones are another headache when using batch loading. For best results applications should standardize on a single time zone, preferably UTC, and use this consistently for all data. To ensure the Java VM outputs time data correctly to CSV files, you must set the JVM time zone to be the same as the standard time zone for your data. Here is the JVM setting in wrapper.conf:
# To ensure consistent handling of dates in heterogeneous and batch replication # you should set the JVM timezone explicitly. Otherwise the JVM will default # to the platform time, which can result in unpredictable behavior when # applying date values to Targets. GMT is recommended to avoid inconsistencies. wrapper.java.additional.5=-Duser.timezone=GMT
Beware that MySQL has two very similar data types:
TIMESTAMP
and
DATETIME
. Timestamps are stored
in UTC and convert back to local time on display. Datetimes by contrast
do not convert back to local time. If you mix timezones and use both
data types your time values will be inconsistent on loading.
All the features discussed in this section are only available from version 6.1.15 of Tungsten Replicator
There are occasions where Batch loading into MySQL may benefit your use case, such as loading large data warehouse environments, or where real-time replication isn't as critical.
A number of specific properties are available for MySQL targets, these are discussed below.
By Default, when loading into MySQL using the Batch Applier, the process executes
LOAD DATA INFILE
statements to load the CSV files into the database.
If you wish to install the applier on a remote host, this action would typically fail, therefore you need to enable the following property in the configuration:
property=replicator.applier.dbms.useLoadDataLocalInfile=true
Tungsten Replicator includes a number of useful filters, such as the ability to drop certain DML statements on a schema or table level.
If you wish to drop such statements on a per object basis, then you should continue to
use the skipbyevent filter, however if you want to drop ALL DELETE
DML, then you can enable the following property:
property=replicator.applier.dbms.skipDeletes=true
By dropping deletes, you will then subsequently expose yourself to errors should rows be reinserted later with the same Primary or Unique Key values. Typically, this feature would be only enabled when you plan to capture and log key violations. See Section 5.6.11.6, “Log rows violating Primary/Unique Keys” for more information.
If you wish to specify a different CHARSET to be used when the data is being loaded into the target database, this can be set using the following property, for example:
property=replicator.applier.dbms.loadCharset=utf8mb4
Typically, the batch loader is used for heterogeneous targets, and therefore by default DDL statements will be dropped. However, when applying into MySQL the DDL statements would be valid and can therefore be executed.
To enable this, you should set the following property:
property=replicator.applier.dbms.applyStatements=true
Any changes to existing tables, or creation of new tables, will only apply to the main base table. You will still need to manually make changes to the relevant staging and error tables (if used)
If you use a lot of foreign keys in your target database, due to the nature of batch loading, this could cause errors when tables may not be loaded in sequence meaning child/parent keys may only be validated after a complete transaction load.
To prevent this from happening, you can enable the property below which will force the batch loader to temporarily disable foreign key checks until after the full transaction has been loaded.
property=replicator.applier.dbms.disableForeignKeys=true
To prevent the replicator erroring on primary or unique key violations, you can instruct the replicator to log the offending rows in an error table, which will allow you to manually process the rows afterwards.
This is especially useful when you are dropping DELETE statements from the apply process
The following properties can be set to enable this:
property=replicator.applier.dbms.useUpdateOpcode=true property=replicator.applier.dbms.batchLogDuplicateRows=true
By default, this feature will only check against PRIMARY KEYS
, if you
wish to also check against UNIQUE
keys, you will need the additional
property:
property=replicator.applier.dbms.fetchKeysFromDatabase=true
By default, the error rows will be logged into tables called error_xxx_
.
origTableName
These table will need precreating in the same way that you create the Staging tablesusing ddlscan, but supplying the table prefix, for example:
shell> ddlscan -db hr -template ddl-mysql-staging.vm -opt tablePrefix error_xxx_
You can choose a different prefix if you wish, by replacing the error_xxx with you choice in the above ddlscan statement. If you choose to do this, you will also need to supply the new prefix in your configuration using the following property:
property=replicator.applier.dbms.errorTablePrefix=your-prefix-here_
If you are loading 10's of thousands of rows per transaction, and your target tables are very large, this process could slow down the apply process as the applier will first need to ensure the row being inserted does not violate any keys. The use of this feature should be fully tested in a load test environment and the risks fully understood before using in production.
By default, the CSV files generated as part of the batchloading process
are named according to the schema name, table name, and the starting
transaction sequence number that generated the data in the file. For
example, the table orders
within
the schema sales
generating the
transaction information from sequence numbers 110 through 145 would have
the name sales-orders-110.csv
.
Because the size of the files can be quite large, and because within different target environments (particularly Hadoop or when uploading to S3) the speed with which the data can be uploaded or organised within the target can be critical, the files can also be partitioned. This splits up the files generated by a chosen value such as the commit time or data value.
The primary solution for partitioning is to the DateTime partitioner, which then uses a configurable date time value from the internal data structure to act as the basis for the information.
To enable date-based partitioning, you must specify the properties during your configuration:
replicator.applier.dbms.partitionBy=tungsten_commit_timestamp replicator.applier.dbms.partitionByClass=com.continuent.tungsten.replicator.applier.batch.DateTimeValuePartitioner replicator.applier.dbms.partitionByFormat=yyyy-MM-dd-HH
The above sets the use fo the tungsten_commit_timestamp
field generated by the batchload CSV system as the basis of the value. The
format specification is then used to specify the format of the data which
will be embedded into the file. The data formatter uses the Java date
format strings, and you can use one or more of the following values:
YY
Year as two digit number
yyyy
Year as four digit number
MM
Month with leading zero
dd
Day with leading zero
HH
Hour in 24 hour format with leading zero
mm
Minute with leading zero
ss
Seconds with leading zero
For example, setting yyyy-MM-dd-HH
(the default), the name of the CSV file will be
orders-sales-2018-04-03-12-199.csv
.
Note that the THL sequence number is still embedded in the filename (as
the last item), as is the schema and table name.
Files generated will automatically be split by the configured value, but remember that the commit timestamp will be consistent for an individual transaction, so data will never be split across multiple files for a single transaction even if it takes time for the CSV file to be written, the key is the commit timestamp from the source database for the entire transaction that corresponds to the sequence number.
Table of Contents
Authentication between command-line tools (trepctl), and between background services.
SSL/TLS between command-line tools and background services.
SSL/TLS between Tungsten Replicator and datasources.
SSL for all API calls.
File permissions and access by all components.
The following graphic provides a visual representation of the various communication channels which may be encrypted.
For the key to the above diagram, please see Section 9.5.15, “tpm report Command”.
If you are using a single staging directory to handle your complete installation, tpm will automatically create the necessary certificates for you. If you are using an INI based installation, then the installation process will create the certificates for you, however you will need to manually sync them between hosts prior to starting the various components.
It is assumed that your underlying database has SSL enabled and the certificates are available. If you need, and want, this level of security enabling, you can refer to Section 6.10.1, “Enabling Database SSL” for the steps required.
Additionally, if you are configuring heterogeneous replication there will additional manual steps required to ensure SSL communication to you chosen target database.
Due to a known issue in earlier Java revisions that may cause performance degradation with client connections, it is strongly advised that you ensure your Java version is one of the following MINIMUM releases before enabling SSL:
By default, security is enabled for new installations.
Security can be enabled/disabled by adding the
disable-security-controls
option to the
configuration.
If this property is not supplied, or set to false, then security will be enabled. If set to true, then security will be disabled.
Enabling security through this single option, has the same effect as adding:
If you are enabling to-the-database encryption, you must ensure this has been enabled in your database and the relevant certificates are available first. See Section 6.10.1, “Enabling Database SSL” for steps.
Installing from a staging host will automatically generate certificates and configuration for a secured installation. No further changes or actions are required.
For INI-based installations, there are additional steps required to copy the needed certificate files to all of the nodes. Please see Section 6.1.2, “Enabling Security using the INI Method” for details.
Security will be enabled during initial install by default, should you choose to disable at install, then these steps will guide you in the process to enable as part of a post-install update
Enabled During Install
As mentioned, security is enabled by default. This is controlled by
the
--disable-security-controls=false
. If
not supplied, the default is false. You can choose to specify this in
your configuration for transparency if you wish.
shell>tools/tpm configure defaults --disable-security-controls=false \ [...the rest of the configuration options...]
shell>tools/tpm install
The above configuration (and the default) will assume that your database has been configured with SSL enabled. The installation will error and fail if this is not the case. You must manually ensure database SSL has been enabled prior to issuing the install. Steps to enable this can be found in Section 6.10.1, “Enabling Database SSL”
If you DO NOT want to enable database level SSL, then you must also include the following option in the tpm configure command above:
--enable-connector-ssl=false --datasource-enable-ssl=false
Installing from a staging host will automatically generate certificates and configuration for a secured installation. No further changes or actions are required.
Enabling Post-Installation
If, at install time, you disabled security (by specifying
--disable-security-controls=true
) you
can enable it by changing the value to false.
shell>tools/tpm configure defaults --disable-security-controls=false
shell>tools/tpm update --replace-jgroups-certificate --replace-tls-certificate --replace-release
The above configuration will assume that your database has been configured with SSL enabled. The update will error and fail if this is not the case. You must manually ensure database SSL has been enabled prior to issuing the update. Steps to enable this can be found in Section 6.10.1, “Enabling Database SSL”
If you DO NOT want to enable database level SSL, then you must also include the following options in the tpm configure command above:
--enable-connector-ssl=false --datasource-enable-ssl=false
Following the update, you will also need to manually re-sync the certificates and keystores to all other nodes within your configuration. The following example uses scp for the copy and uses db1 as the primary source for the files to be copied. Adjust accordingly for your environment.
Sync Certificates and Keystores to all nodes
db1> for host in db2 db3 db4 db5 db6; do
scp /opt/continuent/share/[jpt]* ${host}:/opt/continuent/share
scp /opt/continuent/share/.[jpt]* ${host}:/opt/continuent/share
done
Restart all components, on all hosts
shell> replicator restart
This update will force replicator processes to be restarted.
Security will be enabled during initial install by default, should you choose to disable at install, then these steps will guide you in the process to enable as part of a post-install update
Enabled During Install
As mentioned, security is enabled by default. This is controlled by
the disable-security-controls
property. If not supplied, the default is false. You can choose to
specify this in your configuration for transparency if you wish.
disable-security-controls=false
The above configuration (and the default) will assume that your database has been configured with SSL enabled. The installation will error and fail if this is not the case. You must manually ensure database SSL has been enabled prior to issuing the install. Steps to enable this can be found in Section 6.10.1, “Enabling Database SSL”
If you DO NOT want to enable database
level SSL, then you must also include the following options in your
tungsten.ini
file:
datasource-enable-ssl=false
Following installation there are a few additional steps that will be required before starting the software.
You must select one of the nodes and copy that node's certificate/keystore/truststore files to all other nodes.
For example, assuming you choose db1, and have 5 other nodes to copy the files to you could use this syntax:
shell> for host in db2 db3 db4 db5 db6; do
scp /opt/continuent/share/[jpt]* ${host}:/opt/continuent/share/
scp /opt/continuent/share/.[jpt]* ${host}:/opt/continuent/share/
done
The above example assumes ssh has been setup between nodes as the tungsten OS user. If this is not the case you will need to use whichever methods you have available to sync these files.
Then, on all nodes, you can start the software:
shell>source /opt/continuent/share/env.sh
shell>startall
Enabling Post-Installation
If, at install time, you disabled security (by specifying
disable-security-controls=true
you
can enable it by changing the value to false in your
tungsten.ini
on all nodes.
The above configuration (and the default) will assume that your database has been configured with SSL enabled. The update will error and fail if this is not the case. You must manually ensure database SSL has been enabled prior to issuing the update. Steps to enable this can be found in Section 6.10.1, “Enabling Database SSL”
If you DO NOT want to enable database
level SSL, then you must also include the following options in your
tungsten.ini
file:
datasource-enable-ssl=false
Before issuing the update, there are a number of additional steps required. These are outlined below:
First, configure the tungsten.ini
file as
follows:
disable-security-controls=false
start-and-report=false
Do the update on each node, which will generate new, different certificates on every node.
This update procedure will force replicators to be restarted.
shell>stopall
shell>tpm query staging
shell>cd {staging_directory}
shell>tools/tpm update --replace-jgroups-certificate --replace-tls-certificate --replace-release
As with a fresh install, you must then select one of the nodes and copy that node's certificate files to all other nodes:
For example, assuming you choose db1, and have 5 other nodes to copy the files to you could use this syntax:
shell> for host in db2 db3 db4 db5 db6; do
scp /opt/continuent/share/[jpt]* ${host}:/opt/continuent/share/
scp /opt/continuent/share/.[jpt]* ${host}:/opt/continuent/share/
done
The above example assumes ssh has been setup between nodes as the tungsten OS user. If this is not the case you will need to use whichever methods you have available to sync these files.
On all nodes:
shell> startall
There may be situations where you wish to disable securityfor the entire installation.
Security can be disabled in the following ways during configuration with tpm:
--disable-security-controls=true
Disabling security through this single option, has the same effect as adding:
Disables file level protection, including ownership and file mode settings.
Disables the use of SSL/TLS for communicating with services, this includes starting, stopping, or controlling individual services and operations, such as putting Tungsten Replicator online or offline.
Disables the use of SSL/TLS for THL transmission between replicators.
Disables the use of authentication when accessing and controlling services.
--replicator-rest-api-ssl=false
Disables SSL for communication with the Replicator API. This
does not disable the API altogether.
To do that, refer to
replicator-rest-api
By default, tpm can automatically create suitable certificates and configuration for use in your deployment. To create the required certificates by hand, use one of the following procedures.
To manually generate the security files, use the steps below:
Generating a TLS Certificate
Run this command to create the keystore in
/etc/tungsten/secure
. You may use your own
location, but the values for
-storepass
and
-keypass
must match.
shell> keytool -genkey -alias tls \
-validity 3650 \
-keyalg RSA -keystore /etc/tungsten/secure/tungsten_tls_keystore.jks \
-dname "cn=Continuent, ou=IT, o=Continuent, c=US" \
-storepass mykeystorepass -keypass mykeystorepass
Follow the steps in Section 6.3, “Creating Suitable Certificates” to create the TLS certificate.
Update your configuration to specify the certificate and the keystore password:
Follow the steps in Section 6.3, “Creating Suitable Certificates” to create the TLS certificate.
Transfer the generated certificates to the same path on all hosts.
Update your configuration to specify the certificate and the keystore password:
java-tls-keystore-path=/etc/tungsten/secure/tungsten_tls_keystore.jks
java-keystore-password=mykeystorepass
This procedure will take a signed certificate from a known Certificate Authority and use it as the basis for all SSL operations within the replicator.
The below example procedure assumes that you have an existing,
installed and running Primary/Replica topology with security enabled
by setting
disable-security-controls=false
Assume a simple topology with with member hosts db1
and db2
In all examples below, because you are updating an existing secure
installation, the password tungsten
is required,
do not change it.
Select one node to create the proper set of certs, i.e.
db1
:
shell>su - tungsten
shell>mkdir /etc/tungsten/secure
shell>mkdir ~/certs
shell>cd ~/certs
Copy the available files (CA cert, Intermediate cert (if needed), signed cert and signing key) into ~/certs/, i.e.:
ca.crt.pem int.crt.pem signed.crt.pem signing.key.pem
Create a pkcs12 (.p12) version of the signed certificate:
shell>openssl pkcs12 -export -in ~/certs/signed.crt.pem -inkey ~/certs/signing.key.pem \ -out ~/certs/tungsten_sec.crt.p12 -name replserver
Enter Export Password:tungsten
Verifying - Enter Export Password:tungsten
When using OpenSSL 3.0 with Java 1.8, you
MUST add the
-legacy
option to the openssl
command.
Create a pkcs12-based keystore (.jks) version of the signed certificate:
shell>keytool -importkeystore -deststorepass tungsten -destkeystore /etc/tungsten/secure/tungsten_keystore.jks \ -srckeystore ~/certs/tungsten_sec.crt.p12 -srcstoretype pkcs12 -deststoretype pkcs12
Importing keystore /home/tungsten/certs/tungsten_sec.crt.p12 to /etc/tungsten/secure/tungsten_keystore.jks... Enter source keystore password:tungsten
Entry for alias replserver successfully imported. Import command completed: 1 entries successfully imported, 0 entries failed or cancelled
Import the Certificate Authority's certificate into the keystore:
shell>keytool -import -alias careplserver -file ~/certs/ca.crt.pem -keypass tungsten \ -keystore /etc/tungsten/secure/tungsten_keystore.jks -storepass tungsten
... Trust this certificate? [no]:yes
Certificate was added to keystore
Import the Certificate Authority's intermediate certificate (if supplied) into the keystore:
shell> keytool -import -alias interreplserver -file ~/certs/int.crt.pem -keypass tungsten \
-keystore /etc/tungsten/secure/tungsten_keystore.jks -storepass tungsten
Certificate was added to keystore
Export the cert from the keystore into file
client.cer
for use in the next step to create the
truststore:
shell>keytool -export -alias replserver -file ~/certs/client.cer \ -keystore /etc/tungsten/secure/tungsten_keystore.jks
Enter keystore password:tungsten
Certificate stored in file </home/tungsten/certs/client.cer>
Create the truststore:
shell> keytool -import -trustcacerts -alias replserver -file ~/certs/client.cer \
-keystore /etc/tungsten/secure/tungsten_truststore.ts -storepass tungsten -noprompt
Certificate was added to keystore
Create the rmi_jmx password store entry:
shell> tpasswd -c tungsten tungsten -t rmi_jmx -p /etc/tungsten/secure/passwords.store -e \
-ts /etc/tungsten/secure/tungsten_truststore.ts -tsp tungsten
Using parameters:
-----------------
security.properties = /opt/continuent/tungsten/cluster-home/../cluster-home/conf/security.properties
password.file.location = /etc/tungsten/secure/passwords.store
encrypted.password = true
truststore.location = /etc/tungsten/secure/tungsten_truststore.ts
truststore.password = *********
-----------------
Creating non existing file: /etc/tungsten/secure/passwords.store
User created successfuly: tungsten
Create the tls password store entry:
shell> tpasswd -c tungsten tungsten -t unknown -p /etc/tungsten/secure/passwords.store -e \
-ts /etc/tungsten/secure/tungsten_truststore.ts -tsp tungsten
Using parameters:
-----------------
security.properties = /opt/continuent/tungsten/cluster-home/../cluster-home/conf/security.properties
password.file.location = /etc/tungsten/secure/passwords.store
encrypted.password = true
truststore.location = /etc/tungsten/secure/tungsten_truststore.ts
truststore.password = ********
-----------------
User created successfuly: tungsten
List and verify the user for each security service password store
entry, rmi_jmx and tls (which has a display tag of
unknown
):
shell> tpasswd -l -p /etc/tungsten/secure/passwords.store -ts /etc/tungsten/secure/tungsten_truststore.ts
Using parameters:
-----------------
security.properties = /opt/continuent/tungsten/cluster-home/../cluster-home/conf/security.properties
password.file.location = ./passwords.store
encrypted.password = true
truststore.location = ./tungsten_truststore.ts
truststore.password = ********
-----------------
Listing users by application type:
[unknown]
-----------
tungsten
[rmi_jmx]
-----------
tungsten
On host db1, transfer the generated certificates to the same path on all remaining hosts:
shell> for host in `seq 2 3`; do rsync -av /etc/tungsten/secure/ db$host:/etc/tungsten/secure/; done
Edit the /etc/tungsten/tungsten.ini
configuration
file on all nodes and add:
[defaults]
...
disable-security-controls=false
java-keystore-path=/etc/tungsten/secure/tungsten_keystore.jks
java-keystore-password=tungsten
java-truststore-path=/etc/tungsten/secure/tungsten_truststore.ts
java-truststore-password=tungsten
rmi-ssl=true
rmi-authentication=true
rmi-user=tungsten
java-passwordstore-path=/etc/tungsten/secure/passwords.store
When java-keystore-path
is passed
to tpm, the keystore must contain both tls and
mysql certs when appropriate. tpm will NOT add
mysql cert nor generate tls cert when this flag is found, so both
certs must be manually imported already.
On ALL nodes, stop the replicator software, execute the update, then start the replicators:
This procedure requires the complete restart of all layers of the Cluster, and will cause a brief downtime.
shell>tpm query staging
shell>cd {staging_dir}
shell>stopall
shell>tools/tpm update --replace-release
shell>startall
If you meet the requirements to use an automatically generated certificate
from the staging directory, the tpm update command can
handle the certificate replacement. Simply add the
--replace-jgroups-certificate
option to
your command. This will create errors if your staging configuration does
not reflect the full list of hosts or if you limit the command to a
specific host.
shell> tools/tpm update --replace-jgroups-certificate --replace-release
If you do not meet these requirements, generate a new certificate and update it through the tpm command.
shell> tools/tpm configure SERVICE
\
--java-jgroups-keystore-path=/etc/tungsten/jgroups.jceks \
--java-keystore-password=mykeystorepass
Then perform an update and replace the entire release directory:
shell> tools/tpm update --replace-release
If you meet the requirements to use an automatically generated certificate
from the staging directory, the tpm update command can
handle the certificate replacement. Simply add the
--replace-tls-certificate
option to your
command. This will create errors if your staging configuration does not
reflect the full list of hosts or if you limit the command to a specific
host.
shell> tools/tpm update --replace-tls-certificate --replace-release
If you do not meet these requirements, generate a new certificate and update it through the tpm command.
shell> tools/tpm configure SERVICE
\
--java-tls-keystore-path=/etc/tungsten/tls.jks \
--java-keystore-password=mykeystorepass
Then perform an update and replace the entire release directory:
shell> tools/tpm update --replace-release
Using the tpm update command, the general Continuent service encryption can be easily removed.
shell> tpm configure SERVICE
\
--thl-ssl=false \
--rmi-ssl=false \
--rmi-authentication=false
Then perform an update and replace the entire release directory:
shell> tpm update --replace-release
This section explains how to enable security between the database and various other parts of the topology, including:
Database server SSL
This is the first step, and the prerequiste for all the remaining steps. You must have the database server properly configured to support SSL before any of the other procedures will work.
Tungsten Replicator to the database server
This usually happens during the second step, and what allows Tungsten Replicator to communicate securely with the database server.
See Section 6.10.2, “Configure Tungsten<>Database Secure Communication”
The steps outlined below explain how to enable security within MySQL (If it is not already enabled by default in the release your are using). There are different approaches depending on the version/distribution of MySQL you are using. If in any doubt, you should consult the appropriate documentation pages for the MySQL release you are using.
To enable Tungsten Replicator to communicate with Amazon Aurora, via SSL, the following simple steps can be followed.
Obtain the certificate from Amazon appropriate for the region that your Aurora instance in hosted. More information can be found here.
Copy the file to the Tungsten Replicator host into a directory of your choice.
Add the following properties to your configuration. (In this example our certificate is within
/opt/continuent/share
. Adjust to suit your environment)
property=replicator.datasource.global.connectionSpec.urlOptions=noPrepStmtCache=true&
» serverCertificate=/opt/continuent/share/rds-ca-2019-eu-west-1.pem
datasource-enable-ssl=true
You can now install, or if the replicator was already installed, issue an update
If you choose to enable database level SSL within your MySQL installation, there are a number of additional steps required to allow the Tungsten Components to be able to communicate to the database layer.
The steps below make the following assumptions:
You have enabled SSL using the correct procedures for your distribution of MySQL. If not, refer to Section 6.10.1, “Enabling Database SSL”.
You have generated, and have access to, the client level certificates and keys
If you are installing an Offboard extractor/applier, the client certificates and keys have been copied to the extractor/applier hosts
If SSL has been enabled within the Tungsten installation, then you should either have the following parameter in your configuration, or it will be ommitted altogether since security is enabled by default:
disable-security-controls=false
As a result, you should have a number of files within
/opt/continuent/share
shell> ls -l total 20 -rw-rw-r-- 1 tungsten tungsten 104 Jul 18 10:15 jmxremote.access -rw-rw-r-- 1 tungsten tungsten 729 Jul 18 10:15 passwords.store -rw-rw-r-- 1 tungsten tungsten 2268 Jul 18 10:15 tungsten_keystore.jks -rw-rw-r-- 1 tungsten tungsten 1079 Jul 18 10:15 tungsten_truststore.ts
If you do not have SSL enabled within the installation and you require this, then follow the steps in Section 6.1, “Enabling Security” first
If you do not require SSL between the Replicators, and only require SSL between the replicator and the database, then add the following parameters to your configuration, but do not run tpm update yet.
java-truststore-path=/home/tungsten/tungsten_truststore.ts java-truststore-password=tungsten java-keystore-path=/home/tungsten/tungsten_keystore.jks
Next, add the following parameters to your installation, but do not run tpm update yet:
datasource-enable-ssl=true
You now need to convert the mysql client key to PKCS12 format. Adjust the path and filename in the example to suit your environment
shell> openssl pkcs12 -export -in /home/tungsten/client-cert.pem
\
-inkey /home/tungsten/client-key.pem
\
-name mysql -out /home/tungsten/client-key.p12
When prompted for a password, you MUST enter tungsten
When using OpenSSL 3.0 with Java 1.8, you
MUST add the
-legacy
option to the openssl
command.
You now need to import the key, either into the existing keystore if it exists, or into a new one if SSL is not being enabled at the replicator level
If Tungsten level SSL has been enabled
shell> keytool -importkeystore -deststorepass tungsten \
-destkeystore /opt/continuent/share/tungsten_keystore.jks \
-srckeystore /home/tungsten/client-key.p12 -srcstoretype PKCS12
If ONLY Database SSL is required
shell> keytool -importkeystore -deststorepass tungsten \
-destkeystore /home/tungsten/tungsten_keystore.jks \
-srckeystore /home/tungsten/client-key.p12 -srcstoretype PKCS12
When prompted for a password, enter tungsten
Next, import the client certificate into the truststore
If Tungsten level SSL has been enabled
shell> keytool -import -alias mysql -trustcacerts -file /home/tungsten/ca.pem \
-keystore /opt/continuent/share/tungsten_truststore.ts
If ONLY Database SSL is required
shell> keytool -import -alias mysql -trustcacerts -file /home/tungsten/ca.pem \
-keystore /home/tungsten/tungsten_truststore.ts
When prompted for a password, enter tungsten
Finally, and only if Tungsten level SSL has been enabled, we need to create backups copies of the keystore and truststore as follows:
shell>cp /opt/continuent/share/tungsten_truststore.ts /opt/continuent/share/.tungsten_truststore.ts.orig
shell>cp /opt/continuent/share/tungsten_keystore.jks /opt/continuent/share/.tungsten_keystore.jks.orig
Issue tpm update to apply the configuration
The replicators will be restarted as part of the update process, and sho