Tungsten Replicator 7.1 Manual

Continuent Ltd

Abstract

This manual documents Tungsten Replicator 7.1. This includes information for:

  • Tungsten Replicator

Build date: 2024-10-10 (1408c26a)

Up to date builds of this document: Tungsten Replicator 7.1 Manual (Online), Tungsten Replicator 7.1 Manual (PDF)


Table of Contents

Preface
1. Legal Notice
2. Conventions
3. Quickstart Guide
1. Introduction
1.1. Tungsten Replicator
1.1.1. Extractor
1.1.2. Appliers
1.1.3. Transaction History Log (THL)
1.1.4. Filtering
2. Deployment Overview
2.1. Deployment Sources
2.1.1. Using the TAR/GZipped files
2.1.2. Using the RPM package files
2.2. Best Practices
2.2.1. Best Practices: Deployment
2.2.2. Best Practices: Upgrade
2.2.3. Best Practices: Operations
2.2.4. Best Practices: Maintenance
2.3. Common tpm Options During Deployment
2.4. Starting and Stopping Tungsten Replicator
2.5. Configuring Startup on Boot
2.6. Removing Datasources from a Deployment
2.6.1. Removing a Datasource from an Existing Deployment
2.7. Understanding Deployment Styles and Topologies
2.7.1. Tungsten Replicator Extraction Operation
2.7.2. Understanding Deployment Models
2.7.3. Understanding Deployment Topologies
2.7.3.1. Simple Primary/Replica Topology
2.7.3.2. Active/Active Topology
2.7.3.3. Fan-Out Topology
2.7.3.4. Fan-In Topology
2.7.3.5. Replicating in/out of an existing Tungsten Cluster
2.8. Understanding Heterogeneous Deployments
2.8.1. How Heterogeneous Replication Works
2.8.1.1. JDBC Applier based Replication
2.8.1.2. Native Applier Replication (e.g. MongoDB)
2.8.1.3. Batch Loading
2.8.1.4. Schema Creation and Replication
3. Deploying MySQL Extractors
3.1. MySQL Replication Pre-Requisites
3.2. Deploying a Primary/Replica Topology
3.2.1. Monitoring the MySQL Extractor
3.3. Deploying an Extractor for Amazon Aurora
3.3.1. Changing Amazon RDS/Aurora Instance Configurations
3.3.1.1. Changing Amazon RDS using command line functions
3.3.1.2. Changing Amazon Aurora Parameters using AWS Console
3.4. Replicating Data Out of a Cluster
3.4.1. Prepare: Replicating Data Out of a Cluster
3.4.2. Deploy: Replicating Data Out of a Cluster
4. Deploying Appliers
4.1. Deploying the MySQL Applier
4.1.1. Preparing for MySQL Replication
4.1.2. Prepare Amazon RDS/Amazon Aurora
4.1.3. Install MySQL Applier
4.1.3.1. Local and Remote MySQL Targets
4.1.3.2. Amazon RDS and Amazon Aurora Targets
4.1.4. Management and Monitoring of MySQL Deployments
4.2. Deploying the Amazon Redshift Applier
4.2.1. Redshift Replication Operation
4.2.2. Preparing for Amazon Redshift Replication
4.2.2.1. Redshift Preparation for Amazon Redshift Deployments
4.2.2.2. Configuring Identity Access Management within AWS
4.2.2.3. Amazon Redshift DDL Generation for Amazon Redshift Deployments
4.2.2.4. Handling Concurrent Writes from Multiple Appliers
4.2.3. Install Amazon Redshift Applier
4.2.4. Verifying your Redshift Installation
4.2.5. Keeping CDC Information
4.2.6. Management and Monitoring of Amazon Redshift Deployments
4.3. Deploying the Vertica Applier
4.3.1. Preparing for Vertica Deployments
4.3.2. Install Vertica Applier
4.3.3. Management and Monitoring of Vertica Deployments
4.3.4. Troubleshooting Vertica Installations
4.4. Deploying the Kafka Applier
4.4.1. Preparing for Kafka Replication
4.4.2. Install Kafka Applier
4.4.2.1. Optional Configuration Parameters for Kafka
4.4.3. Management and Monitoring of Kafka Deployments
4.5. Deploying the MongoDB Applier
4.5.1. MongoDB Atlas Replication
4.5.2. Preparing for MongoDB Replication
4.5.3. Install MongoDB Applier
4.5.4. Install MongoDB Atlas Applier
4.5.4.1. Import MongoDB Atlas Certificates
4.5.5. Management and Monitoring of MongoDB Deployments
4.6. Deploying the Hadoop Applier
4.6.1. Hadoop Replication Operation
4.6.2. Preparing for Hadoop Replication
4.6.2.1. Hadoop Host
4.6.2.2. Schema Generation
4.6.3. Replicating into Kerberos Secured HDFS
4.6.4. Install Hadoop Replication
4.6.4.1. Applier Replicator Service
4.6.4.2. Generating Materialized Views
4.6.4.3. Accessing Generated Tables in Hive
4.6.4.4. Management and Monitoring of Hadoop Deployments
4.6.4.5. Troubleshooting Hadoop Replication
4.7. Deploying the Oracle Applier
4.7.1. Preparing for Oracle Replication
4.7.1.1. Additional Prerequisites for Oracle Targets
4.7.1.2. Configure the Oracle database
4.7.1.3. Create the Destination Schema
4.7.2. Install Oracle Applier
4.8. Deploying the PostgreSQL Applier
4.8.1. Preparing for PostgreSQL Replication
4.8.1.1. PostgreSQL Database Setup
4.8.2. Install PostgreSQL Applier
4.8.3. Management and Monitoring of PostgreSQL Deployments
4.9. Deploying the Amazon S3 CSV Applier
4.9.1. S3 Replication Operation
4.9.2. Preparing for Amazon S3 Replication
4.9.3. Install Amazon S3 Applier
5. Deployment: Advanced
5.1. Deploying the Replicator using the AWS Marketplace AMI
5.1.1. Prepare Source/Target database instances
5.1.2. Launch and Configure AMI
5.2. Deploying a Fan-In Topology
5.2.1. Management and Monitoring Fan-in Deployments
5.3. Deploying Multiple Replicators on a Single Host
5.3.1. Preparing Multiple Replicators
5.3.2. Install Multiple Replicators
5.3.3. Best Practices: Multiple Replicators
5.4. Replicating Data Into an Existing Dataservice
5.5. Deploying Parallel Replication
5.5.1. Application Prerequisites for Parallel Replication
5.5.2. Enabling Parallel Apply During Install
5.5.3. Channels
5.5.4. Parallel Replication and Offline Operation
5.5.4.1. Clean Offline Operation
5.5.4.2. Tuning the Time to Go Offline Cleanly
5.5.4.3. Unclean Offline
5.5.5. Adjusting Parallel Replication After Installation
5.5.5.1. How to Enable Parallel Apply After Installation
5.5.5.2. How to Change Channels Safely
5.5.5.3. How to Disable Parallel Replication Safely
5.5.5.4. How to Switch Parallel Queue Types Safely
5.5.6. Monitoring Parallel Replication
5.5.6.1. Useful Commands for Parallel Monitoring Replication
5.5.6.2. Parallel Replication and Applied Latency On Replicas
5.5.6.3. Relative Latency
5.5.6.4. Serialization Count
5.5.6.5. Maximum Offline Interval
5.5.6.6. Workload Distribution
5.5.7. Controlling Assignment of Shards to Channels
5.5.8. Disk vs. Memory Parallel Queues
5.6. Batch Loading for Data Warehouses
5.6.1. How It Works
5.6.2. Important Limitations
5.6.3. Batch Applier Setup
5.6.4. JavaScript Batchloader Scripts
5.6.4.1. JavaScript Batchloader with Parallel Apply
5.6.5. Staging Tables
5.6.5.1. Staging Table Names
5.6.5.2. Whole Record Staging
5.6.5.3. Delete Key Staging
5.6.5.4. Staging Table Generation
5.6.6. Character Sets
5.6.7. Supported CSV Formats
5.6.8. Columns in Generated CSV Files
5.6.9. Batchloading Opcodes
5.6.10. Time Zones
5.6.11. Batch Loading into MySQL
5.6.11.1. Configuring as an Offboard Batch Applier
5.6.11.2. Drop Delete Statements
5.6.11.3. Configure CHARSET to use on Load
5.6.11.4. Allow DDL Statements to execute
5.6.11.5. Disable Foreign Keys during load
5.6.11.6. Log rows violating Primary/Unique Keys
5.6.12. Data File Partitioning
6. Deployment: Security
6.1. Enabling Security
6.1.1. Enabling Security using the Staging Method
6.1.2. Enabling Security using the INI Method
6.2. Disabling Security
6.3. Creating Suitable Certificates
6.3.1. Creating Tungsten Internal Certificates Using tpm cert
6.3.2. Creating Tungsten Internal Certificates Manually
6.4. Installing from a Staging Host with Custom Certificates
6.4.1. Installing from a Staging Host with Manually-Generated Certificates
6.4.2. Installing from a Staging Host with Certificates Generated by tpm cert
6.5. Installing via INI File with Custom Certificates
6.5.1. Installing via INI File with Manually-Generated Certificates
6.5.2. Installing via INI File with Certificates Generated by tpm cert
6.6. Installing via INI File with CA-Signed Certificates
6.7. Replacing the JGroups Certificate from a Staging Directory
6.8. Replacing the TLS Certificate from a Staging Directory
6.9. Removing TLS Encryption from a Staging Directory
6.10. Enabling Tungsten<>Database Security
6.10.1. Enabling Database SSL
6.10.1.1. Generate the Database Certs
6.10.1.2. Common Steps for Enabling Database SSL
6.10.1.3. Enabling Database Level SSL with Amazon AWS Aurora
6.10.2. Configure Tungsten<>Database Secure Communication
7. Operations Guide
7.1. The Home Directory
7.2. Establishing the Shell Environment
7.3. Understanding Replicator Roles
7.4. Checking Replication Status
7.4.1. Understanding Replicator States
7.4.2. Replicator States During Operations
7.4.3. Changing Replicator States
7.5. Managing Transaction Failures
7.5.1. Identifying a Transaction Mismatch
7.5.2. Skipping Transactions
7.6. Provision or Reprovision a Replica
7.7. Creating a Backup
7.7.1. Using a Different Backup Tool
7.7.2. Using a Different Directory Location
7.7.3. Creating an External Backup
7.8. Restoring a Backup
7.8.1. Restoring a Specific Backup
7.8.2. Restoring an External Backup
7.8.3. Restoring from Another Replica
7.8.4. Manually Recovering from Another Replica
7.8.5. Reprovision a MySQL Replica using rsync
7.9. Deploying Automatic Replicator Recovery
7.10. Migrating and Seeding Data
7.10.1. Migrating from MySQL Native Replication 'In-Place'
7.10.2. Seeding Data for Heterogeneous Replication
7.10.2.1. Seeding Data from a Standalone Source
7.10.2.2. Seeding Data from a Cluster, for a Cluster-Extractor Target
7.11. Switching Primary Hosts
7.12. Configuring Parallel Replication
7.13. Performing Database or OS Maintenance
7.13.1. Performing Maintenance on a Single Replica
7.13.2. Performing Maintenance on a Primary
7.13.3. Performing Maintenance on an Entire Dataservice
7.13.4. Upgrading or Updating your JVM
7.14. Upgrading Tungsten Replicator
7.14.1. Upgrading Tungsten Replicator using tpm
7.14.2. Installing an Upgraded JAR Patch
7.14.3. Installing Patches
7.14.4. Upgrading to v7.0.0+
7.14.4.1. Background
7.14.4.2. Upgrade Decisions
7.14.4.3. Setup internal encryption and authentication
7.14.4.4. Enable Tungsten to Database Encryption
7.14.4.5. Enable MySQL SSL
7.14.4.6. Steps to upgrade using tpm
7.14.4.7. Optional Post-Upgrade steps to configure API
7.15. Monitoring Tungsten Cluster
7.15.1. Managing Log Files with logrotate
7.15.2. Monitoring Status Using cacti
7.15.3. Monitoring Status Using nagios
7.15.4. Monitoring Status Using Prometheus Exporters
7.15.4.1. Monitoring Status with Exporters Overview
7.15.4.2. Customizing the Prometheus Exporter Configuration
7.15.4.3. Disabling the Prometheus Exporters
7.15.4.4. Managing and Testing Exporters Using the tmonitor Command
7.15.4.5. Monitoring Node Status Using the External node_exporter
7.15.4.6. Monitoring MySQL Server Status Using the External mysqld_exporter
7.15.4.7. Monitoring Tungsten Replicator Status Using the Built-In Exporter
7.16. Rebuilding THL on the Primary
7.17. THL Encryption and Compression
7.17.1. In-Flight Compression
7.17.2. Encryption and Compression On-Disk
8. Command-line Tools
8.1. The clean_release_directory Command
8.2. The check_tungsten_latency Command
8.3. The check_tungsten_online Command
8.4. The check_tungsten_services Command
8.5. The deployall Command
8.6. The ddlscan Command
8.6.1. Optional Arguments
8.6.2. Supported Templates and Usage
8.6.2.1. ddl-check-pkeys.vm
8.6.2.2. ddl-mysql-hive-0.10.vm
8.6.2.3. ddl-mysql-hive-0.10-staging.vm
8.6.2.4. ddl-mysql-hive-metadata.vm
8.6.2.5. ddl-mysql-oracle.vm
8.6.2.6. ddl-mysql-oracle-cdc.vm
8.6.2.7. ddl-mysql-redshift.vm
8.6.2.8. ddl-mysql-redshift-staging.vm
8.6.2.9. ddl-mysql-vertica.vm
8.6.2.10. ddl-mysql-vertica-staging.vm
8.6.2.11. ddl-oracle-mysql.vm
8.6.2.12. ddl-oracle-mysql-pk-only.vm
8.7. The dsctl Command
8.7.1. dsctl get Command
8.7.2. dsctl set Command
8.7.3. dsctl reset Command
8.7.4. dsctl help Command
8.8. env.sh Script
8.9. The load-reduce-check Tool
8.9.1. Generating Staging DDL
8.9.2. Generating Live DDL
8.9.3. Materializing a View
8.9.4. Generating Sqoop Load Commands
8.9.5. Generating Metadata
8.9.6. Compare Loaded Data
8.10. The materialize Command
8.11. The tungsten_merge_logs Script
8.12. The multi_trepctl Command
8.12.1. multi_trepctl Options
8.12.2. multi_trepctl Commands
8.12.2.1. multi_trepctl backups Command
8.12.2.2. multi_trepctl heartbeat Command
8.12.2.3. multi_trepctl masterof Command
8.12.2.4. multi_trepctl list Command
8.12.2.5. multi_trepctl run Command
8.13. The tungsten_newrelic_event Command
8.14. The query Command
8.15. The replicator Command
8.16. The startall Command
8.17. The stopall Command
8.18. The tapi Command
8.19. The thl Command
8.19.1. thl Position Commands
8.19.2. thl dsctl Command
8.19.3. thl list Command
8.19.4. thl tail Command
8.19.5. thl index Command
8.19.6. thl purge Command
8.19.7. thl info Command
8.19.8. thl help Command
8.20. The trepctl Command
8.20.1. trepctl Options
8.20.2. trepctl Global Commands
8.20.2.1. trepctl kill Command
8.20.2.2. trepctl services Command
8.20.2.3. trepctl servicetable Command
8.20.2.4. trepctl thl Command
8.20.2.5. trepctl version Command
8.20.3. trepctl Service Commands
8.20.3.1. trepctl backup Command
8.20.3.2. trepctl capabilities Command
8.20.3.3. trepctl check Command
8.20.3.4. trepctl clear Command
8.20.3.5. trepctl clients Command
8.20.3.6. trepctl error Command
8.20.3.7. trepctl flush Command
8.20.3.8. trepctl heartbeat Command
8.20.3.9. trepctl load Command
8.20.3.10. trepctl offline Command
8.20.3.11. trepctl offline-deferred Command
8.20.3.12. trepctl online Command
8.20.3.13. trepctl pause Command
8.20.3.14. trepctl perf Command
8.20.3.15. trepctl properties Command
8.20.3.16. trepctl purge Command
8.20.3.17. trepctl qs Command
8.20.3.18. trepctl reset Command
8.20.3.19. trepctl restore Command
8.20.3.20. trepctl resume Command
8.20.3.21. trepctl setdynamic Command
8.20.3.22. trepctl setrole Command
8.20.3.23. trepctl shard Command
8.20.3.24. trepctl status Command
8.20.3.25. trepctl unload Command
8.20.3.26. trepctl wait Command
8.21. The tmonitor Command
8.22. The tpasswd Command
8.23. The tprovision Script
8.24. The tungsten_get_mysql_datadir Script
8.25. The tungsten_get_ports Script
8.26. The tungsten_health_check Script
8.27. The tungsten_monitor Script
8.28. The tungsten_mysql_ssl_setup Script
8.29. The tungsten_prep_upgrade Script
8.30. The tungsten_provision_thl Command
8.30.1. Provisioning from RDS
8.30.2. tungsten_provision_thl Reference
8.31. The tungsten_purge_thl Command
8.32. The tungsten_read_master_events Script
8.33. The tungsten_send_diag Script
8.34. The tungsten_skip_seqno Script
8.35. The tungsten_skip_all Command
8.36. The undeployall Command
9. The tpm Deployment Command
9.1. Comparing Staging and INI tpm Methods
9.2. Processing Installs and Upgrades
9.3. tpm Staging Configuration
9.3.1. Configuring default options for all services
9.3.2. Configuring a single service
9.3.3. Configuring a single host
9.3.4. Reviewing the current configuration
9.3.5. Installation
9.3.5.1. Installing a set of specific services
9.3.5.2. Installing a set of specific hosts
9.3.6. Upgrades from a Staging Directory
9.3.7. Configuration Changes from a Staging Directory
9.3.8. Converting from INI to Staging
9.4. tpm INI File Configuration
9.4.1. Creating an INI file
9.4.2. Installation with INI File
9.4.3. Upgrades with an INI File
9.4.4. Configuration Changes with an INI file
9.4.5. Converting from Staging to INI
9.4.6. Using the translatetoini.pl Script
9.5. tpm Commands
9.5.1. tpm ask Command
9.5.2. tpm cert Command
9.5.2. Introduction
9.5.2.1. tpm cert Usage
9.5.2.2. tpm cert {typeSpec}, Defined
9.5.2.3. {typeSpec} definitions
9.5.2.4. {passwordSpec} definitions
9.5.2.5. tpm cert: Getting Started - Basic Examples
9.5.2.6. tpm cert: Getting Started - Functional Database Cert Rotation Example
9.5.2.7. tpm cert: Getting Started - Conversion to Custom-Generated Security Files Example
9.5.2.8. tpm cert: Getting Started - Advanced Example
9.5.2.9. Using tpm cert add
9.5.2.10. Using tpm cert aliases
9.5.2.11. Using tpm cert ask
9.5.2.12. Using tpm cert backup
9.5.2.13. Using tpm cert cat
9.5.2.14. Using tpm cert changepass
9.5.2.15. Using tpm cert clean
9.5.2.16. Using tpm cert copy
9.5.2.17. Using tpm cert diff
9.5.2.18. Using tpm cert example
9.5.2.19. Using tpm cert info
9.5.2.20. Using tpm cert list
9.5.2.21. Using tpm cert gen
9.5.2.22. Using tpm cert remove
9.5.2.23. Using tpm cert rotate
9.5.2.24. Using tpm cert vi
9.5.3. tpm check Command
9.5.3.1. tpm check ini Command
9.5.3.2. tpm check ports Command
9.5.4. tpm configure Command
9.5.5. tpm delete-service Command
9.5.6. tpm diag Command
9.5.7. tpm fetch Command
9.5.8. tpm firewall Command
9.5.9. tpm help Command
9.5.10. tpm install Command
9.5.11. tpm keep Command
9.5.12. tpm mysql Command
9.5.13. tpm post-process Command
9.5.14. tpm purge-thl Command
9.5.15. tpm query Command
9.5.15.1. tpm query config
9.5.15.2. tpm query dataservices
9.5.15.3. tpm query deployments
9.5.15.4. tpm query manifest
9.5.15.5. tpm query modified-files
9.5.15.6. tpm query staging
9.5.15.7. tpm query version
9.5.16. tpm report Command
9.5.17. tpm reset Command
9.5.18. tpm reset-thl Command
9.5.19. tpm reverse Command
9.5.20. tpm uninstall Command
9.5.21. tpm update Command
9.5.22. tpm validate Command
9.5.23. tpm validate-update Command
9.6. tpm Common Options
9.7. tpm Validation Checks
9.8. tpm Configuration Options
9.8.1. A tpm Options
9.8.2. B tpm Options
9.8.3. C tpm Options
9.8.4. D tpm Options
9.8.5. E tpm Options
9.8.6. F tpm Options
9.8.7. H tpm Options
9.8.8. I tpm Options
9.8.9. J tpm Options
9.8.10. L tpm Options
9.8.11. M tpm Options
9.8.12. N tpm Options
9.8.13. O tpm Options
9.8.14. P tpm Options
9.8.15. R tpm Options
9.8.16. S tpm Options
9.8.17. T tpm Options
9.8.18. U tpm Options
9.8.19. V tpm Options
9.8.20. W tpm Options
10. Tungsten REST API (APIv2)
10.1. Getting Started with Tungsten REST API
10.1.1. Configuring the API
10.1.1.1. Network Ports
10.1.1.2. User Management
10.1.1.3. SSL/Encryption
10.1.1.4. Enabling and Disabling the API
10.1.2. How to Access the API
10.1.2.1. CURL calls and Examples
10.1.2.2. tapi
10.1.2.3. External Tools
10.1.3. Data Structures
10.1.3.1. Generic Payloads
10.1.3.2. INPUT and OUTPUT payloads
10.1.3.3. TAPI Datastructures
10.2. Replicator API Specifics
10.2.1. Replicator Endpoints
10.2.1.1. services
10.2.1.2. status
10.2.1.3. version
10.2.1.4. offline/online
10.2.1.5. purge
10.2.1.6. reset
10.2.2. Service Endpoints
10.2.2.1. backupCapabilities
10.2.2.2. backups
10.2.2.3. backup / restore
10.2.2.4. setrole
10.2.3. Service THL Endpoints
10.2.3.1. compression / encryption
10.2.3.2. genkey
11. Replication Filters
11.1. Enabling/Disabling Filters
11.2. Enabling Additional Filters
11.3. Filter Status
11.4. Filter Reference
11.4.1. ansiquotes.js Filter
11.4.2. BidiRemoteSlave (BidiSlave) Filter
11.4.3. breadcrumbs.js Filter
11.4.4. CaseTransform Filter
11.4.5. ColumnName Filter
11.4.6. ConvertStringFromMySQL Filter
11.4.7. DatabaseTransform (dbtransform) Filter
11.4.8. dbrename.js Filter
11.4.9. dbselector.js Filter
11.4.10. dbupper.js Filter
11.4.11. dropcolumn.js Filter
11.4.12. dropcomments.js Filter
11.4.13. dropddl.js Filter
11.4.14. dropmetadata.js Filter
11.4.15. droprow.js Filter
11.4.16. dropstatementdata.js Filter
11.4.17. dropsqlmode.js Filter
11.4.18. dropxa.js Filter
11.4.19. Dummy Filter
11.4.20. EnumToString Filter
11.4.21. EventMetadata Filter
11.4.22. foreignkeychecks.js Filter
11.4.23. Heartbeat Filter
11.4.24. insertsonly.js Filter
11.4.25. Logging Filter
11.4.26. maskdata.js Filter
11.4.27. MySQLSessionSupport (mysqlsessions) Filter
11.4.28. mapcharset Filter
11.4.29. NetworkClient Filter
11.4.29.1. Network Client Configuration
11.4.29.2. Network Filter Protocol
11.4.29.3. Sample Network Client
11.4.30. nocreatedbifnotexists.js Filter
11.4.31. OptimizeUpdates Filter
11.4.32. PrimaryKey Filter
11.4.32.1. Setting Custom Primary Key Definitions
11.4.33. PrintEvent Filter
11.4.34. Rename Filter
11.4.34.1. Rename Filter Examples
11.4.35. Replicate Filter
11.4.36. ReplicateColumns Filter
11.4.37. Row Add Database Name Filter
11.4.38. Row Add Transaction Info Filter
11.4.39. SetToString Filter
11.4.40. Shard Filter
11.4.41. shardbyrules.js Filter
11.4.42. shardbyseqno.js Filter
11.4.43. shardbytable.js Filter
11.4.44. SkipEventByType Filter
11.4.45. TimeDelay (delay) Filter
11.4.46. TimeDelayMsFilter (delayInMS) Filter
11.4.47. tosingledb.js Filter
11.4.48. truncatetext.js Filter
11.4.49. zerodate2null.js Filter
11.5. Standard JSON Filter Configuration
11.5.1. Rule Handling and Processing
11.5.2. Schema, Table, and Column Selection
11.6. JavaScript Filters
11.6.1. Writing JavaScript Filters
11.6.1.1. Implementable Functions
11.6.1.2. Getting Configuration Parameters
11.6.1.3. Logging Information and Exceptions
11.6.1.4. Exposed Data Structures
11.6.2. Installing Custom JavaScript Filters
11.6.2.1. Step 1: Copy JavaScript files
11.6.2.2. Step 2: Create Template Files
11.6.2.3. Step 3: (Optional) Copy json files
11.6.2.4. Step 4: Update Configuration
12. Performance and Tuning
12.1. Block Commit
12.1.1. Monitoring Block Commit Status
12.2. Improving Network Performance
12.3. Tungsten Replicator Block Commit and Memory Usage
A. Release Notes
A.1. Tungsten Replicator 7.1.4 GA (1 Oct 2024)
A.2. Tungsten Replicator 7.1.3 GA (25 Jun 2024)
A.3. Tungsten Replicator 7.1.2 GA (3 Apr 2024)
A.4. Tungsten Replicator 7.1.1 GA (15 Dec 2023)
A.5. Tungsten Replicator 7.1.0 GA (16 Aug 2023)
B. Prerequisites
B.1. Requirements
B.1.1. Operating Systems Support
B.1.2. Database Support
B.1.2. Version Support Matrix
B.1.2. MySQL "Innovation" Releases
B.1.3. RAM Requirements
B.1.4. Disk Requirements
B.1.5. Java Requirements
B.1.6. Cloud Deployment Requirements
B.1.7. Docker Support Policy
B.1.7.1. Overview
B.1.7.2. Background
B.1.7.3. Current State
B.1.7.4. Summary
B.2. Staging Host Configuration
B.3. Host Configuration
B.3.1. Operating System Version Support
B.3.2. Creating the User Environment
B.3.3. Configuring Network and SSH Environment
B.3.3.1. Network Ports
B.3.3.2. SSH Configuration
B.3.4. Directory Locations and Configuration
B.3.5. Configure Software
B.3.6. sudo Configuration
B.3.7. SELinux Configuration
B.4. MySQL Database Setup
B.4.1. MySQL Version Support
B.4.2. MySQL Configuration
B.4.3. MySQL Configuration for Active/Active Deployments
B.4.4. MySQL Configuration for Heterogeneous Deployments
B.4.5. MySQL User Configuration
B.4.6. MySQL Unprivileged Users
B.5. Prerequisite Checklist
C. Troubleshooting
C.1. Contacting Support
C.1.1. Support Request Procedure
C.1.2. Creating a Support Account
C.1.3. Open a Support Ticket
C.1.4. Open a Support Ticket via Email
C.1.5. Getting Updates for all Company Support Tickets
C.1.6. Support Severity Level Definitions
C.2. Support Tools
C.2.1. Generating Diagnostic Information
C.2.2. Generating Advanced Diagnostic Information
C.2.3. Using tungsten_upgrade_manager
C.3. Error/Cause/Solution
C.3.1. MySQLExtractException: unknown data type 0
C.3.2. Services requires a reset
C.3.3. OptimizeUpdatesFilter cannot filter, because column and key count is different. Make sure that it is defined before filters which remove keys (eg. PrimaryKeyFilter)
C.3.4. Unable to update the configuration of an installed directory
C.3.5. Too many open processes or files
C.3.6. There were issues configuring the sandbox MySQL server
C.3.7. Unexpected failure while extracting event
C.3.8. Attempt to write new log record with equal or lower fragno: seqno=3 previous stored fragno=32767 attempted new fragno=-32768
C.3.9. The session variable SQL_MODE when set to include ALLOW_INVALID_DATES does not apply statements correctly on the Replica.
C.3.10. Replicator runs out of memory
C.4. Known Issues
C.4.1. Triggers
C.5. Troubleshooting Timeouts
C.6. Troubleshooting Backups
C.7. Running Out of Diskspace
C.8. Troubleshooting SSH and tpm
C.9. Troubleshooting Data Differences
C.9.1. Identify Structural Differences
C.9.2. Identify Data Differences
C.10. Comparing Table Data
C.11. Troubleshooting Memory Usage
D. Files, Directories, and Environment
D.1. The Tungsten Cluster Install Directory
D.1.1. The backups Directory
D.1.1.1. Automatically Deleting Backup Files
D.1.1.2. Manually Deleting Backup Files
D.1.1.3. Copying Backup Files
D.1.1.4. Relocating Backup Storage
D.1.2. The releases Directory
D.1.3. The service_logs Directory
D.1.4. The share Directory
D.1.5. The thl Directory
D.1.5.1. Purging THL Log Information on a Replica
D.1.5.2. Purging THL Log Information on a Primary
D.1.5.3. Moving the THL File Location
D.1.5.4. Changing the THL Retention Times
D.1.6. The tungsten Directory
D.1.6.1. The tungsten-replicator Directory
D.2. Log Files
D.3. Environment Variables
E. Terminology Reference
E.1. Transaction History Log (THL)
E.1.1. THL Format
E.2. Generated Field Reference
E.2.1. Terminology: Fields masterConnectUri
E.2.2. Terminology: Fields masterListenUri
E.2.3. Terminology: Fields accessFailures
E.2.4. Terminology: Fields active
E.2.5. Terminology: Fields activeSeqno
E.2.6. Terminology: Fields appliedLastEventId
E.2.7. Terminology: Fields appliedLastSeqno
E.2.8. Terminology: Fields appliedLatency
E.2.9. Terminology: Fields applier.class
E.2.10. Terminology: Fields applier.name
E.2.11. Terminology: Fields applyTime
E.2.12. Terminology: Fields autoRecoveryEnabled
E.2.13. Terminology: Fields autoRecoveryTotal
E.2.14. Terminology: Fields averageBlockSize
E.2.15. Terminology: Fields blockCommitRowCount
E.2.16. Terminology: Fields cancelled
E.2.17. Terminology: Fields channel
E.2.18. Terminology: Fields channels
E.2.19. Terminology: Fields clusterName
E.2.20. Terminology: Fields commits
E.2.21. Terminology: Fields committedMinSeqno
E.2.22. Terminology: Fields criticalPartition
E.2.23. Terminology: Fields currentBlockSize
E.2.24. Terminology: Fields currentEventId
E.2.25. Terminology: Fields currentLastEventId
E.2.26. Terminology: Fields currentLastFragno
E.2.27. Terminology: Fields currentLastSeqno
E.2.28. Terminology: Fields currentTimeMillis
E.2.29. Terminology: Fields dataServerHost
E.2.30. Terminology: Fields discardCount
E.2.31. Terminology: Fields doChecksum
E.2.32. Terminology: Fields estimatedOfflineInterval
E.2.33. Terminology: Fields eventCount
E.2.34. Terminology: Fields extensions
E.2.35. Terminology: Fields extractTime
E.2.36. Terminology: Fields extractor.class
E.2.37. Terminology: Fields extractor.name
E.2.38. Terminology: Fields filter.#.class
E.2.39. Terminology: Fields filter.#.name
E.2.40. Terminology: Fields filterTime
E.2.41. Terminology: Fields flushIntervalMillis
E.2.42. Terminology: Fields fsyncOnFlush
E.2.43. Terminology: Fields headSeqno
E.2.44. Terminology: Fields intervalGuard
E.2.45. Terminology: Fields lastCommittedBlockSize
E.2.46. Terminology: Fields lastCommittedBlockTime
E.2.47. Terminology: Fields latestEpochNumber
E.2.48. Terminology: Fields logConnectionTimeout
E.2.49. Terminology: Fields logDir
E.2.50. Terminology: Fields logFileRetainMillis
E.2.51. Terminology: Fields logFileSize
E.2.52. Terminology: Fields maxChannel
E.2.53. Terminology: Fields maxDelayInterval
E.2.54. Terminology: Fields maxOfflineInterval
E.2.55. Terminology: Fields maxSize
E.2.56. Terminology: Fields maximumStoredSeqNo
E.2.57. Terminology: Fields minimumStoredSeqNo
E.2.58. Terminology: Fields name
E.2.59. Terminology: Fields offlineRequests
E.2.60. Terminology: Fields otherTime
E.2.61. Terminology: Fields pendingError
E.2.62. Terminology: Fields pendingErrorCode
E.2.63. Terminology: Fields pendingErrorEventId
E.2.64. Terminology: Fields pendingErrorSeqno
E.2.65. Terminology: Fields pendingExceptionMessage
E.2.66. Terminology: Fields pipelineSource
E.2.67. Terminology: Fields processedMinSeqno
E.2.68. Terminology: Fields queues
E.2.69. Terminology: Fields readOnly
E.2.70. Terminology: Fields relativeLatency
E.2.71. Terminology: Fields resourcePrecedence
E.2.72. Terminology: Fields rmiPort
E.2.73. Terminology: Fields role
E.2.74. Terminology: Fields seqnoType
E.2.75. Terminology: Fields serializationCount
E.2.76. Terminology: Fields serialized
E.2.77. Terminology: Fields serviceName
E.2.78. Terminology: Fields serviceType
E.2.79. Terminology: Fields shard_id
E.2.80. Terminology: Fields simpleServiceName
E.2.81. Terminology: Fields siteName
E.2.82. Terminology: Fields sourceId
E.2.83. Terminology: Fields stage
E.2.84. Terminology: Fields started
E.2.85. Terminology: Fields state
E.2.86. Terminology: Fields stopRequested
E.2.87. Terminology: Fields store.#
E.2.88. Terminology: Fields storeClass
E.2.89. Terminology: Fields syncInterval
E.2.90. Terminology: Fields taskCount
E.2.91. Terminology: Fields taskId
E.2.92. Terminology: Fields timeInCurrentEvent
E.2.93. Terminology: Fields timeInStateSeconds
E.2.94. Terminology: Fields timeoutMillis
E.2.95. Terminology: Fields totalAssignments
E.2.96. Terminology: Fields transitioningTo
E.2.97. Terminology: Fields uptimeSeconds
E.2.98. Terminology: Fields version
F. Internals
F.1. Extending Backup and Restore Behavior
F.1.1. Backup Behavior
F.1.2. Restore Behavior
F.1.3. Writing a Custom Backup/Restore Script
F.1.4. Enabling a Custom Backup Script
F.2. Character Sets in Database and Tungsten Cluster
F.3. Understanding Replication of Date/Time Values
F.3. Best Practices
F.4. Memory Tuning and Performance
F.4.1. Understanding Tungsten Replicator Memory Tuning
F.5. Tungsten Replicator Pipelines and Stages
F.6. Tungsten Cluster Schemas
G. Frequently Asked Questions (FAQ)
H. Ecosystem Support
H.1. Continuent Github Repositories
I. Configuration Property Reference

List of Figures

2.1. Internals: MySQL Extraction
2.2. Internals: Amazon Aurora/Remote Database, Offboard Extraction
2.3. Topologies: Primary/Replica
2.4. Topologies: Active/Active
2.5. Topologies: Fan-Out
2.6. Topologies: Fan-In
2.7. Topologies: Cluster-Extractor
3.1. Topologies: Primary/Replica
3.2. Topologies: Aurora Extraction
3.3. Fig 1. AWS Config
3.4. Fig 2. AWS Config
3.5. Fig 3. AWS Config
3.6. Fig 4. AWS Config
3.7. Fig 5. AWS Config
3.8. Fig 6. AWS Config
3.9. Fig 7. AWS Config
3.10. Topologies: Replicating Data Out of a Cluster
4.1. Topologies: Replicating to MySQL
4.2. Topologies: Replicating to Amazon Redshift
4.3. Topologies: Redshift Replication Operation
4.4. Topologies: Replicating to Vertica
4.5. Topologies: Replicating to Kafka
4.6. Topologies: Replicating to MongoDB
4.7. Topologies: Replicating to Hadoop
4.8. Topologies: Hadoop Replication Operation
4.9. Topologies: Replicating to Oracle
4.10. Topologies: Replicating to PostgreSQL
5.1. Topologies: Fan-in
5.2. Topologies: Replicating into a Dataservice
5.3. Batchloading: JavaScript
6.1. Security Internals: Cluster Communication Channels
7.1. Cacti Monitoring: Example Graphs
9.1. tpm Staging Based Deployment
9.2. tpm INI Based Deployment
9.3. Internals: Cluster Communication Channels
11.1. Filters: Pipeline Stages on Extractors
11.2. Filters: Pipeline Stages on Appliers
B.1. Tungsten Deployment

List of Tables

1.1. Supported Extractors
1.2. Supported Appliers
2.1. Key Terminology
4.1. Optional Kafka Applier Properties
4.2. Hadoop Replication Directory Locations
4.3. Data Type differences when replicating data from MySQL to Oracle
5.1. Continuent Tungsten Directory Structure
8.1. check_tungsten_latency Options
8.2. check_tungsten_online Options
8.3. check_tungsten_services Options
8.4. ddlscan Command-line Options
8.5. ddlscan Supported Templates
8.6. dsctl Commands
8.7. dsctl Command-line Options
8.8. dsctl Command-line Options
8.9. dsctl Command-line Options
8.10. tungsten_merge_logs Command-line Options
8.11. multi_trepctl Command-line Options
8.12. multi_trepctl--output Option
8.13. multi_trepctl Commands
8.14. tungsten_monitor Command-line Options
8.15. query Common Options
8.16. replicator Commands
8.17. replicator Commands Options for condrestart
8.18. replicator Commands Options for console
8.19. replicator Commands Options for restart
8.20. replicator Commands Options for start
8.21. tapi Generic Options
8.22. tapi CURL-related Options
8.23. tapi Nagios/NRPE/Zabbix-related Options
8.24. tapi Admin-related Options
8.25. tapi Filter-related Options
8.26. tapi API-related Options
8.27. tapi Status-related Options
8.28. tapi Backup and Restore-related Options
8.29. thl Options
8.30. trepctl Command-line Options
8.31. trepctl Replicator Wide Commands
8.32. trepctl Service Commands
8.33. trepctl backup Command Options
8.34. trepctl clients Command Options
8.35. trepctl offline-deferred Command Options
8.36. trepctl online Command Options
8.37. trepctl pause Command Options
8.38. trepctl purge Command Options
8.39. trepctl reset Command Options
8.40. trepctl resume Command Options
8.41. trepctl setdynamic Command Options
8.42. trepctl setrole Command Options
8.43. trepctl shard Command Options
8.44. trepctl status Command Options
8.45. trepctl wait Command Options
8.46. tmonitor Common Options
8.47. tpasswd Common Options
8.48. tprovision Command-line Options
8.49. tungsten_get_mysql_datadir Command-line Options
8.50. tungsten_get_ports Options
8.51. tungsten_health_check Command-line Options
8.52. tungsten_monitor Command-line Options
8.53. tungsten_prep_upgrade Command-line Options
8.54. tungsten_purge_thl Options
8.55. tungsten_read_master_events Command-line Options
8.56. tungsten_send_diag Command-line Options
8.57. tungsten_skip_seqno Command-line Options
8.58. tungsten_skip_all Options
9.1. TPM Deployment Methods
9.2. tpm Core Options
9.3. tpm Commands
9.4. tpm ask Common Options
9.5. tpm cert Read-Only Actions
9.6. tpm cert Write Actions
9.7. tpm cert Arguments
9.8. Convenience tags
9.9. typeSpecs for tpm cert ask
9.10. typeSpecs for tpm cert example
9.11. typeSpecs for tpm cert gen
9.12. typeSpecs for tpm cert vi
9.13. Options for tpm check
9.14. Options for tpm check ini
9.15. Options for tpm check ports
9.16. tpm delete-service Common Options
9.17. tpm keep Options
9.18. tpm post-process Options
9.19. tpm purge-thl Options
9.20. tpm report Common Options
9.21. tpm Common Options
9.22. tpm Validation Checks
9.23. tpm Configuration Options
B.1. Tungsten OS Support
B.2. MySQL/Tungsten Version Support
D.1. Continuent Tungsten Directory Structure
D.2. Continuent Tungsten tungsten Sub-Directory Structure
E.1. THL Event Format

Preface

This manual documents Tungsten Replicator 7.1 up to and including 7.1.4 build 10. Differences between minor versions are highlighted stating the explicit minor release version, such as 7.1.4.x.

For other versions and products, please use the appropriate manual.

1. Legal Notice

The trademarks, logos, and service marks in this Document are the property of Continuent or other third parties. You are not permitted to use these Marks without the prior written consent of Continuent or such appropriate third party. Continuent, Tungsten, uni/cluster, m/cluster, p/cluster, uc/connector, and the Continuent logo are trademarks or registered trademarks of Continuent in the United States, France, Finland and other countries.

All Materials on this Document are (and shall continue to be) owned exclusively by Continuent or other respective third party owners and are protected under applicable copyrights, patents, trademarks, trade dress and/or other proprietary rights. Under no circumstances will you acquire any ownership rights or other interest in any Materials by or through your access or use of the Materials. All right, title and interest not expressly granted is reserved to Continuent.

All rights reserved.

2. Conventions

This documentation uses a number of text and style conventions to indicate and differentiate between different types of information:

  • Text in this style is used to show an important element or piece of information. It may be used and combined with other text styles as appropriate to the context.

  • Text in this style is used to show a section heading, table heading, or particularly important emphasis of some kind.

  • Program or configuration options are formatted using this style. Options are also automatically linked to their respective documentation page when this is known. For example, tpm and --hosts both link automatically to the corresponding reference page.

  • Parameters or information explicitly used to set values to commands or options is formatted using this style.

  • Option values, for example on the command-line are marked up using this format: --help. Where possible, all option values are directly linked to the reference information for that option.

  • Commands, including sub-commands to a command-line tool are formatted using Text in this style. Commands are also automatically linked to their respective documentation page when this is known. For example, tpm links automatically to the corresponding reference page.

  • Text in this style indicates literal or character sequence text used to show a specific value.

  • Filenames, directories or paths are shown like this /etc/passwd. Filenames and paths are automatically linked to the corresponding reference page if available.

Bulleted lists are used to show lists, or detailed information for a list of items. Where this information is optional, a magnifying glass symbol enables you to expand, or collapse, the detailed instructions.

Code listings are used to show sample programs, code, configuration files and other elements. These can include both user input and replaceable values:

shell> cd /opt/continuent/software
shell> ar zxvf tungsten-replicator-7.1.4-10.tar.gz

In the above example command-lines to be entered into a shell are prefixed using shell. This shell is typically sh, ksh, or bash on Linux and Unix platforms.

If commands are to be executed using administrator privileges, each line will be prefixed with root-shell, for example:

root-shell> vi /etc/passwd

To make the selection of text easier for copy/pasting, ignorable text, such as shell> are ignored during selection. This allows multi-line instructions to be copied without modification, for example:

mysql> create database test_selection;
mysql> drop database test_selection;

Lines prefixed with mysql> should be entered within the mysql command-line.

If a command-line or program listing entry contains lines that are two wide to be displayed within the documentation, they are marked using the » character:

the first line has been extended by using a »
    continuation line

They should be adjusted to be entered on a single line.

Text marked up with this style is information that is entered by the user (as opposed to generated by the system). Text formatted using this style should be replaced with the appropriate file, version number or other variable information according to the operation being performed.

In the HTML versions of the manual, blocks or examples that can be userinput can be easily copied from the program listing. Where there are multiple entries or steps, use the 'Show copy-friendly text' link at the end of each section. This provides a copy of all the user-enterable text.

3. Quickstart Guide

Chapter 1. Introduction

Tungsten Replicator™ is a replication engine supporting a variety of different extractor and applier modules. Data can be extracted from MySQL, Amazon RDS MySQL, Amazon Aurora, Microsoft Azure and Google Cloud SQL, and applied to a variety of transactional stores, NoSQL stores and datawarehouse stores. For a full list of supported sources and targets, see Table 1.1, “Supported Extractors” and Table 1.2, “Supported Appliers” below

During replication, Tungsten Replicator assigns data a unique global transaction ID, and enables flexible statement and/or row-based replication of data. This enables data to be exchanged between different databases and different database versions. During replication, information can be filtered and modified, and deployment can be between on-premise or cloud-based databases. For performance, Tungsten Replicator™ provides support for parallel replication, and advanced topologies such as fan-in, star and active/active, and can be used efficiently in cross-site deployments.

Tungsten Replicator™ is the core foundation for Tungsten Cluster™ for HA, DR and geographically distributed solutions.

Features in Tungsten Replicator

  • Includes support for replicating into Hadoop (including Apache Hadoop, Cloudera, HortonWorks, MapR, Amazon EMR)

  • Includes support for replicating into Amazon Redshift, including storing change data within Amazon S3

  • Includes support for replicating into PostgreSQL, Apache Kafka, MongoDB

  • Includes support for replicating to and from Amazon Aurora/RDS (MySQL) deployments

  • Available as an AMI via Amazon Marketplace (Without Support)

  • SSL Support for managing MySQL deployments

  • Network Client filter for handling complex data translation/migration needs during replication

The table below shows the version of Tungsten Replicator that support was added for the specific extractor

Table 1.1. Supported Extractors

Source

5.3

5.4

6.0

6.1

7.0

MySQL (5.0 to 5.6)

x

x

x

x

x

MySQL 5.7

x

x

x

x

x

MySQL 8

 

x

 

x

x

MariaDB (5.5, 10)

x

x

x

x

x

Amazon Aurora/RDS MySQL

x

x

x

x

x

Google Cloud MySQL

x

x

x

x

x

Microsoft Azure

x

x

x

x

x


The table below shows the version of Tungsten Replicator that support was added for the specific applier

Table 1.2. Supported Appliers

Target

5.3

5.4

6.0

6.1

7.0

MySQL (incl MariaDB)

x

x

x

x

x

Amazon Aurora/RDS MySQL

x

x

x

x

x

Microsoft Azure

x

x

x

x

x

Google Cloud MySQL

x

x

x

x

x

Oracle (incl. Cloud)

x

x

x

x

x

PostgreSQL (incl. Cloud)

x

x

x

x

x

Hadoop

x

x

x

x

x

Vertica

x

x

x

x

x

Amazon Redshift

x

x

x

x

x

MongoDB

x

x

x

x

x

MongoDB Atlas

   

x (6.1.3)

x

Apache Kafka

x

x

x

x

x

Clickhouse

   

x

x


1.1. Tungsten Replicator

Tungsten Replicator is a high performance replication engine that works with a number of different source and target databases to provide high-performance and improved replication functionality over the native solution. With MySQL replication, for example, the enhanced functionality and information provided by Tungsten Replicator allows for global transaction IDs, advanced topology support such as Composite Active/Active, star, and fan-in, and enhanced latency identification.

In addition to providing enhanced functionality Tungsten Replicator is also capable of heterogeneous replication by enabling the replicated information to be transformed after it has been read from the data server to match the functionality or structure in the target server. This functionality allows for replication between MySQL and a variety of heterogeneous targets.

Understanding how Tungsten Replicator works requires looking at the overall replicator structure. There are three major components in the system that provide the core of the replication functionality:

  • Extractor

    The extractor component reads data from a MysQL data server and writes that information into the Transaction History Log (THL). The role of the extractor is to read the information from a suitable source of change information and write it into the THL in the native or defined format, either as SQL statements or row-based information.

    Information is always extracted from a source database and recorded within the THL in the form of a complete transaction. The full transaction information is recorded and logged against a single, unique, transaction ID used internally within the replicator to identify the data.

  • Applier

    Appliers within Tungsten Replicator convert the THL information and apply it to a destination data server. The role of the applier is to read the THL information and apply that to the data server.

    The applier works with a number of different target databases, and is responsible for writing the information to the database. Because the transactional data in the THL is stored either as SQL statements or row-based information, the applier has the flexibility to reformat the information to match the target data server. Row-based data can be reconstructed to match different database formats, for example, converting row-based information into an Oracle-specific table row, or a MongoDB document.

  • Transaction History Log (THL)

    The THL contains the information extracted from a data server. Information within the THL is divided up by transactions, either implied or explicit, based on the data extracted from the data server. The THL structure, format, and content provides a significant proportion of the functionality and operational flexibility within Tungsten Replicator.

    As the THL data is stored additional information, such as the metadata and options in place when the statement or row data was extracted are recorded. Each transaction is also recorded with an incremental global transaction ID. This ID enables individual transactions within the THL to be identified, for example to retrieve their content, or to determine whether different appliers within a replication topology have written a specific transaction to a data server.

These components will be examined in more detail as different aspects of the system are described with respect to the different systems, features, and functionality that each system provides.

From this basic overview and structure of Tungsten Replicator, the replicator allows for a number of different topologies and solutions that replicate information between different services. Straightforward replication topologies, such as Primary/Replica are easy to understand with the basic concepts described above. More complex topologies use the same core components. For example, Composite Active/Active topologies make use of the global transaction ID to prevent the same statement or row data being applied to a data server multiple times. Fan-in topologies allow the data from multiple data servers to be combined into one data server.

1.1.1. Extractor

Extractors exist for reading information from the following sources:

  • Reading the MySQL binary log (binlog) directly from the disk and translating that content and session information into the THL. Using this method to read the binlog in it's different formats, such as the statement, row and mixed-based logging.

  • Remotely from MySQL server over a network, including reading from an Amazon RDS MySQL or Amazon Aurora instance. This enables the replicator to read the information remotely, either on services where direct access to the binlog is not available, or where we cannot be installed (Such as databases hosted on a Windows platform).

1.1.2. Appliers

Once information has been recorded into THL, particularly when that information has been recorded in row-based format, it is possible to apply that information out to a variety of different targets, both transactional and SQL based solutions, and also NoSQL and analytical targets.

Available appliers include:

  • MySQL

    • Community Edition

    • Enterprise Edition

    • Percona

    • MariaDB

    • Amazon Aurora/RDS (Including cross region)

    • Google Cloud SQL

    • Microsoft Azure

  • Oracle

  • PostgreSQL

  • Amazon RedShift

  • HPE Vertica

  • Hadoop, compatible with all major distributions

  • MongoDB (Including Atlas from v6.1.3 onwards)

  • Apache Kafka

  • Clickhouse (Experimental)

For more information on how the heterogeneous replicator works, see Section 2.8.1, “How Heterogeneous Replication Works”. For more information on the batch applier, which works with datawarehouse targets, see Section 5.6, “Batch Loading for Data Warehouses”.

1.1.3. Transaction History Log (THL)

Tungsten Replicator operates by reading information from the source database and transferring that information to the Transaction History Log (THL).

Each transaction within the THL includes the SQL statement or the row-based data written to the database. The information also includes, where possible, transaction specific options and metadata, such as character set data, SQL modes and other information that may affect how the information is written when the data is applied. The combination of the metadata and the global transaction ID also enable more complex data replication scenarios to be supported, such as Composite Active/Active, without fear of duplicating statement or row data application because the source and global transaction ID can be compared.

In addition to all this information, the THL also includes a timestamp and a record of when the information was written into the database before the change was extracted. Using a combination of the global transaction ID and this timing information provides information on the latency and how up to date a dataserver is compared to the original datasource.

Depending on the underlying storage of the data, the information can be reformatted and applied to different data servers. When dealing with row-based data, this can be applied to a different type of data server, or completely reformatted and applied to non-table based services such as MongoDB.

THL information is stored for each replicator service, and can also be exchanged over the network between different replicator instances. This enables transaction data to be exchanged between different hosts within the same network or across wide-area-networks.

1.1.4. Filtering

Filtering within the replicator enables the information within the THL to be removed, augmented, or modified as the information is transferred within and between the replicators.

During filtering, the information in the THL can be modified in a host of different ways, including but not limited to:

  • Filtering out information based on the schema name, table name or column name. This is useful if you want a subset of the information in your target database, or if you want want to apply only certain columns to the information.

  • Filter information based on the content, or value of one or more fields.

  • Filter information based on the operation type, for example, only applying inserts to a target ignoring updates or deletes.

  • Modify or alter the format or structure of the data. This can be used to change the data format to be compatible with a target system, for example due to data type limitations, or sizes.

  • Add information to the data. For example, adding a database name, source name, or additional or compound fields into the target data. Within an analytics system this can be useful when combining data from multiple sources so that the source system or customer can still be identified.

The format, content, and structure of the data and the THL can be modified and new data can even be created through the filters.

For more information on the filters available, and how to use them, see Chapter 11, Replication Filters.

Chapter 2. Deployment Overview

Tungsten Replicator creates a unique replication interface between two databases. Because Tungsten Replicator is independent of the dataserver it affords a number of different advantages, including more flexible replication strategies, filtering, and easier control to pause, restart, and skip statements between hosts.

Replication is supported from, and to, different dataservers using different technologies through a series of extractor and applier components which independently read data from, and write data to, the dataservers in question.

The replication process is made possible by reading the binary log on each host. The information from the binary log is written into the Tungsten Replicator Transaction History Log (THL), and the THL is then transferred between hosts and then applied to each Target host. More information can be found in Chapter 1, Introduction.

Before covering the basics of creating different dataservices, there are some key terms that will be used throughout the setup and installation process that identify different components of the system. these are summarised in Table 2.1, “Key Terminology”.

Table 2.1. Key Terminology

Tungsten Term Traditional Term Description
dataserver Database The database on a host. Datasources include MySQL, or Oracle.
datasource Host or Node One member of a dataservice and the associated Tungsten components.
staging host - The machine (and directory) from which Tungsten Replicator is installed and configured. The machine does not need to be the same as any of the existing hosts in the cluster.
staging directory - The directory where the installation files are located and the installer is executed. Further configuration and updates must be performed from this directory.

Before attempting installation, there are a number of prerequisite tasks which must be completed to set up your hosts, database, and Tungsten Replicator service:

  1. Setup a staging host from which you will configure and manage your installation.

  2. Configure each host that will be used within your dataservice.

  3. Configure your MySQL installation, so that Tungsten Replicator can work with the database.

  4. Prepare and configure the target environment

The following sections provide guidance and instructions for creating a number of different deployment scenarios using Tungsten Replicator.

2.1. Deployment Sources

Tungsten Replicator is available in a number of different distribution types, and the methods for configuration available for these different packages differs. See Section 9.1, “Comparing Staging and INI tpm Methods” for more information on the available installation methods.

Deployment Type/Package TAR/GZip RPM
Staging Installation Yes No
INI File Configuration Yes Yes
Deploy Entire Cluster Yes No
Deploy Per Machine Yes Yes

Two primary deployment sources are available:

All packages are named according to the product, version number, build release and extension. For example:

tungsten-replicator-7.1.4-10.tar.gz

The version number is 7.1.4 and build number 10. Build numbers indicate which build a particular release version is based on, and may be useful when installing patches provided by support.

2.1.1. Using the TAR/GZipped files

To use the TAR/GZipped packages, download the files to your machine and unpack them:

shell> cd /opt/continuent/software
shell> tar zxf tungsten-replicator-7.1.4-10.tar.gz

This will create a directory matching the downloaded package name, version, and build number from which you can perform an install using either the INI file or command-line configuration. To use, you will need to use the tpm command within the tools directory of the extracted package:

shell> cd tungsten-replicator-7.1.4-10

2.1.2. Using the RPM package files

The RPM packages can be used for installation, but are primarily designed to be in combination with the INI configuration file.

Installation

Installing the RPM package will do the following:

  1. Create the tungsten system user if it doesn't exist

  2. Make the tungsten system user part of the mysql group if it exists

  3. Create the /opt/continuent/software directory

  4. Unpack the software into /opt/continuent/software

  5. Define the $CONTINUENT_PROFILES and $REPLICATOR_PROFILES environment variables

  6. Update the profile script to include the /opt/continuent/share/env.sh script

  7. Create the /etc/tungsten directory

  8. Run tpm install if the /etc/tungsten.ini or /etc/tungsten/tungsten.ini file exists

Although the RPM packages complete a number of the pre-requisite steps required to configure your cluster, there are additional steps, such as configuring ssh, that you still need to complete. For more information, see Appendix B, Prerequisites.

By using the package files you are able to setup a new server by creating the /etc/tungsten.ini file and then installing the package. Any output from the tpm command will go to /opt/continuent/service_logs/rpm.output.

Note

If you download the package files directly, you may need to add the signing key to your environment before the package will load properly.

For yum platforms (RHEL/CentOS/Amazon Linux), the rpm command is used :

root-shell> rpm --import http://www.continuent.com/RPM-GPG-KEY-continuent

For Ubuntu/Debian platforms, the gpg command is used :

root-shell> gpg --keyserver keyserver.ubuntu.com --recv-key 7206c924

Once an INI file has been created and the packages are available, the installation can be completed using:

  • On RHEL/CentOS/Amazon Linux:

    root-shell> yum install tungsten-replicator
  • On Ubuntu/Debian:

    root-shell> apt-get install tungsten-replicator

Upgrades

If you upgrade to a new version of the RPM package it will do the following:

  1. Unpack the software into /opt/continuent/software

  2. Run tpm update if the /etc/tungsten.ini or /etc/tungsten/tungsten.ini file exists

The tpm update will restart all Continuent Tungsten services so you do not need to do anything after upgrading the package file.

2.2. Best Practices

A successful deployment depends on being mindful during deployment, operations and ongoing maintenance.

2.2.1. Best Practices: Deployment

  • Identify the best deployment method for your environment and use that in production and testing. See Section 9.1, “Comparing Staging and INI tpm Methods”.

  • Standardize the OS and database prerequisites. There are Ansible modules available for immediate use within AWS, or as a template for modifications.

    More information on the Ansible method is available in this blog article.

  • Ensure that the output of the `hostname` command and the nodename entries in the Tungsten configuration match exactly prior to installing Tungsten.

    The configuration keys that define nodenames are: --slaves, --dataservice-slaves, --members, --master, --dataservice-master-host, --masters and --relay

  • For security purposes you should ensure that you secure the following areas of your deployment:

  • Choose your topology from the deployment section and verify the configuration matches the basic settings. Additional settings may be included for custom features but the basics are needed to ensure proper operation. If your configuration is not listed or does not match our documented settings; we cannot guarantee correct operation.

  • If you are using ROW replication, any triggers that run additional INSERT/UPDATE/DELETE operations must be updated so they do not run on the Replica servers.

  • Make sure you know the structure of the Tungsten Cluster home directory and how to initialize your environment for administration. See Section 7.1, “The Home Directory” and Section 7.2, “Establishing the Shell Environment”.

  • Prior to migrating applications to Tungsten Cluster test failover and recovery procedures from Chapter 7, Operations Guide. Be sure to try recovering a failed Primary and reprovisioning failed Replicas.

  • When deciding on the Service Name for your configurations, keep them simple and short and only use alphanumerics (Aa-Zz,0-9) and underscores (_).

2.2.2. Best Practices: Upgrade

In this section we identify the best practices for performing a Tungsten Software upgrade.

  • Identify the deployment method chosen for your environment, Staging or INI. See Section 9.1, “Comparing Staging and INI tpm Methods”.

  • The best practice for Tungsten software is to upgrade All-at-Once, performing zero Primary switches.

  • The Staging deployment method automatically does an All-at-Once upgrade - this is the basic design of the Staging method.

  • For an INI upgrade, there are two possible ways, One-at-a-Time (with at least one Primary switch), and All-at-Once (no switches at all).

  • See Section 9.4.3, “Upgrades with an INI File” for more information.

  • Here is the sequence of events for a proper Tungsten upgrade on a 3-node cluster with the INI deployment method:

    • Login to the Customer Downloads Portal and get the latest version of the software.

    • Copy the file (i.e. tungsten-clustering-7.0.2-161.tar.gz) to each host that runs a Tungsten component.

    • Set the cluster to policy MAINTENANCE

    • On every host:

      • Extract the tarball under /opt/continuent/software/ (i.e. create /opt/continuent/software/tungsten-clustering-7.0.2-161)

      • cd to the newly extracted directory

      • Run the Tungsten Package Manager tool, tools/tpm update --replace-release

    • For example, here are the steps in order:

      On ONE database node:
      shell> cctrl
      cctrl> set policy maintenance
      cctrl> exit
      
      On EVERY Tungsten host at the same time:
      shell> cd /opt/continuent/software
      shell> tar xvzf tungsten-clustering-7.0.2-161.tar.gz
      shell> cd tungsten-clustering-7.0.2-161
      
      To perform the upgrade and restart the Connectors gracefully at the same time:
      shell> tools/tpm update --replace-release
      
      To perform the upgrade and delay the restart of the Connectors to a later time:
      shell> tools/tpm update --replace-release --no-connectors
      When it is time for the Connector to be promoted to the new version, perhaps after taking it out of the load balancer:
      shell> tpm promote-connector
      
      When all nodes are done, on ONE database node:
      shell> cctrl
      cctrl> set policy automatic
      cctrl> exit

WHY is it ok to upgrade and restart everything all at once?

Let’s look at each component to examine what happens during the upgrade, starting with the Manager layer.

Once the cluster is in Maintenance mode, the Managers cease to make changes to the cluster, and therefore Connectors will not reroute traffic either.

Since Manager control of the cluster is passive in Maintenance mode, it is safe to stop and restart all Managers - there will be zero impact to the cluster operations.

The Replicators function independently of client MySQL requests (which come through the Connectors and go to the MySQL database server), so even if the Replicators are stopped and restarted, there should be only a small window of delay while the replicas catch up with the Primary once upgraded. If the Connectors are reading from the Replicas, they may briefly get stale data if not using SmartScale.

Finally, when the Connectors are upgraded they must be restarted so the new version can take over. As discussed in this blog post, Zero-Downtime Upgrades, the Tungsten Cluster software upgrade process will do two key things to help keep traffic flowing during the Connector upgrade promote step:

  • Execute `connector graceful-stop 30` to gracefully drain existing connections and prevent new connections.

  • Using the new software version, initiate the start/retry feature which launches a new connector process while another one is still bound to the server socket. The new Connector process will wait for the socket to become available by retrying binding every 200ms by default (which is tunable), drastically reducing the window for application connection failures.

2.2.3. Best Practices: Operations

2.2.4. Best Practices: Maintenance

  • Your license allows for a testing cluster. Deploy a cluster that matches your production cluster and test all operations and maintenance operations there.

  • Disable any automatic operating system patching processes. The use of automatic patching will cause issues when all database servers automatically restart without coordination. See Section 7.13.3, “Performing Maintenance on an Entire Dataservice”.

  • Regularly check for maintenance releases and upgrade your environment. Every version includes stability and usability fixes to ease the administrative process.

2.3. Common tpm Options During Deployment

There are a variety of tpm options that can be used to alter some aspect of the deployment during configuration. Although they might not be provided within the example deployments, they may be used or required for different installation environments. These include options such as altering the ports used by different components, or the commands and utilities used to monitor or manage the installation once deployment has been completed. Some of the most common options are included within this section.

Changes to the configuration should be made with tpm update. This continues the procedure of using tpm install during installation. See Section 9.5.21, “tpm update Command” for more information on using tpm update.

  • --datasource-systemctl-service

    On some platforms and environments the command used to manage and control the MySQL or MariaDB service is handled by a tool other than the services or /etc/init.d/mysql commands.

    Depending on the system or environment other commands using the same basic structure may be used. For example, within CentOS 7, the command is systemctl. You can explicitly set the command to be used by using the --datasource-systemctl-service to specify the name of the tool.

    The format of the corresponding command that will be used is expected to follow the same format as previous commands, for example to start the database service::

    shell> systemctl mysql stop

    Different commands must follow the same basic structure, the command configured by --datasource-systemctl-service, the servicename, and the status (i.e. stop).

2.4. Starting and Stopping Tungsten Replicator

To shutdown a running Tungsten Replicator operation you must switch off the replicator:

shell> replicator stop
Stopping Tungsten Replicator Service...
Stopped Tungsten Replicator Service.

Note

Stopping the replicator in this way results in an ungraceful shutdown of the replicator. To perform a graceful shutdown, use trepctl offline first, then stop or restart the replicator.

To start the replicator service if it is not already running:

shell> replicator start
Starting Tungsten Replicator Service...

To restart the replicator (stop and start) service if it is not already running:

shell> replicator restart
Stopping Tungsten Replicator Service...
Stopped Tungsten Replicator Service.
Starting Tungsten Replicator Service...

For some scenarios, such as initiating a load within a heterogeneous environment, the replicator can be started up in the OFFLINE state:

shell> replicator start offline

In a clustered environment, if the cluster was configured with auto-enable=false then you will need to put each node online individually.

2.5. Configuring Startup on Boot

By default, Tungsten Replicator does not start automatically on boot. To enable Tungsten Replicator to start at boot time on a system supporting the Linux Standard Base (LSB), use the deployall script provided in the installation directory to create the necessary boot scripts on your system:

shell> sudo deployall

To disable automatic startup at boot time, use the undeployall command:

shell> sudo undeployall

2.6. Removing Datasources from a Deployment

Removing components from a dataservice is quite straightforward, usually involves both modifying the running service and changing the configuration. Changing the configuration is necessary to ensure that the host is not re-configured and installed when the installation is next updated.

In this section:

2.6.1. Removing a Datasource from an Existing Deployment

To remove a datasource from an existing deployment there are two primary stages, removing it from the active service, and then removing it from the active configuration.

For example, to remove host6 from a service:

  1. Login to host6.

  2. Stop the replicator:

    shell> replicator stop

Now the node has been removed from the active dataservice, the host must be removed from the configuration.

  1. Now you must remove the node from the configuration, although the exact method depends on which installation method used with tpm:

    • If you are using staging directory method with tpm:

      1. Change to the staging directory. The current staging directory can be located using tpm query staging:

        shell> tpm query staging
        tungsten@host1:/home/tungsten/tungsten-replicator-7.1.4-10
        shell> cd /home/tungsten/tungsten-replicator-7.1.4-10
      2. Update the configuration, omitting the host from the list of members of the dataservice:

        shell> tpm update alpha \
            --members=host1,host2,host3
    • If you are using the INI file method with tpm:

      • Remove the INI configuration file:

        shell> rm /etc/tungsten/tungsten.ini
  2. Remove the installed software directory:

    shell> rm -rf /opt/continuent

2.7. Understanding Deployment Styles and Topologies

The following sections provide understanding around the different styles of deployment available and the different topologies that can be configured using Tungsten Replicator

2.7.1. Tungsten Replicator Extraction Operation

Replication Operation Support
Statements Replicated Yes, within MySQL/MySQL Topologies only
Rows Replicated Yes
Schema Replicated Yes, within MySQL/MySQL Topologies only
ddlscan Supported Yes, supported for mixed MySQL, and data warehouse targets

Tungsten Replicator for MySQL operates by

  • Reading the MySQL binary log (binlog) directly from the disk and translating that content and session information into the THL. Using this method to read the binlog in it's different formats, such as the statement, row and mixed-based logging.

  • Remotely from the MySQL server over a network, including reading from an Amazon Aurora MySQL instance, for example. This enables the replicator to read the information remotely, either on services where direct access to the binlog is not available, or where we cannot be installed. This is also referred to as Offboard installation

The following diagrams show these two methods of extraction

Figure 2.1. Internals: MySQL Extraction

Internals: MySQL Extraction

Figure 2.2. Internals: Amazon Aurora/Remote Database, Offboard Extraction

Internals: Amazon Aurora/Remote Database, Offboard Extraction

Tungsten Replicator for MySQL is supported within the following environments:

  • MySQL Community Edition

  • MySQL Enterprise Edition from Oracle

  • Percona

  • MariaDB

  • Amazon RDS

  • Amazon Aurora

  • Google Cloud MySQL

In addition, the following requirements and limitations are in effect:

  • Tables must have primary keys (Only applicable when the target is not Oracle, MySQL or Postgres)

  • Row-based binary logging must be configured for heterogeneous deployment models

  • Datatype support varies, depending upon the target. Check applier documentation appropriate to deployment target for more detail.

  • Currently, DDL is only replicated in MySQL to MySQL deployments

2.7.2. Understanding Deployment Models

The flexibility of the replicator allows you to install the software in a number of ways to fit into a number of possible limitations or restrictions you may be faced with, in addition to a number of flexible topologies. These are outlined below

  • Onboard

    This method will involve the Tungsten Replicator being installed on the same host as the Source MySQL Database. This method is suitable for:

    • On-Premise deployments

    • EC2 Hosted Databases in AWS

    • Google Cloud SQL Hosted Instances

  • Offboard

    This method will involve the Tungsten Replicator being installed on the different host to the Source MySQL Database. This method is suitable for:

    • On-Premise deployments

    • EC2 Instances in AWS

    • Google Cloud SQL Hosted Instances

    • Amazon RDS MySQL Instances

    • Amazon Aurora Instances

  • Direct

    This method involved the Tungsten Replicator being installed on a different host to the source MySQL Database, however the replicator will also act as the applier, writing out to the target This method is suitable for:

    • Amazon RDS MySQL Instances

    • Amazon Aurora Instances

    • Cluster-Extractor topologies, extracting direct from a Tungsten Cluster

  • AWS Marketplace AMI

    This method is based on a pre-built AMI available for purchase within the Amazon Marketplace. This method is suitable for:

    • Amazon AWS Hosted solutions, including RDS and Aurora

2.7.3. Understanding Deployment Topologies

There are a number of different methods in which Tungsten Replicator can be configured, review Section 2.7.2, “Understanding Deployment Models” for full details of the differences between each deployment style. The following sections explain the different topology styles that can be deployed

2.7.3.1. Simple Primary/Replica Topology

Primary/Replica is the simplest and most straightforward of all replication scenarios, and also the basis of all other types of topology. The fundamental basis for the Primary/Replica topology is that changes in the Source are distributed and applied to the each of the configured Targets.

Figure 2.3. Topologies: Primary/Replica

Topologies: Primary/Replica

2.7.3.2. Active/Active Topology

An active/active topology, relies on a number of individual services that are used to define a Primary/Replica topology between each group of hosts. In a three-node active/active setup, for example, three different services are created on each host, each service creates a Primary/Replica relationship between a primary host (itself) and the remote Targets. A change on any individual host will be replicated to the other databases in the topology creating the active/active configuration.

Figure 2.4. Topologies: Active/Active

Topologies: Active/Active

2.7.3.3. Fan-Out Topology

The fan-out topology allows you to replicate from one single host out to two or more target hosts. Fan-out topologies are often in situations where you have different reporting requirements, for example, sales figures may need aggregating and reporting within a redshift environment but payroll information may need replicating to a MySQL environment for back office processing.

Figure 2.5. Topologies: Fan-Out

Topologies: Fan-Out

2.7.3.4. Fan-In Topology

The fan-in topology is the logical opposite of a Primary/Replica topology. In a fan-in topology, the data from two (or more) Sources is combined together on one Target. Fan-in topologies are often in situations where you have satellite databases, maybe for sales or retail operations, and need to combine that information together in a single database for processing.

Figure 2.6. Topologies: Fan-In

Topologies: Fan-In

2.7.3.5. Replicating in/out of an existing Tungsten Cluster

If you have an existing cluster and you want to replicate the data out to a separate standalone server using Tungsten Replicator then you can create a cluster alias, and use a Primary/Replica topology to replicate from the cluster. This allows for THL events from the cluster to be applied to a separate server for the purposes of backup or separate analysis.

Figure 2.7. Topologies: Cluster-Extractor

Topologies: Cluster-Extractor

2.8. Understanding Heterogeneous Deployments

Heterogeneous deployments cover installations where data is being replicated between two different database solutions. These include, but are not limited to:

The following sections provide more detail and information on the setup and configuration of these different solutions.

2.8.1. How Heterogeneous Replication Works

Heterogeneous replication works slightly differently compared to the native MySQL to MySQL replication. This is because SQL statements, including both Data Manipulation Language (DML) and Data Definition Language (DDL) cannot be executed on a target system as they were extracted from the MySQL database. The SQL dialects are different, so that an SQL statement on MySQL is not the same as an SQL statement on Oracle, and differences in the dialects mean that either the statement would fail, or would perform an incorrect operation.

On targets that do not support SQL of any kind, such as MongoDB, replicating SQL statements would achieve nothing since they cannot be executed at all.

All heterogeneous replication deployments therefore use row-based replication. This extracts only the raw row data, not the statement information. Because it is only row-data, it can be easily re-assembled or constructed into another format, including statements in other SQL dialects, native appliers for alternative formats, such as JSON or BSON, or external CSV formats that enable the data to be loaded in bulk batches into a variety of different targets.

2.8.1.1. JDBC Applier based Replication

Replication into targets where the JDBC Driver can be used, such as Oracle and Postgres, work as follows:

  1. Data is extracted from the source MySQL database:

    • The MySQL server is configured to write transactions into the MySQL binary log using row-based logging. This generates information in the log in the form of the individual updated rows, rather than the statement that was used to perform the update. For example, instead of recording the statement:

      mysql> INSERT INTO MSG VALUES (1,'Hello World');
    • The information is stored as a row entry against the updated table:

      1 Hello World
    • The information is written into the THL as row-based events, with the event type (insert, update or delete) is appended to the metadata of the THL event.

    It is the raw row data that is stored in the THL. Because the row data, not the SQL statement, has been recorded, the differences in SQL dialects between does not need to be taken into account. In fact, Data Definition Language (DDL) and other SQL statements are deliberately ignored so that replication does not break.

  2. The row-based transactions stored in the THL are transferred from the Extractor to the Applier.

  3. On the Applier side, the row-based event data is wrapped into a suitable SQL statement for the target database environment. Because the raw row data is available, it can be constructed into any suitable statement appropriate for the target database.

2.8.1.2. Native Applier Replication (e.g. MongoDB)

For heterogeneous replication where data is written into a target database using a native applier, such as MongoDB, the row-based information is written into the database using the native API. With MongoDB, for example, data is reformatted into BSON and then applied into MongoDB using the native insert/update/delete API calls.

2.8.1.3.  Batch Loading

For batch appliers, such as Hadoop, Vertica and Redshift, the row-data is converted into CSV files in batches. The format of the CSV file includes both the original row data for all the columns of each table, and metadata on each line that contain the unique SEQNO and the operation type (insert, delete or update). A modified form of the CSV is used in some cases where the operation type is only an insert or delete, with updates being translated into a delete followed by an insert of the updated information.

These temporary CSV files are then loaded into the native environment as part of the replicator using a custom script that employs the specific tools of that database that support CSV imports. The raw CSV data is loaded into a staging table that contains the per-row metadata and the row data itself.

Depending on the batch environment, the loading of the data into the final destination tables is performed either within the same script, or by using a separate script. Both methods work in the same basic fashion; the base table is updated using the data from the staging table, with each row marked to be deleted, deleted, and the latest row (calculated from the highest SEQNO) for each primary key) are then inserted

2.8.1.4. Schema Creation and Replication

Because heterogeneous replication does not replicate SQL statements, including DDL statements that would normally define and generate the table structures, a different method must be used.

Tungsten Replicator includes a tool called ddlscan which can read the schema definition from MySQL and translate that into the schema definition required on the target database. During the process, differences in supported sizes and datatypes are identified and either modified to a suitable value, or highlighted as a definition that must be changed in the generated DDL.

Once this modified form of the DDL has been completed, it can then be executed against the target database to generate the DDL required for Tungsten Replicator to apply data. The same basic method s used in batch loading environments where a staging table is required, with the additional staging columns added to the DDL automatically.

For MongoDB or Kafka, where no explicit DDL needs to be generated, the use of ddlscan is not required.

Chapter 3. Deploying MySQL Extractors

The following sections outline the steps to configure the replicator for extraction. Each section covers the basic configuration to deploy an extractor in each of the deployment models (Onboard or Offboard) regardless of target database type.

To complete the deployment, after preparing the basic extractor configuration, follow the steps outlined in Chapter 4, Deploying Appliers appropriate to the target database type for your deployment.

3.1. MySQL Replication Pre-Requisites

Before installing Tungsten Replicator there are a number of steps that need to be completed to prepare the hosts.

First, ensure you have followed the general notes within Section B.3, “Host Configuration”. For supported platforms and environments, see Section B.1, “Requirements”.

If configuring extraction from MySQL instances hosted on your own hardware, or, for example, on EC2 instances, follow the MySQL specific pre-requisites within Section B.4, “MySQL Database Setup”

If configuring extraction from Amazon RDS or Amazon Aurora, also follow the pre-requisites within Section B.4, “MySQL Database Setup” however, paying specific attention to Section B.4.6, “MySQL Unprivileged Users”

For more detail on changing parameters within Amazon AWS, see Section 3.3.1, “Changing Amazon RDS/Aurora Instance Configurations”

A pre-requisite checklist is available to download and can be used to ensure your environment is ready for installation. See Section B.5, “Prerequisite Checklist”

3.2. Deploying a Primary/Replica Topology

Primary/Replica is the simplest and most straightforward of all replication scenarios, and also the basis of all other types of topology. The fundamental basis for the Primary/Replica topology is that changes in the Primary are distributed and applied to the each of the configured Replicas.

Figure 3.1. Topologies: Primary/Replica

Topologies: Primary/Replica

This deployment style can be used against the following sources

  • MySQL Community Edition

  • MySQL Enterprise Edition

  • Percona MySQL

  • MariaDB

  • Google Cloud MySQL

This deployment assumes full access to the host, including access to Binary Logs, therefore this deployment style is not suitable for RDS or Aurora extraction. For these sources, see Section 3.3, “Deploying an Extractor for Amazon Aurora”

tpm includes a specific topology structure for the basic Primary/Replica configuration, using the list of hosts and the Primary host definition to define the Primary/Replica relationship. Before starting the installation, the prerequisites must have been completed (see Appendix B, Prerequisites). To create a Primary/Replica using tpm:

There are two types of installation, either via a Staging Install, or via an ini file install.

To understand the differences between these two installation methods, see Section 9.1, “Comparing Staging and INI tpm Methods”

Regardless of which installation method you choose, the steps are the same, and are outlined below, using the appropriate example confguration based on your deployment style

Note

If you plan to make full use of the REST API (which is enabled by default) you will need to also configure a username and password for API access. This must be done by specifying the following options in your configuration:

rest-api-admin-user=tungsten
rest-api-admin-pass=secret

In both of the above examples, enable-heterogenous-service, is only required if the target applier is NOT a MySQL database

If the installation process fails, check the output of the /tmp/tungsten-configure.log file for more information about the root cause.

Once the installation has been completed, you can now proceed to configure the Applier service following the relevant step within Chapter 4, Deploying Appliers.

Following installation of the applier, the services can be started. For information on starting and stopping Tungsten Cluster see Section 2.4, “Starting and Stopping Tungsten Replicator”; configuring init scripts to startup and shutdown when the system boots and shuts down, see Section 2.5, “Configuring Startup on Boot”.

For information on checking the running service, see Section 3.2.1, “Monitoring the MySQL Extractor”.

3.2.1. Monitoring the MySQL Extractor

Once the service has been started, a quick view of the service status can be determined using trepctl:

shell> trepctl services
Processing services command...
NAME              VALUE
----              -----
appliedLastSeqno: 3593
appliedLatency  : 1.074
role            : master
serviceName     : alpha
serviceType     : local
started         : true
state           : ONLINE
Finished services command...

The key fields are:

  • appliedLastSeqno and appliedLatency indicate the global transaction ID and latency of the host. These are important when monitoring the status of the cluster to determine how up to date a host is and whether a specific transaction has been applied.

  • role indicates the current role of the host within the scope of this dataservice.

  • state shows the current status of the host within the scope of this dataservice.

More detailed status information can also be obtained. On the Extractor:

shell> trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000009:0000000000001033;0
appliedLastSeqno       : 3593
appliedLatency         : 1.074
channels               : 1
clusterName            : default
currentEventId         : mysql-bin.000009:0000000000001033
currentTimeMillis      : 1373615598598
dataServerHost         : host1
extensions             : 
latestEpochNumber      : 3589
masterConnectUri       : 
masterListenUri        : thl://host1:2112/
maximumStoredSeqNo     : 3593
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : jdbc:mysql:thin://host1:3306/
relativeLatency        : 604904.598
resourcePrecedence     : 99
rmiPort                : 10000
role                   : master
seqnoType              : java.lang.Long
serviceName            : alpha
serviceType            : local
simpleServiceName      : alpha
siteName               : default
sourceId               : host1
state                  : ONLINE
timeInStateSeconds     : 604903.621
transitioningTo        : 
uptimeSeconds          : 1202137.328
version                : Tungsten Replicator 7.1.4 build 10
Finished status command...

For more information on using trepctl, see Section 8.20, “The trepctl Command”.

Definitions of the individual field descriptions in the above example output can be found in Section E.2, “Generated Field Reference”.

For more information on management and operational detailed for managing your replicator installation, see Chapter 7, Operations Guide.

3.3. Deploying an Extractor for Amazon Aurora

Replicating from Amazon Aurora, operates by directly accessing the binary log provided by Aurora and enables you to take advantage of the Amazon Web, either replicating from the remote Aurora instance, or to a standard EC2 instance within AWS. The complexity with Aurora is that there is no access to the host that is running the instance, or the MySQL binary logs.

To use this service, two aspects of the Tungsten Replicator are required, direct mode and unprivileged user support. Direct mode reads the MySQL binary log over the network, rather than accessing the binlog on the filesystem. The unprivileged mode enables the user to access and update information within Aurora without requiring SUPER privileges, which are unavailable within an Aurora instance. For more information, see Section B.4.6, “MySQL Unprivileged Users”.

The deployment requires a host for the extractor installation, this can be an EC2 instance within your AWS environment, or it could be a remote host in your own environment.

This deployment follows a similar model to an Offboard Installation

Figure 3.2. Topologies: Aurora Extraction

Topologies: Aurora Extraction

Before starting the installation, the prerequisites must have been completed (see Appendix B, Prerequisites) on both the Host designated for the installation of the extractor, and within the source database instance.

There are two types of installation, either via a Staging Install, or via an ini file install.

To understand the differences between these two installation methods, see Section 9.1, “Comparing Staging and INI tpm Methods”

Regardless of which installation method you choose, the steps are the same, and are outlined below.

In the above examples,

  • enable-heterogenous-service, is only required if the target applier is NOT a MySQL database

  • datasource-mysql-conf, needs to be set as shown as we do not have access to the my.cnf file

If the installation process fails, check the output of the /tmp/tungsten-configure.log file for more information about the root cause.

Once the installation has been completed, you can now proceed to configure the Applier service following the relevant step within Chapter 4, Deploying Appliers.

Following installation of the applier, the services can be started. For information on starting and stopping Tungsten Cluster see Section 2.4, “Starting and Stopping Tungsten Replicator”; configuring init scripts to startup and shutdown when the system boots and shuts down, see Section 2.5, “Configuring Startup on Boot”.

Monitoring the extractor is the same as an extractor from MySQL, for information, see Section 3.2.1, “Monitoring the MySQL Extractor”.

3.3.1. Changing Amazon RDS/Aurora Instance Configurations

The configuration of RDS and Aurora instances can be modified to change the parameters for MySQL instances, the Amazon equivalent of modifying the my.cnf file.

3.3.1.1. Changing Amazon RDS using command line functions

These steps can be used for changing the configuration for RDS Instances only. See Section 3.3.1.2, “Changing Amazon Aurora Parameters using AWS Console” for steps to change Aurora parameters

The parameters can be set internally by connecting to the instance and using the configuration function within the instance. For example:

mysql> call mysql.rds_set_configuration('binlog retention hours', 48);

An RDS command-line interface is available which enables modifying these parameters. To enable the command-line interface:

shell> wget http://s3.amazonaws.com/rds-downloads/RDSCli.zip
shell> unzip RDSCli.zip
shell> export AWS_RDS_HOME=/home/tungsten/RDSCli-1.13.002
shell> export PATH=$PATH:$AWS_RDS_HOME/bin

The current RDS instances can be listed by using rds-describe-db-instances:

shell> rds-describe-db-instances --region=us-east-1

To change parameters, a new parameter group must be created, and then applied to a running instance or instances before restarting the instance:

  1. Create a new custom parameter group:

    shell> rds-create-db-parameter-group repgroup -d 'Parameter group for DB Replicas' -f mysql5.1

    Where repgroup is the replicator group name.

  2. Set the new parameter value:

    shell> rds-modify-db-parameter-group repgroup --parameters \
    "name=max_allowed_packet,value=67108864, method=immediate" 
  3. Apply the parameter group to your instance:

    shell> rds-modify-db-instance instancename --db-parameter-group-name=repgroup

    Where instancename is the name given to your instance.

  4. Restart the instance:

    shell> rds-reboot-db-instance instancename

3.3.1.2. Changing Amazon Aurora Parameters using AWS Console

To change the parameters for Aurora Instances, you can follow the following guidelines using the AWS Console

  1. Login to the AWS Console using your account credentials and navigate to the RDS Dashboard. From here, select "Parameter Groups" from the left hand list

    Figure 3.3. Fig 1. AWS Config

    Fig 1. AWS Config

  2. Select the "Create Parameter Group" Button to the top right

    Figure 3.4. Fig 2. AWS Config

    Fig 2. AWS Config

  3. This dialog will now allow you to create a new parameter group using an existing one as a template. Select the appropriate template to use and complete the rest of the details. You need to create a DB Paramater group and a DB Cluster Parameter Group

    Figure 3.5. Fig 3. AWS Config

    Fig 3. AWS Config

    Figure 3.6. Fig 4. AWS Config

    Fig 4. AWS Config

    Figure 3.7. Fig 5. AWS Config

    Fig 5. AWS Config

  4. Now you have the two groups, you can modify the parameters accordingly, by selecting the group in the list and then selecting the "Edit" option.

    Figure 3.8. Fig 6. AWS Config

    Fig 6. AWS Config

    Figure 3.9. Fig 7. AWS Config

    Fig 7. AWS Config

  5. Now the groups are setup, you can assign these groups to existing Aurora Instances, or you can assign them during instance creation. If you are assigning to existing instances, you may need to restart the instance for certain parameters to take effect.

Some parameters can only be set via the cluster parameter group - such as enabling binary logging, others can only be change in the DB Parameter group.

3.4. Replicating Data Out of a Cluster

If you have an existing cluster and you want to replicate the data out to a separate standalone server using Tungsten Replicator then you can create a cluster alias, and use a Primary/Replica topology to replicate from the cluster. This allows for THL events from the cluster to be applied to a separate server for the purposes of backup or separate analysis.

Figure 3.10. Topologies: Replicating Data Out of a Cluster

Topologies: Replicating Data Out of a Cluster

During the installation process a cluster-alias and cluster-slave are declared. The cluster-alias describes all of the servers in the cluster and how they may be reached. The cluster-slave defines one or more servers that will replicate from the cluster.

The Tungsten Replicator will be installed on the Cluster-Extractor server. That server will download THL data and apply them to the local server. If the Cluster-Extractor has more than one server; one of them will be declared the relay (or Primary). The other members of the Cluster-Extractor may also download THL data from that server.

If the relay for the Cluster-Extractor fails; the other nodes will automatically start downloading THL data from a server in the cluster. If a non-relay server fails; it will not have any impact on the other members.

3.4.1. Prepare: Replicating Data Out of a Cluster

  1. Identify the cluster to replicate from. You will need the Primary, Replicas and THL port (if specified). Use tpm reverse from a cluster member to find the correct values.

  2. If you are replicating to a non-MySQL server. Update the configuration of the cluster to include the following properties prior to beginning.

    svc-extractor-filters=colnames,pkey
    property=replicator.filter.pkey.addColumnsToDeletes=true
    property=replicator.filter.pkey.addPkeyToInserts=true

  3. Identify all servers that will replicate from the cluster. If there is more than one, a relay server should be identified to replicate from the cluster and provide THL data to other servers.

  4. Prepare each server according to the prerequisites for the DBMS platform it is serving. If you are working with multiple DBMS platforms; treat each platform as a different Cluster-Extractor during deployment.

  5. Make sure the THL port for the cluster is open between all servers.

3.4.2. Deploy: Replicating Data Out of a Cluster

  1. Install the Tungsten Replicator package or download the Tungsten Replicator tarball, and unpack it:

    shell> cd /opt/continuent/software
    shell> tar zxf tungsten-replicator-7.1.4-10.tar.gz
  2. Change to the unpackaged directory:

    shell> cd tungsten-replicator-7.1.4-10
  3. Configure the replicator

    Click the link below to switch examples between Staging and INI methods

    Show Staging

    Show INI

    shell> ./tools/tpm configure defaults \
        --install-directory=/opt/continuent \
        --profile-script=~/.bash_profile \
        --replication-password=secret \
        --replication-port=13306 \
        --replication-user=tungsten \
        --user=tungsten \
        --rest-api-admin-user=apiuser \
        --rest-api-admin-pass=secret
    
    shell> ./tools/tpm configure alpha \
        --master=host1 \
        --slaves=host2,host3 \
        --thl-port=2112 \
        --topology=cluster-alias
    
    shell> ./tools/tpm configure beta \
        --relay=host6 \
        --relay-source=alpha \
        --topology=cluster-slave
    
    shell> vi /etc/tungsten/tungsten.ini
    [defaults]
    install-directory=/opt/continuent
    profile-script=~/.bash_profile
    replication-password=secret
    replication-port=13306
    replication-user=tungsten
    user=tungsten
    rest-api-admin-user=apiuser
    rest-api-admin-pass=secret
    
    [alpha]
    master=host1
    slaves=host2,host3
    thl-port=2112
    topology=cluster-alias
    
    [beta]
    relay=host6
    relay-source=alpha
    topology=cluster-slave
    

    Configuration group defaults

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

    Configuration group alpha

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

    Configuration group beta

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

    Important

    If you are replicating to a non-MySQL server. Include the following steps in your configuration.

    shell> mkdir -p /opt/continuent/share/
    shell> cp tungsten-replicator/support/filters-config/convertstringfrommysql.json »
       /opt/continuent/share/

    Then, include the following parameters in the configuration

    property=replicator.stage.remote-to-thl.filters=convertstringfrommysql
    property=replicator.filter.convertstringfrommysql.definitionsFile= »
       /opt/continuent/share/convertstringfrommysql.json
    

    Important

    This dataservice cluster-alias name MUST be the same as the cluster dataservice name that you are replicating from.

    Note

    Do not include start-and-report=true if you are taking over for MySQL native replication. See Section 7.10.1, “Migrating from MySQL Native Replication 'In-Place'” for next steps after completing installation.

  4. Once the configuration has been completed, you can perform the installation to set up the services using this configuration:

    shell> ./tools/tpm install

During the installation and startup, tpm will notify you of any problems that need to be fixed before the service can be correctly installed and started. If the service starts correctly, you should see the configuration and current status of the service.

If the installation process fails, check the output of the /tmp/tungsten-configure.log file for more information about the root cause.

The cluster should be installed and ready to use.

Chapter 4. Deploying Appliers

Table of Contents

4.1. Deploying the MySQL Applier
4.1.1. Preparing for MySQL Replication
4.1.2. Prepare Amazon RDS/Amazon Aurora
4.1.3. Install MySQL Applier
4.1.3.1. Local and Remote MySQL Targets
4.1.3.2. Amazon RDS and Amazon Aurora Targets
4.1.4. Management and Monitoring of MySQL Deployments
4.2. Deploying the Amazon Redshift Applier
4.2.1. Redshift Replication Operation
4.2.2. Preparing for Amazon Redshift Replication
4.2.2.1. Redshift Preparation for Amazon Redshift Deployments
4.2.2.2. Configuring Identity Access Management within AWS
4.2.2.3. Amazon Redshift DDL Generation for Amazon Redshift Deployments
4.2.2.4. Handling Concurrent Writes from Multiple Appliers
4.2.3. Install Amazon Redshift Applier
4.2.4. Verifying your Redshift Installation
4.2.5. Keeping CDC Information
4.2.6. Management and Monitoring of Amazon Redshift Deployments
4.3. Deploying the Vertica Applier
4.3.1. Preparing for Vertica Deployments
4.3.2. Install Vertica Applier
4.3.3. Management and Monitoring of Vertica Deployments
4.3.4. Troubleshooting Vertica Installations
4.4. Deploying the Kafka Applier
4.4.1. Preparing for Kafka Replication
4.4.2. Install Kafka Applier
4.4.2.1. Optional Configuration Parameters for Kafka
4.4.3. Management and Monitoring of Kafka Deployments
4.5. Deploying the MongoDB Applier
4.5.1. MongoDB Atlas Replication
4.5.2. Preparing for MongoDB Replication
4.5.3. Install MongoDB Applier
4.5.4. Install MongoDB Atlas Applier
4.5.4.1. Import MongoDB Atlas Certificates
4.5.5. Management and Monitoring of MongoDB Deployments
4.6. Deploying the Hadoop Applier
4.6.1. Hadoop Replication Operation
4.6.2. Preparing for Hadoop Replication
4.6.2.1. Hadoop Host
4.6.2.2. Schema Generation
4.6.3. Replicating into Kerberos Secured HDFS
4.6.4. Install Hadoop Replication
4.6.4.1. Applier Replicator Service
4.6.4.2. Generating Materialized Views
4.6.4.3. Accessing Generated Tables in Hive
4.6.4.4. Management and Monitoring of Hadoop Deployments
4.6.4.5. Troubleshooting Hadoop Replication
4.7. Deploying the Oracle Applier
4.7.1. Preparing for Oracle Replication
4.7.1.1. Additional Prerequisites for Oracle Targets
4.7.1.2. Configure the Oracle database
4.7.1.3. Create the Destination Schema
4.7.2. Install Oracle Applier
4.8. Deploying the PostgreSQL Applier
4.8.1. Preparing for PostgreSQL Replication
4.8.1.1. PostgreSQL Database Setup
4.8.2. Install PostgreSQL Applier
4.8.3. Management and Monitoring of PostgreSQL Deployments
4.9. Deploying the Amazon S3 CSV Applier
4.9.1. S3 Replication Operation
4.9.2. Preparing for Amazon S3 Replication
4.9.3. Install Amazon S3 Applier

The following sections outline the steps to configure the replicator for applying into your target of choice. Each section covers the basic configuration to deploy an applier in each of the deployment models (Onboard or Offboard).

Before preparing the applier configuration, follow the steps outlined in Chapter 3, Deploying MySQL Extractors to configure the extractor.

4.1. Deploying the MySQL Applier

Deploying the MySQL applier is the most straight forward of deployments. This section covers configuration of the applier into all releases of MySQL, including Amazon RDS, Amazon Aurora, Google Cloud SQL and Microsoft Azure.

  • Service Alpha on host1 extracts the information from the MySQL binary log into THL.

  • Service Alpha reads the information from the remote replicator as THL, and applies that to the target MySQL instance via a JDBC Connector.

Figure 4.1. Topologies: Replicating to MySQL

Topologies: Replicating to MySQL

The Applier replicator can be installed on:

4.1.1. Preparing for MySQL Replication

Configure the source and target hosts following the prerequisites outlined in Appendix B, Prerequisites then follow the appropriate steps for the required extractor topology outlined in Chapter 3, Deploying MySQL Extractors.

  • MySQL Target

    Applies to:

    • Standalone hosted instances

    • EC2 hosted instances

    • Google Cloud hosted instances

    • Microsoft Azure hosted instances

    To prepare the target MySQL Database, ensure the user accounts are created as per the steps outlined in Section B.4.5, “MySQL User Configuration”

  • Amazon RDS/Amazon Aurora Target

    For Amazon based targets, as we do not have access to the host, nor can we configure accounts with elevated privileges, follow the steps in Section B.4.6, “MySQL Unprivileged Users” to prepare the target for replication

The data replicated from MySQL can be any data, although there are some known limitations and assumptions made on the way the information is transferred.

  • Table format should be updated to UTF8 by updating the MySQL configuration (my.cnf):

    character-set-server=utf8
    collation-server=utf8_general_ci
  • To prevent timezone configuration storing zone adjusted values and exporting this information to the binary log and AmazonRDS, fix the timezone configuration to use UTC within the configuration file (my.cnf):

    default-time-zone='+00:00'

If your target is an Amazon RDS or Aurora Instance, that has not yet been created, follow the steps in Section 4.1.2, “Prepare Amazon RDS/Amazon Aurora”

If your target is a hosted MySQL environment, proceed to Section 4.1.3, “Install MySQL Applier”

4.1.2. Prepare Amazon RDS/Amazon Aurora

  • Create the Amazon Instance

    If the instance does not already exist, create the Amazon RDS or Amazon Aurora instance and take a note of the endpoint URL reported. This information will be required when configuring the replicator service.

    Also take a note of the user and password used for connecting to the instance.

  • Check your security group configuration.

    The host used as the Target for applying changes to the Amazon instance must have been added to the security groups. Within Amazon RDS and Aurora, security groups configure the hosts that are allowed to connect to the Amazon instance, and hence update information within the database. The configuration must include the IP address of the Applier replicator, whether that host is within Amazon EC2 or external.

  • Change RDS/Aurora instance properties

    Depending on the configuration and data to be replicated, the parameter of the running instance may need to be modified. For example, the max_allowed_packet parameter may need to be increased.

    For more information on changing parameters, see Section 3.3.1, “Changing Amazon RDS/Aurora Instance Configurations”.

4.1.3. Install MySQL Applier

The applier will read information from the Extractor and write database changes into the target instance.

To configure the Applier replicator for either local or remote MySQL or for Amazon RDS/Aurora, the process is the same, but with a slightly different configuration, this is outlined below:

  • Unpack the Tungsten Replicator distribution in staging directory:

    shell> tar zxf tungsten-replicator-7.1.4-10.tar.gz
  • Change into the staging directory:

    shell> cd tungsten-replicator-7.1.4-10
  • Use the appropriate template config for your target

  • Note

    If you plan to make full use of the REST API (which is enabled by default) you will need to also configure a username and password for API access. This must be done by specifying the following options in your configuration:

    rest-api-admin-user=tungsten
    rest-api-admin-pass=secret

  • Once the prerequisites and configuring of the installation has been completed, the software can be installed:

    shell> ./tools/tpm install

If the installation process fails, check the output of the /tmp/tungsten-configure.log file for more information about the root cause.

The replicators can now be started using the replicator command.

The status of the replicator can be checked and monitored by using the trepctl command.

4.1.3.1. Local and Remote MySQL Targets

  • Configure the installation using tpm:

    Show Staging

    Show INI

    shell> ./tools/tpm configure defaults \
        --reset \
        --install-directory=/opt/continuent \
        --user=tungsten \
        --mysql-allow-intensive-checks=true \
        --profile-script=~/.bash_profile \
        --rest-api-admin-user=apiuser \
        --rest-api-admin-pass=secret
    
    shell> ./tools/tpm configure alpha \
        --master=sourcehost \
        --members=localhost,sourcehost \
        --datasource-type=mysql \
        --replication-user=tungsten \
        --replication-password=secret \
        --replication-host=remotedbhost
    
    shell> vi /etc/tungsten/tungsten.ini
    [defaults]
    install-directory=/opt/continuent
    user=tungsten
    mysql-allow-intensive-checks=true
    profile-script=~/.bash_profile
    rest-api-admin-user=apiuser
    rest-api-admin-pass=secret
    
    [alpha]
    master=sourcehost
    members=localhost,sourcehost
    datasource-type=mysql
    replication-user=tungsten
    replication-password=secret
    replication-host=remotedbhost
    

    Configuration group defaults

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

    Configuration group alpha

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

    replication-host should only be added to the above configuration if the target MySQL Database is on a different host to the applier installation

4.1.3.2. Amazon RDS and Amazon Aurora Targets

4.1.4. Management and Monitoring of MySQL Deployments

Replication to MySQL and Amazon based instances operates in the same manner as all other replication environments. The current status can be monitored using trepctl. On the Extractor:

shell> trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000043:0000000000000291;84
appliedLastSeqno       : 2320
appliedLatency         : 0.733
channels               : 1
clusterName            : alpha
currentEventId         : mysql-bin.000043:0000000000000291
currentTimeMillis      : 1387544952494
dataServerHost         : host1
extensions             : 
host                   : host1
latestEpochNumber      : 60
masterConnectUri       : thl://localhost:/
masterListenUri        : thl://host1:2112/
maximumStoredSeqNo     : 2320
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : jdbc:mysql:thin://host1:13306/
relativeLatency        : 23.494
resourcePrecedence     : 99
rmiPort                : 10000
role                   : master
seqnoType              : java.lang.Long
serviceName            : alpha
serviceType            : local
simpleServiceName      : alpha
siteName               : default
sourceId               : host1
state                  : ONLINE
timeInStateSeconds     : 99525.477
transitioningTo        : 
uptimeSeconds          : 99527.364
useSSLConnection       : false
version                : Tungsten Replicator 7.1.4 build 10
Finished status command...

On the Applier, use trepctl and monitor the appliedLatency and appliedLastSeqno. The output will include the hostname of the Amazon RDS instance:

shell> trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000043:0000000000000291;84
appliedLastSeqno       : 2320
appliedLatency         : 797.615
channels               : 1
clusterName            : default
currentEventId         : NONE
currentTimeMillis      : 1387545785268
dataServerHost         : documentationtest.cnlhon44f2wq.eu-west-1.rds.amazonaws.com
extensions             : 
host                   : documentationtest.cnlhon44f2wq.eu-west-1.rds.amazonaws.com
latestEpochNumber      : 60
masterConnectUri       : thl://host1:2112/
masterListenUri        : thl://host2:2112/
maximumStoredSeqNo     : 2320
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : thl://host1:2112/
relativeLatency        : 856.268
resourcePrecedence     : 99
rmiPort                : 10000
role                   : slave
seqnoType              : java.lang.Long
serviceName            : alpha
serviceType            : local
simpleServiceName      : alpha
siteName               : default
sourceId               : documentationtest.cnlhon44f2wq.eu-west-1.rds.amazonaws.com
state                  : ONLINE
timeInStateSeconds     : 461.885
transitioningTo        : 
uptimeSeconds          : 668.606
useSSLConnection       : false
version                : Tungsten Replicator 7.1.4 build 10
Finished status command...

4.2. Deploying the Amazon Redshift Applier

Amazon Redshift is a cloud-based data warehouse service that integrates with other Amazon services, such as S3, to provide an SQL-like interface to the loaded data. Replication for Amazon Redshift moves data from MySQL datastores, through S3, and into the Redshift environment in real-time, avoiding the need to manually export and import the data.

Replication to Amazon Redshift operates as follows:

  • Data is extracted from the source database into THL.

  • When extracting the data from the THL, the Amazon Redshift replicator writes the data into CSV files according to the name of the source tables. The files contain all of the row-based data, including the global transaction ID generated by the extractor during replication, and the operation type (insert, delete, etc) as part of the CSV data.

  • The generated CSV files are loaded into Amazon S3 using either the s3cmd command or the aws s3 cli tools. This enables easy access to your Amazon S3 installation and simplifies the loading.

  • The CSV data is loaded from S3 into Redshift staging tables using the Redshift COPY command, which imports raw CSV into Redshift tables.

  • SQL statements are then executed within Redshift to perform updates on the live version of the tables, using the CSV, batch loaded, information, deleting old rows, and inserting the new data when performing updates to work effectively within the confines of Amazon Redshift operation.

Figure 4.2. Topologies: Replicating to Amazon Redshift

Topologies: Replicating to Amazon Redshift

Setting up replication requires setting up both the Extractor and Applier components as two different configurations, one for MySQL and the other for Amazon Redshift. Replication also requires some additional steps to ensure that the Amazon Redshift host is ready to accept the replicated data that has been extracted. Tungsten Replicator provides all the tools required to perform these operations during the installation and setup.

4.2.1. Redshift Replication Operation

The Redshift applier makes use of the JavaScript based batch loading system (see Section 5.6.4, “JavaScript Batchloader Scripts”). This constructs change data from the source-database. The change data is then loaded into staging tables, at which point a process will then merge the change data up into the base tables A summary of this basic structure can be seen in Figure 4.3, “Topologies: Redshift Replication Operation”.

Figure 4.3. Topologies: Redshift Replication Operation

Topologies: Redshift Replication Operation Operation

Different object types within the two systems are mapped as follows:

MySQL Redshift
Instance Database
Database Schema
Table Table

The full replication of information operates as follows:

  1. Data is extracted from the source database using the standard extractor, for example by reading the row change data from the binlog in MySQL.

  2. The Section 11.4.5, “ColumnName Filter” filter is used to extract column name information from the database. This enables the row-change information to be tagged with the corresponding column information. The data changes, and corresponding row names, are stored in the THL.

    The Section 11.4.32, “PrimaryKey Filter” filter is used to extract primary key data from the source tables.

  3. On the Applier replicator, the THL data is read and written into batch-files in the character-separated value format.

    The information in these files is change data, and contains not only the original row values from the source tables, but also metadata about the operation performed (i.e. INSERT, DELETE or UPDATE, and the primary key of for each table. All UPDATE statements are recorded as a DELETE of the existing data, and an INSERT of the new data.

    In addition to these core operation types, the batch applier can also be configured to record UPDATE operations that result in INSERT or DELETE rows. This enables Redshift to process the update information more simply than performing the individual DELETE and INSERT operations.

  4. A second process uses the CSV stage data and any existing data, to build a materialized view that mirrors the source table data structure.

The staging files created by the replicator are in a specific format that incorporates change and operation information in addition to the original row data.

  • The format of the files is a character separated values file, with each row separated by a newline, and individual fields separated by the character 0x01. This is supported by Hive as a native value separator.

  • The content of the file consists of the full row data extracted from the Source, plus metadata describing the operation for each row, the sequence number, and then the full row information.

Operation Sequence No Table-specific primary key DateTime Table-columns...
OPTYPE SEQNO that generated this row PRIMARYKEY DATATIME of source table commit  

The operation field will match one of the following values

Operation Description Notes
I Row is an INSERT of new data  
D Row is DELETE of existing data  
UI Row is an UPDATE which caused INSERT of data  
UD Row is an UPDATE which caused DELETE of data  

For example, the MySQL row from an INSERT of:

|  3 | #1 Single | 2006 | Cats and Dogs (#1.4)         |

Is represented within the CSV staging files generated as:

"I","5","3","2014-07-31 14:29:17.000","3","#1 Single","2006","Cats and Dogs (#1.4)"

The character separator, and whether to use quoting, are configurable within the replicator when it is deployed. For Redshift, the default behavior is to generate quoted and comma separated fields.

4.2.2. Preparing for Amazon Redshift Replication

Preparing the hosts for the replication process requires setting some key configuration parameters within the MySQL server to ensure that data is stored and written correctly. On the Amazon Redshift side, the database and schema must be created using the existing schema definition so that the databases and tables exist within Amazon Redshift.

Source Host

Configure the source and target hosts following the prerequisites outlined in Appendix B, Prerequisites then follow the appropriate steps for the required extractor topology outlined in Chapter 3, Deploying MySQL Extractors.

The following are required for replication to Amazon Redshift:

4.2.2.1. Redshift Preparation for Amazon Redshift Deployments

On the Amazon Redshift host, you need to perform some preparation of the destination database, first creating the database, and then creating the tables that are to be replicated. Setting up this process requires the configuration of a number of components outside of Tungsten Replicator in order to support the loading.

  • An existing Amazon Web Services (AWS) account, and either the AWS Access Key and Secret Key, or configured IAM Roles, required to interact with the account through the API. For information on creating IAM Roles, see Section 4.2.2.2, “Configuring Identity Access Management within AWS”

  • A configured Amazon S3 service. If the S3 service has not already been configured, visit the AWS console and sign up for the Amazon S3 service.

  • The s3cmd or the aws tools installed and configured. The s3cmd can be downloaded from s3cmd on s3tools.org.

    If using the s3cmd, you should then configure the command to automatically connect to the Amazon S3 service without requiring further authentication, the .s3cfg in the tungsten users home directory should be configured as follows:

    • Using Access Keys:

      [default]
      access_key = ACCESS_KEY
      secret_key = SECRET_KEY
    • Using IAM Roles: Leave values blank - copy example as is

      [default]
      access_key = 
      secret_key = 
      security_token =
  • Create an S3 bucket that will be used to hold the CSV files that are generated by the replicator. This can be achieved either through the web interface, or via the command-line, for example:

    shell> s3cmd mb s3://tungsten-csv
  • A running Redshift instance must be available, and the port and IP address of the Tungsten Cluster that will be replicating into Redshift must have been added to the Redshift instance security credentials.

    Make a note of the user and password that has been provided with access to the Redshift instance, as these will be needed when installing the applier. Also make a note of the Redshift instance address, as this will need to be provided to the applier configuration.

  • Create an s3-config-servicename.json file based on the sample provided within cluster-home/samples/conf/s3-config-servicename.json within the Tungsten Replicator staging directory, or using the example below.

    Once created, the file will be copied into the /opt/continuent/share directory to be used by the batch applier script.

    If multiple services are being created, one file must be created for each service.

    The following example shows the use of Access and Secret Keys:

    {
      "awsS3Path" : "s3://your-bucket-for-redshift/redshift-test",
      "awsAccessKey" : "access-key-id",
      "awsSecretKey" : "secret-access-key",
      "cleanUpS3Files" : "true"
    }

    The following example shows the use of IAM Roles:

    {
      "awsS3Path" : "s3://your-bucket-for-redshift/redshift-test",
      "awsIAMRole" : "arn:iam-role",
      "cleanUpS3Files" : "true"
    }

    The allowed options for this file are as follows:

    • awsS3Path — the location within your S3 storage where files should be loaded.

    • awsAccessKey — the S3 access key to access your S3 storage. Not required if awsIAMRole is used.

    • awsSecretKey — the S3 secret key associated with the Access Key. Not required if awsIAMRole is used.

    • awsIAMRole — the IAM role configured to allow Redshift to interact with S3. Not required if awsAccessKey and awsSecretKey are in use.

    • multiServiceTarget (true/false) — to indicate if there are multiple appliers writing into the single Redshift Target, for example when the source is Tungsten Cluster Composite Active/Active or a Tungsten Replicator Fan-In Topology (Default: false).

    • singleLockTable (true/false) — to indicate the table lock behaviour when multiServiceTarget is true. Will be ignored if multiServiceTarget set to false (Default: true)

    • lockTablePrefix — the prefix for the lock tables when singleLockTable is false. (Default: lock_xxx_)

    • s3Binary — the binary to use for loading csv file up to S3. (Valid Values: s3cmd, s4cmd, aws) (Default: s3cmd)

    • redshiftCopyOptions — allows the passing of additional valid syntax to be added to the Redshift COPY command during csv loading from S3 into Redshift Staging Tables.

      A list of valid parameters can be found in the Redshift documentation

    • cleanUpS3Files — a boolean value used to identify whether the CSV files loaded into S3 should be deleted after they have been imported and merged. If set to true, the files are automatically deleted once the files have been successfully imported into the Redshift staging tables. If set to false, files are not automatically removed.

    • gzipS3Files — setting to true will result in the csv files being gzipped prior to loading into S3 (Default: false)

    • storeCDCIn — a definition table that stores the change data from the load, in addition to importing to staging and base tables. The {schema} and {table} variables will be automatically replaced with the corresponding schema and table name. For more information on keeping CDC information, see Section 4.2.5, “Keeping CDC Information”.

4.2.2.2. Configuring Identity Access Management within AWS

Identity Management with AWS is complex, but a useful and secure way of restriciting services interacting with each other, and for restricting user access to the AWS platform.

Tungsten Replicator for Redshift, and Tungsten Replicator for S3, requires a certain level of interaction between the replicator and S3 and between Redshift and S3.

If configuring the Tungsten Replicator for S3, you only need to follow the relevant steps to allow the replicator to access, and write to, the S3 bucket that you create.

Note

All versions up to and including Tungsten Replicator version 6.0 can utilise IAM Roles for uploading the csv files to S3, however for loading the data from S3 into Redshift, the only option is to use Access and Secret Keys.

Tungsten Replicator version 6.1 onwards will also allow for the use of IAM Roles for loading data from S3 into Redshift.

To use IAM Roles with Tungsten Replicator you will need to create two roles, with the following recommended policies:

To allow csv files to be loaded upto S3:

  • Role should be associated with the AWS Service: EC2

  • AWS Defined Policy Name: AmazonS3FullAccess, or

  • Define and create your own policy, with, at minimum, the ability to write to the bucket you intend to use for the Redshift Applier

  • Associate this role to the EC2 instance running the Tungsten Replicator software

For use by Redshift COPY command to load csv into staging tables:

  • Role should be associated with the AWS Service: Redshift

  • AWS Defined Policy Name: AmazonS3FullAccess, or

  • Define and create your own policy, with, at minimum, the ability to read from the bucket you intend to use for the Redshift Applier

  • Associate this role to the Redshift Cluster.

Note

For more details and full instructions on creating and managing IAM roles, review the AWS documentation

4.2.2.3. Amazon Redshift DDL Generation for Amazon Redshift Deployments

In order for the data to be written into the Redshift tables, the tables must be generated. Tungsten Replicator does not replicate the DDL statements between the source and applier between heterogeneous deployments due to differences in the format of the DDL statements. The supplied ddlscan tool can translate the DDL from the source database into suitable DDL for the target database.

For each database being replicated, DDL must be generated twice, once for the staging tables where the change data is loaded, and again for the live tables. To generate the necessary DDL:

  1. To generate the staging table DDL, ddlscan must be executed on the Extractor host. After the replicator has been installed, the ddlscan can automatically pick up the configuration to connect to the host, or it can be specified on the command line:

    On the source host for each database that is being replicated, run ddlscan using the ddl-mysql-redshift-staging.vm:

    shell> ddlscan -db test -template ddl-mysql-redshift-staging.vm
    DROP TABLE stage_xxx_test.stage_xxx_msg;
    CREATE TABLE stage_xxx_test.stage_xxx_msg
    (
      tungsten_opcode CHAR(2),
      tungsten_seqno INT,
      tungsten_row_id INT,
      tungsten_commit_timestamp TIMESTAMP,
      id INT,
      msg CHAR(80),
      PRIMARY KEY (tungsten_opcode, tungsten_seqno, tungsten_row_id)
    );

    Check the output to ensure that no errors have been generated during the process. These may indicate datatype limitations that should be identified before continuing. The generated output should be captured and then executed on the Redshift host to create the table.

  2. Once the staging tables have been created, execute ddlscan again using the base table template, ddl-mysql-redshift.vm:

    shell> ddlscan -db test -template ddl-mysql-redshift.vm
    DROP TABLE test.msg;
    CREATE TABLE test.msg
    (
      id INT,
      msg CHAR(80),
      PRIMARY KEY (id)
    );

    Once again, check the output for errors, then capture the output and execute the generated DDL against the Redshift instance.

The DDL templates translate datatypes as directly as possible, with the following caveats:

  • The length of MySQL VARCHAR length is quadrupled, because MySQL counts characters, while Redshift counts bytes.

  • There is no TIME datatype in Redshift, instead, TIME columns are converted to VARCHAR(17).

  • Primary keys from MySQL are applied into Redshift where possible.

Once the DDL has been generated within the Redshift instance, the replicator will be ready to be installed.

4.2.2.4. Handling Concurrent Writes from Multiple Appliers

The features outlined in this section where specifically introduced in Tungsten Replicator 6.1.4.

Redshift only supports a SERIALIZABLE transaction isolation level, which differs from relational databases like MySQL, which is REPEATABLE READ by default. Isolation Levels determine the behaviour of the database for concurrent access to the tables within transactions.

When loading data into Redshift, from multiple appliers, this isolation level can cause locking issues that would manifest as errors in the Replicator Log similiar to the following:

Detail: Serializable isolation violation on table - 150379, transactions forming the cycle are: 2356786, 2356787 
» (pid:17914) (../../tungsten-replicator//appliers/batch/redshift.js#219)

In some cases, the replicator will simply retry and carry on successfully, but on very busy systems this can sometimes cause the replicator to fall back into an OFFLINE:ERROR state and manual intervention would be required.

To overcome this problem, the first step is to ensure that each applier has its own set of staging tables that the CSV files are loaded into. By default all staging tables will be named with the prefix stage_xxx_

First of all, to generate the staging tables, you would typically use ddlscan that would look something like the following:

shel> ddlscan -user tungsten -pass secret -url jdbc:mysql:thin://db01:3306/ 
  » -db hr -template ddl-mysql-redshift-staging.vm > staging.sql

To change the default prefix of the staging table, for example, to stage_nyc_ you can provide the option to the ddlscan command as follows:

shel> ddlscan -user tungsten -pass secret -url jdbc:mysql:thin://db01:3306/ 
  » -db hr -template ddl-mysql-redshift-staging.vm -opt tablePrefix stage_nyc_ > staging.sql

You would need to execute this for each applier, changing the prefix accordingly. Once this has been executed and the tables have been built in Redshift, you will then need to add the additional property to each applier to instruct which staging tables to use. The property should be added to the tungsten.ini file and a tpm update issued

property=replicator.applier.dbms.stageTablePrefix=stage_nyc_

4.2.2.4.1. Increase load rates

The first and easiest step to try and overcome the isolation errors, would be to increase the batch commit levels and the batch commit interval. Each system works differently so there is no simple calculation to find the right level. These values should be adjusted in small increments to find the right balance for your system.

Within your configuration, adjust the following two parameters:

  • svc-block-commit-size

  • svc-block-commit-interval

4.2.2.4.2. Enable Transaction Locking

Within the redshift applier, it is possible to introduce table locking. This will enable multiple appliers to process their own THL and load the transactions without impacting, or being impacted by, other appliers.

This configuration should only be used when multiple appliers are in use, however it must also be recognised that the addition of table locking could introduce latency in applying to Redshift on extremely busy systems, it could also impact client applications from reading the tables due to Redshift's isolation level. To avoid this, table locking should also include an increase in the block commit size and block commit interval properties mentioned above.

There are two types of table locking approaches, depending upon your environment will determine which approach is better for you.

  • Single Lock Table: This approach should be used for appliers in extremely busy systems where a block-commit-size of 500000 or greater does not eliminate isolation errors and where mutliple tables are updated within each transaction.

  • One Lock Table per Base Table: This approach should be used for appliers in less busy systems, or where parallel apply has been enabled within the applier, regardless of system activity levels.

To enable the single lock table approach:

  • The following option should be added to the s3-config-servicename.json file:

    "multiServiceTarget": "true"

  • Connect to Redshift with the same account used by the applier, and using the DDL below, create the lock table:

    CREATE TABLE public.tungsten_lock_table
    (
      ID  INT
    );

To enable the lock table per base table approach:

  • The following option should be added to the s3-config-servicename.json file:

    "multiServiceTarget": "true",
      "singleLockTable": "false"

  • Create a lock table for each of the base tables within Redshift. A ddlscan template can be used to generate the ddl. In the following example the ddlscan command is generating lock table ddl for all tables within the hr schema:

    shel> ddlscan -user tungsten -pass secret -url jdbc:mysql:thin://db01:3306/ 
      » -db hr -template ddl-mysql-redshift-lock.vm > outfile.sql

    Execute the output from ddlscan into redshift

After enabling either of the above methods, if replication has already been installed you will need to simply restart the replicator by issuing the following:

shel> replicator restart

4.2.3. Install Amazon Redshift Applier

Replication into Redshift requires two separate replicator installations, one that extracts information from the source database, and a second that generates the CSV files, loads those files into S3 and then executes the statements on the Redshift database to import the CSV data and apply the transformations to build the final tables.

The two replication services can operate on the same machine, (See Section 5.3, “Deploying Multiple Replicators on a Single Host”) or they can be installed on two different machines.

Once you have completed the configuration of the Amazon Redshift database, you can configure and install the applier as described using the steps below.

  1. Before installing the applier, the following additions need adding to the extractor configuration. Apply the following parameter to the extractor configuration before installing the applier

    Add the following the /etc/tungsten/tungsten.ini

    [alpha]
    ...Existing Replicator Config...
    enable-heterogeneous-service=true
    
    shell> tpm update

    Note

    The above step is only applicable for standalone extractors. If you are configuring replications from an existing Tungsten Cluster (Cluster-Extractor), follow the steps outlined here to ensure the cluster is configured correctly: Section 3.4.1, “Prepare: Replicating Data Out of a Cluster”

  2. The applier can now be configured. Unpack the Tungsten Replicator distribution in staging directory:

    shell> tar zxf tungsten-replicator-7.1.4-10.tar.gz
  3. Change into the staging directory:

    shell> cd tungsten-replicator-7.1.4-10
  4. Configure the installation using tpm:

    Show Staging

    Show INI

    shell> ./tools/tpm configure defaults \
        --reset \
        --user=tungsten \
        --install-directory=/opt/continuent \
        --profile-script=~/.bash_profile \
        --rest-api-admin-user=apiuser \
        --rest-api-admin-pass=secret
    
    shell> ./tools/tpm configure alpha \
        --topology=master-slave \
        --master=sourcehost \
        --members=localhost \
        --datasource-type=redshift \
        --replication-host=redshift.us-east-1.redshift.amazonaws.com \
        --replication-user=awsRedshiftUser \
        --replication-password=awsRedshiftPass \
        --redshift-dbname=dev \
        --batch-enabled=true \
        --batch-load-template=redshift \
        --svc-applier-filters=dropstatementdata \
        --svc-applier-block-commit-interval=30s \
        --svc-applier-block-commit-size=250000
    
    shell> vi /etc/tungsten/tungsten.ini
    [defaults]
    user=tungsten
    install-directory=/opt/continuent
    profile-script=~/.bash_profile
    rest-api-admin-user=apiuser
    rest-api-admin-pass=secret
    
    [alpha]
    topology=master-slave
    master=sourcehost
    members=localhost
    datasource-type=redshift
    replication-host=redshift.us-east-1.redshift.amazonaws.com
    replication-user=awsRedshiftUser
    replication-password=awsRedshiftPass
    redshift-dbname=dev
    batch-enabled=true
    batch-load-template=redshift
    svc-applier-filters=dropstatementdata
    svc-applier-block-commit-interval=30s
    svc-applier-block-commit-size=250000
    

    Configuration group defaults

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

    Configuration group alpha

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

  5. If your MySQL source is a Tungsten Cluster, ensure the additional steps below are also included in your applier configuration

    First, prepare the required filter configuration file as follows on the Redshift applier host(s) only:

    shell> mkdir -p /opt/continuent/share/
    shell> cp tungsten-replicator/support/filters-config/convertstringfrommysql.json /opt/continuent/share/

    Then, include the following parameters in the configuration

    property=replicator.stage.remote-to-thl.filters=convertstringfrommysql
    property=replicator.filter.convertstringfrommysql.definitionsFile=/opt/continuent/share/convertstringfrommysql.json
    
  6. Note

    If you plan to make full use of the REST API (which is enabled by default) you will need to also configure a username and password for API access. This must be done by specifying the following options in your configuration:

    rest-api-admin-user=tungsten
    rest-api-admin-pass=secret

  7. Once the prerequisites and configuring of the installation has been completed, the software can be installed:

    shell> ./tools/tpm install

If the installation process fails, check the output of the /tmp/tungsten-configure.log file for more information about the root cause.

On the host that is loading data into Redshift, create the s3-config-servicename.json file and then copy that file into the share directory within the installed directory on that host. For example:

shell> cp s3-config-servicename.json /opt/continuent/share/

Now the services can be started:

shell> replicator start

Once the service is configured and running, the service can be monitored as normal using the trepctl command. See Section 4.2.6, “Management and Monitoring of Amazon Redshift Deployments” for more information.

4.2.4. Verifying your Redshift Installation

  1. Create a database within your source MySQL instance:

    mysql> CREATE DATABASE redtest;
  2. Create a table within your source MySQL instance:

    mysql> CREATE TABLE redtest.msg (id INT PRIMARY KEY AUTO_INCREMENT,msg CHAR(80));
  3. Create a schema for the tables:

    redshift> CREATE SCHEMA redtest;
  4. Create a staging table within your Redshift instance:

    redshift> CREATE TABLE redtest.stage_xxx_msg (tungsten_opcode CHAR(1), \
        tungsten_seqno INT, tungsten_row_id INT,tungsten_date CHAR(30),id INT,msg CHAR(80));
  5. Create the target table:

    redshift> CREATE TABLE redtest.msg (id INT,msg CHAR(80));
  6. Insert some data within your MySQL source instance:

    mysql> INSERT INTO redtest.msg VALUES (0,'First');
    Query OK, 1 row affected (0.04 sec)
    
    mysql> INSERT INTO redtest.msg VALUES (0,'Second');
    Query OK, 1 row affected (0.04 sec)
    
    mysql> INSERT INTO redtest.msg VALUES (0,'Third');
    Query OK, 1 row affected (0.04 sec)
    
    mysql> UPDATE redtest.msg SET msg = 'This is the first update of the second row' WHERE ID = 2;
  7. Check the replicator status on the applier (host2):

    shell> trepctl status

    There should be 5 transactions replicated.

  8. Check the table within Redshift:

    redshift> SELECT * FROM redtest.msg;
    1	First
    3	Third
    2	This is the first update of the second row

4.2.5. Keeping CDC Information

The Redshift applier can keep the CDC data, that is, the raw CDC CSV data that is recorded and replicated during the loading process, rather than simply cleaning up the CDC files and deleting them. The CDC data can be useful if you want to be able to monitor data changes over time.

The process works as follows:

  1. Batch applier generates CSV files.

  2. Batch applier loads the CSV data into the staging tables.

  3. Batch applier loads the CSV data into the CDC tables.

  4. Staging data is merged with the base table data.

  5. Staging data is deleted.

Unlike the staging and base table information, the data in the CDC tables is kept forever, without removing any of the processed information. Using this data you can report on change information over time for different data sets, or even recreate datasets at a specific time by using the change information.

To enable this feature:

  1. When creating the DDL for the staging and base tables, also create the table information for the CDC data for each table. The actual format of the information is the same as the staging table data, and can be created using ddlscan:

    shell> ddlscan -service my_red -db test \
        -template ddl-mysql-redshift-staging.vm \
        -opt renameSchema cdc_{schema} -opt renameTable {table}_cdc
  2. In the configuration file, s3-config-svc.json for each service, specify the name of the table to be used when storing the CDC information using the storeCDCIn field. This should specify the table template to be used, with the schema and table name being automatically replaced by the load script. The structure should match the structure used by ddlscan to define the CDC tables:

    {
      "awsS3Path" : "s3://your-bucket-for-redshift/redshift-test",
      "awsAccessKey" : "access-key-id",
      "awsSecretKey" : "secret-access-key",
      "storeCDCIn" : "cdc_{schema}.{table}_cdc"
    }
  3. Restart the replicator using replicator restart to update the configuration.

4.2.6. Management and Monitoring of Amazon Redshift Deployments

Monitoring a Amazon Redshift replication scenario requires checking the status of both the Extractor - extracting data from MySQL - and the Applier which retrieves the remote THL information and applies it to Amazon Redshift.

shell> trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000006:0000000000002857;-1
appliedLastSeqno       : 15
appliedLatency         : 1.918
autoRecoveryEnabled    : false
autoRecoveryTotal      : 0
channels               : 1
clusterName            : alpha
currentEventId         : mysql-bin.000006:0000000000002857
currentTimeMillis      : 1407336195165
dataServerHost         : redshift1
extensions             : 
host                   : redshift1
latestEpochNumber      : 8
masterConnectUri       : thl://localhost:/
masterListenUri        : thl://redshift1:2112/
maximumStoredSeqNo     : 15
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : jdbc:mysql:thin://redshift1:3306/tungsten_alpha
relativeLatency        : 35.164
resourcePrecedence     : 99
rmiPort                : 10000
role                   : master
seqnoType              : java.lang.Long
serviceName            : alpha
serviceType            : local
simpleServiceName      : alpha
siteName               : default
sourceId               : redshift1
state                  : ONLINE
timeInStateSeconds     : 34.807
transitioningTo        : 
uptimeSeconds          : 36.493
useSSLConnection       : false
version                : Tungsten Replicator 7.1.4 build 10
Finished status command...

On the Applier, the output of trepctl shows the current sequence number and applier status:

shell> trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000006:0000000000002857;-1
appliedLastSeqno       : 15
appliedLatency         : 154.748
autoRecoveryEnabled    : false
autoRecoveryTotal      : 0
channels               : 1
clusterName            : alpha
currentEventId         : NONE
currentTimeMillis      : 1407336316454
dataServerHost         : redshift.us-east-1.redshift.amazonaws.com
extensions             : 
host                   : redshift.us-east-1.redshift.amazonaws.com
latestEpochNumber      : 8
masterConnectUri       : thl://redshift1:2112/
masterListenUri        : null
maximumStoredSeqNo     : 15
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : thl://redshift1:2112/
relativeLatency        : 156.454
resourcePrecedence     : 99
rmiPort                : 10000
role                   : slave
seqnoType              : java.lang.Long
serviceName            : alpha
serviceType            : local
simpleServiceName      : alpha
siteName               : default
sourceId               : redshift.us-east-1.redshift.amazonaws.com
state                  : ONLINE
timeInStateSeconds     : 2.28
transitioningTo        : 
uptimeSeconds          : 524104.751
useSSLConnection       : false
version                : Tungsten Replicator 7.1.4 build 10
Finished status command...

The appliedLastSeqno should match as normal. Because of the batching of transactions the appliedLatency may be much higher than a normal MySQL to MySQL replication.

The batch loading parameters controlling the batching of data can be tuned and update by studying the output from the trepsvc.log log file. The log will show a line containing the number of rows updated:

INFO  scripting.JavascriptExecutor COUNT: 4

See Section 12.1, “Block Commit” for more information on these parameters.

4.3. Deploying the Vertica Applier

Hewlett-Packard's Vertica provides support for BigData, SQL-based analysis and processing. Integration with MySQL enables data to be replicated live from the MySQL database directly into Vertica without the need to manually export and import the data.

Replication to Vertica operates as follows:

  • Data is extracted from the source database into THL.

  • When extracting the data from the THL, the Vertica replicator writes the data into CSV files according to the name of the source tables. The files contain all of the row-based data, including the global transaction ID generated by Tungsten Replicator during replication, and the operation type (insert, delete, etc) as part of the CSV data.

  • The CSV data is then loaded into Vertica into staging tables.

  • SQL statements are then executed to perform updates on the live version of the tables, using the CSV, batch loaded, information, deleting old rows, and inserting the new data when performing updates to work effectively within the confines of Vertica operation.

Figure 4.4. Topologies: Replicating to Vertica

Topologies: Replicating to Vertica

Setting up replication requires setting up both the Extractor and Applier components as two different configurations, one for MySQL and the other for Vertica. Replication also requires some additional steps to ensure that the Vertica host is ready to accept the replicated data that has been extracted. Tungsten Replicator uses all the tools required to perform these operations during the installation and setup.

4.3.1. Preparing for Vertica Deployments

Preparing the hosts for the replication process requires setting some key configuration parameters within the MySQL server to ensure that data is stored and written correctly. On the Vertica side, the database and schema must be created using the existing schema definition so that the databases and tables exist within Vertica.

Source Host

Configure the source and target hosts following the prerequisites outlined in Appendix B, Prerequisites then follow the appropriate steps for the required extractor topology outlined in Chapter 3, Deploying MySQL Extractors.

Vertica Host

On the Vertica host, you need to perform some preparation of the destination database, first creating the database, and then creating the tables that are to be replicated.

  • Create a database (if you want to use a different one than those already configured), and a schema that will contain the Tungsten data about the current replication position:

    shell> vsql -Udbadmin -wsecret bigdata
    Welcome to vsql, the Vertica Analytic Database v5.1.1-0 interactive terminal.
    
    Type:  \h for help with SQL commands
           \? for help with vsql commands
           \g or terminate with semicolon to execute query
           \q to quit
    
    bigdata=> create schema tungsten_alpha;

    The schema will be used only by Tungsten Replicator to store metadata about the replication process.

  • Locate the Vertica JDBC driver. This can be downloaded separately from the Vertica website. The driver will need to be copied into the Tungsten Replicator lib directory.

    shell> cp vertica-jdbc-7.1.2-0.jar tungsten-replicator-7.1.4-10/tungsten-replicator/lib/
  • You need to create tables within Vertica according to the databases and tables that need to be replicated; the tables are not automatically created for you. From a Tungsten Replicator deployment directory, the ddlscan command can be used to identify the existing tables, and create table definitions for use within Vertica.

    To use ddlscan, the template for Vertica must be specified, along with the user/password information to connect to the source database to collect the schema definitions. The tool should be run from the templates directory.

    The tool will need to be executed twice, the first time generates the live table definitions:

    shell> cd tungsten-replicator-7.1.4-10
    shell> cd tungsten-replicator/samples/extensions/velocity/
    shell> ddlscan -user tungsten -url 'jdbc:mysql:thin://host1:13306/access_log' -pass password \
        -template ddl-mysql-vertica.vm -db access_log
    /*
    SQL generated on Fri Sep 06 14:37:40 BST 2013 by ./ddlscan utility of Tungsten
    
    url = jdbc:mysql:thin://host1:13306/access_log
    user = tungsten
    dbName = access_log
    */
    CREATE SCHEMA access_log;
    
    DROP TABLE access_log.access_log;
    
    CREATE TABLE access_log.access_log
    (
      id INT ,
      userid INT ,
      datetime INT ,
      session CHAR(30) ,
      operation CHAR(80) ,
      opdata CHAR(80)  ) ORDER BY id;
    ...

    The output should be redirected to a file and then used to create tables within Vertica:

    shell> ddlscan -user tungsten -url 'jdbc:mysql:thin://host1:13306/access_log' -pass password \
        -template ddl-mysql-vertica.vm -db access_log >access_log.ddl

    The output of the command should be checked to ensure that the table definitions are correct.

    The file can then be applied to Vertica:

    shell> cat access_log.ddl | vsql -Udbadmin -wsecret bigdata

    This generates the table definitions for live data. The process should be repeated to create the table definitions for the staging data by using te staging template:

    shell> ddlscan -user tungsten -url 'jdbc:mysql:thin://host1:13306/access_log' -pass password \
        -template ddl-mysql-vertica-staging.vm -db access_log >access_log.ddl-staging

    Then applied to Vertica:

    shell> cat access_log.ddl-staging | vsql -Udbadmin -wsecret bigdata

    The process should be repeated for each database that will be replicated.

Once the preparation of the MySQL and Vertica databases are ready, you can proceed to installing Tungsten Replicator

4.3.2. Install Vertica Applier

  1. Before installing the applier, the following additions need adding to the extractor configuration. Apply the following parameter to the extractor configuration before installing the applier

    Add the following the /etc/tungsten/tungsten.ini

    [alpha]
    ...Existing Replicator Config...
    enable-heterogeneous-service=true
    
    shell> tpm update

    Note

    The above step is only applicable for standalone extractors. If you are configuring replications from an existing Tungsten Cluster (Cluster-Extractor), follow the steps outlined here to ensure the cluster is configured correctly: Section 3.4.1, “Prepare: Replicating Data Out of a Cluster”

  2. The applier can now be configured.

    Unpack the Tungsten Replicator distribution in staging directory:

    shell> tar zxf tungsten-replicator-7.1.4-10.tar.gz
  3. Change into the staging directory:

    shell> cd tungsten-replicator-7.1.4-10
  4. Locate the Vertica JDBC driver. This can be downloaded separately from the Vertica website. The driver will need to be copied into the Tungsten Replicator lib directory.

    shell> cp vertica-jdbc-7.1.2-0.jar tungsten-replicator-7.1.4-10/tungsten-replicator/lib/
  5. Configure the installation using tpm:

    Show Staging

    Show INI

    shell> ./tools/tpm configure defaults \
        --reset \
        --user=tungsten \
        --install-directory=/opt/continuent \
        --profile-script=~/.bash_profile \
        --skip-validation-check=HostsFileCheck \
        --skip-validation-check=InstallerMasterSlaveCheck \
        --rest-api-admin-user=apiuser \
        --rest-api-admin-pass=secret
    
    shell> ./tools/tpm configure alpha \
        --topology=master-slave \
        --master=sourcehost \
        --members=localhost \
        --datasource-type=vertica \
        --replication-user=dbadmin \
        --replication-password=password \
        --vertica-dbname=dev \
        --batch-enabled=true \
        --batch-load-template=vertica6 \
        --batch-load-language=js \
        --replication-port=5433 \
        --svc-applier-filters=dropstatementdata \
        --svc-applier-block-commit-interval=30s \
        --svc-applier-block-commit-size=25000 \
        --disable-relay-logs=true
    
    shell> vi /etc/tungsten/tungsten.ini
    [defaults]
    user=tungsten
    install-directory=/opt/continuent
    profile-script=~/.bash_profile
    skip-validation-check=HostsFileCheck
    skip-validation-check=InstallerMasterSlaveCheck
    rest-api-admin-user=apiuser
    rest-api-admin-pass=secret
    
    [alpha]
    topology=master-slave
    master=sourcehost
    members=localhost
    datasource-type=vertica
    replication-user=dbadmin
    replication-password=password
    vertica-dbname=dev
    batch-enabled=true
    batch-load-template=vertica6
    batch-load-language=js
    replication-port=5433
    svc-applier-filters=dropstatementdata
    svc-applier-block-commit-interval=30s
    svc-applier-block-commit-size=25000
    disable-relay-logs=true
    

    Configuration group defaults

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

    Configuration group alpha

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

  6. Note

    If you plan to make full use of the REST API (which is enabled by default) you will need to also configure a username and password for API access. This must be done by specifying the following options in your configuration:

    rest-api-admin-user=tungsten
    rest-api-admin-pass=secret

  7. Once the prerequisites and configuring of the installation has been completed, the software can be installed:

    shell> ./tools/tpm install

If you encounter problems during the installation, check the output of the /tmp/tungsten-configure.log file for more information about the root cause.

Once the service is configured and running, the service can be monitored as normal using the trepctl command. See Section 4.3.3, “Management and Monitoring of Vertica Deployments” for more information.

4.3.3. Management and Monitoring of Vertica Deployments

Monitoring a Vertica replication scenario requires checking the status of both the Extractor - extracting data from MySQL - and the Applier which retrieves the remote THL information and applies it to Vertica.

shell> trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000012:0000000128889042;0
appliedLastSeqno       : 1070
appliedLatency         : 22.537
channels               : 1
clusterName            : alpha
currentEventId         : mysql-bin.000012:0000000128889042
currentTimeMillis      : 1378489888477
dataServerHost         : mysqldb01
extensions             :
latestEpochNumber      : 897
masterConnectUri       : thl://localhost:/
masterListenUri        : thl://mysqldb01:2112/
maximumStoredSeqNo     : 1070
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : jdbc:mysql:thin://mysqldb01:13306/
relativeLatency        : 691980.477
resourcePrecedence     : 99
rmiPort                : 10000
role                   : master
seqnoType              : java.lang.Long
serviceName            : alpha
serviceType            : local
simpleServiceName      : alpha
siteName               : default
sourceId               : mysqldb01
state                  : ONLINE
timeInStateSeconds     : 694039.058
transitioningTo        :
uptimeSeconds          : 694041.81
useSSLConnection       : false
version                : Tungsten Replicator 7.1.4 build 10
Finished status command...

On the Applier, the output of trepctl shows the current sequence number and applier status:

shell> trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000012:0000000128889042;0
appliedLastSeqno       : 1070
appliedLatency         : 78.302
channels               : 1
clusterName            : default
currentEventId         : NONE
currentTimeMillis      : 1378479271609
dataServerHost         : vertica01
extensions             :
latestEpochNumber      : 897
masterConnectUri       : thl://mysqldb01:2112/
masterListenUri        : null
maximumStoredSeqNo     : 1070
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : thl://mysqldb01:2112/
relativeLatency        : 681363.609
resourcePrecedence     : 99
rmiPort                : 10000
role                   : slave
seqnoType              : java.lang.Long
serviceName            : alpha
serviceType            : local
simpleServiceName      : alpha
siteName               : default
sourceId               : vertica01
state                  : ONLINE
timeInStateSeconds     : 681486.806
transitioningTo        :
uptimeSeconds          : 689922.693
useSSLConnection       : false
version                : Tungsten Replicator 7.1.4 build 10
Finished status command...

The appliedLastSeqno should match as normal. Because of the batching of transactions the appliedLatency may be much higher than a normal MySQL to MySQL replication.

4.3.4. Troubleshooting Vertica Installations

The following items detail some of the more common problems with replication through to Vertica. Often the underlying issue is related to the data types, the data format, or the number of columns.

  • If the following is reported by the replicator:

    pendingError           : Replicator unable to go online due to error »
      Operation failed: Online operation failed (Unable to prepare plugin: class »
      name=com.continuent.tungsten.replicator.datasource.DataSourceService »
      message=[Unable to load driver: com.vertica.jdbc.Driver])
    state                  : OFFLINE:ERROR

    The Vertica JDBC driver is missing from the installation. The Vertica JDBC JAR file must have been placed into the tungsten-replicator/lib directory within the release diectory before running tpm update or tpm install.

  • The following error:

    pendingExceptionMessage: Invalid write to CSV file: name=/opt/continuent/tmp/staging/alpha/staging0/test-msg-1.csv »
      table=test.msg table_columns=schemaname,schemahash csv_columns=tungsten_opcode,tungsten_seqno, »
      tungsten_row_id,tungsten_commit_timestamp,nullschemaname,schemahash

    Indicates the source THL has been not been marked up correctly. Either the colnames filter has not been enabled, or the --enable-batch-service has not been confifgred during installation. This means that the source THL is not being populated with the right information, either the full list of columns, or the column names and primary key information is incorrect. The configuration should be updated, and then the THL on both the Extractor and Applier should be recreated by using trepctl reset.

  • If you get an error similar to the following:

    pendingExceptionMessage: CSV loading failed: schema=test table=msg CSV »
     file=/opt/continuent/tmp/staging/alpha/staging0/test-msg-1.csv »
     message=com.continuent.tungsten.replicator.ReplicatorException: Incoming table data »
     has no primary keys: test.msg »
     (/opt/continuent/tungsten/tungsten-replicator/appliers/batch/vertica6.js#70)

    Either the pkey filter has not been enabled, or the source tables on the source database do not contain primary keys. This means that the source THL is not being populated with the primary key information from the table which is requird in order to load into Vertica through the batch mechanism. The configuration should be updated, and then the THL on both the Extractor and Applier should be recreated by using trepctl reset.

  • The following error indicates that the incoming data could not be loaded into the staging table within Vertica:

    pendingError  : Stage task failed: q-to-dbms
    pendingExceptionMessage: CSV loading failed: schema=blog table=article CSV »
      file=/tmp/staging/alpha/staging0/blog-article-432.csv »
      message=com.continuent.tungsten.replicator.ReplicatorException:
      LOAD DATA ROW count does not match: sql=COPY blog.stage_xxx_article »
      FROM '/tmp/staging/alpha/staging0/blog-article-432.csv' »
      DIRECT NULL 'null' DELIMITER ',' ENCLOSED BY '"' »
      expected_copy_rows=3614 rows=2233 ; exceptions are in »
      /tmp/tungsten_vertica_blog.article.exceptions »
      (../../tungsten-replicator//samples/scripts/batch/vertica6.js#67)

    There are a number of possible reasons for this. The actual reasons can be found in the exceptions file which is generated, the error message contains the location. In this example /tmp/tungsten_vertica_blog.article.exceptions. Possible reasons include:

    • Mismatch in the number of columns in the source file and the target table. Check the source and target tables match, including the four special fields used in all staging tables.

    • Mismatch in the data types of one or more of the columns in target table. Check the source and target table definitions match, or at least support the corresponding data. For example, the column size, length or format is correct. Loading character data into numeric columns, or floating point values into integer columns for example is not supported.

    • Badly formatted CSV file. This happens when the incoming data contains newliness or commas or other data that is incompatible with the CSV format. The CSV file should have been kept, the location is also in the error message. Examine the file and check the format. You may need to enable filters to modify and 'clean' the data so that it is more compatible with the CSV format.

  • Remember that changes to the DDL within the source database are not automatically replicated to Vertica. Changes to the table definitions, additional tables, or additional databases, must all be updated manually within Vertica.

  • If you get errors similar to:

    stage_xxx_access_log does not exist

    When loading into Vertica, it means that the staging tables have not created correctly. Check the steps for creating the staging tables using ddlscan in Section 4.3.1, “Preparing for Vertica Deployments”.

  • Replication may fail if date types contain zero values, which are legal in MySQL. For example, the timestamp 0000-00-00 00:00:00 is valid in MySQL. An error reporting a mismatch in the values will be reported when applying the data into Vertica, for example:

    ERROR 2631:  Column "time" is of type timestamp but expression is of type int
    HINT:  You will need to rewrite or cast the expression

    Or:

    ERROR 2992:  Date/time field value out of range: "0"
    HINT:  Perhaps you need a different "datestyle" setting

    To address this error, use the zerodate2null filter, which translates zero-value dates into a valid NULL value. This can be enabled by adding the zerodate2null filter to the applier stage when configuring the service using tpm:

    shell> ./tools/tpm update alpha --repl-svc-applier-filters=zerodate2null

4.4. Deploying the Kafka Applier

Kafka is a highly scalable messaging platform that provides a method for distributing information through a series of messages organised by a specified topic. With Tungsten Replicator the incoming stream of data from the upstream replicator is converted, on a row by row basis, into a JSON document that contains the row information. A new message is created for each row, even from multiple-row transactions.

The deployment of Tungsten Replicator to Kafka service is slightly different. There are two parts to the process:

  • Service Alpha on the Extractor, extracts the information from the MySQL binary log into THL.

  • Service Alpha on the Applier, reads the information from the remote replicator as THL, and applies that to Kafka.

Figure 4.5. Topologies: Replicating to Kafka

Topologies: Replicating to Kafka

With the Kafka applier, information is extracted from the source database using the row-format, column names and primary keys are identified, and translated to a JSON format, and then embedded into a larger Kafka message. The topic used is either composed from the schema name or can be configured to use an explicit topic type, and the generated information included in the Kafka message can include the source schema, table, and commit time information.

The transfer operates as follows:

  1. Data is extracted from MySQL using the standard extractor, reading the row change data from the binlog.

  2. The Section 11.4.5, “ColumnName Filter” filter is used to extract column name information from the database. This enables the row-change information to be tagged with the corresponding column information. The data changes, and corresponding row names, are stored in the THL.

    The Section 11.4.32, “PrimaryKey Filter” filter is used to add primary key information to row-based replication data.

  3. The THL information is then applied to Kafka using the Kafka applier.

There are some additional considerations when applying to Kafka that should be taken into account:

  • Because Kafka is a message queue and not a database, traditional transactional semantics are not supported. This means that although the data will be applied to Kafka as a message, there is no guarantee of transactional consistency. By default the applier will ensure that the message has been correctly received by the Kafka service, it is the responsibility of the Kafka environment and configuration to ensure delivery. The replicator.applier.dbms.zookeeperString can be used to ensure acknowledgements are received from the Kafka service.

  • One message is sent for each row of source information in each transaction. For example, if 20 rows have been inserted or updated in a single transaction, then 20 separate Kafka messages will be generated.

  • A separate message is broadcast for each operation, and includes the operation type. A single message will be broadcast for each row for each operation. So if 20 rows are delete, 20 messages are generated, each with the operation type.

  • If replication fails in the middle of a large transaction, and the replicator goes OFFLINE, when the replicator goes online it may resend rows and messages.

The two replication services can operate on the same machine, (See Section 5.3, “Deploying Multiple Replicators on a Single Host”) or they can be installed on two different machines.

4.4.1. Preparing for Kafka Replication

Configure the source and target hosts following the prerequisites outlined in Appendix B, Prerequisites then follow the appropriate steps for the required extractor topology outlined in Chapter 3, Deploying MySQL Extractors.

In general, it is easier to understand that a row within the MySQL table is converted into a single message on the Kafka side, the topic used is made up of the schema name and table name, and the message ID is composed of the primary key information, but can optionally include the schema and table name and primary key information.

For example, the following row within MySQL:

mysql> select * from messages where id = 99999 \G
*************************** 1. row ***************************
 id: 99999
 msg: Hello Kafka
 1 row in set (0.00 sec)

Is replicated into Kafka as a Kafka message using the topic test_msg:

{
   "_seqno" : "4865",
   "_source_table" : "msg",
   "_committime" : "2017-07-13 15:30:37.0",
   "_source_schema" : "test",
   "record" : {
      "msg" : "Hello Kafka",
      "id" : "2384726"
   },
   "_optype" : "INSERT"
}

In the output, the record contains the actualy record data, the other fields in the message are:

  • _seqno — the THL sequence number of the transaction.

  • _source_table — the source table. Inclusion of this information is optional.

  • _committime — the original transaction commit time. Inclusion of this information is optional.

  • _source_schema — the source schema. Inclusion of this information is optional.

  • _optype — the operation type (INSERT, UPDATE, DELETE).

When preparing the hosts you must be aware of this translation of the different structures, as it will have an effect on the way the information is replicated from MySQL to Kafka.

MySQL Host

The data replicated from MySQL can be any data, although there are some known limitations and assumptions made on the way the information is transferred.

When configuring the extractor database and host, ensure heterogenous specific prerequisities have been included, see Section B.4.4, “MySQL Configuration for Heterogeneous Deployments”

For the best results when replicating, be aware of the following issues and limitations:

  • Use primary keys on all tables. The use of primary keys will improve the lookup of information within Kafka when rows are updated. Without a primary key on a table a full table scan is performed, which can affect performance.

  • MySQL TEXT columns are correctly replicated, but cannot be used as keys.

  • MySQL BLOB columns are converted to text using the configured character type. Depending on the data that is being stored within the BLOB, the data may need to be custom converted. A filter can be written to convert and reformat the content as required.

Kafka Host

On the Kafka side, status information is stored into the Zookeeper instance used for configuring Kafka, and the Zookeeper and Kafka instances must be up and running before the replicator is first started. There are no specific configuration elements required on the Kafka host.

4.4.2. Install Kafka Applier

Installation of the Kafka replication requires special configuration of the Extractor and Applier hosts so that each is configured for the correct datasource type.

  1. Before installing the applier, the following additions need adding to the extractor configuration. Apply the following parameter to the extractor configuration before installing the applier

    Add the following the /etc/tungsten/tungsten.ini

    [alpha]
    ...Existing Replicator Config...
    enable-heterogeneous-service=true
    
    shell> tpm update

    Note

    The above step is only applicable for standalone extractors. If you are configuring replications from an existing Tungsten Cluster (Cluster-Extractor), follow the steps outlined here to ensure the cluster is configured correctly: Section 3.4.1, “Prepare: Replicating Data Out of a Cluster”

  2. Unpack the Tungsten Replicator distribution in staging directory:

    shell> tar zxf tungsten-replicator-7.1.4-10.tar.gz
  3. Change into the staging directory:

    shell> cd tungsten-replicator-7.1.4-10
  4. Configure the installation using tpm:

    Show Staging

    Show INI

    shell> ./tools/tpm configure defaults \
        --reset \
        --install-directory=/opt/continuent \
        --profile-script=~/.bash_profile \
        --rest-api-admin-user=apiuser \
        --rest-api-admin-pass=secret
    
    shell> ./tools/tpm configure alpha \
        --master=sourcehost \
        --members=localhost \
        --datasource-type=kafka \
        --replication-user=root \
        --replication-password=null \
        --replication-port=9092 \
        --property=replicator.applier.dbms.zookeeperString=localhost:2181 \
        --property=replicator.applier.dbms.requireacks=1
    
    shell> vi /etc/tungsten/tungsten.ini
    [defaults]
    install-directory=/opt/continuent
    profile-script=~/.bash_profile
    rest-api-admin-user=apiuser
    rest-api-admin-pass=secret
    
    [alpha]
    master=sourcehost
    members=localhost
    datasource-type=kafka
    replication-user=root
    replication-password=null
    replication-port=9092
    property=replicator.applier.dbms.zookeeperString=localhost:2181
    property=replicator.applier.dbms.requireacks=1
    

    Configuration group defaults

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

    Configuration group alpha

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

  5. If your MySQL source is a Tungsten Cluster, ensure the additional steps below are also included in your applier configuration

    First, prepare the required filter configuration file as follows on the Kafka applier host(s) only:

    shell> mkdir -p /opt/continuent/share/
    shell> cp tungsten-replicator/support/filters-config/convertstringfrommysql.json /opt/continuent/share/

    Then, include the following parameters in the configuration

    property=replicator.stage.remote-to-thl.filters=convertstringfrommysql
    property=replicator.filter.convertstringfrommysql.definitionsFile=/opt/continuent/share/convertstringfrommysql.json
    
  6. Note

    If you plan to make full use of the REST API (which is enabled by default) you will need to also configure a username and password for API access. This must be done by specifying the following options in your configuration:

    rest-api-admin-user=tungsten
    rest-api-admin-pass=secret

  7. Once the prerequisites and configuring of the installation has been completed, the software can be installed:

    shell> ./tools/tpm install

If you encounter problems during the installation, check the output of the /tmp/tungsten-configure.log file for more information about the root cause.

Once the service is configured and running, the service can be monitored as normal using the trepctl command. See Section 4.4.3, “Management and Monitoring of Kafka Deployments” for more information.

4.4.2.1. Optional Configuration Parameters for Kafka

A number of optional, configurable, properties are available that control how Tungsten Replicator applies and populates information when the data is written into Kafka. The following properties can by set during configuration using --property=PROPERTYNAME=value:

Table 4.1. Optional Kafka Applier Properties

OptionDescription
replicator.applier.dbms.embedCommitTimeSets whether the commit time for the source row is embedded into the document
replicator.applier.dbms.embedSchemaTableEmbed the source schema name and table name in the stored document
replicator.applier.dbms.enabletxinfo.kafkaEmbeds transaction information (generated by the rowaddtxninfo filter) into each Kafka message
replicator.applier.dbms.enabletxninfoTopicEmbeds transaction information into a separate Kafka message broadcast on an independent channel from the one used by the actual database data. One message is sent per transaction or THL event.
replicator.applier.dbms.keyFormatDetermines the format of the message ID
replicator.applier.dbms.requireacksDefines whether when writing messages to the Kafka cluster, how many acknowledgements from Kafka nodes is required
replicator.applier.dbms.retrycountThe number of retries for sending each message
replicator.applier.dbms.txninfoTopicSets the topic name for transaction messages
replicator.applier.dbms.zookeeperStringConnection string for Zookeeper, including hostname and port

replicator.applier.dbms.embedCommitTime

Optionreplicator.applier.dbms.embedCommitTime
DescriptionSets whether the commit time for the source row is embedded into the document
Value Typeboolean
Defaulttrue
Valid ValuesfalseDo not embed the source database commit time
 trueEmbed the source database commit time into the stored document

Embeds the commit time of the source database row into the document information:

{
 "_seqno" : "4865",
 "_source_table" : "msg",
 "_committime" : "2017-07-13 15:30:37.0",
 "_source_schema" : "test",
 "record" : {
 "msg" : "Hello Kafka",
 "id" : "2384726"
 },
 "_optype" : "INSERT"
}

replicator.applier.dbms.embedSchemaTable

Optionreplicator.applier.dbms.embedSchemaTable
DescriptionEmbed the source schema name and table name in the stored document
Value Typeboolean
Defaulttrue
Valid ValuesfalseDo not embed the schema or database name in the document
 trueEmbed the source schema name and database name into the stored document

If enabled, the documented stored into Elasticsearch will include the source schema and database name. This can be used to identify the source of the information if the schema and table name is not being used for the index and type names (see replicator.applier.dbms.useSchemaAsIndex and replicator.applier.dbms.useTableAsType).

{
 "_seqno" : "4865",
 "_source_table" : "msg",
 "_committime" : "2017-07-13 15:30:37.0",
 "_source_schema" : "test",
 "record" : {
 "msg" : "Hello Kafka",
 "id" : "2384726"
 },
 "_optype" : "INSERT"
}

replicator.applier.dbms.enabletxinfo.kafka

Optionreplicator.applier.dbms.enabletxinfo.kafka
DescriptionEmbeds transaction information (generated by the rowaddtxninfo filter) into each Kafka message
Value Typeboolean
Defaultfalse
Valid ValuesfalseDo not include transaction information in each
 trueEmbed transaction information into each Kafka message

Embeds information about the entire transaction information using the data provided by the rowaddtxninfo filter and other information embedded in each THL event into each message sent. The transaction information includes information about the entire transaction (row counts, event ID and tables modified) into each message. Since one message is normally sent for each row of data, by adding the information about the full transaction into the message it's possible to validate and identify what other messages may be part of a single transaction when the messages are being re-assembled by a Kafka client.

For example, when looking at a single message in Kafka, the message includes a txninfo section:

{
 "_source_table" : "msg",
 "_committime" : "2018-03-07 12:53:21.0",
 "record" : {
 "msg2" : "txinfo",
 "id" : "109",
 "msg" : "txinfo"
 },
 "_optype" : "INSERT",
 "_seqno" : "164",
 "txnInfo" : {
 "schema" : [
 {
 "schemaName" : "msg",
 "rowCount" : "1",
 "tableName" : "msg"
 },
 {
 "rowCount" : "2",
 "schemaName" : "msg",
 "tableName" : "msgsub"
 }
 ],
 "serviceName" : "alpha",
 "totalCount" : "3",
 "tungstenTransId" : "164",
 "firstRecordInTransaction" : "true"
 },
 "_source_schema" : "msg"
}

This block of the overall message includes the following objects and information:

  • schema

    An array of the row counts within this transaction, with a row count included for each schema and table.

  • serviceName

    The name of the Tungsten Replicator service that generated the message.

  • totalCount

    The total number of rows modified within the entire transaction.

  • firstRecordInTransaction

    If this field exists, it should always be set to true and indicats that this message was generated by the first row inserted, updated or deleted in the overall transaction. This effectively indicates the start of the overall transaction.

  • lastRecordInTransaction

    If this field exists, it should always be set to true and indicats that this message was generated by the last row inserted, updated or deleted in the overall transaction. This effectively indicates the end of the overall transaction

Note that this information block is included in every message for each row within an overall transaction. The firstRecordInTransaction and lastRecordInTransaction can be used to identify the start and end of the transaction overall.

replicator.applier.dbms.enabletxninfoTopic

Optionreplicator.applier.dbms.enabletxninfoTopic
DescriptionEmbeds transaction information into a separate Kafka message broadcast on an independent channel from the one used by the actual database data. One message is sent per transaction or THL event.
Value Typeboolean
Defaultfalse
Valid ValuesfalseDo not generate transaction information
 trueSend transaction information on a separate Kafka topic for each transaction

If enabled, it sends a separate message on a Kafka topic containing information about the entire tranaction. The topic name can be configured by setting the replicator.applier.dbms.txninfoTopic property.

The default message sent will look like the following example:

{
 "txnInfo" : {
 "tungstenTransId" : "164",
 "schema" : [
 {
 "schemaName" : "msg",
 "rowCount" : "1",
 "tableName" : "msg"
 },
 {
 "schemaName" : "msg",
 "rowCount" : "2",
 "tableName" : "msgsub"
 }
 ],
 "totalCount" : "3",
 "serviceName" : "alpha"
 }
}

This block of the overall message includes the following objects and information:

  • schema

    An array of the row counts within this transaction, with a row count included for each schema and table.

  • serviceName

    The name of the Tungsten Replicator service that generated the message.

  • totalCount

    The total number of rows modified within the entire transaction.

replicator.applier.dbms.keyFormat

Optionreplicator.applier.dbms.keyFormat
DescriptionDetermines the format of the message ID
Value Typestring
Defaultpkey
Valid ValuespkeyCombine the primary key column values into a single string
 pkeyusCombine the primary key column values into a single string joined by an underscore character
 tspkeyCombine the schema name, table name, and primary key column values into a single string joined by an underscore character
 tspkeyusCombine the schema name, table name, and primary key column values into a single string

Determines the format of the message ID used when sending the message into Kafka. For example, when configured to use tspkeyus, then the format of the message ID will consist of the schemaname, table name and primary key column information separated by underscores, SCHEMANAME_TABLENAME_234.

replicator.applier.dbms.requireacks

Optionreplicator.applier.dbms.requireacks
DescriptionDefines whether when writing messages to the Kafka cluster, how many acknowledgements from Kafka nodes is required
Value Typestring
Defaultall
Valid Values1Only the lead host should acknowledge receipt of the message
 allAll nodes should acknowledge receipt of the message

Sets the acknowledgement counter for sending messages into the Kafka queue.

replicator.applier.dbms.retrycount

Optionreplicator.applier.dbms.retrycount
DescriptionThe number of retries for sending each message
Value Typenumber
Default0

Determines the number of times the message will attempt to be sent before failure.

replicator.applier.dbms.txninfoTopic

Optionreplicator.applier.dbms.txninfoTopic
DescriptionSets the topic name for transaction messages
Value Typestring
Defaulttungsten_transactions

Sets the topic name to be used when sending independent transaction information messagesa about each THL event. See replicator.applier.dbms.addtxninfo.

replicator.applier.dbms.zookeeperString

Optionreplicator.applier.dbms.zookeeperString
DescriptionConnection string for Zookeeper, including hostname and port
Value Typestring
Default${replicator.global.db.host}:2181

The string to be used when connecting to Zookeeper. The default is to use port 2181 on the host used by replicator.global.db.host.

4.4.3. Management and Monitoring of Kafka Deployments

Once the extractor and applier have been installed, services can be monitored using the trepctl command.

For example, to monitor the extractor status:

shell> trepctl status
appliedLastEventId     : mysql-bin.000009:0000000000002298;2340
appliedLastSeqno       : 10
appliedLatency         : 0.788
autoRecoveryEnabled    : false
autoRecoveryTotal      : 0
channels               : 1
clusterName            : alpha
currentEventId         : mysql-bin.000009:0000000000002298
currentTimeMillis      : 1498687871560
dataServerHost         : mysqlhost
extensions             :
host                   : mysqlhost
latestEpochNumber      : 0
masterConnectUri       : thl://localhost:/
masterListenUri        : thl://mysqlhost:2112/
maximumStoredSeqNo     : 10
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : /var/lib/mysql
relativeLatency        : 99185.56
resourcePrecedence     : 99
rmiPort                : 10000
role                   : master
seqnoType              : java.lang.Long
serviceName            : east
serviceType            : local
simpleServiceName      : east
siteName               : default
sourceId               : mysqlhost
state                  : ONLINE
timeInStateSeconds     : 101347.786
timezone               : GMT
transitioningTo        :
uptimeSeconds          : 101358.88
useSSLConnection       : false
version                : Tungsten Replicator 7.1.4 build 10
Finished status command...

The replicator service operates just the same as a standard extractor service of a typical MySQL replication service.

The Kafka applier service can be accessed either remotely from the extractor:

shell> trepctl -host kafka status
...

Or locally on the Kafka host:

shell> trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000008:0000000000412301;0
appliedLastSeqno       : 1296
appliedLatency         : 10.253
channels               : 1
clusterName            : alpha
currentEventId         : NONE
currentTimeMillis      : 1377098139212
dataServerHost         : kafka
extensions             :
latestEpochNumber      : 1286
masterConnectUri       : thl://host1:2112/
masterListenUri        : null
maximumStoredSeqNo     : 1296
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : thl://mysqlhost:2112/
relativeLatency        : 771.212
resourcePrecedence     : 99
rmiPort                : 10000
role                   : slave
seqnoType              : java.lang.Long
serviceName            : alpha
serviceType            : local
simpleServiceName      : alpha
siteName               : default
sourceId               : kafka
state                  : ONLINE
timeInStateSeconds     : 177783.343
transitioningTo        :
uptimeSeconds          : 180631.276
useSSLConnection       : false
version                : Tungsten Replicator 7.1.4 build 10
Finished status command...

Monitoring the status of replication between the source and target is also the same. The appliedLastSeqno still indicates the sequence number that has been applied to Kafka, and the event ID from Kafka can still be identified from appliedLastEventId.

Sequence numbers between the two hosts should match, as in a source/target deployment, but due to the method used to replicate, the applied latency may be higher.

To check for information within Kafka, use a tool or the kafka-console-consumer.sh command-line client:

shell> kafka-console-consumer.sh --topic test_msg --zookeeper localhost:2181

The output should be checked to ensure that information is being correctly replicated. If strings are shown as a hex value, for example:

"title" : "[B@7084a5c"

It probably indicates that UTF8 and/or --mysql-use-bytes-for-string=false options were not used during installation. If you are reading from a cluster this is expected behavior, and you should enable the convertstringfrommysql filter as shown in the installation examples. In pure replicator scenarios, ensure that the --mysql-use-bytes-for-string=false setting is enabled, or that you are using --enable-heterogeneous-service.

4.5. Deploying the MongoDB Applier

Deployment of a replication to MongoDB service is slightly different to other appliers, there are two parts to the process:

  • Service Alpha on the Extractor, extracts the information from the MySQL binary log into THL.

  • Service Alpha on the Applier reads the information from the remote replicator as THL, and applies that to MongoDB.

Figure 4.6. Topologies: Replicating to MongoDB

Topologies: Replicating to MongoDB

Basic reformatting and restructuring of the data is performed by translating the structure extracted from one database in row format and restructuring for application in a different format. A filter, the ColumnNameFilter, is used to extract the column names against the extracted row-based information.

With the MongoDB applier, information is extracted from the source database using the row-format, column names and primary keys are identified, and translated to the BSON (Binary JSON) format supported by MongoDB. The fields in the source row are converted to the key/value pairs within the generated BSON.

The transfer operates as follows:

  1. Data is extracted from MySQL using the standard extractor, reading the row change data from the binlog.

  2. The Section 11.4.5, “ColumnName Filter” filter is used to extract column name information from the database. This enables the row-change information to be tagged with the corresponding column information. The data changes, and corresponding row names, are stored in the THL.

  3. The THL information is then applied to MongoDB using the MongoDB applier.

The two replication services can operate on the same machine, (See Section 5.3, “Deploying Multiple Replicators on a Single Host”) or they can be installed on two different machines.

4.5.1. MongoDB Atlas Replication

The MongoDB applier can also be used to apply into a MongoDB Atlas instance.

The configuration for MongoDB Atlas is slightly different and follows a typical offboard applier process, similar in style to applying to Amazon Aurora Instances

Specific installation steps for MongoDB Atlas are outlined here Section 4.5.4, “Install MongoDB Atlas Applier”

4.5.2. Preparing for MongoDB Replication

Configure the source and target hosts following the prerequisites outlined in Appendix B, Prerequisites then follow the appropriate steps for the required extractor topology outlined in Chapter 3, Deploying MySQL Extractors.

During the replication process, data is exchanged from the MySQL database/table/row structure into corresponding MongoDB structures, as follows

MySQL MongoDB
Database Database
Table Collection
Row Document

In general, it is easier to understand that a row within the MySQL table is converted into a single document on the MongoDB side, and automatically added to a collection matching the table name.

For example, the following row within MySQL:

mysql> select * from recipe where recipeid = 1085 \G
*************************** 1. row ***************************
  recipeid: 1085
     title: Creamy egg and leek special
  subtitle:
  servings: 4
    active: 1
     parid: 0
    userid: 0
    rating: 0.0
 cumrating: 0.0
createdate: 0
1 row in set (0.00 sec)

Is replicated into the MongoDB document:

{
    "_id" : ObjectId("5212233584ae46ce07e427c3"),
    "recipeid" : "1085",
    "title" : "Creamy egg and leek special",
    "subtitle" : "",
    "servings" : "4",
    "active" : "1",
    "parid" : "0",
    "userid" : "0",
    "rating" : "0.0",
    "cumrating" : "0.0",
    "createdate" : "0"
}

When preparing the hosts you must be aware of this translation of the different structures, as it will have an effect on the way the information is replicated from MySQL to MongoDB.

MySQL Host

The data replicated from MySQL can be any data, although there are some known limitations and assumptions made on the way the information is transferred.

When configuring the extractor database and host, ensure heterogenous specific prerequisities have been included, see Section B.4.4, “MySQL Configuration for Heterogeneous Deployments”

For the best results when replicating, be aware of the following issues and limitations:

  • Use primary keys on all tables. The use of primary keys will improve the lookup of information within MongoDB when rows are updated. Without a primary key on a table a full table scan is performed, which can affect performance.

  • MySQL TEXT columns are correctly replicated, but cannot be used as keys.

  • MySQL BLOB columns are converted to text using the configured character type. Depending on the data that is being stored within the BLOB, the data may need to be custom converted. A filter can be written to convert and reformat the content as required.

MongoDB Host

  • Enable networking; by default MongoDB is configured to listen only on the localhost (127.0.0.1) IP address. The address should be changed to the IP address off your host, or 0.0.0.0, which indicates all interfaces on the current host.

  • Ensure that network port 27017, or the port you want to use for MongoDB is configured as the listening port.

4.5.3. Install MongoDB Applier

Note

The steps in this section relate specifically to applying to a standard MongoDB Instance. For configuring the applier to work with MongoDB Atlas, please refer to the following section: Section 4.5.4, “Install MongoDB Atlas Applier”

Installation of the MongoDB replication requires special configuration of the Source and Target hosts so that each is configured for the correct datasource type.

To configure the Applier replicators:

  1. Before installing the applier, the following additions need adding to the extractor configuration. Apply the following parameter to the extractor configuration before installing the applier

    Add the following the /etc/tungsten/tungsten.ini

    [alpha]
    ...Existing Replicator Config...
    enable-heterogeneous-service=true
    
    shell> tpm update

    Note

    The above step is only applicable for standalone extractors. If you are configuring replications from an existing Tungsten Cluster (Cluster-Extractor), follow the steps outlined here to ensure the cluster is configured correctly: Section 3.4.1, “Prepare: Replicating Data Out of a Cluster”

  2. Unpack the Tungsten Replicator distribution in staging directory:

    shell> tar zxf tungsten-replicator-7.1.4-10.tar.gz
  3. Change into the staging directory:

    shell> cd tungsten-replicator-7.1.4-10
  4. Configure the installation using tpm:

    Show Staging

    Show INI

    shell> ./tools/tpm configure defaults \
        --reset \
        --install-directory=/opt/continuent \
        --profile-script=~/.bash_profile \
        --rest-api-admin-user=apiuser \
        --rest-api-admin-pass=secret
    
    shell> ./tools/tpm configure alpha \
        --master=sourcehost \
        --members=localhost \
        --datasource-type=mongodb \
        --replication-user=tungsten \
        --replication-password=secret \
        --svc-applier-filters=dropstatementdata \
        --role=slave \
        --replication-port=27017
    
    shell> vi /etc/tungsten/tungsten.ini
    [defaults]
    install-directory=/opt/continuent
    profile-script=~/.bash_profile
    rest-api-admin-user=apiuser
    rest-api-admin-pass=secret
    
    [alpha]
    master=sourcehost
    members=localhost
    datasource-type=mongodb
    replication-user=tungsten
    replication-password=secret
    svc-applier-filters=dropstatementdata
    role=slave
    replication-port=27017
    

    Configuration group defaults

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

    Configuration group alpha

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

  5. Note

    If you plan to make full use of the REST API (which is enabled by default) you will need to also configure a username and password for API access. This must be done by specifying the following options in your configuration:

    rest-api-admin-user=tungsten
    rest-api-admin-pass=secret

  6. Once the prerequisites and configuring of the installation has been completed, the software can be installed:

    shell> ./tools/tpm install

If the installation process fails, check the output of the /tmp/tungsten-configure.log file for more information about the root cause.

Once the replicators have started, the status of the service can be checked using trepctl. See Section 4.5.5, “Management and Monitoring of MongoDB Deployments” for more information.

4.5.4. Install MongoDB Atlas Applier

Note

The steps in this section relate specifically to applying to a MongoDB Atlas Instance. For configuring the applier to work with standatd MongoDB, please refer to the following section: Section 4.5.3, “Install MongoDB Applier”

Installation of the MongoDB replication requires special configuration of the Source and Target hosts so that each is configured for the correct datasource type.

To configure the Applier replicators:

  1. Before installing the applier, the following addition needs adding to the extractor configuration. Apply the following parameters on the extractor host, update the extractor using the details below, and then install the applier

    • For Staging installs:

      shell> cd tungsten-replicator-7.1.4-10
      shell> ./tools/tpm configure alpha \
      --enable-heterogeneous-master=true
      shell> ./tools/tpm update
    • For INI installs: Add the following the /etc/tungsten/tungsten.ini

      [alpha]
      ...Existing Replicator Config...
      enable-heterogeneous-master=true
      
      shell> tpm update
  2. Unpack the Tungsten Replicator distribution in staging directory:

    shell> tar zxf tungsten-replicator-7.1.4-10.tar.gz
  3. Change into the staging directory:

    shell> cd tungsten-replicator-7.1.4-10
  4. Configure the installation using tpm:

    Show Staging

    Show INI

    shell> ./tools/tpm configure defaults \
        --reset \
        --install-directory=/opt/continuent \
        --profile-script=~/.bash_profile \
        --disable-security-controls=false \
        --rmi-ssl=false \
        --thl-ssl=false \
        --rmi-authentication=false \
        --rest-api-admin-user=apiuser \
        --rest-api-admin-pass=secret
    
    shell> ./tools/tpm configure alpha \
        --master=sourcehost \
        --members=localhost \
        --datasource-type=mongodb \
        --replication-user=tungsten \
        --replication-password=secret \
        --svc-applier-filters=dropstatementdata \
        --role=slave \
        --replication-host=atlasendpoint.mongodb.net \
        --replication-port=27017 \
        --property=replicator.applier.dbms.connectString=mongodb+srv://${replicator.global.db.user}:${replicator.global.db.password}@${replicator.global.db.host}/?retryWrites=true&w=majority
    
    shell> vi /etc/tungsten/tungsten.ini
    [defaults]
    install-directory=/opt/continuent
    profile-script=~/.bash_profile
    disable-security-controls=false
    rmi-ssl=false
    thl-ssl=false
    rmi-authentication=false
    rest-api-admin-user=apiuser
    rest-api-admin-pass=secret
    
    [alpha]
    master=sourcehost
    members=localhost
    datasource-type=mongodb
    replication-user=tungsten
    replication-password=secret
    svc-applier-filters=dropstatementdata
    role=slave
    replication-host=atlasendpoint.mongodb.net
    replication-port=27017
    property=replicator.applier.dbms.connectString=mongodb+srv://${replicator.global.db.user}:${replicator.global.db.password}@${replicator.global.db.host}/?retryWrites=true&w=majority
    

    Configuration group defaults

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

    Configuration group alpha

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

  5. Note

    If you plan to make full use of the REST API (which is enabled by default) you will need to also configure a username and password for API access. This must be done by specifying the following options in your configuration:

    rest-api-admin-user=tungsten
    rest-api-admin-pass=secret

  6. Once the prerequisites and configuring of the installation has been completed, the software can be installed:

    shell> ./tools/tpm install

If the installation process fails, check the output of the /tmp/tungsten-configure.log file for more information about the root cause.

Important

The above example assumes SSL is not enabled between the extractor and applier replicators.

If SSL is required, then you must omit the following properties from the example configs displayed above, or change the values to true: rmi-ssl=false, thl-ssl=false, rmi-authentication=false

Once you have installed the replicator, there are a few more steps required to allow the replicator to be able to authenticate with MongoDB Atlas.

4.5.4.1. Import MongoDB Atlas Certificates

MongoDB Atlas requires TLS connections for all Atlas Clusters, therefore we need to configure the replicator to recognise this.

Note

From May 1, 2021, MongoDB Atlas has moved to new TLS Certificiates using ISRG instead of IdenTrust for their root Certificate Authority.

All new clusters created after this time, or any existing clusters that have since been migrated to this new root CA will need to follow the correct procedure to configure the replicator. Both procedures are below, follow the correct one that relates to your configuration.

For MongoDB Atlas Cluster created PRIOR to May 1, 2021, or that have not yet migrated to the new LetsEncrypt root Certificate:

  1. Using the correct Atlas Endpoint, issue the following command to retrieve the Atlas certificates

    shell> openssl s_client -showcerts -connect atlas-endpoint.mongodb.net:27017
  2. The output may be quite long and will include at least two certificates bound by the header/footer as follows

    -----BEGIN CERTIFICATE-----
    xxxx
    xxxx
    -----END CERTIFICATE-----

    Copy each certificate, including the header/footer, into individual files

  3. Using keytool, we now need to load each certificte into the truststore that was created during the replicator installation. Repeat the example below for each certificate, ensuring you use a unique alias name for each certificate.

    shell> keytool -import -alias your-alias1 -file cert1.cer -keystore /opt/continuent/share/tungsten_truststore.ts

    When prompted, the default password for the truststore will be tungsten unless you specified a different password during installation

  4. Once this is complete, you can now start the replicator

    shell> replicator start

For MongoDB Atlas Cluster created AFTER May 1, 2021, or that have been migrated to the new LetsEncrypt root Certificate:

  1. Obtain the LetsEncrypt root Certificate from here

  2. Copy the certificate into a file called letsencrypt.pem in the home directory of the applier host, including the BEGIN an END header/footer, for example:

    -----BEGIN CERTIFICATE-----
    xxxx
    xxxx
    -----END CERTIFICATE-----
  3. Using keytool, we now need to import this certificte into the truststore that was created during the replicator installation.

    shell> keytool -import -alias letsencrypt -file letsencrypt.pem -keystore /opt/continuent/share/tungsten_truststore.ts

    When prompted, the default password for the truststore will be tungsten unless you specified a different password during installation

  4. Once this is complete, you can now start the replicator

    shell> replicator start

Once the replicators have started, the status of the service can be checked using trepctl. See Section 4.5.5, “Management and Monitoring of MongoDB Deployments” for more information.

4.5.5. Management and Monitoring of MongoDB Deployments

Once the two services — extractor and applier — have been installed, the services can be monitored using trepctl. To monitor the extractor service:

shell> trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000008:0000000000412301;0
appliedLastSeqno       : 1296
appliedLatency         : 1.889
channels               : 1
clusterName            : epsilon
currentEventId         : mysql-bin.000008:0000000000412301
currentTimeMillis      : 1377097812795
dataServerHost         : host1
extensions             : 
latestEpochNumber      : 1286
masterConnectUri       : thl://localhost:/
masterListenUri        : thl://host2:2112/
maximumStoredSeqNo     : 1296
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : jdbc:mysql:thin://host1:13306/
relativeLatency        : 177444.795
resourcePrecedence     : 99
rmiPort                : 10000
role                   : master
seqnoType              : java.lang.Long
serviceName            : alpha
serviceType            : local
simpleServiceName      : alpha
siteName               : default
sourceId               : host1
state                  : ONLINE
timeInStateSeconds     : 177443.948
transitioningTo        : 
uptimeSeconds          : 177461.483
version                : Tungsten Replicator 7.1.4 build 10
Finished status command...

The replicator service operates just the same as a standard Extractor service of a typical MySQL replication service.

The MongoDB applier service can be accessed either remotely from the Extractor:

shell> trepctl -host host2 status
...

Or locally on the MongoDB host:

shell> trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000008:0000000000412301;0
appliedLastSeqno       : 1296
appliedLatency         : 10.253
channels               : 1
clusterName            : alpha
currentEventId         : NONE
currentTimeMillis      : 1377098139212
dataServerHost         : host2
extensions             : 
latestEpochNumber      : 1286
masterConnectUri       : thl://host1:2112/
masterListenUri        : null
maximumStoredSeqNo     : 1296
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : thl://host1:2112/
relativeLatency        : 177771.212
resourcePrecedence     : 99
rmiPort                : 10000
role                   : slave
seqnoType              : java.lang.Long
serviceName            : alpha
serviceType            : local
simpleServiceName      : alpha
siteName               : default
sourceId               : host2
state                  : ONLINE
timeInStateSeconds     : 177783.343
transitioningTo        : 
uptimeSeconds          : 180631.276
version                : Tungsten Replicator 7.1.4 build 10
Finished status command...

Monitoring the status of replication between the Source and Target is also the same. The appliedLastSeqno still indicates the sequence number that has been applied to MongoDB, and the event ID from MongoDB can still be identified from appliedLastEventId.

Sequence numbers between the two hosts should match, as in a Primary/Replica deployment, but due to the method used to replicate, the applied latency may be higher. Tables that do not use primary keys, or large individual row updates may cause increased latency differences.

To check for information within MongoDB, use the mongo command-line client:

shell> mongo
MongoDB shell version: 2.2.4
connecting to: test
> use cheffy;
switched to db cheffy

The show collections will indicate the tables from MySQL that have been replicated to MongoDB:

> show collections
access_log
audit_trail
blog_post_record
helpdb
ingredient_recipes
ingredient_recipes_bytext
ingredients
ingredients_alt
ingredients_keywords
ingredients_matches
ingredients_measures
ingredients_plurals
ingredients_search_class
ingredients_search_class_map
ingredients_shop_class
ingredients_xlate
ingredients_xlate_class
keyword_class
keywords
measure_plurals
measure_trans
metadata
nut_fooddesc
nut_foodgrp
nut_footnote
nut_measure
nut_nutdata
nut_nutrdef
nut_rda
nut_rda_class
nut_source
nut_translate
nut_weight
recipe
recipe_coll_ids
recipe_coll_search
recipe_collections
recipe_comments
recipe_pics
recipebase
recipeingred
recipekeywords
recipemeta
recipemethod
recipenutrition
search_translate
system.indexes
terms

Collection counts should match the row count of the source tables:

> > db.recipe.count()
2909

The db.collection.find() command can be used to list the documents within a given collection.

> db.recipe.find()
{ "_id" : ObjectId("5212233584ae46ce07e427c3"), 
"recipeid" : "1085", 
"title" : "Creamy egg and leek special", 
"subtitle" : "", 
"servings" : "4", 
"active" : "1", 
"parid" : "0", 
"userid" : "0", 
"rating" : "0.0", 
"cumrating" : "0.0", 
"createdate" : "0" }
{ "_id" : ObjectId("5212233584ae46ce07e427c4"),
 "recipeid" : "87",
 "title" : "Chakchouka",
 "subtitle" : "A traditional Arabian and North African dish and often accompanied with slices of cooked meat",
 "servings" : "4",
 "active" : "1",
 "parid" : "0",
 "userid" : "0",
 "rating" : "0.0",
 "cumrating" : "0.0",
 "createdate" : "0" }    
 ...

The output should be checked to ensure that information is being correctly replicated. If strings are shown as a hex value, for example:

"title" : "[B@7084a5c"

It probably indicates that UTF8 and/or --mysql-use-bytes-for-string=false options were not used during installation. The configuration can be updated using tpm to address this issue.

4.6. Deploying the Hadoop Applier

Replicating data into Hadoop is achieved by generating character-separated values from ROW-based information that is applied directly to the Hadoop HDFS using a batch loading process. Files are written directly to the HDFS using the Hadoop client libraries. A separate process is then used to merge existing data, and the changed information extracted from the Source database.

Deployment of the Hadoop replication is similar to other heterogeneous installations; two separate installations are created:

  • Service Alpha on the extractor, extracts the information from the MySQL binary log into THL.

  • Service Alpha on the applier, reads the information from the remote replicator as THL, applying it to Hadoop. The applier works in two stages:

Figure 4.7. Topologies: Replicating to Hadoop

Topologies: Replicating to Hadoop

Basic requirements for replication into Hadoop:

  • Hadoop Replication is supported on the following Hadoop distributions and releases:

    • Cloudera Enterprise 4.4, Cloudera Enterprise 5.0 (Certified) up to Cloudera Enterprise 5.8

    • HortonWorks DataPlatform 2.0

    • Amazon Elastic MapReduce

    • IBM InfoSphere BigInsights 2.1 and 3.0

    • MapR 3.0, 3.1, and 5.x

    • Pivotal HD 2.0

    • Apache Hadoop 2.1.0, 2.2.0

  • Source tables must have primary keys. Without a primary key, Tungsten Replicator is unable to determine the row to be updated when the data reaches Hadoop.

4.6.1. Hadoop Replication Operation

The Hadoop applier makes use of the JavaScript based batch loading system (see Section 5.6.4, “JavaScript Batchloader Scripts”). This constructs change data from the source-database, and uses this information in combination with any existing data to construct, using Hive, a materialized view. A summary of this basic structure can be seen in Figure 4.8, “Topologies: Hadoop Replication Operation”.

Figure 4.8. Topologies: Hadoop Replication Operation

Topologies: Hadoop Replication Operation Operation

The full replication of information operates as follows:

  1. Data is extracted from the source database using the standard extractor, for example by reading the row change data from the binlog in MySQL.

  2. The colnames filter is used to extract column name information from the database. This enables the row-change information to be tagged with the corresponding column information. The data changes, and corresponding row names, are stored in the THL.

    The pkey filter is used to extract primary key data from the source tables.

  3. On the applier replicator, the THL data is read and written into batch-files in the character-separated value format.

    The information in these files is change data, and contains not only the original data, but also metadata about the operation performed (i.e. INSERT, DELETE or UPDATE, and the primary key of for each table. All UPDATE statements are recorded as a DELETE of the existing data, and an INSERT of the new data.

  4. A second process uses the CSV stage data and any existing data, to build a materialized view that mirrors the source table data structure.

The staging files created by the replicator are in a specific format that incorporates change and operation information in addition to the original row data.

  • The format of the files is a character separated values file, with each row separated by a newline, and individual fields separated by the character 0x01. This is supported by Hive as a native value separator.

  • The content of the file consists of the full row data extracted from the source, plus metadata describing the operation for each row, the sequence number, and then the full row information.

Operation Sequence No Unique Row Commit TimeStamp Table-specific primary key Table-column
I (Insert) or D (Delete) SEQNO that generated this row Unique row ID within the batch The commit timestamp of the original transaction, which can be used for partitioning   

For example, the MySQL row:

|  3 | #1 Single | 2006 | Cats and Dogs (#1.4)         |

Is represented within the staging files generated as:

I^A1318^A1^A2017-06-07 09:22:28.000^A3^A3^A#1 Single^A2006^ACats and Dogs (#1.4)

The character separator, and whether to use quoting, are configurable within the replicator when it is deployed. The default is to use a newline character for records, and the 0x01 character for fields. For more information on these fields and how they can be configured, see Section 5.6.7, “Supported CSV Formats”.

On the Hadoop host, information is stored into a number of locations within the HDFS during the data transfer:

Table 4.2. Hadoop Replication Directory Locations

Directory/File Description
/user/USERNAME Top-level directory for Tungsten Replicator information, using the configured replication user.
/user/tungsten/metadata Location for metadata related to the replication operation
/user/tungsten/metadata/alpha The directory (named after the servicename of the replicator service) that holds service-specific metadata
/user/tungsten/staging Directory of the data transferred
/user/tungsten/staging/servicename Directory of the data transferred from a specific servicename.
/user/tungsten/staging/servicename/databasename Directory of the data transferred specific to a database.
/user/tungsten/staging/servicename/databasename/tablename Directory of the data transferred specific to a table.
/user/tungsten/staging/servicename/databasename/tablename/tablename-###.csv Filename of a single file of the data transferred for a specific table and database.

Files are automatically created, named according to the parent table name, and the starting Tungsten Replicator sequence number for each file that is transferred. The size of the files is determined by the batch and commit parameters. For example, in the truncated list of files below displayed using the hadoop fs command,

shell> hadoop fs -ls /user/tungsten/staging/hadoop/chicago
Found 66 items
-rw-r--r-- 3 cloudera cloudera  1270236 2020-01-13 06:58 /user/tungsten/staging/alpha/hadoop/chicago/chicago-10.csv
-rw-r--r-- 3 cloudera cloudera 10274189 2020-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-103.csv
-rw-r--r-- 3 cloudera cloudera  1275832 2020-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-104.csv
-rw-r--r-- 3 cloudera cloudera  1275411 2020-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-105.csv
-rw-r--r-- 3 cloudera cloudera 10370471 2020-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-113.csv
-rw-r--r-- 3 cloudera cloudera  1279435 2020-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-114.csv
-rw-r--r-- 3 cloudera cloudera  2544062 2020-01-13 06:58 /user/tungsten/staging/alpha/hadoop/chicago/chicago-12.csv
-rw-r--r-- 3 cloudera cloudera 11694202 2020-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-123.csv
-rw-r--r-- 3 cloudera cloudera  1279072 2020-01-13 08:34 /user/tungsten/staging/alpha/hadoop/chicago/chicago-124.csv
-rw-r--r-- 3 cloudera cloudera  2570481 2020-01-13 08:34 /user/tungsten/staging/alpha/hadoop/chicago/chicago-126.csv
-rw-r--r-- 3 cloudera cloudera  9073627 2020-01-13 08:34 /user/tungsten/staging/alpha/hadoop/chicago/chicago-133.csv
-rw-r--r-- 3 cloudera cloudera  1279708 2020-01-13 08:34 /user/tungsten/staging/alpha/hadoop/chicago/chicago-134.csv
...

The individual file numbers will not be sequential, as they will depend on the sequence number, batch size and range of tables transferred.

4.6.2. Preparing for Hadoop Replication

During the replication process, data is exchanged from the MySQL database/table/row structure into corresponding Hadoop directory and files, as shown in the table below:

MySQL Hadoop
Database Directory
Table Hive-compatible Character-Separated Text file
Row Line in the text file, fields terminated by character 0x01

4.6.2.1. Hadoop Host

The Hadoop environment should have the following features and parameters for the most efficient operation:

  • Disk storage

    There must be enough disk storage for the change data, data being actively merged, and the live data for the replicated information. Depending on the configuration and rate of changes in the Source, the required data space will fluctuate.

    For example, replicating a 10GB dataset, and 5GB of change data during replication, will require at least 30GB of storage. 10GB for the original dataset, 5GB of change data, and 10-25GB of merged data. The exact size is dependent on the quantity of inserts/updates/deletes.

  • Pre-requisites

    Currently, deployment of the target to a relay host is not supported. One host within the Hadoop cluster must be chosen to act as the target.

    The prerequisites for a standard Tungsten Replicator should be followed, including:

    This will provide the base environment into which Tungsten Replicator can be installed.

  • HDFS Location

    The /user/tungsten directory must be writable by the replicator user within HDFS:

    shell> hadoop fs -mkdir /user/tungsten
    shell> hadoop fs -chmod 700 /user/tungsten
    shell> hadoop fs -chown tungsten /user/tungsten

    These commands should be executed by a user with HDFS administration rights (e.g. the hdfs user).

  • Replicator User Group Membership

    The user that will be executing the replicator (typically tungsten, as recommended in the Appendix B, Prerequisites) must be a member of the hive group on the Hadoop host where the replicator will be installed. Without this membership, the user will be unable to execute Hive queries.

4.6.2.2. Schema Generation

In order to access the generated tables, both staging and the final tables, it is necessary to create a schema definition. The ddlscan tool can be used to read the existing definition of the tables from the source server and generate suitable Hive schema definitions to access the table data.

To create the staging table definition, use the ddl-mysql-hive-0.10.vm template; you must specify the JDBC connection string, user, password and database names. For example:

shell> ddlscan -user tungsten -url 'jdbc:mysql:thin://host1:13306/test' -pass password \
   -template ddl-mysql-hive-0.10.vm -db test
--
-- SQL generated on Wed Jan 29 16:17:05 GMT 2020 by Tungsten ddlscan utility
-- 
-- url = jdbc:mysql:thin://host1:13306/test
-- user = tungsten
-- dbName = test
--
CREATE DATABASE test;

DROP TABLE IF EXISTS test.movies_large;

CREATE TABLE test.movies_large
(
  id INT ,
  title STRING ,
  year INT ,
  episodetitle STRING  )
;

The output from this command should be applied to your Hive installation within the Hadoop cluster. For example, by capturing the output, transferring that file and then running:

shell> cat schema.sql | hive

To create Hive tables that read the staging files loaded by the replicator, use the ddl-mysql-hive-0.10-staging.vm:

shell> ddlscan -user tungsten -url 'jdbc:mysql:thin://host:13306/test' -pass password \
    -template ddl-mysql-hive-0.10-staging.vm -db test

The process creates the schema and tables which match the schema and table names on the source database.

Transfer this file to your Hadoop environment and then create the generated schema:

shell> cat schema-staging.sql |hive

The process creates matching schema names, but table names are modified to include the prefix stage_xxx_. For example, for the table movies_large a staging table named stage_xxx_movies_large is created. The Hive table definition is created pointing to the external file-based tables, using the default 0x01 field separator and 0x0A (newline) record separator. If different values were used for these in the configuration, the schema definition in the captured file from ddlscan should be updated by hand.

The tables should now be available within Hive. For more information on accessing and using the tables, see Section 4.6.4.3, “Accessing Generated Tables in Hive”.

4.6.3. Replicating into Kerberos Secured HDFS

For replicating into HDFS where Kerberos support has been enabled, the hadoop_kerberos.js vatch script can be used in place of the normal hadoop.js script.

The script will need modification before it can be used, due to the varying implementations of Kerberos, and to ensure the correct authentication parameters are used.

Before installed, edit the hadoop_kerberos.js file located within tungsten-replicator/appliers/batch/hadoop-kerberos.js within the installation package. Within that file is the line called before the HDFS operations are called:

var kinit_prefix = "kinit USER/LEVEL@REALM -k -t KEYTAB_FILE;"

Edit this line to set the correct command and/or authentication parameters, such as the username and keytab file. The configured command will be executed immediately before all the commands that operate on the Hadoop filesystem, including creating directories and files.

For example, the variable might be updated to:

var kinit_prefix = "kinit mc/admin@CLOUDERA -k -t mcadmin.keytab;"

When installing, use --batch-load-template=hadoop_kerberos.js to enable the new batch load script.

4.6.4. Install Hadoop Replication

Installation of the Hadoop replication consists of multiple stages:

  1. Configure the source and target hosts following the prerequisites outlined in Appendix B, Prerequisites then follow the appropriate steps for the required extractor topology outlined in Chapter 3, Deploying MySQL Extractors.

  2. Install the Applier replicator which will apply information to the target Hadoop environment.

  3. Once the installation of the Extractor and Applier components have been completed, materialization of tables and views can be performed.

4.6.4.1. Applier Replicator Service

The applier replicator service reads information from the THL of the source and applies this to a local instance of Hadoop.

Important

Installation must take place on a node within the Hadoop cluster. Writing to a remote HDFS filesystem is not currently supported.

  1. Before installing the applier, the following additions need adding to the extractor configuration. Apply the following parameters, update the extractor and then install the applier

    • For Staging Install:

      shell> cd tungsten-replicator-7.1.4-10
      shell> ./tools/tpm configure alpha \
        --enable-batch-service=true
      shell> ./tools/tpm update
    • For INI Installs: Add the following the /etc/tungsten/tungsten.ini

      
      [alpha]
      ...Existing Replicator Config...
      enable-batch-service=true
      
      
      shell> tpm update
  2. The applier can now be configured.

    Unpack the Tungsten Replicator distribution in staging directory:

    shell> tar zxf tungsten-replicator-7.1.4-10.tar.gz
  3. Change into the staging directory:

    shell> cd tungsten-replicator-7.1.4-10
  4. Configure the installation using tpm:

    Show Staging

    Show INI

    shell> ./tools/tpm configure defaults \
        --reset \
        --user=tungsten \
        --install-directory=/opt/continuent \
        --profile-script=~/.bash_profile \
        --skip-validation-check=HostsFileCheck \
        --skip-validation-check=InstallerMasterSlaveCheck \
        --skip-validation-check=DatasourceDBPort \
        --skip-validation-check=DirectDatasourceDBPort \
        --skip-validation-check=ReplicationServicePipelines \
        --rest-api-admin-user=apiuser \
        --rest-api-admin-pass=secret
    
    shell> ./tools/tpm configure alpha \
        --master=host1 \
        --members=host2 \
        --property=replicator.datasource.global.csvType=hive \
        --property=replicator.stage.q-to-dbms.blockCommitInterval=1s \
        --property=replicator.stage.q-to-dbms.blockCommitRowCount=1000 \
        --replication-password=secret \
        --replication-user=tungsten \
        --batch-enabled=true \
        --batch-load-language=js  \
        --batch-load-template=hadoop \
        --datasource-type=file
    
    shell> vi /etc/tungsten/tungsten.ini
    [defaults]
    user=tungsten
    install-directory=/opt/continuent
    profile-script=~/.bash_profile
    skip-validation-check=HostsFileCheck
    skip-validation-check=InstallerMasterSlaveCheck
    skip-validation-check=DatasourceDBPort
    skip-validation-check=DirectDatasourceDBPort
    skip-validation-check=ReplicationServicePipelines
    rest-api-admin-user=apiuser
    rest-api-admin-pass=secret
    
    [alpha]
    master=host1
    members=host2
    property=replicator.datasource.global.csvType=hive
    property=replicator.stage.q-to-dbms.blockCommitInterval=1s
    property=replicator.stage.q-to-dbms.blockCommitRowCount=1000
    replication-password=secret
    replication-user=tungsten
    batch-enabled=true
    batch-load-language=js 
    batch-load-template=hadoop
    datasource-type=file
    

    Configuration group defaults

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

    Configuration group alpha

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

  5. Note

    If you plan to make full use of the REST API (which is enabled by default) you will need to also configure a username and password for API access. This must be done by specifying the following options in your configuration:

    rest-api-admin-user=tungsten
    rest-api-admin-pass=secret

  6. Once the prerequisites and configuring of the installation has been completed, the software can be installed:

    shell> ./tools/tpm install

If the installation process fails, check the output of the /tmp/tungsten-configure.log file for more information about the root cause.

Once the service has been installed it can be monitored using the trepctl command. See Section 4.6.4.4, “Management and Monitoring of Hadoop Deployments” for more information. If there are problems during installation, see Section 4.6.4.5, “Troubleshooting Hadoop Replication”.

4.6.4.2. Generating Materialized Views

Added in 6.0.4.  From Tungsten Replicator 6.0.4, continuent-tools-hadoop are now packaged within the main Tungsten Replicator software bundle and can be found within ./tungsten-replicator/support/hadoop-tools

The continuent-tools-hadoop repository contains a set of tools that allow for the convenient creation of DDL, materialized views, and data comparison on the tables that have been replicated from MySQL.

To obtain the tools, use git

shell> ./bin/load-reduce-check -s test -Ujdbc:mysql:thin://tr-hadoop2:13306 -udbload -ppassword

The load-reduce-check command performs four distinct steps:

  1. Reads the schema from the MySQL server and creates the staging table DDL within Hive

  2. Reads the schema from the MySQL server and creates the base table DDL within Hive

  3. Executes the materialized view process on each selected staging table data to build the base table content.

  4. Performs a data comparison

4.6.4.3. Accessing Generated Tables in Hive

If not already completed, the schema generation process described in Section 4.6.2.2, “Schema Generation” should have been followed. This creates the necessary Hive schema and staging schema definitions.

Once the tables have been created through ddlscan you can query the stage tables:

hive> select * from stage_xxx_movies_large limit 10;
OK
I	10	1	57475	All in the Family	1971	Archie Feels Left Out (#4.17)
I	10	2	57476	All in the Family	1971	Archie Finds a Friend (#6.18)
I	10	3	57477	All in the Family	1971	Archie Gets the Business: Part 1 (#8.1)
I	10	4	57478	All in the Family	1971	Archie Gets the Business: Part 2 (#8.2)
I	10	5	57479	All in the Family	1971	Archie Gives Blood (#1.4)
I	10	6	57480	All in the Family	1971	Archie Goes Too Far (#3.17)
I	10	7	57481	All in the Family	1971	Archie in the Cellar (#4.10)
I	10	8	57482	All in the Family	1971	Archie in the Hospital (#3.15)
I	10	9	57483	All in the Family	1971	Archie in the Lock-Up (#2.3)
I	10	10	57484	All in the Family	1971	Archie Is Branded (#3.20)

4.6.4.4. Management and Monitoring of Hadoop Deployments

Once the two services — extractor and applier — have been installed, the services can be monitored using trepctl. To monitor the Extractor service:

shell>  trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000023:0000000505545003;0
appliedLastSeqno       : 10992
appliedLatency         : 42.764
channels               : 1
clusterName            : alpha
currentEventId         : mysql-bin.000023:0000000505545003
currentTimeMillis      : 1389871897922
dataServerHost         : host1
extensions             : 
host                   : host1
latestEpochNumber      : 0
masterConnectUri       : thl://localhost:/
masterListenUri        : thl://host1:2112/
maximumStoredSeqNo     : 10992
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : jdbc:mysql:thin://host1:13306/
relativeLatency        : 158296.922
resourcePrecedence     : 99
rmiPort                : 10000
role                   : master
seqnoType              : java.lang.Long
serviceName            : alpha
serviceType            : local
simpleServiceName      : alpha
siteName               : default
sourceId               : host1
state                  : ONLINE
timeInStateSeconds     : 165845.474
transitioningTo        : 
uptimeSeconds          : 165850.047
useSSLConnection       : false
version                : Tungsten Replicator 7.1.4 build 10
Finished status command...

When monitoring, the primary concernrs beyond identifying and copying with any errors is to monitor the applied latency. LArger numbers for applied latency generally indicate the the information is being written out to disk effectively. There are a number of strategies that should be checked:

  • Confirm that the Hadoop environment is running effectively. Any delays to writing to HDFS will impact the replicator.

  • Adjust the block commit parameters. Tuning the block commit levels should find the balance between frequent updates to achieve the required latency, and generating files of a suitable file sizes so that Hadoop can process them effectively for processing through map/reduce. You should try both increasing and reducing the sizes to find and figure out the the correct settings according to your source data.

4.6.4.5. Troubleshooting Hadoop Replication

Replicating to Hadoop involves a number of discrete, specific steps. Due to the batch and multi-stage nature of the extract and apply process, replication can stall or stop due to a variety of issues.

4.6.4.5.1. Errors Reading/Writing commitseqno.0 File

During initial installation, or when starting up replication, the replicator may report that the commitseqno.0 can not be created or written properly, or during startup, that the file cannot be read.

The following checks and recovery procedures can be tried:

  • Check the permissions of the directory to the commitseqno.0 file, the file itself, and the ownership:

    shell> hadoop fs -ls -R /user/tungsten/metadata
    drwxr-xr-x   - cloudera cloudera          0 2020-01-14 10:40 /user/tungsten/metadata/alpha
    -rw-r--r--   3 cloudera cloudera        251 2020-01-14 10:40 /user/tungsten/metadata/alpha/commitseqno.0
  • Check that the file is writable and is not empty. An empty file may indicate a problem updating the content with the new sequence number.

  • Check the content of the file is correct. The content should be a JSON structure containing the replicator state and position information. For example:

    shell> hadoop fs -cat /user/tungsten/metadata/alpha/commitseqno.0
    {
      "appliedLatency" : "0",
      "epochNumber" : "0",
      "fragno" : "0",
      "shardId" : "dna",
      "seqno" : "8",
      "eventId" : "mysql-bin.000015:0000000000103156;0",
      "extractedTstamp" : "1578998421000"
      "lastFrag" : "true",
      "sourceId" : "host1"
    }
  • Try deleting the commitseqno.0 file and placing the replicator online:

    shell> hadoop fs -rm /user/tungsten/metadata/alpha/commitseqno.0
    shell> trepctl online
4.6.4.5.2. Recovering from Replication Failure

If the replication fails, is manually stopped, or the host needs to be restarted, replication should continue from the last point When replication was stopped. Files that were being written when replication was last running will be overwritten and the information recreated.

Unlike other Heterogeneous replication implementations, the Hadoop applier stores the current replication state and restart position in a file within the HDFS of the target Hadoop environment. To recover from failed replication, this file must be deleted, so that the THL can be re-read from the Source and CSV files will be recreated and applied into HDFS.

  1. On the Applier, put the replicator offline:

    shell> trepctl offline
  2. Remove the THL files from the Applier:

    shell> trepctl reset -thl
  3. Remove the staging CSV files replicated into Hadoop:

    shell> hadoop fs -rm -r /user/tungsten/staging
  4. Reset the restart position:

    shell> rm /opt/continuent/tungsten/tungsten-replicator/data/alpha/commitseqno.0

    Replace alpha and /opt/continuent with the corresponding service name and installation location.

  5. Restart replication on the Applier; this will start to recreate the THL files from the MySQL binary log:

    shell> trepctl online
4.6.4.5.3. Missing Primary Key

Replication may fail at the applier stage if the source data does not contain the correct ROW format and information, including the primary key data. trepctl may report the following error:

...
pendingErrorEventId    : mysql-bin.000015:0000000000143981;0
pendingErrorSeqno      : 10
pendingExceptionMessage: Wrapped com.continuent.tungsten.replicator.ReplicatorException: »
    Unable to find a primary key for dna.alt_allele_attrib and there is no default » 
    from property stagePkeyColumn (../../tungsten-replicator//samples/scripts/batch/hdfs-merge.js#18)
pipelineSource         : UNKNOWN
relativeLatency        : -1.0
...

If the primary key was missing in the source data, the table structure on the source must be updated, and the THL information recreated.

4.7. Deploying the Oracle Applier

Replication Operation Support
Statements Replicated No
Rows Replicated Yes
Schema Replicated No
ddlscan Supported Yes

Tungsten Cluster supports replication to Oracle as a datasource. This allows replication of data from MySQL to Oracle. See Section B.1.2, “Database Support” for more details.

Figure 4.9. Topologies: Replicating to Oracle

Topologies: Replicating to Oracle

Replication in these configurations operates using two separate replicators:

  • Replicator on the Extractor, extracts the information from the source database into THL.

  • Replicator on the Applier reads the information from the remote replicator as THL, and applies that to the target database.

4.7.1. Preparing for Oracle Replication

Configure the source and target hosts following the prerequisites outlined in Appendix B, Prerequisites followed by the additional prerequisites specific to Oracle Targets outlined in Section 4.7.1.1, “Additional Prerequisites for Oracle Targets” then finally follow the appropriate steps for the required extractor topology outlined in Chapter 3, Deploying MySQL Extractors.

When replicating from MySQL to Oracle there are a number of datatype differences that should be accommodated to ensure reliable replication of the information. The core differences are described in Table 4.3, “Data Type differences when replicating data from MySQL to Oracle”.

Table 4.3. Data Type differences when replicating data from MySQL to Oracle

MySQL Datatype Oracle Datatype Notes
INT NUMBER(10, 0)  
BIGINT NUMBER(19, 0)  
TINYINT NUMBER(3, 0)  
SMALLINT NUMBER(5, 0)  
MEDIUMINT NUMBER(7, 0)  
DECIMAL(x,y) NUMBER(x, y)  
FLOAT FLOAT  
CHAR(n) CHAR(n)  
VARCHAR(n) VARCHAR2(n) For sizes less than 2000 bytes data can be replicated. For lengths larger than 2000 bytes, the data will be truncated when written into Oracle
DATE DATE  
DATETIME DATE  
TIMESTAMP DATE  
TEXT CLOB Replicator can transform TEXT into CLOB or VARCHAR(N). If you choose VARCHAR(N) on Oracle, the length of the data accepted by Oracle will be limited to 4000. This is limitation of Oracle. The size of CLOB columns within Oracle is calculated in terabytes. If TEXT fields on MySQL are known to be less than 4000 bytes (not characters) long, then VARCHAR(4000) can be used on Oracle. This may be faster than using CLOB.
BLOB BLOB  
ENUM(...) VARCHAR(255) Use the EnumToString filter
SET(...) VARCHAR(255) Use the SetToString filter

When replicating to Oracle, the ddlscan command can be used to generate DDL appropriate for the supported data types in the target database. In MySQL to Oracle deployments the DDL can be read from the MySQL server and generated for the Oracle server so that replication can begin without manually creating the Oracle specific DDL.

In addition, the following DDL differences and requirements exist:

  • Column orders on MySQL and Oracle must match, but column names do not have to match.

    Using the dropcolumn filter, columns can be dropped and ignored if required.

  • Each table within MySQL should have a Primary Key. Without a primary key, full-row based lookups are performed on the data when performing UPDATE or DELETE operations. With a primary key, the pkey filter can add metadata to the UPDATE/DELETE event, enabling faster application of events within Oracle.

  • Indexes on MySQL and Oracle do not have to match. This allows for different index types and tuning between the two systems according to application and dataserver performance requirements.

  • Keywords that are restricted on Oracle should not be used within MySQL as table, column or database names. For example, the keyword SESSION is not allowed within Oracle. Tungsten Cluster determines the column name from the target database metadata by position (column reference), not name, so replication will not fail, but applications may need to be adapted. For compatibility, try to avoid Oracle keywords.

For more information on differences between MySQL and Oracle, see Oracle and MySQL Compared.

To make the process of migration from MySQL to Oracle easier, Tungsten Cluster includes a tool called ddlscan which will read table definitions from MySQL and create appropriate Oracle table definitions to use during replication.

For reference information on the ddlscan tool, see Section 8.6, “The ddlscan Command”.

When replicating to Oracle there are a number of key steps that must be performed. The primary process is the preparation of the Oracle database and DDL for the database schema that are being replicated. Although DDL statements will be replicated to Oracle, they will often fail because of SQL language differences. Because of this, tables within Oracle must be created before replication starts.

4.7.1.1. Additional Prerequisites for Oracle Targets

When applying to oracle there are additional prerequisites required to ensure the replicator can connect to, and apply to, the target database

For remote Oracle targets (Offboard Applier)

To enable the replicator to apply to a remote Oracle Instance, the Replicator host will require an Oracle Client installation, with an appropriate TNS entry configured in the tnsnames.ora file

In addition, the environment for the tungsten OS user will need to be configured with ORACLE_HOME and LD_LIBRARY_PATH variables

For remote and local Oracle targets

Before installing you need to ensure that you have the ojdbc7.jar file in the correct location.

This can be copied to either:

  • $ORACLE_HOME/jdbc/lib, or

  • /opt/continuent/software/tungsten-replicator-7.1.4-10/tungsten_replicator/lib

4.7.1.2. Configure the Oracle database

Before installing replication, the Oracle target database must be configured:

  • A user and schema must exist for each database from MySQL that you want to replicate. In addition, the schema used by the services within Tungsten Cluster must have an associated schema and user name.

    For example, if you are replicating the database sales to Oracle, the following statements must be executed to create a suitable schema. This can be performed through any connection, including sqlplus:

    shell> sqlplus sys/oracle as sysdba
    SQL> CREATE USER sales IDENTIFIED BY password DEFAULT TABLESPACE DEMO QUOTA UNLIMITED ON DEMO;

    The above assumes a suitable tablespace has been created (DEMO in this case).

  • A schema must also be created for each service replicating into Oracle. For example, if the service is called alpha, then the tungsten_alpha schema/user must be created. The same command can be used:

    SQL> CREATE USER tungsten_alpha IDENTIFIED BY password DEFAULT TABLESPACE DEMO QUOTA UNLIMITED ON DEMO;
  • One of the users used above must be configured so that it has the rights to connect to Oracle and has all rights so that it can execute statements on any schema:

    SQL> GRANT CONNECT TO tungsten_alpha;
    SQL> GRANT DBA TO tungsten_alpha;

    The user/password combination selected will be required when configuring the Applier replication service.

4.7.1.3. Create the Destination Schema

On the host which has been already configured as the Extractor, use ddlscan to extract the DDL for Oracle:

shell> cd tungsten-replicator-7.1.4-10
shell> ./bin/ddlscan -user tungsten -url 'jdbc:mysql:thin://host1:3306/access_log' \
    -pass password -template ddl-mysql-oracle.vm -db access_log

The output should be captured and checked before applying it to your Oracle instance:

shell> ./bin/ddlscan -user tungsten -url 'jdbc:mysql:thin://host1:3306/access_log' \
    -pass password -template ddl-mysql-oracle.vm -db access_log > access_log.ddl

If you are happy with the output, it can be executed against your target Oracle database:

shell> cat access_log.ddl | sqlplus sys/oracle as sysdba

The generated DDL includes statements to drop existing tables if they exist. This will fail in a new installation, but the output can be ignored.

Once the process has been completed for this database, it must be repeated for each database that you plan on replicating from Oracle to MySQL.

4.7.2. Install Oracle Applier

The Applier replicator will read the THL from the remote Extractor and apply it into Oracle using a standard JDBC connection. The Applier replicator needs to know the Extractor hostname, and the datasource type.

  1. Unpack the Tungsten Replicator distribution in staging directory:

    shell> tar zxf tungsten-replicator-7.1.4-10.tar.gz
  2. Change into the staging directory:

    shell> cd tungsten-replicator-7.1.4-10
  3. Obtain a copy of the Oracle JDBC driver and copy it into the tungsten-replicator/lib directory:

    shell> cp ojdbc7.jar ./tungsten-replicator/lib/
  4. Configure the installation using tpm:

    Show Staging

    Show INI

    shell> ./tools/tpm configure defaults \
        --reset \
        --install-directory=/opt/continuent \
        --profile-script=~/.bash_profile \
        --skip-validation-check=InstallerMasterSlaveCheck \
        --rest-api-admin-user=apiuser \
        --rest-api-admin-pass=secret
    
    shell> ./tools/tpm configure alpha \
        --master=sourcehost \
        --members=localhost \
        --datasource-type=oracle \
        --datasource-oracle-service=ORCL \
        --datasource-user=tungsten_alpha \
        --datasource-password=secret \
        --svc-applier-filters=dropstatementdata
    
    shell> vi /etc/tungsten/tungsten.ini
    [defaults]
    install-directory=/opt/continuent
    profile-script=~/.bash_profile
    skip-validation-check=InstallerMasterSlaveCheck
    rest-api-admin-user=apiuser
    rest-api-admin-pass=secret
    
    [alpha]
    master=sourcehost
    members=localhost
    datasource-type=oracle
    datasource-oracle-service=ORCL
    datasource-user=tungsten_alpha
    datasource-password=secret
    svc-applier-filters=dropstatementdata
    

    Configuration group defaults

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

    Configuration group alpha

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

    replication-host should be added to the above configuration if the target Oracle Database is on a different host to the applier installation

  5. Note

    If you plan to make full use of the REST API (which is enabled by default) you will need to also configure a username and password for API access. This must be done by specifying the following options in your configuration:

    rest-api-admin-user=tungsten
    rest-api-admin-pass=secret

  6. Once the prerequisites and configuring of the installation has been completed, the software can be installed:

    shell> ./tools/tpm install

If the installation process fails, check the output of the /tmp/tungsten-configure.log file for more information about the root cause.

Once the installation has completed, the status of the service should be reported. The service should be online and reading events from the Extractor replicator.

The status of the replicator can be checked and monitored by using the trepctl command.

4.8. Deploying the PostgreSQL Applier

Deployment of replication to PostgreSQL service operates as follows:

  • Service Alpha on the Extractor, extracts the information from the MySQL binary log into THL.

  • Service Alpha on the Applier reads the information from the remote replicator as THL, and applies that to PostgreSQL using a standard JDBC driver by constructing PostgreSQL compatible SQL to insert, update and delete the target data.

Figure 4.10. Topologies: Replicating to PostgreSQL

Topologies: Replicating to PostgreSQL

The two replication services can operate on the same machine, (See Section 5.3, “Deploying Multiple Replicators on a Single Host”) or they can be installed on two different machines.

4.8.1. Preparing for PostgreSQL Replication

Configure the source and target hosts following the prerequisites outlined in Appendix B, Prerequisites then follow the appropriate steps for the required extractor topology outlined in Chapter 3, Deploying MySQL Extractors.

4.8.1.1. PostgreSQL Database Setup

For replication to PostgreSQL hosts, you must ensure that the networking and user configuration has been configured correctly.

4.8.1.1.1. PostgreSQL Version Support
Database Version Support Status Notes
PostgreSQL 9.5, 9.6 Primary platform (applier only)  
4.8.1.1.2. Enable PostgreSQL Networking

Within the PostgreSQL configuration, two changes need to be made:

  • Configure the networking so that the listen address for the PostgreSQL server is configured correctly for this edit. Edit the /etc/postgresql/main/postgresql.conf file and edit the listen_address line either to * or to an explicit IP address. For example:

    listen_addresses = '192.168.3.73'
  • Edit the /etc/postgresql/main/pg_hba.conf file and ensure that the password properties match the password settings and hostname limitations. In particular, the replicator will communicate over the public IP address, not localhost, and so you must ensure that network-based connections using a user/password combination are allowed. For example, you may want to add a line to the file that provides network-wide access, or at least access for the local network range:

    local   all             all                                     md5
4.8.1.1.3. User Configuration

A suitable user must be created with rights and permissions to create databases, as this is required by the replicator to create databases, tables, and other objects. The creatuser command can be used for this purpose. The --createdb adds the CREATEDB permission:

shell> createuser tungsten --createdb

You will be prompted to provide a password for the user.

Alternatively, you can create the user and permissions through the psql interface:

shell> sudo -u postgres psql --port=5433 --user=postgres postgres
    Type "help" for help.

    postgres=# CREATE ROLE tungsten WITH LOGIN PASSWORD 'password';
    postgres=# ALTER ROLE tungsten CREATEDB;

You may also want to grant specific privileges to existing databases which must be done within the psql interface:

shell> sudo -u postgres psql --port=5433 --user=postgres postgres
    Type "help" for help.

    postgres=# GRANT ALL ON DATABASE postgres TO tungsten;

4.8.2. Install PostgreSQL Applier

Once you have completed the configuration of the PostgreSQL database, you can configure and install the PostgreSQL applier as described using the steps below.

  1. Unpack the Tungsten Replicator distribution in staging directory:

    shell> tar zxf tungsten-replicator-7.1.4-10.tar.gz
  2. Change into the staging directory:

    shell> cd tungsten-replicator-7.1.4-10
  3. Configure the installation using tpm:

    Show Staging

    Show INI

    shell> ./tools/tpm configure defaults \
        --reset \
        --install-directory=/opt/continuent \
        --user=tungsten \
        --profile-script=~/.bash_profile \
        --rest-api-admin-user=apiuser \
        --rest-api-admin-pass=secret
    
    shell> ./tools/tpm configure alpha \
        --master=sourcehost \
        --members=localhost,sourcehost \
        --datasource-type=postgresql \
        --postgresql-dbname=dbname \
        --replication-user=tungsten \
        --replication-password=secret \
        --replication-host=remotedbhost \
        --replication-port=5432
    
    shell> vi /etc/tungsten/tungsten.ini
    [defaults]
    install-directory=/opt/continuent
    user=tungsten
    profile-script=~/.bash_profile
    rest-api-admin-user=apiuser
    rest-api-admin-pass=secret
    
    [alpha]
    master=sourcehost
    members=localhost,sourcehost
    datasource-type=postgresql
    postgresql-dbname=dbname
    replication-user=tungsten
    replication-password=secret
    replication-host=remotedbhost
    replication-port=5432
    

    Configuration group defaults

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

    Configuration group alpha

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

  4. Note

    If you plan to make full use of the REST API (which is enabled by default) you will need to also configure a username and password for API access. This must be done by specifying the following options in your configuration:

    rest-api-admin-user=tungsten
    rest-api-admin-pass=secret

  5. Once the prerequisites and configuring of the installation has been completed, the software can be installed:

    shell> ./tools/tpm install

If the installation process fails, check the output of the /tmp/tungsten-configure.log file for more information about the root cause.

Once the replicators have started, the status of the service can be checked using trepctl. See Section 4.8.3, “Management and Monitoring of PostgreSQL Deployments” for more information.

4.8.3. Management and Monitoring of PostgreSQL Deployments

Once the two services — extractor and applier — have been installed, the services can be monitored using trepctl. To monitor the extractor service:

shell> trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000008:0000000000412301;0
appliedLastSeqno       : 1296
appliedLatency         : 1.889
channels               : 1
clusterName            : epsilon
currentEventId         : mysql-bin.000008:0000000000412301
currentTimeMillis      : 1377097812795
dataServerHost         : host1
extensions             :
latestEpochNumber      : 1286
masterConnectUri       : thl://localhost:/
masterListenUri        : thl://host2:2112/
maximumStoredSeqNo     : 1296
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : jdbc:mysql:thin://host1:13306/
relativeLatency        : 177444.795
resourcePrecedence     : 99
rmiPort                : 10000
role                   : master
seqnoType              : java.lang.Long
serviceName            : alpha
serviceType            : local
simpleServiceName      : alpha
siteName               : default
sourceId               : host1
state                  : ONLINE
timeInStateSeconds     : 177443.948
transitioningTo        :
uptimeSeconds          : 177461.483
version                : Tungsten Replicator 7.1.4 build 10
Finished status command...

The replicator service operates just the same as a standard Extractor service of a typical MySQL replication service.

The PostgreSQL applier service can be accessed either remotely from the Extractor:

shell> trepctl -host host2 status
...

Or locally on the Applier host:

shell> trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000008:0000000000412301;0
appliedLastSeqno       : 1296
appliedLatency         : 10.253
channels               : 1
clusterName            : alpha
currentEventId         : NONE
currentTimeMillis      : 1377098139212
dataServerHost         : host2
extensions             :
latestEpochNumber      : 1286
masterConnectUri       : thl://host1:2112/
masterListenUri        : null
maximumStoredSeqNo     : 1296
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : thl://host1:2112/
relativeLatency        : 177771.212
resourcePrecedence     : 99
rmiPort                : 10000
role                   : slave
seqnoType              : java.lang.Long
serviceName            : alpha
serviceType            : local
simpleServiceName      : alpha
siteName               : default
sourceId               : host2
state                  : ONLINE
timeInStateSeconds     : 177783.343
transitioningTo        :
uptimeSeconds          : 180631.276
version                : Tungsten Replicator 7.1.4 build 10
Finished status command...

Monitoring the status of replication between the Source and Target is also the same. The appliedLastSeqno still indicates the sequence number that has been applied to PostgreSQL, and the event ID from PostgreSQL can still be identified from appliedLastEventId.

Sequence numbers between the two hosts should match, as in a Primary/Replica deployment, but due to the method used to replicate, the applied latency may be higher. Tables that do not use primary keys, or large individual row updates may cause increased latency differences.

4.9. Deploying the Amazon S3 CSV Applier

Amazon S3 is a cloud-based data storage service that integrates with other Amazon services. Replication for Amazon S3 moves data from MySQL datastores, in real-time to csv files stored within an S3 bucket.

Replication to Amazon S3 operates as follows:

  • Data is extracted from the source database into THL.

  • When extracting the data from the THL, the Amazon S3 replicator writes the data into CSV files according to the name of the source tables. The files contain all of the row-based data, including the global transaction ID generated by the extractor during replication, and the operation type (insert, delete, etc) as part of the CSV data.

  • The generated CSV files are loaded into Amazon S3 using either the s3cmd command or the aws s3 cli tools. This enables easy access to your Amazon S3 installation and simplifies the loading.

Setting up replication requires setting up both the Extractor and Applier components as two different configurations, one for MySQL and the other for Amazon S3. Replication also requires some additional steps to ensure that S3 is ready to accept the replicated data that has been extracted. Tungsten Replicator provides all the tools required to perform these operations during the installation and setup.

4.9.1. S3 Replication Operation

The S3 applier makes use of the JavaScript based batch loading system (see Section 5.6.4, “JavaScript Batchloader Scripts”). This constructs change data from the source-database. The change data is then written into csv files, optional compressed, and loaded into an S3 bucket.

The full replication of information operates as follows:

  1. Data is extracted from the source database using the standard extractor, for example by reading the row change data from the binlog in MySQL.

  2. The Section 11.4.5, “ColumnName Filter” filter is used to extract column name information from the database. This enables the row-change information to be tagged with the corresponding column information. The data changes, and corresponding row names, are stored in the THL.

    The Section 11.4.32, “PrimaryKey Filter” filter is used to extract primary key data from the source tables.

  3. On the Applier replicator, the THL data is read and written into batch-files in the character-separated value format.

    The information in these files is change data, and contains not only the original row values from the source tables, but also metadata about the operation performed (i.e. INSERT, DELETE or UPDATE, and the primary key of for each table. All UPDATE statements are recorded as a DELETE of the existing data, and an INSERT of the new data.

    In addition to these core operation types, the batch applier can also be configured to record UPDATE operations that result in INSERT or DELETE rows.

The staging files created by the replicator are in a specific format that incorporates change and operation information in addition to the original row data.

  • The format of the files is a character separated values file, with each row separated by a newline, and individual fields separated by the character 0x01.

  • The content of the file consists of the full row data extracted from the Source, plus metadata describing the operation for each row, the sequence number, and then the full row information.

Operation Sequence No Table-specific primary key DateTime Table-columns...
OPTYPE SEQNO that generated this row PRIMARYKEY DATATIME of source table commit  

The operation field will match one of the following values

Operation Description Notes
I Row is an INSERT of new data  
D Row is DELETE of existing data  
UI Row is an UPDATE which caused INSERT of data  
UD Row is an UPDATE which caused DELETE of data  

For example, the MySQL row from an INSERT of:

|  3 | #1 Single | 2006 | Cats and Dogs (#1.4)         |

Is represented within the CSV files generated as:

"I","5","3","2014-07-31 14:29:17.000","3","#1 Single","2006","Cats and Dogs (#1.4)"

The character separator, and whether to use quoting, are configurable within the replicator when it is deployed. For S3, the default behavior is to generate quoted and comma separated fields.

As the target for the Amazon S3 Applier is not a relational database in the sense of traditional Tungsten replication, the replicator stores it's apply position as a JSON structure on the local filesystem.

This allows the replicator to know it's starting position in the case of a restart.

The file is located in the following directory: /opt/continuent/metadata/applier/serviceName and is called commitseqno.0

The contents of the file will look something like the following, and should NOT be edited unless advised to do so by Continuent Support

{
  "sourceId" : "ext01",
  "epochNumber" : "0",
  "fragno" : "0",
  "eventId" : "mysql-bin.000002:0000000000134613;27",
  "seqno" : "427",
  "lastFrag" : "true",
  "extractedTstamp" : "1687439308000",
  "appliedLatency" : "0",
  "shardId" : "demo"
}

4.9.2. Preparing for Amazon S3 Replication

Preparing the hosts for the replication process requires setting some key configuration parameters within the MySQL server to ensure that data is stored and written correctly.

Configure the source and target hosts following the prerequisites outlined in Appendix B, Prerequisites then follow the appropriate steps for the required extractor topology outlined in Chapter 3, Deploying MySQL Extractors.

The following are required for replication to Amazon S3:

  • An existing Amazon Web Services (AWS) account, and either the AWS Access Key and Secret Key, or configured IAM Roles, required to interact with the account through the API. For information on creating IAM Roles, see Section 4.2.2.2, “Configuring Identity Access Management within AWS”

  • A configured Amazon S3 service. If the S3 service has not already been configured, visit the AWS console and sign up for the Amazon S3 service.

  • If using the s3cmd, you should then configure the command to automatically connect to the Amazon S3 service without requiring further authentication, the .s3cfg in the tungsten users home directory should be configured as follows:

    • Using Access Keys:

      [default]
      access_key = ACCESS_KEY
      secret_key = SECRET_KEY
    • Using IAM Roles: Leave values blank - copy example as is

      [default]
      access_key = 
      secret_key = 
      security_token =
  • Create an S3 bucket that will be used to hold the CSV files that are generated by the replicator. This can be achieved either through the web interface, or via the command-line, for example:

    shell> s3cmd mb s3://tungsten-csv
  • Create an s3-config-servicename.json file based on the sample provided within cluster-home/samples/conf/s3-config-servicename.json within the Tungsten Replicator staging directory, or using the example below.

    Once created, the file will be copied into the /opt/continuent/share directory to be used by the batch applier script.

    If multiple services are being created, one file must be created for each service.

    The following example shows the use of Access and Secret Keys:

    {
      "awsS3Path" : "s3://your-bucket-for-s3/s3-test",
      "awsAccessKey" : "access-key-id",
      "awsSecretKey" : "secret-access-key",
      "cleanUpS3Files" : "true"
    }

    The following example shows the use of IAM Roles:

    {
      "awsS3Path" : "s3://your-bucket-for-s3/s3-test",
      "awsIAMRole" : "arn:iam-role",
    }

    The allowed options for this file are as follows:

    • awsS3Path — the location within your S3 storage where files should be loaded.

    • awsAccessKey — the S3 access key to access your S3 storage. Not required if awsIAMRole is used.

    • awsSecretKey — the S3 secret key associated with the Access Key. Not required if awsIAMRole is used.

    • awsIAMRole — the IAM role configured to allow Redshift to interact with S3. Not required if awsAccessKey and awsSecretKey are in use.

    • s3Binary — the binary to use for loading csv file up to S3. (Valid Values: s3cmd, s4cmd, aws) (Default: s3cmd)

    • gzipS3Files — setting to true will result in the csv files being gzipped prior to loading into S3 (Default: false)

4.9.3. Install Amazon S3 Applier

Replication into S3 requires two separate replicator installations, one that extracts information from the source database, and a second that generates the CSV files, loads those files into S3.

The two replication services can operate on the same machine, (See Section 5.3, “Deploying Multiple Replicators on a Single Host”) or they can be installed on two different machines.

Once you have completed the configuration of the Amazon S3 bucket, you can configure and install the applier as described using the steps below.

  1. Before installing the applier, the following additions need adding to the extractor configuration. Apply the following parameter to the extractor configuration before installing the applier

    Add the following the /etc/tungsten/tungsten.ini

    [alpha]
    ...Existing Replicator Config...
    enable-heterogeneous-service=true
    
    shell> tpm update

    Note

    The above step is only applicable for standalone extractors. If you are configuring replications from an existing Tungsten Cluster (Cluster-Extractor), follow the steps outlined here to ensure the cluster is configured correctly: Section 3.4.1, “Prepare: Replicating Data Out of a Cluster”

  2. The applier can now be configured. Unpack the Tungsten Replicator distribution in staging directory:

    shell> tar zxf tungsten-replicator-7.1.4-10.tar.gz
  3. Change into the staging directory:

    shell> cd tungsten-replicator-7.1.4-10
  4. Configure the installation using tpm:

    Show Staging

    Show INI

    shell> ./tools/tpm configure defaults \
        --reset \
        --user=tungsten \
        --install-directory=/opt/continuent \
        --profile-script=~/.bash_profile \
        --rest-api-admin-user=apiuser \
        --rest-api-admin-pass=secret \
        --replicator-rest-api-ssl=true \
        --replicator-rest-api-port=8097 \
        --replicator-rest-api-authentication=true \
        --replicator-rest-api-address=0.0.0.0
    
    shell> ./tools/tpm configure alpha \
        --master=sourcehost \
        --members=localhost \
        --role=slave \
        --batch-enabled=true \
        --batch-load-template=s3 \
        --datasource-type=file \
        --enable-heterogeneous-service=true
    
    shell> vi /etc/tungsten/tungsten.ini
    [defaults]
    user=tungsten
    install-directory=/opt/continuent
    profile-script=~/.bash_profile
    rest-api-admin-user=apiuser
    rest-api-admin-pass=secret
    replicator-rest-api-ssl=true
    replicator-rest-api-port=8097
    replicator-rest-api-authentication=true
    replicator-rest-api-address=0.0.0.0
    
    [alpha]
    master=sourcehost
    members=localhost
    role=slave
    batch-enabled=true
    batch-load-template=s3
    datasource-type=file
    enable-heterogeneous-service=true
    

    Configuration group defaults

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

    Configuration group alpha

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

  5. If your MySQL source is a Tungsten Cluster, ensure the additional steps below are also included in your applier configuration

    First, prepare the required filter configuration file as follows on the S3 applier host(s) only:

    shell> mkdir -p /opt/continuent/share/
    shell> cp tungsten-replicator/support/filters-config/convertstringfrommysql.json /opt/continuent/share/

    Then, include the following parameters in the configuration

    property=replicator.stage.remote-to-thl.filters=convertstringfrommysql
    property=replicator.filter.convertstringfrommysql.definitionsFile=/opt/continuent/share/convertstringfrommysql.json
    
  6. Note

    If you plan to make full use of the REST API (which is enabled by default) you will need to also configure a username and password for API access. This must be done by specifying the following options in your configuration:

    rest-api-admin-user=tungsten
    rest-api-admin-pass=secret

  7. Once the prerequisites and configuring of the installation has been completed, the software can be installed:

    shell> ./tools/tpm install

If the installation process fails, check the output of the /tmp/tungsten-configure.log file for more information about the root cause.

On the host that is loading data into S3, create the s3-config-servicename.json file and then copy that file into the share directory within the installed directory on that host. For example:

shell> cp s3-config-servicename.json /opt/continuent/share/

Now the services can be started:

shell> replicator start

Chapter 5. Deployment: Advanced

Table of Contents

5.1. Deploying the Replicator using the AWS Marketplace AMI
5.1.1. Prepare Source/Target database instances
5.1.2. Launch and Configure AMI
5.2. Deploying a Fan-In Topology
5.2.1. Management and Monitoring Fan-in Deployments
5.3. Deploying Multiple Replicators on a Single Host
5.3.1. Preparing Multiple Replicators
5.3.2. Install Multiple Replicators
5.3.3. Best Practices: Multiple Replicators
5.4. Replicating Data Into an Existing Dataservice
5.5. Deploying Parallel Replication
5.5.1. Application Prerequisites for Parallel Replication
5.5.2. Enabling Parallel Apply During Install
5.5.3. Channels
5.5.4. Parallel Replication and Offline Operation
5.5.4.1. Clean Offline Operation
5.5.4.2. Tuning the Time to Go Offline Cleanly
5.5.4.3. Unclean Offline
5.5.5. Adjusting Parallel Replication After Installation
5.5.5.1. How to Enable Parallel Apply After Installation
5.5.5.2. How to Change Channels Safely
5.5.5.3. How to Disable Parallel Replication Safely
5.5.5.4. How to Switch Parallel Queue Types Safely
5.5.6. Monitoring Parallel Replication
5.5.6.1. Useful Commands for Parallel Monitoring Replication
5.5.6.2. Parallel Replication and Applied Latency On Replicas
5.5.6.3. Relative Latency
5.5.6.4. Serialization Count
5.5.6.5. Maximum Offline Interval
5.5.6.6. Workload Distribution
5.5.7. Controlling Assignment of Shards to Channels
5.5.8. Disk vs. Memory Parallel Queues
5.6. Batch Loading for Data Warehouses
5.6.1. How It Works
5.6.2. Important Limitations
5.6.3. Batch Applier Setup
5.6.4. JavaScript Batchloader Scripts
5.6.4.1. JavaScript Batchloader with Parallel Apply
5.6.5. Staging Tables
5.6.5.1. Staging Table Names
5.6.5.2. Whole Record Staging
5.6.5.3. Delete Key Staging
5.6.5.4. Staging Table Generation
5.6.6. Character Sets
5.6.7. Supported CSV Formats
5.6.8. Columns in Generated CSV Files
5.6.9. Batchloading Opcodes
5.6.10. Time Zones
5.6.11. Batch Loading into MySQL
5.6.11.1. Configuring as an Offboard Batch Applier
5.6.11.2. Drop Delete Statements
5.6.11.3. Configure CHARSET to use on Load
5.6.11.4. Allow DDL Statements to execute
5.6.11.5. Disable Foreign Keys during load
5.6.11.6. Log rows violating Primary/Unique Keys
5.6.12. Data File Partitioning

5.1. Deploying the Replicator using the AWS Marketplace AMI

If you have an AWS account, you can take advantage of pre-built EC2 hosts, complete with all necessary pre-requisites in place, launched from an AWS Marketplace AMI.

Upon launch, a wizard will start and prompt you for a number of credentials to build a default configuration for Tungsten Replicator

For a complete end-to-end Replication Pipeline you will need:

Source Databases

The Tungsten Replicator for MySQL Source Extraction is required for extraction from any of the following:

  • MySQL hosted on another EC2 instance

  • MySQL hosted on the same EC2 host launched from the AMI

  • An existing Tungsten Clustering Installation

  • Amazon RDS

  • Amazon Aurora

  • MySQL hosted on a remote non-AWS host

  • Google Cloud SQL

  • Microsoft Azure

Target Databases

Note

Upon launch, the AMI does NOT include the required binaries for a locally hosted database instance. For a local install for either the extractor or the applier, this will need to be configured manually beforehand.

Note

If you plan to extract from an existing Tungsten Cluster (Cluster-Extractor) a number of changes may need to be applied to your cluster configuration, in addition your cluster must be running the same release as Tungsten Replicator. For more details on Cluster requirements consult the appropriate Applier specific pages here: Chapter 4, Deploying Appliers

Note

For any non-AWS hosted instances, ensure the appropriate inbound and outbound security rules are in place to allow WAN Communication.

5.1.1. Prepare Source/Target database instances

When using the AMI to configure an Extractor or Applier, it is important to ensure all the necessary target/source database pre-requisites are in place.

  • For extraction, ensure your source MySQL Instance is configured as per the Database specific notes in Section B.4, “MySQL Database Setup”

  • In addition, for Amazon based extraction, pay particular attention to Section B.4.6, “MySQL Unprivileged Users”

  • For preparing the target database, specific notes for target pre-requisities, where appropriate, are detailed within each applier deployment section found at Chapter 4, Deploying Appliers

  • Once you have prepared your sources and targets, you can now launch the relevant AMI's from the Marketplace

  • Within your AWS Dashboard, you can find the AMI by searching within the Marketplace for "Continuent"

  • Select the Extractor AMI and the Target AMI based on your choice of target database. Each AMI is restricted to only configure an applier based on the choice of target. There are no restrictions on extraction, providing the necessary pre-requisities are in place.

  • Ensure you select a Security group that allows communication to the source and target databases, the require network ports are detailed in Section B.3.3.1, “Network Ports”

5.1.2. Launch and Configure AMI

After launching the AMI, obtain the public IP and connect to the shell using your preferred Terminal application, eg

shell> ssh -i your-key.pem ec2-user@publicIP

Upon connecting, you will see a welcome message, from here you can now connect as the tungsten user

shell> sudo su - tungsten

The launch wizard will start automatically and start prompting you for details regarding your source or target database.

It is advisable to configure the Extractor AMI first as you will need to provide details of the extractor when you configure the applier.

Once you have provided all the information to the wizard, you will be prompted on screen for the next steps.

  • In summary, the wizard will have completed the following:

    • Created tungsten.ini within /etc/tungsten

    • Created additional directories for software installation

    • Created additional configuration files depening upon target requirements

    • Created a log file of the Wizard execution within /home/tungsten/ami-launch/log

  • The latest version of Tungsten Replicator will be unpacked within /opt/continuent/software

  • The wizard does not install the software, this allows you to fine tune the configuration to suit your needs, such as adding additional filters, or adjusting memory and buffer allocations.

  • For more information on all the possible configuration parameters, see Section 9.8, “tpm Configuration Options”

  • You can now install the software, follow the on screen instructions displayed after Wizard completetion to install using tpm, or review Section 9.4.2, “Installation with INI File”

  • For further reading and understanding of how to manage the replicator, review Chapter 7, Operations Guide

  • For steps on starting and stopping the replicator, review Section 2.4, “Starting and Stopping Tungsten Replicator”

  • For details on how to monitor and interact with the running replicator using the trepctl tool, review Section 8.20, “The trepctl Command”

5.2. Deploying a Fan-In Topology

The fan-in topology is the logical opposite of a Primary/Replica topology. In a fan-in topology, the data from two Sources is combined together on one Target. Fan-in topologies are often in situations where you have satellite databases, maybe for sales or retail operations, and need to combine that information together in a single database for processing.

Figure 5.1. Topologies: Fan-in

Topologies: Fan-in

Some additional considerations need to be made when using fan-in topologies:

  • If the same tables from each each machine are being merged together, it is possible to get collisions in the data where auto increment is used. The effects can be minimized by using increment offsets within the MySQL configuration:

    auto-increment-offset = 1
    auto-increment-increment = 4
  • Fan-in can work more effectively, and be less prone to problems with the corresponding data by configuring specific tables at different sites. For example, with two sites in New York and San Jose databases and tables can be prefixed with the site name, i.e. sjc_sales and nyc_sales.

    Alternatively, a filter can be configured to rename the database sales dynamically to the corresponding location based tables. See Section 11.4.34, “Rename Filter” for more information.

  • Statement-based replication will work for most instances, but where your statements are updating data dynamically within the statement, in fan-in the information may get increased according to the name of fan-in Sources. Update your configuration file to explicitly use row-based replication by adding the following to your my.cnf file:

    binlog-format = row
  • Triggers can cause problems during fan-in replication if two different statements from each Source and replicated to the Target and cause the operations to be triggered multiple times. Tungsten Replicator cannot prevent triggers from executing on the concentrator host and there is no way to selectively disable triggers. Check at the trigger level whether you are executing on a Source or Target. For more information, see Section C.4.1, “Triggers”.

To create the configuration the Extractors and services must be specified, the topology specification takes care of the actual configuration:

Show Staging

Show INI

shell> ./tools/tpm configure epsilon \
    --topology=fan-in \
    --install-directory=/opt/continuent \
    --replication-user=tungsten \
    --replication-password=password \
    --master=host1,host2 \
    --members=host1,host2,host3 \
    --master-services=alpha,beta \
    --rest-api-admin-user=apiuser \
    --rest-api-admin-pass=secret
shell> vi /etc/tungsten/tungsten.ini
[epsilon]
topology=fan-in
install-directory=/opt/continuent
replication-user=tungsten
replication-password=password
master=host1,host2
members=host1,host2,host3
master-services=alpha,beta
rest-api-admin-user=apiuser
rest-api-admin-pass=secret

Configuration group epsilon

The description of each of the options is shown below; click the icon to hide this detail:

Click the icon to show a detailed description of each argument.

For additional options supported for configuration with tpm, see Chapter 9, The tpm Deployment Command.

If the installation process fails, check the output of the /tmp/tungsten-configure.log file for more information about the root cause.

Once the installation has been completed, the service will be started and ready to use.

5.2.1. Management and Monitoring Fan-in Deployments

Once the service has been started, a quick view of the service status can be determined using trepctl. Because there are multiple services, the service name and host name must be specified explicitly. The Extractor connection of one of the fan-in hosts:

shell> trepctl -service alpha -host host1 status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000012:0000000000000418;0
appliedLastSeqno       : 0
appliedLatency         : 1.194
channels               : 1
clusterName            : alpha
currentEventId         : mysql-bin.000012:0000000000000418
currentTimeMillis      : 1375451438898
dataServerHost         : host1
extensions             : 
latestEpochNumber      : 0
masterConnectUri       : thl://localhost:/
masterListenUri        : thl://host1:2112/
maximumStoredSeqNo     : 0
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : jdbc:mysql:thin://host1:13306/
relativeLatency        : 6232.897
resourcePrecedence     : 99
rmiPort                : 10000
role                   : master
seqnoType              : java.lang.Long
serviceName            : alpha
serviceType            : local
simpleServiceName      : alpha
siteName               : default
sourceId               : host1
state                  : ONLINE
timeInStateSeconds     : 6231.881
transitioningTo        : 
uptimeSeconds          : 6238.061
version                : Tungsten Replicator 7.1.4 build 10
Finished status command...

The corresponding Extractor service from the other host is beta on host2:

shell> trepctl -service beta -host host2 status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000012:0000000000000415;0
appliedLastSeqno       : 0
appliedLatency         : 0.941
channels               : 1
clusterName            : beta
currentEventId         : mysql-bin.000012:0000000000000415
currentTimeMillis      : 1375451493579
dataServerHost         : host2
extensions             : 
latestEpochNumber      : 0
masterConnectUri       : thl://localhost:/
masterListenUri        : thl://host2:2112/
maximumStoredSeqNo     : 0
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : jdbc:mysql:thin://host2:13306/
relativeLatency        : 6286.579
resourcePrecedence     : 99
rmiPort                : 10000
role                   : master
seqnoType              : java.lang.Long
serviceName            : beta
serviceType            : local
simpleServiceName      : beta
siteName               : default
sourceId               : host2
state                  : ONLINE
timeInStateSeconds     : 6285.823
transitioningTo        : 
uptimeSeconds          : 6291.053
version                : Tungsten Replicator 7.1.4 build 10
Finished status command...

Note that because this is a fan-in topology, the sequence numbers and applied sequence numbers will be different for each service, as each service is independently storing data within the fan-in hub database.

The following sequence number combinations should match between the different hosts on each service:

Extractor Service Source Host Target Host
alpha host1 host3
beta host1 host3

The sequence numbers between host1 and host2 will not match, as they are two independent services.

For more information on using trepctl, see Section 8.20, “The trepctl Command”.

Definitions of the individual field descriptions in the above example output can be found in Section E.2, “Generated Field Reference”.

For more information on management and operational detailed for managing your cluster installation, see Chapter 7, Operations Guide.

5.3. Deploying Multiple Replicators on a Single Host

It is possible to install multiple replicators on the same host. This can be useful, either when building complex topologies with multiple services, and in hetereogenous environments where you are reading from one database and writing to another that may be installed on the same single server.

When installing multiple replicator services on the same host, different values must be set for the following configuration parameters:

5.3.1. Preparing Multiple Replicators

Before continuing with deployment you will need the following:

  1. The name to use for the service.

  2. The list of datasources in the service. These are the servers which will be running MySQL.

  3. The username and password of the MySQL replication user.

All servers must be prepared with the proper prerequisites. See Appendix B, Prerequisites for additional details.

  • RMI network port used for communicating with the replicator service.

    Set through the --rmi-port parameter to tpm. Note that RMI ports are configured in pairs; the default port is 10000, port 10001 is used automatically. When specifying an alternative port, the subsequent port must also be available. For example, specifying port 10002 also requires 10003.

  • THL network port used for exchanging THL data.

    Set through the --thl-port parameter to tpm. The default THL port is 2112. This option is required for services operating as Extractors.

  • Extractor THL port, i.e. the port from which an Applier will read THL events from the Extractor

    Set through the --master-thl-port parameter to tpm. When operating as an Applier, the explicit THL port should be specified to ensure that you are connecting to the THL port correctly.

  • Extractor hostname

    Set through the --master-thl-host parameter to tpm. This is optional if the Extractor hostname has been configured correctly through the --master parameter.

  • Installation directory used when the replicator is installed.

    Set through the --install-directory or --install-directory parameters to tpm. This directory must have been created, and be configured with suitable permissions before installation starts. For more information, see Section B.3.4, “Directory Locations and Configuration”.

5.3.2. Install Multiple Replicators

For example, to create two services, one that reads from MySQL and another that writes to MongoDB on the same host:

  1. Install the Tungsten Replicator package or download the Tungsten Replicator tarball, and unpack it:

    shell> cd /opt/continuent/software
    shell> tar zxf tungsten-replicator-7.1.4-10.tar.gz
  2. Create the proper directories with appropriate ownership and permissions:

    shell> sudo mkdir /opt/applier /opt/extractor
    shell> sudo chown tungsten: /opt/applier/ /opt/extractor/
    shell> sudo chmod 700 /opt/applier/ /opt/extractor/
  3. Change to the Tungsten Replicator directory:

    shell> cd tungsten-replicator-7.1.4-10
  4. Extractor reading from MySQL (Click link to switch examples between Staging Method or INI Method):

    Show Staging

    Show INI

    shell> ./tools/tpm configure defaults \
        --reset \
        --install-directory=/opt/extractor \
        --user=tungsten \
        --profile-script=~/.bash_profile \
        --mysql-allow-intensive-checks=true \
        --disable-security-controls=true \
        --executable-prefix=ext \
        --rest-api-admin-user=apiuser \
        --rest-api-admin-pass=secret
    
    shell> ./tools/tpm configure alpha \
        --master=offboardhost \
        --members=offboardhost \
        --enable-heterogeneous-service=true \
        --replication-port=3306 \
        --replication-user=tungsten_alpha \
        --replication-password=secret \
        --datasource-mysql-conf=/etc/my.cnf \
        --svc-extractor-filters=colnames,pkey \
        --property=replicator.filter.pkey.addColumnsToDeletes=true \
        --property=replicator.filter.pkey.addPkeyToInserts=true \
        --mysql-enable-enumtostring=true \
        --mysql-enable-settostring=true \
        --mysql-use-bytes-for-string=false
    
    shell> vi /etc/tungsten/tungsten.ini
    [defaults]
    install-directory=/opt/extractor
    user=tungsten
    profile-script=~/.bash_profile
    mysql-allow-intensive-checks=true
    disable-security-controls=true
    executable-prefix=ext
    rest-api-admin-user=apiuser
    rest-api-admin-pass=secret
    
    [alpha]
    master=offboardhost
    members=offboardhost
    enable-heterogeneous-service=true
    replication-port=3306
    replication-user=tungsten_alpha
    replication-password=secret
    datasource-mysql-conf=/etc/my.cnf
    svc-extractor-filters=colnames,pkey
    property=replicator.filter.pkey.addColumnsToDeletes=true
    property=replicator.filter.pkey.addPkeyToInserts=true
    mysql-enable-enumtostring=true
    mysql-enable-settostring=true
    mysql-use-bytes-for-string=false
    

    Configuration group defaults

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

    Configuration group alpha

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

    This is a standard configuration using the default ports, with the directory /opt/extractor.

  5. Applier for writing to MongoDB (Click link to switch examples between Staging Method or INI Method):

    Show Staging

    Show INI

    shell> ./tools/tpm configure defaults \
        --reset \
        --install-directory=/opt/applier \
        --profile-script=~/.bash_profile \
        --skip-validation-check=InstallerMasterSlaveCheck \
        --executable-prefix=app \
        --rest-api-admin-user=apiuser \
        --rest-api-admin-pass=secret
    
    shell> ./tools/tpm configure alpha \
        --master=localhost \
        --members=localhost \
        --role=slave \
        --datasource-type=mongodb \
        --replication-user=tungsten \
        --replication-password=secret \
        --rmi-port=10002 \
        --master-thl-port=2112 \
        --master-thl-host=localhost \
        --thl-port=2113
    
    shell> vi /etc/tungsten/tungsten.ini
    [defaults]
    install-directory=/opt/applier
    profile-script=~/.bash_profile
    skip-validation-check=InstallerMasterSlaveCheck
    executable-prefix=app
    rest-api-admin-user=apiuser
    rest-api-admin-pass=secret
    
    [alpha]
    master=localhost
    members=localhost
    role=slave
    datasource-type=mongodb
    replication-user=tungsten
    replication-password=secret
    rmi-port=10002
    master-thl-port=2112
    master-thl-host=localhost
    thl-port=2113
    

    Configuration group defaults

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

    Configuration group alpha

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

    In this configuration, the Extractor THL port is specified explicitly, along with the THL port used by this replicator, the RMI port used for administration, and the installation directory /opt/applier.

  6. Run tpm to install the software

    shell > ./tools/tpm install

    During the startup and installation, tpm will notify you of any problems that need to be fixed before the service can be correctly installed and started. If start-and-report is set and the service starts correctly, you should see the configuration and current status of the service.

  7. Initialize your PATH and environment.

    shell > source /opt/extractor/share/env.sh
    shell > source /opt/applier/share/env.sh

  8. Check the replication status.

When multiple replicators have been installed, checking the replicator status through trepctl depends on the replicator executable location used. If /opt/extractor/tungsten/tungsten-replicator/bin/trepctl, the extractor service status will be reported. If /opt/applier/tungsten/tungsten-replicator/bin/trepctl is used, then the applier service status will be reported.

To make things easier, in the config examples above executable-prefix has been used, which will set up OS aliases. These aliases are setup when you source the relevant env.sh files, this will also happen by default when you login to the host providing profile-script has been specified

The use of the prefix and aliases, then simplifies the use of all executables, for example, based on the setting of executable-prefix in the above config examples, to report the status of the extractor, you can execute:

shell> ext_trepctl status

Or to check the applier service:

shell> app_trepctl status

Alternatively, a specific replicator can be checked by explicitly specifying the RMI port of the service. For example, to check the extractor service:

shell> trepctl -port 10000 status

Or to check the applier service:

shell> trepctl -port 10002 status

When an explicit port has been specified in this way, the executable used is irrelevant. Any valid trepctl instance will work.

Further, either path may be used to get a summary view using multi_trepctl:

shell> /opt/extractor/tungsten/tungsten-replicator/scripts/multi_trepctl
| host   | servicename | role   | state  | appliedlastseqno | appliedlatency |
| host1  | extractor   | master | ONLINE |                0 |          1.724 |
| host1  | applier     | slave  | ONLINE |                0 |          0.000 |

5.3.3. Best Practices: Multiple Replicators

Follow the guidelines in Section 2.2, “Best Practices”.

5.4. Replicating Data Into an Existing Dataservice

If you have an existing dataservice, data can be replicated from a standalone MySQL server into the service. The replication is configured by creating a service that reads from the standalone MySQL server and writes into the Primary of the target dataservice. By writing this way, changes are replicated to the Primary and Replica in the new deployment.

Additionally, using a replicator that writes data into an existing data service can be used when migrating from an existing service into a new Tungsten Cluster service.

Figure 5.2. Topologies: Replicating into a Dataservice

Topologies: Replicating into a Dataservice

In order to configure this deployment, there are two steps:

  1. Create a new replicator that reads this data and writes the replicated data into the Primary of the destination dataservice.

  2. Create a new replicator that reads the binary logs directly from the external MySQL service through the Primary of the destination dataservice

There are also the following requirements:

  • The host on which you want to replicate to must have Tungsten Replicator 5.3.0 or later.

  • Hosts on both the replicator and cluster must be able to communicate with each other.

  • The replication user on the source host must have the RELOAD, REPLICATION SLAVE, and REPLICATION CLIENT GRANT privileges.

  • Replicator must be able to connect as the tungsten user to the databases within the cluster.

Install the Tungsten Replicator package (see Section 2.1.2, “Using the RPM package files”), or download the compressed tarball and unpack it on host1:

shell> cd /opt/replicator/software
shell> tar zxf tungsten-replicator-7.1.4-10.tar.gz

Change to the Tungsten Replicator staging directory:

shell> cd tungsten-replicator-7.1.4-10

Configure the replicator on host1

First we configure the defaults and a cluster alias that points to the Primaries and Replicas within the current Tungsten Cluster service that you are replicating from:

Click the link below to switch examples between Staging and INI methods

Show Staging

Show INI

shell> ./tools/tpm configure defaults \
    --install-directory=/opt/replicator \
    --rmi-port=10002 \
    --user=tungsten \
    --replication-user=tungsten \
    --replication-password=secret \
    --skip-validation-check=MySQLNoMySQLReplicationCheck \
    --rest-api-admin-user=apiuser \
    --rest-api-admin-pass=secret

shell> ./tools/tpm configure beta \
    --topology=direct \
    --master=host1 \
    --direct-datasource-host=host3 \
    --thl-port=2113
shell> vi /etc/tungsten/tungsten.ini
[defaults]
install-directory=/opt/replicator
rmi-port=10002
user=tungsten
replication-user=tungsten
replication-password=secret
skip-validation-check=MySQLNoMySQLReplicationCheck
rest-api-admin-user=apiuser
rest-api-admin-pass=secret

[beta]
topology=direct
master=host1
direct-datasource-host=host3
thl-port=2113

Configuration group defaults

The description of each of the options is shown below; click the icon to hide this detail:

Click the icon to show a detailed description of each argument.

Configuration group beta

The description of each of the options is shown below; click the icon to hide this detail:

Click the icon to show a detailed description of each argument.

This creates a configuration that specifies that the topology should read directly from the source host, host3, writing directly to host1. An alternative THL port is provided to ensure that the THL listener is not operating on the same network port as the original.

Now install the service, which will create the replicator reading direct from host3 into host1:

shell> ./tools/tpm install

If the installation process fails, check the output of the /tmp/tungsten-configure.log file for more information about the root cause.

Once the installation has been completed, you must update the position of the replicator so that it points to the correct position within the source database to prevent errors during replication. If the replication is being created as part of a migration process, determine the position of the binary log from the external replicator service used when the backup was taken. For example:

mysql> show master status;
*************************** 1. row ***************************
            File: mysql-bin.000026
        Position: 1311
    Binlog_Do_DB: 
Binlog_Ignore_DB: 
1 row in set (0.00 sec)

Use dsctl set to update the replicator position to point to the Primary log position:

shell> /opt/replicator/tungsten/tungsten-replicator/bin/dsctl -service beta set \
    -reset -seqno 0 -epoch 0 \
    -source-id host3 -event-id mysql-bin.000026:1311

Now start the replicator:

shell> /opt/replicator/tungsten/tungsten-replicator/bin/replicator start

Replication status should be checked by explicitly using the servicename and/or RMI port:

shell> /opt/replicator/tungsten/tungsten-replicator/bin/trepctl -service beta status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000026:0000000000001311;1252
appliedLastSeqno       : 5
appliedLatency         : 0.748
channels               : 1
clusterName            : beta
currentEventId         : mysql-bin.000026:0000000000001311
currentTimeMillis      : 1390410611881
dataServerHost         : host1
extensions             : 
host                   : host3
latestEpochNumber      : 1
masterConnectUri       : thl://host3:2112/
masterListenUri        : thl://host1:2113/
maximumStoredSeqNo     : 5
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : jdbc:mysql:thin://host3:13306/
relativeLatency        : 8408.881
resourcePrecedence     : 99
rmiPort                : 10000
role                   : master
seqnoType              : java.lang.Long
serviceName            : beta
serviceType            : local
simpleServiceName      : beta
siteName               : default
sourceId               : host3
state                  : ONLINE
timeInStateSeconds     : 8408.21
transitioningTo        : 
uptimeSeconds          : 8409.88
useSSLConnection       : false
version                : Tungsten Replicator 7.1.4 build 10
Finished status command...

5.5. Deploying Parallel Replication

Parallel apply is an important technique for achieving high speed replication and curing Replica lag. It works by spreading updates to Replicas over multiple threads that split transactions on each schema into separate processing streams. This in turn spreads I/O activity across many threads, which results in faster overall updates on the Replica. In ideal cases throughput on Replicas may improve by up to 5 times over single-threaded MySQL native replication.

Note

It is worth noting that the only thing Tungsten parallelizes is applying transactions to Replicas. All other operations in each replication service are single-threaded.

5.5.1. Application Prerequisites for Parallel Replication

Parallel replication works best on workloads that meet the following criteria:

  • ROW based binary logging must be enabled in the MySQL database.

  • Data are stored in independent schemas. If you have 100 customers per server with a separate schema for each customer, your application is a good candidate.

  • Transactions do not span schemas. Tungsten serializes such transactions, which is to say it stops parallel apply and runs them by themselves. If more than 2-3% of transactions are serialized in this way, most of the benefits of parallelization are lost.

  • Workload is well-balanced across schemas.

  • The Replica host(s) are capable and have free memory in the OS page cache.

  • The host on which the Replica runs has a sufficient number of cores to operate a large number of Java threads.

  • Not all workloads meet these requirements. If your transactions are within a single schema only, you may need to consider different approaches, such as Replica prefetch. Contact Continuent for other suggestions.

Parallel replication does not work well on underpowered hosts, such as Amazon m1.small instances. In fact, any host that is already I/O bound under single-threaded replication will typical will not show much improvement with parallel apply.

5.5.2. Enabling Parallel Apply During Install

Parallel apply is enabled using the svc-parallelization-type and channels options of tpm. The parallelization type defaults to none which is to say that parallel apply is disabled. You should set it to disk. The channels option sets the the number of channels (i.e., threads) you propose to use for applying data. Here is a code example of a MySQL Applier installation with parallel apply enabled. The Replica will apply transactions using 30 channels.

Show Staging

Show INI

shell> ./tools/tpm configure defaults \
    --reset \
    --install-directory=/opt/continuent \
    --user=tungsten \
    --mysql-allow-intensive-checks=true \
    --profile-script=~/.bash_profile \
    --start-and-report=true

shell> ./tools/tpm configure alpha \
    --master=sourcehost \
    --members=localhost,sourcehost \
    --datasource-type=mysql \
    --replication-user=tungsten \
    --replication-password=secret \
    --svc-parallelization-type=disk \
    --channels=10
shell> vi /etc/tungsten/tungsten.ini
[defaults]
install-directory=/opt/continuent
user=tungsten
mysql-allow-intensive-checks=true
profile-script=~/.bash_profile
start-and-report=true

[alpha]
master=sourcehost
members=localhost,sourcehost
datasource-type=mysql
replication-user=tungsten
replication-password=secret
svc-parallelization-type=disk
channels=10

Configuration group defaults

The description of each of the options is shown below; click the icon to hide this detail:

Click the icon to show a detailed description of each argument.

Configuration group alpha

The description of each of the options is shown below; click the icon to hide this detail:

Click the icon to show a detailed description of each argument.

If the installation process fails, check the output of the /tmp/tungsten-configure.log file for more information about the root cause.

There are several additional options that default to reasonable values. You may wish to change them in special cases.

  • buffer-size — Sets the replicator block commit size, which is the number of transactions to commit at once on Replicas. Values up to 100 are normally fine.

  • native-slave-takeover — Used to allow Tungsten to take over from native MySQL replication and parallelize it. See here for more.

You can check the number of active channels on a Replica by looking at the "channels" property once the replicator restarts.

Replica shell> trepctl -service alpha status| grep channels
channels               : 10

Important

The channel count for a Primary will ALWAYS be 1 because extraction is single-threaded:

Primary shell> trepctl -service alpha status| grep channels
channels               : 1

Warning

Enabling parallel apply will dramatically increase the number of connections to the database server.

Typically the calculation on a Replica would be: Connections = Channel_Count x Sevice_Count x 2, so for a 4-way Composite Composite Active/Active topology with 30 channels there would be 30 x 4 x 2 = 240 connections required for the replicator alone, not counting application traffic.

You may display the currently used number of connections in MySQL:

mysql> SHOW STATUS LIKE 'max_used_connections';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| Max_used_connections | 190   |
+----------------------+-------+
1 row in set (0.00 sec)

Below are suggestions for how to change the maximum connections setting in MySQL both for the running instance as well as at startup:

mysql> SET GLOBAL max_connections = 512;

mysql> SHOW VARIABLES LIKE 'max_connections';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| max_connections | 512   |
+-----------------+-------+
1 row in set (0.00 sec)

shell> vi /etc/my.cnf
#max_connections = 151
max_connections = 512

5.5.3. Channels

Channels and Parallel Apply

Parallel apply works by using multiple threads for the final stage of the replication pipeline. These threads are known as channels. Restart points for each channel are stored as individual rows in table trep_commit_seqno if you are applying to a relational DBMS server, including MySQL, Oracle, and data warehouse products like Vertica.

When you set the channels argument, the tpm program configures the replication service to enable the requested number of channels. A value of 1 results in single-threaded operation.

Do not change the number of channels without setting the replicator offline cleanly. See the procedure later in this page for more information.

How Many Channels Are Enough?

Pick the smallest number of channels that loads the Replica fully. For evenly distributed workloads this means that you should increase channels so that more threads are simultaneously applying updates and soaking up I/O capacity. As long as each shard receives roughly the same number of updates, this is a good approach.

For unevenly distributed workloads, you may want to decrease channels to spread the workload more evenly across them. This ensures that each channel has productive work and minimizes the overhead of updating the channel position in the DBMS.

Once you have maximized I/O on the DBMS server leave the number of channels alone. Note that adding more channels than you have shards does not help performance as it will lead to idle channels that must update their positions in the DBMS even though they are not doing useful work. This actually slows down performance a little bit.

Effect of Channels on Backups

If you back up a Replica that operates with more than one channel, say 30, you can only restore that backup on another Replica that operates with the same number of channels. Otherwise, reloading the backup is the same as changing the number of channels without a clean offline.

When operating Tungsten Replicator in a Tungsten cluster, you should always set the number of channels to be the same for all replicators. Otherwise you may run into problems if you try to restore backups across MySQL instances that load with different locations.

If the replicator has only a single channel enabled, you can restore the backup anywhere. The same applies if you run the backup after the replicator has been taken offline cleanly.

5.5.4. Parallel Replication and Offline Operation

5.5.4.1. Clean Offline Operation

When you issue a trepctl offline command, Tungsten Replicator will bring all channels to the same point in the log and then go offline. This is known as going offline cleanly. When a Replica has been taken offline cleanly the following are true:

When parallel replication is not enabled, you can take the replicator offline by stopping the replicator process. There is no need to issue a trepctl offline command first.

5.5.4.2. Tuning the Time to Go Offline Cleanly

Putting a replicator offline may take a while if the slowest and fastest channels are far apart, i.e., if one channel gets far ahead of another. The separation between channels is controlled by the maxOfflineInterval parameter, which defaults to 5 seconds. This sets the allowable distance between commit timestamps processed on different channels. You can adjust this value at installation or later. The following example shows how to change it after installation. This can be done at any time and does not require the replicator to go offline cleanly.

Click the link below to switch examples between Staging and INI methods...

Show Staging

Show INI

shell> tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.1.4-10

shell> echo The staging USER is `tpm query staging| cut -d: -f1 | cut -d@ -f1`
The staging USER is tungsten

shell> echo The staging HOST is `tpm query staging| cut -d: -f1 | cut -d@ -f2`
The staging HOST is db1

shell> echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.1.4-10

shell> ssh {STAGING_USER}@{STAGING_HOST}
shell> cd {STAGING_DIRECTORY}
shell> ./tools/tpm configure alpha \
    --property=replicator.store.parallel-queue.maxOfflineInterval=30

Run the tpm command to update the software with the Staging-based configuration:

shell> ./tools/tpm update

For information about making updates when using a Staging-method deployment, please see Section 9.3.7, “Configuration Changes from a Staging Directory”.

shell> vi /etc/tungsten/tungsten.ini
[alpha]
...
property=replicator.store.parallel-queue.maxOfflineInterval=30

Run the tpm command to update the software with the INI-based configuration:

shell> tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.1.4-10

shell> echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.1.4-10

shell> cd {STAGING_DIRECTORY}

shell> ./tools/tpm update

For information about making updates when using an INI file, please see Section 9.4.4, “Configuration Changes with an INI file”.

The offline interval is only the the approximate time that Tungsten Replicator will take to go offline. Up to a point, larger values (say 60 or 120 seconds) allow the replicator to parallelize in spite of a few operations that are relatively slow. However, the down side is that going offline cleanly can become quite slow.

5.5.4.3. Unclean Offline

If you need to take a replicator offline quickly, you can either stop the replicator process or issue the following command:

shell> trepctl offline -immediate

Both of these result in an unclean shutdown. However, parallel replication is completely crash-safe provided you use transactional table types like InnoDB, so you will be able to restart without causing Replica consistency problems.

Warning

You must take the replicator offline cleanly to change the number of channels or when reverting to MySQL native replication. Failing to do so can result in errors when you restart replication.

5.5.5. Adjusting Parallel Replication After Installation

5.5.5.1. How to Enable Parallel Apply After Installation

To enable parallel replication after installation, take the replicator offline cleanly using the following command:

shell> trepctl offline

Modify the configuration to add two parameters:

Show Staging

Show INI

shell> tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.1.4-10

shell> echo The staging USER is `tpm query staging| cut -d: -f1 | cut -d@ -f1`
The staging USER is tungsten

shell> echo The staging HOST is `tpm query staging| cut -d: -f1 | cut -d@ -f2`
The staging HOST is db1

shell> echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.1.4-10

shell> ssh {STAGING_USER}@{STAGING_HOST}
shell> cd {STAGING_DIRECTORY}
shell> ./tools/tpm configure defaults \
    --svc-parallelization-type=disk \
    --channels=10

Run the tpm command to update the software with the Staging-based configuration:

shell> ./tools/tpm update

For information about making updates when using a Staging-method deployment, please see Section 9.3.7, “Configuration Changes from a Staging Directory”.

[defaults]
...
svc-parallelization-type=disk
channels=10

Run the tpm command to update the software with the INI-based configuration:

shell> tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.1.4-10

shell> echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.1.4-10

shell> cd {STAGING_DIRECTORY}

shell> ./tools/tpm update

For information about making updates when using an INI file, please see Section 9.4.4, “Configuration Changes with an INI file”.

Note

You make use an actual data service name in place of the keyword defaults.

Signal the changes by a complete restart of the Replicator process:

shell> replicator restart

You can check the number of active channels on a Replica by looking at the "channels" property once the replicator restarts.

Replica shell> trepctl -service alpha status| grep channels
channels               : 10

Important

The channel count for a Primary will ALWAYS be 1 because extraction is single-threaded:

Primary shell> trepctl -service alpha status| grep channels
channels               : 1

Warning

Enabling parallel apply will dramatically increase the number of connections to the database server.

Typically the calculation on a Replica would be: Connections = Channel_Count x Sevice_Count x 2, so for a 4-way Composite Composite Active/Active topology with 30 channels there would be 30 x 4 x 2 = 240 connections required for the replicator alone, not counting application traffic.

You may display the currently used number of connections in MySQL:

mysql> SHOW STATUS LIKE 'max_used_connections';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| Max_used_connections | 190   |
+----------------------+-------+
1 row in set (0.00 sec)

Below are suggestions for how to change the maximum connections setting in MySQL both for the running instance as well as at startup:

mysql> SET GLOBAL max_connections = 512;

mysql> SHOW VARIABLES LIKE 'max_connections';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| max_connections | 512   |
+-----------------+-------+
1 row in set (0.00 sec)

shell> vi /etc/my.cnf
#max_connections = 151
max_connections = 512

5.5.5.2. How to Change Channels Safely

To change the number of channels you must take the replicator offline cleanly using the following command:

shell> trepctl offline

This command brings all channels up the same transaction in the log, then goes offline. If you look in the trep_commit_seqno table, you will notice only a single row, which shows that updates to the Replica have been completely serialized to a single point. At this point you may safely reconfigure the number of channels on the replicator, for example using the following command:

Click the link below to switch examples between Staging and INI methods...

Show Staging

Show INI

shell> tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.1.4-10

shell> echo The staging USER is `tpm query staging| cut -d: -f1 | cut -d@ -f1`
The staging USER is tungsten

shell> echo The staging HOST is `tpm query staging| cut -d: -f1 | cut -d@ -f2`
The staging HOST is db1

shell> echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.1.4-10

shell> ssh {STAGING_USER}@{STAGING_HOST}
shell> cd {STAGING_DIRECTORY}
shell> ./tools/tpm configure alpha \
    --channels=5

Run the tpm command to update the software with the Staging-based configuration:

shell> ./tools/tpm update

For information about making updates when using a Staging-method deployment, please see Section 9.3.7, “Configuration Changes from a Staging Directory”.

[alpha]
...
channels=5

Run the tpm command to update the software with the INI-based configuration:

shell> tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.1.4-10

shell> echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.1.4-10

shell> cd {STAGING_DIRECTORY}

shell> ./tools/tpm update

For information about making updates when using an INI file, please see Section 9.4.4, “Configuration Changes with an INI file”.

You can check the number of active channels on a Replica by looking at the "channels" property once the replicator restarts.

If you attempt to reconfigure channels without going offline cleanly, Tungsten Replicator will signal an error when you attempt to go online with the new channel configuration. The cure is to revert to the previous number of channels, go online, and then go offline cleanly. Note that attempting to clean up the trep_commit_seqno and trep_shard_channel tables manually can result in your Replicas becoming inconsistent and requiring full resynchronization. You should only do such cleanup under direction from Continuent support.

Warning

Failing to follow the channel reconfiguration procedure carefully may result in your Replicas becoming inconsistent or failing. The cure is usually full resynchronization, so it is best to avoid this if possible.

5.5.5.3. How to Disable Parallel Replication Safely

The following steps describe how to gracefully disable parallel apply replication.

Replication Graceful Offline (critical first step)

To disable parallel apply, you must first take the replicator offline cleanly using the following command:

shell> trepctl offline

This command brings all channels up the same transaction in the log, then goes offline. If you look in the trep_commit_seqno table, you will notice only a single row, which shows that updates to the Replica have been completely serialized to a single point. At this point you may safely disable parallel apply on the replicator, for example using the following command:

Click the link below to switch examples between Staging and INI methods...

Show Staging

Show INI

shell> tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.1.4-10

shell> echo The staging USER is `tpm query staging| cut -d: -f1 | cut -d@ -f1`
The staging USER is tungsten

shell> echo The staging HOST is `tpm query staging| cut -d: -f1 | cut -d@ -f2`
The staging HOST is db1

shell> echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.1.4-10

shell> ssh {STAGING_USER}@{STAGING_HOST}
shell> cd {STAGING_DIRECTORY}
shell> ./tools/tpm configure alpha \
    --svc-parallelization-type=none \
    --channels=1

Run the tpm command to update the software with the Staging-based configuration:

shell> ./tools/tpm update

For information about making updates when using a Staging-method deployment, please see Section 9.3.7, “Configuration Changes from a Staging Directory”.

[alpha]
...
svc-parallelization-type=none
channels=1

Run the tpm command to update the software with the INI-based configuration:

shell> tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.1.4-10

shell> echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.1.4-10

shell> cd {STAGING_DIRECTORY}

shell> ./tools/tpm update

For information about making updates when using an INI file, please see Section 9.4.4, “Configuration Changes with an INI file”.

Verification

You can check the number of active channels on a Replica by looking at the "channels" property once the replicator restarts.

shell> trepctl -service alpha status| grep channels
channels               : 1
Notes and Warnings

If you attempt to reconfigure channels without going offline cleanly, Tungsten Replicator will signal an error when you attempt to go online with the new channel configuration. The cure is to revert to the previous number of channels, go online, and then go offline cleanly. Note that attempting to clean up the trep_commit_seqno and trep_shard_channel tables manually can result in your Replicas becoming inconsistent and requiring full resynchronization. You should only do such cleanup under direction from Continuent support.

Warning

Failing to follow the channel reconfiguration procedure carefully may result in your Replicas becoming inconsistent or failing. The cure is usually full resynchronization, so it is best to avoid this if possible.

5.5.5.4. How to Switch Parallel Queue Types Safely

As with channels you should only change the parallel queue type after the replicator has gone offline cleanly. The following example shows how to update the parallel queue type after installation:

Click the link below to switch examples between Staging and INI methods...

Show Staging

Show INI

shell> tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.1.4-10

shell> echo The staging USER is `tpm query staging| cut -d: -f1 | cut -d@ -f1`
The staging USER is tungsten

shell> echo The staging HOST is `tpm query staging| cut -d: -f1 | cut -d@ -f2`
The staging HOST is db1

shell> echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.1.4-10

shell> ssh {STAGING_USER}@{STAGING_HOST}
shell> cd {STAGING_DIRECTORY}
shell> ./tools/tpm configure alpha \
    --svc-parallelization-type=disk \
    --channels=5

Run the tpm command to update the software with the Staging-based configuration:

shell> ./tools/tpm update

For information about making updates when using a Staging-method deployment, please see Section 9.3.7, “Configuration Changes from a Staging Directory”.

[alpha]
...
svc-parallelization-type=disk
channels=5

Run the tpm command to update the software with the INI-based configuration:

shell> tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.1.4-10

shell> echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.1.4-10

shell> cd {STAGING_DIRECTORY}

shell> ./tools/tpm update

For information about making updates when using an INI file, please see Section 9.4.4, “Configuration Changes with an INI file”.

5.5.6. Monitoring Parallel Replication

Basic monitoring of a parallel deployment can be performed using the techniques in Chapter 7, Operations Guide. Specific operations for parallel replication are provided in the following sections.

5.5.6.1. Useful Commands for Parallel Monitoring Replication

The replicator has several helpful commands for tracking replication performance:

Command Description
trepctl status Shows basic variables including overall latency of Replica and number of apply channels
trepctl status -name shards Shows the number of transactions for each shard
trepctl status -name stores Shows the configuration and internal counters for stores between tasks
trepctl status -name tasks Shows the number of transactions (events) and latency for each independent task in the replicator pipeline

5.5.6.2. Parallel Replication and Applied Latency On Replicas

The trepctl status appliedLastSeqno parameter shows the sequence number of the last transaction committed. Here is an example from a Replica with 5 channels enabled.

shell> trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000211:0000000020094456;0
appliedLastSeqno       : 78021
appliedLatency         : 0.216
channels               : 5
...
Finished status command...

When parallel apply is enabled, the meaning of appliedLastSeqno changes. It is the minimum recovery position across apply channels, which means it is the position where channels restart in the event of a failure. This number is quite conservative and may make replication appear to be further behind than it actually is.

  • Busy channels mark their position in table trep_commit_seqno as they commit. These are up-to-date with the traffic on that channel, but channels have latency between those that have a lot of big transactions and those that are more lightly loaded.

  • Inactive channels do not get any transactions, hence do not mark their position. Tungsten sends a control event across all channels so that they mark their commit position in trep_commit_channel. It is possible to see a delay of many seconds or even minutes in unloaded systems from the true state of the Replica because of idle channels not marking their position yet.

For systems with few transactions it is useful to lower the synchronization interval to a smaller number of transactions, for example 500. The following command shows how to adjust the synchronization interval after installation:

Click the link below to switch examples between Staging and INI methods...

Show Staging

Show INI

shell> tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.1.4-10

shell> echo The staging USER is `tpm query staging| cut -d: -f1 | cut -d@ -f1`
The staging USER is tungsten

shell> echo The staging HOST is `tpm query staging| cut -d: -f1 | cut -d@ -f2`
The staging HOST is db1

shell> echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.1.4-10

shell> ssh {STAGING_USER}@{STAGING_HOST}
shell> cd {STAGING_DIRECTORY}
shell> ./tools/tpm configure alpha \
    --property=replicator.store.parallel-queue.syncInterval=500

Run the tpm command to update the software with the Staging-based configuration:

shell> ./tools/tpm update

For information about making updates when using a Staging-method deployment, please see Section 9.3.7, “Configuration Changes from a Staging Directory”.

[alpha]
...
property=replicator.store.parallel-queue.syncInterval=500

Run the tpm command to update the software with the INI-based configuration:

shell> tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.1.4-10

shell> echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.1.4-10

shell> cd {STAGING_DIRECTORY}

shell> ./tools/tpm update

For information about making updates when using an INI file, please see Section 9.4.4, “Configuration Changes with an INI file”.

Note that there is a trade-off between the synchronization interval value and writes on the DBMS server. With the foregoing setting, all channels will write to the trep_commit_seqno table every 500 transactions. If there were 50 channels configured, this could lead to an increase in writes of up to 10%—each channel could end up adding an extra write to mark its position every 10 transactions. In busy systems it is therefore better to use a higher synchronization interval for this reason.

You can check the current synchronization interval by running the trepctl status -name stores command, as shown in the following example:

shell> trepctl status -name stores
Processing status command (stores)...
...
NAME                      VALUE
----                      -----
...
name                    : parallel-queue
...
storeClass              : com.continuent.tungsten.replicator.thl.THLParallelQueue
syncInterval            : 10000
Finished status command (stores)...

You can also force all channels to mark their current position by sending a heartbeat through using the trepctl heartbeat command.

5.5.6.3. Relative Latency

Relative latency is a trepctl status parameter. It indicates the latency since the last time the appliedSeqno advanced; for example:

shell> trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000211:0000000020094766;0
appliedLastSeqno       : 78022
appliedLatency         : 0.571
...
relativeLatency        : 8.944
Finished status command...

In this example the last transaction had a latency of .571 seconds from the time it committed on the Primary and committed 8.944 seconds ago. If relative latency increases significantly in a busy system, it may be a sign that replication is stalled. This is a good parameter to check in monitoring scripts.

5.5.6.4. Serialization Count

Serialization count refers to the number of transactions that the replicator has handled that cannot be applied in parallel because they involve dependencies across shards. For example, a transaction that spans multiple shards must serialize because it might cause cause an out-of-order update with respect to transactions that update a single shard only.

You can detect the number of transactions that have been serialized by looking at the serializationCount parameter using the trepctl status -name stores command. The following example shows a replicator that has processed 1512 transactions with 26 serialized.

shell> trepctl status -name stores
Processing status command (stores)...
...
NAME                      VALUE
----                      -----
criticalPartition       : -1
discardCount            : 0
estimatedOfflineInterval: 0.0
eventCount              : 1512
headSeqno               : 78022
maxOfflineInterval      : 5
maxSize                 : 10
name                    : parallel-queue
queues                  : 5
serializationCount      : 26
serialized              : false
...
Finished status command (stores)...

In this case 1.7% of transactions are serialized. Generally speaking you will lose benefits of parallel apply if more than 1-2% of transactions are serialized.

5.5.6.5. Maximum Offline Interval

The maximum offline interval (maxOfflineInterval) parameter controls the "distance" between the fastest and slowest channels when parallel apply is enabled. The replicator measures distance using the seconds between commit times of the last transaction processed on each channel. This time is roughly equivalent to the amount of time a replicator will require to go offline cleanly.

You can change the maxOfflineInterval as shown in the following example, the value is defined in seconds.

Click the link below to switch examples between Staging and INI methods...

Show Staging

Show INI

shell> tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.1.4-10

shell> echo The staging USER is `tpm query staging| cut -d: -f1 | cut -d@ -f1`
The staging USER is tungsten

shell> echo The staging HOST is `tpm query staging| cut -d: -f1 | cut -d@ -f2`
The staging HOST is db1

shell> echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.1.4-10

shell> ssh {STAGING_USER}@{STAGING_HOST}
shell> cd {STAGING_DIRECTORY}
shell> ./tools/tpm configure alpha \
    --property=replicator.store.parallel-queue.maxOfflineInterval=30

Run the tpm command to update the software with the Staging-based configuration:

shell> ./tools/tpm update

For information about making updates when using a Staging-method deployment, please see Section 9.3.7, “Configuration Changes from a Staging Directory”.

[alpha]
...
property=replicator.store.parallel-queue.maxOfflineInterval=30

Run the tpm command to update the software with the INI-based configuration:

shell> tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.1.4-10

shell> echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.1.4-10

shell> cd {STAGING_DIRECTORY}

shell> ./tools/tpm update

For information about making updates when using an INI file, please see Section 9.4.4, “Configuration Changes with an INI file”.

You can view the configured value as well as the estimate current value using the trepctl status -name stores command, as shown in yet another example:

shell> trepctl status -name stores
Processing status command (stores)...
NAME                      VALUE
----                      -----
...
estimatedOfflineInterval: 1.3
...
maxOfflineInterval      : 30
...
Finished status command (stores)...

5.5.6.6. Workload Distribution

Parallel apply works best when transactions are distributed evenly across shards and those shards are distributed evenly across available channels. You can monitor the distribution of transactions over shards using the trepctl status -name shards command. This command lists transaction counts for all shards, as shown in the following example.

shell> trepctl status -name shards
Processing status command (shards)...
...
NAME                VALUE
----                -----
appliedLastEventId: mysql-bin.000211:0000000020095076;0
appliedLastSeqno  : 78023
appliedLatency    : 0.255
eventCount        : 3523
shardId           : cust1
stage             : q-to-dbms
...
Finished status command (shards)...

If one or more shards have a very large eventCount value compared to the others, this is a sign that your transaction workload is poorly distributed across shards.

The listing of shards also offers a useful trick for finding serialized transactions. Shards that Tungsten Replicator cannot safely parallelize are assigned the dummy shard ID #UNKNOWN. Look for this shard to find the count of serialized transactions. The appliedLastSeqno for this shard gives the sequence number of the most recent serialized transaction. As the following example shows, you can then list the contents of the transaction to see why it serialized. In this case, the transaction affected tables in different schemas.

shell> trepctl status -name shards
Processing status command (shards)...
NAME                VALUE
----                -----
appliedLastEventId: mysql-bin.000211:0000000020095529;0
appliedLastSeqno  : 78026
appliedLatency    : 0.558
eventCount        : 26
shardId           : #UNKNOWN
stage             : q-to-dbms
...
Finished status command (shards)...
shell> thl list -seqno 78026
SEQ# = 78026 / FRAG# = 0 (last frag)
- TIME = 2013-01-17 22:29:42.0
- EPOCH# = 1
- EVENTID = mysql-bin.000211:0000000020095529;0
- SOURCEID = logos1
- METADATA = [mysql_server_id=1;service=percona;shard=#UNKNOWN]
- TYPE = com.continuent.tungsten.replicator.event.ReplDBMSEvent
- OPTIONS = [##charset = ISO8859_1, autocommit = 1, sql_auto_is_null = 0, »
    foreign_key_checks = 1, unique_checks = 1, sql_mode = '', character_set_client = 8, »
    collation_connection = 8, collation_server = 33]
- SCHEMA =
- SQL(0) = insert into mats_0.foo values(1) /* ___SERVICE___ = [percona] */
- OPTIONS = [##charset = ISO8859_1, autocommit = 1, sql_auto_is_null = 0, »
    foreign_key_checks = 1, unique_checks = 1, sql_mode = '', character_set_client = 8, »
    collation_connection = 8, collation_server = 33]
- SQL(1) = insert into mats_1.foo values(1)

The replicator normally distributes shards evenly across channels. As each new shard appears, it is assigned to the next channel number, which then rotates back to 0 once the maximum number has been assigned. If the shards have uneven transaction distributions, this may lead to an uneven number of transactions on the channels. To check, use the trepctl status -name tasks and look for tasks belonging to the q-to-dbms stage.

shell> trepctl status -name tasks
Processing status command (tasks)...
...
NAME                VALUE
----                -----
appliedLastEventId: mysql-bin.000211:0000000020095076;0
appliedLastSeqno  : 78023
appliedLatency    : 0.248
applyTime         : 0.003
averageBlockSize  : 2.520
cancelled         : false
currentLastEventId: mysql-bin.000211:0000000020095076;0
currentLastFragno : 0
currentLastSeqno  : 78023
eventCount        : 5302
extractTime       : 274.907
filterTime        : 0.0
otherTime         : 0.0
stage             : q-to-dbms
state             : extract
taskId            : 0
...
Finished status command (tasks)...

If you see one or more channels that have a very high eventCount, consider either assigning shards explicitly to channels or redistributing the workload in your application to get better performance.

5.5.7. Controlling Assignment of Shards to Channels

Tungsten Replicator by default assigns channels using a round robin algorithm that assigns each new shard to the next available channel. The current shard assignments are tracked in table trep_shard_channel in the Tungsten catalog schema for the replication service.

For example, if you have 2 channels enabled and Tungsten processes three different shards, you might end up with a shard assignment like the following:

foo => channel 0
bar => channel 1
foobar => channel 0

This algorithm generally gives the best results for most installations and is crash-safe, since the contents of the trep_shard_channel table persist if either the DBMS or the replicator fails.

It is possible to override the default assignment by updating the shard.list file found in the tungsten-replicator/conf directory. This file normally looks like the following:

# SHARD MAP FILE.
# This file contains shard handling rules used in the ShardListPartitioner
# class for parallel replication.  If unchanged shards will be hashed across
# available partitions.

# You can assign shards explicitly using a shard name match, where the form
# is <db>=<partition>.
#common1=0
#common2=0
#db1=1
#db2=2
#db3=3

# Default partition for shards that do not match explicit name.
# Permissible values are either a partition number or -1, in which
# case values are hashed across available partitions.  (-1 is the
# default.
#(*)=-1

# Comma-separated list of shards that require critical section to run.
# A "critical section" means that these events are single-threaded to
# ensure that all dependencies are met.
#(critical)=common1,common2

# Method for channel hash assignments.  Allowed values are round-robin and
# string-hash.
(hash-method)=round-robin

You can update the shard.list file to do three types of custom overrides.

  1. Change the hashing method for channel assignments. Round-robin uses the trep_shard_channel table. The string-hash method just hashes the shard name.

  2. Assign shards to explicit channels. Add lines of the form shard=channel to the file as shown by the commented-out entries.

  3. Define critical shards. These are shards that must be processed in serial fashion. For example if you have a sharded application that has a single global shard with reference information, you can declare the global shard to be critical. This helps avoid applications seeing out of order information.

Changes to shard.list must be made with care. The same cautions apply here as for changing the number of channels or the parallelization type. For subscription customers we strongly recommend conferring with Continuent Support before making changes.

5.5.8. Disk vs. Memory Parallel Queues

Channels receive transactions through a special type of queue, known as a parallel queue. Tungsten offers two implementations of parallel queues, which vary in their performance as well as the requirements they may place on hosts that operate parallel apply. You choose the type of queue to enable using the --svc-parallelization-type option.

Warning

Do not change the parallel queue type without setting the replicator offline cleanly. See the procedure later in this page for more information.

Disk Parallel Queue (disk option)

A disk parallel queue uses a set of independent threads to read from the Transaction History Log and feed short in-memory queues used by channels. Disk queues have the advantage that they minimize memory required by Java. They also allow channels to operate some distance apart, which improves throughput. For instance, one channel may apply a transaction that committed 2 minutes before the transaction another channel is applying. This separation keeps a single slow transaction from blocking all channels.

Disk queues minimize memory consumption of the Java VM but to function efficiently they do require pages from the Operating System page cache. This is because the channels each independently read from the Transaction History Log. As long as the channels are close together the storage pages tend to be present in the Operating System page cache for all threads but the first, resulting in very fast reads. If channels become widely separated, for example due to a high maxOfflineInterval value, or the host has insufficient free memory, disk queues may operate slowly or impact other processes that require memory.

Memory Parallel Queue (memory option)

A memory parallel queue uses a set of in-memory queues to hold transactions. One stage reads from the Transaction History Log and distributes transactions across the queues. The channels each read from one of the queues. In-memory queues have the advantage that they do not need extra threads to operate, hence reduce the amount of CPU processing required by the replicator.

When you use in-memory queues you must set the maxSize property on the queue to a relatively large value. This value sets the total number of transaction fragments that may be in the parallel queue at any given time. If the queue hits this value, it does not accept further transaction fragments until existing fragments are processed. For best performance it is often necessary to use a relatively large number, for example 10,000 or greater.

The following example shows how to set the maxSize property after installation. This value can be changed at any time and does not require the replicator to go offline cleanly:

Click the link below to switch examples between Staging and INI methods...

Show Staging

Show INI

shell> tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.1.4-10

shell> echo The staging USER is `tpm query staging| cut -d: -f1 | cut -d@ -f1`
The staging USER is tungsten

shell> echo The staging HOST is `tpm query staging| cut -d: -f1 | cut -d@ -f2`
The staging HOST is db1

shell> echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.1.4-10

shell> ssh {STAGING_USER}@{STAGING_HOST}
shell> cd {STAGING_DIRECTORY}
shell> ./tools/tpm configure alpha \
    --property=replicator.store.parallel-queue.maxSize=10000

Run the tpm command to update the software with the Staging-based configuration:

shell> ./tools/tpm update

For information about making updates when using a Staging-method deployment, please see Section 9.3.7, “Configuration Changes from a Staging Directory”.

[alpha]
...
property=replicator.store.parallel-queue.maxSize=10000

Run the tpm command to update the software with the INI-based configuration:

shell> tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-replicator-7.1.4-10

shell> echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-replicator-7.1.4-10

shell> cd {STAGING_DIRECTORY}

shell> ./tools/tpm update

For information about making updates when using an INI file, please see Section 9.4.4, “Configuration Changes with an INI file”.

You may need to increase the Java VM heap size when you increase the parallel queue maximum size. Use the --java-mem-size option on the tpm command for this purpose or edit the Replicator wrapper.conf file directly.

Warning

Memory queues are not recommended for production use at this time. Use disk queues.

5.6. Batch Loading for Data Warehouses

Tungsten Replicator normally applies SQL changes to Targets by constructing SQL statements and executing in the exact order that transactions appear in the Tungsten History Log (THL). This works well for OLTP databases like MySQL, Oracle, and MongoDB. However, it is a poor approach for data warehouses.

Data warehouse products like Vertica or Redshift load very slowly through JDBC interfaces (50 times slower or even more compared to MySQL). Instead, such databases supply batch loading commands that upload data in parallel. For instance Vertica uses the COPY command.

Tungsten Replicator has a batch applier named SimpleBatchApplier that groups transactions and then loads data. This is known as "batch apply." You can configure Tungsten to load 10s of thousands of transactions at once using template that apply the correct commands for your chosen data warehouse.

While we use the term batch apply Tungsten is not batch-oriented in the sense of traditional Extract/Transfer/Load tools, which may run only a small number of batches a day. Tungsten builds batches automatically as transactions arrive in the log. The mechanism is designed to be self-adjusting. If small transaction batches cause loading to be slower, Tungsten will automatically tend to adjust the batch size upwards until it no longer lags during loading.

5.6.1. How It Works

The batch applier loads data into the Target DBMS using CSV files and appropriate load commands like LOAD DATA INFILE or COPY. Here is the basic algorithm.

While executing within a commit block, we write incoming transactions into open CSV files written by the class CsvWriter. There is one CSV file per database table. The following sample shows typical contents.

"I","84900","1","2016-03-11 20:51:10.000","986","http://www.continent.com/software"
"D","84901","2","2016-03-11 20:51:10.000","143",null
"I","84901","3","2016-03-11 20:51:10.000","143","http://www.microsoft.com"

Tungsten adds four extra column values to each line of CSV output.

Column Description
opcode A transaction code that has the value "I" for insert and "D" for delete. Other types are available.
seqno The Tungsten transaction sequence number
row_id A line number that starts with 1 and increments by 1 for each new row
timestamp The commit timestamp, i.e. the origin timestamp of the committed statement that generated the row information.

Different update types are handled as follows:

  • Each insert generates a single row containing all values in the row with an "I" opcode.

  • Each delete generates a single row with the key and a "D" opcode. Non-key fields are null.

  • Each update results in a delete with the row key followed by an insert.

  • Statements are ignored. If you want DDL you need to put it in yourself.

Tungsten writes each row update into the corresponding CSV file for the SQL. At commit time the following steps occur:

  1. Flush and close each CSV file. This ensures that if there is a failure the files are fully visible in storage.

  2. For each table execute a merge script to move the data from CSV into the data warehouse. This script varies depending on the data warehouse type or even for specific application. It generally consists of a sequence of operating system commands, load commands like COPY or LOAD DATA INFILE to load in the CSV data, and ordinary SQL commands to move/massage data.

  3. When all tables are loaded, issue a single commit on the SQL connection.

The main requirement of merge scripts is that they must ensure rows load and that delete and insert operations apply in the correct order. Tungsten includes load scripts for MySQL and Vertica that do this automatically.

It is common to use staging tables to help load data. These are described in more detail in a later section.

5.6.2. Important Limitations

Tungsten currently has some important limitations for batch loading, namely:

  1. Primary keys must be a single column only. Tungsten does not handle multi-column keys.

  2. Binary data is not certified and may cause problems when converted to CSV as it will be converted to Unicode.

These limitations will be relaxed in future releases.

5.6.3. Batch Applier Setup

Here is how to set up on MySQL. For more information on specific data warehouse types, refer to Chapter 2, Deployment Overview.

  1. Enable row replication on the MySQL Source using set global binlog_format=row or by updating my.cnf.

  2. Ensure that you are operating using GMT throughout your source and target database.

  3. Install using the --batch-enabled=true option. Here's a typical vertica applier configuration, taken from Section 4.3, “Deploying the Vertica Applier” :.

    Show Staging

    Show INI

    shell> ./tools/tpm configure defaults \
        --reset \
        --user=tungsten \
        --install-directory=/opt/continuent \
        --profile-script=~/.bash_profile \
        --skip-validation-check=HostsFileCheck \
        --skip-validation-check=InstallerMasterSlaveCheck \
        --rest-api-admin-user=apiuser \
        --rest-api-admin-pass=secret
    
    shell> ./tools/tpm configure alpha \
        --topology=master-slave \
        --master=sourcehost \
        --members=localhost \
        --datasource-type=vertica \
        --replication-user=dbadmin \
        --replication-password=password \
        --vertica-dbname=dev \
        --batch-enabled=true \
        --batch-load-template=vertica6 \
        --batch-load-language=js \
        --replication-port=5433 \
        --svc-applier-filters=dropstatementdata \
        --svc-applier-block-commit-interval=30s \
        --svc-applier-block-commit-size=25000 \
        --disable-relay-logs=true
    
    shell> vi /etc/tungsten/tungsten.ini
    [defaults]
    user=tungsten
    install-directory=/opt/continuent
    profile-script=~/.bash_profile
    skip-validation-check=HostsFileCheck
    skip-validation-check=InstallerMasterSlaveCheck
    rest-api-admin-user=apiuser
    rest-api-admin-pass=secret
    
    [alpha]
    topology=master-slave
    master=sourcehost
    members=localhost
    datasource-type=vertica
    replication-user=dbadmin
    replication-password=password
    vertica-dbname=dev
    batch-enabled=true
    batch-load-template=vertica6
    batch-load-language=js
    replication-port=5433
    svc-applier-filters=dropstatementdata
    svc-applier-block-commit-interval=30s
    svc-applier-block-commit-size=25000
    disable-relay-logs=true
    

5.6.4. JavaScript Batchloader Scripts

The JavaScript batchloader enables data to be loaded into datawarehouse and other targets through a simplified JavaScript command script. The script implements specific functions for specification stages for the apply process, from preparation to commit, allowing for internal data, external commands, and other operations to be executed in sequence.

The actual loading process works through the specification of a JavaScript batchload script that defines what operations to perform during each stage of the batchloading process. These mirror the basic steps in the operation of applying the data that is being batchloaded, as shown in Figure 5.3, “Batchloading: JavaScript”.

Figure 5.3. Batchloading: JavaScript

Batchloading: JavaScript

To summarize:

  • prepare() is called when the replicator goes online

  • begin() is called before a single transaction starts

  • apply() is called to copy and load the raw CSV data

  • commit() is called after the raw data has been loaded

  • release() is called when the replicator goes offline

5.6.4.1. JavaScript Batchloader with Parallel Apply

The JavaScript batchloader can be used with parallel apply to enable multiple threads to be generated and apply data to the target database. This can be useful in datawarehouse environments where simultaneous loading (and commit) enables effective application of multiple table data into the datawarehouse.

  • The defined JavaScript methods like prepare, begin, commit, and release are called independently for each environment. This means that you should ensure actions in these methods do not conflict with each other.

  • CSV files are divided across the scripts. If there is a large number of files that all take about the same time to load and there are three threads (parallelization=3), each individual load script will see about a third of the files. You should therefore not code assumptions that you have seen all tables or CSV files in a single script.

  • Parallel load script is only recommended for data sources like Hadoop that are idempotent. When applying to a data source that is non-idempotent (for example MySQL or potentially Vertica) you should just use a single thread.

5.6.5. Staging Tables

Staging tables are intermediate tables that help with data loading. There are different usage patterns for staging tables.

5.6.5.1. Staging Table Names

Tungsten assumes that staging tables, if present, follow certain conventions for naming and provides a number of configuration properties for generating staging table names that match the base tables in the data warehouse without colliding with them.

Property Description
stageColumnPrefix Prefix for seqno, row_id, and opcode columns generated by Tungsten
stageTablePrefix Prefix for stage table name
stageSchemaPrefix Prefix for the schema in which the stage tables reside

These values are set in the static properties file that defines the replication service. They can be set at install time using --property options. The following example shows typical values from a service properties file.

replicator.applier.dbms.stageColumnPrefix=tungsten_
replicator.applier.dbms.stageTablePrefix=stage_xxx_
replicator.applier.dbms.stageSchemaPrefix=load_

If your data warehouse contains a table named foo in schema bar, these properties would result in a staging table name of load_bar.stage_xxx_foo for the staging table. The Tungsten generated column containing the seqno, if present, would be named tungsten_seqno.

Note

Staging tables are by default in the same schema as the table they update. You can put them in a different schema using the stageSchemaPrefix property as shown in the example.

5.6.5.2. Whole Record Staging

Whole record staging loads the entire CSV file into an identical table, then runs queries to apply rows to the base table or tables in the data warehouse. One of the strengths of whole record staging is that it allows you to construct a merge script that can handle any combination of INSERT, UPDATE, or DELETE operations. A weakness is that whole record staging can result in sub-optimal I/O for workloads that consist mostly of INSERT operations.

For example, suppose we have a base table created by the following CREATE TABLE command:

CREATE TABLE `mydata` (
`id` int(11) NOT NULL,
`f_data` float DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

A whole record staging table would look as follows.

CREATE TABLE `stage_xxx_croc_mydata` (
`tungsten_opcode` char(1) DEFAULT NULL,
`tungsten_seqno` int(11) DEFAULT NULL,
`tungsten_row_id` int(11) DEFAULT NULL,
`id` int(11) NOT NULL,
`f_data` float DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Note that this table does not have a primary key defined. Most data warehouses do not use primary keys and many of them do not even permit it in the create table syntax.

Note also that the non-primary columns must permit nulls. This is required for deletes, which contain only the Tungsten generated columns plus the primary key.

5.6.5.3. Delete Key Staging

Another approach is to load INSERT rows directly into the base data warehouse tables without staging. All you need to stage is the keys for deleted records. This reduces I/O considerably for workloads that have mostly inserts. The downside is that it may require introduce ordering dependencies between DELETE and INSERT operations that require special handling by upstream applications to generate transactions that will load without conflicts.

Delete key staging tables can be as simple as the follow example:

CREATE TABLE `stage_xxx_croc_mydata` (
`id` int(11) NOT NULL,
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

5.6.5.4. Staging Table Generation

Tungsten does not generate staging tables automatically. Creation of staging tables is the responsibility of users, but using the ddlscan tool with the right template can be simplified.

5.6.6. Character Sets

Character sets are a headache in batch loading because all updates are written and read from CSV files, which can result in invalid transactions along the replication path. Such problems are very difficult to debug. Here are some tips to improve chances of happy replicating.

  • Use UTF8 character sets consistently for all string and text data.

  • Force Tungsten to convert data to Unicode rather than transferring strings:

    shell> mysql-use-bytes-for-string=false
  • When starting the replicator for MySQL replication, include the following option:

    shell> java-file-encoding=UTF8

5.6.7. Supported CSV Formats

Tungsten Replicator supports a number of CSV formats that can and should be used with specific heterogeneous environments when using the batch loading process, or generating CSV files in general for testing or loading.

A number of standard types are included, and the use of these standard types when generating CSV is controlled by the replicator.datasource.global.csvType property. Depending on the configured target, the corresponding type will be configured automatically. For example, if you configure a Vertica deployment, the replicator will be configured to default to the Vertica style CSV format.

Warning

Using the wrong CSV format with a given target may break replication. You should always use the appropriate CSV format for the defined target.

Table 5.1. Continuent Tungsten Directory Structure

Format Field Separator Record Separator Escape Sequence Escaped Characters Null Policy Null Value Show Headers Use Quotes Quote String Suppressed Characters
hive \u0001 \n \\ \u0001\\ Use Null Value \\N false false \n\r
mysql , \n \\ \\ Use Null Value \\N false true \"  
oracle , \n \\ \\ Use Null Value \\N false true \"  
vertica , \n \\ \\ Skip Value false true \" \n
redshift , \n \" Skip Value false true \" \n

In addition to the standardised types, the replicator.datasource.global.csvType property can be set to custom, in which case the following configurable values are used instead:

  • replicator.datasource.global.csv.fieldSeparator — the character used to separate fields, such as , (comma).

  • replicator.datasource.global.csv.RecordSeparator — the character used to separate records, such as the newline character.

  • replicator.datasource.global.csv.nullValue — the value to use for NULL (empty) values.

  • replicator.datasource.global.csv.useQuotes — whether to use quotes to encapsulate field values (specified using true or false).

  • replicator.datasource.global.csv.useHeaders — whether to include the column headers in the generated CSV (specified using true or false).

5.6.8. Columns in Generated CSV Files

The CSV generated when using the batch loading process creates a number of special columns that are designed to hold the appropriate information for loading the staging data into the target system.

There are four fields supported:

  • opcode — The operation code, a one- or two-letter code indicating the operation type. For more information on the supported codes, see Section 5.6.9, “Batchloading Opcodes”.

  • seqno — Contains the current THL event (sequence) number for the row data being loaded. The sequence number generated is specific to the THL event number.

  • row_id — Contains a unique row ID (a monotonically incrementing number) which is unique to this CSV file for the table data being loaded. This can be useful for systems where the sequence number alone is not enough to identify an incoming row, even with the incoming primary key information.

  • commit_timestamp — the timestamp of when the data was originally committed by the source database, taken from the TIME within the THL event.

  • service — the service name of the replicator service that performed the loading and generated the CSV. This field is not enabled by default, but is provided to allow for data concentration into a BigData target while enabling identification of the source service and/or database that generated the data.

These fields are placed before the actual data for the corresponding table, for example, with the default setting, the following CSV is generated, the last three columns are specific to the table data:

"I","74","1","2017-05-26 13:00:11.000","655337","Dr No","kat"

The configuration of the list of fields, and the order in which they appear, is controlled by the replicator.applier.dbms.stageColumnNames property. By default, all four fields, in the order shown above, are used:

replicator.applier.dbms.stageColumnNames=opcode,seqno,row_id,commit_timestamp

The actual names used (and passed to the JavaScript environment) are also controlled by another property, replicator.applier.dbms.stageColumnPrefix. This value is prepended to each column within the JS environment, and expected by the various tools. For example, with the default tungsten_ the true name for the opcode is tungsten_opcode.

Warning

Modifying the list of fields generated by the CSV writer may stop batchloading from working. Unless otherwise noted, the default batchloading scripts all expect to see the default four columns (opcode, seqno, row_id and commit_timestamp.

5.6.9. Batchloading Opcodes

The batchloading an CSV generation process use the opcode value to specify the operation type for each row. The default mode is to use only the I and D codes for inserts and deletes respectively, with an update being represented as two rows, one a delete and the other an insert of the new information.

This behavior can be altered to denote updates with a U character, with the row containing the updated information. To enable this mode, set the replicator.applier.dbms.useUpdateOpcode to true.

It is also possible to identify situations where the incoming row data indicates a delete operation that resulted from an update (for example, in a cascade or related column), and an insert from an update. When this mode is enable, the opcode becomes a two-character value or UD and UI respectively. To enable this option, set the replicator.applier.dbms.distinguishUpdates property to true.

Warning

Changing the default opcode modes may cause replication to fail. The default JavaScript batchloading scripts expect the default I and D notation with updated implied through a delete and insert operation.

5.6.10. Time Zones

Time zones are another headache when using batch loading. For best results applications should standardize on a single time zone, preferably UTC, and use this consistently for all data. To ensure the Java VM outputs time data correctly to CSV files, you must set the JVM time zone to be the same as the standard time zone for your data. Here is the JVM setting in wrapper.conf:

# To ensure consistent handling of dates in heterogeneous and batch replication
# you should set the JVM timezone explicitly.  Otherwise the JVM will default
# to the platform time, which can result in unpredictable behavior when
# applying date values to Targets.  GMT is recommended to avoid inconsistencies.
wrapper.java.additional.5=-Duser.timezone=GMT

Note

Beware that MySQL has two very similar data types: TIMESTAMP and DATETIME. Timestamps are stored in UTC and convert back to local time on display. Datetimes by contrast do not convert back to local time. If you mix timezones and use both data types your time values will be inconsistent on loading.

5.6.11. Batch Loading into MySQL

Note

All the features discussed in this section are only available from version 6.1.15 of Tungsten Replicator

There are occasions where Batch loading into MySQL may benefit your use case, such as loading large data warehouse environments, or where real-time replication isn't as critical.

A number of specific properties are available for MySQL targets, these are discussed below.

5.6.11.1. Configuring as an Offboard Batch Applier

By Default, when loading into MySQL using the Batch Applier, the process executes LOAD DATA INFILE statements to load the CSV files into the database.

If you wish to install the applier on a remote host, this action would typically fail, therefore you need to enable the following property in the configuration:

property=replicator.applier.dbms.useLoadDataLocalInfile=true

5.6.11.2. Drop Delete Statements

Tungsten Replicator includes a number of useful filters, such as the ability to drop certain DML statements on a schema or table level.

If you wish to drop such statements on a per object basis, then you should continue to use the skipbyevent filter, however if you want to drop ALL DELETE DML, then you can enable the following property:

property=replicator.applier.dbms.skipDeletes=true

Warning

By dropping deletes, you will then subsequently expose yourself to errors should rows be reinserted later with the same Primary or Unique Key values. Typically, this feature would be only enabled when you plan to capture and log key violations. See Section 5.6.11.6, “Log rows violating Primary/Unique Keys” for more information.

5.6.11.3. Configure CHARSET to use on Load

If you wish to specify a different CHARSET to be used when the data is being loaded into the target database, this can be set using the following property, for example:

property=replicator.applier.dbms.loadCharset=utf8mb4

5.6.11.4. Allow DDL Statements to execute

Typically, the batch loader is used for heterogeneous targets, and therefore by default DDL statements will be dropped. However, when applying into MySQL the DDL statements would be valid and can therefore be executed.

To enable this, you should set the following property:

property=replicator.applier.dbms.applyStatements=true

Warning

Any changes to existing tables, or creation of new tables, will only apply to the main base table. You will still need to manually make changes to the relevant staging and error tables (if used)

5.6.11.5. Disable Foreign Keys during load

If you use a lot of foreign keys in your target database, due to the nature of batch loading, this could cause errors when tables may not be loaded in sequence meaning child/parent keys may only be validated after a complete transaction load.

To prevent this from happening, you can enable the property below which will force the batch loader to temporarily disable foreign key checks until after the full transaction has been loaded.

property=replicator.applier.dbms.disableForeignKeys=true

5.6.11.6. Log rows violating Primary/Unique Keys

To prevent the replicator erroring on primary or unique key violations, you can instruct the replicator to log the offending rows in an error table, which will allow you to manually process the rows afterwards.

This is especially useful when you are dropping DELETE statements from the apply process

The following properties can be set to enable this:

property=replicator.applier.dbms.useUpdateOpcode=true
property=replicator.applier.dbms.batchLogDuplicateRows=true

By default, this feature will only check against PRIMARY KEYS, if you wish to also check against UNIQUE keys, you will need the additional property:

property=replicator.applier.dbms.fetchKeysFromDatabase=true

By default, the error rows will be logged into tables called error_xxx_origTableName.

These table will need precreating in the same way that you create the Staging tablesusing ddlscan, but supplying the table prefix, for example:

shell> ddlscan -db hr -template ddl-mysql-staging.vm -opt tablePrefix error_xxx_

You can choose a different prefix if you wish, by replacing the error_xxx with you choice in the above ddlscan statement. If you choose to do this, you will also need to supply the new prefix in your configuration using the following property:

property=replicator.applier.dbms.errorTablePrefix=your-prefix-here_

Warning

If you are loading 10's of thousands of rows per transaction, and your target tables are very large, this process could slow down the apply process as the applier will first need to ensure the row being inserted does not violate any keys. The use of this feature should be fully tested in a load test environment and the risks fully understood before using in production.

5.6.12. Data File Partitioning

By default, the CSV files generated as part of the batchloading process are named according to the schema name, table name, and the starting transaction sequence number that generated the data in the file. For example, the table orders within the schema sales generating the transaction information from sequence numbers 110 through 145 would have the name sales-orders-110.csv.

Because the size of the files can be quite large, and because within different target environments (particularly Hadoop or when uploading to S3) the speed with which the data can be uploaded or organised within the target can be critical, the files can also be partitioned. This splits up the files generated by a chosen value such as the commit time or data value.

The primary solution for partitioning is to the DateTime partitioner, which then uses a configurable date time value from the internal data structure to act as the basis for the information.

To enable date-based partitioning, you must specify the properties during your configuration:

replicator.applier.dbms.partitionBy=tungsten_commit_timestamp
replicator.applier.dbms.partitionByClass=com.continuent.tungsten.replicator.applier.batch.DateTimeValuePartitioner
replicator.applier.dbms.partitionByFormat=yyyy-MM-dd-HH

The above sets the use fo the tungsten_commit_timestamp field generated by the batchload CSV system as the basis of the value. The format specification is then used to specify the format of the data which will be embedded into the file. The data formatter uses the Java date format strings, and you can use one or more of the following values:

  • YY

    Year as two digit number

  • yyyy

    Year as four digit number

  • MM

    Month with leading zero

  • dd

    Day with leading zero

  • HH

    Hour in 24 hour format with leading zero

  • mm

    Minute with leading zero

  • ss

    Seconds with leading zero

For example, setting yyyy-MM-dd-HH (the default), the name of the CSV file will be orders-sales-2018-04-03-12-199.csv. Note that the THL sequence number is still embedded in the filename (as the last item), as is the schema and table name.

Files generated will automatically be split by the configured value, but remember that the commit timestamp will be consistent for an individual transaction, so data will never be split across multiple files for a single transaction even if it takes time for the CSV file to be written, the key is the commit timestamp from the source database for the entire transaction that corresponds to the sequence number.

Chapter 6. Deployment: Security

  • Authentication between command-line tools (trepctl), and between background services.

  • SSL/TLS between command-line tools and background services.

  • SSL/TLS between Tungsten Replicator and datasources.

  • SSL for all API calls.

  • File permissions and access by all components.

The following graphic provides a visual representation of the various communication channels which may be encrypted.

Figure 6.1. Security Internals: Cluster Communication Channels

Security Internals: Cluster Communication Channels

For the key to the above diagram, please see Section 9.5.16, “tpm report Command”.

If you are using a single staging directory to handle your complete installation, tpm will automatically create the necessary certificates for you. If you are using an INI based installation, then the installation process will create the certificates for you, however you will need to manually sync them between hosts prior to starting the various components.

It is assumed that your underlying database has SSL enabled and the certificates are available. If you need, and want, this level of security enabling, you can refer to Section 6.10.1, “Enabling Database SSL” for the steps required.

Additionally, if you are configuring heterogeneous replication there will additional manual steps required to ensure SSL communication to you chosen target database.

Important

Due to a known issue in earlier Java revisions that may cause performance degradation with client connections, it is strongly advised that you ensure your Java version is one of the following MINIMUM releases before enabling SSL:

  • Oracle JRE 8 Build 261
  • OpenJDK 8 Build 222

6.1. Enabling Security

By default, security is enabled for new installations.

Security can be enabled/disabled by adding the disable-security-controls option to the configuration.

If this property is not supplied, or set to false, then security will be enabled. If set to true, then security will be disabled.

Enabling security through this single option, has the same effect as adding:

Important

If you are enabling to-the-database encryption, you must ensure this has been enabled in your database and the relevant certificates are available first. See Section 6.10.1, “Enabling Database SSL” for steps.

Important

Installing from a staging host will automatically generate certificates and configuration for a secured installation. No further changes or actions are required.

For INI-based installations, there are additional steps required to copy the needed certificate files to all of the nodes. Please see Section 6.1.2, “Enabling Security using the INI Method” for details.

6.1.1. Enabling Security using the Staging Method

Security will be enabled during initial install by default, should you choose to disable at install, then these steps will guide you in the process to enable as part of a post-install update

Enabled During Install

As mentioned, security is enabled by default. This is controlled by the --disable-security-controls=false. If not supplied, the default is false. You can choose to specify this in your configuration for transparency if you wish.

shell> tools/tpm configure defaults --disable-security-controls=false \
[...the rest of the configuration options...]
shell> tools/tpm install

The above configuration (and the default) will assume that your database has been configured with SSL enabled. The installation will error and fail if this is not the case. You must manually ensure database SSL has been enabled prior to issuing the install. Steps to enable this can be found in Section 6.10.1, “Enabling Database SSL”

If you DO NOT want to enable database level SSL, then you must also include the following option in the tpm configure command above:

--enable-connector-ssl=false
--datasource-enable-ssl=false

Important

Installing from a staging host will automatically generate certificates and configuration for a secured installation. No further changes or actions are required.

Enabling Post-Installation

If, at install time, you disabled security (by specifying --disable-security-controls=true) you can enable it by changing the value to false.

shell> tools/tpm configure defaults --disable-security-controls=false
shell> tools/tpm update --replace-jgroups-certificate --replace-tls-certificate --replace-release

The above configuration will assume that your database has been configured with SSL enabled. The update will error and fail if this is not the case. You must manually ensure database SSL has been enabled prior to issuing the update. Steps to enable this can be found in Section 6.10.1, “Enabling Database SSL”

If you DO NOT want to enable database level SSL, then you must also include the following options in the tpm configure command above:

--enable-connector-ssl=false
--datasource-enable-ssl=false

Following the update, you will also need to manually re-sync the certificates and keystores to all other nodes within your configuration. The following example uses scp for the copy and uses db1 as the primary source for the files to be copied. Adjust accordingly for your environment.

  1. Sync Certificates and Keystores to all nodes

    db1> for host in db2 db3 db4 db5 db6; do
    scp /opt/continuent/share/[jpt]* ${host}:/opt/continuent/share
    scp /opt/continuent/share/.[jpt]* ${host}:/opt/continuent/share
    done
  2. Restart all components, on all hosts

    shell> replicator restart

Warning

This update will force replicator processes to be restarted.

6.1.2. Enabling Security using the INI Method

Security will be enabled during initial install by default, should you choose to disable at install, then these steps will guide you in the process to enable as part of a post-install update

Enabled During Install

As mentioned, security is enabled by default. This is controlled by the disable-security-controls property. If not supplied, the default is false. You can choose to specify this in your configuration for transparency if you wish.

disable-security-controls=false

The above configuration (and the default) will assume that your database has been configured with SSL enabled. The installation will error and fail if this is not the case. You must manually ensure database SSL has been enabled prior to issuing the install. Steps to enable this can be found in