Amazon Redshift is a cloud-based data warehouse service that integrates with other Amazon services, such as S3, to provide an SQL-like interface to the loaded data. Replication for Amazon Redshift moves data from MySQL or Oracle datastores, through S3, and into the Redshift environment in real-time, avoiding the need to manually export and import the data.
Replication to Amazon Redshift operates as follows:
Data is extracted from the source database into THL.
When extracting the data from the THL, the Amazon Redshift replicator writes the data into CSV files according to the name of the source tables. The files contain all of the row-based data, including the global transaction ID generated by Tungsten Clustering during replication, and the operation type (insert, delete, etc) as part of the CSV data.
The generated CSV files are loaded into Amazon S3 using the s3cmd command. This enables easy access to your Amazon S3 installation and simplifies the loading.
The CSV data is loaded from S3 into Redshift staging tables using the
which imports raw CSV into Redshift tables.
SQL statements are then executed within Redshift to perform updates on the live version of the tables, using the CSV, batch loaded, information, deleting old rows, and inserting the new data when performing updates to work effectively within the confines of Amazon Redshift operation.
Setting up replication requires setting up both the master and slave components as two different configurations, one for MySQL (or Oracle) and the other for Amazon Redshift. Replication also requires some additional steps to ensure that the Amazon Redshift host is ready to accept the replicated data that has been extracted. Tungsten Clustering provides all the tools required to perform these operations during the installation and setup.