Heterogeneous replication works slightly differently compared to the native MySQL to MySQL replication. This is because SQL statements, including both Data Manipulation Language (DML) and Data Definition Language (DDL) cannot be executed on a target system as they were extracted from the MySQL database. The SQL dialects are different, so that an SQL statement on MySQL is not the same as an SQL statement on Oracle, and differences in the dialects mean that either the statement would fail, or would perform an incorrect operation.
On targets that do not support SQL of any kind, such as MongoDB, replicating SQL statements would achieve nothing since they cannot be executed at all.
All heterogeneous replication deployments therefore use row-based replication. This extracts only the raw row data, not the statement information. Because it is only row-data, it can be easily re-assembled or constructed into another format, including statements in other SQL dialects, native appliers for alternative formats, such as JSON or BSON, or external CSV formats that enable the data to be loaded in bulk batches into a variety of different targets.
Replication into targets where the JDBC Driver can be used, such as Oracle and Postgres, work as follows:
Data is extracted from the source MySQL database:
The MySQL server is configured to write transactions into the MySQL binary log using row-based logging. This generates information in the log in the form of the individual updated rows, rather than the statement that was used to perform the update. For example, instead of recording the statement:
mysql> INSERT INTO MSG VALUES (1,'Hello World');
The information is stored as a row entry against the updated table:
The information is written into the THL as row-based events, with the event type (insert, update or delete) is appended to the metadata of the THL event.
It is the raw row data that is stored in the THL. Because the row data, not the SQL statement, has been recorded, the differences in SQL dialects between does not need to be taken into account. In fact, Data Definition Language (DDL) and other SQL statements are deliberately ignored so that replication does not break.
The row-based transactions stored in the THL are transferred from the Extractor to the Applier.
On the Applier side, the row-based event data is wrapped into a suitable SQL statement for the target database environment. Because the raw row data is available, it can be constructed into any suitable statement appropriate for the target database.
For heterogeneous replication where data is written into a target database using a native applier, such as MongoDB, the row-based information is written into the database using the native API. With MongoDB, for example, data is reformatted into BSON and then applied into MongoDB using the native insert/update/delete API calls.
For batch appliers, such as Hadoop, Vertica and Redshift, the row-data is
converted into CSV files in batches. The format of the CSV file includes both
the original row data for all the columns of each table, and metadata on each line that
contain the unique SEQNO
and the operation type
(insert, delete or update). A modified form of the CSV is used in some
cases where the operation type is only an insert or delete, with updates
being translated into a delete followed by an insert of the updated
information.
These temporary CSV files are then loaded into the native environment as part of the replicator using a custom script that employs the specific tools of that database that support CSV imports. The raw CSV data is loaded into a staging table that contains the per-row metadata and the row data itself.
Depending on the batch environment, the loading of the data into the final
destination tables is performed either within the same script, or by using
a separate script. Both methods work in the same basic fashion; the base
table is updated using the data from the staging table, with each row
marked to be deleted, deleted, and the latest row (calculated from the
highest SEQNO
) for each primary key) are then
inserted
Because heterogeneous replication does not replicate SQL statements, including DDL statements that would normally define and generate the table structures, a different method must be used.
Tungsten Replicator includes a tool called ddlscan which can read the schema definition from MySQL and translate that into the schema definition required on the target database. During the process, differences in supported sizes and datatypes are identified and either modified to a suitable value, or highlighted as a definition that must be changed in the generated DDL.
Once this modified form of the DDL has been completed, it can then be executed against the target database to generate the DDL required for Tungsten Replicator to apply data. The same basic method s used in batch loading environments where a staging table is required, with the additional staging columns added to the DDL automatically.
For MongoDB or Kafka, where no explicit DDL needs to be generated, the use of ddlscan is not required.