3.1. How Heterogeneous Replication Works

Heterogeneous replication works slightly differently compared to the native MySQL to MySQL replication. This is because SQL statements, including both Data Manipulation Language (DML) and Data Definition Language (DDL) cannot be executed on a target system as they were extracted from the MySQL database. The SQL dialects are different, so that an SQL statement on MySQL is not the same as an SQL statement on Oracle, and differences in the dialects mean that either the statement would fail, or would perform an incorrect operation.

On targets that do not support SQL of any kind, such as MongoDB, replicating SQL statements would achieve nothing since they cannot be executed at all.

All heterogeneous replication deployments therefore use row-based replication. This extracts only the raw row data, not the statement information. Because it is only row-data, it can be easily re-assembled or constructed into another format, including statements in other SQL dialects, native appliers for alternative formats, such as JSON or BSON, or external CSV formats that enable the data to be loaded in bulk batches into a variety of different targets.

MySQL to Oracle, Oracle to MySQL, and Oracle to Oracle Replication

Replication between Oracle or MySQL, in either direction, or Oracle-to-Oracle replication, work as shown in Figure 3.1, “Topologies: Heterogeneous Operation”.

Figure 3.1. Topologies: Heterogeneous Operation

Topologies: Heterogeneous Operation

The process works as follows:

  1. Data is extracted from the source database. The exact method depends on whether data is being extracted from MySQL or Oracle.

    • For MySQL:

      The MySQL server is configured to write transactions into the MySQL binary log using row-based logging. This generates information in the log in the form of the individual updated rows, rather than the statement that was used to perform the update. For example, instead of recording the statement:

      mysql> INSERT INTO MSG VALUES (1,'Hello World');

      The information is stored as a row entry against the updated table:

      1 Hello World

      The information is written into the THL as row-based events, with the event type (insert, update or delete) is appended to the metadata of the THL event.

    • For Oracle CDC:

      The Oracle Change Data Capture (CDC) system records the row-level changes made to a table into a change table. Tungsten Replicator reads the change information from the change tables and generates row-based transactions within the THL.

    In both cases, it is the raw row data that is stored in the THL. Because the row data, not the SQL statement, has been recorded, the differences in SQL dialects between the two databases does not need to be taken into account. In fact, Data Definition Language (DDL) and other SQL statements are deliberately ignored so that replication does not break.

  2. The row-based transactions stored in the THL are transferred from the master to the slave.

  3. On the slave (or applier) side, the row-based event data is wrapped into a suitable SQL statement for the target database environment. Because the raw row data is available, it can be constructed into any suitable statement appropriate for the target database.

Native Applier Replication (e.g. MongoDB)

For heterogeneous replication where data is written into a target database using a native applier, such as MongoDB, the row-based information is written into the database using the native API. With MongoDB, for example, data is reformatted into BSON and then applied into MongoDB using the native insert/update/delete API calls.

Batch Loading

For batch appliers, such as Vertica, the row-data is converted into CSV files in batches. The format of the CSV file includes both the original row data for all the columns of each table, and metadata on each line that contain the unique SEQNO and the operation type (insert, delete or update). A modified form of the CSV is used in some cases where the operation type is only an insert or delete, with updates being translated into a delete followed by an insert of the updated information.

These temporary CSV files are then loaded into the native environment as part of the replicator using a custom script that employs the specific tools of that database that support CSV imports. The raw CSV data is loaded into a staging table that contains the per-row metadata and the row data itself.

Depending on the batch environment, the loading of the data into the final destination tables is performed either within the same script, or by using a separate script. Both methods work in the same basic fashion; the base table is updated using the data from the staging table, with each row marked to be deleted, deleted, and the latest row (calculated from the highest SEQNO) for each primary key) are then inserted

Schema Creation and Replication

Because heterogeneous replication does not replicated SQL statements, including DDL statements that would normally define and generate the table structures, a different method must be used.

Tungsten Replicator includes a tool called ddlscan which can read the schema definition from MySQL or Oracle and translate that into the schema definition required on the target database. During the process, differences in supported sizes and datatypes are identified and either modified to a suitable value, or highlighted as a definition that must be changed in the generated DDL.

Once this modified form of the DDL has been completed, it can then be executed against the target database to generate the DDL required for Tungsten Replicator to apply data. The same basic is used in batch loading environments where a staging table is required, with the additional staging columns added to the DDL automatically.

For MongoDB, where no explicitly DDL needs to be generated, the use of ddlscan is not required.