7.2.6. Character Sets

Character sets are a headache in batch loading because all updates are written and read from CSV files, which can result in invalid transactions along the replication path. Such problems are very difficult to debug. Here are some tips to improve chances of happy replicating.

  • Use UTF8 character sets consistently for all string and text data.

  • Force Tungsten to convert data to Unicode rather than transferring strings:

    shell> tpm ... --mysql-use-bytes-for-string=false
  • When starting the replicator for MySQL replication, include the following option tpm file:

    shell> tpm ... --java-file-encoding=UTF8