By default, the CSV files generated as part of the batchloading process
are named according to the schema name, table name, and the starting
transaction sequence number that generated the data in the file. For
example, the table orders
within
the schema sales
generating the
transaction information from sequence numbers 110 through 145 would have
the name sales-orders-110.csv
.
Because the size of the files can be quite large, and because within different target environments (particularly Hadoop or when uploading to S3) the speed with which the data can be uploaded or organised within the target can be critical, the files can also be partitioned. This splits up the files generated by a chosen value such as the commit time or data value.
The primary solution for partitioning is to the DateTime partitioner, which then uses a configurable date time value from the internal data structure to act as the basis for the information.
To enable date-based partitioning, you must specify the properties during your configuration:
replicator.applier.dbms.partitionBy=tungsten_commit_timestamp replicator.applier.dbms.partitionByClass=com.continuent.tungsten.replicator.applier.batch.DateTimeValuePartitioner replicator.applier.dbms.partitionByFormat=yyyy-MM-dd-HH
The above sets the use fo the tungsten_commit_timestamp
field generated by the batchload CSV system as the basis of the value. The
format specification is then used to specify the format of the data which
will be embedded into the file. The data formatter uses the Java date
format strings, and you can use one or more of the following values:
YY
Year as two digit number
yyyy
Year as four digit number
MM
Month with leading zero
dd
Day with leading zero
HH
Hour in 24 hour format with leading zero
mm
Minute with leading zero
ss
Seconds with leading zero
For example, setting yyyy-MM-dd-HH
(the default), the name of the CSV file will be
orders-sales-2018-04-03-12-199.csv
.
Note that the THL sequence number is still embedded in the filename (as
the last item), as is the schema and table name.
Files generated will automatically be split by the configured value, but remember that the commit timestamp will be consistent for an individual transaction, so data will never be split across multiple files for a single transaction even if it takes time for the CSV file to be written, the key is the commit timestamp from the source database for the entire transaction that corresponds to the sequence number.