Basic monitoring of a parallel deployment can be performed using the techniques in Chapter 8, Operations Guide. Specific operations for parallel replication are provided in the following sections.
The replicator has several helpful commands for tracking replication performance:
|trepctl status||Shows basic variables including overall latency of slave and number of apply channels|
|trepctl status -name shards||Shows the number of transactions for each shard|
|trepctl status -name stores||Shows the configuration and internal counters for stores between tasks|
|trepctl status -name tasks||Shows the number of transactions (events) and latency for each independent task in the replicator pipeline|
The trepctl status appliedLastSeqno parameter shows the sequence number of the last transaction committed. Here is an example from a slave with 5 channels enabled.
trepctl statusProcessing status command... NAME VALUE ---- ----- appliedLastEventId : mysql-bin.000211:0000000020094456;0 appliedLastSeqno : 78021 appliedLatency : 0.216 channels : 5 ... Finished status command...
When parallel apply is enabled, the meaning of
appliedLastSeqno changes. It is the minimum
recovery position across apply channels, which means it is the position
where channels restart in the event of a failure. This number is quite
conservative and may make replication appear to be further behind than
it actually is.
Busy channels mark their position in table
trep_commit_seqno as they
commit. These are up-to-date with the traffic on that channel, but
channels have latency between those that have a lot of big
transactions and those that are more lightly loaded.
Inactive channels do not get any transactions, hence do not mark
their position. Tungsten sends a control event across all channels
so that they mark their commit position in
trep_commit_channel. It is
possible to see a delay of many seconds or even minutes in unloaded
systems from the true state of the slave because of idle channels
not marking their position yet.
For systems with few transactions it is useful to lower the synchronization interval to a smaller number of transactions, for example 500. The following command shows how to adjust the synchronization interval after installation:
tpm update alpha \ --property=replicator.store.parallel-queue.syncInterval=500
Note that there is a trade-off between the synchronization interval
value and writes on the DBMS server. With the foregoing setting, all
channels will write to the
trep_commit_seqno table every 500
transactions. If there were 50 channels configured, this could lead to
an increase in writes of up to 10%—each channel could end up
adding an extra write to mark its position every 10 transactions. In
busy systems it is therefore better to use a higher synchronization
interval for this reason.
You can check the current synchronization interval by running the trepctl status -name stores command, as shown in the following example:
trepctl status -name storesProcessing status command (stores)... ... NAME VALUE ---- ----- ... name : parallel-queue ... storeClass : com.continuent.tungsten.replicator.thl.THLParallelQueue syncInterval : 10000 Finished status command (stores)...
You can also force all channels to mark their current position by sending a heartbeat through using the trepctl heartbeat command.
Relative latency is a trepctl status parameter. It indicates the latency since the last time the appliedSeqno advanced; for example:
trepctl statusProcessing status command... NAME VALUE ---- ----- appliedLastEventId : mysql-bin.000211:0000000020094766;0 appliedLastSeqno : 78022 appliedLatency : 0.571 ... relativeLatency : 8.944 Finished status command...
In this example the last transaction had a latency of .571 seconds from the time it committed on the master and committed 8.944 seconds ago. If relative latency increases significantly in a busy system, it may be a sign that replication is stalled. This is a good parameter to check in monitoring scripts.
Serialization count refers to the number of transactions that the replicator has handled that cannot be applied in parallel because they involve dependencies across shards. For example, a transaction that spans multiple shards must serialize because it might cause cause an out-of-order update with respect to transactions that update a single shard only.
You can detect the number of transactions that have been serialized by
looking at the
serializationCount parameter using
the trepctl status -name stores command. The
following example shows a replicator that has processed 1512
transactions with 26 serialized.
trepctl status -name storesProcessing status command (stores)... ... NAME VALUE ---- ----- criticalPartition : -1 discardCount : 0 estimatedOfflineInterval: 0.0 eventCount : 1512 headSeqno : 78022 maxOfflineInterval : 5 maxSize : 10 name : parallel-queue queues : 5 serializationCount : 26 serialized : false ... Finished status command (stores)...
In this case 1.7% of transactions are serialized. Generally speaking you will lose benefits of parallel apply if more than 1-2% of transactions are serialized.
The maximum offline interval (
parameter controls the "distance" between the fastest and slowest
channels when parallel apply is enabled. The replicator measures
distance using the seconds between commit times of the last transaction
processed on each channel. This time is roughly equivalent to the amount
of time a replicator will require to go offline cleanly.
You can change the
maxOfflineInterval as shown in
the following example, the value is defined in seconds.
tpm update alpha --property=replicator.store.parallel-queue.maxOfflineInterval=15
You can view the configured value as well as the estimate current value using the trepctl status -name stores command, as shown in yet another example:
trepctl status -name storesProcessing status command (stores)... NAME VALUE ---- ----- ... estimatedOfflineInterval: 1.3 ... maxOfflineInterval : 15 ... Finished status command (stores)...
Parallel apply works best when transactions are distributed evenly across shards and those shards are distributed evenly across available channels. You can monitor the distribution of transactions over shards using the trepctl status -name shards command. This command lists transaction counts for all shards, as shown in the following example.
trepctl status -name shardsProcessing status command (shards)... ... NAME VALUE ---- ----- appliedLastEventId: mysql-bin.000211:0000000020095076;0 appliedLastSeqno : 78023 appliedLatency : 0.255 eventCount : 3523 shardId : cust1 stage : q-to-dbms ... Finished status command (shards)...
If one or more shards have a very large
eventCount value compared to the others, this is
a sign that your transaction workload is poorly distributed across
The listing of shards also offers a useful trick for finding serialized
transactions. Shards that Tungsten Replicator cannot safely parallelize
are assigned the dummy shard ID
#UNKNOWN. Look for this shard to
find the count of serialized transactions. The
appliedLastSeqno for this shard gives the
sequence number of the most recent serialized transaction. As the
following example shows, you can then list the contents of the
transaction to see why it serialized. In this case, the transaction
affected tables in different schemas.
trepctl status -name shardsProcessing status command (shards)... NAME VALUE ---- ----- appliedLastEventId: mysql-bin.000211:0000000020095529;0 appliedLastSeqno : 78026 appliedLatency : 0.558 eventCount : 26 shardId : #UNKNOWN stage : q-to-dbms ... Finished status command (shards)... shell>
thl list -seqno 78026SEQ# = 78026 / FRAG# = 0 (last frag) - TIME = 2013-01-17 22:29:42.0 - EPOCH# = 1 - EVENTID = mysql-bin.000211:0000000020095529;0 - SOURCEID = logos1 - METADATA = [mysql_server_id=1;service=percona;shard=#UNKNOWN] - TYPE = com.continuent.tungsten.replicator.event.ReplDBMSEvent - OPTIONS = [##charset = ISO8859_1, autocommit = 1, sql_auto_is_null = 0, » foreign_key_checks = 1, unique_checks = 1, sql_mode = '', character_set_client = 8, » collation_connection = 8, collation_server = 33] - SCHEMA = - SQL(0) = insert into mats_0.foo values(1) /* ___SERVICE___ = [percona] */ - OPTIONS = [##charset = ISO8859_1, autocommit = 1, sql_auto_is_null = 0, » foreign_key_checks = 1, unique_checks = 1, sql_mode = '', character_set_client = 8, » collation_connection = 8, collation_server = 33] - SQL(1) = insert into mats_1.foo values(1)
The replicator normally distributes shards evenly across channels. As
each new shard appears, it is assigned to the next channel number, which
then rotates back to 0 once the maximum number has been assigned. If the
shards have uneven transaction distributions, this may lead to an uneven
number of transactions on the channels. To check, use the
trepctl status -name tasks and look for tasks
belonging to the
trepctl status -name tasksProcessing status command (tasks)... ... NAME VALUE ---- ----- appliedLastEventId: mysql-bin.000211:0000000020095076;0 appliedLastSeqno : 78023 appliedLatency : 0.248 applyTime : 0.003 averageBlockSize : 2.520 cancelled : false currentLastEventId: mysql-bin.000211:0000000020095076;0 currentLastFragno : 0 currentLastSeqno : 78023 eventCount : 5302 extractTime : 274.907 filterTime : 0.0 otherTime : 0.0 stage : q-to-dbms state : extract taskId : 0 ... Finished status command (tasks)...
If you see one or more channels that have a very high
eventCount, consider either assigning shards
explicitly to channels or redistributing the workload in your
application to get better performance.