Basic monitoring of a parallel deployment can be performed using the techniques in Chapter 6, Operations Guide. Specific operations for parallel replication are provided in the following sections.
The replicator has several helpful commands for tracking replication performance:
Command | Description |
---|---|
trepctl status | Shows basic variables including overall latency of Replica and number of apply channels |
trepctl status -name shards | Shows the number of transactions for each shard |
trepctl status -name stores | Shows the configuration and internal counters for stores between tasks |
trepctl status -name tasks | Shows the number of transactions (events) and latency for each independent task in the replicator pipeline |
The trepctl status appliedLastSeqno parameter shows the sequence number of the last transaction committed. Here is an example from a Replica with 5 channels enabled.
shell> trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysql-bin.000211:0000000020094456;0
appliedLastSeqno : 78021
appliedLatency : 0.216
channels : 5
...
Finished status command...
When parallel apply is enabled, the meaning of
appliedLastSeqno
changes. It is the minimum
recovery position across apply channels, which means it is the position
where channels restart in the event of a failure. This number is quite
conservative and may make replication appear to be further behind than
it actually is.
Busy channels mark their position in table
trep_commit_seqno
as they
commit. These are up-to-date with the traffic on that channel, but
channels have latency between those that have a lot of big
transactions and those that are more lightly loaded.
Inactive channels do not get any transactions, hence do not mark
their position. Tungsten sends a control event across all channels
so that they mark their commit position in
trep_commit_channel
. It is
possible to see a delay of many seconds or even minutes in unloaded
systems from the true state of the Replica because of idle channels
not marking their position yet.
For systems with few transactions it is useful to lower the synchronization interval to a smaller number of transactions, for example 500. The following command shows how to adjust the synchronization interval after installation:
Click the link below to switch examples between Staging and INI methods...
shell>tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-clustering-7.1.4-10 shell>echo The staging USER is `tpm query staging| cut -d: -f1 | cut -d@ -f1`
The staging USER is tungsten shell>echo The staging HOST is `tpm query staging| cut -d: -f1 | cut -d@ -f2`
The staging HOST is db1 shell>echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-clustering-7.1.4-10 shell>ssh {STAGING_USER}@{STAGING_HOST}
shell>cd {STAGING_DIRECTORY}
shell> ./tools/tpm configure alpha \
--property=replicator.store.parallel-queue.syncInterval=500
Run the tpm command to update the software with the Staging-based configuration:
shell> ./tools/tpm update
For information about making updates when using a Staging-method deployment, please see Section 10.3.7, “Configuration Changes from a Staging Directory”.
[alpha]
...
property=replicator.store.parallel-queue.syncInterval=500
Run the tpm command to update the software with the INI-based configuration:
shell>tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-clustering-7.1.4-10 shell>echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-clustering-7.1.4-10 shell>cd {STAGING_DIRECTORY}
shell>./tools/tpm update
For information about making updates when using an INI file, please see Section 10.4.4, “Configuration Changes with an INI file”.
Note that there is a trade-off between the synchronization interval
value and writes on the DBMS server. With the foregoing setting, all
channels will write to the
trep_commit_seqno
table every 500
transactions. If there were 50 channels configured, this could lead to
an increase in writes of up to 10%—each channel could end up
adding an extra write to mark its position every 10 transactions. In
busy systems it is therefore better to use a higher synchronization
interval for this reason.
You can check the current synchronization interval by running the trepctl status -name stores command, as shown in the following example:
shell> trepctl status -name stores
Processing status command (stores)...
...
NAME VALUE
---- -----
...
name : parallel-queue
...
storeClass : com.continuent.tungsten.replicator.thl.THLParallelQueue
syncInterval : 10000
Finished status command (stores)...
You can also force all channels to mark their current position by sending a heartbeat through using the trepctl heartbeat command.
Relative latency is a trepctl status parameter. It indicates the latency since the last time the appliedSeqno advanced; for example:
shell> trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysql-bin.000211:0000000020094766;0
appliedLastSeqno : 78022
appliedLatency : 0.571
...
relativeLatency : 8.944
Finished status command...
In this example the last transaction had a latency of .571 seconds from the time it committed on the Primary and committed 8.944 seconds ago. If relative latency increases significantly in a busy system, it may be a sign that replication is stalled. This is a good parameter to check in monitoring scripts.
Serialization count refers to the number of transactions that the replicator has handled that cannot be applied in parallel because they involve dependencies across shards. For example, a transaction that spans multiple shards must serialize because it might cause cause an out-of-order update with respect to transactions that update a single shard only.
You can detect the number of transactions that have been serialized by
looking at the serializationCount
parameter using
the trepctl status -name stores command. The
following example shows a replicator that has processed 1512
transactions with 26 serialized.
shell> trepctl status -name stores
Processing status command (stores)...
...
NAME VALUE
---- -----
criticalPartition : -1
discardCount : 0
estimatedOfflineInterval: 0.0
eventCount : 1512
headSeqno : 78022
maxOfflineInterval : 5
maxSize : 10
name : parallel-queue
queues : 5
serializationCount : 26
serialized : false
...
Finished status command (stores)...
In this case 1.7% of transactions are serialized. Generally speaking you will lose benefits of parallel apply if more than 1-2% of transactions are serialized.
The maximum offline interval (maxOfflineInterval
)
parameter controls the "distance" between the fastest and slowest
channels when parallel apply is enabled. The replicator measures
distance using the seconds between commit times of the last transaction
processed on each channel. This time is roughly equivalent to the amount
of time a replicator will require to go offline cleanly.
You can change the maxOfflineInterval
as shown in
the following example, the value is defined in seconds.
Click the link below to switch examples between Staging and INI methods...
shell>tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-clustering-7.1.4-10 shell>echo The staging USER is `tpm query staging| cut -d: -f1 | cut -d@ -f1`
The staging USER is tungsten shell>echo The staging HOST is `tpm query staging| cut -d: -f1 | cut -d@ -f2`
The staging HOST is db1 shell>echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-clustering-7.1.4-10 shell>ssh {STAGING_USER}@{STAGING_HOST}
shell>cd {STAGING_DIRECTORY}
shell> ./tools/tpm configure alpha \
--property=replicator.store.parallel-queue.maxOfflineInterval=30
Run the tpm command to update the software with the Staging-based configuration:
shell> ./tools/tpm update
For information about making updates when using a Staging-method deployment, please see Section 10.3.7, “Configuration Changes from a Staging Directory”.
[alpha]
...
property=replicator.store.parallel-queue.maxOfflineInterval=30
Run the tpm command to update the software with the INI-based configuration:
shell>tpm query staging
tungsten@db1:/opt/continuent/software/tungsten-clustering-7.1.4-10 shell>echo The staging DIRECTORY is `tpm query staging| cut -d: -f2`
The staging DIRECTORY is /opt/continuent/software/tungsten-clustering-7.1.4-10 shell>cd {STAGING_DIRECTORY}
shell>./tools/tpm update
For information about making updates when using an INI file, please see Section 10.4.4, “Configuration Changes with an INI file”.
You can view the configured value as well as the estimate current value using the trepctl status -name stores command, as shown in yet another example:
shell> trepctl status -name stores
Processing status command (stores)...
NAME VALUE
---- -----
...
estimatedOfflineInterval: 1.3
...
maxOfflineInterval : 30
...
Finished status command (stores)...
Parallel apply works best when transactions are distributed evenly across shards and those shards are distributed evenly across available channels. You can monitor the distribution of transactions over shards using the trepctl status -name shards command. This command lists transaction counts for all shards, as shown in the following example.
shell> trepctl status -name shards
Processing status command (shards)...
...
NAME VALUE
---- -----
appliedLastEventId: mysql-bin.000211:0000000020095076;0
appliedLastSeqno : 78023
appliedLatency : 0.255
eventCount : 3523
shardId : cust1
stage : q-to-dbms
...
Finished status command (shards)...
If one or more shards have a very large
eventCount
value compared to the others, this is
a sign that your transaction workload is poorly distributed across
shards.
The listing of shards also offers a useful trick for finding serialized
transactions. Shards that Tungsten Replicator cannot safely parallelize
are assigned the dummy shard ID
#UNKNOWN
. Look for this shard to
find the count of serialized transactions. The
appliedLastSeqno
for this shard gives the
sequence number of the most recent serialized transaction. As the
following example shows, you can then list the contents of the
transaction to see why it serialized. In this case, the transaction
affected tables in different schemas.
shell>trepctl status -name shards
Processing status command (shards)... NAME VALUE ---- ----- appliedLastEventId: mysql-bin.000211:0000000020095529;0 appliedLastSeqno : 78026 appliedLatency : 0.558 eventCount : 26 shardId : #UNKNOWN stage : q-to-dbms ... Finished status command (shards)... shell>thl list -seqno 78026
SEQ# = 78026 / FRAG# = 0 (last frag) - TIME = 2013-01-17 22:29:42.0 - EPOCH# = 1 - EVENTID = mysql-bin.000211:0000000020095529;0 - SOURCEID = logos1 - METADATA = [mysql_server_id=1;service=percona;shard=#UNKNOWN] - TYPE = com.continuent.tungsten.replicator.event.ReplDBMSEvent - OPTIONS = [##charset = ISO8859_1, autocommit = 1, sql_auto_is_null = 0, » foreign_key_checks = 1, unique_checks = 1, sql_mode = '', character_set_client = 8, » collation_connection = 8, collation_server = 33] - SCHEMA = - SQL(0) = insert into mats_0.foo values(1) /* ___SERVICE___ = [percona] */ - OPTIONS = [##charset = ISO8859_1, autocommit = 1, sql_auto_is_null = 0, » foreign_key_checks = 1, unique_checks = 1, sql_mode = '', character_set_client = 8, » collation_connection = 8, collation_server = 33] - SQL(1) = insert into mats_1.foo values(1)
The replicator normally distributes shards evenly across channels. As
each new shard appears, it is assigned to the next channel number, which
then rotates back to 0 once the maximum number has been assigned. If the
shards have uneven transaction distributions, this may lead to an uneven
number of transactions on the channels. To check, use the
trepctl status -name tasks and look for tasks
belonging to the q-to-dbms
stage.
shell> trepctl status -name tasks
Processing status command (tasks)...
...
NAME VALUE
---- -----
appliedLastEventId: mysql-bin.000211:0000000020095076;0
appliedLastSeqno : 78023
appliedLatency : 0.248
applyTime : 0.003
averageBlockSize : 2.520
cancelled : false
currentLastEventId: mysql-bin.000211:0000000020095076;0
currentLastFragno : 0
currentLastSeqno : 78023
eventCount : 5302
extractTime : 274.907
filterTime : 0.0
otherTime : 0.0
stage : q-to-dbms
state : extract
taskId : 0
...
Finished status command (tasks)...
If you see one or more channels that have a very high
eventCount
, consider either assigning shards
explicitly to channels or redistributing the workload in your
application to get better performance.