Tungsten Clustering

Tungsten Clustering 6.1.1

Build: 130

Release Date: 28 Oct 2019

End of Life Date: 15 Aug 2024

Product End of Life

This release is past End of Life.

Release 6.1.1 contains both significant improvements as well as some needed bugfixes.

Improvements, new features and functionality (4)

Manager (4)

Improved the ability of the manager to detect un-extracted, desirable binary log events when recovering the old Primary viacctrl after a failover.
The recover command will now fail if:
- any unextracted binlog events exist on the old Primary that we are trying to recover
- the old Primary THL contains more events than the Replicas
In this case, the recover command will display text similar to the following:
```
Recovery failed because the failed Primary has unextracted events in
the binlog. Please run the tungsten_find_orphaned script to inspect
this events. Provided you have a recent backup available, you can
try to restore the data source by issuing the following command:
     datasource {hostname} restore
Please consult the user manual at:
https://docs.continuent.com/tungsten-clustering-6.1/operations-restore.html
```
The tungsten_find_orphaned script is designed to locate orphaned MySQL binary logs that were not extracted into THL before a failover. For more information, please see "The tungsten_find_orphaned Command".
Issue: CT-996
Improved how the Manager and Replicator behave when MySQL dies on the Primary node.
**This improvement will induce a change of behavior in the product during failover by default, possibly causing a delay in failover as a way to protect data integrity.**
The new default setting for 6.1.1 is:
```
replicator.store.thl.stopOnDBError=false
```
This means that the Manager will wait until the Replicator reads all remaining binlog events on the failing Primary node.
Failover will only continue once:
- all available events are completely read from the binary logs on the Primary node
- all events have reached the Replicas
Warning
The new default means that the failover time could take longer than it used to.
When property=replicator.store.thl.stopOnDBError=true, then the Replicator will stop extracting once it is unable to update the trep_commit_seqno table in MySQL, and the Manager will perform the failover without waiting, at the risk of possible data loss due to leaving binlog events behind. All such situations are logged.
For use cases where failover speed is more important than data accuracy, those NOT willing to wait for long failover can set property=replicator.store.thl.stopOnDBError=true and still use tungsten_find_orphaned to manually analyze and perform the data recovery. For more information, please see "The tungsten_find_orphaned Command".
Issue: CT-583
Improved the ability to configure the manager's behavior upon failover.
During a failover, the manager will now wait until the selected Replica has applied all stored THL events before promoting that node to Primary.
This wait time can be configured via the property=manager.failover.thl.apply.wait.timeout=0 property.
The default value is 0, which means "wait indefinitely until all stored THL events are applied".
Any value other than zero invites data loss due to the fact that once the Replica is promoted to Primary, any unapplied stored events in the THL will be ignored, and therefore lost.
Whenever a failover occurs, the Replica with most events stored in the local THL is selected so that when the events are eventually applied, the data is as close to the original Primary as possible with the least number of events missed.
That is usually, but not always, the most up-to-date Replica, which is the one with the most events applied.
There should be a good balance between the value for property=manager.failover.thl.apply.wait.timeout and the value for property=policy-relay-from-slave=900, which is the number of seconds to which a Replica must be current with the Primary in order to qualify as a candidate for failover. The default is 15 minutes (900 seconds).
Issue: CT-1022
A new feature called "Cluster State Savepoints" has been implemented.
This new functionality was created to support clean, consistent rollbacks during aborted switch and failover operations. This functionality works for both physical clusters as well as for composite clusters.
To support this new feature, a new cluster sub-command has been added to the cctrlcommand - cluster topology validate, which will check and validate a cluster topology and, in the process, will report any issues that it finds. The purpose of this command is to provide a fast way to see, immediately, if there are any issues with any components of a cluster.
Savepoints are created automatically with every switch and failover command. The savepoint is only used if there is an exception during switch or failover that is actually able to be rolled-back.

Important
Not all exceptions during switch and failover will cause a rollback. In particular, if an exception happens during switch or failover AFTER a new primary datasource has been put online (relay or Primary) then the switch or failover operation cannot be rolled back.
The Manager is configured, by default, to hold a maximum of 50 savepoints. When that limit is hit, the Manager resets the current-savepoint-id to 0 and starts to overwrite existing savepoints, starting at 0.
Issue: CT-951

Bug Fixes (13)

Command-line Tools (8)

Fixed an issue where the command trepctl -all-services status -name watches fails.
Issue: CT-977
Fixed an issue that would prevent reading remote binary logs when using SSL.
Issue: CT-958
Improve the ability to find needed binaries for commands: tungsten_find_position, tungsten_find_seqno andtungsten_get_rtt
Issue: CT-1054
Restored previously-removed log file symbolic links under \$CONTINUENT_ROOT/service_logs/
Issue: CT-1026
Fixed a bug where tpm diag would generate an empty zip file if the hostnames contain hyphens (-) or periods (.)
Issue: CT-1032
Updated the check_tungsten_services and zabbix_tungsten_services commands to auto-detect active witnesses.
Issue: CT-1043
Updated the check_tungsten.sh command to have the executable bit set.
Issue: CT-1037
Installing with disable-security-controls=false or when updating using:tools/tpm update --replace-jgroups-certificate --replace-tls-certificate would generate self-signed security certs that have a 1-year expiration which will cause installs to break eventually.
This expiration time value is controlled by the tpm command option java-tls-key-lifetime, which is now set to 10 years or 3,650 days by default.
Issue: CT-937

Manager (5)

If the pipeline source replicator goes OFFLINE, the relay will reconnect to a different Replica.
Issue: CT-871
Improve the ability to find needed binaries, both locally and over SSH, for commands: tungsten_find_orphanedand tungsten_is_recoverable
Issue: CT-1053
Fixed use case where xtrabackup would timeout during backup via cctrl
Issue: CT-1045
Fixed an issue where the ls resources command run inside of cctrl would fail to list the MANAGER entry on a Replica node.
Issue: CT-599
Fixed an issue where the Manager would show an exception when the MySQL check script did not get expected results.
Issue: CT-912