A.1. Tungsten Clustering 5.4.1 GA (28 October 2019)

Version End of Life. Not Yet Set

Release 5.4.1 contains both significant improvements as well as some needed bugfixes.

Improvements, new features and functionality

  • Tungsten Manager

    • Improved how the Manager and Replicator behave when MySQL dies on the master node.

      This improvement will induce a change of behavior in the product during failover by default, possibly causing a delay in failover as a way to protect data integrity.

      The new default setting for 6.1.1 is:

      replicator.store.thl.stopOnDBError=false

      This means that the Manager will wait until the Replicator reads all remaining binlog events on the failing master node.

      Failover will only continue once:

      • all available events are completely read from the binary logs on the master node

      • all events have reached the slaves

      WARNING:

      The new default means that the failover time could take longer than it used to.

       

      When replicator.store.thl.stopOnDBError=true, then the Replicator will stop extracting once it is unable to update the trep_commit_seqno table in MySQL, and the Manager will perform the failover without waiting, at the risk of possible data loss due to leaving binlog events behind. All such situations are logged.

      For use cases where failover speed is more important than data accuracy, those NOT willing to wait for long failover can set replicator.store.thl.stopOnDBError=true and still use tungsten_find_orphaned to manually analyze and perform the data recovery. For more information, please see Section 8.24, “The tungsten_find_orphaned Command”.

      Issues: CT-583

    • Improved the ability of the manager to detect un-extracted, desirable binary log events when recovering the old master via cctrl after a failover.

      The cctrl recover command will now fail if:

      • any unextracted binlog events exist on the old master that we are trying to recover

      • the old master THL contains more events than the slaves

      In this case, the cctrl recover command will display text similar to the following:

      Recovery failed because the failed master has unextracted events in
      the binlog. Please run the tungsten_find_orphaned script to inspect
      this events. Provided you have a recent backup available, you can
      try to restore the data source by issuing the following command:
                     datasource {hostname} restore
      Please consult the user manual at:
      https://docs.continuent.com/tungsten-clustering-6.1/operations-restore.html

      The tungsten_find_orphaned script is designed to locate orphaned MySQL binary logs that were not extracted into THL before a failover. For more information, please see Section 8.24, “The tungsten_find_orphaned Command”.

      Issues: CT-996

    • Improved the ability to configure the manager's behavior upon failover.

      During a failover, the manager will now wait until the selected slave has applied all stored THL events before promoting that node to master.

      This wait time can be configured via the manager.failover.thl.apply.wait.timeout=0 property.

      The default value is 0, which means "wait indefinitely until all stored THL events are applied".

      Any value other than zero invites data loss due to the fact that once the slave is promoted to master, any unapplied stored events in the THL will be ignored, and therefore lost.

      Whenever a failover occurs, the slave with most events stored in the local THL is selected so that when the events are eventually applied, the data is as close to the original master as possible with the least number of events missed.

      That is usually, but not always, the most up-to-date slave, which is the one with the most events applied.

      There should be a good balance between the value for manager.failover.thl.apply.wait.timeout and the value for policy.slave.promotion.latency.threshold=900, which is the number of seconds to which a slave must be current with the master in order to qualify as a candidate for failover. The default is 15 minutes (900 seconds).

      Issues: CT-1022

Bug Fixes

  • Command-line Tools

    • Installing with disable-security-controls=false or when updating using: tools/tpm update --replace-jgroups-certificate --replace-tls-certificate would generate self-signed security certs that have a 1-year expiration which will cause installs to break eventually.

      This expiration time value is controlled by the tpm command option --java-tls-key-lifetime, which is now set to 10 years or 3,650 days by default.

      Issues: CT-937

    • Updated the check_tungsten.sh command to have the executable bit set.

      Issues: CT-1037

    • Updated the check_tungsten_services and zabbix_tungsten_services commands to auto-detect active witnesses.

      Issues: CT-1043

  • Tungsten Manager

    • Fixed an issue where the Manager would show an exception when the MySQL check script did not get expected results.

      Issues: CT-912

    • Fixed use case where xtrabackup would timeout during backup via cctrl

      Issues: CT-1045

    • Improve ability to find needed binaries, both locally and over SSH, for commands: tungsten_find_orphaned and tungsten_is_recoverable

      Issues: CT-1053