MySQL 8.0.22: Asynchronous Replication Automatic Connection (IO Thread) Failover

MySQL 8.0.22 Asynchronous Replication Automatic Connection

MySQL 8.0.22 Asynchronous Replication Automatic ConnectionMySQL 8.0.22 was released on Oct 19, 2020, and came with nice features and a lot of bug fixes. Now, you can configure your async replica to choose the new source in case the existing source connection (IO thread) fails. In this blog, I am going to explain the entire process involved in this configuration with a use case.

Overview

This feature is very helpful to keep your replica server in sync in case of current source fails. 

To activate asynchronous connection failover, we need to set the “SOURCE_CONNECTION_AUTO_FAILOVER=1” on the “CHANGE MASTER” statement.

Once the IO connection fails, it will try to connect the existing source based on the “MASTER_RETRY_COUNT, MASTER_CONNECT_RETRY”. Then only it will do the failover. 

The feature will only work when the IO connection is failed, maybe the source crashed or stopped, or any network failures. This will not work if the replica is manually stopped using the “STOP REPLICA”. 

We have two new functions, which will help to add and delete the server entries from the source list. 

  • asynchronous_connection_failover_add_source → Arguments (‘channel’,’host’,port,’network_namespace’,weight)
  • asynchronous_connection_failover_delete_source — Arguments (‘channel’,’host’,port,’network_namespace)

The source servers need to be configured in the table “mysql.replication_asynchronous_connection_failover”. We can also use the table “performance_schema.replication_asynchronous_connection_failover” to view the available servers in source list.

Requirements

  • GTID should be enabled on all the servers.
  • Regarding auto-positioning purpose, MASTER_AUTO_POSITION should be enabled on the replica ( CHANGE MASTER ).
  • The user and password should be the same on all the source servers.
  • Replication user and password must be set for the channel using the CHANGE MASTER .. FOR CHANNEL statement. 

Use Case

I have two data centers and three servers (dc1, dc2, report).

  • “dc1” and “report” servers are in the same data center.
  • “dc2” is in a different data center.
  • “dc1” and “dc2” are in active-passive async replication setup (dc1 – active, dc2 – passive)
  • “report” is configured as an async replica under “dc1” for reporting purposes.

Here, my requirement is, if the active node “dc1” is failed, I need to configure the “report” server under “dc2” to get the live data without manual work after a failure happens.

Configuration for Automatic Connection Failover

I have installed MySQL 8.0.22 on all three servers and configured the active – passive replication between “dc1” and “dc2”.

[root@dc1 ~]# mysql -e "select @@version, @@version_comment\G"
*************************** 1. row ***************************
        @@version: 8.0.22
@@version_comment: MySQL Community Server - GPL

At dc1,

mysql> show replica status\G
                  Source_Host: dc2
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
1 row in set (0.00 sec)

At dc2,

mysql> show replica status\G
                  Source_Host: dc1
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
1 row in set (0.00 sec)

Now, I need to configure the “report” server as an async replica under “dc1” with automatic failover options.

At report,

mysql> change master to
    -> master_user='Autofailover',
    -> master_password='Autofailover@321',
    -> master_host='dc1',
    -> master_auto_position=1,
    -> get_master_public_key=1,
    -> source_connection_auto_failover=1,
    -> master_retry_count=3,
    -> master_connect_retry=10
    -> for channel "herc7";
Query OK, 0 rows affected, 2 warnings (0.03 sec)

 

source_connection_auto_failover : To activate the automatic failover feature.

master_retry_count, master_connect_retry : The default setting is huge ( master_retry_count = 86400, master_connect_retry = 60 ), with that we need to wait 60 days ( 86400 * 60 /60/60/24 ) for the failover. So, i reduced the settings to 30 seconds ( 10 *3 )

mysql> start replica for channel "herc7";
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> show replica status\G
                  Source_Host: dc1
                Connect_Retry: 10
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
        Seconds_Behind_Source: 0
                Last_IO_Error: 
    Replica_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Source_Retry_Count: 3
      Last_IO_Error_Timestamp: 
                Auto_Position: 1
                 Channel_Name: herc7
1 row in set (0.00 sec)

You can see the replication is started and the failover settings are applied. The current primary source is “dc1”. Now, I am going to use the function to add the server details into the source list for the failover to dc2.

At “report”,

mysql> select asynchronous_connection_failover_add_source('herc7', 'dc2', 3306, '', 50);
+------------------------------------------------------------------------------+
| asynchronous_connection_failover_add_source('herc7', 'dc2', 3306, '', 50)    |
+------------------------------------------------------------------------------+
| The UDF asynchronous_connection_failover_add_source() executed successfully. |
+------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> select * from mysql.replication_asynchronous_connection_failover\G
*************************** 1. row ***************************
     Channel_name: herc7
             Host: dc2
             Port: 3306
Network_namespace: 
           Weight: 50
1 row in set (0.00 sec)

It shows the source list is updated with dc2 details. We are good to perform the failover now. 

I am going to shut down the MySQL service on dc1. 

At dc1, 

[root@dc1 ~]# service mysqld stop
Redirecting to /bin/systemctl stop mysqld.service
[root@dc1 ~]#

At report server,

mysql> show replica status\G
                 Source_Host: dc1
                Connect_Retry: 10
           Replica_IO_Running: Connecting
          Replica_SQL_Running: Yes
        Seconds_Behind_Source: NULL
                Last_IO_Error: error reconnecting to master 'Autofailover@dc1:3306' - retry-time: 10 retries: 2 message: Can't connect to MySQL server on 'dc1' (111)
    Replica_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Source_Retry_Count: 3
      Last_IO_Error_Timestamp: 201019 21:32:26
                Auto_Position: 1
                 Channel_Name: herc7
1 row in set (0.00 sec)

The IO thread is in “connecting” state. This means it is trying to establish the connection to the existing source (dc1) based on the “master_retry_count” and “master_connect_retry” settings. 

 After 30 seconds,

mysql> show replica status\G
                  Source_Host: dc2
                Connect_Retry: 10
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
        Seconds_Behind_Source: 0
                Last_IO_Error: 
    Replica_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Source_Retry_Count: 3
      Last_IO_Error_Timestamp: 
                Auto_Position: 1
                 Channel_Name: herc7
1 row in set (0.00 sec)

You can see the source_host was changed to “dc2”. So, the server “report” performed the auto failover and connected to “dc2”. 

From the error log,

2020-10-19T21:32:16.247460Z 53 [ERROR] [MY-010584] [Repl] Slave I/O for channel 'herc7': error reconnecting to master 'Autofailover@dc1:3306' - retry-time: 10 retries: 1 message: Can't connect to MySQL server on 'dc1' (111), Error_code: MY-002003
2020-10-19T21:32:26.249887Z 53 [ERROR] [MY-010584] [Repl] Slave I/O for channel 'herc7': error reconnecting to master 'Autofailover@dc1:3306' - retry-time: 10 retries: 2 message: Can't connect to MySQL server on 'dc1' (111), Error_code: MY-002003
2020-10-19T21:32:36.251989Z 53 [ERROR] [MY-010584] [Repl] Slave I/O for channel 'herc7': error reconnecting to master 'Autofailover@dc1:3306' - retry-time: 10 retries: 3 message: Can't connect to MySQL server on 'dc1' (111), Error_code: MY-002003
2020-10-19T21:32:36.254585Z 56 [Warning] [MY-010897] [Repl] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2020-10-19T21:32:36.256170Z 56 [System] [MY-010562] [Repl] Slave I/O thread for channel 'herc7': connected to master 'Autofailover@dc2:3306',replication started in log 'FIRST' at position 196
2020-10-19T21:32:36.258628Z 56 [Warning] [MY-010549] [Repl] The master's UUID has changed, although this should not happen unless you have changed it manually. The old UUID was f68b8693-1246-11eb-a6c0-5254004d77d3.

The first three lines say it tried to connect the existing primary source “dc1” in a 10 seconds interval. There was no response from “dc1”, so it does the failover to “dc2” (connected to master ‘Autofailover@dc2:3306’).

It works perfectly!

Is Failback Possible?

Let’s experiment with the below two scenarios,

  • What happens if the primary node comes back online?
  • Does it perform a failback in case the server with higher weight comes back online?

What happens if the primary node comes back online?

I am going to start the “dc1”, which was shut down earlier to test the failover.

At “dc1”,

[root@dc1 ~]# service mysqld start
Redirecting to /bin/systemctl start mysqld.service
[root@dc1 ~]# mysql -e "show status like 'uptime'\G"
*************************** 1. row ***************************
Variable_name: Uptime
        Value: 4

Let’s see the replication on the “report” server.

mysql> show replica status\G
                  Source_Host: dc2
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
    Replica_SQL_Running_State: Slave has read all relay log; waiting for more updates
                 Channel_Name: herc7
1 row in set (0.00 sec)

No changes. It is still connected to “dc2”. Failback has not happened.

Does it perform a failback in case the server with higher weight comes back online?

To test this, again I shut down the MySQL on “dc1” and updated the source list on the “report” server (dc1 weight > dc2 weight).

select asynchronous_connection_failover_add_source('herc7', 'dc1', 3306, '', 70)\G

mysql> select * from replication_asynchronous_connection_failover\G
*************************** 1. row ***************************
     Channel_name: herc7
             Host: dc1
             Port: 3306
Network_namespace: 
           Weight: 70
*************************** 2. row ***************************
     Channel_name: herc7
             Host: dc2
             Port: 3306
Network_namespace: 
           Weight: 50
2 rows in set (0.00 sec)

You can see the server “dc1” is configured with a higher weight (70). Now I am going to start the MySQL service on “dc1”.

At “dc1”,

[root@dc1 ~]# service mysqld start
Redirecting to /bin/systemctl start mysqld.service
[root@dc1 ~]# mysql -e "show status like 'uptime'\G"
*************************** 1. row ***************************
Variable_name: Uptime
        Value: 37

At “report” server,

mysql> show replica status\G
                  Source_Host: dc2
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
    Replica_SQL_Running_State: Slave has read all relay log; waiting for more updates
1 row in set (0.00 sec)

No changes, so once the failover is done to the new source, the automatic failback will not happen until the new source goes down. 

From MySQL doc:

Once the replica has succeeded in making a connection, it does not change the connection unless the new source stops or there is a network failure. This is the case even if the source that became unavailable and triggered the connection change becomes available again and has a higher priority setting.

This solution is also very helpful in (cluster + async replica) environments. You can automatically switch the connection to another cluster node, in case the existing source cluster node fails. If your network is not stable, you need to consider to set the proper retry settings, because you may face the frequent failover with low thresholds.


by Sri Sakthivel via Percona Database Performance Blog

Comments