Various Ways to Perform Schema Upgrades with Percona XtraDB Cluster

Perform Schema Upgrades with Percona XtraDB Cluster

Perform Schema Upgrades with Percona XtraDB ClusterSchema changes are the big challenges in Galera replication. So, it is recommended to understand the schema changes operation for everyone who uses the Percona XtraDB Cluster (PXB)/Galera clusters. In this blog, I am going to explain the operation and impact of the various schema changes methods used in the PXB/Galera cluster.

  • Schema changes with “wsrep_OSU_method = TOI”
  • Schema changes with “wsrep_OSU_method = RSU”
  • Schema changes with “ONLINE ALGORITHMS”
  • Schema changes with “pt-osc”
  • Schema changes with “gh-ost”

For testing:

  • I have configured the 3-node Percona Xtradb Cluster (8.0.19).
  • Executing read/write load using the sysbench.
mysql> select @@wsrep_cluster_address\G
*************************** 1. row ***************************
@@wsrep_cluster_address: gcomm://pxc81,pxc82,pxc83
1 row in set (0.00 sec)

mysql> select @@version, @@version_comment\G
*************************** 1. row ***************************
        @@version: 8.0.19-10
@@version_comment: Percona XtraDB Cluster (GPL), Release rel10, Revision 727f180, WSREP version 26.4.3
1 row in set (0.00 sec)

What is the Impact of Schema Changes in Clusters?

  • By default (TOI), all the nodes in the cluster will be pause during the ALTER process. Because the ALTER needs to be replicated on all the nodes. If the ALTER is big it will affect the performance and could be the cause of the downtime.
  • Rollback is not possible on schema upgrade. 
  • You can’t kill the ALTER query immediately during the operation. So, your application may need to wait until the ALTER completion. 
mysql> pager grep alter
PAGER set to 'grep alter'

mysql> show processlist;
| 19 | root            | localhost | schema_changes | Query   |   18 | altering table           | alter table sbtest1 add index idx_c(c) |         0 |             0 |
7 rows in set (0.00 sec)

mysql> kill 19;
ERROR 1095 (HY000): You are not owner of thread 19
  • MDLs are set only on one node. Not across all the nodes in the cluster. So, you need additional control over this. 

Schema Changes with “wsrep_OSU_method = TOI”

TOI: Total Order Isolation

  • TOI is the default method ( wsrep_OSU_method = TOI ) for schema changes.
  • DDL statements are processed in the same order with regard to other transactions in each node. 
  • The full cluster will be blocked/locked during the DDL operation. 
  • This guarantees data consistency.
mysql> select @@wsrep_OSU_method\G
*************************** 1. row ***************************
@@wsrep_OSU_method: TOI
1 row in set (0.00 sec)

Example:

I am going to run the below ALTER on “pxc81”.

alter table sbtest1 add index idx_c(c)

After initiating the ALTER on pxc81, My processlist states the COMMIT and UPDATE ( from sysbench ) statements are paused. Only ALTER is in progress. The COMMIT and UPDATE will be resumed once the ALTER is completed.

| 17 | root            | localhost | schema_changes | Execute |      15 | closing tables                           | COMMIT                                 |         0 |             0 |
| 17 | root            | localhost | schema_changes | Execute |      15 | updating                                 | UPDATE sbtest1 SET c='91668836759-30934071579-18064439108-53267873872-79461377960-32104006456-143369 |         0 |             1 |
| 24 | root            | localhost | schema_changes | Query   |      15 | altering table                           | alter table sbtest1 add index idx_c(c) |         0 |             0 |

But, still, the SELECT statement can be run with “wsrep_sync_wait != 1” because “wsrep_sync_wait = 1” needs the casualty checks from other nodes. So, it will fail.  

SELECT with “wsrep_sync_wait=1”

|  1 | system user     |           | schema_changes | Query   |     0 | altering table         | alter table sbtest1 add index idx_c(c) |         0 |             0 |
| 15 | root            | localhost | schema_changes | Query   |    40 | starting               | select * from sbtest1 where id=1       |         0 |             0 |

mysql> select * from sbtest1 where id=1;
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

TOI can be the right choice for quick operations.

  • CREATE STATEMENTS
  • RENAME INDEX
  • RENAME TABLE
  • DROP INDEX
  • ALGORITHM=INSTANT

Schema Changes with “wsrep_OSU_method = RSU”

RSU – Rolling Schema Upgrade

  • In this method, DDL statements will not replicate across the cluster nodes. Need to execute the DDL individually on all nodes.
  • The node which is executing the DDL will desync from the cluster group. The other nodes in the cluster are still operational and receive the application connections.
  • Once the node executes the DDL, it will start to apply the missing writesets.
  • In this method, the important thing is the WRITEs should not be performed on that particular table until the schema upgrade completes on all the nodes. Users should be very clear on this because the failure will break the cluster and the data may be unrecoverable. 
  • Gcache should be good enough to store the writesets.

Example:

At pxc82, I am going to execute the ALTER.

Session 1: (setting up the value to RSU – session-level) 

mysql> set wsrep_OSU_method=RSU;
Query OK, 0 rows affected (0.09 sec)

Session 2: (checking the node status)

mysql> show global status where Variable_name like 'wsrep_local_recv_queue' or Variable_name like 'wsrep_local_state_comment';
+---------------------------+--------+
| Variable_name             | Value  |
+---------------------------+--------+
| wsrep_local_recv_queue    | 0      |
| wsrep_local_state_comment | Synced |
+---------------------------+--------+
2 rows in set (0.00 sec)

Session 1: (executing the ALTER )

mysql> alter table sbtest1 add index idx_c(c);

Session 2: (checking again the node status )

Here the node went to Donor/Desynced state once the ALTER started. You can see the queue also keeps increasing.

mysql> nopager;  show global status where Variable_name like 'wsrep_local_recv_queue' or Variable_name like 'wsrep_local_state_comment';
PAGER set to stdout
+---------------------------+----------------+
| Variable_name             | Value          |
+---------------------------+----------------+
| wsrep_local_recv_queue    | 2053           |
| wsrep_local_state_comment | Donor/Desynced |
+---------------------------+----------------+
2 rows in set (0.21 sec)

Session 1: (ALTER completed)

mysql> alter table sbtest1 add index idx_c(c);
Query OK, 0 rows affected (2 min 6.52 sec)
Records: 0  Duplicates: 0  Warnings: 0

Session 2: (Node synced to cluster)

mysql> show global status where Variable_name like 'wsrep_local_recv_queue' or Variable_name like 'wsrep_local_state_comment';
+---------------------------+--------+
| Variable_name             | Value  |
+---------------------------+--------+
| wsrep_local_recv_queue    | 0      |
| wsrep_local_state_comment | Synced |
+---------------------------+--------+
2 rows in set (0.00 sec)

This step needs to be executed in pxc81 and pxc83 as well. After completing on all nodes, we are good to allow the WRITEs for that table. 

The RSU method is not truly disruption-free, as there are few bugs reported regarding RSU. Users should be very clear and careful about executing the RSU for schema updates:

https://jira.percona.com/browse/PXC-2620

https://jira.percona.com/browse/PXC-2293

https://jira.percona.com/browse/PXC-1980

Schema Changes with “ONLINE ALGORITHMS”

So far, we have 3 algorithms,

  • INPLACE
  • COPY
  • INSTANT

With TOI:

“ALGORITHM = INPLACE / COPY” still pauses the cluster during the operation. Galera doesn’t allow transactions when an ALTER TABLE statement is run. So if you are using TOI, any ALTER TABLE will block all transactions on all nodes.

| 17 | root            | localhost | schema_changes | Execute |      12 | closing tables                           | COMMIT                                                               |         0 |             0 |
| 18 | root            | localhost | schema_changes | Execute |      12 | closing tables                           | COMMIT                                                               |         0 |             0 |
| 32 | root            | localhost | schema_changes | Query   |      13 | altering table                           | alter table sbtest1 add index idx_c(c), algorithm=inplace,

“ALGORITHM=INSTANT” is supported and faster in TOI.

mysql> alter table sbtest1 add column test_Ins int , algorithm=instant;
Query OK, 0 rows affected (0.24 sec)
Records: 0  Duplicates: 0  Warnings: 0 lock=none

With RSU:

“ALGORITHM = INPLACE/COPY” is still not beneficial on RSU. It pauses the Galera replication and takes the node to Desync. 

mysql> show processlist;
| 62 | root            | localhost | schema_changes | Query   |    51 | altering table                       | alter table sbtest1 add index idx_c(c), algorithm=inplace, lock=none |         0 |             0 |
5 rows in set (0.06 sec)

mysql> nopager;  show global status where Variable_name like 'wsrep_local_recv_queue' or Variable_name like 'wsrep_local_state_comment';
PAGER set to stdout
+---------------------------+----------------+
| Variable_name             | Value          |
+---------------------------+----------------+
| wsrep_local_recv_queue    | 7335           |
| wsrep_local_state_comment | Donor/Desynced |
+---------------------------+----------------+
2 rows in set (0.03 sec)

“ALGORITHM=INSTANT” is supported and faster in RSU. But, still, you can use TOI to avoid the additional work.

mysql> alter table sbtest1 add column test_Inss int , algorithm=instant;
Query OK, 0 rows affected (0.19 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> select @@wsrep_OSU_method;
+--------------------+
| @@wsrep_OSU_method |
+--------------------+
| RSU                |
+--------------------+
1 row in set (0.02 sec)

I would suggest using the “ALGORITHM = INSTANT ” with TOI wherever you can. But, make sure you have the MySQL 8.x + version. Unfortunately, “ALGORITHM=INSTANT” currently only supports adding new columns. 

Schema Changes with “pt-osc”

pt-osc : Percona-online-schema-change

Personally, I like this approach very much and use this mostly in production environments. Pt-osc provides non-blocking schema upgrades on all nodes in one shot. This should be used with the TOI method. The action flow will be like this:

  • Create a new table “_tablename_new” with the required modification
  • Creates triggers for update the modified rows (insert / update / delete)
  • Copy the records from the original table to the new table using chunk operation.
  • Once the copy is completed, it will swap the table ( original → _old, _new → original ) and drop the triggers and old table. Direct DDLs ( RENAME TABLE, DROP TABLE ) will be used for this operation ( wsrep_OSU_method=TOI ). 

For the below ALTER, 

alter table schema_changes.sbtest1 add index idx_test_Ins(test_Ins);

Pt-osc flow in SQL syntax:

Creating new table...
CREATE TABLE `schema_changes`.`_sbtest1_new` (
  `id` int NOT NULL AUTO_INCREMENT,
  `k` int NOT NULL DEFAULT '0',
  `c` char(120) NOT NULL DEFAULT '',
  `pad` char(60) NOT NULL DEFAULT '',
  `test_Ins` int DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `k_1` (`k`)
) ENGINE=InnoDB AUTO_INCREMENT=20400335 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
Created new table schema_changes._sbtest1_new OK.
Altering new table...
ALTER TABLE `schema_changes`.`_sbtest1_new` add index idx_test_Ins(test_Ins)
Altered `schema_changes`.`_sbtest1_new` OK.
Not creating triggers because this is a dry run.
Not copying rows because this is a dry run.
INSERT LOW_PRIORITY IGNORE INTO `schema_changes`.`_sbtest1_new` (`id`, `k`, `c`, `pad`, `test_ins`) SELECT `id`, `k`, `c`, `pad`, `test_ins` FROM `schema_changes`.`sbtest1` FORCE INDEX(`PRIMARY`) WHERE ((`id` >= ?)) AND ((`id` <= ?)) LOCK IN SHARE MODE /*pt-online-schema-change 9253 copy nibble*/
SELECT /*!40001 SQL_NO_CACHE */ `id` FROM `schema_changes`.`sbtest1` FORCE INDEX(`PRIMARY`) WHERE ((`id` >= ?)) ORDER BY `id` LIMIT ?, 2 /*next chunk boundary*/
Not swapping tables because this is a dry run.
Not dropping old table because this is a dry run.
Not dropping triggers because this is a dry run.
DROP TRIGGER IF EXISTS `schema_changes`.`pt_osc_schema_changes_sbtest1_del`
DROP TRIGGER IF EXISTS `schema_changes`.`pt_osc_schema_changes_sbtest1_upd`
DROP TRIGGER IF EXISTS `schema_changes`.`pt_osc_schema_changes_sbtest1_ins`
2020-09-30T08:31:17 Dropping new table...
DROP TABLE IF EXISTS `schema_changes`.`_sbtest1_new`;
2020-09-30T08:31:17 Dropped new table OK.

Pt-osc provides several options to perform the effective operations.  You can control the connections, active threads, load, chunk size etc ..

For Galera, we have the option “–max-flow-ctrl”. The option will check the average time cluster spent pausing for FC and make the tool pause if it goes over the percentage indicated in the option. By default, the tool will not check the FC.

[root@pxc81 log]# less /bin/pt-online-schema-change  | grep -i pausing
         print STDERR "Pausing because PXC Flow Control is active\n";
         print STDERR "Pausing because "

To make the schema changes on FOREIGN KEY tables, I would suggest using the “alter-foreign-keys-method = rebuild_constraints”. This helps to maintain the consistency of the schema and its relations. In this approach, before dropping the old table, it runs ALTER on all the child tables to drop existing FK and re-add new FK constraints that points to the columns from the new table. Again, adding and dropping the FOREIGN KEY will be the direct ALTER using TOI.

Schema changes with “gh-ost”

Gh-ost is doing a similar approach like “pt-osc”. It also helps to do the non-blocking ALTERs on all cluster nodes in one shot. The main difference is gh-ost is triggerless. Gh-ost uses the binary log to track the changes. So you need the following variables and thresholds to perform the gh-ost operation. 

log-bin=sakthi-bin
binlog-format=ROW
log-slave-updates=ON

The flow will be like,

  • Creates gh-ost table with the required modifications
  • Copy the records from the original table to the new table using chunk operation.
  • Apply the live changes by reading the DML events from binary logs.
  • Once the binary log events are applied, it will swap the tables ( original –> _old, gh-ost –> original ) and drop the old table.

Example:

[root@pxc81 schema_changes]# gh-ost --alter="add index idx_test_Inss(test_Ins)" --database=schema_changes --table=sbtest1 --user=root --password=Jesus@7sakthI --allow-on-master --execute
[2020/09/30 09:40:56] [info] binlogsyncer.go:133 create BinlogSyncer with config {99999 mysql 127.0.0.1 3306 root    false false <nil> false UTC true 0 0s 0s 0 false}
[2020/09/30 09:40:56] [info] binlogsyncer.go:354 begin to sync binlog from position (binlog.000027, 196850993)
[2020/09/30 09:40:56] [info] binlogsyncer.go:203 register slave for master server 127.0.0.1:3306
[2020/09/30 09:40:56] [info] binlogsyncer.go:723 rotate to (binlog.000027, 196850993)
# Migrating `schema_changes`.`sbtest1`; Ghost table is `schema_changes`.`_sbtest1_gho`
# Migrating pxc81:3306; inspecting pxc81:3306; executing on pxc81
# Migration started at Wed Sep 30 09:40:56 +0000 2020
# chunk-size: 1000; max-lag-millis: 1500ms; dml-batch-size: 10; max-load: ; critical-load: ; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle 
# Serving on unix socket: /tmp/gh-ost.schema_changes.sbtest1.sock
Copy: 0/6563240 0.0%; Applied: 0; Backlog: 0/1000; Time: 0s(total), 0s(copy); streamer: binlog.000027:196853401; Lag: 0.02s, State: migrating; ETA: N/A
Copy: 0/6563240 0.0%; Applied: 0; Backlog: 0/1000; Time: 1s(total), 1s(copy); streamer: binlog.000027:196858195; Lag: 0.01s, State: migrating; ETA: N/A
Copy: 22000/6563240 0.3%; Applied: 0; Backlog: 0/1000; Time: 2s(total), 2s(copy); streamer: binlog.000027:201067135; Lag: 0.01s, State: migrating; ETA: 9m58s

.......

Copy: 5682000/6563240 86.6%; Applied: 0; Backlog: 0/1000; Time: 16m10s(total), 16m10s(copy); streamer: binlog.000028:213168607; Lag: 0.01s, State: migrating; ETA: 2m30s
Copy: 6563000/6563240 100.0%; Applied: 0; Backlog: 0/1000; Time: 20m20s(total), 20m20s(copy); streamer: binlog.000028:382677405; Lag: 0.01s, State: migrating; ETA: 0s

Gh-ost also provides several options to perform effective operations.  You can control the connections, active threads, load, chunk size, etc.

But unfortunately, “–max-flow-ctl” option is not available in gh-ost. 

Conclusion

So, finally, I would say,

  • Always use the direct ALTER with TOI for the metadata changes and INSTANT ALTERs.
  • Use pt-online-schema-change with TOI and use the optimal flow control thresholds for InnoDB tables.
  • Schedule pt-online-schema-change operation in off-peak hours for FOREIGN KEY tables.
  • If you use RSU, never forget that you need to execute the ALTER on all nodes individually and you should block the WRITEs for that particular table. Make sure, your Gcache size is good enough to hold the writesets.
  • If you are concerned with triggers, you can use the gh-ost to make the ALTERs.

by Sri Sakthivel via Percona Database Performance Blog

Comments