MongoDB 4.4 Performance Regression: Overwhelmed by Memory
There is a special collection of database bugs when the system starts to perform worse when given more resources. Examples of such bugs for MySQL I have:
Bug #15815 – This is where InnoDB on an 8-CPU system performed worse than on a 4-CPU system with increased concurrency.
Bug #29847 – This is a similar bug to what I will describe today, when given more memory (innodb_buffer_pool_size), the MySQL crash recovery process would take longer than with less memory, which was described in our blog Innodb Recovery – Is a large buffer pool always better?
It seems InnoDB Flushing was not optimal with a big amount of memory at that time, and this is what I think is happening with MongoDB 4.4 in the scenario I will describe.
MongoDB 4.4 Load Data Procedures
So when preparing data for my benchmark (Percona Server for MongoDB 4.2 vs 4.4 in Python TPCC Benchmark), I also measured how long it takes to load 1000 Warehouses (about 165GB of data in MongoDB) and to have repeatable numbers, as I usually like to repeat the procedure multiple times.
What I noticed is that when MongoDB 4.4 starts with unlimited cache (that is on the server with 250 GB of RAM, it will allocate 125GB for WiredTiger cache and the rest can be used for OS cache) it shows interesting behavior.
Let me describe by load procedure, which is quite simple
- Load data into database TPCC1
- Sleep 10 mins
- Load data into database TPCC3
That is the second time we load into a different database, and in background pages for database, TPCC1 should be flushed and evicted.
Before jumping to the problem I see, let’s check the number for MongoDB 4.2, and by the number I mean how long it takes to accomplish step 1 and step 3.
MongoDB 4.2 with limited memory (WiredTiger cache 25GB):
Step 1: 20 min
Step 3: 21 min
MongoDB 4.4 with limited memory (WiredTiger cache 25GB):
Step 1: 24 min
Step 3: 26 min
MongoDB 4.2 with 125GB WiredTiger cache
Step 1: 18 min
Step 3: 19 min
And now to the problem I see:
MongoDB 4.4 with WiredTiger cache
Step 1: 19 min
Step 3: 497 min
Notice Step 3 takes almost 8 and a half hours, instead of the usual ~20 mins for all previous cases, and this is when MongoDB has 125GB of WiredTiger cache.
What’s interesting is that I do not see this issue when I limit the WiredTiger cache to 25GB, and also this problem does not exist in MongoDB 4.2.
That’s why I think MongoDB 4.4 starts to behave differently when it is given a lot of memory for WiredTiger cache. Why this happens exactly I do not know yet and I will continue to profile this case. A quick look may indicate this is related to the WiredTiger eviction process (similar to the InnoDB flushing problem in the crash recovery process) and replication flow control which was created in MongoDB 4.2 to keep replicas in sync (but I do not use replicas for this test).
by Vadim Tkachenko via Percona Database Performance Blog