MongoDB Best Practices 2020 Edition

MongoDB Best Practices 2020

MongoDB Best Practices 2020In this blog post, we will discuss the best practices on the MongoDB ecosystem applied at the Operating System (OS) and MongoDB levels. The main objective of this post is to share my experience over the past years tuning MongoDB and centralize the diverse sources that I crossed in this journey in a unique place.

Spoiler alert: this post focus on MongoDB 3.6.X series and higher since previous versions have reached its End-of-Life (EOL).

Note that the intent of tuning the settings is not exclusively about improving performance but also enhancing the high-availability and resilience of the MongoDB database.

Without further ado, let’s start with the OS settings.

Operating System(OS) Settings

Swappiness

Swappiness is a Linux kernel setting that influences the behavior of the Virtual Memory manager when it needs to allocate a swap, ranging from 0-100. A setting of “0” tells the kernel to swap only to avoid out-of-memory problems. A setting of 100 determines it to swap aggressively to disk. The Linux default is usually 60, which is not ideal for database usage.

It is common to see a value of “0″ (or sometimes “10”) on database servers, telling the kernel to prefer to swap to memory for better response times. However, Ovais Tariq details a known bug (or feature) when using a setting of “0”.

So it is recommended setting it to “1″. To change the swappiness value:
# Non persistent - the value will change to the previous value if you reboot the OS 
echo 1 > /proc/sys/vm/swappiness

And to persist it across reboots:
# Change in sysctl.conf will make the change persistent across reboots 
sudo sysctl -w vm.swappiness=1

NUMA Architecture

Non-uniform memory access (NUMA) is a memory design where a  symmetric multiprocessing system processor(SMP) can access its local memory faster than non-local memory (the one assigned local to other CPUs).  Here is an example of a system that has NUMA enabled:

As we can see the node 0 has more free memory than node 1. There is an issue with this, which causes the OS to swap even with memory available. The swap issue is explained in the excellent article by Jeremy Cole at the Swap Insanity and NUMA Architecture. The report focuses on MySQL, but it is valid for MongoDB as well.

Unfortunately, MongoDB is not NUMA-aware, and because of this, MongoDB can allocate memory unevenly, leading to the swap issue even with memory available. To solve this issue the mongod process can use the interleaved-mode (fair memory allocation on all the nodes) in two ways:

Start the mongod process with numactl --interleave=all :

numactl --interleave=all /usr/bin/mongod -f /etc/mongod.conf

Or if systemd is in use:

# Edit the file
/etc/systemd/system/multi-user.target.wants/mongod.service

If the existing ExecStart statement reads:

ExecStart=/usr/bin/mongod --config /etc/mongod.conf

Update that statement to read:
ExecStart=/usr/bin/numactl --interleave=all /usr/bin/mongod --config /etc/mongod.conf

Apply the change to systemd:
sudo systemctl daemon-reload

Restart any running mongod instances:

sudo systemctl stop mongod
sudo systemctl start mongod

And to validate the memory usage:

$ sudo numastat -p $(pidof mongod)

Per-node process memory usage (in MBs) for PID 35172 (mongod)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                        19.40           27.36           46.77
Stack                        0.03            0.03            0.05
Private                      1.61           24.23           25.84
----------------  --------------- --------------- ---------------
Total                       21.04           51.62           72.66

zone_reclaim_mode

In some OS versions, the vm.zone_reclaim_mode is enabled. The zone_reclaim_mode parameter allows someone to set more or less aggressive approaches to reclaim memory when a zone runs out of memory. If it is set to zero, then no zone reclaim occurs.

It is necessary to disable vm.zone_reclaim_mode when NUMA is enabled.  To disable, you can execute the following command:

sudo sysctl -w vm.zone_reclaim_mode=0

IO Scheduler

The IO scheduler is an algorithm the kernel will use to commit reads and writes to disk. By default, most Linux installs use the CFQ (Completely-Fair Queue) scheduler. The CFQ works well for many general use cases, but with little latency guarantees. Two other schedulers are deadline and noop. The deadline excels at latency-sensitive use cases (like databases), and noop is closer to no schedule at all. For bare metals, any algorithm among deadline or noop  (the performance difference between then is imperceptible) will be better than CFQ.

If you are running MongoDB inside a VM (which has it’s own IO scheduler beneath it), it is best to use “noop” and let the virtualization layer take care of the IO scheduling itself.

To change it run these as root (accordingly to the disk):

# Verifying
$ cat /sys/block/xvda/queue/scheduler
noop [deadline] cfq

# Adjusting the value dynamically
$ echo "noop" > /sys/block/xvda/queue/scheduler

To make this change persistent, you must edit the GRUB configuration file (usually /etc/sysconfig/grub ) and add an elevator option to GRUB_CMDLINE_LINUX_DEFAULT .  For example, you would change this line:

GRUB_CMDLINE_LINUX="console=tty0 crashkernel=auto console=ttyS0,115200

With this line:

GRUB_CMDLINE_LINUX="console=tty0 crashkernel=auto console=ttyS0,115200 elevator=noop"

Note for AWS setups: There are cases where the I/O scheduler has a value of none, most notably in AWS VM instance types where EBS volumes are exposed as NVMe block devices. This is because the setting has no use in modern PCIe/NVMe devices. The reason is that they have a substantial internal queue, and they bypass the IO scheduler altogether. The setting, in this case, is none, and it is optimal in such disks.

 Transparent Huge Pages

Databases use small memory pages, and the Transparent Huge Pages tend to become fragmented and impact performance:

To disable it on the runtime for RHEL/CentOS 6 and 7:

$ echo "never" > /sys/kernel/mm/transparent_hugepage/enabled
$ echo "never" > /sys/kernel/mm/transparent_hugepage/defrag

To make this change survive a server restart, you’ll have to add the flag transparent_hugepage=never  to your kernel options (/etc/sysconfig/grub):

GRUB_CMDLINE_LINUX="console=tty0 crashkernel=auto console=ttyS0,115200 elevator=noop transparent_hugepage=never"

Rebuild the /boot/grub2/grub.cfg  file by running the grub2-mkconfig -o command. Before rebuilding the GRUB2 configuration file, ensure to take a backup of the existing /boot/grub2/grub.cfg.

On BIOS-Based Machines

$ grub2-mkconfig -o /boot/grub2/grub.cfg

Troubleshooting

If Transparent Huge Pages (THP) is still not disabled, continue and use the option below:

Disable tuned services

Disable the tuned services if it is re-enabling the THP using the below commands.

$ systemctl stop tuned
$ systemctl disable tuned

Dirty Ratio

The dirty_ratio  is the percentage of total system memory that can hold dirty pages. The default on most Linux hosts is between 20-30%. When you exceed the limit, the dirty pages are committed to disk, creating a small pause. To avoid the hard pause, there is a second ratio: dirty_background_ratio (default 10-15%) which tells the kernel to start flushing dirty pages to disk in the background without any pause.

20-30% is a good general default for “dirty_ratio,” but on large-memory database servers, this can be a lot of memory. For example, on a 128GB-memory host, this can allow up to 38.4GB of dirty pages. The background ratio won’t kick in until 12.8GB. It is recommended to lower this setting and monitor the impact to query performance and disk IO. The goal is to reduce memory usage without impacting query performance negatively.

A recommended setting for dirty ratios on large-memory (64GB+) database servers is: vm.dirty_ratio = 15 and vm.dirty_background_ratio = 5, or possibly less. (Red Hat recommends lower ratios of 10 and 3 for high-performance/large-memory servers.)

You can set this by adding the following lines to the /etc/sysctl.conf:

vm.dirty_ratio = 15
vm.dirty_background_ratio = 5

Filesystems mount options

MongoDB recommends the XFS filesystem for on-disk database data. Furthermore, proper mount options can improve performance noticeably. Make sure the drives are mounted with noatime and also if the drives are behind a RAID controller with appropriate battery-backed cache. It is possible to remount on the fly; For example, to remount /mnt/db/ with these options:

mount -oremount,rw,noatime/mnt/db

It is necessary to add/edit the corresponding line in /etc/fstab for the option to persist on reboots. For example:

UUID=f41e390f-835b-4223-a9bb-9b45984ddf8d /                       xfs     rw,noatime,attr2,inode64,noquota        0 0

Unix ulimit Settings

Most UNIX-like operating systems, including Linux and macOS, provide ways to limit and control the usage of system resources such as threads, files, and network connections on a per-process and per-user basis. These “ulimits” prevent single users from using too many system resources. Sometimes, these limits have low default values that can cause several issues in the course of regular MongoDB operation. For Linux distributions that use systemd, you can specify limits within the [Service] :

First, open the file for editing:

vi /etc/systemd/system/multi-user.target.wants/mongod.service

And under [Service] add the following:

# (file size) 
LimitFSIZE=infinity 
# (cpu time) LimitCPU=infinity 
# (virtual memory size) 
LimitAS=infinity 
# (locked-in-memory size) 
LimitMEMLOCK=infinity 
# (open files) 
LimitNOFILE=64000 
# (processes/threads) 
LimitNPROC=64000

To adjust for the user:

# Edit the file below
/etc/security/limits.conf

And add to the user that is starting the mongod process:

# In this example, the user is mongo
mongo hard cpu  unlimited
mongo soft cpu  unlimited
mongo hard memlock unlimited
mongo soft memlock unlimited
mongo hard nofile 64000
mongo soft nofile 64000
mongo hard nproc 192276
mongo soft nproc 192276
mongo hard fsize unlimited
mongo soft fsize unlimited
mongo hard as unlimited
mongo soft as unlimited

To improve performance, we can safely set the limit of processes for the super-user root to be unlimited. Edit the .bashrc file and add the following line:

# vi /root/.bashrc
ulimit -u unlimited

Exit and re-login from the terminal for the change to take effect.

Network Stack

Several defaults of the Linux kernel network tunings are either not optimal for MongoDB, limit a typical host with 1000mbps network interfaces (or better), or cause unpredictable behavior with routers and load balancers. I suggest increasing the relatively low throughput settings (net.core.somaxconn and net.ipv4.tcp_max_syn_backlog) and a decrease in keepalive settings, seen below.

Make these changes permanent by adding the following to /etc/sysctl.conf (or a new file /etc/sysctl.d/mongodb-sysctl.conf – if /etc/sysctl.d exists):

net.core.somaxconn = 4096
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_time = 120
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.tcp_keepalive_probes = 6
Note: you must run the command /sbin/sysctl -p as root/sudo (or reboot) to apply this change.

NTP Daemon

All of these deeper tunings make it easy to forget about something as simple as your clock source. As MongoDB is a cluster, it relies on a consistent time across nodes. Thus the NTP Daemon should run permanently on all MongoDB hosts, mongos and arbiters included.

This is installed on RedHat/CentOS with:

sudo yum install ntp

MongoDB Settings

Journal commit interval

Values can range from 1 to 500 milliseconds (default is 200 ms). Lower values increase the durability of the journal, at the expense of disk performance. Since MongoDB usually works with a replica set, it is possible to increase this parameter to get better performance:

# edit /etc/mongod.conf
storage:
  journal:
    enabled: true
    commitIntervalMs: 300

WiredTiger cache

For dedicated servers, it is possible to increase the WiredTiger(WT) cache. By default, it uses 50% of the memory + 1 GB. Set the value to 60-70% and monitor the memory usage. For example to set the WT cache to 50Gb:

# edit /etc/mongod.conf
wiredTiger:
   engineConfig:
      cacheSizeGB: 50

If there’s a monitoring tool in place, such as Percona Monitoring and Management (PMM), it is possible to monitor the memory usage. For example:

Read/Write tickets

WiredTiger uses tickets to control the number of read/write operations simultaneously processed by the storage engine. The default value is 128 and works well for most cases, but in some cases case, the number of tickets is not enough. To adjust:

use admin
db.adminCommand( { setParameter: 1, wiredTigerConcurrentReadTransactions: 256 } )
db.adminCommand( { setParameter: 1, wiredTigerConcurrentWriteTransactions: 256 } )

https://docs.mongodb.com/manual/reference/parameters/#wiredtiger-parameters

To make persistent add to the Mongo configuration file:

# Two options below can be used for wiredTiger and inMemory storage engines
setParameter:
    wiredTigerConcurrentReadTransactions: 256
    wiredTigerConcurrentWriteTransactions: 256

And to estimate is necessary to observe the workload behavior. Again, PMM is suitable for this situation:

Note that sometimes increasing the level of parallelism might lead to an opposite effected than desired in an already loaded server. At this point, it might be necessary to reduce the number of tickets to the current number of CPUs/vCPUs available (if the server has 16 cores, set the read/write tickets for 16 each). This parameter needs to be extensively tested!

Pitfalls for mongos in containers

The mongos process is not cgroups aware, which means it can blow up the CPU usage creating tons of TaskExecutor threads. Secondly, grouping containers in Kubernetes Pods ends up creating tons of mongos processes, resulting in additional overhead. We can also extend this for automation(using ansible for example) which in general, DevOps engineers tend to create a pool of mongos.

To avoid pool explosion, set the parameter taskExecutorPoolSize in the containerized mongos by running it with the following argument, or setting this parameter in a configuration file: --setParameter taskExecutorPoolSize=X, where X the number of CPU cores you assign to the container (for example, ‘CPU limits’ in Kubernetes or ‘cpuset/cpus’ in Docker). For example:

$ /opt/mongodb/4.0.6/bin/mongos --logpath /home/vinicius.grippa/data/data/mongos.log --port 37017 --configdb configRepl/localhost:37027,localhost:37028,localhost:37029 --keyFile /home/vinicius.grippa/data/data/keyfile --fork --setParameter taskExecutorPoolSize=1

Or using the configuration file:

setParameter:
     taskExecutorPoolSize: 1

Conclusion

I tried to cover and summarize the most common questions and incorrect settings that I see in daily activities. Using recommended settings is not a silver bullet, but it will cover the majority of the cases and will help have a better user experience for those who use MongoDB. Finally, having a proper monitoring system in place must be a priority to adjust its settings according to the application workload.

Useful Resources

Finally, you can reach us through the social networks, our forum, or access our material using the links presented below:


by Vinicius Grippa via Percona Database Performance Blog

Comments