PMM 101: Troubleshooting MongoDB with Percona Monitoring and Management
Percona Monitoring and Management (PMM) is an open-source tool developed by Percona that allows you to monitor and manage your MongoDB, MySQL, and PostgreSQL databases. This blog will give you an overview of troubleshooting your MongoDB deployments with PMM.
Let’s start with a basic understanding of the architecture of PMM. PMM has two main architectural components:
- PMM Client – Client that lives on each database host in your environment. Collects server metrics, system metrics, and database metrics used for Query Analytics
- PMM Server – Central part of PMM that the clients report all of their metric data to. Also presents dashboards, graphs, and tables of that data in its web interface for visualization of your metric data.
For more details on the architecture of PMM, check out our docs.
Query Analytics
PMM Query Analytics (“QAN”) allows you to analyze MongoDB query performance over periods of time. In the below screenshot you can see that the most longest-running query was against the testData collection.
If we drill deeper by clicking on the query in PMM we can see exactly what it was running. In this case, the query was searching in the testData collection of the mpg database looking for records where the value of x is 987544.
This is very helpful in determining what each query is doing, how much it is running, and which queries make up the bulk of your load.
The output is from db.currentOp(), and I agree it may not be clear at a glance what the application-side (or mongo shell) command was. This is a limitation of the MongoDB API in general – the drivers will send the request with perfect functional accuracy but it does not necessarily resemble what the user typed (or programmed). But with an understanding of this, and focusing first on what the “command” field contains it is not too hard to picture a likely original format. For example the example above could have been sent by running “use mpg; db.testData.find({“x”: { “$lte”: …, “$gt”: … }).skip(0)” in the shell. The last “.skip(0)” is optional as it is 0 by default.
Additionally, you can see the full explain plan for your query just as you would by adding .explain() to your query. In the below example we can see that the query did a full collection scan on the mpg.testData collection and we should think about adding an index to the ‘x’ field to improve the performance of this query.
Metrics Monitor
Metrics Monitor allows you to monitor, alert, and visualize different metrics related to your database overall, its internal metrics, and the systems they are running on.
Overall System Performance View
The first view that is helpful is your overall system performance view. Here you can see at a high level, how much CPU and memory are being used, the amount of writes and reads from disk, network bandwidth, # of database connections, database queries per second, RAM, and the uptime for both the host and the database. This view can often lead you to the problematic node(s) if you’re experiencing any issues and can also give you a high level of the overall health of your monitored environment.
WiredTiger Metrics
Next, we’ll start digging into some of the database internal metrics that are helpful for troubleshooting MongoDB. These metrics are mostly from the WiredTiger Storage Engine that is the default storage engine for MongoDB since MongoDB 3.0. In addition to the metrics I cover, there are more documented here.
The WiredTiger storage engine uses tickets as a way to handle concurrency, The default is for WiredTiger to have 128 read and 128 write tickets. PMM allows you to alert when your available tickets are getting low. You can also correlate with other metrics as to why so many tickets are being utilized. The graph sample below shows a low-load situation – only ~1 ticket out of 128 was checked out at any time.
One of the metrics that could be causing you to use a large number of tickets is if your checkpoint time is high. WiredTiger, by default, does a full checkpoint at least every 60 seconds, this is controlled by the WiredTiger parameter checkpoint=(wait=60)). Checkpointing flushes all the dirty pages to disk. (By the way ‘dirty’ is not as bad as it sounds – it’s just a storage engine term meaning ‘not committed to disk yet’.) High checkpointing times can lead to more tickets being in use.
Finally, we have WiredTiger Cache Activity metrics. WiredTiger Cache activity indicates the level of data that is being read into or written from the cache. These metrics can help you baseline your normal cache activity, so you can notice if you have a large amount of data being read into the cache, perhaps from a poorly tuned query, or a lot of data being written from the cache.
Database Metrics
PMM also has database metrics that are not WiredTiger specific. Here we can see the uptime for the node, queries per second, latency, connections, and number of cursors. These are higher-level metrics which can be indicative of a larger problem such as connection storms, storage latency, and excessive queries per second. These can help you hone in on potential issues for your database.
Node Overview Metrics
System metrics can point you towards an issue at the O/S level that may or may not correlate to your database. CPU, CPU saturation, core usage, DISK I/O, Swap Activity, and Network Traffic are some of the metrics that can help you find issues that may start at the O/S level or below. Additional metrics to the below can be found in our documentation.
Takeaways
In this blog, we’ve discussed how PMM can help you troubleshoot your MongoDB deployment, whether you’re looking at the WiredTiger specific metrics, system-level metrics, or database level metrics PMM has you covered and can help you troubleshoot your MongoDB deployment. Thanks for reading!
Additional Resources:
Download Percona Monitoring and Management
PMM for MongoDB Quick Start Guide
by Mike Grayson via Percona Database Performance Blog
Comments
Post a Comment