Tactics to Scale Down Your MongoDB Deployment

scale down mongodb deployment

scale down mongodb deploymentGet big fast” was a mantra in the second eCommerce boom. MongoDB’s success during its startup years was due in large part to being the best database to support that.

But maybe, after your growth years, the data’s rent bill has become a little high? Big data was nice, but you absorbed a lot of it, and now you realize you’re only getting profit from a subset of it.

The obvious thing to do is to cut back. Especially if you’re renting your hardware, or paying by the GB, they cost a lot, those gigabyte-months.

Changing providers is one way to cut costs, but this is a different sort of article. It will focus on how to take a diet without changing datacenter.

1. Sharded Cluster: Use Cheaper ‘Cold Data’ Shards

Preliminary requirements for using this method:

  • There is a significant fraction of ‘cold’ data in your DB, and it’d be fine if it were served by slower latency on the few occasions it is accessed.
  • The cold documents in sharded collections are identifiable by shard key order. E.g., your shard key begins with the “creation_datetime” field, and you know that old records are rarely accessed.
  • Your MongoDB cluster is on self-managed cloud servers, not a MongoDB DBaaS. (The following suggestions should be an option for a DBaaS in theory, but I don’t think any afford it in practice.)

‘Cheaper’ for this article simply means servers with smaller RAM and CPU, and disks that have higher latency/lower throughput. The latter especially can lead to > 50% cost saving once you have TB’s of disk file to keep.

If you’re aware of MongoDB Shard Zones, you may be aware that it can be used for Geo-zoning, and that’s the only concept I think most people have about it.

But sharding zones are a generic concept—another way to use them to label shards that have different hardware.

In a nutshell: Add new shards with lower-spec, lower-cost servers, and give them a shared shard tag (e.g. “lowram_hdd”). Set a shard range for the coldest data on your big sharded collections to that same tag. The MongoDB balancer will then move chunks only in that range to those new shards. If the shard key proves to be suitable to separate hot from cold document data, the rare accesses on the new shards will be slower, but in such a small number that it doesn’t hurt the aggregate latency SLA you need to meet.

Some care is required when doing this: just adding shards with the balancer on, as usual, will cause ordinary balancing to begin immediately (i.e., all shards will start sending chunks to the new shard(s)). You’ll want the balancer off while the shard configuration is happening; then double-check it balances only the data you intended after you enable it. It’s also better to add as many shards as you intended from the start – if you add just one at first, then a second later, half the data on the first one will be rebalanced again to the second one.

With most of the data on the original high-spec (and high-cost) shards is reduced, you can use the removeShard command in three stages (begin the draining process, check progress, and finalize) to remove as many of those shards as you need.

2. Shrink a Replica set: Reduce Memory, Use Cheaper Storage

If your latency has always been fine, you may have already wondered if you can scale down. You can experiment in MongoDB relatively easily compared to other databases I know. The experiment works by adding new, smaller-spec nodes into a replica set to try out the performance.

The key thing that affects the latency is what the active data set size is. 99.9% of that should be kept in the WT cache, with one or two 9’s added or removed depending on your sensitivity for latency.

There’s no way to pre-calculate what the active data set size is. The “[Bytes] Read into” metric in the “WiredTiger Cache Activity” graph (available in both of MongoDB OpsManager/Atlas or Percona’s Percona Monitoring and Management) directly indicates how much has to be read from outside of WiredTiger cache if this is low (say < 100MB/s, even at peaks, and < 10MB/s on average) over the hours, days and weeks then I would take that as a good indication that 99.99% of the most active data is in the WiredTiger cache.

There’s a way to systematically find your desirable WiredTiger cache size. Before you even provision servers with smaller RAM sizes reduce the storage.wiredTiger.engineConfig.cacheSizeGB value and do a restart of nodes is necessary to make it effective. (Tip: If it is unset when you start, that means it was the default – 50% of the RAM size.) The WiredTiger cache activity will be running at maximum after each restart while all the data is paged back in, but that will taper off over time and reach a new steady-state, which I would suggest should be observed for a week. When a node’s “read into” cache metrics start having peaks > 100MB/s, or > 10MB/s average over the long term, you’ve probably gone a step too far and should go back up one step again.

Don’t be concerned about the “Written from” cache metric being non-zero – any document update will require bytes be written from cache on the way to disk persistence.

The “<= 100MB/s peaks” figure above is just my estimation, suitable for medium RAM sizes and storage throughput being used today. I mainly wanted to introduce everyone to using the WiredTiger cache metrics so you can see the relationship between the WiredTiger cache size, active data set size, and “read into” cache activity. Once you’re comfortable with that concept, I suggest looking at the average DB operation latency, instead of cache activity volume. Latency is usually the main SLA we’re aiming to achieve.

3. The Hard One – Deleting data

It’s evident that if you want to use less space, you can delete data that isn’t being used. It’s so obvious that it’s surprising how rarely this is done.

Reasons why I think it isn’t done much:

  • Until recently, disk space was becoming 100x cheaper per GB every decade – why waste days of your time finding stuff to delete when the latest disk with ample space costs less than one day of your time. (This is pre-cloud thinking though; the option to invest in hardware now to save money in the long term is disappearing these days.)
  • DBAs won’t do it until they’re given a request that makes the responsibility clear. They don’t know which users might need these or those records / this or that field.
  • The application team, who can identify if they’re using data or not, get no credit for doing DB maintenance.
  • Compliance-enforced retention policies. You’d have to archive it for 5, 10 years, so you need an archive policy… too much hassle.

This is not a technology problem; it’s organizational. There’s no reason for anyone to solve this one unless they get recognized for it. Although this blog’s primary audience is DBAs, I suggest that the application team be given the responsibility for (and recognition for financial benefits of) pruning data.

4. Remove the secondaries

When a Little Downtime is OK, and Data Can Be Rebuilt

In MongoDB basics, everyone knows it uses three servers for each normal replica set. You can save yourself 2/3rd in server costs by only having the data on a single-node replicaset or standalone node.

You can do this in the following situation:

  • If you can live without that database on any random day, a hypothetical server crash could happen.
  • The database isn’t the one true source of the data within it; it is being fed by external systems that can easily enough be run to refill it any time.

One general category this applies to is MongoDB databases that are a middle stage in an ETL system, or otherwise used only to produce output in batches. E.g., reports built once a day for internal users don’t change the business activities on the same day. If the database server went down and the reports weren’t run again until the next day, there will be some annoyed internal users, but no business risk to speak of.

“It’s Supposed to Be No-Downtime, But I’ll Risk it With A Single Node”

I’m putting this is in here to say: No, you can’t. You can’t run a 24/7 MongoDB database on the basis that a crash is, say, < 1% risk over several years, and it’s not worth the money to triple the server cost for that.

The reason is: Common maintenance practice in MongoDB assumes replicaset failover. You didn’t have this option with older databases – you had to shut them down during maintenance – but by now, we’ve become accustomed to the 100% uptime provided by MongoDB’s automatic failover. No secondaries mean downtime when upgrading a version, moving servers, etc.

Even having two nodes (one primary and one secondary) won’t cut it. Take one out, and the other will refuse to be primary because it can’t have majority voting confirmation (<= 50% is not a majority, > 50% is). You need to have three nodes to get through a restart of one node while giving the client applications a way to failover to another data-bearing node. And the third node can’t be an arbiter if applications are using v4.0+’s transactions, or anything else that needs to do w:majority writes.

There is a possibility to have downtime-free maintenance if you are in a cloud or other ad-hoc-provisioning environment: Only at the time you do maintenance, add in two new secondaries. It will take a long time to do initial sync if your data is large. Do the maintenance as you would with a normal 3-node replica set. Then rs.remove(..) those two out and free up the servers you borrowed.


Options 1 & 2: There’s a couple of ways to move your MongoDB Big Data onto cheaper servers/storage while still being in the same topology and keeping high availability. In a cluster, it depends on whether your shard key can be used to define the boundary between ‘cold’ and ‘hot’ data. Non-sharded replicasets that never have high out-of-cache read activity can be downsized to servers with less ample RAM.

Option 3: Outright document deletion is an obvious way to shrink disk space (or even remove whole shards if a cluster), but it’s rarely done. I advise offering recognition to the people who can sign off on data deletion as the first step to overcoming the organizational boundaries that usually prevent it.

Option 4: Lastly, there’s the go-solo-no-secondaries mode if you don’t need 24/7 uptime.

by Akira Kurogane via Percona Database Performance Blog