MongoDB 4.4 Is Coming Out Soon – What Does the Code Tell Us?

mongodb 4.4 coming soon I’ve been involved with MongoDB for quite a while now. I follow it from the code and I’ve been reviewing the changes in the 4.3 development version, plus the recent 4.4 release candidate versions rc0 – rc7.

There are a few thousand github commits, but at the human scale, you’d say 100 ~ 200 things have been patched.

From the above, I’ve created a summary of the changes that are going to affect you as a MongoDB DBA, or an application programmer, when you upgrade from v4.2 to v4.4.

But First: A Review of 2019’s v4.2 Release.

Last year’s release was a big one. Much of the code weight went into one thing – Distributed Transactions – but there were other notable new features.

User-Visible Features:

Distributed Transactions
Wildcard Indexes
Medium advances with the aggregation pipeline
$currentOp added idleSessions, idleCursors and so (finally) became fully comprehensive
MMAP Storage engine removed

Drivers:

Field-level Encryption (beta)
(MongoDB Atlas only) Full-text indexing via Lucene

Internals:

FlowControl (a throttle on primary nodes)
Improved Index build
Improved Initial Sync
Config file REST/EXEC directives
Last balancer janitorial task purged from mongos node

Now: 2020’s MongoDB v4.4-rc* ‘Beta’ Releases

User-Visible Changes

Hedged Reads
refineCollectionShardKey (= add more fields, suffix end only)
1 new Aggregation Stage: $unionWith
Aggregation Pipeline MapReduce: $accumulator, $function
AWS IAM Role-based Authentication (Enterprise only?)
Read-only transactions
Compound hashed shard keys

Admin and Internal Improvements

Resumable Initial Sync
initialSyncSourceReadPreference
Pre-warming:
- Secondary nodes by using mirrored read ops
- mongos node’s route tables, shard conns
Minimum Oplog Window setting
Indexes build simultaneously, and commit together, in replica sets
“logv2” Structured text and json logging
Background validate command
WiredTiger-level backup improvements

Comparing the Releases

My personal summary of the above is:

4.2 The Great Ascendency

Transactions in sharded, general-purpose DB is a big headline
(Driver-side) Field-level Encryption

4.4 Born to be Mild

Tuning to improve latency at cold starts or ad-hoc interruptions
~5% feature increase in Aggregation Pipeline, + MapReduce refactor

“So you’re saying … 4.4 is a Maintenance Release?”

Yes. But I love it!

A general observation I make on the software industry is bugs that cause time loss for admins, but not developers, get postponed forever when a company is trying to grow its market share first.

v4.4 is eliminating some long-outstanding examples of those bugs. Tech debt is being paid back more than new features are increasing it.

I’ve been wishing MongoDB would do a mostly-maintenance major release since v3.4.

Wish fulfilled!

Old Problems Fixed in 4.4

Resumable Initial Sync

Previously-unsolved problem scenario:

Your old, weak server needs to be replaced by a better server because it can’t handle the recent high load. You would usually just replicate and step down, but this fails because the primary takes too long to serve the data to the initial-syncing new node.

The two most common ‘Too Big To Initial Sync’ failures I’ve known:

Fallen off the oplog by the time cloning step finishes
Fixed in v3.4. (Oplog is continually fetched to the destination)
Takes so long a transient network error probably happens, and that invalidates the initial sync in progress.
Now fixed in v4.4 thanks to resumable initial sync.

initialSyncSourceReadPreference

Previously-unsolved problem scenario:

An overloaded primary, but secondaries are doing better (eg. because they have no read load). Adding a new node fails because the primary can’t serve the full data for initial sync fast enough.

Usability Bug: You couldn’t change the sync source before or during the initial sync. (And it usually chooses the primary.)

Now thanks to SERVER-38731 there is:

–setParameter initialSyncSourceReadPreference=<host:port>

Minimum Oplog Window Config Setting

Previous risk scenario:

Let’s say you typically get 500MB of updates an hour. The 20GB oplog size (as in bytes size) you configured is safely larger than a 24hr window, all the time.
But then demand goes up; Now getting 100x as many updates. Oplog window becomes < 30mins.

This period, with suddenly increased write load, is a risky time. You might not be able to do any maintenance operations without falling off the oplog before you can restart. And you’re probably being asked to “do something, anything!” to deal with the high load incident. Eg. Being pushed into doing sever upscaling etc. just when it’s riskier to do so.

You can dynamically resize the oplog (replSetResizeOplog) since v3.6, so you can quickly add the extra Oplog window time you need. But it would be nice if that could happen automatically – the trouble may be happening outside of normal working hours, after all.

Solution: new storage.oplogMinRetentionHours config option.

“logv2” Structured Text and json Logging

Scenario:

You work in a job supporting MongoDB
Need to process 100’s of thousands of log lines for a mystery issue
MongoDB log files are semi-structured = Syntactically unreliable

A main cause of the problem was that command details were printed by a minimal BSON string formatter. Human readable, but unreliable for parsing.

No log diagnostic tool could permanently resolve this issue. These logs could never support reliable, automated services.

But now the (new) systemLog.logFormat option is “json” by default. So although you can make the log output look something like the old format:

2020-05-14T01:32:32.969+09:00 I COMMAND [conn6] command test.foo command test.foo command: find { find: “foo”, filter: { $oid: “5ebc21120070b8d72d002586” …

By default it will now be like this instead:

{“t”:{“$date”:”2020-05-14T01:32:32.969+09:00″},”s”:”I”, “c”:”COMMAND”, “id”:51803, “ctx”:”conn6″,”msg”:”Slow query”,”attr”:{“type”:”command”,”ns”:”test.foo”, “appName”:”MongoDB Shell”, “command”: {“find”:”foo”,”filter”: {“_id”: {“$oid”:”5ebc21120070b8d72d002586″}},”lsid”:{“id”:{“$uuid”:”e442fa30-ba0c-4cf1-b208-c812…..

It’s fully-legit JSON! Much wooting from me. Tip for shell scripters: You should try the jq command line tool. This is the output of “jq . <logfile>”.

{
  "t": { "$date": "2020-05-14T01:32:32.969+09:00" },
  "s": "I",
  "c": "COMMAND",
  "id": 51803,
  "ctx": "conn6",
  "msg": "Slow query",
  "attr": {
    "type": "command",
    "ns": "test.foo",
    "appName": "MongoDB Shell",
    "command": {
      "find": "foo",
      "filter": {
        "_id": {
          "$oid": "5ebc21120070b8d72d002586"
        }
      },
      ....

An Old Problem is Given a New Life

MapReduce was all the rage about a decade ago, and MongoDB had a MapReduce framework built-in v1.8 (forward-incompatible functionality even earlier).

I’m surprised it isn’t removed (i.e. by 4.2). The reasons I don’t like it:

Frankenstein’ed onto the side of the main QueryEngine code. Tech debt.
MongoDB’s MapReduce means running Javascript server-side. So much slower and it feels like just a matter of time before a security hole opens up.
Almost no-one used it. (I have a bias in that my history is with big deployments. Maybe it was more popular with small database users.)

Something was needed to solve needs that the standard CRUD commands couldn’t, but this was achieved so much better by the Aggregation Pipeline. It has been present since v2.2 and fully-fledged in my opinion by v3.2 or v3.4.

It appears that map reduce paradigm using slow, CPU-intensive server-side Javascript will continue to live. See the new $function and $accumulator operators which can be used in the $group, $bucket, and $merge stages in the 4.4 documentation.

On one hand, better in the aggregation pipeline framework than anywhere else.

But on the other: aggregations stages using the $function operator will defy optimization – the Query Planner design can’t make assumptions about the documents coming out of arbitrary javascript code. The pipeline will presumably treat them as ‘blocking’ stages and these cause latency increase.

All New in 4.4

refineCollectionShardKey

Changing a shard key until now has always meant dropping the collection and starting again. The sharding metadata across all the shard nodes, configsvr nodes, and mongos nodes can’t be changed in an instant. Nor of course, can the data be instantly moved around to new shards according to the sort and partition of a new shard key.

And that will still be the state of play for most people when realize they picked the wrong shard key. This is when they find most of their queries are scattered to all shards, rather than being targeted. Or when they see indivisible jumbo chunks appearing. From my perspective, doing global support of MongoDB for years, this bad news day has happened a lot; that is why documentation, presentations, and courses about MongoDB Sharding all emphasize the importance of picking the right shard key at the start.

But – special case – do you think can save yourself by adding a new field on the end of the shard key? Eg.

{“manufacturer”: 1, “product_serial_no”: 1} ->

{“manufacturer”: 1, “product_serial_no”: 1, “reseller_id”: 1}

{“author”: 1, “article_id”: 1} –>

{“author”: 1, “article_id”: 1, “draft_version”: 1}

If so then v4.4’s refineCollectionShardKey has you covered. It’s obviously making use of the following logic so mongos nodes and shard nodes can switch over without a synchronous everyone-at-once step.

Documents stay on exactly the same shards they’re already on.
The new index with the extra field on the end can satisfy all the needs of the old index.

Mirrored Reads

Do you want the reads to your primary to be forwarded to a random secondary node as well? And have the result just discarded immediately where it was made?

Even if your immediate response is “Of course bloody not” the real answer may be: Yes.

The effect, in a nutshell, is that the WiredTiger cache in RAM on the secondary will become warmed up with approximately the same document population the primary node is using. Without mirrored reads, a secondary node’s cache would only have documents that were updated through replication.

When an election happens the unwarmed node that becomes primary would then be doing disk reads to answer a lot of the queries. I would estimate it would take 10’s of seconds to prime the cache with the much of the active dataset needed. In that time the underlying storage engine accesses in memory (typically 0.1ms for a single, indexed lookup) would block waiting for disk (typically 1 ~ 10ms, depending on the disk type). The operation latency will jump for a while, say ten times or worse.

If you have tight SLAs for latency and you’ve noticed that they are being broken after elections, then mirrored reads are going to hit the spot for you.

$unionWith

It’s UNION, but for MongoDB. There’s little else that can be added about this new aggregation pipeline stage. I only want to point out that it can take a pipeline operator of its own, and you need this if you want to union pre-aggregated or pre-filtered result sets together.

To achieve a ‘UNION DISTINCT’ result you will need to do a $group stage after the $unionWith stage. Logical really, you only need to remember there isn’t a single-keyword way to do it.

Hedged Reads

Use these if you’d like to accept a stale secondary read that is delivered quicker more than you’d prefer a primary node read that has slower delivery. But only as a fallback, when no response from the primary node within maxTimeMSForHedgedReads (default=150ms) time span.

It doesn’t provide a guarantee that all reads will be returned within a hard ceiling of execution time. The secondary read may also be slow. You are however buying a second chance that will usually be better and that will improve your tail-end latency a lot.

Note: Only supported in a sharded cluster, even though it would seem to be implementable for a non-sharded replicaset just as easily.

Possible Wildcards Before the GA Release

The GA version is not out yet – announcements so far say simply “this summer”. It’s possible some things might yet be slipped in.

WiredTiger Hot Backup

The WiredTiger storage API makes it possible to create a hot backup – a copy of each *.wt file as of a single snapshot time point. This is not exposed in MongoDB Community edition, but it is provided in Percona Server for MongoDB with the $createBackup command. This can save to filesystem or AWS S3. MongoDB Enterprise Edition includes a $backupCursor aggregation stage since 4.2, but this remains hidden as far as documentation is concerned.

The WiredTiger updates merged into v4.4 now include further enhancements such as an ability to get incremental changes between one snapshot point and another. This makes it possible to create a cluster-wide backup from each shard and the config server replicaset synchronized to the same point of time by the distributed logical clock.

Will MongoDB stop obfuscating the API so that MongoDB Community edition can make hot backups? For now, we must wait until the 4.4 is officially released and its documentation becomes “Current” rather than “Upcoming” status to find out.

Otherwise external tool Percona Backup for MongoDB appears to be the only free solution for making consistent cluster backups.

New Storage Engine?

Since about v4.2 it would be feasible to develop a storage engine out of sight. Given the relatively narrow code width between the common MongoDB storage API and the bindings to the external library (wiredtiger, rocksdb, etc.) work could be done in a private fork easily enough.

There are two notable storage engines candidates out there being offered under normal open source licenses:

At the beginning of 2020, approximately a year’s worth of community work resulted in a v4.0-compatible MongoRocks engine being shared in MongoDB’s lab github. Then in May, it was enhanced to support the extra timestamp needed to support distributed transactions. I.e. the crucial test for 4.2 (and 4.4) compatibility looks to be passed.
The Heterogeneous-memory Storage Engine was open-sourced recently and was privately shared a couple of years before that I believe. (Please see this blog about a hands-on with the prototype.)

Another possibility is that WiredTiger’s suppressed LSM mode will become enabled alongside the current default BTree mode.

But either way, I don’t see this happening in v4.4’s release this summer.

Summary

Putting the MapReduce rebirth aside several longstanding problems have been solved, without much expansion of new features. I love that at least one year’s annual battle with technical debt hasn’t been won by The Debt, for a change.

But on the other hand, the core database server doesn’t seem to be reorganizing for anything big in the future.

Learn more about the history of Oracle, the growth of MongoDB, and what really qualifies software as open source. If you are a DBA, or an executive looking to adopt or renew with MongoDB, this is a must-read!

Download “Is MongoDB the New Oracle?”

by Akira Kurogane via Percona Database Performance Blog

Search This Blog

Small Business News