A MongoDB Prototype With New Heterogeneous-Memory Storage Engine (Hse)

mongodb heterogeneous memory storage engine

mongodb heterogeneous memory storage engineIntroducing a new MongoDB Storage Engine

Q. What is the Heterogeneous-memory Storage Engine?

A key value store library developed (and open-sourced) by Micron that will work with any normal storage but works especially well with emerging SSDs or NVDIMMS (or other Storage Class Memory) that have faster even NVM media in them (Bleeding-edge NAND, or Optane / 3D XPoint).

Q. So it goes faster with faster NVM storage? Doesn’t everything?

No. The maximum potential of Non-volatile memory storage devices isn’t capitalized on when used like classic block devices. And this will apply to future NVM storage products even more so.

This storage engine (or more to the point the “mpool” driver it uses) only accepts writes in blocks or append-only log streams which end up being whole blocks too as soon as they’re committed. It will work with the NVME Zoned Namespaces (ZNS) spec when SSD drives supporting it come out. It has also been run with a NVDIMM sometime earlier in the development (a year or more?) but a caveat is no test with a NVDIMM  has been re-run recently.

Beyond a point of speed, there’s also one of endurance. Bytes in this media, unlike HDD, cannot be rewritten – a page (say 4kb ~ 16kb) can only be written in one go, and has to be completely erased before being written to again. An application that modifies a page in n steps is inadvertently causing n full-page writes and erasures in the SSD’s flash. If this is your software’s typical write pattern your SSD ‘s endurance will be reduced by a factor of n. From what I gather reading through database-related papers and presentations from the storage industry average n seems to be 3 ~ 6.

Q. Does this HSE storage engine beat WiredTiger in MongoDB?

Yes, especially when the write load is high. And this applies to SSDs that only support traditional block interfaces so far, without using ZNS SSDs or SCM. Please see the YCSB test case results.

But it was built a while ago, so No in a marketplace sense for now because it is only available in the already EOL’ed 3.4 version of MongoDB. There’s no blockers to making a v3.6, v4.0+ compatible version according to the developers, but it hasn’t been kept in sync with MongoDB storage API changes since it was developed a year or two ago.

Latency

The best feature of HSE MongoDB isn’t the improved average latency / higher throughput for heavy write loads. It is the much better tail latency.

A checkpointing storage engine such as WiredTiger will have high latency during checkpoints if a large volume of updates/inserts is written between one checkpoint and the next. It’s like hitting a road-bump once per minute. This is not WiredTiger’s bug – it’s well-tuned as it can be by default, and affords further manual tuning as well. Periodic bump latency is a property/symptom of any consistent data store that does complete flushing periodically rather than continuous.

When tail latency is your key SLA HSE-using MongoDB would definitely be better than WiredTiger for you. (Probably RocksDB too, so long as the compaction is tuned.)

Q. New driver – More admin work?

Although you have to install an extra driver and initialize an SSD to be used by it, I think the answer is no, it would end up reducing admin work for DBAs who will have to scale up their DB in the coming years. And that isn’t that the case for the majority of database deployments?

This storage engine will enable better vertical scaling by using the NVM storage that will outperform normal SSDs. That, in turn, will delay the day you have to start horizontal scaling (i.e. change to a sharded cluster). Or, if you already are sharded, it will reduce the number of shards needed.

Summary

Micron has created, open-sourced (and published to Redhat repositories so far) a new driver and an associated Key-value Store library (“HSE”) that improves performance and endurance for non-volatile memory media types (including stock-standard SSDs).

The HSE library is an interesting project in its own right for Key-value applications in general, but it was also wrapped as a MongoDB storage engine as proof of concept. In throughput and average latency, this engine matches or exceeds WiredTiger performance depending on the load type. As a rough summary, it is several multiples faster for high write loads; Plus or minus 10% on the low write cases as far as I see. The best point though is that WiredTiger checkpoint impacts can be avoided and hence latency is more consistent.

How to Try MongoDB With HSE

Overview

You can build from source for yourself, or just install from packages already made for RHEL 7.7 or 8.1.

But either way, before you can run the modified v3.4 MongoDB binaries the following prerequisites must be done.

  • A) Install mpool kernel module and util commands
  • B) Format/initialize one (or more) drives as a mpool
  • C) Install the HSE library, create an HSE KVDB on the mpool

Installing HSE and Its mpool Dependencies

The HSE project’s wiki includes install instructions for its prerequisites mpool-kmod, mpool and mpool-dev

https://github.com/hse-project/hse/wiki/Install-from-Packages

❗Only supported in RHEL 7.7 or RHEL 8.1

Eg. Attempting to build from source on Ubuntu 18 hit the following make error in mpool-kmod

akira:pvar_src$ cd mpool-kmod
akira:mpool-kmod$ make package
Makefile:168: *** invalid MPOOL_DISTRO (unknown unknown0 0 0 unsupported) . Stop.

Eg. 2. the mpool.ko module will fail to be installed with “Invalid parameters” error in RHEL 8.0.

RHEL 8.1 was used in this document’s examples.

Installing mpool-kmod From Package

Download the rpm package from https://github.com/hse-project/mpool-kmod/releases. In this case it was mpool-kmod-1.7.0-r107.20200416-4.18.0-147.el8.x86_64.rpm.

[ec2-user@ip-10-0-0-85 ~]$ sudo dnf install -y mpool-kmod-1.7.0*.rpm
..    
Dependencies resolved.
==========================================================================================
 Package        Arch       Version                      Repository                   Size
==========================================================================================
Installing:
 mpool-kmod     x86_64     1.7.0-r107.20200416.el8      @commandline                1.1 M
Installing dependencies:
 bzip2          x86_64     1.0.6-26.el8                 rhel-8-baseos-rhui-rpms      60 k
 ...
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                  1/1 
  Installing       : bzip2-1.0.6-26.el8.x86_64                                        1/3 
  Installing       : sos-3.7-8.el8_1.noarch                                           2/3 
  Running scriptlet: mpool-kmod-1.7.0-r107.20200416.el8.x86_64                        3/3 
  Installing       : mpool-kmod-1.7.0-r107.20200416.el8.x86_64                        3/3 
  Running scriptlet: mpool-kmod-1.7.0-r107.20200416.el8.x86_64                        3/3 
modprobe: ERROR: could not insert 'mpool': Permission denied

*** NOTE ***
... 

Installed:
  mpool-kmod-1.7.0-r107.20200416.el8.x86_64           bzip2-1.0.6-26.el8.x86_64          
  sos-3.7-8.el8_1.noarch                             

Complete!

❗ If this package fails to install the kernel module due to the “modprobe: ERROR: could not insert ‘mpool’: Permission denied” error (shown in example mis-installation above) that is a known bug caused by a conflict with SELinux on some but not all distributions – it seems to be an issue in some that are found in AWS at the moment. Run the sudo command below as a workaround. Confirm the “mpool” is module loaded by checking for it in the output of lsmod. You will need to repeat this after each restart.

[ec2-user@ip-10-0-0-85 ~]$ sudo insmod /usr/lib/mpool/modules/mpool.ko
[ec2-user@ip-10-0-0-85 ~]$ lsmod | grep mpool
mpool                 315392  0

Installing mpool and mpool-dev

Download the rpm packages from the https://github.com/hse-project/mpool/releases. In this case they were mpool-1.7.0-r106.20200416.el8.x86_64.rpm and mpool-devel-1.7.0-r106.20200416.el8.x86_64.rpm.

[ec2-user@ip-10-0-0-85 ~]$ sudo dnf install -y mpool-1.7.0*.rpm
Last metadata expiration check: 0:17:48 ago on Thu 23 Apr 2020 01:45:40 PM UTC.
Dependencies resolved.
==========================================================================================
 Package       Architecture   Version                          Repository            Size
==========================================================================================
Installing:
 mpool         x86_64         1.7.0-r106.20200416.el8          @commandline         411 k

Transaction Summary
==========================================================================================
Install  1 Package

Total size: 411 k
...
Running transaction
  Preparing        :                                                                  1/1 
  Running scriptlet: mpool-1.7.0-r106.20200416.el8.x86_64                             1/1 
  Installing       : mpool-1.7.0-r106.20200416.el8.x86_64                             1/1 
  Running scriptlet: mpool-1.7.0-r106.20200416.el8.x86_64                             1/1 
Created symlink /etc/systemd/system/multi-user.target.wants/mpool.service → /usr/lib/systemd/system/mpool.service.

  Verifying        : mpool-1.7.0-r106.20200416.el8.x86_64                             1/1 

Installed:
  mpool-1.7.0-r106.20200416.el8.x86_64                                                    

Complete!

[ec2-user@ip-10-0-0-85 ~]$ sudo dnf install -y mpool-devel-1.7.0*.rpm
Last metadata expiration check: 0:17:58 ago on Thu 23 Apr 2020 01:45:40 PM UTC.
Dependencies resolved.
==========================================================================================
 Package           Architecture Version                          Repository          Size
==========================================================================================
Installing:
 mpool-devel       x86_64       1.7.0-r106.20200416.el8          @commandline       564 k

Transaction Summary
==========================================================================================
Install  1 Package

Total size: 564 k
Installed size: 4.4 M
...

Installed:
  mpool-devel-1.7.0-r106.20200416.el8.x86_64                                              

Complete!

Create a mpool Device and Test It

https://github.com/hse-project/mpool/wiki

https://github.com/hse-project/mpool/wiki/Create-and-Destroy (The briefer quickstart suggestions in the HSE KV store documentation at https://github.com/hse-project/hse/wiki/Configure-Storage are also sufficient.)

Before proceeding: Confirm that /dev/mpoolctl exists – if it doesn’t then mpool-kmod was not installed successfully.

This example shows a server with an as-of-yet unmounted, unformatted 1.7TB disk /dev/nvme0n1, which is the one that will be used by mpool. As it is only one for this test I’ve skipped putting it under LVM.

Executing this command with the “mpool” command-line tool to create a mpool device. The mpool dev name “mydb” used here is chosen arbitrarily.

[ec2-user@ip-10-0-0-85 ~]$ lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
xvda    202:0    0   80G  0 disk 
└─xvda1 202:1    0   80G  0 part /
<strong>nvme0n1</strong> 259:0    0  1.7T  0 disk 
[ec2-user@ip-10-0-0-85 ~]$ sudo mpool create mydb /dev/nvme0n1 uid=$(id -nu) gid=$(id -ng) mode=0600 capsz=32
[ec2-user@ip-10-0-0-85 ~]$ mpool list
MPOOL    TOTAL    USED   AVAIL  CAPACITY  LABEL    HEALTH
mydb     1.73t   1.16g   1.64t     0.07%    raw   optimal
[ec2-user@ip-10-0-0-85 ~]$ ls -l /dev/mpool
total 0
crw-------. 1 ec2-user ec2-user 240, 1 Apr 23 14:08 mydb

Installing hse and hse-devel

Download the rpm packages from the https://github.com/hse-project/hse/releases. In this case they were hse-1.7.0-r193.20200420.el8.x86_64.rpm and hse-devel-1.7.0-r193.20200420.el8.x86_64.rpm.

[ec2-user@ip-10-0-0-85 ~]$ <strong>sudo dnf install -y hse-1.7.0*.rpm</strong>
Last metadata expiration check: 0:36:46 ago on Thu 23 Apr 2020 01:45:40 PM UTC.
Dependencies resolved.
==========================================================================================
 Package          Arch      Version                      Repository                  Size
==========================================================================================
Installing:
 hse              x86_64    1.7.0-r193.20200420.el8      @commandline               7.4 M
Installing dependencies:
 libmicrohttpd    x86_64    1:0.9.59-2.el8               rhel-8-baseos-rhui-rpms     81 k
 userspace-rcu    x86_64    0.10.1-2.el8                 rhel-8-baseos-rhui-rpms    101 k
... 

Installed:
  hse-1.7.0-r193.20200420.el8.x86_64          libmicrohttpd-1:0.9.59-2.el8.x86_64         
  userspace-rcu-0.10.1-2.el8.x86_64          

Complete!

[ec2-user@ip-10-0-0-85 ~]$ <strong>sudo dnf install -y hse-devel-1.7.0*.rpm</strong>
Last metadata expiration check: 0:37:00 ago on Thu 23 Apr 2020 01:45:40 PM UTC.
Dependencies resolved.
==========================================================================================
 Package          Architecture  Version                         Repository           Size
==========================================================================================
Installing:
 hse-devel        x86_64        1.7.0-r193.20200420.el8         @commandline         75 k
...

Installed:
  hse-devel-1.7.0-r193.20200420.el8.x86_64                                                

Complete!

Test a HSE KVDB Can Be Created

https://github.com/hse-project/hse/wiki/Create-a-KVDB

[ec2-user@ip-10-0-0-85 ~]$ hse kvdb create mydb
Successfully created KVDB mydb
[ec2-user@ip-10-0-0-85 ~]$ hse kvdb list
kvdbs:
- name: mydb
  label: raw

❗ The KV DB shares the name with the mpool device. I.e. whatever name you gave in the “mpool create” command must be used again here. The syntax “kvdb create <name>” suggests you’re choosing the name, and can do so arbitrarily. But you’re only specifying the encompassing mpool device. (A better syntax i.m.o. would be “hse kvdb create –mpool <name>”)

There are more operations that can be done such as creating the KV stores within this KVDB, but for the purpose of installation confirmation, the above is enough.

Time to Rename: The examples above have created a mpool device and HSE KVDB called “mydb”. The following section for running MongoDB with the HSE MongoDB storage engine assumes it will be called “mongoData” instead. So now would be a good time to deactivate and rename this mpool if you want to follow the next section letter-for-letter.

MongoDB with HSE

https://github.com/hse-project/hse/wiki/MongoDB

Installing From Packages

See https://github.com/hse-project/hse/wiki/MongoDB#install-mongodb-with-hse-from-packages

Building hse-mongo From Source

There are instructions at https://github.com/hse-project/hse/wiki/MongoDB#compile-mongodb.

For those already familiar with building MongoDB I summarize it as being like this:

  • You build the “v3.4.17.1-hse-1.7.0” branch
  • libuuid-devel, lz4-devel, and openssl-devel are extra dependencies
  • –ssl will work in RHEL 7.7, but does not in RHEL 8.1 (for now at least).
  • The HSE Wiki instructions install scons as a normal executable, which you would get by yum, dnf or pip package. I used the buildscript/scons.py script already in the source code instead, by habit. I found I had to add “-D MONGO_VERSION=3.4.17” as an extra scons parameter to start the build.

Configuration of the mongod Node

https://github.com/hse-project/hse/wiki/MongoDB#new-mongodb-options

N.b. the storage.hse.mpoolName option will have to be the same as a mpool device you’ve already created, and a HSE KVDB will need to be created on it. If you haven’t already done that do so before proceeding. If you want it is possible to change the mpool name (see Managing KVBD).

See the “New MongoDB Options” link into the HSE Wiki above for an example of the arguments that must be set in the mongod.conf options file. You’ll probably be merging those into an existing configuration file template you use; beware that there are some comments that make it easy to miss the nested YAML levels. In particular the “engine:” and “hse:”+”mpoolName:” lines are meant to be under “storage:” section.

Start the mongod Node

Launching

Common error warning: If the KVDB named in the storage.hse.mpoolName configuration option is missing, or the wrong mpool name, the node will have a fatal assertion and abort. In the mongod log file it will look like this:

2020-04-26T01:15:33.337+0000 I CONTROL  [initandlisten] MongoDB starting : pid=4236 port=27017 dbpath=
/home/ec2-user/hse_dbroot/data 64-bit host=ip-10-0-0-85.ap-northeast-1.compute.internal
2020-04-26T01:15:33.337+0000 I CONTROL  [initandlisten] db version v3.4.17
2020-04-26T01:15:33.337+0000 I CONTROL  [initandlisten] git version: 6ed3cfea4fc4aa42f7ae16d23df3f74c300478ec
2020-04-26T01:15:33.337+0000 I CONTROL  [initandlisten] ...
2020-04-26T01:15:33.337+0000 I CONTROL  [initandlisten] options: { config: "/home/ec2-user/hse_dbroot/mongod.conf", net: { bindIp: "0.0.0.0" }, processManagement: { fork: true }, replication: { oplogSizeMB: 32000, replSetName: "testrs" }, security: { authorization: "enabled", keyFile: "/home/ec2-user/hse_dbroot/keyfile" }, setParameter: { internalQueryExecYieldIterations: "100000", internalQueryExecYieldPeriodMS: "1000", replWriterThreadCount: "64" }, storage: { dbPath: "/home/ec2-user/hse_dbroot/data", engine: "hse", hse: { mpoolName: "mongoData" } }, systemLog: { destination: "file", path: "/home/ec2-user/hse_dbroot/mongod.log" } }
2020-04-26T01:15:37.427+0000 I -        [initandlisten] Invariant failure: st resulted in status InternalError: HSE Error: kvdb/kvdb_log.c:309: No data available - #61
 at src/mongo/db/storage/hse/src/hse_engine.cpp 474
2020-04-26T01:15:37.427+0000 I -        [initandlisten]

***aborting after invariant() failure

2020-04-26T01:15:37.444+0000 F -        [initandlisten] Got signal: 6 (Aborted).
 0x556695c013aa 0x556695c00cee 0x556695c00d86 0x7efd0b43edc0 0x7efd0a14d8df 0x7efd0a137cf5 0x556694f5b
7dc 0x556694ef7d33 0x55669586a9ce 0x55669587085d 0x55669587e00a 0x55669582fe89 0x556694fbbd31 0x7efd0a
139873 0x55669501213e
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"55669468 ..... ..... ..... .....

If it starts OK the following will be printed to stdout as it begins:

[ec2-user@ip-10-0-0-85 ~]$ ~/hse-mongo/build/opt/mongo/mongod -f ~/hse_dbroot/mongod.conf 
2020-04-26T01:35:08.619+0000 I STORAGE  [main] Mpool Name: mongoData
2020-04-26T01:35:08.619+0000 I STORAGE  [main] Force Lag: 0
2020-04-26T01:35:08.619+0000 I STORAGE  [main] HSE config path str: 
2020-04-26T01:35:08.619+0000 I STORAGE  [main] HSE params str: 
2020-04-26T01:35:08.619+0000 I STORAGE  [main] Collection compression Algo str: lz4
2020-04-26T01:35:08.619+0000 I STORAGE  [main] Collection compression minimum size  str: 0
about to fork child process, waiting until server is ready for connections.
forked process: 4299
child process started successfully, parent exiting

The mongod log in this MongoDB 3.4.7 + HSE 1.7.0 build contains nothing special when it starts normally. As of this version (April 2020) the only evidence in the log of the HSE storage engine being used is in the “[initandlisten] options” line that reflects the configuration options.

Post-Launch, Regular MongoDB Administration

If Standalone node, or first node in new replica set

The first time you connect with the mongo shell there will be no authentication or authorization enabled. So simply use the “mongo” shell without any parameters except the host (and that can be empty if it’s localhost:27017).

If this node has replication enabled run rs.initiate() first.

[ec2-user@ip-10-0-0-85 ~]$ ~/hse-mongo/build/opt/mongo/mongo --host "mongodb://localhost:27017"
MongoDB shell version v3.4.17
connecting to: mongodb://127.0.0.1:27017
MongoDB server version: 3.4.17
> rs.initiate()
{
        "info2" : "no configuration specified. Using a default configuration for the set",
        "me" : "ip-10-0-0-85.ap-northeast-1.compute.internal:27017",
        "ok" : 1
}
testrs:OTHER> //wait a few secs; hit return to see PRIMARY rs state reflected in the prompt
testrs:PRIMARY>

It is of no concern to the HSE Storage engine, but by habit this is when we create the first user in a new replicaset or cluster, so let’s do that now.

testrs:PRIMARY> use admin
switched to db admin
testrs:PRIMARY> db.createUser({"user":  "mongoadmin", "pwd": "secret", "roles": ["clusterAdmin", "userAdminAnyDatabase", "readWriteAnyDatabase"]})
Successfully added user: {
        "user" : "mongoadmin",
        "roles" : [
                "clusterAdmin",
                "userAdminAnyDatabase",
                "readWriteAnyDatabase"
        ]
}

# Test connection with username and password. Using the MongoDB URI syntax here.
# Also fine to have --host + --user + --password (+ --authenticationDB=admin) syntax if you prefer that.
[ec2-user@ip-10-0-0-85 ~]$ ~/hse-mongo/build/opt/mongo/mongo --host "mongodb://mongoadmin:secret@localhost:27017/"
MongoDB shell version v3.4.17
connecting to: mongodb://mongoadmin:secret@localhost:27017/
MongoDB server version: 3.4.17
testrs:PRIMARY>

Adding a HSE node to an existing replicaset

This procedure is not HSE storage engine-specific, this is just a reminder of the standard MongoDB procedure.

If it is on a host:port that is new to the replica set connect to the current primary and run rs.add(“…”) to include it. If the HSE node is being started in place of an existing WiredTiger (or MMAPv1) node then nothing should be done other than to start it – the other nodes will notify and share the rs config so long as the replicaset name (i.e. the replication.setName config value) It will replicate everything, including user authentication information, from the other nodes.

Check HSE Storage Engine in Effect

One way to confirm dynamically that a mongod is a HSE-using one is to look for the presence of a “hse” child object in the db.serverStatus() output:

testrs:PRIMARY> db.serverStatus().hse
{
        "versionInfo" : {
                "hseVersion" : "1.7.0",
                "hseConnectorVersion" : "3.4.17.1",
                "hseGitSha" : "1.7.0-r193.20200420.el8.x86_64",
                "hseConnectorGitSha" : "6ed3cfea4fc4aa42f7ae16d23df3f74c300478ec"
        },
        "appBytes" : {
                "hseAppBytesRead" : NumberLong(55250),
                "hseAppBytesWritten" : NumberLong(36305)
        },
        "rates" : {
                "hseOplogCursorRead" : "DISABLED"
        }
}

by Akira Kurogane via Percona Database Performance Blog

Comments