Migrating from DynamoDB into MongoDB
Say you need to move from DynamoDB to MongoDB; what options do you have? Unfortunately, at the time of this writing, there are no available tools to do this easily.
In this post, we will explore the available options and discuss some of the potential issues and solutions.
Deciding on the Approach
A tutorial on the AWS site describes the steps to use a Data Pipeline to extract the DynamoDB data and write text files to S3. This seems overly complex and does not really help with the import to MongoDB.
Via the console, one can export rows to text format as well, but when you have more than a few rows, this is not an option.
Another way to deal with the problem is to code a program. We can use the available APIs to read data from DynamoDB and write to MongoDB. NimoShake from Alibaba group does this, but unfortunately, it is not so well documented. A similar approach is using NodeJS code.
Let’s dig a bit deeper into the ad-hoc approach, as it seems like the best alternative. One thing to keep in mind as we go into the terabyte(s) range, is a single-threaded approach won’t be enough to copy the data in a reasonable amount of time. It is also desirable to be able to pause/resume if there are errors or you need to do some tweaks.
Extracting Data from DynamoDB
The way to read all of a table’s data in DynamoDB is by using the Scan operation, which is similar to a full table scan in relational databases.
DynamoDB uses the concept of read capacity unit to depict one strongly consistent read per second, or two eventually consistent reads per second, for an item up to 4 KB in size. You define (and pay for) provisioned capacity on a per-table basis. Requests will be throttled if you exceed that value, so we need to rate-limit reads in order to not exceed the provisioned capacity (taking into account not only this Scan operation but any pre-existing application traffic as well).
No problem – we just set a very high read capacity right? Not quite. Even though DynamoDB distributes a large table’s data across multiple physical partitions, a Scan operation can only read from one partition at a time. Probably not enough throughput to extract a multi-TB table in a reasonable amount of time.
Speeding Things Up
What we need to do is called a parallel scan. Basically we split the table into Segments, with one Scan operation reading from each. Luckily the API supports this functionality out-of-the-box.
Data is returned in 1 MB chunks, so we need to deal with pagination. A LastEvalKey is returned on each “page”, containing the partition key (which is basically the primary key) of the last row returned.
So to iterate, we need to construct a new Scan request, with the same parameters as the previous one, except this time specify the LastEvalKey as the ExclusiveStartKey parameter. This also allows you to pause and resume later a Scan operation from a known position.
I did find some odd behavior. Sometimes while reading the next page, I would get the last document from the previous “page” again, as the first result of the new “page”. I am not sure about the reason, as I didn’t find documented that we could get duplicates.
Data Type Conversions
There is no 1-to-1 mapping between DynamoDB and MongoDB data types. We have to accept the types that the driver assigns for us, or force the conversion. For example, if we have some Unix timestamp values stored as Number in DynamoDB, but in MongoDB we might want to store them as Date. Also, since Javascript/NodeJS uses milliseconds internally we have to multiply the values by 1000. Recall “normal” Unix timestamps are measured in seconds. Here’s how to do the conversion and enforce the type:
item.created = new Date(item.created*1000)
Another issue is some of the fields in DynamoDB documents may contain binary encoded data, stored as BinarySet. We have to convert them to String before feeding them to the MongoDB driver, in order to avoid double-encoding issues. Here’s how to do it:
item.data = item.data.toString('base64');
Dealing with Primary Keys
DynamoDB can have composite primary keys, which are composed of a Partition Key and an (optional) Sort Key. We have to figure out how to translate this to MongoDB’s _id as the primary key.
If we don’t specify an _id value on insert, MongoDB will generate one for us dynamically. I think a better approach is to generate it ourselves so that a record in DynamoDB will always map to the same _id in MongoDB. If we need to re-run the data load, we won’t get duplicate items that way.
One way to do it is concatenating the partition and sort keys as comma-separated strings, as follows:
pk[ '_id' ] = item.partition_key.concat(",").concat(item.sort_key); lodash.assign(item, pk);
If you are not familiar with lodash, it is a utility library that makes it easier to operate on complex data types. In this case we are using it to add a new key/value pair (the _id: something) to the document.
Writing Data into MongoDB
MongoDB default configuration is, most probably, not optimized for heavy writes of your particular batch insert job’s profile, so some tuning might be required. Also I recommend you shard extensively, to keep data sizes manageable and have good insert rates by spreading the load. My rule of thumb is to keep a maximum 1 TB per shard, ideally half of that. That might not seem too much, but think about the benefits to backup/restore and maintenance operations (like index creation).
We need to pre-split the collections to ensure an even distribution of the write load and the data. Otherwise you will see all inserts going to a single shard, and also the balancer will need to spend a really long time to re-distribute everything.
If you can, use a hashed sharded key. This has the benefit of distributing the chunks evenly immediately. If you do not, even on an empty table the balancer can take a long time to do this job.
In order to write efficiently, we have to make use of the bulkInsert functionality. In this case, the code was indeed using bulk writes, but it was doing upserts. This means existing documents on the target side are overwritten if they exist. Instead, we can run bulk Inserts with ordered:false to just ignore existing rows.
Potential Improvements
Ideally, we should decouple the extract phase from the apply, which is the way most replication solutions work. This has the drawback of requiring significant disk space to store the extracted data. The benefit is we can adjust the level of parallelism independently for the extraction and the apply processes.
Finally, in most projects it is desirable to minimize downtime. In order to keep the existing DynamoDB application running while we do the data copy, we need to have some kind of live replication solution. I suggest you check out Corrado’s recent post Real-Time Replication From DynamoDB to MongoDB for details on how to solve that problem.
Final Words
Even if some assembly is required, it is possible to move data from DynamoDB to MongoDB. Care must be taken to not affect the existing application, by sizing read capacity units properly on the DynamoDB side. On the MongoDB side, you need to shard properly and pre-split the collections to spread the writes evenly across all shards. Also don’t forget to create the indexes!
If you are interested, I actually forked the NodeJS code from the post mentioned above to address some of the issues discussed in this post.
by Ivan Groenewold via Percona Database Performance Blog
Comments
Post a Comment