Q&A on MongoDB backups on Amazon EC2

I just recently got this question from one of my readers and I just posted my response here for future reference:

I was impressed about your post about MongoDB because I have similar setup at my company and I was thinking maybe you could give me an advice.

We have production servers with mongodb 2.6 and replica set. /data /journal /log all separate EBS volumes. I wrote a script that taking snapshot of production secondary /data volume every night. The /data volume 600GB and it takes 8 hours to snapshot using aws snapshot tool. In the morning I restore that snapshot to QA environment mongodb and it takes 1 minute to create volume from snapshot and attach volume to qa instance. Now my boss saying that taking snapshot on running production mongodb drive might bring inconsistency and invalidity of data. I found on internet that db.fsynclock would solve the problem. But what is going to happen if apply fsynclock on secondary (replica set) for 8 hours no one knows.
We store all data (data+journal+logs) into the same EBS volume. That’s also what MongoDB documentation suggests: “To get a correct snapshot of a running mongod process, you must have journaling enabled and the journal must reside on the same logical volume as the other MongoDB data files.” (that doc is from 3.0 but it applies also to 2.x)
I suggest that you switch to having data+journal in the same EBS volume and after that you should be just fine with doing snapshots. The current GP2 SSD disks allows big volumes and great amount of IOPS so that you might be able to get away with having just one EBS volume instead of combining several volumes together with LVM. If you end up using LVM make sure that you use the LVM snapshot sequence which I described in my blog http://www.juhonkoti.net/2015/01/26/tips-and-caveats-for-running-mongodb-in-production
I also suggest that you do snapshots more often than just once per night. The EBS snapshot system stores only new modifications, so the more often you do snapshots, the faster each snapshot will be created. We do it once per hour.
Also after the EBS snapshot API call has been completed and the EBS snapshot process is started you can resume all your operations in the disk which was just snapshotted. In other words: The data is frozen at some atomic moment during the EBS snapshot API call. After that moment the snapshot will contain exactly that data what it was during that atomic moment. The snapshot progress just tells you when you can restore a new EBS volume from that snapshot and that your volume IO performance is degraded a bit because the snapshot is being copied to S3 behind the scenes.
If you want to use fsynclock (which btw should not be required if you use mongodb journal) then implement a following sequence and you are fine:
  1. fsynclock
  2. XFS freeze (xfs_freeze -t /mount/point)
  3. EBS snapshot
  4. XFS unfreeze
  5. fsyncUnlock (xfs_freeze -u /mount/point)
The entire process should not take more than a dozen or so seconds.

 

Tips and caveats for running MongoDB in production

A friend of mine recently asked about tips and caveats when he was planning a production MongoDB installation. It’s a really nice database for many use cases, but it, as every other, has its quirks. As we have been running MongoDB for several years we have encountered quite many bugs and issues with it. Most of them have been fixed during the years, but some are still persisting. Here’s my take on few good to know keypoints:

Slave with replication lag can get slaveOk=true queries

MongoDB replication is asynchronous. The master stores every operation into an oplog which the slaves read one operation at a time and apply the commands into their own data. If a slave can’t keep up it will be delayed and thus it won’t contain all the updates which the master has already seen. If you are running a software which is doing queries with slaveOk=true then mongos and some of the client drivers can direct those queries into one of the slaves. Now if your slave is lagging behind with its replication then there’s a very good change that your application can get older data and thus might end up corrupting your data set logically. A ticket has been acknowledged but not scheduled for implementation: 3346.

There’s two options: You can dynamically check the replication lag in your application and program your application to drop the slaveOk=true property in this case, or you can reconfigure your cluster and hide the lagging slave so that mongos will not drive slaveOk queries to it. This brings us to the second problem:

Reconfiguring cluster often causes it to drop primary database for 10-15 seconds.

There’s really no other way saying this, but this sucks. There are number of operations which still, after all these years, causes the MongoDB cluster to throw its hands into the air, drop primary from the cluster and completely rethink who should be the new master – usually ending up keeping the exact same master than it was before. There have been numerous Jira issues to this but they’re half closed as duplicate and half resolved: 6572, 5788, 7833 plus more.

Keep your oplog big enough, but cluster size small enough.

If your database is getting thousands of updates per second the time what the oplog can hold will start to shrink. If your database is also getting bigger then some operations might take too long that they no longer can complete during the oplog time window. Repairs, relaunches and backup restores are the main problems. We had one database which had 100GB oplog which could hold just about 14 hours of operations – not even closely enough to keep the ops guys sleep well. Another problem is that in some cases the oplog will mostly live in active memory, which will cause penalties to the overall database performance as the hot cacheable data set shrinks.

Solutions? Either manually partition your tables into several distinct mongodb clusters or start using sharding.

A word on backups

This is not a MongoDB related issue, backups can be hard to implement. After a few tries here’s our way which has served us really well: We use AWS so we’re big fans of the provisioned IOPS volumes. We mount several EBS volumes into the machine as we want to keep each volume less than 300GB if possible, so that AWS EBS snapshots wont take forever. We then use LVM with striping to combine the EBS volumes into one LVM Volume Group. On top of that we create a Logical Volume which spans 80% of the available space and we create an XFS filesystem on it. The remaining 20% is left for both backups and emergency space if we need to quickly enlarge the volume. XFS allows growing the filesystem without unmounting it, right on a live production system.

A snapshot is then done with the following sequence:

  1. Create new LVM snapshot. Internally this does XFS lock and fsync, ensuring that the filesystem has fully synchronous status. This causes MongoDB to freeze for around four seconds.
  2. Create EBS snapshots for each underlying EBS volumes. We tag each volume with timestamp, position in the stripe, stripe id and “lineage” which we use to identify the data living in the volume set.
  3. Remove the LVM snapshot. The EBS volume performance is now degraded until the snapshots are completed. This is one of the reason why we want to keep each EBS volume small enough. We usually have 2-4 EBS volumes per LVM group.

Restore is done in reverse order:

  1. Use AWS api to find the most recent set of EBS volumes for given lineage which contains all EBS volumes and which snapshots have been successfully completed.
  2. Create new EBS volumes from the snapshots and mount the volumes into the machine.
  3. Spin up the LVM so that kernel finds the new volumes. The LVM will contain the actual filesystem Logical Volume and the snapshot. The filesystem volume is corrupted and cannot be used per-se.
  4. Restore the snapshot into the volume. The snapshot will contain the fixed state which we want to use, so we need to merge it into the volume where the snapshot was taken from.
  5. The volume is now ready to use. Remove the snapshot.
  6. Start MongoDB. It will replay the journal and then start reading the oplog from the master so that it can get up to date with the rest of the cluster. Because the volumes were created from snapshots the new disks will be slow for at least an hour, so don’t be afraid that mongostat says that the new slave isn’t doing anything. It will, eventually.

Watch out for orphan Map-Reduce operations

If a client doing map-reduce gets killed the map-reduce operation might stick and keep using resources. You can kill them but even the kill operation can take some time. Just keep an eye out for these.

Quick way to analyze MongoDB frequent queries with tcpdump

MongoDB has an internal profiler, but it’s often too complex for a quick statistics to see what kind of queries the database is getting. Luckily there’s an easy way to get some quick statistics with tcpdump. Granted, these examples are pretty naive in terms of accuracy, but they are really fast to do and they do give out a lot of useful information.

Get top collections which are getting queries:

tcpdump dst port 27017 -A -s 1400 |grep query | perl -ne '/^.{28}([^.]+\.[^.]+)(.+)/; print "$1\n";' > /tmp/queries.txt
sort /tmp/queries.txt | uniq -c | sort -n -k 1 | tail

The first command will dump the beginning of each packet as string which goes into MongoDB and it will then filter out everything except queries. The perl regexp clause will pick the target collection name and print it to stdout. You should run this around 10 seconds and then stop it with Ctrl+C. The next command sorts this log and prints top collections to stdout. You can run these commands both in your MongoDB machine or in your frontend machine.

You can also get more detailed statistics about the query by looking at the tcpdump. For example you can spot keywords like $and, $or, $readPreference etc which can help you to determine what kind of queries there are. Then you can pick up the queries you might want to cache with memcached or redis, or maybe to move some queries to the secondary instances.

Check out also this nice tool called MongoMem, which can tell you how much each of your collections are stored in the physical memory (RRS). This is also known as the “hot data”.