Tips and caveats for running MongoDB in production

A friend of mine recently asked about tips and caveats when he was planning a production MongoDB installation. It’s a really nice database for many use cases, but it, as every other, has its quirks. As we have been running MongoDB for several years we have encountered quite many bugs and issues with it. Most of them have been fixed during the years, but some are still persisting. Here’s my take on few good to know keypoints:

Slave with replication lag can get slaveOk=true queries

MongoDB replication is asynchronous. The master stores every operation into an oplog which the slaves read one operation at a time and apply the commands into their own data. If a slave can’t keep up it will be delayed and thus it won’t contain all the updates which the master has already seen. If you are running a software which is doing queries with slaveOk=true then mongos and some of the client drivers can direct those queries into one of the slaves. Now if your slave is lagging behind with its replication then there’s a very good change that your application can get older data and thus might end up corrupting your data set logically. A ticket has been acknowledged but not scheduled for implementation: 3346.

There’s two options: You can dynamically check the replication lag in your application and program your application to drop the slaveOk=true property in this case, or you can reconfigure your cluster and hide the lagging slave so that mongos will not drive slaveOk queries to it. This brings us to the second problem:

Reconfiguring cluster often causes it to drop primary database for 10-15 seconds.

There’s really no other way saying this, but this sucks. There are number of operations which still, after all these years, causes the MongoDB cluster to throw its hands into the air, drop primary from the cluster and completely rethink who should be the new master – usually ending up keeping the exact same master than it was before. There have been numerous Jira issues to this but they’re half closed as duplicate and half resolved: 6572, 5788, 7833 plus more.

Keep your oplog big enough, but cluster size small enough.

If your database is getting thousands of updates per second the time what the oplog can hold will start to shrink. If your database is also getting bigger then some operations might take too long that they no longer can complete during the oplog time window. Repairs, relaunches and backup restores are the main problems. We had one database which had 100GB oplog which could hold just about 14 hours of operations – not even closely enough to keep the ops guys sleep well. Another problem is that in some cases the oplog will mostly live in active memory, which will cause penalties to the overall database performance as the hot cacheable data set shrinks.

Solutions? Either manually partition your tables into several distinct mongodb clusters or start using sharding.

A word on backups

This is not a MongoDB related issue, backups can be hard to implement. After a few tries here’s our way which has served us really well: We use AWS so we’re big fans of the provisioned IOPS volumes. We mount several EBS volumes into the machine as we want to keep each volume less than 300GB if possible, so that AWS EBS snapshots wont take forever. We then use LVM with striping to combine the EBS volumes into one LVM Volume Group. On top of that we create a Logical Volume which spans 80% of the available space and we create an XFS filesystem on it. The remaining 20% is left for both backups and emergency space if we need to quickly enlarge the volume. XFS allows growing the filesystem without unmounting it, right on a live production system.

A snapshot is then done with the following sequence:

Create new LVM snapshot. Internally this does XFS lock and fsync, ensuring that the filesystem has fully synchronous status. This causes MongoDB to freeze for around four seconds.
Create EBS snapshots for each underlying EBS volumes. We tag each volume with timestamp, position in the stripe, stripe id and “lineage” which we use to identify the data living in the volume set.
Remove the LVM snapshot. The EBS volume performance is now degraded until the snapshots are completed. This is one of the reason why we want to keep each EBS volume small enough. We usually have 2-4 EBS volumes per LVM group.

Restore is done in reverse order:

Use AWS api to find the most recent set of EBS volumes for given lineage which contains all EBS volumes and which snapshots have been successfully completed.
Create new EBS volumes from the snapshots and mount the volumes into the machine.
Spin up the LVM so that kernel finds the new volumes. The LVM will contain the actual filesystem Logical Volume and the snapshot. The filesystem volume is corrupted and cannot be used per-se.
Restore the snapshot into the volume. The snapshot will contain the fixed state which we want to use, so we need to merge it into the volume where the snapshot was taken from.
The volume is now ready to use. Remove the snapshot.
Start MongoDB. It will replay the journal and then start reading the oplog from the master so that it can get up to date with the rest of the cluster. Because the volumes were created from snapshots the new disks will be slow for at least an hour, so don’t be afraid that mongostat says that the new slave isn’t doing anything. It will, eventually.

Watch out for orphan Map-Reduce operations

If a client doing map-reduce gets killed the map-reduce operation might stick and keep using resources. You can kill them but even the kill operation can take some time. Just keep an eye out for these.

Juho Garo Mäkinen's blog

Useful learnings from software engineering.