Setting backup solution for my entire digital legacy (part 2 of 2)

As part of my LifeArchive project, I had to verify that I have sufficient methods to back all my valuable assets so well that they will last for decades. Sadly, there isn’t currently any mass media storage available that is known to function for such a long time, and in any way you must prepare for losing a site due to floods, fire and other disasters. This post explains how I solved my backup needs for my entire digital legacy. Be sure to read the first part: LifeArchive – store all your digital assets safely (part 1 of 2)

The cheapest way to store data currently is to use hard disks. Google, Facebook, The Internet Archive, Dropbox etc are all known to host big data centers with a lot of machine with a lot of disks. Also at least Google is known to use tapes for additional backups, but they are way too expensive for this kind of small usage.

Disks have also their own problem. The biggest problem is that they tend to break. Another problem is that they might corrupt your data, which is a problem with traditional raid systems. As I’m a big fan of ZFS, my choose was to build a SAN on top of it. You can read more on this process from this blog post: Cheap NAS with ZFS in HP MicroServer N40L

Backups

As keeping your eggs in one basked is just stupid, having a good and redundant backup solution is the key to success. As in my previous post, I concluded that using cloud providers to solely host your data isn’t wise, but they are a viable choice for doing backups. I’ve chosen to use CrashPlan, which is a really nice cloud based software for doing increment backups. Here are the cons and the pros for CrashPlan:

Pros:

  • Nice GUI for both backing up and restoring files
  • Robust. The backups will eventually complete and the service will notify you by email if something is broken
  • Supports Windows, OS X, Linux and Solaris / OpenIndiana
  • Infinitive storage on some of the plans
  • Does increment backups, so you can find the lost file from history.
  • Allows you to backup to both CrashPlan cloud and to your own storage if you run the client in multiple machines.
  • Allows you to backup to your friends machine (this doesn’t even cost you anything), so you can establish a backup ring with a few of your friends.

Cons:

  • It’s still a black-box service, which might break down when you least expect
  • CrashPlan cloud is not very fast: Upload rate to CrashPlan cloud is around 1Mbps and download (restore) around 5Mbps
  • You have to fully trust and rely on the CrashPlan client to work – there’s no another way to access the archive except using the client.

I setup the CrashPlan client to backup into its cloud and in addition to Kapsi Ry’s server where I’m running a copy of the CrashPlan client. Running your own client is easy and it gives me a much faster way to recover the data when I need to. As the data is encrypted, I don’t need to worry that there’s also a few thousand other users in the same server.

Another parallel backup solution

Even when CrashPlan feels like a really good service, I still didn’t want to trust solely to its services. I can always somehow forget to enter my new credit card number and let the data there expire, only to have a simultaneous fatal accident on my NAS. So that’s why I wanted to have a redundant backup method. I happen to get another used HP MicrosServer for a good bargain, so I setup it similarly to have three 1TB disks which I also happend to have laying around unused from my previous old NAS. Used gear, used disks, but they’re good enough to act as my secondary backup method. I will of course still receiver email notifications on disk failures and broken backups, so I’m well within my planning safety limits.

This secondary NAS lives at another site and it’s connected with an openvpn network to the primary server in my home. It also doesn’t allow any incoming connections from anywhere outside, so it’s also quite safe. I setup a simple rsync script from my main NAS to sync all data to this secondary NAS. The rsync script uses –delete -option, so it will remove files which have been deleted from the primary NAS. Because of this I also use a crontab entry to snapshot the backup each night. This will protect me if I accidentally delete files from the main archive. I keep a week worth of daily snapshots and a few month of weekly snapshots.

One best pros with this when comparing to CrashPlan is that the files are sitting directly on your filesystem. There’s no encryption nor any proprietary client and tools you need to rely, so you can safely assume that you can always get an access to your files.

There’s also another option: Get a few USB disks and setup a schema where you automatically copy your entire archive to one disk. Then every once in a while unplug one of those, put it somewhere safe and replace it with another. I might do something like this once a year.

Verifying backups and monitoring

“You don’t have backups unless you have proven you can restore from them.” – a well known truth that many people tend to forget. Rsync backups are easy to verify, just run the entire archive thru sha1sum on both sides and verify that the checksums match. CrashPlan is a different beast, because you need to restore the entire archive to another place and verify it from there. It’s doable, but currently it can’t be automated.

Monitoring is another thing. I’ve built all my scripts so that they will email me if there’s a problem, so I can react immediately on error. I’m planning to setup a Zabbix instance to keep track, but I haven’t yet bothered.

Conclusions

Currently most of our digital assets aren’t stored safely enough that you can count that they all will be available in the future. With this setup I’m confident that I can keep all my digital legacy safe from hardware failures, cracking and human mistakes. I admit that the overal solution isn’t simple, but it’s very well doable for an IT-savvy person. The problem is that currently you can’t buy this kind of setup anywhere as a service, because you can’t be 100% sure that the service will keep up in the upcoming decades.

This setup will also work as a personal cloud, assuming that your internet link is fast enough. With the VPN connections, I can also let my family members to connect into this archive and let them store their valuable data. This is very handy, because that way I will know that I can access my parents digital legacy, who probably can’t do all this by themselves alone.

LifeArchive – store all your digital assets safely (part 1 of 2)

Remember the moments when you, or your parents, found some really old pictures buried deep into some closet and you instantly get a warm and fuzzy feeling of memories? In this modern era of Cloud Services, we’re producing even more personal data which we want to retain. Pictures form your phone and your DSLR camera, documents you’ve made, non-drm games, movies and music you’ve bought etc. The list goes on and on. Todays problem is that you have so many different devices, that it’s really hard to keep up where all your data is.

One solution is to rely on a number of cloud services: Dropbox, Google, Facebook, Flickr etc can all host your images and files, but can you really rely that your data is still there after ten years? What about 30 years? What if you use a paid service and for some reason you forget to upgrade your new credit card data and the service deletes all your data? What about the files which lay on a corner of your old laptop after you bought a new shiny computer? You can’t rely that the cloud providers you use are still in business for decades to come.

The solution? Implement your own strategy by backing up all your valuable digital assets. After thinking this for a few years I finally came up with my current solution for this problem: Gather all your digital assets into one place, so that they’re easy to backup. You can still use all your cloud services like Dropbox and Facebook, but just make sure that you do automatic backups from all these services into this central storage. This way there’s only one place which you need to backup and you can easily do backups to multiple different places just for extra precaution.

First identify what’s worth saving

  • I do photography, so that’s a bunch of .DNG and .JPG images in my Lightroom archive. I don’t photograph that much, so I can easily store them all, assuming that I remove the images which have failed so badly that there’s no imaginable situation where I would want those.
  • I also like doing movies. The raw footage takes too much space that it’s worth saving, so I only archive the project files and the final version. I store the raw footage in external drives which I don’t backup into this archive.
  • Pictures from my cell phone. There’s a ton of lovely images there which I want to save.
  • Emails, text messages from my phone, comments and messages from facebook.
  • Misc project files. Be it a 3D model, a source code file related to an experiment, drawings for my new home layout etc. I produce this kind of small stuff on weekly basis and I want to keep them for future reference and for the nostalgic value.
  • This blog and the backups related to it.

I calculated that I have currently about 250GB of this personal data, spanning over a decade. I’ve planned that I can just keep adding data to this central repository during my entire life and to always transfer it to new hardware when the old breaks. In other words, this will be my entire digital legacy.

Action plan:

  1. Buy a good NAS to home
  2. Build bunch of scripts and automation to fetch all data from different cloud services to this nas
  3. Implement good backup procedures for this NAS.

The first step was quite easy. I bought a HP MicroServer, which acts as a NAS in my home. You can read more from this project from this other blog post. Second step is the most complex: I have multiple computers, an Android cell phone and a few cloud services where I have data that I want to save. I had to find existing solutions for each of these and build my own when I couldn’t find anything. The third step is easy, but it’s worth for another blog post next week.

Archive pictures, edited videos and other projects

I can access the NAS directly from my workstations via Samba/CIFS mounts over network, so I use it directly to host my Lightroom archive, edited video projects (not including raw video assets), and other project files which I tend to produce. I also use it to store drm free music, videos and ebooks which I’ve bought from internet.

Backing up Android phones

This includes pictures which I take with my phone, but also raw data and settings for applications. I found out about this nice program called Rsync for Android. It uses rsync with ssh keys to sync into a backup destination, which runs inside a OpenIndiana Zone in my NAS. Data destination dir is mounted into the zone via lofs, so that only the specific data directory is exposed to the zone. Then I use Llama to periodically run the backup.

In addition I use SMS Backup + to sync all sms and mms messages to GMail under a special “SMS” label. Works like charm!

Backing up GMail

gmvault does the trick. It uses IMAP to download new emails from GMail and it’s simple to setup.

I actually use two different instance of gmvault. They both sync my gmail account, but other deletes emails from the backup database which have been deleted from the gmail and other does not. The idea is that I can still restore my emails if somebody gains access to my gmail and deletes my all emails. I have one script in my cron that syncs the backup databases every night with the “-t quick” option.

Backup other servers

I have a few other servers, like the one where I host this blog, that needs to be backed up. I use simple rsync with ssh keys from cron, which backs up these every night. The rsync uses –backup and –backup-dir to retain some direct protection for deleted files.

Conclusions

This kind of central storage for all your digital assets needs some careful planning, but it’s an idea worth exploring. After you have established this kind of archive, you need to implement the backups, which I will talk in the next post.

This kind of archive solves the problem where you aren’t sure where all your files are – they either have a copy in the archive, or they don’t exists. Beside that, you can make some really entertaining discoveries when you crawl the helms of the archive and find some files you thought you had lost a decade ago.

Read the second part of this blog series at Setting backup solution for my entire digital legacy (part 2 of 2)