Backups on the Home Front

In computing, you have exactly two options: 1) Have current, working, tested backups or 2) don’t care if your data is there tomorrow. There is no third option. Pretending that there is leads only to substantial cussing.

Unfortunately, people mostly either know this already, or won’t be convinced of it until they learn from the school of bitter experience. So, this isn’t a post to try to convince you to take backups; it’s a post about how I do it, presented in the hopes that it’ll make doing so easier and safer. (As with many of my computer-related posts, the actual implementation is somewhat specific to UNIX-like systems, though many of the general principles apply universally.)

Goals

There are a few different requirements that a backup solution needs to meet — for me, anyway; your requirements may differ. These include:

reliability — Write-only backups are no fun. Part of reliability is testability. You can’t count a backup solution as reliable unless you can and do periodically check (by actually retrieving files) that it works. If you didn’t personally get real data out of the backup, it didn’t happen. Reliability claims of hardware or software vendors aren’t enough. Success status reported by the backup software is not enough. (Actual example: expensive enterprise-level backup software saying “complete success” day after day, when all it was configured to back up was one empty directory.)
redundancy — While having a reliable backup system means that there’s a good chance that you’ll be able to recover what you need from a single backup disk or tape, redundancy means you have more than one such thing, and that they’re kept in different places. If all your backups are in the same building, you may be out of luck if that building burns or floods or is raided by thieves. Even backups in different places within the same city or region can all be lost if there is a widespread disaster like a hurricane.
capacity — The backup target should be big enough to hold all my important files. For that matter, it should be big enough to hold my unimportant files too, since I don’t want to have to pick and choose. Now, there are certainly files you can re-generate or re-install from some other source. But if you’re making decisions about what to omit (to save space), it’ll inevitably bite you. Example: Why keep all those .mp3 or .flac files? They’re huge, and you can always re-rip your CD collection. That’s fine, unless the reason you’re restoring is the house fire that also destroyed your CDs.
portability — Backups are worthless if the rare machine you need to read the media or the special software you need to restore the data were destroyed. Plan around the idea that you’re in a small town with the backup medium in your hand, a limited budget and a need to restore your data in the next few days — maybe on a public-access or borrowed computer.
security — If your backup media is stolen or seized, the perpetrator should be prevented — by strong, peer-reviewed encryption – from gaining access to your files.
ease of use — If making backups is a hassle, it’ll be skipped. If it isn’t automatic, it’ll be forgotten. It’s also important that restoring files be reasonably easy. On the one hand, you shouldn’t delete or overwrite files you want to keep, and you should always think before typing the command. On the other, mistakes are inevitable, and time you don’t have to spend recovering from an error is time you can spend on something productive.
ability to keep multiple generations — Sometimes a file gets overwritten or corrupted, and stays that way for a while before you notice anything wrong. In such cases, the most recent backup will just be a faithful copy of the bad data. It’s useful to have the ability to keep multiple generations of backups, spread out over time, and to restore from one of the older ones if you like.

Physical Medium

I’ve tried a number of different systems over the years, starting with “save to two different floppy disks”, then QIC tapes, various helical-scan systems, and optical media (CD-ROM and DVD-ROM). I’ve found tapes to be prohibitively expensive, painfully slow and, unless you are absolutely religious about head cleaning and replacing media on schedule, prone to failure. Optical media have limited capacity, and the most reliable such media are write-once.

My current solution involves using plain old ATA (or SATA) hard disks in USB drive enclosures. These are fast, capacious enough to hold multiple snapshots of my entire system, hot-pluggable and more reliable than all but the very best tapes (the drives for which I cannot afford). They’re self-contained, so I can take them (or mail them) to off-site storage. They’re cheap and use commodity hardware, so I can have a bunch of backup devices, and buy more if I think I need them.

Software

I keep one such drive enclosure connected to my homebox, and use a cron script to run nightly backups. The backup device has a single partition formatted as an encrypted journaling filesystem (ext3 or reiserfs) — more details about this in a later section. The backup software is a simple Bourne-shell script I wrote:

Bourne shell script (3KB)

Note that there are configuration options within the body of the script which you must edit; do not just run this without first adapting it to your site.

There are substantial comments within the script itself explaining what it does in detail. In essence, it uses rsync(1) to make a snapshot of a filesystem. It uses the --link-dest= option to opportunistically make hard links for files which have not changed since the previous backup generation.

For example, say you have a 4GB movie file “cat.avi”. You make a first backup, and it puts a copy of cat.avi on the backup device. You make a second backup, and cat.avi hasn’t changed (in terms of contents, name or directory). The second backup will hard-link to cat.avi in the first backup. Now each backup contains cat.avi (with a link count of 2), and it only takes up 4GB of space on the backup (not 8GB).

If you edit the movie (say, by using a compositing program to add captions), and make a third backup on the same device, rsync will notice the change and make a fresh copy (without disturbing the previous two backups).

This combines the advantages of full backups (you need only look in one place to get the latest data, or any specific generation of data) and incremental backups (speed, efficient use of storage space on backup medium).

Because the backup medium is just a plain old mounted filesystem, I can navigate it and use the contents using all the same tools I use for my regular files. I can plug the enclosure into any modern Linux box with a USB port, mount the filesystem, and read my files without having to install anything special. I can do a full restore onto commodity hardware using any one of the various bootable rescue disks. Because not very many files change each night, backups after the first one are extremely fast, and a single drive can hold dozens of generations.

Encryption

My backup media are small, portable, and a tempting target for theft or seizure. They are also — because of the requirement for geographic redundancy — kept in various places where the physical security is maybe not the best.

As a result, I don’t want mere possession of one of my backup disks to be sufficient to allow someone to read my files. I also don’t want to depend on a key or token that might be destroyed or stolen in the same disaster that caused the need to restore from backup in the first place. So, I depend on encryption (using a pass-phrase I know, which is recorded nowhere) to protect the backups.

I don’t want a filesystem-level encryption solution, since there are (surprisingly common) cases where metadata like the names, sizes and timestamps of files and directories are as privacy-sensitive as the contents of the files; solutions which protect only the file contents are insufficient.

I also consider proprietary closed-source “encryption” products to be a contradiction in terms. You have to take the word of the vendor that the product works, and that they haven’t included back doors at the behest of some criminal enterprise (governmental or otherwise). This is one of the reasons I disdain “cloud” backup schemes: they generally ask you to trust some black-box alleged crypto, sight-unseen. (If you must use such a service, mitigate the risk by storing only big undifferentiated blobs of data you’ve encrypted yourself, using real open-source peer-reviewed crypto.)

Finally, I want something where the decryption software is ubiquitous, thoroughly tested, actively developed, and easy to come by if I have to do a “bare metal” restore.

Fortunately LUKS (Linux Unified Key Setup) is an excellent solution to all of the above concerns. Plenty of detail is available on the project homepage. A short description of the moving parts:

block device driver — abstracts storage hardware (like a USB hard disk) as a standard UNIX block device with a uniform API
device mapper — a generic framework for mapping one block device onto another
dm-crypt — a device mapper target that provides encryption
LUKS — provides standardized key management

When you plug in your drive enclosure, you get a block device file like /dev/sdb1 for the first (and perhaps only) partition on the disk. That partition (representing what’s on the physical disk) is full of encrypted hash, statistically indistinguishable from purely random numbers.

Using the device mapper, dm-crypt target and cryptographic functions built in to the kernel (probably via the cryptsetup(8) command), you get another (virtual) block device file like /dev/mapper/backup, which acts as a cleartext version of your physical device. That is to say, when you read from it, a block of encrypted data is read from the physical disk, decrypted, and presented to you. When you write, your cleartext data is encrypted then written to the physical device.

This is handy, because you can use the device mapper virtual block device like any other block device. In particular, you can make a filesystem on it, mount it, and then it is just another directory.

Details depend on your distribution, but the essential steps are running cryptsetup luksFormat to set up the key block, encryption scheme and passphrase, then cryptsetup luksOpen to use your passphrase to unlock a key and create a mapping. A tutorial for Ubuntu 8.04 is a representative example.

Some distros (like Ubuntu 9.04) have very nice integration of this process — all you have to do is plug in a device with an encrypted partition; you’ll be prompted for the passphrase, and your filesystem will be mounted automatically.

Data Protection

Encryption is an essential component of data protection, but isn’t a complete solution by itself. A complete discussion is beyond the scope of this post, but some things to think about:

If your threat matrix includes scenarios where the attacker has you as well as your backup media, consider using a duress password in combination with decoy content. This will let you appear to comply with an attacker who threatens you with harm unless you disclose your data, while simultaneously making the data unrecoverable forever (even if you change your mind or the ruse is discovered). The beerbottle project is an example implementation.

Data is only protected if every copy is protected. While backups deserve special consideration because they are have to be in places where physical security is less well-controlled, consider also encrypting your primary filesystem. (This is especially important for laptops, which get travel to all manner of sketchy places.) An unencrypted swap partition may be full of things you’d rather keep to yourself.

Attackers routinely steal running machines, both host and UPS, dragging the lot back to where memory contents can be dumped and the contents of mounted filesystems examined. A simple limit switch can cut power if the case is lifted; easy to defeat, but effective if the attacker doesn’t expect it. If you’re the suspicious sort, there are trembler switches, accelerometers, magnetometers, and really any kind of electronic widget that can detect a condition which is reasonably likely to only arise if your box is being abducted. Creativity counts for a lot; an off-the-shelf solution will have an off-the-shelf countermeasure.

Controlling Writes

I keep my backup device mounted most of the time. It’s nice, because users have direct access (subject to the same file permissions as the original) to all the recent generations of backup. The one disadvantage, and it’s a big one, is that if a user can write to a file he can destroy the backups of that file. Since part of the purpose of backups is to protect users from their own errors, this is a serious problem.

My solution is to mount the backup filesystem on a mount point that only root can reach, and use that to perform the backups. I have a read-only bind mount accessible to users. A user can read a file if the mode bits allow it, but can’t write a file no matter what the permissions are. Example commands:

mount -t auto /dev/mapper/backup /private-mnt/backup
mount --bind /private-mnt/backup /mnt
mount -o remount,ro /mnt

In the above example, the directory /private-mnt/ is owned by root and has mode 0700 (read, write, execute only by owner). Non-root users cannot do anything beneath /private-mnt/ (including traverse the backup mount point therein). Anyone can traverse directories, read files and even run programs under /mnt/ (subject to individual file permissions), but can modify nothing (due to the read-only status of the filesystem).

Scheduling

Running the backup script can be done through a simple nightly cron job. Mounting the device involves a necessary manual step: entering the pass phrase, which can obviously not be kept in persistent storage. Rotating backup devices between home and remote locations also requires human effort.

I’m still finding the right compromise. Right now, I run (automated) backups nightly, and rotate the devices every few weeks. Each rotation requires remounting and typing the passphrase. If I have to reboot for some reason (like a lengthy power outage or a kernel update) that also requires a remount — but such events are few and far between.

Testing

If you’re using cron, it’s relatively easy to set things up so you get mail if the backup script fails for whatever reason. Actually reading the files is simple enough — just go into the mounted backup filesystem and do whatever you need to do (either as part of a random audit, or because you made a mess and actually need to).

If you want a more extensive test, it isn’t difficult to do bitwise comparisons of the backups filesystem (or portions thereof) versus your running system. Just bear in mind that not all differences are errors; some files are supposed to change, after all.

Problems and Limitations

The chief problem I’ve encountered so far is inelegant handling of hard links in the source data. If foo and bar are hard links which refer to the same underlying data, a given generation of my backup will contain two copies of that data. It may be that I can solve this by being smarter about what options I pass to rsync. In some cases, using symlinks instead might be a workaround.

Another potential hassle is automating the process of selectively throwing away older generations of backups. (Actually getting rid of a generation is not a problem — you just remove the directory tree.) Right now, I deal with this by re-formatting the drives at a certain point in the rotation. It wouldn’t be a bad idea to write a script that would do something like keep all backups less than a month old, then single generations at two, three, six and twelve months. I just haven’t done this yet.

While there is a facility for excluding certain data on the running filesystem from the backup (using the rsync –filter=dir-merge option), it is not exactly straightforward to use. (However, it’s dead easy to check the next day’s backups to see if you got the set of files you expected.)

This backup scheme is resilient in the face of single bits of backup media failing (in addition to whatever failure caused you to need to restore from backup in the first place). Beware of monoculture vulnerabilities in your hardware, though. I found a bargain on a loss-leader drive plus enclosure combo at a local shop, and bought several. All had power supplies which failed after a relatively short period of use. Buy enclosures from different vendors, and make sure the drives inside are different makes as well. (There are lots of stories about RAID arrays where all the drives came from the same lot, and all failed within hours of one another. Heed such stories.)

Some USB drive enclosures play silly mapping games, using a portion of your drive for their own nefarious purposes. Avoid these, as it is possible that if you transfer the drive to another enclosure, you’ll be unable to see your partition table without vigorous hackery. In fact, I’d suggest trying the drive-swap game right away when you get a new enclosure. Enclosures purchased alone seem less likely to exhibit annoying incompatibilities than those sold bundled with a drive.

This backup scheme is extremely simpleminded, in that it works by just copying all the files one at a time, whenever it is run. This works great for most things, but can be a problem for things that are in the midst of being updated when that copy is made. Relational databases are especially troublesome in this regard. You can work around this by programmatically dumping the database contents to a flat file just prior to running the backup. (Both MySQL and PostgreSQL offer means of doing this without shutting down the database.)

USB drive enclosures, with a few exceptions, do not support the commands needed to access reliability information available on the underlying drive hardware (SMART and the like). While tools like smartctl only give you advance notice of a fraction of drive failures, some warning is better than none. Hotplug ATA or SCSI gets around this problem, but adds others (chiefly expense and difficulty finding compatible hardware on short notice).

This backup scheme makes it much less likely you’ll lose data by accident, but complicates life if for some reason you need to get rid of all the copies of a given file.

Conclusion

I doubt that anyone else will want to use exactly the backup scheme I use, but I hope that some of the ideas here will help some of my readers. I also hope that you’ll point out my inevitable mistakes, and make suggestions if you have ideas for improvements.

Back up your data. It doesn’t matter how, so long as it’s complete, current and actually tested. More of your life than you think is on your computer. Think about someone being interviewed after a house fire — if all the people and pets are safe, their big worry is usually things like family photo albums. You have files on your computer that are analogous to — and in some cases, literally are — those photo albums. And they depend on a complicated, delicate machine that was mass-produced as cheaply as possible.

Protecting your files requires conscious action on your part. It isn’t hard, but if you don’t take it, you’re living your digital life on borrowed time.