(NOTE: This is the first of a three part series on setting up a cloud-based backup system, and describes the rationale behind the various technologies selected for the system. The second describes how to setup and use the backup system, while the third describes setting up the backups to run automatically on schedule.)
This is a series of three posts that describe how to set up a modern, performant, efficient, and economical cloud-based backup system on your machine. There are two famous proverbs on data backup. The first goes: “you are only as good as your last backup”. The second is much more cynical: “amateurs talk backup, professionals talk restoration”. The title of this series of posts merges the insight of these two proverbs of the modern digital age, and adds another layer: I used the term “resurrection” to obliquely reference Richard Morgan’s Altered Carbon series (“resleeve” would be a more direct reference) because I have always felt these novels capture the visceral exprience of losing data between backup cycles so well.
My Backup Requirements
I wanted a fast, reliable, incremental, versioned backup system that did not put too much stress on either my machine or my mind/taskload, and that allowed for backup to, what must be one of the most popular destinations on the planet, the “Cloud”. (And yes, I snuck in the word “versioned” in there, because, despite the whimsical title of this series of posts, I actually want access to not just the last, but older snapshots of my data in case there were files that I may have inadvertently deleted or corrupted in some of the more recent backups. As an added bonus, this would also be good insurance against the danger of a Rawling 4851 attack …).
I do use Git!
A whole lot!
The vast majority of my working and personal life is under Git.
But Git is a version control system, and not a backup system. The differences are real, particularly how large files that do not change often (or at all, once created) are handled. For example, the RAW data for my photographs are not under revision control. Same goes for all of my collected media (books, audiobooks, movies, documetaries, etc.) as well. And, for that matter, neither is my email (the post-2012 email is all on IMAP servers anyway, so the mail that I would be interested in saving are mainly the pre-2012 non IMAP, which are static).
Furthermore, the scoping is different as well. I wanted something that I could just point to a source and back it up. Perhaps more importantly, if I needed to recover the entire data system, I did not want to trawl through all my remotes and cloning them one-by-one. And, not to forget, not all my Git-revisioned projects are stored on remotes (all the important/active ones are, it is true, but there are many historical ones or ones that are just experimental/prototypes that are not).
I did use CrashPlan!
In fact, I still do have an active CrashPlan account which I use to backup my spouse’s Mac. But after moving (back) to Linux, and installing the CrashPlan program, I ran into trouble with inotify node exhaustion, which was caused by CrashPlan monitoring a huge number of files. Now, this was easily solved by increasing the inotify limit, but now I had some insight how things worked under the hood, I was uncomfortable with having such a resource hog running in the background monitoring all my files. Instead, it seemed to me from my persual of options, that there were a number of really nice, really performant, open source solutions that gave me everything CrashPlan did for almost the same price for one machine but stayed the same as I added a three or four more machines, and, more importantly, were much more performant, resource-friendly, and without vendor lock-in.
I do use rsync!
It is certainly one part of my backup strategy: I have an external SSD drive that I periodically plugin and run something like “
rsync -a --delete --progress ~/Content /media/whome/backup/backup”. Change this destination to some online storage provider, wrap it up in a little script, and stick it in a “
cron” job, and all done, right? While this is a relatively easy and robust solution, giving immediate access to my data with no post-processing required, what I do not have is the (space-efficient) rollback/versioning: the ability to, e.g., restore a file that might have been deleted two weeks ago, accidentally or otherwise, knowingly or otherwise. We could rsync to different subdirectories each time to maintain history, but then things get ridiculously large very quickly.
Now, there actually is a really elegant trick called “rsync snapshots” which creates hard links from each successive snapshot to its older counterpart if it exists in the archive. In some circumstances, this has some of the savings of the “deduplication” of the new generation of backup programs. Though, coming at a file-level granularity instead of subfile chunk, it is not quite as efficient. More problematically, this space savings is lost when you rename a higher-level directory. However, it otherwise works well enough for most people most of the he time, and, in fact, I will venture to put a large prior on you, too, using and liking this system if you use or have recently used a Mac, as this is the basis of how their Time Machine works.
rsnapshot essentially wraps up the “rsync + hard links” trick that I describe above into a nice little self-contained full-featured program that I would strongly encourage you all to checkout as it might be all that you really need. And answering “Why not rsnapshot (for a backup system)” is a little more hard to answer than the above. Frankly, there might not even be a good answer as depending on your use case, rsnapshot might work just fine, and bring with it a few advantageous over the system(s) discussed below. One “disadvantage” is data privacy: if you have any worry that your data might be stolen then it is worth noting that one of the major things that the new generation of backup systems have over rsync-based ones is integrated (native) encryption or some other form of data security (along with deduplication and, in some cases, built-in compression). Now it seems to me this is easy enough to wrap into an rsync pipeline, though this adds extra hassle and processing and fragility. But more to the point, for me, is that I really do not care all that much about the privacy of most of the data that I am backing up: the vast majority of my work is open source anyway, and I publish all my photographs on Flickr, and the few things that I do want to keep private (e.g., passwords and credit cards) are all in encrypted files anyway.
So, why not rsnapshot? The answer here, for me, is performance and, here is that word again, (subfile-level) deduplication. There are indications that this does not scale well to large numbers of files. And I can confirm this from personal experience as well: while some 650GB of thousands of files took over 4 hours for an rsnapshot update (mostly spent indexing), the solution I describe below took literally less than five minutes. For the differences between subfile-level vs file-level deduplication, I frequently reorganize my directories and files, and with rsnapshot, even a single letter case change in a directory name will result in the entire subtree being seen as a different revision from the previous, resulting in potentially massive new-but-not-new data being written to the next snapshot. Just as much of a deal-breaker for me is also the issue of it not supporting native syncing to common storage service providers that us protocols such as S3 or B2.
There are a number of “new generation” backup systems that crucially feature “deduplication” which results in some very nice savings in space. As noted above, you can hack in some savings by using the hard link trick with rsync, but deduplication proper uses subfile chunks, so that if you have a very large file that has just a few blocks changed (think: email inbox), the subsequent backup will not incur the cost of the duplication of the entire file. Same goes for a directory rename — this will be seen as a new entry by the rsync trick, but not by the new breed of backup programs. On top of all that, the new generation of backup systems are highly, highly, highly performant, and have a lot of nice built-in features, like mounting the backups etc.
Initially, I had thought that I would use Borg: it was very popular, well-supported, robust, efficient, performant, had an attractive and powerful API, and, I will admit it, had a really cool name. Unfortunately, it only supported SSH to write to remotes and required that remotes had “borg” installed, and so you wanted to back up to a third-party service, such as Backblaze or Wasabi, you needed to maintain a local repository and then rely on yet another tool like [git annex]() or [rclone]() to do the upload. This gave the system a little too many moving parts for me. Honestly, this lack of (native) support for various storage service providers was the one thing that proved a deal-breaker for me.
“[restic]“(https://restic.net/) is a very close competitor to “borg”, with all its advantages except, IMHO, the cool name. In addition, however, it natively supports writing to a wide variety of backends, including not only Backblaze or Wasabi, but, among others Amazon S3, Google Cloud etc. So “restic” it was …
The Storage Service
“restic” is the the backup technology that we will use. But we need somewhere to actually store the backup data. The cloud is the trendy (and, I think, also a good) answer.
But which cloud?
As noted, with “restic” you have your choice of backend services. In terms of pricing, the contenders came down to Backblaze and Wasabi. Their actual storage prices are almost identical, BUT: (a) Wasabi has a 1TB minimum monthly charge at their USD 4.99 per TB per month rate (Backblaze has no minumum, so you pay for what you store); and (b) Wasabi has no egress charges, whereas Backblaze does.
My current backup size is about 0.6TB, so I would definitely be paying for storage that I did not immediately need with Wasabi. However, on the other hand, this is offset by Backblaze’s egress and other transport charges. Egress charges are fees for downloads. With a backup system, you might think that this is a relatively rate case, with most transactions being uploading (which are free for both services)? But you also incur egress fees when, for e.g., checking the integrity of the backup, or downloading/viewing single files (both of which will require an entire snapshot to be unrolled, I think).
In any case, Wasabi offers a free 1TB trial without a need for a credit card, and that led me to commit to trying out Wasabi for now. However, in what follows, I also provide usage examples for Backblaze, because that seems to be the more popular one and I may eventually go there if I am unhappy with Wasabi for any reason.
- Part 2: how to set up the backups and the backup systems.
- Part 3: scheduling the system to run for automatic backups.
- Why not “
- The ‘rsync + hard links’ trick:
- Storage providers:
- restic(backup) vs. borg(backup):