I recently decided to jump into the object storage revolution (yeah, I’m a little late). This drive comes from my very old archives I’d like to store offsite but also to more easily streamline how I deploy applications which have things like data directories and databases that need to be backed up.

The Customary

Lately, through my work at Arbor and my own personal dabbling, I’ve come to love the idea that a service may depend on one or more containers to function. In example, this blog relies on three containers linked and sharing volumes.

Additionally, one repository defines everything that is needed for scriptthe.net: my docker-compose.yml defines runtime configuration and secret keys etc.. whilst the code including scripts or configuration files exist in the same repository, with a Dockerfile to build it all.

This is great and all, but when thinking about larger scale, including larger data sets and databases, how does one handle backing those up in a timely fashion? Especially if the application is containerized, how would you handle a push or pull based strategy? Some technical questions around this include:

Why do I need to make my postgres container available on the network so some backup system can pg_dump and save backups for me.

Why do I need to install some special client in my container to send backups somewhere, handle compression and encryption, and trim them after 15 days?

The Contemporary

In considering my options, it occurred to me that letting an application handle its own backups is the right way to go. Getting to containers whether it be or their data or databases, is hard. Even with the cool Consul DNS bridge and WeaveWorks, trying to gain private access to a containers’ resources is hacky. Instead, why not run a companion container whose sole purpose it is to backup any assets of the service?

Using some great source images, I’ve come up with a few fairly generic docker images which can run alongside your service containers. The first, s3-archive, just needs a volumes-from and a variable DATADIR to define the directory to backup. It will tar, compress, encrypt, and send the result directory to s3. The second, postgres-s3-archive does exactly the same but utilizes pg_dumpall to safely export all databases and then compresses, encrypts, and sends the result to s3.

For those out there that run Stash in production and would like the more safe route to backing up their instance, I’ve made a stash-backup-client image that convieniently takes in configuration via variables and does the same process as my other s3-archive images. This process is quite safe as Stash actually enters Maintenance Mode throughout the duration of the backup. This ensures both the database and data storage are not actively being utilized (open locks, etc..), which is always the preferred way to backup.

The Experience

So this is cool: You’ve now got backups going to S3, automatically, at configurable intervals, highly compressed, and encrypted (both client and server side). The last step in enabling this beautiful montage to work is handling expiration of older backups.

First, let me explain Versioning. With S3, you can enable the Versioning at the bucket-level. This means that every single version of every file is tracked and stored. This is super useful and very nice even when talking about uploading the same file multiple times which sees each upload get a unique vid. Using this unique value and the s3api, you can pull any specific upload you’ve ever made to s3.

With Versioning, which must be enabled to use Lifecycle Polcies, you now enable a world of possibilities relating to the current and previous versions of your files.

Lets look at an pseudo-example:

Any current file in backups/ can be expired after x amount of days. Any previous versions of a file can be migrated to Glacier after y amount of days.

I won’t delve too much deeper into how this all works, but basicaly any way you want it, you can pretty much get it.

Finale

One last thing I’d like to mention: S3 is not the only object storage doing it right! Keep in mind that Google has a pretty strong offering, especially relating to their archival abilities through Google Nearline vs Amazon Glacier (take a look at restore time:)

Mario Loria is a builder of diverse infrastructure with modern workloads on both bare-metal and cloud platforms. He's traversed roles in system administration, network engineering, and DevOps. You can learn more about him here.