Pages

Sunday, December 30, 2012

Amazon S3 and Glacier: A Cheap Solution for Long Term Storage Needs

In the last few years, lots of cloud-based storage services began providing relatively cheap solutions to many classes of storage needs. Many of them, especially consumer-oriented ones such as DropBox, Google Drive and Microsoft SkyDrive, try to appeal their users with free tiers and collaborative and social features. Google Drive is a clear case of this trend, having "absorbed" many of the features of the well-known Google Docs applications, seamlessly integrating them into easy to use applications for many platforms, both mobile and desktop-oriented.

I've been using these services for a long time now, and despite being really happy with them, I've been looking for alternative solutions for other kinds of storage needs. As an amateur photographer, for example, I generate a lot of files on a monthly basis, and my long-term storage need for backup is currently in the tens of gigabytes per month. If I used Google Drive to satisfy those needs, supposing I'm already in the terabyte range, I'd pay almost $50 per month! Competitors don't offer seriously cheaper solutions either. At that price, one could argue that a decent home-based storage solution could be a better solution to his problems.

The Backup Problem

The problem is that many consumer cloud storage services are not really meant for backup, and you're paying for a service which keeps your files always online. On the other hand, typical backup strategies involve storing files in mediums which are kept offline, typically reducing the total cost of the solution. At home, you could store your files in DVDs, and keep hard disk space available for other tasks. Instead of DVDs, you could use hard drives as well. We're not considering management issues here (DVDs and hard drives can fail over time, even if kept off and properly stored) but the important thing to grasp here is that different storage needs can be satisfied by different kind of storage classes, to minimize the long-term storage costs of assets whose size is most probably only going to grow over time.

This kind of issues has been addressed by Amazon, which recently rolled out a new service for low-cost long-term storage needs: Amazon Glacier.

What Glacier Is Not

As soon as Glacier was announced, there has been a lot of talking about it. At a cost of $0.01 per gigabyte per month, it clearly seemed an affordable solution for this kind of problems. The cost of one terabyte would be $10 per month, 5 times cheaper than Google Drive, 10 times cheaper than DropBox (at the time of writing).

But Glacier is a different kind of beast. For starters, Glacier requires you to keep track of a Glacier-generated document identifier every time you upload a new file. Basically, it acts like a gigantic database where you store your files and retrieve them by key. No fancy user interface, no typical file system hierarchies such as folders to organize your content.

Glacier's design philosophy is great for system integrators and enterprise applications using the Glacier API to meet their storage needs, but it certainly keeps the average user away from it.

Glacier Can Be Used as a New Storage Class in S3

Even if Glacier was meant and rolled out with enterprise users in mind, at the time of release the Glacier documentation already stated that Glacier would be seamlessly integrated with S3 in the near future.

S3 is a cloud storage web service which pioneered the cloud storage offerings, and it's as easy to use as any other consumer-oriented cloud storage service. In fact, if you're not willing to use the good S3 web interface, lots of S3 clients for almost every platform exist. Many of them even let you mount an S3 bucket as if it were an hard disk.

In the past, the downside of S3 for backup scenarios has always been its price, which was much higher than that of its competitors: 1 terabyte costs approximately $95 per month (for standard redundancy storage).

The great news is that now that Glacier has been integrated with S3, you can have the best of both worlds:
  • You can use S3 as your primary user interface to manage your storage. This means that you can keep on using your favourite S3 clients to manage the service.
  • You can configure S3 to transparently move content to Glacier using lifecycle policies.
  • You will pay Glacier's fees for content that's been moved to Glacier.
  • The integration is completely transparent and seamless: you won't need to perform any other kind of operation, your content will be transitioned to Glacier according to your rules and it will always be visible into your S3 bucket.

The only important thing to keep in mind is that files hosted on Glacier are kept offline and can be downloaded only if you request a "restore" job. A restore job can take up to 5 hours to be executed, but that's certainly acceptable in a non-critical backup/restore scenario.

How To Configure S3 and Use the Glacier Storage Class

The Glacier storage class cannot be used directly when uploading files to S3. Instead, transitions to Glacier are managed by a bucket's lifecycle rules. If you select one of your S3 buckets, you can use the Lifecycle properties to configure seamless file transitions to Glacier:

S3 Bucket Lifecycle Properties

In the previous image you can see a lifecycle rule of a bucket of mine, which move content to Glacier according to the rules I defined. You can create as many rules as you need and rules can contain both transitions and expirations. In this use case, we're interested in transitions:

S3 Lifecycle Rule - Transition to Glacier

As you can see in the previous image, the afore-mentioned S3 lifecycle rule instructs S3 to migrate all content from the images/ folder to Glacier after just 1 day (the minimum amount of time you can select). All files uploaded into the images directory will automatically be transitioned to glacier by S3.

As previously stated, the integration is transparent and you'll keep on seeing your content into your S3 bucket even after it's been transitioned to Glacier:

S3 Bucket Showing Glacier Content

Requesting a Restore Job

The seamless integration between the two services don't finish here. Glacier files are kept offline and if you try to download them you'll get an error instructing you to initiate a restore job.

You can initiate a restore job from within the S3 user interface using a new Action menu item:

S3 Actions Menu - Initiate Restore

When you initiate a restore job for part of your content (of course you can select only the files you need), you can specify the amount of time the content will be kept online, before being automatically migrated to Glacier again:

S3 Initiation a Restore Job on Glacier Content

This is great since you won't need to remember to transition content to Glacier again: you simply ask S3 to bring your content online for the specified amount of time.

Conclusions

This post quickly outlines the benefit of storing a backup copy of your important content on Amazon Glacier, taking advantage of the ease of use and the affordable price of this service. Glacier integration in S3 enables any kind of users to take advantage of it without even changing your existing S3 workflow. And if you're new to S3, it's just as easy to use as any other cloud storage service out there. Maybe their applications are not as fancy as Google's, but their offer is unmatched today, and there are lots of easy to use S3 clients, either free or commercial (such as Cyberduck and Transmit if you're a Mac user), or even browser based S3 clients such as plugins for Firefox and Google Chrome.

Everybody has got files to backup, and many people is unfortunately unaware of the intrinsic fragility of typical home-based backup strategies, let alone users that never perform any kind of backups. Hard disks fail, that's just a fact, you just don't know when it's going to happen. And besides hard disk failures, other problems may appear over time such as undetected data corruption, which can only be addressed using dedicated storage technologies (such as the ZFS file system), all of which are usually out of range of many user, either for their cost or for their skill requirements for setup and management.

In the last 6 years, I've been running a dedicated Solaris server for my storage needs, and I bought at least 10 hard drives. When I projected the total cost of ownership of this solution I realised how Glacier would allow me to spare a big amount of money. And it did.

Of course I'm still keeping a local copy of everything because I sometimes require quick access to it, but I reduced the redundancy of my disk pools to the bare minimum, and still have a good night sleep because I know that whatever happens my data is still safe at Amazon premises. If a disk breaks (it happened a few days ago), I'm not worried about array reconstruction, because it's not an issue any longer, and I just use two-way mirrors instead of more costly solutions. I could even give up using a mirror altogether, but I'm not willing to reconstruct the content from Glacier every time a disk fails (and it's going to happen at least once every 2/3 years, according to my personal statistics).

So far, I never needed to restore anything from Glacier, but I'm sure that day will eventually come. And I want to be prepared. And you should want to as well.

P.S.: Ted Forbes has cited this blog post in Episode 118 (Photo Storage with Amazon Glacier and S3) of The Art of Photography, his excellent podcast about photography. If you still don't know it, you should check it out. Ted is an amazing guy and his podcast is awesome, with content that ranges from tips, techniques and interesting digressions on the art of photography. I've learnt a lot from him and I bet you will, too.