varnish, nginx, and S3

A while back I wrote proxy for accessing an S3 bucket that transparently handles decrypting objects before serving them. It combines some things I’v already talked about in what I consider interesting ways.

Disclaimer: The proxy is in no way production worthy. There are much better S3 proxies out there with much better documentation. This post is just to document a trick that I came up with to avoid using a database.

Overview

It turns out that nginx is very happy to serve broken symlinks and it actually does the right thing. If you turn on directory listings the symlink will appear in the listing but when someone tries to get a broken symlink nginx will return a 404 response. This is actually great because if varnish is sitting in front of nginx then it can be configured to restart the request handling process with another backend whenever nginx returns a 404 response. In our case the other backend is the S3 proxy which is a very basic sinatra application that handles downloading things from S3, decrypting and re-jiggering some files in the system in a safe way, and then returning a response. Once the file is on disk next time we get a request for the same file nginx can find it and return it directly and we don’t have to go to the sinatra application.

An extra hitch is we don’t use a single key for the entire bucket but potentially a different key for each file. One way to do the decryption would be to keep track of which key was used to encrypt which file in a database somewhere and that is in fact the correct solution because when we need to scale the service beyond one server what I’m about to recommend would lead to more trouble than it’s worth.

Unsavory Hacks

Remember that broken symlink? Well I just happen to encode the entire path of the S3 object that the file corresponds to in that symlink and part of the path is the reference to the key that was used to encrypt it, e.g. file -> key-name/path/to/s3/object. The sinatra application looks up the key in a key store somewhere and uses it to decrypt the file after it is downloaded from S3 into a temporary location and then atomically moved into place. So finding the key and getting the object is a constant time operation because we just use readlink and split things up just the right way to get the name of the key and the reference to the object in S3.

Conclusion

So next time you think you need a database think about how the filesystem and symlinks can be used to avoid it. I’m just kidding. Please use a database and do not encode references to remote resources with symlinks.