CloudFront - one month on

After the first month of hosting most of my static content on CloudFront, I’ve been very impressed with the performance (which is, after all, the main reason for using a CDN).

There are a few minor irritations, or features which would be nice and seem obvious to me:

rsync access

Much more efficient for uploading from a local working copy. S3Fox’s ‘sync’ feature largely fills this gap; there’s probably a scriptable equivalent available somewhere.

Flat namespace

There is no true directory or folder hierarchy, although most of the front-ends provide some simulation of this. In particular, this could become an issue with large numbers of files: on a regular file server, you might structure a million files in a two-level directory structure, with a hundred directories and a hundred sub-directories each containing another hundred files – in S3, you’d really just have a million objects in a single ‘bucket’ (S3’s top-level container). This might not bother S3 itself, but could be problematic for other tools. It also complicates permissions a little: rather than granting permissions to a single parent, you need to modify the access control list of every single file. Granting or revoking access to that million file collection would probably entail two million metadata operations (one to fetch each ACL, another to write the new one back) – racking up something like $20 in AWS charges and probably taking many hours!

Content listings

Permissions are a little different to the Unix hosting most of us will be familiar with. After uploading my first batch of content, I told S3Fox to grant anonymous read access recursively to my CloudFront bucket as a quick way to make the objects I had just uploaded accessible. On a standard Unix system, you need X (execute) permission on a directory to access its contents; on S3, all you need is read access to the object (file) itself. Read access to the bucket actually enables access to an XML list of the bucket’s contents, which you probably don’t want.

No DirectoryIndex

If you want a file like index.html served, you need to link to it explicitly: no linking to hostname/foo/ if you really want hostname/foo/index.html served, as you would expect from a standard web server. Having said that, uploading your HTML file with an object name of “foo/” may be sufficient: there are no directories in S3, but slashes are valid parts of object names instead. I’ll experiment further on this later and report back.

Expiry and versioning

Everything has a minimum TTL of 24 hours when fetched by CloudFront, which you can increase by setting appropriate headers on the object in S3. Most CDNs set long TTLs for efficiency’s sake, although those like CacheFly employing push-based replication may not need to.

Unfortunately, the current version of S3Fox doesn’t give access to custom headers, but S3Hub is a good S3 client for Mac OS X which does – I have now set 2019 Expires headers and a 365 day TTL for caches on all the versioned objects on cdn.deadnode.org.

The trick here is to ensure that whenever an object is changed, the URL by which it is referenced also changes. Either embed a hash of the object in the URL (MD5 or even CRC will do: it’s to guard against near-contemporary versions of an object having the same URL, not to provide actual “security”), a monotonic counter or a timestamp.

I’m currently using the first two approaches: the CSS file for this site is presently style-v4.css and will be probably replaced by style-v5.css at some point in the future; the banner image at the top is /1318a51d1a358a1aa49ccabaa7f48fad/header.jpeg where the string of hex digits at the start is an MD5 hash of the contents.

Cost control

One noticeable omission is any kind of cost control. Most hosting packages I’ve looked at recently include some threshold at which they will warn you before you rack up outrageous charges; AWS seems to ensure that your site getting deluged with a sudden influx of hits won’t overload the servers – the notorious “Slashdot effect” which reduced so many popular sites to a collection of error messages in years past when a site suddenly became popular – but leaves you open to a wallet overload instead. Having said that, the prices are low enough that even a relatively large video file getting thousands of unexpected visitors should only be a matter of tens rather than hundreds or thousands of dollars.

Dollars

The price above was in (US) dollars for two reasons: first, it’s a currency more visitors to this site are likely to understand than any other – but secondly, it’s currently the only currency Amazon charges in. You need to be aware of this if, like me, your cards are in any other currency, both because this means Amazon’s prices effectively fluctuate randomly to some extent, and because I have occasionally been burned: one of my Visa cards applies a significant surcharge to all transactions in a foreign currency.

No Edge-Side Includes

It’s apparently static content only, while some CDNs – most notably Akamai – offer the assembly of content at the edge. Using ESI – an open W3 standard to which Akamai, IBM and others contributed – you can build a dynamic page on their CDN servers, with all the static portions of the page (even the most dynamic page will still contain a large amount of static data) being cached in the CDN rather than fetched from your own origin servers.

Referrer restrictions (hot-linking protection)

There also seems to be no way to protect content from hot linking racking up large bandwidth bills. Most of the cheaper CDN options seem to lack this, presumably because of the extra overhead involved in setting it up, but this would be a valuable feature; at one point some years ago, I discovered almost half my total bandwidth bill resulted from one person having linked directly to an animation clip I hosted as her avatar on a popular online forum – so every single viewer of any page she had posted on was downloading that fairly large clip from my server. At the time, fixing this was a one-line change to my web server configuration file; on S3/CloudFront, this could have been an expensive ongoing problem.

Experience so far

The hit rates so far seem quite impressive: from the S3 billing figures, well over 90% of my traffic is served up from the CloudFront nodes’ caches rather than being fetched from S3. With clusters on three continents (one in each of Hong Kong and Japan, eight in the US and four in Europe) access should be very fast for almost everyone, although Australia still seems an unfortunate omission for now. It is very good value, particularly at the low end – a nice clear pricing structure with no commitment, but the ability to scale up to almost any level you might need without problems. I do recall S3 having had an embarrassing outage or two over the last year, but the aggressive caching by CloudFront nodes should mean almost any S3 downtime would not affect CloudFront: only newly-uploaded content being accessed for the first time during an outage would hit an error. It should be possible to mirror content between CloudFront and another CDN using a different DNS setup

The future

S3Fox gave a good service initially, apart from the lack of support for adding headers to objects – I’ve heard there is a beta version now which has this feature, so I will revisit this software once the next version is released. In the mean time, S3Hub gives me this ability. I’ll also be looking at some way to automate uploading new content as part of a deployment script.

ESI support would also be nice, as would hot linking protection of some sort. It should be possible to achieve something similar using Javascript on the client end, getting even better performance by caching the static components on the client end after fetching them from the CDN, with the obvious drawback of normally needing some sort of fallback option for users without Javascript.