Cloud Availability Zones: 3 is the Magic Number

(originally published on LinkedIn)

"The Cloud" presents a range of new concepts and requires new ways to think about architecture, risk and redundancy. Pretty much all major cloud providers urge we all "design for failure", but there are some myths about what this really means and what effect it can have on you.

Cloud providers work in "availability zones" (AZs). Essentially, in a given region/location there will be one or more AZs, and you can spread your services across them. This is intended to give you some protection against the loss of an entire AZ by the cloud provider.

The misstep...

As a general rule, cloud providers say you should always architect your application over three AZs. That way, if an AZ fails you still have two more left, which shouldn't overly affect your application.

We see that often, architects new to the cloud will look at these sorts of statements and think that having three of everything is excessive ("that's just them trying to make money!"). In response to this, they'll architect with two AZs instead, thinking that's a suitable compromise of risk and cost. After all, they only have two on-prem datacentres, so surely this is enough, right?

A move to the cloud means you don't have to settle for some of the compromises you make when you build out on-prem.

The myths: 1. You need three of everything

Here is where the myths and legends of the cloud start to take over. Firstly, designing your architecture over three AZs almost certainly costs either nothing more, or a comparatively tiny amount more than doing so over two AZs. Designing for three AZs doesn't mean you need three of everything - in fact, far from it - for many use-cases, running two VMs (or whatever other resource) in two different AZs is more than enough redundancy. What three AZs means is that you could move one of your servers to a different AZ if you need to. This is a really important point.

What three AZs means is that you could move one of your servers to a different AZ if you need to.

In the case of AZ failure, this may take down one of your two servers - they're in a redundant pair, so you don't actually experience any downtime, but you do have a loss of redundancy. You can recover from your loss of redundancy quite easily by simply recreating your server in the third AZ (indeed, if your server is in an Auto Scaling Group, then the cloud should do this for you completely automatically).

If you don't have three AZs, then you're in quite a bit of trouble right now. The AZ may take days/weeks or months to recover - and in some cases, may never recover (the cloud provider may just choose to close it entirely and build out a new AZ in that region - what was a,b,c may just become a,c,d). As cloud providers will tell you, there's no SLA for AZ recovery.

To recover from an AZ failure when you only designed for two, you're going to need to entirely rework your architecture to remove the failed AZ and add in a new one. You might think you're not bothered about a few hours of work in the unlikely event you lose an AZ. But this is likely to be tricky work that only your most experienced cloud engineers can realistically do, and there's a high chance it'll have some knock-on effect to the remaining service. Your application experts will almost certainly have to be involved too, which means cross-discipline co-ordination - by no means impossible, but many organisation struggle with this sort of thing. There isn't really a playbook for how to do any of this, so you'll be making it up as you go along. Good luck with that.

The Myths: 2. You should spread all your services over all three AZs

Just because you have three AZs doesn't automatically mean you should routinely use all three of them (especially if you're using two of everything). To understand this, we have to look at the maths of the situation. Let's say there's a one in a million chance of an AZ failing in the next month. Using just one AZ means you're at risk of a failure to the tune of 1:1000000. Using two AZs means you're at risk to 2:1000000, and so on. Using more AZs increases the likelihood of a failure (of course, if you're spread over multiple AZs, you can recover from failure easily, but the chances of a failure are actually higher).

On the assumption that your application isn't big enough to warrant three of everything, then spread two of everything over exactly two AZs (tools like Terraform make this easy, and once you start doing it, you're going to find it hard to do it wrong). The third AZ should stay completely empty - it's only there for you to be able to re-build your redundancy if one of your two primary AZs fails.

Possibly the worst mistake you can make here is to (say) put a database in one AZ and the app server(s) that use it in different AZs. Loss of a single AZ will effectively guarantee you a problem. You're better off putting the database and some of the app servers in the same AZ, with the remaining app servers in a second AZ. Ideally of course you'd have a mutli-AZ database, but if funds don't allow then maybe use a read-only replica in the second AZ (which you can promote to a master if you ever need to).

Conclusions

It turns out that the cloud providers know what they're talking about. When they advise we should all use three AZs, they're right. That doesn't mean you need to create three of everything - not at all, two will do in many cases. It just means your architecture should be able to work on three AZs, even if you don't routinely use the third one for anything. Failing to do so is gambling on failure - not a good place to go, and entirely unnecessary in the cloud.