Understanding Amazon EC2 failures and redundancy

I’ve been looking after several websites on Amazon EC2, and I’d like to share some thoughts about how Amazon EC2 can fail.

Only two instance types run on dedicated servers: c1.xlarge and m2.4xlarge. Other instance types use shared servers.

If you start multiple instances for redundancy, there’s a good chance they will land on the same server, therefore not providing redundancy. For example, I’ve just tried launching eight instances, and three of them landed on the same physical server (as confirmed by traceroute).

Therefore, to avoid having a single point of failure, most likely you’ll need to start instances in more than one Availability Zone.

EBS durability

EBS does not provide real durability because of writeback caches. Some people claim that’s an advantage.

Replicating data to a second EBS volume in the same Availability Zone will not prevent disasters:

in case of data center failure multiple EBS volumes can get corrupted simultaneously,
there is no guarantee that two EBS volumes won’t land on the same physical hardware.

The only solution is to replicate data to another Availability Zone.

The devil is in the details

Having perfectly running system today does not guarantee you’ll be able to run it tomorrow.

In case of forced server shutdown by Amazon, hardware failure, or stopping and starting an EBS root instance, the following may happen:

Physical server may change. It’s possible to get different CPU models with different extensions (e.g. E5507 may change to E5430 or even to an Opteron)
Xen version may change. For example, 3.0.3-rc5-8.1.14.f may become 3.4.3-2.6.18.

There is no guarantee that your OS will support these changes.

Also, some Linux kernel bugs exist only when running on EC2.

Conclusion

Creating reliable systems on EC2 can be hard, because critical services may need to span multiple Availability Zones. While most of the time systems will run fine, extensive sysadmin skills may be required when things go wrong.

P.S.

Here are two common setup mistakes when you take factors above into account:

Hadoop HDFS with replication factor set to 2 for reliability
Database replication for reliability with all database servers in the same availability zone

Do you know more examples?