We (Twilio) have released a number of articles & presentations in this area:...

caseysoftware · on Oct 22, 2012

And Netflix's Simian Army is awesome:

http://techblog.netflix.com/2011/07/netflix-simian-army.html

They've even released the "Chaos Monkey" open source: http://techblog.netflix.com/2012/07/chaos-monkey-released-in...

wyck · on Oct 22, 2012

Yet netflix is currently down, so maybe there is a problem with the chimp army?

bermanoid · on Oct 22, 2012

The problem is likely the same as usual: if the damn control plane is down, it doesn't matter how robust your failover architecture is, because your requests to bring up new machines go unanswered.

There's pretty much no way to architect around that one as an AWS user (apart from going fully multi-cloud, but "nobody" actually does that, at least at scale), and I'm kind of shocked that those bits of AWS are still not robust against "single AZ outages", given that they're involved in pretty much every one of these incidents and make them affect people on the entire cloud...

koide · on Oct 22, 2012

> apart from going fully multi-cloud, but "nobody" actually does that, at least at scale

Pirate Bay might disagree with that sentence: http://torrentfreak.com/pirate-bay-moves-to-the-cloud-become...

taligent · on Oct 22, 2012

Apparently iCloud is multi cloud (AWS and Azure).

But regardless it's not like all of EC2 went down just one or two AZs. So why couldn't traffic be migrated transprently to other AZs/regions ?

nathannecro · on Oct 22, 2012

Netflix not down for me here (Eastern US).

caseysoftware · on Oct 22, 2012

I'm watching Netflix from Austin, so it's not entirely down in the US.

j45 · on Oct 22, 2012

Sweet, thanks for sharing!