Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We (Twilio) have released a number of articles & presentations in this area:

http://www.slideshare.net/twilio/highavailability-infrastruc...

http://www.twilio.com/engineering/2011/04/22/why-twilio-wasn...

It's strategy as opposed to how-to but the principles apply.



And Netflix's Simian Army is awesome:

http://techblog.netflix.com/2011/07/netflix-simian-army.html

They've even released the "Chaos Monkey" open source: http://techblog.netflix.com/2012/07/chaos-monkey-released-in...


Yet netflix is currently down, so maybe there is a problem with the chimp army?


The problem is likely the same as usual: if the damn control plane is down, it doesn't matter how robust your failover architecture is, because your requests to bring up new machines go unanswered.

There's pretty much no way to architect around that one as an AWS user (apart from going fully multi-cloud, but "nobody" actually does that, at least at scale), and I'm kind of shocked that those bits of AWS are still not robust against "single AZ outages", given that they're involved in pretty much every one of these incidents and make them affect people on the entire cloud...


> apart from going fully multi-cloud, but "nobody" actually does that, at least at scale

Pirate Bay might disagree with that sentence: http://torrentfreak.com/pirate-bay-moves-to-the-cloud-become...


Apparently iCloud is multi cloud (AWS and Azure).

But regardless it's not like all of EC2 went down just one or two AZs. So why couldn't traffic be migrated transprently to other AZs/regions ?


Netflix not down for me here (Eastern US).


I'm watching Netflix from Austin, so it's not entirely down in the US.


Sweet, thanks for sharing!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: