It should go without saying that we're mortified by this. While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a datacenter. As soon as we reasonably can, we will be providing a full postmortem of this: how this was architecturally possible, what exactly happened, how the system recovered, and what improvements we are/will be making to both the software and to operational procedures to assure that this doesn't happen in the future (and that the recovery is smoother for failure modes of similar scope).
I feel bad for the person who made the mistake. Even though its obviously a systemic problem, and highly unlikely to be an act of negligence, Im sure he/she doesnt feel too hot right now.
It's operations. You fuck up, you suck it up, you fix it, then (and this is the important part) you prevent it from ever happening again. Feeling like shit for bringing something down is a good way to give yourself depression, given how often you will screw the pooch with root. In the same vein, anybody who says they'd fire the operator without any qualification on that remark should be given a wide berth.
People tend to forget that "fixing it" isn't just technical, it involves process, too. Every new hire that whines about change control and downtime windows would be the first to suggest them, were they troubleshooting the outage that demonstrated the need.
Nonsense. Someone has to be operating at the sharp end of the enable prompt, and sooner or later it'll be 0330 and that person will type Ethernet0 when they meant Ethernet1, whatever management you have in place.
When that happens, you do just what Joyent did here: you send out an embarrassed email to customers, everyone else in the ops team gets a few cheap laughs at the miscreant's expense, you have a meeting about it, discuss lessons learned, and you move on.
Everyone screws up. Everything goes down once in a while. This is why you build in redundancy at every level.
I've seen generally brilliant people be bit by bad process. The worst example was an important hard drive being wiped thanks to a lack of labeling, obviously taking a production server down with it.
Other things that have caused outages: lack of power capacity planning, unplugging an unrelated test server from the network (go go gadget BGP), cascading backup power failure, building maintenance taking down AC units, expensive equipment caching ARP replies indefinitely… the list goes on.
I had my own fun fuckup too. I learned SQL on PostgreSQL, and had to fix a problem with logged data in a MySQL database. Not trusting myself, I typed "BEGIN;" to enter a transaction, ran my update, and queried the table to check my results. I noticed my update did more than I expected, so I entered "ROLLBACK;" only to learn that MyISAM tables don't actually implement transactions.
Thankfully, in this case it turned out to be possible to undo the damage, but talk about a heart-stopping moment!
Shit happens. You deal with it, then do what you can to keep it from happening again. I've learned to respect early morning change windows as a way to limit damage caused by mistakes.
Why are you using a throwaway account? Ohh, I just saw the "dollars or Yen" remark. TIL we use throwaway accounts for the times we feel like being assholes, so the non-elites can't track it back to our physical neuroprocessors.
I'm not sure why I'm dignifying this Reddit drivel with a response, but my karma and account age should be your hint that you're barking up the wrong tree.
"check my previous responses and my credit score for how you should treat me" ohh what an old-man response. It's too bad the Imgur "downvote everything they ever posted" script doesn't work here on HN, now isn't it?
My account's karma and history exceed your account's on this site, and even worse, this individual comment bears more value than yours! Ooh burn!
Haha, this is a pretty spectacular amount of cognitive dissonance you're demonstrating here.
Let's post-mortem this lunacy:
1) You misinterpret jsmthrowaway's initial comment as vaguely racist (or something), notice the word "throwaway" in his user name and get really excited that you can stand on your high horse and call him out for hiding behind the shield of internet anonymity when he wants to be a (you think) racist idiot. Even though his comment is a legitimate currency conversion remark. See: https://www.google.com/search?q=1+usd+in+yen
2) He explains that clearly you're mistaken (which, really, seriously, what a kneejerk response from someone who just wanted to show how clever they were, calling out an "asshole") and further explains his account isn't even a "throwaway" in the traditional, trolly sense of the word, citing his account age and the fact that he regularly actively posts to the account.
3) You, caught perhaps in a moment of clarity, though I think I give you too much credit here, realize you were too eager to pounce on the "asshole" for his "Yen" remark, and perhaps you misread it. Your latent erection fading, you counter by explaining that your history and karma are even _more_ impressive, somehow completely avoiding taking responsibility for a completely nonsensical leap in logic and accusation of wrongdoing, while doubling down on your cognitive dissonance.
Please keep in mind that "price" means "how many dollars other humans are willing to trade for it right now"; not necessarily any concrete evaluation of the device's functionality compared to a human competitor or human operator...
That sounds so awful. I can't imagine living the rest of my life knowing that I had been a net negative in the world. All of my life's earnings would just be a partial restitution of that one second of destruction.
As a request: It looks like each time the status page is updated, the old UPDATE: <words> is removed. For the future, it would be great if the older updates were preserved so that people looking back could understand the chain of events, rather than just seeing the first / last pieces.