Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Please tell me more about The Great Oops


It's inevitable that one of the major cloud providers will irrecoverably delete all customer data with one single fat-fingered command. Though in google's case I'll also consider the prophecy to be fulfilled if they delete their own data.

It will forever be known as The Great Oops.


It's not inevitable, it's essentially impossible.

There are a few things that can cause tremendously widespread outages, essentially all of them network configuration changes. Actually deleting customer data is dramatically more difficult to the point of impossible - there are so many different services in so many different locations with so many layers of access control. There is no "one command" that can do such a thing - at the scale of a worldwide network of data centers there is no "rm -rf /".


Ah, but you fail to account for Google's incredible knack for building tools designed to do things at scale. Or put AI in things that don't need it.

The possibility Google will either manage to unleash a malicious AI on their infrastructure and/or develop a way to destroy a lot of data at scale quite efficiently or some combination of the two is far from zero.

Bear in mind, this "Little Oops" should also have been impossible: https://www.techspot.com/news/103207-google-reveals-how-blan...


.....no?

"We deployed this private cloud with a missing parameter and it wasn't caught" is as different from "we wiped out all customer data" as hello world is from Kubernetes.

No one promised this "should be impossible". Did you confuse "we'll take steps to ensure this never happens again"?


It's pretty much half the puzzle actually.

You contend there's no global rm rf for a global cloud provider, but clearly a missing parameter can rm rf a customer in an irrecoverable manner.

The only half you're missing is... how every major cloud outage happens today... a bad configuration update. These companies have hundreds of thousands of servers, but they also use orchestration tools to distribute sets of changes to all of them.

You only need a command to rm rf one box, if you are distributing that command to every box.

Now sure, there are tons of security precautions and checks and such to prevent this! But pretending it's impossible is delusional. People do stupid stuff, at scale, every day.

The most likely scenario is a zero day in an environment necessitating an extremely rapid global rollout, combined with a plain, simple error.


And the most telling thing about most of these outages is that the provider later admits in their postmortem that they just didn't really understand how the system they made worked until it fell over and were forced to learn how it really works.

It's the sort of thing that used to keep me up at night.


When was the last time it wasn't a cascading failure caused by Rube Goldberg levels of interdependency on their own systems.


The release process, monitoring checks, etc. for a customer's private cloud is generally significantly different from the release process for a global product. I'm not going to get any more specific for all the standard NDA reasons, but having worked for Google and Microsoft among others....no, the risk you describe doesn't translate from one to the other.


Do you not remember crowdstrike?


Again: an outage caused by a config change is different from data loss.

The remediation was painful but it was not data loss.


What if a machine was supposed to be running to capture data?


Yet.


I understand you believe the checks cannot fail that catastrophically, and I agree that the likelihood they do is quite low.

But it can happen, and it only has to happen once. (Also FYI, telling me your work history just tells me you've drunk the koolaid, ain't proof you know more.)


Delete a decryption key. Good luck! I'll see you at the end of time.

Break your control plane, and you can't stop the propagation of poison.

Propagate the wrong trust bundle... everywhere.

Also, it's not about the delete command. It's about the automatic cleanup following behind it that shreds everything, or repurposes the storage.


Children of the kubernetic line.


Cyclic infrastructure dependencies suck :(


Google accidentally deleted customer location history data from customer devices (after intentionally deleting it from Google servers) just last year.

If didn't back it up yourself, it is gone forever.


That seems unlikely. Is Google run by one Homer Simpson?


Yes.


I don’t know if you’re being serious but that’s laughable


The idea that all customer data will be deleted is far fetched, but I feel like there have been some massive incidents. Crowdstrike comes to mind, but I feel its entirely possible that Apple/Google/etc could push out some kind of config update which bricks phones in a way they are unable to download another update to fix them.

Though I'm sure the major players are all over this risk which is why it hasn't happened.


Google wiped all of UniSuper not too long ago by mistake, I don't see why such a occurrence couldn't happen more widely.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: