Please tell me more about The Great Oops

bigbuppo · 2026-02-18T01:50:26 1771379426

It's inevitable that one of the major cloud providers will irrecoverably delete all customer data with one single fat-fingered command. Though in google's case I'll also consider the prophecy to be fulfilled if they delete their own data.

It will forever be known as The Great Oops.

Arainach · 2026-02-18T02:03:54 1771380234

It's not inevitable, it's essentially impossible.

There are a few things that can cause tremendously widespread outages, essentially all of them network configuration changes. Actually deleting customer data is dramatically more difficult to the point of impossible - there are so many different services in so many different locations with so many layers of access control. There is no "one command" that can do such a thing - at the scale of a worldwide network of data centers there is no "rm -rf /".

ocdtrekkie · 2026-02-18T02:44:19 1771382659

Ah, but you fail to account for Google's incredible knack for building tools designed to do things at scale. Or put AI in things that don't need it.

The possibility Google will either manage to unleash a malicious AI on their infrastructure and/or develop a way to destroy a lot of data at scale quite efficiently or some combination of the two is far from zero.

Bear in mind, this "Little Oops" should also have been impossible: https://www.techspot.com/news/103207-google-reveals-how-blan...

Arainach · 2026-02-18T02:57:41 1771383461

.....no?

"We deployed this private cloud with a missing parameter and it wasn't caught" is as different from "we wiped out all customer data" as hello world is from Kubernetes.

No one promised this "should be impossible". Did you confuse "we'll take steps to ensure this never happens again"?

ocdtrekkie · 2026-02-18T03:12:43 1771384363

It's pretty much half the puzzle actually.

You contend there's no global rm rf for a global cloud provider, but clearly a missing parameter can rm rf a customer in an irrecoverable manner.

The only half you're missing is... how every major cloud outage happens today... a bad configuration update. These companies have hundreds of thousands of servers, but they also use orchestration tools to distribute sets of changes to all of them.

You only need a command to rm rf one box, if you are distributing that command to every box.

Now sure, there are tons of security precautions and checks and such to prevent this! But pretending it's impossible is delusional. People do stupid stuff, at scale, every day.

The most likely scenario is a zero day in an environment necessitating an extremely rapid global rollout, combined with a plain, simple error.

bigbuppo · 2026-02-18T05:52:14 1771393934

And the most telling thing about most of these outages is that the provider later admits in their postmortem that they just didn't really understand how the system they made worked until it fell over and were forced to learn how it really works.

It's the sort of thing that used to keep me up at night.

stephenr · 2026-02-18T08:10:29 1771402229

When was the last time it wasn't a cascading failure caused by Rube Goldberg levels of interdependency on their own systems.

Arainach · 2026-02-18T03:46:05 1771386365

The release process, monitoring checks, etc. for a customer's private cloud is generally significantly different from the release process for a global product. I'm not going to get any more specific for all the standard NDA reasons, but having worked for Google and Microsoft among others....no, the risk you describe doesn't translate from one to the other.

bigbuppo · 2026-02-18T05:07:42 1771391262

Do you not remember crowdstrike?

Arainach · 2026-02-18T06:24:59 1771395899

Again: an outage caused by a config change is different from data loss.

The remediation was painful but it was not data loss.

DANmode · 2026-02-19T03:00:52 1771470052

What if a machine was supposed to be running to capture data?

bigbuppo · 2026-02-18T07:47:36 1771400856

ocdtrekkie · 2026-02-18T03:54:25 1771386865

I understand you believe the checks cannot fail that catastrophically, and I agree that the likelihood they do is quite low.

But it can happen, and it only has to happen once. (Also FYI, telling me your work history just tells me you've drunk the koolaid, ain't proof you know more.)

rossjudson · 2026-02-18T04:21:30 1771388490

Delete a decryption key. Good luck! I'll see you at the end of time.

Break your control plane, and you can't stop the propagation of poison.

Propagate the wrong trust bundle... everywhere.

Also, it's not about the delete command. It's about the automatic cleanup following behind it that shreds everything, or repurposes the storage.

bigbuppo · 2026-02-18T05:13:17 1771391597

Children of the kubernetic line.

esseph · 2026-02-18T09:19:02 1771406342

Cyclic infrastructure dependencies suck :(

GeekyBear · 2026-02-18T04:46:21 1771389981

Google accidentally deleted customer location history data from customer devices (after intentionally deleting it from Google servers) just last year.

If didn't back it up yourself, it is gone forever.

tokyobreakfast · 2026-02-18T01:53:18 1771379598

That seems unlikely. Is Google run by one Homer Simpson?

bigbuppo · 2026-02-18T01:57:34 1771379854

JyB · 2026-02-18T02:46:31 1771382791

I don’t know if you’re being serious but that’s laughable

SchemaLoad · 2026-02-18T03:45:06 1771386306

The idea that all customer data will be deleted is far fetched, but I feel like there have been some massive incidents. Crowdstrike comes to mind, but I feel its entirely possible that Apple/Google/etc could push out some kind of config update which bricks phones in a way they are unable to download another update to fix them.

Though I'm sure the major players are all over this risk which is why it hasn't happened.

aragilar · 2026-02-18T07:25:04 1771399504

Google wiped all of UniSuper not too long ago by mistake, I don't see why such a occurrence couldn't happen more widely.