Has anyone ever seen S3 behave eventually consistent? I have not seen a lot of e...

macksd · on Dec 2, 2020

Yes. In fact I spent months working on a system to workaround all the problems it caused for Hadoop: https://hadoop.apache.org/docs/current/hadoop-aws/tools/hado.... We had a lot of automated tests that would fail because of missing (or more recently, out of date) files a significant portion (maybe 20-30%) of the time. We believed that S3 inconsistency was to blame but I always felt like that was the typical "blame flaky infra" cop-out. As soon as S3Guard was in place all those tests were constantly green. It really was S3 inconsistency, routinely.

To be fair to S3, Hadoop pretending S3 is a hierarchical filesystem is a bit of a hack. But I had cases where new objects wouldn't be listed for hours. There's only so much you can do to design an application around that, especially when Azure and Google Storage don't have that problem.

banana_giraffe · on Dec 2, 2020

Yes, a ton. The big issue for me was Does key exist? -> No -> Upload object with key. Followed by Get key -> Not found. Less rarely, I'd see Upload object followed by an overwrite object, followed by a download download some combination of the two if the download was segmented (like AWS's cli will do).

The first one happened often enough to cause problems, the second one was a fairly rare event, but still had to be handled.

It might have been more of an issue with "large" buckets. The bucket in particular where I had to dance around the missing object issue had ~100 million fairly small objects. I ended up having to create a database to track if objects exist so no consumer would ever try to see if a non-existent object existed.

Time to revisit all that mess, I suppose.

allengeorge · on Dec 2, 2020

Yes, we have. Specifically:

1. LIST 2. PUT 3. LIST

would trigger situations where (3) wouldn't include the object inserted in (2). This is well-known however.

juancampa · on Dec 2, 2020

Also:

  1. HEAD key -> 404
  2. PUT key  -> 200
  3. GET key  -> 404 (what? But I just put it!)

This is commonly used for "upload file if it doesn't exist"

electrum · on Dec 2, 2020

Even more confusing, the 404 on GET is caused by doing the HEAD when the object doesn't exist. Without the previous HEAD, the GET would usually succeed.

roseway4 · on Dec 2, 2020

Yes. Writing objects from EMR Spark to S3 buckets in a different account. Immediately setting ACLs on the new objects fail as the objects are not yet “there”. Using EMRFS Consistent View was previously the obvious solution here, but it adds significant complexity.

enw · on Dec 2, 2020

EMRFS consistent view is a mess. Changes to S3 files are likely to break EMRFS, the metadata needs to be kept in sync (otherwise the table will keep increasing forever) with tools that are a pain to install, and when I used it to run Hive queries from a script, it didn't work.

jamesblonde · on Dec 2, 2020

This news is the death of EMRFS. The only thing it has left is fast listings of "directories" using DynamoDB.

brian_herman · on Dec 2, 2020

Yeah, I never had that problem either. But I wonder how they are addressing the https://en.wikipedia.org/wiki/CAP_theorem

content_sesh · on Dec 2, 2020

Sacrificing availability I reckon.

riking · on Dec 2, 2020

Yup, the correct answer in the face of a network partition is for the PUT to return 500, even though the file really was uploaded. The API handler can't prove success, due to the network partition, so the correct move is to declare failure and let the client activate their contingencies.

SpicyLemonZest · on Dec 2, 2020

I've never seen the minutes long delays that you were theoretically supposed to accommodate, but second-scale delays were pretty common in my team's experience. (We're writing a lot of files though, probably hundreds or thousands per second aggregated across all our customers.)

WrtCdEvrydy · on Dec 2, 2020

Yeah, I got seconds but people were saying there were minutes of data... max I ever saw was 3 seconds on a 2gb file...

gregw2 · on Dec 2, 2020

I ran across issues where files that we PUT in S3 were still not visible in Hive 6+ hours later. It only happened a few times out of hundreds of jobs per day for jobs running daily the last 2-3 years, but it was annoying to deal with when it happened because there wasn't a lot we could do beyond open tickets with AWS.

enw · on Dec 2, 2020

All the time when using EMR with S3. Really frequent, and really annoying to mitigate.

rantwasp · on Dec 2, 2020

yes. i believe newer region had read after write in a while but you needed to use the right endpoint

i remember a time when if you were using the us-standard “region” and were unlucky it could take 12 hours for your objects to become visible

TheDong · on Dec 2, 2020

Yes.

I've observed eventual consistency with S3 prior to this change where it took on the order of several hundred milliseconds to observe an update of an existing key.

I've observed this as recently as 2 years ago.

johncolanduoni · on Dec 2, 2020

There used to be a behavior with 404 caching that was pretty easy to trigger: if you race requesting an object and creating it, subsequent reads could continue to 404.

OJFord · on Dec 2, 2020

Yes, it caused us some pain with multipart uploads that we sort of proxied, them being multipart from the client to us as well.

richwater · on Dec 2, 2020

Yes. When you're writing terabytes of data files it was very obvious.

That's why manifest files became so popular.

dilatedmind · on Dec 2, 2020

I think it was already strongly consistent within regions?

like if you had tried to read a non-existent key, then wrote to it, it might continue to appear to not exist for a minute?

TheDong · on Dec 2, 2020

Their previous consistency was RAW, but not RAU (for most regions. I think us-east-1 was missing RAW consistency for quite a while).

After a write, you would _always_ be able to read the key you just wrote.

After an update, you could get a stale copy of the key if your subsequent read hit a different server.

heartofgold · on Dec 2, 2020

Not quite _always_. There were some documented caveats... the one I hit before was: Read nonexistent key, followed by a write of that same key, and then a subsequent read could return a stale read saying it didn’t exist. (even though it had just been written for the first time.)

Anyway I am glad to see these gaps and caveats have been closed.