Yes, https://github.com/GoogleCloudDataproc/hadoop-connectors disclosure: I work...

boulos · on Dec 2, 2020

Right, https://github.com/GoogleCloudDataproc/hadoop-connectors/rel... was apparently the release:

> Delete metadata cache functionality because Cloud Storage has strong native list operation consistency already.

If folks are actually interested in these connectors, I'd also recommend this blogpost from last year:

https://cloud.google.com/blog/products/data-analytics/new-re...

because even with consistency, GCS and S3 still aren't filesystems :).

basicneo · on Dec 2, 2020

Do you have a link to the commits that removed the code. It'd be good to see what sort of complexity this sort of strong consistency can make redundant.

boulos · on Dec 2, 2020

(Phone reply, sorry for the brevity)

Just comparing to the previous release:

https://github.com/GoogleCloudDataproc/hadoop-connectors/com...

there are some big deletions like in:

https://github.com/GoogleCloudDataproc/hadoop-connectors/com...

and

https://github.com/GoogleCloudDataproc/hadoop-connectors/com...

Igor would know more :)

macksd · on Dec 2, 2020

In the case of Hadoop's S3 connector, this could eliminate this entire directory, plus its tests, plus a bunch of hooks in the main code: https://github.com/apache/hadoop/tree/trunk/hadoop-tools/had.... There's an argument in favor of keeping it in case other S3-compatible stores need it (though you'd still need DynamoDB or some equivalent) and because it makes metadata lookups so much faster than S3 scans, which helps with query planning performance. But I imagine even fewer people will take the trade-off now that Amazon S3 itself is consistent.