> It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop
It's been a few years, but I lost customer data on gs:// due to listing consistency - create a bunch of files, list dir, rename them one by one to commit the dir into Hive - listing missed a file & the rename skipped the missing one.
Put a breakpoint and the bug disappears, hit the rename and listing from the same JVM, the bug disappears.
Consistency issues are hell.
> It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop, and I hope to see the same across the S3-focused connectors.
Yes!
S3guard was a complete pain to secure, since it was built on top of dynamodb and since any client could be a writer for some file (any file), you had to give all clients write access to the entire dynamodb table, which was quite a security headache.
It's been a few years, but I lost customer data on gs:// due to listing consistency - create a bunch of files, list dir, rename them one by one to commit the dir into Hive - listing missed a file & the rename skipped the missing one.
Put a breakpoint and the bug disappears, hit the rename and listing from the same JVM, the bug disappears.
Consistency issues are hell.
> It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop, and I hope to see the same across the S3-focused connectors.
Yes!
S3guard was a complete pain to secure, since it was built on top of dynamodb and since any client could be a writer for some file (any file), you had to give all clients write access to the entire dynamodb table, which was quite a security headache.