You have a processing job which dumps a bunch of output files in a directory. A downstream job uses these files as input, sees a new directory, and pulls all the files in the new directory.
Because s3 was not strongly consistent, you would have the downstream job see a arbitrary subset of the files for a short while after creation, and not just the oldest files. This could cause your job to skip processing input files unless you provided some sort of manifest of all the files it would expect in that batch. So then you'd have to load the manifest, then keep retrying until all the input files showed up in whatever s3 node you were hitting.
If this is an issue, you wait X amount of seconds after the file has been created and then start processing it. This would allow the file to be consistent before processing it.
A better idea would be triggering Lambda jobs which either directly processes the files as they are added to S3 or trigger Lambda jobs which add the files to SQS and each job in the SQS is processed by another Lambda job.
2. If you just take the maximum recorded time and say add 20% padding, then waiting that amount of time to process every dataset could be detrimental to performance.
The example I gave happened to the team I was on in 2017/2018. We had 1000s of files totaling terabytes of data in a given batch. The 90th percentile time for consistency was the low 10s of seconds, the 99th percentile was measured in minutes. The manifest and retry not yet present method avoids having to put in a sleep(5 minutes) for the 1% of cases.
Because s3 was not strongly consistent, you would have the downstream job see a arbitrary subset of the files for a short while after creation, and not just the oldest files. This could cause your job to skip processing input files unless you provided some sort of manifest of all the files it would expect in that batch. So then you'd have to load the manifest, then keep retrying until all the input files showed up in whatever s3 node you were hitting.