OutputCommitter suitable for S3 workloads. Unlike the usual FileOutputCommitter, which
writes files to a _temporary/ directory before renaming them to their final location, this
simply writes directly to the final location.
The FileOutputCommitter is required for HDFS + speculation, which allows only one writer at
a time for a file (so two people racing to write the same file would not work). However, S3
supports multiple writers outputting to the same file, where visibility is guaranteed to be
atomic. This is a monotonic operation: all writers should be writing the same data, so which
one wins is immaterial.
Code adapted from Ian Hummel's code from this PR:
https://github.com/themodernlife/spark/commit/4359664b1d557d55b0579023df809542386d5b8c
OutputCommitter suitable for S3 workloads. Unlike the usual FileOutputCommitter, which writes files to a _temporary/ directory before renaming them to their final location, this simply writes directly to the final location.
The FileOutputCommitter is required for HDFS + speculation, which allows only one writer at a time for a file (so two people racing to write the same file would not work). However, S3 supports multiple writers outputting to the same file, where visibility is guaranteed to be atomic. This is a monotonic operation: all writers should be writing the same data, so which one wins is immaterial.
Code adapted from Ian Hummel's code from this PR: https://github.com/themodernlife/spark/commit/4359664b1d557d55b0579023df809542386d5b8c