Compacting hadoop partitions is not supported out-of-the-box by hadoop, as files need to be read with the correct format and written again.
Compacting hadoop partitions is not supported out-of-the-box by hadoop, as files need to be read with the correct format and written again. The following steps are used to compact partitions with Spark: 1. Check if compaction is already in progress by looking for a special file "_SDL_COMPACTING" in data objects root hadoop path. If it exists and is not older than 12h exit compaction with Exception. Otherwise create/update special file "_COMPACTION". If the file is older than 12h the compaction process is assumed to be crashed. 2. As step 5 is not atomic (delete and move are two operations), we need to check for possibly incomplete compactions of previous crashed runs and fix them. Incomplete compactions are marked with a special file "_SDL_MOVING" in the temporary path. Incomplete compacted partitions must be moved from temporary path to hadoop path (see step 5) and marked as compacted (see step 6). 3. Filter already compacted partitions from given partitions by looking for "_SDL_COMPACTED" file, see step 5 4. Data from partitions to be compacted is rewritten into a temporary path under this data objects hadoop path. 5. Partitions to be compacted are deleted from the hadoop path and moved from the temporary path to the hadoop path. This should be done one-by-one to reduce risk of data loss. To recover in case of unexpected abort between delete and move, a special file "_SDL_MOVING" is created in temporary path before deleting hadoop path. After moving the temporary path, this file is deleted again. Mark compacted partitions by creating a special file "_SDL_COMPACTED" and 6. Delete "_SDL_COMPACTING" file created in step 1.