During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files.
During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files. The operation has to be fail safe; moving out data can only take place after new version is fully written and committed.
E.g. data from fromBase=/data/db/tbl1/type=hot and fromSubFolders=Seq("region=11", "region=12", "region=13", "region=14") will be merged and coalesced into optimal number of partitions in Dataset data and will be written out into newDataPath=/data/db/tbl1/type=cold/region=15 with old folder being moved into table's trash folder.
Starting state:
/data/db/tbl1/type=hot/region=11 .../region=12 .../region=13 .../region=14
Final state:
/data/db/tbl1/type=cold/region=15 /data/db/.Trash/tbl1/region=11 .../region=12 .../region=13 .../region=14
name of the table
the data set with data from fromSubFolders already repartitioned, it will be saved into newDataPath
path into which combined and repartitioned data from the dataset will be committed into
parent folder from which to remove the cleanUpFolders
list of sub-folders to remove once the writing and committing of the combined data is successful
Glob a list of table paths with partitions, and apply a partial function to collect (filter+map) the result to transform
the FileStatus to any type A
Glob a list of table paths with partitions, and apply a partial function to collect (filter+map) the result to transform
the FileStatus to any type A
return type of final sequence
parent folder which contains folders with table names
list of table names to search under
list of partition columns to include in the path
a partition function to transform FileStatus to any type A
Lists tables in the basePath.
Lists tables in the basePath. It will ignore any folder/table that starts with '.'
parent folder which contains folders with table names
Creates folders on the physical storage.
Creates folders on the physical storage.
path to create
true if the folder exists or was created without problems, false if there were problems creating all folders in the path
Opens parquet file from the path, which can be folder or a file.
Opens parquet file from the path, which can be folder or a file. If there are partitioned sub-folders with file with slightly different schema, it will attempt to merge schema to accommodate for the schema evolution.
path to open
Some with dataset if there is data, None if path does not exist or can not be opened
Exception
in cases of connectivity
Checks if the path exists in the physical storage.
Checks if the path exists in the physical storage.
true if path exists in the storage layer
Reads the table info back.
Reads the table info back.
parent folder which contains folders with table names
name of the table to read for
Writes out static data about the audit table into basePath/table_name/.table_info file.
Writes out static data about the audit table into basePath/table_name/.table_info file.
parent folder which contains folders with table names
static information about table, that will not change during table's existence
Commits data set into full path.
Commits data set into full path. The path is the full path into which the parquet will be placed after it is fully written into the temp folder.
name of the table, will only be used to write into tmp
full destination path
dataset to write out. no partitioning will be performed on it
Exception
can be thrown due to access permissions, connectivity, spark UDFs (as datasets are lazily executed)
Contains operations that interact with physical storage. Will also handle commit to the file system.
Created by Alexei Perelighin on 2018/03/05