Appends a trailer data to every single file in the data directory.
Appends a trailer data to every single file in the data directory. A single trailer file in the
pathOutputTrailer
directory should correspond to a single data file in the pathOutputData
directory.
If a trailer for a given file does not exist, the file is moved as is to the output directory.
Input data files directory
Input trailer files directory
Output concatenated files directory
Hadoop configuration (preferably sparkSession.sparkContext.hadoopConfiguration
)
Method to get data from multiple source paths and combine it into single destination path.
Method to get data from multiple source paths and combine it into single destination path.
multiple source paths from which to merge the data.
destination path to combine all data to.
flag to compress final output file into gzip format
Method to get empty dataframe with below abinitio log schema.
Method to get empty dataframe with below abinitio log schema.
record string("|") node, timestamp, component, subcomponent, event_type; string("|\n") event_text; end
Method to read data from hive table.
Method to read data from hive table.
spark session
hive database
hive table.
hive table partition to read data specifically from if provided.
dataframe with data read from Hive Table.
Reads a full hive table partition, by reading every subpartition separately and performing a union on all the final DataFrames
Reads a full hive table partition, by reading every subpartition separately and performing a union on all the final DataFrames
This function is meant to temporarily solve the problem with Hive metastore crashing when querying too many partitions at the same time.
spark session
hive database name
hive table name
top-level partition's key
top-level partition's value
A complete DataFrame with the selected hive table partition
Method to take union of all passed dataframes.
Method to take union of all passed dataframes.
list of dataframes for which to take union of.
union of all passed input dataframes.
Method to write data passed in dataframe in specific file format.
Method to write data passed in dataframe in specific file format.
dataframe containing data.
path to write data to.
spark session.
underlying data source specific properties.
file format in which to persist data. Supported file formats are csv, text, json, parquet, orc
columns to be used for partitioning.
used to bucket the output by the given columns. If specified, the output is laid out on the file-system similar to Hive's bucketing scheme.
number of buckets to be used.
columns on which to order data while persisting.
table name for persisting data.
database name for persisting data.
UDF to write logging parameters to log port.
(Since version ) see corresponding Javadoc for more information.
Helper Utilities for reading/writing data from/to different data sources.