: Input Dataset Domain
: Input Dataset Schema
: List of globally defined types
: Input dataset path
: Storage Handler
remove any extra quote / BOM in the header
remove any extra quote / BOM in the header
: Header column name
: Input Dataset Domain
: Input Dataset Domain
Apply the schema to the dataset.
Apply the schema to the dataset. This is where all the magic happen Valid records are stored in the accepted path / table and invalid records in the rejected path / table
: Spark Dataset
: Headers found in the dataset
: Headers defined in the schema
two lists : One with thecolumns present in the schema and the dataset and onther with the headers present in the dataset only
Load dataset using spark csv reader and all metadata.
Load dataset using spark csv reader and all metadata. Does not infer schema. columns not defined in the schema are dropped fro the dataset (require datsets with a header)
Spark DataFrame where each row holds a single string
Merge incoming and existing dataframes using merge options
Merge incoming and existing dataframes using merge options
merged dataframe
Merged metadata
Merged metadata
Spark Job name
Partition a dataset using dataset columns.
Partition a dataset using dataset columns. To partition the dataset using the igestion time, use the reserved column names :
: Input dataset
: list of columns to use for partitioning.
The Spark session used to run this job
: Input dataset path
: Input dataset path
Main entry point as required by the Spark Job interface
Main entry point as required by the Spark Job interface
: Spark Session used for the job
Merge new and existing dataset if required Save using overwrite / Append mode
Merge new and existing dataset if required Save using overwrite / Append mode
Save typed dataset in parquet.
Save typed dataset in parquet. If hive support is active, also register it as a Hive Table and if analyze is active, also compute basic statistics
: dataset to save
: absolute path
: Append or overwrite
: accepted or rejected area
: Input Dataset Schema
: Input Dataset Schema
dataset Header names as defined by the schema
dataset Header names as defined by the schema
: Storage Handler
: Storage Handler
: List of globally defined types
: List of globally defined types
: Headers found in the dataset
: Headers defined in the schema
success if all headers in the schema exist in the dataset
Main class to ingest delimiter separated values file