: Input Dataset Domain
: Input Dataset Schema
: List of globally defined types
: Input dataset path
: Storage Handler
Saves a dataset.
Saves a dataset. If the path is empty (the first time we call metrics on the schema) then we can write.
If there's already parquet files stored in it, then create a temporary directory to compute on, and flush the path to move updated metrics in it
: dataset to be saved
: Path to save the file at
: Input Dataset Domain
: Input Dataset Domain
Apply the schema to the dataset.
Apply the schema to the dataset. This is where all the magic happen Valid records are stored in the accepted path / table and invalid records in the rejected path / table
: Spark Dataset
Load dataset using spark csv reader and all metadata.
Load dataset using spark csv reader and all metadata. Does not infer schema. columns not defined in the schema are dropped fro the dataset (require datsets with a header)
Spark Dataset
Merged metadata
Merged metadata
Spark Job name
: Parameters to pass as input (k1=v1,k2=v2,k3=v3)
: Parameters to pass as input (k1=v1,k2=v2,k3=v3)
Partition a dataset using dataset columns.
Partition a dataset using dataset columns. To partition the dataset using the ingestion time, use the reserved column names :
: Input dataset
: list of columns to use for partitioning.
The Spark session used to run this job
: Input dataset path
: Input dataset path
Main entry point as required by the Spark Job interface
Main entry point as required by the Spark Job interface
: Spark Session used for the job
Merge new and existing dataset if required Save using overwrite / Append mode
Merge new and existing dataset if required Save using overwrite / Append mode
: Input Dataset Schema
: Input Dataset Schema
dataset Header names as defined by the schema
dataset Header names as defined by the schema
: Storage Handler
: Storage Handler
: List of globally defined types
: List of globally defined types
Parse a simple one level json file. Complex types such as arrays & maps are not supported. Use JsonIngestionJob instead. This class is for simple json only that makes it way faster.