: Input Dataset Domain
: Input Dataset Schema
: List of globally defined types
: Input dataset path
: Storage Handler
Saves a dataset.
Saves a dataset. If the path is empty (the first time we call metrics on the schema) then we can write.
If there's already parquet files stored in it, then create a temporary directory to compute on, and flush the path to move updated metrics in it
: dataset to be saved
: Path to save the file at
: Input Dataset Domain
: Input Dataset Domain
Where the magic happen
Where the magic happen
input dataset as a RDD of string
load the json as an RDD of String
load the json as an RDD of String
Spark Dataframe loaded using metadata options
Merged metadata
Merged metadata
Partition a dataset using dataset columns.
Partition a dataset using dataset columns. To partition the dataset using the ingestion time, use the reserved column names :
: Input dataset
: list of columns to use for partitioning.
The Spark session used to run this job
: Input dataset path
: Input dataset path
Main entry point as required by the Spark Job interface
Main entry point as required by the Spark Job interface
: Spark Session used for the job
Merge new and existing dataset if required Save using overwrite / Append mode
Merge new and existing dataset if required Save using overwrite / Append mode
: Input Dataset Schema
: Input Dataset Schema
: Storage Handler
: Storage Handler
: List of globally defined types
: List of globally defined types
Main class to XML file If your json contains only one level simple attribute aka. kind of dsv but in json format please use SIMPLE_JSON instead. It's way faster