ActionPipelineContext contains start and runtime information about a SmartDataLake run.
A generic directed acyclic graph (DAG) consisting of DAGNodes interconnected with directed DAGEdges.
A generic directed acyclic graph (DAG) consisting of DAGNodes interconnected with directed DAGEdges.
This DAG can have multiple start nodes and multiple end nodes as well as disconnected parts.
A FileSubFeed is used to transport references to files between Actions.
A FileSubFeed is used to transport references to files between Actions.
path to files to be processed
id of the DataObject this SubFeed corresponds to
Values of Partitions transported by this SubFeed
used to remember processed input FileRef's for post processing (e.g. delete after read)
A InitSubFeed is used to initialize first Nodes of a DAG.
A InitSubFeed is used to initialize first Nodes of a DAG.
id of the DataObject this SubFeed corresponds to
Values of Partitions transported by this SubFeed
Exception to signal that a configured pipeline can't be executed properly
A SparkSubFeed is used to transport DataFrame's between Actions.
A SparkSubFeed is used to transport DataFrame's between Actions.
Spark DataFrame to be processed. DataFrame should not be saved to state (@transient).
id of the DataObject this SubFeed corresponds to
Values of Partitions transported by this SubFeed
true if this subfeed is a start node of the dag
true if this subfeed only contains a dummy DataFrame. Dummy DataFrames can be used for validating the lineage in init phase, but not for the exec phase.
a spark sql filter expression. This is used by SparkIncrementalMode.
A SubFeed transports references to data between Actions.
A SubFeed transports references to data between Actions. Data can be represented by different technologies like Files or DataFrame.
ActionPipelineContext contains start and runtime information about a SmartDataLake run.
feed selector of the run
application name of the run
runId of the run. Stays 1 if recovery is not enabled.
attemptId of the run. Stays 1 if recovery is not enabled.
registry of all SmartDataLake objects parsed from the config
timestamp used as reference in certain actions (e.g. HistorizeAction)
the command line parameters parsed into a SmartDataLakeBuilderConfig object
start time of the run
start time of attempt
true if this is a simulation run
current execution phase
Counter how many times a DataFrame of a SparkSubFeed is reused by an Action later in the pipeline. The counter is increased during ExecutionPhase.Init when preparing the SubFeeds for an Action and it is decreased in ExecutionPhase.Exec to unpersist the DataFrame after there is no need for it anymore.