workflow

Type Members

case class ActionPipelineContext(feed: String, application: String, runId: Int, attemptId: Int, instanceRegistry: InstanceRegistry, referenceTimestamp: Option[LocalDateTime] = None, appConfig: SmartDataLakeBuilderConfig, runStartTime: LocalDateTime = LocalDateTime.now(), attemptStartTime: LocalDateTime = LocalDateTime.now(), simulation: Boolean = false, phase: ExecutionPhase = ExecutionPhase.Prepare, dataFrameReuseStatistics: Map[(DataObjectId, Seq[PartitionValues]), Seq[ActionObjectId]] = mutable.Map()) extends SmartDataLakeLogger with Product with Serializable

ActionPipelineContext contains start and runtime information about a SmartDataLake run.
ActionPipelineContext contains start and runtime information about a SmartDataLake run.
feed
feed selector of the run
application
application name of the run
runId
runId of the run. Stays 1 if recovery is not enabled.
attemptId
attemptId of the run. Stays 1 if recovery is not enabled.
instanceRegistry
registry of all SmartDataLake objects parsed from the config
referenceTimestamp
timestamp used as reference in certain actions (e.g. HistorizeAction)
appConfig
the command line parameters parsed into a SmartDataLakeBuilderConfig object
runStartTime
start time of the run
attemptStartTime
start time of attempt
simulation
true if this is a simulation run
phase
current execution phase
dataFrameReuseStatistics
Counter how many times a DataFrame of a SparkSubFeed is reused by an Action later in the pipeline. The counter is increased during ExecutionPhase.Init when preparing the SubFeeds for an Action and it is decreased in ExecutionPhase.Exec to unpersist the DataFrame after there is no need for it anymore.

Annotations
@DeveloperApi()
case class DAG[N <: DAGNode] extends SmartDataLakeLogger with Product with Serializable

A generic directed acyclic graph (DAG) consisting of DAGNodes interconnected with directed DAGEdges.
A generic directed acyclic graph (DAG) consisting of DAGNodes interconnected with directed DAGEdges.
This DAG can have multiple start nodes and multiple end nodes as well as disconnected parts.
case class FileSubFeed(fileRefs: Option[Seq[FileRef]], dataObjectId: DataObjectId, partitionValues: Seq[PartitionValues], isDAGStart: Boolean = false, processedInputFileRefs: Option[Seq[FileRef]] = None) extends SubFeed with Product with Serializable

A FileSubFeed is used to transport references to files between Actions.
A FileSubFeed is used to transport references to files between Actions.
fileRefs
path to files to be processed
dataObjectId
id of the DataObject this SubFeed corresponds to
partitionValues
Values of Partitions transported by this SubFeed
processedInputFileRefs
used to remember processed input FileRef's for post processing (e.g. delete after read)
case class HadoopFileStateId(path: Path, appName: String, runId: Int, attemptId: Int) extends StateId with Product with Serializable
case class InitSubFeed(dataObjectId: DataObjectId, partitionValues: Seq[PartitionValues]) extends SubFeed with Product with Serializable

A InitSubFeed is used to initialize first Nodes of a DAG.
A InitSubFeed is used to initialize first Nodes of a DAG.
dataObjectId
id of the DataObject this SubFeed corresponds to
partitionValues
Values of Partitions transported by this SubFeed
class PrimaryKeyConstraintViolationException extends RuntimeException
class ProcessingLogicException extends RuntimeException

Exception to signal that a configured pipeline can't be executed properly
case class SparkSubFeed(dataFrame: Option[DataFrame], dataObjectId: DataObjectId, partitionValues: Seq[PartitionValues], isDAGStart: Boolean = false, isDummy: Boolean = false, filter: Option[String] = None) extends SubFeed with Product with Serializable

A SparkSubFeed is used to transport DataFrame's between Actions.
A SparkSubFeed is used to transport DataFrame's between Actions.
dataFrame
Spark DataFrame to be processed. DataFrame should not be saved to state (@transient).
dataObjectId
id of the DataObject this SubFeed corresponds to
partitionValues
Values of Partitions transported by this SubFeed
isDAGStart
true if this subfeed is a start node of the dag
isDummy
true if this subfeed only contains a dummy DataFrame. Dummy DataFrames can be used for validating the lineage in init phase, but not for the exec phase.
filter
a spark sql filter expression. This is used by SparkIncrementalMode.
trait SubFeed extends DAGResult

A SubFeed transports references to data between Actions.
A SubFeed transports references to data between Actions. Data can be represented by different technologies like Files or DataFrame.

Value Members

object DAG extends SmartDataLakeLogger with Serializable
object ExecutionPhase extends Enumeration
object FileSubFeed extends Serializable
object SparkSubFeed extends Serializable
package action
package connection
package dataobject

package workflow

Type Members

case class DAG[N <: DAGNode] extends SmartDataLakeLogger with Product with Serializable

case class FileSubFeed(fileRefs: Option[Seq[FileRef]], dataObjectId: DataObjectId, partitionValues: Seq[PartitionValues], isDAGStart: Boolean = false, processedInputFileRefs: Option[Seq[FileRef]] = None) extends SubFeed with Product with Serializable

case class HadoopFileStateId(path: Path, appName: String, runId: Int, attemptId: Int) extends StateId with Product with Serializable

case class InitSubFeed(dataObjectId: DataObjectId, partitionValues: Seq[PartitionValues]) extends SubFeed with Product with Serializable

class PrimaryKeyConstraintViolationException extends RuntimeException

class ProcessingLogicException extends RuntimeException

case class SparkSubFeed(dataFrame: Option[DataFrame], dataObjectId: DataObjectId, partitionValues: Seq[PartitionValues], isDAGStart: Boolean = false, isDummy: Boolean = false, filter: Option[String] = None) extends SubFeed with Product with Serializable

trait SubFeed extends DAGResult

Value Members

object DAG extends SmartDataLakeLogger with Serializable

object ExecutionPhase extends Enumeration

object FileSubFeed extends Serializable

object SparkSubFeed extends Serializable

package action

package connection

package dataobject

Ungrouped