SparkDataFlow

Instance Constructors

new SparkDataFlow(info: SparkDataFlowInfo)

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def actions(acs: Seq[DataFlowAction]): SparkDataFlow.this.type

Definition Classes
SparkDataFlow → DataFlow
def actions: Seq[DataFlowAction]

Actions to execute, these will be scheduled when inputs become available.
Actions to execute, these will be scheduled when inputs become available. Executed actions must be removed from the sate.

Definition Classes
SparkDataFlow → DataFlow
def addAction[A <: DataFlowAction](action: A): SparkDataFlow.this.type

Creates new state of the dataflow by adding an action to it.
Creates new state of the dataflow by adding an action to it.
action
- action to add
returns
- new state with action

Definition Classes
DataFlow
Exceptions thrown
DataFlowException when: 1) at least one of the input labels is not present in the inputs 2) at least one of the input labels is not present in the outputs of existing actions
def addInput(label: String, value: Option[Any]): SparkDataFlow.this.type

Creates new state of the dataflow by adding an input.
Creates new state of the dataflow by adding an input. Duplicate labels are handled in prepareForExecution()
label
- name of the input
value
- values of the input
returns
- new state with the input

Definition Classes
DataFlow
def addInterceptor(interceptor: InterceptorAction, guidToIntercept: String): SparkDataFlow.this.type

Creates new state of the data flow by replacing the action that is intercepted with action that intercepts it.
Creates new state of the data flow by replacing the action that is intercepted with action that intercepts it. The action to replace will differ from the intercepted action in the InterceptorAction in the case of replacing an existing InterceptorAction

Definition Classes
DataFlow
final def asInstanceOf[T0]: T0

Definition Classes
Any
def buildCommits(): SparkDataFlow.this.type

During data flow preparation for execution stage, it interacts with data committer to add actions that implement stages of the data committer.
During data flow preparation for execution stage, it interacts with data committer to add actions that implement stages of the data committer.
This build uses tags to separate the stages of the data committer: cache, move, finish.

Attributes
protected[com.coxautodata.waimak.dataflow]
Definition Classes
DataFlow
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def commit(commitName: String)(labels: String*): SparkDataFlow.this.type

Groups labels to commit under a commit name.
Groups labels to commit under a commit name. Can be called multiple times with same same commit name, thus adding labels to it. There can be multiple commit names defined in a single data flow.
By default, the committer is requested to cache the underlying labels on the flow before writing them out if caching is supported by the data committer. If caching is not supported this parameter is ignored. This behavior can be disabled by setting the CACHE_REUSED_COMMITTED_LABELS parameter.
commitName
name of the commit, which will be used to define its push implementation
labels
labels added to the commit name with partitions config

Definition Classes
DataFlow
def commit(commitName: String, repartition: Int)(labels: String*): SparkDataFlow.this.type

Groups labels to commit under a commit name.
Groups labels to commit under a commit name. Can be called multiple times with same same commit name, thus adding labels to it. There can be multiple commit names defined in a single data flow.
By default, the committer is requested to cache the underlying labels on the flow before writing them out if caching is supported by the data committer. If caching is not supported this parameter is ignored. This behavior can be disabled by setting the CACHE_REUSED_COMMITTED_LABELS parameter.
commitName
name of the commit, which will be used to define its push implementation
repartition
how many partitions to repartition the data by
labels
labels added to the commit name with partitions config

Definition Classes
DataFlow
def commit(commitName: String, partitions: Seq[String], repartition: Boolean = true)(labels: String*): SparkDataFlow.this.type

Groups labels to commit under a commit name.
Groups labels to commit under a commit name. Can be called multiple times with same same commit name, thus adding labels to it. There can be multiple commit names defined in a single data flow.
By default, the committer is requested to cache the underlying labels on the flow before writing them out if caching is supported by the data committer. If caching is not supported this parameter is ignored. This behavior can be disabled by setting the CACHE_REUSED_COMMITTED_LABELS parameter.
commitName
name of the commit, which will be used to define its push implementation
partitions
list of partition columns for the labels specified in this commit invocation. It will not impact labels from previous or following invocations of the commit with same commit name.
repartition
to repartition the data
labels
labels added to the commit name with partitions config

Definition Classes
DataFlow
def commitMeta(cm: CommitMeta): SparkDataFlow.this.type

Definition Classes
SparkDataFlow → DataFlow
def commitMeta: CommitMeta

Definition Classes
SparkDataFlow → DataFlow
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def execute(errorOnUnexecutedActions: Boolean = true): (Seq[DataFlowAction], DataFlow)

Execute this flow using the current executor on the flow.
Execute this flow using the current executor on the flow. See DataFlowExecutor.execute() for more information.

Definition Classes
DataFlow
def executed(executed: DataFlowAction, outputs: Seq[Option[Any]]): SparkDataFlow.this.type

Creates new state of the dataflow by removing executed action from the actions list and adds its outputs to the inputs.
Creates new state of the dataflow by removing executed action from the actions list and adds its outputs to the inputs.
executed
- executed actions
outputs
- outputs of the executed action
returns
- next stage data flow without the executed action, but with its outpus as inputs

Definition Classes
SparkDataFlow → DataFlow
Exceptions thrown
DataFlowException if number of provided outputs is not equal to the number of output labels of the action
def executionPool(executionPoolName: String)(nestedFlow: (SparkDataFlow.this.type) ⇒ SparkDataFlow.this.type): SparkDataFlow.this.type

Creates a code block with all actions inside of it being run on the specified execution pool.
Creates a code block with all actions inside of it being run on the specified execution pool. Same execution pool name can be used multiple times and nested pools are allowed, the name closest to the action will be assigned to it.
Ex: flow.executionPool("pool_1") { _.addAction(a1) .addAction(a2) .executionPool("pool_2") { _.addAction(a3) .addAction(a4) }..addAction(a5) }
So actions a1, a2, a5 will be in the pool_1 and actions a3, a4 in the pool_2
executionPoolName
pool name to assign to all actions inside of it, but it can be overwritten by the nested execution pools.

Definition Classes
DataFlow
def executor: DataFlowExecutor

Current DataFlowExecutor associated with this flow
Current DataFlowExecutor associated with this flow

Definition Classes
SparkDataFlow → DataFlow
def finaliseExecution(): Try[SparkDataFlow.this.type]

A function called just after the flow is executed.
A function called just after the flow is executed. By default, the implementation on DataFlow is no-op, however it is used in spark.SparkDataFlow to clean up the temporary directory

Definition Classes
SparkDataFlow → DataFlow
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
val flowContext: SparkFlowContext

Definition Classes
SparkDataFlow → DataFlow
def foldLeftOver[A, S >: SparkDataFlow.this.type <: DataFlow](foldOver: Iterable[A])(f: (S, A) ⇒ S): S

Fold left over a collection, where the current DataFlow is the zero value.
Fold left over a collection, where the current DataFlow is the zero value. Lets you fold over a flow inline in the flow.
foldOver
Collection to fold over
f
Function to apply during the flow
returns
A DataFlow produced after repeated applications of f for each element in the collection

Definition Classes
DataFlow
def getActionByGuid(actionGuid: String): DataFlowAction

Guids are unique, find action by guid
Guids are unique, find action by guid

Definition Classes
DataFlow
def getActionByOutputLabel(outputLabel: String): DataFlowAction

Output labels are unique.
Output labels are unique. Finds action that produces outputLabel.

Definition Classes
DataFlow
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
def inputs(inp: DataFlowEntities): SparkDataFlow.this.type

Definition Classes
SparkDataFlow → DataFlow
def inputs: DataFlowEntities

Inputs that were explicitly set or produced by previous actions, these are inputs for all following actions.
Inputs that were explicitly set or produced by previous actions, these are inputs for all following actions. Inputs are preserved in the data flow state, even if they are no longer required by the remaining actions. //TODO: explore the option of removing the inputs that are no longer required by remaining actions!!!

Definition Classes
SparkDataFlow → DataFlow
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def isTraceEnabled(): Boolean

Attributes
protected
Definition Classes
Logging
def isValidFlowDAG: Try[SparkDataFlow.this.type]

Flow DAG is valid iff: 1.
Flow DAG is valid iff: 1. All output labels and existing input labels unique 2. Each action depends on labels that are produced by actions or already present in inputs 3. Active tags is empty 4. Active dependencies is zero 5. No cyclic dependencies in labels 6. No cyclic dependencies in tags 7. No cyclic dependencies in label tag combination

Definition Classes
DataFlow
def logDebug(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logDebug(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logError(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logError(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logInfo(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logInfo(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logName: String

Attributes
protected
Definition Classes
Logging
def logTrace(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logTrace(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logWarning(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logWarning(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def map[R >: SparkDataFlow.this.type](f: (SparkDataFlow.this.type) ⇒ R): R

Transforms the current dataflow by applying a function to it.
Transforms the current dataflow by applying a function to it.
f
A function that transforms a dataflow object
returns
New dataflow

Definition Classes
DataFlow
def mapOption[R >: SparkDataFlow.this.type](f: (SparkDataFlow.this.type) ⇒ Option[R]): R

Optionally transform a dataflow depending on the output of the applying function.
Optionally transform a dataflow depending on the output of the applying function. If the transforming function returns a None then the original dataflow is returned.
f
A function that returns an Option[DataFlow]
returns
DataFlow object that may have been transformed

Definition Classes
DataFlow
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def nextRunnable(executionPoolsAvailable: Set[String]): Seq[DataFlowAction]

Returns actions that are ready to run: 1.
Returns actions that are ready to run: 1. have no input labels; 2. whose inputs have been created 3. all actions whose dependent tags have been run 4. belong to the available pool
will not include actions that are skipped.
executionPoolsAvailable
set of execution pool for which to schedule actions

Definition Classes
DataFlow
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def prepareForExecution(): Try[SparkDataFlow.this.type]

A function called just before the flow is executed.
A function called just before the flow is executed. By default, this function has just checks the tagging state of the flow, and could be overloaded to have implementation specific preparation steps. An overloaded function should call this function first. It would be responsible for preparing an execution environment such as cleaning temporary directories.

Definition Classes
SparkDataFlow → DataFlow
def push(commitName: String)(committer: DataCommitter): SparkDataFlow.this.type

Associates commit name with an implementation of a data committer.
Associates commit name with an implementation of a data committer. There must be only one data committer per one commit name.

Definition Classes
DataFlow
def schedulingMeta(sc: SchedulingMeta): SparkDataFlow.this.type

Definition Classes
SparkDataFlow → DataFlow
def schedulingMeta: SchedulingMeta

Definition Classes
SparkDataFlow → DataFlow
def schedulingMeta(mutateState: (SchedulingMetaState) ⇒ SchedulingMetaState)(nestedFlow: (SparkDataFlow.this.type) ⇒ SparkDataFlow.this.type): SparkDataFlow.this.type

Generic method that can be used to add context and state to all actions inside the block.
Generic method that can be used to add context and state to all actions inside the block.
mutateState
function that adds attributes to the state
nestedFlow
all actions inside of this flow will be associated with the mutated state

Definition Classes
DataFlow
def spark: SparkSession
def sqlTables: Set[String]

Execution of the flow is lazy, but registration of the datasets as sql tables can only happen when data set is created.
Execution of the flow is lazy, but registration of the datasets as sql tables can only happen when data set is created. With multiple threads consuming same table, registration of the data set as an sql table needs to happen in synchronised code.
Labels that need to be registered as temp spark views before the execution starts. This is necessary if they are to be reused by multiple parallel threads.
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def tag[S <: DataFlow](tags: String*)(taggedFlow: (SparkDataFlow.this.type) ⇒ S): SparkDataFlow.this.type

Tag all actions added during the taggedFlow lambda function with any given number of tags.
Tag all actions added during the taggedFlow lambda function with any given number of tags. These tags can then be used by the tagDependency() action to create a dependency in the running order of actions by tag.
tags
Tags to apply to added actions
taggedFlow
An intermediate flow that actions can be added to that will be be marked with the tag

Definition Classes
DataFlow
def tagDependency[S <: DataFlow](depTags: String*)(tagDependentFlow: (SparkDataFlow.this.type) ⇒ S): SparkDataFlow.this.type

Mark all actions added during the tagDependentFlow lambda function as having a dependency on the tags provided.
Mark all actions added during the tagDependentFlow lambda function as having a dependency on the tags provided. These actions will only be run once all tagged actions have finished.
depTags
Tags to create a dependency on
tagDependentFlow
An intermediate flow that actions can be added to that will depended on tagged actions to have completed before running

Definition Classes
DataFlow
def tagState(ts: DataFlowTagState): SparkDataFlow.this.type

Definition Classes
SparkDataFlow → DataFlow
def tagState: DataFlowTagState

Definition Classes
SparkDataFlow → DataFlow
def tempFolder: Option[Path]

Folder into which the temp data will be saved before commit into the output storage: folders, RDBMs, Key Value tables.
def toString(): String

Definition Classes
AnyRef → Any
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
def withExecutor(executor: DataFlowExecutor): SparkDataFlow.this.type

Add a new executor to this flow, replacing the existing one
Add a new executor to this flow, replacing the existing one
executor
DataFlowExecutor to add to this flow

Definition Classes
SparkDataFlow → DataFlow

Related Docs: object SparkDataFlow | package spark

class SparkDataFlow extends DataFlow with Logging

Instance Constructors

new SparkDataFlow(info: SparkDataFlowInfo)

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

def actions(acs: Seq[DataFlowAction]): SparkDataFlow.this.type

def actions: Seq[DataFlowAction]

def addAction[A <: DataFlowAction](action: A): SparkDataFlow.this.type

def addInput(label: String, value: Option[Any]): SparkDataFlow.this.type

def addInterceptor(interceptor: InterceptorAction, guidToIntercept: String): SparkDataFlow.this.type

final def asInstanceOf[T0]: T0

def buildCommits(): SparkDataFlow.this.type

def clone(): AnyRef

def commit(commitName: String)(labels: String*): SparkDataFlow.this.type

def commit(commitName: String, repartition: Int)(labels: String*): SparkDataFlow.this.type

def commit(commitName: String, partitions: Seq[String], repartition: Boolean = true)(labels: String*): SparkDataFlow.this.type

def commitMeta(cm: CommitMeta): SparkDataFlow.this.type

def commitMeta: CommitMeta

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def execute(errorOnUnexecutedActions: Boolean = true): (Seq[DataFlowAction], DataFlow)

def executed(executed: DataFlowAction, outputs: Seq[Option[Any]]): SparkDataFlow.this.type

def executionPool(executionPoolName: String)(nestedFlow: (SparkDataFlow.this.type) ⇒ SparkDataFlow.this.type): SparkDataFlow.this.type

def executor: DataFlowExecutor

def finaliseExecution(): Try[SparkDataFlow.this.type]

def finalize(): Unit

val flowContext: SparkFlowContext

def foldLeftOver[A, S >: SparkDataFlow.this.type <: DataFlow](foldOver: Iterable[A])(f: (S, A) ⇒ S): S

def getActionByGuid(actionGuid: String): DataFlowAction

def getActionByOutputLabel(outputLabel: String): DataFlowAction

final def getClass(): Class[_]

def hashCode(): Int

def inputs(inp: DataFlowEntities): SparkDataFlow.this.type

def inputs: DataFlowEntities

final def isInstanceOf[T0]: Boolean

def isTraceEnabled(): Boolean

def isValidFlowDAG: Try[SparkDataFlow.this.type]

def logDebug(msg: ⇒ String, throwable: Throwable): Unit

def logDebug(msg: ⇒ String): Unit

def logError(msg: ⇒ String, throwable: Throwable): Unit

def logError(msg: ⇒ String): Unit

def logInfo(msg: ⇒ String, throwable: Throwable): Unit

def logInfo(msg: ⇒ String): Unit

def logName: String

def logTrace(msg: ⇒ String, throwable: Throwable): Unit

def logTrace(msg: ⇒ String): Unit

def logWarning(msg: ⇒ String, throwable: Throwable): Unit

def logWarning(msg: ⇒ String): Unit

def map[R >: SparkDataFlow.this.type](f: (SparkDataFlow.this.type) ⇒ R): R

def mapOption[R >: SparkDataFlow.this.type](f: (SparkDataFlow.this.type) ⇒ Option[R]): R

final def ne(arg0: AnyRef): Boolean

def nextRunnable(executionPoolsAvailable: Set[String]): Seq[DataFlowAction]

final def notify(): Unit

final def notifyAll(): Unit

def prepareForExecution(): Try[SparkDataFlow.this.type]

def push(commitName: String)(committer: DataCommitter): SparkDataFlow.this.type

def schedulingMeta(sc: SchedulingMeta): SparkDataFlow.this.type

def schedulingMeta: SchedulingMeta

def schedulingMeta(mutateState: (SchedulingMetaState) ⇒ SchedulingMetaState)(nestedFlow: (SparkDataFlow.this.type) ⇒ SparkDataFlow.this.type): SparkDataFlow.this.type

def spark: SparkSession

def sqlTables: Set[String]

final def synchronized[T0](arg0: ⇒ T0): T0

def tag[S <: DataFlow](tags: String*)(taggedFlow: (SparkDataFlow.this.type) ⇒ S): SparkDataFlow.this.type

def tagDependency[S <: DataFlow](depTags: String*)(tagDependentFlow: (SparkDataFlow.this.type) ⇒ S): SparkDataFlow.this.type

def tagState(ts: DataFlowTagState): SparkDataFlow.this.type

def tagState: DataFlowTagState

def tempFolder: Option[Path]

def toString(): String

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

def withExecutor(executor: DataFlowExecutor): SparkDataFlow.this.type

Inherited from DataFlow

Inherited from Logging

Inherited from AnyRef

Inherited from Any