Class/Object

com.coxautodata.waimak.dataflow.spark

SimpleSparkDataFlow

Related Docs: object SimpleSparkDataFlow | package spark

Permalink

class SimpleSparkDataFlow extends SparkDataFlow with Logging

Created by Alexei Perelighin on 22/12/17.

Linear Supertypes
SparkDataFlow, DataFlow[Dataset[_], SparkFlowContext], Logging, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. SimpleSparkDataFlow
  2. SparkDataFlow
  3. DataFlow
  4. Logging
  5. AnyRef
  6. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new SimpleSparkDataFlow(spark: SparkSession, inputs: DataFlowEntities[Option[Dataset[_]]], actions: Seq[DataFlowAction[Dataset[_], SparkFlowContext]], sqlTables: Set[String], tempFolder: Option[Path], commitLabels: Map[String, LabelCommitDefinition] = Map.empty, tagState: DataFlowTagState = ...)

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. val actions: Seq[DataFlowAction[Dataset[_], SparkFlowContext]]

    Permalink

    Actions to execute, these will be scheduled when inputs become available.

    Actions to execute, these will be scheduled when inputs become available. Executed actions must be removed from the sate.

    Definition Classes
    SimpleSparkDataFlowDataFlow
  5. def addAction(action: DataFlowAction[Dataset[_], SparkFlowContext]): SimpleSparkDataFlow.this.type

    Permalink

    Creates new state of the dataflow by adding an action to it.

    Creates new state of the dataflow by adding an action to it.

    action

    - action to add

    returns

    - new state with action

    Definition Classes
    DataFlow
    Exceptions thrown

    DataFlowException when: 1) at least one of the input labels is not present in the inputs 2) at least one of the input labels is not present in the outputs of existing actions

  6. def addCommitLabel(label: String, definition: LabelCommitDefinition): SparkDataFlow

    Permalink
    Definition Classes
    SimpleSparkDataFlowSparkDataFlow
  7. def addInput(label: String, value: Option[Dataset[_]]): SimpleSparkDataFlow.this.type

    Permalink

    Creates new state of the dataflow by adding an input.

    Creates new state of the dataflow by adding an input. Duplicate labels are handled in prepareForExecution()

    label

    - name of the input

    value

    - values of the input

    returns

    - new state with the input

    Definition Classes
    DataFlow
  8. def addInterceptor(interceptor: InterceptorAction[Dataset[_], SparkFlowContext], guidToIntercept: String): SimpleSparkDataFlow.this.type

    Permalink

    Creates new state of the data flow by replacing the action that is intercepted with action that intercepts it.

    Creates new state of the data flow by replacing the action that is intercepted with action that intercepts it. The action to replace will differ from the intercepted action in the InterceptorAction in the case of replacing an existing InterceptorAction

    Definition Classes
    DataFlow
  9. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  10. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  11. val commitLabels: Map[String, LabelCommitDefinition]

    Permalink
    Definition Classes
    SimpleSparkDataFlowSparkDataFlow
  12. def createInstance(in: DataFlowEntities[Option[Dataset[_]]], ac: Seq[DataFlowAction[Dataset[_], SparkFlowContext]], tags: DataFlowTagState): DataFlow[Dataset[_], SparkFlowContext]

    Permalink

    All new states of the dataflow must be created via this factory method.

    All new states of the dataflow must be created via this factory method. This will allow specific dataflows to pass their specific context objects into new state.

    in

    - input entities for the next state

    ac

    - actions for the next state

    returns

    - new instance of the implementing class

    Attributes
    protected
    Definition Classes
    SimpleSparkDataFlowDataFlow
  13. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  14. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  15. def executed(executed: DataFlowAction[Dataset[_], SparkFlowContext], outputs: Seq[Option[Dataset[_]]]): DataFlow[Dataset[_], SparkFlowContext]

    Permalink

    Creates new state of the dataflow by removing executed action from the actions list and adds its outputs to the inputs.

    Creates new state of the dataflow by removing executed action from the actions list and adds its outputs to the inputs.

    executed

    - executed actions

    outputs

    - outputs of the executed action

    returns

    - next stage data flow without the executed action, but with its outpus as inputs

    Definition Classes
    SparkDataFlowDataFlow
    Exceptions thrown

    DataFlowException if number of provided outputs is not equal to the number of output labels of the action

  16. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  17. val flowContext: SparkFlowContext

    Permalink
    Definition Classes
    SparkDataFlowDataFlow
  18. def getActionByGuid(actionGuid: String): DataFlowAction[Dataset[_], SparkFlowContext]

    Permalink

    Guids are unique, find action by guid

    Guids are unique, find action by guid

    Definition Classes
    DataFlow
  19. def getActionByOutputLabel(outputLabel: String): DataFlowAction[Dataset[_], SparkFlowContext]

    Permalink

    Output labels are unique.

    Output labels are unique. Finds action that produces outputLabel.

    Definition Classes
    DataFlow
  20. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  21. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  22. val inputs: DataFlowEntities[Option[Dataset[_]]]

    Permalink

    Inputs that were explicitly set or produced by previous actions, these are inputs for all following actions.

    Inputs that were explicitly set or produced by previous actions, these are inputs for all following actions. Inputs are preserved in the data flow state, even if they are no longer required by the remaining actions. //TODO: explore the option of removing the inputs that are no longer required by remaining actions!!!

    Definition Classes
    SimpleSparkDataFlowDataFlow
  23. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  24. def isTraceEnabled(): Boolean

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  25. def isValidFlowDAG: Try[Unit]

    Permalink

    Flow DAG is valid iff: 1.

    Flow DAG is valid iff: 1. All output labels and existing input labels unique 2. Each action depends on labels that are produced by actions or already present in inputs 3. Active tags is empty 4. Active dependencies is zero 5. No cyclic dependencies in labels 6. No cyclic dependencies in tags 7. No cyclic dependencies in label tag combination

    Definition Classes
    DataFlow
  26. def logDebug(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  27. def logDebug(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  28. def logError(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  29. def logError(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  30. def logInfo(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  31. def logInfo(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  32. def logName: String

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  33. def logTrace(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  34. def logTrace(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  35. def logWarning(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  36. def logWarning(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  37. def map[R >: SimpleSparkDataFlow.this.type](f: (SimpleSparkDataFlow.this.type) ⇒ R): R

    Permalink

    Transforms the current dataflow by applying a function to it.

    Transforms the current dataflow by applying a function to it.

    f

    A function that transforms a dataflow object

    returns

    New dataflow

    Definition Classes
    DataFlow
  38. def mapOption[R >: SimpleSparkDataFlow.this.type](f: (SimpleSparkDataFlow.this.type) ⇒ Option[R]): R

    Permalink

    Optionally transform a dataflow depending on the output of the applying function.

    Optionally transform a dataflow depending on the output of the applying function. If the transforming function returns a None then the original dataflow is returned.

    f

    A function that returns an Option[DataFlow]

    returns

    DataFlow object that may have been transformed

    Definition Classes
    DataFlow
  39. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  40. def nextRunnable(): Seq[DataFlowAction[Dataset[_], SparkFlowContext]]

    Permalink

    Returns actions that are ready to run: 1.

    Returns actions that are ready to run: 1. have no input labels; 2. whose inputs have been created 3. all actions whose dependent tags have been run

    will not include actions that are skipped.

    Definition Classes
    DataFlow
  41. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  42. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  43. def prepareForExecution(): SimpleSparkDataFlow.this.type

    Permalink

    A function called just before the flow is executed.

    A function called just before the flow is executed. By default, this function has just checks the tagging state of the flow, and could be overloaded to have implementation specific preparation steps. An overloaded function should call this function first. It would be responsible for preparing an execution environment such as cleaning temporary directories.

    Definition Classes
    SparkDataFlowDataFlow
  44. val spark: SparkSession

    Permalink
    Definition Classes
    SimpleSparkDataFlowSparkDataFlow
  45. val sqlTables: Set[String]

    Permalink

    Execution of the flow is lazy, but registration of the datasets as sql tables can only happen when data set is created.

    Execution of the flow is lazy, but registration of the datasets as sql tables can only happen when data set is created. With multiple threads consuming same table, registration of the data set as an sql table needs to happen in synchronised code.

    Labels that need to be registered as temp spark views before the execution starts. This is necessary if they are to be reused by multiple parallel threads.

    Definition Classes
    SimpleSparkDataFlowSparkDataFlow
  46. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  47. def tag[S <: DataFlow[Dataset[_], SparkFlowContext]](tags: String*)(taggedFlow: (SimpleSparkDataFlow.this.type) ⇒ S): SimpleSparkDataFlow.this.type

    Permalink

    Tag all actions added during the taggedFlow lambda function with any given number of tags.

    Tag all actions added during the taggedFlow lambda function with any given number of tags. These tags can then be used by the tagDependency() action to create a dependency in the running order of actions by tag.

    tags

    Tags to apply to added actions

    taggedFlow

    An intermediate flow that actions can be added to that will be be marked with the tag

    Definition Classes
    DataFlow
  48. def tagDependency[S <: DataFlow[Dataset[_], SparkFlowContext]](depTags: String*)(tagDependentFlow: (SimpleSparkDataFlow.this.type) ⇒ S): SimpleSparkDataFlow.this.type

    Permalink

    Mark all actions added during the tagDependentFlow lambda function as having a dependency on the tags provided.

    Mark all actions added during the tagDependentFlow lambda function as having a dependency on the tags provided. These actions will only be run once all tagged actions have finished.

    depTags

    Tags to create a dependency on

    tagDependentFlow

    An intermediate flow that actions can be added to that will depended on tagged actions to have completed before running

    Definition Classes
    DataFlow
  49. val tagState: DataFlowTagState

    Permalink
    Definition Classes
    SimpleSparkDataFlowDataFlow
  50. val tempFolder: Option[Path]

    Permalink

    Folder into which the temp data will be saved before commit into the output storage: folders, RDBMs, Key Value tables.

    Folder into which the temp data will be saved before commit into the output storage: folders, RDBMs, Key Value tables.

    Definition Classes
    SimpleSparkDataFlowSparkDataFlow
  51. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  52. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  53. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  54. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from SparkDataFlow

Inherited from DataFlow[Dataset[_], SparkFlowContext]

Inherited from Logging

Inherited from AnyRef

Inherited from Any

Ungrouped