Package

com.coxautodata.waimak

dataflow

Permalink

package dataflow

Created by Alexei Perelighin on 2018/01/11.

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. dataflow
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Type Members

  1. type ActionResult = Seq[Option[Any]]

    Permalink
  2. trait ActionScheduler extends AnyRef

    Permalink

    Defines functions that are specific to scheduling tasks, evaluating which execution pools are available and signaling back which actions have finished their execution.

    Defines functions that are specific to scheduling tasks, evaluating which execution pools are available and signaling back which actions have finished their execution.

    Created by Alexei Perelighin on 2018/07/06

  3. sealed case class CachePostAction[T](run: (Option[T]) ⇒ Option[T], labelToIntercept: String) extends PostAction[T] with Product with Serializable

    Permalink
  4. case class CommitEntry(label: String, commitName: String, partitions: Option[Either[Seq[String], Int]], repartition: Boolean, cache: Boolean) extends Product with Serializable

    Permalink
  5. implicit class CommitImplicits[Self <: DataFlow[Self]] extends AnyRef

    Permalink
  6. case class CommitMeta[S <: DataFlow[S]](commits: Map[String, Seq[CommitEntry]], pushes: Map[String, Seq[DataCommitter[S]]]) extends Product with Serializable

    Permalink

    Contains configurations for commits and pushes, while configs are added, there are no modifications to the dataflow, as it waits for a validation before execution.

    Contains configurations for commits and pushes, while configs are added, there are no modifications to the dataflow, as it waits for a validation before execution.

    commits

    Map[ COMMIT_NAME, Seq[CommitEntry] ]

    pushes

    Map[ COMMIT_NAME, Seq[DataCommitter] - there should be one committer per commit name, but due to lazy definitions of the data flows, validation will have to catch it.

  7. case class CommitMetadataExtension[S <: DataFlow[S]](commitMeta: CommitMeta[S]) extends DataFlowMetadataExtension[S] with Product with Serializable

    Permalink
  8. abstract class DataCommitter[A <: DataFlow[A]] extends AnyRef

    Permalink

    Defines phases of each data committers in the data flow, which are: 1) validate that committer is properly configured 2) cache the labels in temp area 3) when flow is successful, move all cached data into its permanent storage 4) finalise or cleanup after the committer had committed all of the data into permanent storage

    Defines phases of each data committers in the data flow, which are: 1) validate that committer is properly configured 2) cache the labels in temp area 3) when flow is successful, move all cached data into its permanent storage 4) finalise or cleanup after the committer had committed all of the data into permanent storage

    Created by Alexei Perelighin

  9. abstract class DataFlow[Self <: DataFlow[Self]] extends Logging

    Permalink

    Defines a state of the data flow.

    Defines a state of the data flow. State is defined by the inputs that are ready to be consumed and actions that need to be executed. In most of the BAU cases, initial state of the data flow has no inputs, as they need to be produced by the actions. When an action finishes, it can produce 0 or N outputs, to create next state of the data flow, that action is removed from the data flow and its outputs are added as inputs into the flow. This state transitioning will enable restarts of the flow from any point or debug/exploratory runs with already existing/manufactured/captured/materialised inputs.

    Also inputs are useful for unit testing, as they give access to all intermediate outputs of actions.

  10. trait DataFlowAction extends AnyRef

    Permalink

    An action to be performed as part of a data flow.

    An action to be performed as part of a data flow. Actions declare what input labels they expect in order to start execution (0 .. N) and can produce outputs associated with the output labels (0 .. N). Executors will use these labels to execute to schedule the actions sequentially or in parallel.

  11. trait DataFlowActionState extends AnyRef

    Permalink

    State of the action.

  12. case class DataFlowActionTags(tags: Set[String], dependentOnTags: Set[String]) extends Product with Serializable

    Permalink

    Represents the tag state on a given action

    Represents the tag state on a given action

    tags

    Tags belonging to this action

    dependentOnTags

    Tags this action is dependent on

  13. trait DataFlowConfigurationExtension[S <: DataFlow[S]] extends AnyRef

    Permalink

    Trait used to define a DataFlow Configuration extension.

    Trait used to define a DataFlow Configuration extension. This type of extension adds a pre-execution hook when an extension is enabled by setting spark.waimak.dataflow.extensions=${extensionKey},otherextension.

    Instances of this trait must be registered services in the META-INF/services file as they are loaded using ServiceLoader.

  14. class DataFlowEntities extends AnyRef

    Permalink

    Maintains data flow entities (the inputs and outputs of data flow actions).

    Maintains data flow entities (the inputs and outputs of data flow actions). Every entity has a label which must be unique across the data flow.

  15. class DataFlowException extends RuntimeException

    Permalink
  16. trait DataFlowExecutor extends Logging

    Permalink

    Created by Alexei Perelighin on 11/01/18.

  17. trait DataFlowMetadataExtension[S <: DataFlow[S]] extends AnyRef

    Permalink

    Trait used to define a DataFlow Metadata extension.

    Trait used to define a DataFlow Metadata extension. This type of extension adds custom metadata to a flow and is keyed by the extension instance.

  18. trait DataFlowMetadataExtensionIdentifier extends AnyRef

    Permalink

    Trait used as an identifier for an instance of an extension.

  19. case class DataFlowTagState(activeTags: Set[String], activeDependentOnTags: Set[String], taggedActions: Map[String, DataFlowActionTags]) extends Product with Serializable

    Permalink

    Represents the tag state on a DataFlow

    Represents the tag state on a DataFlow

    activeTags

    Tags currently active on the flow (i.e. within the tag() context)

    activeDependentOnTags

    Tag dependencies currently active on the flow (i.e. within the tagDependency() context)

    taggedActions

    Mapping of actions to their applied tag state

  20. class EmptyFlowContext extends FlowContext

    Permalink
  21. sealed case class Executed() extends DataFlowActionState with Product with Serializable

    Permalink

    Action was executed and can not be executed again.

  22. case class ExecutionPoolDesc(poolName: String, maxJobs: Int, running: Set[String], threadsExecutor: Option[ExecutionContextExecutorService]) extends Product with Serializable

    Permalink
  23. sealed case class ExpectedInputIsEmpty(ready: Seq[String], notReady: Seq[String]) extends DataFlowActionState with Product with Serializable

    Permalink

    Can not be executed, as expected input is present, but is empty

  24. trait FlowContext extends AnyRef

    Permalink
  25. trait FlowReporter extends AnyRef

    Permalink
  26. class InterceptorAction extends DataFlowAction

    Permalink

    This action can be added to the flow over an existing action, it will be scheduled instead of it and can override or intercept the behaviour of an action.

    This action can be added to the flow over an existing action, it will be scheduled instead of it and can override or intercept the behaviour of an action. This is useful when additional behaviours need to be added. Examples: register as spark temp views for sql, logging filtering, persisting to disk, dredging, etc.

    Created by Alexei Perelighin on 23/02/2018.

  27. class NoReportingFlowReporter extends FlowReporter

    Permalink
  28. class ParallelActionScheduler extends ActionScheduler with Logging

    Permalink

    Can run multiple actions in parallel with multiple execution pools support.

    Can run multiple actions in parallel with multiple execution pools support.

    It was originally designed to benefit from Spark fair scheduler https://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools Execution pool names must be the same as the Fair Scheduler pool names and number of parallel jobs within a pool is the number of Java threads.

    Example: to configure 2 pools for Spark Fair Scheduler following xml must be passed to spark :

    <?xml version="1.0"?> <allocations> <pool name="high"> <schedulingMode>FAIR</schedulingMode> <weight>1000</weight> <minShare>0</minShare> </pool> <pool name="medium"> <schedulingMode>FAIR</schedulingMode> <weight>25</weight> <minShare>0</minShare> </pool> </allocations>

    following configuration options need to be specified:

    1. spark.scheduler.mode=FAIR 2. spark.scheduler.allocation.file=PATH_TO_THE_XML

    in code pools parameter: Map( ("high" -> ExecutionPoolDesc("high", 10, Set.Empty, None)), ("medium" -> ExecutionPoolDesc("medium", 20, Set.Empty, None)) )

    Created by Alexei Perelighin on 2018/07/10

  29. class ParallelDataFlowExecutor extends DataFlowExecutor with Logging

    Permalink
  30. sealed abstract class PostAction[T] extends AnyRef

    Permalink
  31. case class PostActionInterceptor[T](toIntercept: DataFlowAction, postActions: Seq[PostAction[T]]) extends InterceptorAction with Logging with Product with Serializable

    Permalink
  32. sealed case class ReadyToRun(ready: Seq[String]) extends DataFlowActionState with Product with Serializable

    Permalink

    Action is ready to run.

  33. sealed case class RequiresInput(ready: Seq[String], notReady: Seq[String]) extends DataFlowActionState with Product with Serializable

    Permalink

    Action can not be executed as it requires more inputs to be available.

  34. case class SchedulingMeta(state: SchedulingMetaState, actionState: Map[String, SchedulingMetaState]) extends Product with Serializable

    Permalink

    When a Data Flow is defined, certain hints related to its execution can be specified, these hints will help scheduler with deciding when and where to run the action.

    When a Data Flow is defined, certain hints related to its execution can be specified, these hints will help scheduler with deciding when and where to run the action. Further uses can be added to it.

    At the moment, when an action is added to the scheduling meta, it will automatically assign it the current Execution Pool, but if there were other global context attributes to assign, than the action could aquire them as well.

    state

    describes a current state of schedulingMeta

    actionState

    Map[DataFlowAction.schedulingGuid, Execution Pool Name] - association between actions and execution pool names

  35. case class SchedulingMetaState(executionPoolName: String, context: Option[Any] = None) extends Product with Serializable

    Permalink

    Contains values that will be associated with all actions added to the data flow.

    Contains values that will be associated with all actions added to the data flow.

    executionPoolName

    name of the execution pool

  36. class SequentialDataFlowExecutor extends DataFlowExecutor with Logging

    Permalink

    Created by Alexei Perelighin 2017/12/27

    Created by Alexei Perelighin 2017/12/27

    Executes one action at a time wihtout trying to parallelize them.

  37. class SequentialScheduler extends ActionScheduler with Logging

    Permalink

    Executes only one action at a time.

    Executes only one action at a time.

    Created by Alexei Perelighin on 2018/07/06

  38. sealed case class TransformPostAction[T](run: (Option[T]) ⇒ Option[T], labelToIntercept: String) extends PostAction[T] with Product with Serializable

    Permalink

Value Members

  1. object CommitMeta extends Serializable

    Permalink
  2. object CommitMetadataExtension extends Serializable

    Permalink
  3. object CommitMetadataExtensionIdentifier extends DataFlowMetadataExtensionIdentifier with Product with Serializable

    Permalink
  4. val DEFAULT_POOL_NAME: String

    Permalink
  5. object DFExecutorPriorityStrategies

    Permalink

    Defines various priority strategies for DataFlowExecutor to use.

    Defines various priority strategies for DataFlowExecutor to use.

    Created by Alexei Perelighin on 24/08/2018.

  6. object DataFlow

    Permalink
  7. object DataFlowEntities

    Permalink
  8. object NoReportingFlowReporter

    Permalink
  9. object ParallelActionScheduler

    Permalink
  10. object ParallelDataFlowExecutor

    Permalink
  11. object SequentialDataFlowExecutor

    Permalink
  12. object Waimak

    Permalink

    Defines factory functions for creating and running Waimak data flows.

    Defines factory functions for creating and running Waimak data flows.

    Create by Alexei Perelighin on 2018/02/27

  13. package spark

    Permalink

Inherited from AnyRef

Inherited from Any

Ungrouped