dataflow

Type Members

type ActionResult = Seq[Option[Any]]
trait ActionScheduler extends AnyRef

Defines functions that are specific to scheduling tasks, evaluating which execution pools are available and signaling back which actions have finished their execution.
Defines functions that are specific to scheduling tasks, evaluating which execution pools are available and signaling back which actions have finished their execution.
Created by Alexei Perelighin on 2018/07/06
sealed case class CachePostAction[T](run: (Option[T]) ⇒ Option[T], labelToIntercept: String) extends PostAction[T] with Product with Serializable
case class CommitEntry(label: String, commitName: String, partitions: Option[Either[Seq[String], Int]], repartition: Boolean, cache: Boolean) extends Product with Serializable
implicit class CommitImplicits[Self <: DataFlow[Self]] extends AnyRef
case class CommitMeta[S <: DataFlow[S]](commits: Map[String, Seq[CommitEntry]], pushes: Map[String, Seq[DataCommitter[S]]]) extends Product with Serializable

Contains configurations for commits and pushes, while configs are added, there are no modifications to the dataflow, as it waits for a validation before execution.
Contains configurations for commits and pushes, while configs are added, there are no modifications to the dataflow, as it waits for a validation before execution.
commits
Map[ COMMIT_NAME, Seq[CommitEntry] ]
pushes
Map[ COMMIT_NAME, Seq[DataCommitter] - there should be one committer per commit name, but due to lazy definitions of the data flows, validation will have to catch it.
case class CommitMetadataExtension[S <: DataFlow[S]](commitMeta: CommitMeta[S]) extends DataFlowMetadataExtension[S] with Product with Serializable
abstract class DataCommitter[A <: DataFlow[A]] extends AnyRef

Defines phases of each data committers in the data flow, which are: 1) validate that committer is properly configured 2) cache the labels in temp area 3) when flow is successful, move all cached data into its permanent storage 4) finalise or cleanup after the committer had committed all of the data into permanent storage
Defines phases of each data committers in the data flow, which are: 1) validate that committer is properly configured 2) cache the labels in temp area 3) when flow is successful, move all cached data into its permanent storage 4) finalise or cleanup after the committer had committed all of the data into permanent storage
Created by Alexei Perelighin
abstract class DataFlow[Self <: DataFlow[Self]] extends Logging

Defines a state of the data flow.
Defines a state of the data flow. State is defined by the inputs that are ready to be consumed and actions that need to be executed. In most of the BAU cases, initial state of the data flow has no inputs, as they need to be produced by the actions. When an action finishes, it can produce 0 or N outputs, to create next state of the data flow, that action is removed from the data flow and its outputs are added as inputs into the flow. This state transitioning will enable restarts of the flow from any point or debug/exploratory runs with already existing/manufactured/captured/materialised inputs.
Also inputs are useful for unit testing, as they give access to all intermediate outputs of actions.
trait DataFlowAction extends AnyRef

An action to be performed as part of a data flow.
An action to be performed as part of a data flow. Actions declare what input labels they expect in order to start execution (0 .. N) and can produce outputs associated with the output labels (0 .. N). Executors will use these labels to execute to schedule the actions sequentially or in parallel.
trait DataFlowActionState extends AnyRef

State of the action.
case class DataFlowActionTags(tags: Set[String], dependentOnTags: Set[String]) extends Product with Serializable

Represents the tag state on a given action
Represents the tag state on a given action
tags
Tags belonging to this action
dependentOnTags
Tags this action is dependent on
trait DataFlowConfigurationExtension[S <: DataFlow[S]] extends AnyRef

Trait used to define a DataFlow Configuration extension.
Trait used to define a DataFlow Configuration extension. This type of extension adds a pre-execution hook when an extension is enabled by setting spark.waimak.dataflow.extensions=${extensionKey},otherextension.
Instances of this trait must be registered services in the META-INF/services file as they are loaded using ServiceLoader.
class DataFlowEntities extends AnyRef

Maintains data flow entities (the inputs and outputs of data flow actions).
Maintains data flow entities (the inputs and outputs of data flow actions). Every entity has a label which must be unique across the data flow.
class DataFlowException extends RuntimeException
trait DataFlowExecutor extends Logging

Created by Alexei Perelighin on 11/01/18.
trait DataFlowMetadataExtension[S <: DataFlow[S]] extends AnyRef

Trait used to define a DataFlow Metadata extension.
Trait used to define a DataFlow Metadata extension. This type of extension adds custom metadata to a flow and is keyed by the extension instance.
trait DataFlowMetadataExtensionIdentifier extends AnyRef

Trait used as an identifier for an instance of an extension.
case class DataFlowTagState(activeTags: Set[String], activeDependentOnTags: Set[String], taggedActions: Map[String, DataFlowActionTags]) extends Product with Serializable

Represents the tag state on a DataFlow
Represents the tag state on a DataFlow
activeTags
Tags currently active on the flow (i.e. within the tag() context)
activeDependentOnTags
Tag dependencies currently active on the flow (i.e. within the tagDependency() context)
taggedActions
Mapping of actions to their applied tag state
class EmptyFlowContext extends FlowContext
sealed case class Executed() extends DataFlowActionState with Product with Serializable

Action was executed and can not be executed again.
case class ExecutionPoolDesc(poolName: String, maxJobs: Int, running: Set[String], threadsExecutor: Option[ExecutionContextExecutorService]) extends Product with Serializable
sealed case class ExpectedInputIsEmpty(ready: Seq[String], notReady: Seq[String]) extends DataFlowActionState with Product with Serializable

Can not be executed, as expected input is present, but is empty
trait FlowContext extends AnyRef
trait FlowReporter extends AnyRef
class InterceptorAction extends DataFlowAction

This action can be added to the flow over an existing action, it will be scheduled instead of it and can override or intercept the behaviour of an action.
This action can be added to the flow over an existing action, it will be scheduled instead of it and can override or intercept the behaviour of an action. This is useful when additional behaviours need to be added. Examples: register as spark temp views for sql, logging filtering, persisting to disk, dredging, etc.
Created by Alexei Perelighin on 23/02/2018.
class NoReportingFlowReporter extends FlowReporter
class ParallelActionScheduler extends ActionScheduler with Logging

Can run multiple actions in parallel with multiple execution pools support.
Can run multiple actions in parallel with multiple execution pools support.
It was originally designed to benefit from Spark fair scheduler https://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools Execution pool names must be the same as the Fair Scheduler pool names and number of parallel jobs within a pool is the number of Java threads.
Example: to configure 2 pools for Spark Fair Scheduler following xml must be passed to spark :
<?xml version="1.0"?> <allocations> <pool name="high"> <schedulingMode>FAIR</schedulingMode> <weight>1000</weight> <minShare>0</minShare> </pool> <pool name="medium"> <schedulingMode>FAIR</schedulingMode> <weight>25</weight> <minShare>0</minShare> </pool> </allocations>
following configuration options need to be specified:
1. spark.scheduler.mode=FAIR 2. spark.scheduler.allocation.file=PATH_TO_THE_XML
in code pools parameter: Map( ("high" -> ExecutionPoolDesc("high", 10, Set.Empty, None)), ("medium" -> ExecutionPoolDesc("medium", 20, Set.Empty, None)) )
Created by Alexei Perelighin on 2018/07/10
class ParallelDataFlowExecutor extends DataFlowExecutor with Logging
sealed abstract class PostAction[T] extends AnyRef
case class PostActionInterceptor[T](toIntercept: DataFlowAction, postActions: Seq[PostAction[T]]) extends InterceptorAction with Logging with Product with Serializable
sealed case class ReadyToRun(ready: Seq[String]) extends DataFlowActionState with Product with Serializable

Action is ready to run.
sealed case class RequiresInput(ready: Seq[String], notReady: Seq[String]) extends DataFlowActionState with Product with Serializable

Action can not be executed as it requires more inputs to be available.
case class SchedulingMeta(state: SchedulingMetaState, actionState: Map[String, SchedulingMetaState]) extends Product with Serializable

When a Data Flow is defined, certain hints related to its execution can be specified, these hints will help scheduler with deciding when and where to run the action.
When a Data Flow is defined, certain hints related to its execution can be specified, these hints will help scheduler with deciding when and where to run the action. Further uses can be added to it.
At the moment, when an action is added to the scheduling meta, it will automatically assign it the current Execution Pool, but if there were other global context attributes to assign, than the action could aquire them as well.
state
describes a current state of schedulingMeta
actionState
Map[DataFlowAction.schedulingGuid, Execution Pool Name] - association between actions and execution pool names
case class SchedulingMetaState(executionPoolName: String, context: Option[Any] = None) extends Product with Serializable

Contains values that will be associated with all actions added to the data flow.
Contains values that will be associated with all actions added to the data flow.
executionPoolName
name of the execution pool
class SequentialDataFlowExecutor extends DataFlowExecutor with Logging

Created by Alexei Perelighin 2017/12/27
Created by Alexei Perelighin 2017/12/27
Executes one action at a time wihtout trying to parallelize them.
class SequentialScheduler extends ActionScheduler with Logging

Executes only one action at a time.
Executes only one action at a time.
Created by Alexei Perelighin on 2018/07/06
sealed case class TransformPostAction[T](run: (Option[T]) ⇒ Option[T], labelToIntercept: String) extends PostAction[T] with Product with Serializable

Value Members

object CommitMeta extends Serializable
object CommitMetadataExtension extends Serializable
object CommitMetadataExtensionIdentifier extends DataFlowMetadataExtensionIdentifier with Product with Serializable
val DEFAULT_POOL_NAME: String
object DFExecutorPriorityStrategies

Defines various priority strategies for DataFlowExecutor to use.
Defines various priority strategies for DataFlowExecutor to use.
Created by Alexei Perelighin on 24/08/2018.
object DataFlow
object DataFlowEntities
object NoReportingFlowReporter
object ParallelActionScheduler
object ParallelDataFlowExecutor
object SequentialDataFlowExecutor
object Waimak

Defines factory functions for creating and running Waimak data flows.
Defines factory functions for creating and running Waimak data flows.
Create by Alexei Perelighin on 2018/02/27
package spark

package dataflow

Type Members

type ActionResult = Seq[Option[Any]]

trait ActionScheduler extends AnyRef

sealed case class CachePostAction[T](run: (Option[T]) ⇒ Option[T], labelToIntercept: String) extends PostAction[T] with Product with Serializable

case class CommitEntry(label: String, commitName: String, partitions: Option[Either[Seq[String], Int]], repartition: Boolean, cache: Boolean) extends Product with Serializable

implicit class CommitImplicits[Self <: DataFlow[Self]] extends AnyRef

case class CommitMeta[S <: DataFlow[S]](commits: Map[String, Seq[CommitEntry]], pushes: Map[String, Seq[DataCommitter[S]]]) extends Product with Serializable

case class CommitMetadataExtension[S <: DataFlow[S]](commitMeta: CommitMeta[S]) extends DataFlowMetadataExtension[S] with Product with Serializable

abstract class DataCommitter[A <: DataFlow[A]] extends AnyRef

abstract class DataFlow[Self <: DataFlow[Self]] extends Logging

trait DataFlowAction extends AnyRef

trait DataFlowActionState extends AnyRef

case class DataFlowActionTags(tags: Set[String], dependentOnTags: Set[String]) extends Product with Serializable

trait DataFlowConfigurationExtension[S <: DataFlow[S]] extends AnyRef

class DataFlowEntities extends AnyRef

class DataFlowException extends RuntimeException

trait DataFlowExecutor extends Logging

trait DataFlowMetadataExtension[S <: DataFlow[S]] extends AnyRef

trait DataFlowMetadataExtensionIdentifier extends AnyRef

case class DataFlowTagState(activeTags: Set[String], activeDependentOnTags: Set[String], taggedActions: Map[String, DataFlowActionTags]) extends Product with Serializable

class EmptyFlowContext extends FlowContext

sealed case class Executed() extends DataFlowActionState with Product with Serializable

case class ExecutionPoolDesc(poolName: String, maxJobs: Int, running: Set[String], threadsExecutor: Option[ExecutionContextExecutorService]) extends Product with Serializable

sealed case class ExpectedInputIsEmpty(ready: Seq[String], notReady: Seq[String]) extends DataFlowActionState with Product with Serializable

trait FlowContext extends AnyRef

trait FlowReporter extends AnyRef

class InterceptorAction extends DataFlowAction

class NoReportingFlowReporter extends FlowReporter

class ParallelActionScheduler extends ActionScheduler with Logging

class ParallelDataFlowExecutor extends DataFlowExecutor with Logging

sealed abstract class PostAction[T] extends AnyRef

case class PostActionInterceptor[T](toIntercept: DataFlowAction, postActions: Seq[PostAction[T]]) extends InterceptorAction with Logging with Product with Serializable

sealed case class ReadyToRun(ready: Seq[String]) extends DataFlowActionState with Product with Serializable

sealed case class RequiresInput(ready: Seq[String], notReady: Seq[String]) extends DataFlowActionState with Product with Serializable

case class SchedulingMeta(state: SchedulingMetaState, actionState: Map[String, SchedulingMetaState]) extends Product with Serializable

case class SchedulingMetaState(executionPoolName: String, context: Option[Any] = None) extends Product with Serializable

class SequentialDataFlowExecutor extends DataFlowExecutor with Logging

class SequentialScheduler extends ActionScheduler with Logging

sealed case class TransformPostAction[T](run: (Option[T]) ⇒ Option[T], labelToIntercept: String) extends PostAction[T] with Product with Serializable

Value Members

object CommitMeta extends Serializable

object CommitMetadataExtension extends Serializable

object CommitMetadataExtensionIdentifier extends DataFlowMetadataExtensionIdentifier with Product with Serializable

val DEFAULT_POOL_NAME: String

object DFExecutorPriorityStrategies

object DataFlow

object DataFlowEntities

object NoReportingFlowReporter

object ParallelActionScheduler

object ParallelDataFlowExecutor

object SequentialDataFlowExecutor

object Waimak

package spark

Inherited from AnyRef

Inherited from Any

Ungrouped