Defines functions that are specific to scheduling tasks, evaluating which execution pools are available and signaling back which actions have finished their execution.
Defines functions that are specific to scheduling tasks, evaluating which execution pools are available and signaling back which actions have finished their execution.
Created by Alexei Perelighin on 2018/07/06
Contains configurations for commits and pushes, while configs are added, there are no modifications to the dataflow, as it waits for a validation before execution.
Contains configurations for commits and pushes, while configs are added, there are no modifications to the dataflow, as it waits for a validation before execution.
Map[ COMMIT_NAME, Seq[CommitEntry] ]
Map[ COMMIT_NAME, Seq[DataCommitter] - there should be one committer per commit name, but due to lazy definitions of the data flows, validation will have to catch it.
Defines phases of each data committers in the data flow, which are: 1) validate that committer is properly configured 2) cache the labels in temp area 3) when flow is successful, move all cached data into its permanent storage 4) finalise or cleanup after the committer had committed all of the data into permanent storage
Defines phases of each data committers in the data flow, which are: 1) validate that committer is properly configured 2) cache the labels in temp area 3) when flow is successful, move all cached data into its permanent storage 4) finalise or cleanup after the committer had committed all of the data into permanent storage
Created by Alexei Perelighin
Defines a state of the data flow.
Defines a state of the data flow. State is defined by the inputs that are ready to be consumed and actions that need to be executed. In most of the BAU cases, initial state of the data flow has no inputs, as they need to be produced by the actions. When an action finishes, it can produce 0 or N outputs, to create next state of the data flow, that action is removed from the data flow and its outputs are added as inputs into the flow. This state transitioning will enable restarts of the flow from any point or debug/exploratory runs with already existing/manufactured/captured/materialised inputs.
Also inputs are useful for unit testing, as they give access to all intermediate outputs of actions.
An action to be performed as part of a data flow.
An action to be performed as part of a data flow. Actions declare what input labels they expect in order to start execution (0 .. N) and can produce outputs associated with the output labels (0 .. N). Executors will use these labels to execute to schedule the actions sequentially or in parallel.
State of the action.
Represents the tag state on a given action
Represents the tag state on a given action
Tags belonging to this action
Tags this action is dependent on
Trait used to define a DataFlow Configuration extension.
Trait used to define a DataFlow Configuration extension.
This type of extension adds a pre-execution hook when an extension is enabled
by setting spark.waimak.dataflow.extensions=${extensionKey},otherextension
.
Instances of this trait must be registered services in the META-INF/services
file as they are loaded using ServiceLoader.
Maintains data flow entities (the inputs and outputs of data flow actions).
Maintains data flow entities (the inputs and outputs of data flow actions). Every entity has a label which must be unique across the data flow.
Created by Alexei Perelighin on 11/01/18.
Trait used to define a DataFlow Metadata extension.
Trait used to define a DataFlow Metadata extension. This type of extension adds custom metadata to a flow and is keyed by the extension instance.
Trait used as an identifier for an instance of an extension.
Represents the tag state on a DataFlow
Represents the tag state on a DataFlow
Tags currently active on the flow (i.e. within the tag()
context)
Tag dependencies currently active on the flow (i.e. within the tagDependency()
context)
Mapping of actions to their applied tag state
Action was executed and can not be executed again.
Can not be executed, as expected input is present, but is empty
This action can be added to the flow over an existing action, it will be scheduled instead of it and can override or intercept the behaviour of an action.
This action can be added to the flow over an existing action, it will be scheduled instead of it and can override or intercept the behaviour of an action. This is useful when additional behaviours need to be added. Examples: register as spark temp views for sql, logging filtering, persisting to disk, dredging, etc.
Created by Alexei Perelighin on 23/02/2018.
Can run multiple actions in parallel with multiple execution pools support.
Can run multiple actions in parallel with multiple execution pools support.
It was originally designed to benefit from Spark fair scheduler https://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools Execution pool names must be the same as the Fair Scheduler pool names and number of parallel jobs within a pool is the number of Java threads.
Example: to configure 2 pools for Spark Fair Scheduler following xml must be passed to spark :
<?xml version="1.0"?> <allocations> <pool name="high"> <schedulingMode>FAIR</schedulingMode> <weight>1000</weight> <minShare>0</minShare> </pool> <pool name="medium"> <schedulingMode>FAIR</schedulingMode> <weight>25</weight> <minShare>0</minShare> </pool> </allocations>
following configuration options need to be specified:
in code pools parameter: Map( ("high" -> ExecutionPoolDesc("high", 10, Set.Empty, None)), ("medium" -> ExecutionPoolDesc("medium", 20, Set.Empty, None)) )
Created by Alexei Perelighin on 2018/07/10
Action is ready to run.
Action can not be executed as it requires more inputs to be available.
When a Data Flow is defined, certain hints related to its execution can be specified, these hints will help scheduler with deciding when and where to run the action.
When a Data Flow is defined, certain hints related to its execution can be specified, these hints will help scheduler with deciding when and where to run the action. Further uses can be added to it.
At the moment, when an action is added to the scheduling meta, it will automatically assign it the current Execution Pool, but if there were other global context attributes to assign, than the action could aquire them as well.
describes a current state of schedulingMeta
Map[DataFlowAction.schedulingGuid, Execution Pool Name] - association between actions and execution pool names
Contains values that will be associated with all actions added to the data flow.
Contains values that will be associated with all actions added to the data flow.
name of the execution pool
Created by Alexei Perelighin 2017/12/27
Created by Alexei Perelighin 2017/12/27
Executes one action at a time wihtout trying to parallelize them.
Executes only one action at a time.
Executes only one action at a time.
Created by Alexei Perelighin on 2018/07/06
Defines various priority strategies for DataFlowExecutor to use.
Defines various priority strategies for DataFlowExecutor to use.
Created by Alexei Perelighin on 24/08/2018.
Defines factory functions for creating and running Waimak data flows.
Defines factory functions for creating and running Waimak data flows.
Create by Alexei Perelighin on 2018/02/27
Created by Alexei Perelighin on 2018/01/11.