com.coxautodata.waimak.dataflow.spark
Actions to execute, these will be scheduled when inputs become available.
Actions to execute, these will be scheduled when inputs become available. Executed actions must be removed from the sate.
Creates new state of the dataflow by adding an action to it.
Creates new state of the dataflow by adding an action to it.
- action to add
- new state with action
DataFlowException
when:
1) at least one of the input labels is not present in the inputs
2) at least one of the input labels is not present in the outputs of existing actions
Creates new state of the dataflow by adding an input.
Creates new state of the dataflow by adding an input. Duplicate labels are handled in prepareForExecution()
- name of the input
- values of the input
- new state with the input
Creates new state of the data flow by replacing the action that is intercepted with action that intercepts it.
Creates new state of the data flow by replacing the action that is intercepted with action that intercepts it. The action to replace will differ from the intercepted action in the InterceptorAction in the case of replacing an existing InterceptorAction
Execute this flow using the current executor on the flow.
Creates new state of the dataflow by removing executed action from the actions list and adds its outputs to the inputs.
Creates new state of the dataflow by removing executed action from the actions list and adds its outputs to the inputs.
- executed actions
- outputs of the executed action
- next stage data flow without the executed action, but with its outpus as inputs
DataFlowException
if number of provided outputs is not equal to the number of output labels of the action
Creates a code block with all actions inside of it being run on the specified execution pool.
Creates a code block with all actions inside of it being run on the specified execution pool. Same execution pool name can be used multiple times and nested pools are allowed, the name closest to the action will be assigned to it.
Ex: flow.executionPool("pool_1") { _.addAction(a1) .addAction(a2) .executionPool("pool_2") { _.addAction(a3) .addAction(a4) }..addAction(a5) }
So actions a1, a2, a5 will be in the pool_1 and actions a3, a4 in the pool_2
pool name to assign to all actions inside of it, but it can be overwritten by the nested execution pools.
Current DataFlowExecutor associated with this flow
Current DataFlowExecutor associated with this flow
A function called just after the flow is executed.
A function called just after the flow is executed. By default, the implementation on DataFlow is no-op, however it is used in spark.SparkDataFlow to clean up the temporary directory
Fold left over a collection, where the current DataFlow is the zero value.
Fold left over a collection, where the current DataFlow is the zero value. Lets you fold over a flow inline in the flow.
Collection to fold over
Function to apply during the flow
A DataFlow produced after repeated applications of f for each element in the collection
Guids are unique, find action by guid
Guids are unique, find action by guid
Output labels are unique.
Output labels are unique. Finds action that produces outputLabel.
Inputs that were explicitly set or produced by previous actions, these are inputs for all following actions.
Inputs that were explicitly set or produced by previous actions, these are inputs for all following actions. Inputs are preserved in the data flow state, even if they are no longer required by the remaining actions. //TODO: explore the option of removing the inputs that are no longer required by remaining actions!!!
Flow DAG is valid iff: 1.
Flow DAG is valid iff: 1. All output labels and existing input labels unique 2. Each action depends on labels that are produced by actions or already present in inputs 3. Active tags is empty 4. Active dependencies is zero 5. No cyclic dependencies in labels 6. No cyclic dependencies in tags 7. No cyclic dependencies in label tag combination
Takes a value of type A and a msg to log, returning a and logging the message at the desired level
Takes a value of type A and a msg to log, returning a and logging the message at the desired level
a
Takes a value of type A and a function message from A to String, logs the value of invoking message(a) at the level described by the level parameter
Takes a value of type A and a function message from A to String, logs the value of invoking message(a) at the level described by the level parameter
a
logAndReturn(1, (num: Int) => s"number: $num", Info) // In the log we would see a log corresponding to "number 1"
Transforms the current dataflow by applying a function to it.
Transforms the current dataflow by applying a function to it.
A function that transforms a dataflow object
New dataflow
Optionally transform a dataflow depending on the output of the applying function.
Optionally transform a dataflow depending on the output of the applying function. If the transforming function returns a None then the original dataflow is returned.
A function that returns an Option[DataFlow]
DataFlow object that may have been transformed
Returns actions that are ready to run: 1.
Returns actions that are ready to run: 1. have no input labels; 2. whose inputs have been created 3. all actions whose dependent tags have been run 4. belong to the available pool
will not include actions that are skipped.
set of execution pool for which to schedule actions
A function called just before the flow is executed.
A function called just before the flow is executed. This function keeps calling any extension preparation steps first, then checks the tagging state of the flow, and could be overloaded to have implementation specific preparation steps. An overloaded function should call this function first. It would be responsible for preparing an execution environment such as cleaning temporary directories.
Generic method that can be used to add context and state to all actions inside the block.
Generic method that can be used to add context and state to all actions inside the block.
function that adds attributes to the state
all actions inside of this flow will be associated with the mutated state
Execution of the flow is lazy, but registration of the datasets as sql tables can only happen when data set is created.
Execution of the flow is lazy, but registration of the datasets as sql tables can only happen when data set is created. With multiple threads consuming same table, registration of the data set as an sql table needs to happen in synchronised code.
Labels that need to be registered as temp spark views before the execution starts. This is necessary if they are to be reused by multiple parallel threads.
Tag all actions added during the taggedFlow lambda function with any given number of tags.
Tag all actions added during the taggedFlow lambda function with any given number of tags. These tags can then be used by the tagDependency() action to create a dependency in the running order of actions by tag.
Tags to apply to added actions
An intermediate flow that actions can be added to that will be be marked with the tag
Mark all actions added during the tagDependentFlow lambda function as having a dependency on the tags provided.
Mark all actions added during the tagDependentFlow lambda function as having a dependency on the tags provided. These actions will only be run once all tagged actions have finished.
Tags to create a dependency on
An intermediate flow that actions can be added to that will depended on tagged actions to have completed before running
Folder into which the temp data will be saved before commit into the output storage: folders, RDBMs, Key Value tables.
Add, update or remove a metadata extension from the flow using the identifier
argument to find an existing extension.
Add, update or remove a metadata extension from the flow using the identifier
argument to find an existing extension.
Type of the DataFlowMetadataExtension
Identifier of extension to update or remove
Function that manipulates the extension on the flow. Input will be None if no existing extension with matching identifier exists on the flow. Return None to remove an existing extension with matching identifier from the flow.
Add a new executor to this flow, replacing the existing one
Add a new executor to this flow, replacing the existing one
DataFlowExecutor to add to this flow
Introduces spark session into the data flows