io.smartdatalake.workflow.action
input DataObject's
output DataObject's
custom transformation for multiple dataframes to apply
optional selection of main inputId used for execution mode and partition values propagation. Only needed if there are multiple input DataObject's.
optional selection of main outputId used for execution mode and partition values propagation. Only needed if there are multiple output DataObject's.
optional execution mode for this Action
optional spark sql expression evaluated as where-clause against dataframe of metrics. Available columns are dataObjectId, key, value. If there are any rows passing the where clause, a MetricCheckFailed exception is thrown.
output of action that are used as input in the same action
Adds an action event
Adds an action event
applies additionalColumns
applies additionalColumns
applies type casting decimal -> integral/float
applies type casting decimal -> integral/float
apply custom transformation
apply custom transformation
applies filterClauseExpr
applies filterClauseExpr
applies all the transformations above
applies all the transformations above
Stop propagating input DataFrame through action and instead get a new DataFrame from DataObject.
Stop propagating input DataFrame through action and instead get a new DataFrame from DataObject. This can help to save memory and performance if the input DataFrame includes many transformations from previous Actions. The new DataFrame will be initialized according to the SubFeed's partitionValues.
Runtime metrics
Runtime metrics
Note: runtime metrics are disabled by default, because they are only collected when running Actions from an ActionDAG. This is not the case for Tests or other use cases. If enabled exceptions are thrown if metrics are not found.
Enriches SparkSubFeed with DataFrame if not existing
Enriches SparkSubFeed with DataFrame if not existing
input data object.
input SubFeed.
Action.exec implementation
Action.exec implementation
SparkSubFeed's to be processed
processed SparkSubFeed's
optional execution mode for this Action
optional execution mode for this Action
Returns the factory that can parse this type (that is, type CO
).
Returns the factory that can parse this type (that is, type CO
).
Typically, implementations of this method should return the companion object of the implementing class. The companion object in turn should implement FromConfigFactory.
the factory (object) for this class.
Filter DataFrame with given partition values
Filter DataFrame with given partition values
DataFrame to filter
partition values to use as filter condition
filter expression to apply
filtered DataFrame
get latest runtime state
get latest runtime state
get latest runtime information for this action
get latest runtime information for this action
A unique identifier for this instance.
A unique identifier for this instance.
Generic init implementation for Action.init
Generic init implementation for Action.init
SparkSubFeed's to be processed
processed SparkSubFeed's
input DataObject's
Input DataObjects To be implemented by subclasses
Input DataObjects To be implemented by subclasses
optional selection of main inputId used for execution mode and partition values propagation.
optional selection of main inputId used for execution mode and partition values propagation. Only needed if there are multiple input DataObject's.
optional selection of main outputId used for execution mode and partition values propagation.
optional selection of main outputId used for execution mode and partition values propagation. Only needed if there are multiple output DataObject's.
Additional metadata for the Action
Additional metadata for the Action
optional spark sql expression evaluated as where-clause against dataframe of metrics.
optional spark sql expression evaluated as where-clause against dataframe of metrics. Available columns are dataObjectId, key, value. If there are any rows passing the where clause, a MetricCheckFailed exception is thrown.
applies multiple transformations to a SubFeed
applies multiple transformations to a SubFeed
provide an implementation of the DAG node id
provide an implementation of the DAG node id
output DataObject's
Output DataObjects To be implemented by subclasses
Output DataObjects To be implemented by subclasses
Force persisting input DataFrame's on Disk.
Force persisting input DataFrame's on Disk. This improves performance if dataFrame is used multiple times in the transformation and can serve as a recovery point in case a task get's lost. Note that DataFrames are persisted automatically by the previous Action if later Actions need the same data. To avoid this behaviour set breakDataFrameLineage=false.
Executes operations needed after executing an action.
Executes operations needed after executing an action. In this step any task on Input- or Output-DataObjects needed after the main task is executed, e.g. JdbcTableDataObjects postWriteSql or CopyActions deleteInputData.
Executes operations needed before executing an action.
Executes operations needed before executing an action. In this step any phase on Input- or Output-DataObjects needed before the main task is executed, e.g. JdbcTableDataObjects preWriteSql
Prepare DataObjects prerequisites.
Prepare DataObjects prerequisites. In this step preconditions are prepared & tested: - connections can be created - needed structures exist, e.g Kafka topic or Jdbc table
This runs during the "prepare" phase of the DAG.
Applies changes to a SubFeed from a previous action in order to be used as input for this actions transformation.
Applies changes to a SubFeed from a previous action in order to be used as input for this actions transformation.
output of action that are used as input in the same action
Recursive Inputs are DataObjects that are used as Output and Input in the same action This is usually prohibited as it creates loops in the DAG.
Recursive Inputs are DataObjects that are used as Output and Input in the same action This is usually prohibited as it creates loops in the DAG. In special cases this makes sense, i.e. when building a complex delta logic
Resets the runtime state of this Action This is mainly used for testing
Resets the runtime state of this Action This is mainly used for testing
Sets the util job description for better traceability in the Spark UI
Sets the util job description for better traceability in the Spark UI
Note: This sets Spark local properties, which are propagated to the respective executor tasks. We rely on this to match metrics back to Actions and DataObjects. As writing to a DataObject on the Driver happens uninterrupted in the same exclusive thread, this is suitable.
phase description (be short...)
Transform the DataFrame of a SubFeed
Transform the DataFrame of a SubFeed
This is displayed in ascii graph visualization
This is displayed in ascii graph visualization
Transform SparkSubFeed's.
Transform SparkSubFeed's. To be implemented by subclasses.
SparkSubFeeds to be transformed
SparkSubFeeds to be enriched with transformed result
transformed SparkSubFeeds
Transform partition values
Transform partition values
custom transformation for multiple dataframes to apply
The transformed DataFrame is validated to have the output's partition columns included, partition columns are moved to the end and SubFeeds partition values updated.
The transformed DataFrame is validated to have the output's partition columns included, partition columns are moved to the end and SubFeeds partition values updated.
output DataObject
SubFeed with transformed DataFrame
validated and updated SubFeed
Validate that DataFrame contains a given list of columns, throwing an exception otherwise.
Validate that DataFrame contains a given list of columns, throwing an exception otherwise.
DataFrame to validate
Columns that must exist in DataFrame
name to mention in exception
writes subfeed to output respecting given execution mode
writes subfeed to output respecting given execution mode
true if no data was transfered, otherwise false
Action to transform data according to a custom transformer. Allows to transform multiple input and output dataframes.
input DataObject's
output DataObject's
custom transformation for multiple dataframes to apply
optional selection of main inputId used for execution mode and partition values propagation. Only needed if there are multiple input DataObject's.
optional selection of main outputId used for execution mode and partition values propagation. Only needed if there are multiple output DataObject's.
optional execution mode for this Action
optional spark sql expression evaluated as where-clause against dataframe of metrics. Available columns are dataObjectId, key, value. If there are any rows passing the where clause, a MetricCheckFailed exception is thrown.
output of action that are used as input in the same action