Class/Object

io.smartdatalake.workflow.action

CustomSparkAction

Related Docs: object CustomSparkAction | package action

Permalink

case class CustomSparkAction(id: ActionId, inputIds: Seq[DataObjectId], outputIds: Seq[DataObjectId], transformer: CustomDfsTransformerConfig, breakDataFrameLineage: Boolean = false, persist: Boolean = false, mainInputId: Option[DataObjectId] = None, mainOutputId: Option[DataObjectId] = None, executionMode: Option[ExecutionMode] = None, executionCondition: Option[Condition] = None, metricsFailCondition: Option[String] = None, metadata: Option[ActionMetadata] = None, recursiveInputIds: Seq[DataObjectId] = Seq(), inputIdsToIgnoreFilter: Seq[DataObjectId] = Seq())(implicit instanceRegistry: InstanceRegistry) extends SparkSubFeedsAction with Product with Serializable

Action to transform data according to a custom transformer. Allows to transform multiple input and output dataframes.

inputIds

input DataObject's

outputIds

output DataObject's

transformer

custom transformation for multiple dataframes to apply

mainInputId

optional selection of main inputId used for execution mode and partition values propagation. Only needed if there are multiple input DataObject's.

mainOutputId

optional selection of main outputId used for execution mode and partition values propagation. Only needed if there are multiple output DataObject's.

executionMode

optional execution mode for this Action

executionCondition

optional spark sql expression evaluated against SubFeedsExpressionData. If true Action is executed, otherwise skipped. Details see Condition.

metricsFailCondition

optional spark sql expression evaluated as where-clause against dataframe of metrics. Available columns are dataObjectId, key, value. If there are any rows passing the where clause, a MetricCheckFailed exception is thrown.

recursiveInputIds

output of action that are used as input in the same action

inputIdsToIgnoreFilter

optional list of input ids to ignore filter (partition values & filter clause)

Linear Supertypes
Serializable, Serializable, Product, Equals, SparkSubFeedsAction, SparkAction, Action, AtlasExportable, SmartDataLakeLogger, DAGNode, ParsableFromConfig[Action], SdlConfigObject, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. CustomSparkAction
  2. Serializable
  3. Serializable
  4. Product
  5. Equals
  6. SparkSubFeedsAction
  7. SparkAction
  8. Action
  9. AtlasExportable
  10. SmartDataLakeLogger
  11. DAGNode
  12. ParsableFromConfig
  13. SdlConfigObject
  14. AnyRef
  15. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new CustomSparkAction(id: ActionId, inputIds: Seq[DataObjectId], outputIds: Seq[DataObjectId], transformer: CustomDfsTransformerConfig, breakDataFrameLineage: Boolean = false, persist: Boolean = false, mainInputId: Option[DataObjectId] = None, mainOutputId: Option[DataObjectId] = None, executionMode: Option[ExecutionMode] = None, executionCondition: Option[Condition] = None, metricsFailCondition: Option[String] = None, metadata: Option[ActionMetadata] = None, recursiveInputIds: Seq[DataObjectId] = Seq(), inputIdsToIgnoreFilter: Seq[DataObjectId] = Seq())(implicit instanceRegistry: InstanceRegistry)

    Permalink

    inputIds

    input DataObject's

    outputIds

    output DataObject's

    transformer

    custom transformation for multiple dataframes to apply

    mainInputId

    optional selection of main inputId used for execution mode and partition values propagation. Only needed if there are multiple input DataObject's.

    mainOutputId

    optional selection of main outputId used for execution mode and partition values propagation. Only needed if there are multiple output DataObject's.

    executionMode

    optional execution mode for this Action

    executionCondition

    optional spark sql expression evaluated against SubFeedsExpressionData. If true Action is executed, otherwise skipped. Details see Condition.

    metricsFailCondition

    optional spark sql expression evaluated as where-clause against dataframe of metrics. Available columns are dataObjectId, key, value. If there are any rows passing the where clause, a MetricCheckFailed exception is thrown.

    recursiveInputIds

    output of action that are used as input in the same action

    inputIdsToIgnoreFilter

    optional list of input ids to ignore filter (partition values & filter clause)

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. def addRuntimeEvent(phase: ExecutionPhase, state: RuntimeEventState, msg: Option[String] = None, results: Seq[SubFeed] = Seq()): Unit

    Permalink

    Adds an action event

    Adds an action event

    Definition Classes
    Action
  5. def applyAdditionalColumns(additionalColumns: Map[String, String], partitionValues: Seq[PartitionValues])(df: DataFrame)(implicit session: SparkSession, context: ActionPipelineContext): DataFrame

    Permalink

    applies additionalColumns

    applies additionalColumns

    Definition Classes
    SparkAction
  6. def applyCastDecimal2IntegralFloat(df: DataFrame): DataFrame

    Permalink

    applies type casting decimal -> integral/float

    applies type casting decimal -> integral/float

    Definition Classes
    SparkAction
  7. def applyCustomTransformation(transformer: CustomDfTransformerConfig, subFeed: SparkSubFeed)(df: DataFrame)(implicit session: SparkSession, context: ActionPipelineContext): DataFrame

    Permalink

    apply custom transformation

    apply custom transformation

    Definition Classes
    SparkAction
  8. def applyFilter(filterClauseExpr: Column)(df: DataFrame): DataFrame

    Permalink

    applies filterClauseExpr

    applies filterClauseExpr

    Definition Classes
    SparkAction
  9. def applyTransformations(inputSubFeed: SparkSubFeed, transformation: Option[CustomDfTransformerConfig], columnBlacklist: Option[Seq[String]], columnWhitelist: Option[Seq[String]], additionalColumns: Option[Map[String, String]], standardizeDatatypes: Boolean, additionalTransformers: Seq[(DataFrame) ⇒ DataFrame], filterClauseExpr: Option[Column] = None)(implicit session: SparkSession, context: ActionPipelineContext): DataFrame

    Permalink

    applies all the transformations above

    applies all the transformations above

    Definition Classes
    SparkAction
  10. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  11. def atlasName: String

    Permalink
    Definition Classes
    Action → AtlasExportable
  12. def atlasQualifiedName(prefix: String): String

    Permalink
    Definition Classes
    AtlasExportable
  13. val breakDataFrameLineage: Boolean

    Permalink

    Stop propagating input DataFrame through action and instead get a new DataFrame from DataObject.

    Stop propagating input DataFrame through action and instead get a new DataFrame from DataObject. This can help to save memory and performance if the input DataFrame includes many transformations from previous Actions. The new DataFrame will be initialized according to the SubFeed's partitionValues.

    Definition Classes
    CustomSparkAction → SparkAction
  14. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  15. def createEmptyDataFrame(dataObject: DataObject with CanCreateDataFrame, subFeed: SparkSubFeed)(implicit session: SparkSession, context: ActionPipelineContext): DataFrame

    Permalink
    Definition Classes
    SparkAction
  16. def enableRuntimeMetrics(): Unit

    Permalink

    Runtime metrics

    Runtime metrics

    Note: runtime metrics are disabled by default, because they are only collected when running Actions from an ActionDAG. This is not the case for Tests or other use cases. If enabled exceptions are thrown if metrics are not found.

    Definition Classes
    Action
  17. def enrichSubFeedDataFrame(input: DataObject with CanCreateDataFrame, subFeed: SparkSubFeed, phase: ExecutionPhase)(implicit session: SparkSession, context: ActionPipelineContext): SparkSubFeed

    Permalink

    Enriches SparkSubFeed with DataFrame if not existing

    Enriches SparkSubFeed with DataFrame if not existing

    input

    input data object.

    subFeed

    input SubFeed.

    Definition Classes
    SparkAction
  18. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  19. final def exec(subFeeds: Seq[SubFeed])(implicit session: SparkSession, context: ActionPipelineContext): Seq[SubFeed]

    Permalink

    Action.exec implementation

    Action.exec implementation

    subFeeds

    SparkSubFeed's to be processed

    returns

    processed SparkSubFeed's

    Definition Classes
    SparkSubFeedsAction → Action
  20. val executionCondition: Option[Condition]

    Permalink

    optional spark sql expression evaluated against SubFeedsExpressionData.

    optional spark sql expression evaluated against SubFeedsExpressionData. If true Action is executed, otherwise skipped. Details see Condition.

    Definition Classes
    CustomSparkAction → Action
  21. var executionConditionResult: (Boolean, Option[String])

    Permalink
    Attributes
    protected
    Definition Classes
    Action
  22. val executionMode: Option[ExecutionMode]

    Permalink

    optional execution mode for this Action

    optional execution mode for this Action

    Definition Classes
    CustomSparkAction → Action
  23. var executionModeResult: Try[Option[ExecutionModeResult]]

    Permalink
    Attributes
    protected
    Definition Classes
    Action
  24. def factory: FromConfigFactory[Action]

    Permalink

    Returns the factory that can parse this type (that is, type CO).

    Returns the factory that can parse this type (that is, type CO).

    Typically, implementations of this method should return the companion object of the implementing class. The companion object in turn should implement FromConfigFactory.

    returns

    the factory (object) for this class.

    Definition Classes
    CustomSparkAction → ParsableFromConfig
  25. def filterDataFrame(df: DataFrame, partitionValues: Seq[PartitionValues], genericFilter: Option[Column]): DataFrame

    Permalink

    Filter DataFrame with given partition values

    Filter DataFrame with given partition values

    df

    DataFrame to filter

    partitionValues

    partition values to use as filter condition

    genericFilter

    filter expression to apply

    returns

    filtered DataFrame

    Definition Classes
    SparkAction
  26. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  27. def getAllLatestMetrics: Map[DataObjectId, Option[ActionMetrics]]

    Permalink
    Definition Classes
    Action
  28. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  29. def getFinalMetrics(dataObjectId: DataObjectId): Option[ActionMetrics]

    Permalink
    Definition Classes
    Action
  30. def getInputDataObject[T <: DataObject](id: DataObjectId)(implicit arg0: ClassTag[T], arg1: scala.reflect.api.JavaUniverse.TypeTag[T], registry: InstanceRegistry): T

    Permalink
    Attributes
    protected
    Definition Classes
    Action
  31. def getLatestMetrics(dataObjectId: DataObjectId): Option[ActionMetrics]

    Permalink
    Definition Classes
    Action
  32. def getLatestRuntimeState: Option[RuntimeEventState]

    Permalink

    get latest runtime state

    get latest runtime state

    Definition Classes
    Action
  33. def getMainInput(inputSubFeeds: Seq[SubFeed])(implicit context: ActionPipelineContext): DataObject

    Permalink
    Definition Classes
    SparkSubFeedsAction
  34. def getOutputDataObject[T <: DataObject](id: DataObjectId)(implicit arg0: ClassTag[T], arg1: scala.reflect.api.JavaUniverse.TypeTag[T], registry: InstanceRegistry): T

    Permalink
    Attributes
    protected
    Definition Classes
    Action
  35. def getRuntimeInfo: Option[RuntimeInfo]

    Permalink

    get latest runtime information for this action

    get latest runtime information for this action

    Definition Classes
    Action
  36. val id: ActionId

    Permalink

    A unique identifier for this instance.

    A unique identifier for this instance.

    Definition Classes
    CustomSparkAction → Action → SdlConfigObject
  37. final def init(subFeeds: Seq[SubFeed])(implicit session: SparkSession, context: ActionPipelineContext): Seq[SubFeed]

    Permalink

    Generic init implementation for Action.init

    Generic init implementation for Action.init

    subFeeds

    SparkSubFeed's to be processed

    returns

    processed SparkSubFeed's

    Definition Classes
    SparkSubFeedsAction → Action
  38. val inputIds: Seq[DataObjectId]

    Permalink

    input DataObject's

  39. val inputIdsToIgnoreFilter: Seq[DataObjectId]

    Permalink

    optional list of input ids to ignore filter (partition values & filter clause)

    optional list of input ids to ignore filter (partition values & filter clause)

    Definition Classes
    CustomSparkActionSparkSubFeedsAction
  40. val inputs: Seq[DataObject with CanCreateDataFrame]

    Permalink

    Input DataObjects To be implemented by subclasses

    Input DataObjects To be implemented by subclasses

    Definition Classes
    CustomSparkActionSparkSubFeedsAction → Action
  41. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  42. def logWritingFinished(subFeed: SparkSubFeed, noData: Boolean, duration: Duration)(implicit session: SparkSession): Unit

    Permalink
    Definition Classes
    SparkAction
  43. def logWritingStarted(subFeed: SparkSubFeed)(implicit session: SparkSession): Unit

    Permalink
    Definition Classes
    SparkAction
  44. lazy val logger: Logger

    Permalink
    Attributes
    protected
    Definition Classes
    SmartDataLakeLogger
  45. val mainInputId: Option[DataObjectId]

    Permalink

    optional selection of main inputId used for execution mode and partition values propagation.

    optional selection of main inputId used for execution mode and partition values propagation. Only needed if there are multiple input DataObject's.

    Definition Classes
    CustomSparkActionSparkSubFeedsAction
  46. lazy val mainOutput: DataObject with CanWriteDataFrame

    Permalink
    Definition Classes
    SparkSubFeedsAction
  47. val mainOutputId: Option[DataObjectId]

    Permalink

    optional selection of main outputId used for execution mode and partition values propagation.

    optional selection of main outputId used for execution mode and partition values propagation. Only needed if there are multiple output DataObject's.

    Definition Classes
    CustomSparkActionSparkSubFeedsAction
  48. val metadata: Option[ActionMetadata]

    Permalink

    Additional metadata for the Action

    Additional metadata for the Action

    Definition Classes
    CustomSparkAction → Action
  49. val metricsFailCondition: Option[String]

    Permalink

    optional spark sql expression evaluated as where-clause against dataframe of metrics.

    optional spark sql expression evaluated as where-clause against dataframe of metrics. Available columns are dataObjectId, key, value. If there are any rows passing the where clause, a MetricCheckFailed exception is thrown.

    Definition Classes
    CustomSparkAction → Action
  50. def multiTransformDataFrame(inputDf: DataFrame, transformers: Seq[(DataFrame) ⇒ DataFrame]): DataFrame

    Permalink

    applies multiple transformations to a SubFeed

    applies multiple transformations to a SubFeed

    Definition Classes
    SparkAction
  51. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  52. def nodeId: String

    Permalink

    provide an implementation of the DAG node id

    provide an implementation of the DAG node id

    Definition Classes
    Action → DAGNode
  53. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  54. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  55. def onRuntimeMetrics(dataObjectId: Option[DataObjectId], metrics: ActionMetrics): Unit

    Permalink
    Definition Classes
    Action
  56. val outputIds: Seq[DataObjectId]

    Permalink

    output DataObject's

  57. val outputs: Seq[DataObject with CanWriteDataFrame]

    Permalink

    Output DataObjects To be implemented by subclasses

    Output DataObjects To be implemented by subclasses

    Definition Classes
    CustomSparkActionSparkSubFeedsAction → Action
  58. val persist: Boolean

    Permalink

    Force persisting input DataFrame's on Disk.

    Force persisting input DataFrame's on Disk. This improves performance if dataFrame is used multiple times in the transformation and can serve as a recovery point in case a task get's lost. Note that DataFrames are persisted automatically by the previous Action if later Actions need the same data. To avoid this behaviour set breakDataFrameLineage=false.

    Definition Classes
    CustomSparkAction → SparkAction
  59. def postExec(inputSubFeeds: Seq[SubFeed], outputSubFeeds: Seq[SubFeed])(implicit session: SparkSession, context: ActionPipelineContext): Unit

    Permalink

    Executes operations needed after executing an action.

    Executes operations needed after executing an action. In this step any task on Input- or Output-DataObjects needed after the main task is executed, e.g. JdbcTableDataObjects postWriteSql or CopyActions deleteInputData.

    Definition Classes
    SparkSubFeedsAction → SparkAction → Action
  60. def preExec(subFeeds: Seq[SubFeed])(implicit session: SparkSession, context: ActionPipelineContext): Unit

    Permalink

    Executes operations needed before executing an action.

    Executes operations needed before executing an action. In this step any phase on Input- or Output-DataObjects needed before the main task is executed, e.g. JdbcTableDataObjects preWriteSql

    Definition Classes
    Action
  61. def preInit(subFeeds: Seq[SubFeed])(implicit session: SparkSession, context: ActionPipelineContext): Unit

    Permalink

    Checks before initalization of Action In this step execution condition is evaluated and is Action init is skipped if result is false.

    Checks before initalization of Action In this step execution condition is evaluated and is Action init is skipped if result is false.

    Definition Classes
    Action
  62. def prepare(implicit session: SparkSession, context: ActionPipelineContext): Unit

    Permalink

    Prepare DataObjects prerequisites.

    Prepare DataObjects prerequisites. In this step preconditions are prepared & tested: - connections can be created - needed structures exist, e.g Kafka topic or Jdbc table

    This runs during the "prepare" phase of the DAG.

    Definition Classes
    SparkSubFeedsAction → SparkAction → Action
  63. def prepareInputSubFeed(input: DataObject with CanCreateDataFrame, subFeed: SparkSubFeed, ignoreFilters: Boolean = false)(implicit session: SparkSession, context: ActionPipelineContext): SparkSubFeed

    Permalink

    Applies changes to a SubFeed from a previous action in order to be used as input for this actions transformation.

    Applies changes to a SubFeed from a previous action in order to be used as input for this actions transformation.

    Definition Classes
    SparkAction
  64. lazy val prioritizedMainInputCandidates: Seq[DataObject with CanCreateDataFrame]

    Permalink
    Definition Classes
    SparkSubFeedsAction
  65. val recursiveInputIds: Seq[DataObjectId]

    Permalink

    output of action that are used as input in the same action

  66. val recursiveInputs: Seq[DataObject with CanCreateDataFrame]

    Permalink

    Recursive Inputs are DataObjects that are used as Output and Input in the same action This is usually prohibited as it creates loops in the DAG.

    Recursive Inputs are DataObjects that are used as Output and Input in the same action This is usually prohibited as it creates loops in the DAG. In special cases this makes sense, i.e. when building a complex delta logic

    Definition Classes
    CustomSparkActionSparkSubFeedsAction → Action
  67. def reset(): Unit

    Permalink

    Resets the runtime state of this Action This is mainly used for testing

    Resets the runtime state of this Action This is mainly used for testing

    Definition Classes
    Action
  68. def setSparkJobMetadata(operation: Option[String] = None)(implicit session: SparkSession): Unit

    Permalink

    Sets the util job description for better traceability in the Spark UI

    Sets the util job description for better traceability in the Spark UI

    Note: This sets Spark local properties, which are propagated to the respective executor tasks. We rely on this to match metrics back to Actions and DataObjects. As writing to a DataObject on the Driver happens uninterrupted in the same exclusive thread, this is suitable.

    operation

    phase description (be short...)

    Definition Classes
    Action
  69. def subFeedDfTransformer(fnTransform: (DataFrame) ⇒ DataFrame)(subFeed: SparkSubFeed): SparkSubFeed

    Permalink

    Transform the DataFrame of a SubFeed

    Transform the DataFrame of a SubFeed

    Definition Classes
    SparkAction
  70. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  71. final def toString(): String

    Permalink

    This is displayed in ascii graph visualization

    This is displayed in ascii graph visualization

    Definition Classes
    Action → AnyRef → Any
  72. def toStringMedium: String

    Permalink
    Definition Classes
    Action
  73. def toStringShort: String

    Permalink
    Definition Classes
    Action
  74. def transform(inputSubFeeds: Seq[SparkSubFeed], outputSubFeeds: Seq[SparkSubFeed])(implicit session: SparkSession, context: ActionPipelineContext): Seq[SparkSubFeed]

    Permalink

    Transform SparkSubFeed's.

    Transform SparkSubFeed's. To be implemented by subclasses.

    inputSubFeeds

    SparkSubFeeds to be transformed

    outputSubFeeds

    SparkSubFeeds to be enriched with transformed result

    returns

    transformed SparkSubFeeds

    Definition Classes
    CustomSparkActionSparkSubFeedsAction
  75. def transformPartitionValues(partitionValues: Seq[PartitionValues])(implicit context: ActionPipelineContext): Map[PartitionValues, PartitionValues]

    Permalink

    Transform partition values

    Transform partition values

    Definition Classes
    CustomSparkActionSparkSubFeedsAction
  76. val transformer: CustomDfsTransformerConfig

    Permalink

    custom transformation for multiple dataframes to apply

  77. def validateAndUpdateSubFeed(output: DataObject, subFeed: SparkSubFeed)(implicit session: SparkSession, context: ActionPipelineContext): SparkSubFeed

    Permalink

    The transformed DataFrame is validated to have the output's partition columns included, partition columns are moved to the end and SubFeeds partition values updated.

    The transformed DataFrame is validated to have the output's partition columns included, partition columns are moved to the end and SubFeeds partition values updated.

    output

    output DataObject

    subFeed

    SubFeed with transformed DataFrame

    returns

    validated and updated SubFeed

    Definition Classes
    SparkAction
  78. def validateDataFrameContainsCols(df: DataFrame, columns: Seq[String], debugName: String): Unit

    Permalink

    Validate that DataFrame contains a given list of columns, throwing an exception otherwise.

    Validate that DataFrame contains a given list of columns, throwing an exception otherwise.

    df

    DataFrame to validate

    columns

    Columns that must exist in DataFrame

    debugName

    name to mention in exception

    Definition Classes
    SparkAction
  79. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  80. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  81. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  82. def writeSubFeed(subFeed: SparkSubFeed, output: DataObject with CanWriteDataFrame, isRecursiveInput: Boolean = false)(implicit session: SparkSession, context: ActionPipelineContext): Boolean

    Permalink

    writes subfeed to output respecting given execution mode

    writes subfeed to output respecting given execution mode

    returns

    true if no data was transfered, otherwise false

    Definition Classes
    SparkAction

Inherited from Serializable

Inherited from Serializable

Inherited from Product

Inherited from Equals

Inherited from SparkSubFeedsAction

Inherited from SparkAction

Inherited from Action

Inherited from AtlasExportable

Inherited from SmartDataLakeLogger

Inherited from DAGNode

Inherited from ParsableFromConfig[Action]

Inherited from SdlConfigObject

Inherited from AnyRef

Inherited from Any

Ungrouped