Class/Object

io.smartdatalake.workflow.action

DeduplicateAction

Related Docs: object DeduplicateAction | package action

Permalink

case class DeduplicateAction(id: ActionObjectId, inputId: DataObjectId, outputId: DataObjectId, transformer: Option[CustomDfTransformerConfig] = None, columnBlacklist: Option[Seq[String]] = None, columnWhitelist: Option[Seq[String]] = None, filterClause: Option[String] = None, standardizeDatatypes: Boolean = false, ignoreOldDeletedColumns: Boolean = false, ignoreOldDeletedNestedColumns: Boolean = true, breakDataFrameLineage: Boolean = false, persist: Boolean = false, initExecutionMode: Option[ExecutionMode] = None, metadata: Option[ActionMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends SparkSubFeedAction with Product with Serializable

Action to deduplicate a subfeed. Deduplication keeps the last record for every key, also after it has been deleted in the source. It needs a transactional table as output with defined primary keys.

inputId

inputs DataObject

outputId

output DataObject

ignoreOldDeletedColumns

if true, remove no longer existing columns in Schema Evolution

ignoreOldDeletedNestedColumns

if true, remove no longer existing columns from nested data types in Schema Evolution. Keeping deleted columns in complex data types has performance impact as all new data in the future has to be converted by a complex function.

initExecutionMode

optional execution mode if this Action is a start node of a DAG run

Linear Supertypes
Serializable, Serializable, Product, Equals, SparkSubFeedAction, Action, SmartDataLakeLogger, DAGNode, ParsableFromConfig[Action], SdlConfigObject, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. DeduplicateAction
  2. Serializable
  3. Serializable
  4. Product
  5. Equals
  6. SparkSubFeedAction
  7. Action
  8. SmartDataLakeLogger
  9. DAGNode
  10. ParsableFromConfig
  11. SdlConfigObject
  12. AnyRef
  13. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new DeduplicateAction(id: ActionObjectId, inputId: DataObjectId, outputId: DataObjectId, transformer: Option[CustomDfTransformerConfig] = None, columnBlacklist: Option[Seq[String]] = None, columnWhitelist: Option[Seq[String]] = None, filterClause: Option[String] = None, standardizeDatatypes: Boolean = false, ignoreOldDeletedColumns: Boolean = false, ignoreOldDeletedNestedColumns: Boolean = true, breakDataFrameLineage: Boolean = false, persist: Boolean = false, initExecutionMode: Option[ExecutionMode] = None, metadata: Option[ActionMetadata] = None)(implicit instanceRegistry: InstanceRegistry)

    Permalink

    inputId

    inputs DataObject

    outputId

    output DataObject

    ignoreOldDeletedColumns

    if true, remove no longer existing columns in Schema Evolution

    ignoreOldDeletedNestedColumns

    if true, remove no longer existing columns from nested data types in Schema Evolution. Keeping deleted columns in complex data types has performance impact as all new data in the future has to be converted by a complex function.

    initExecutionMode

    optional execution mode if this Action is a start node of a DAG run

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. def addRuntimeEvent(phase: String, state: RuntimeEventState, msg: String): Unit

    Permalink

    Adds an action event

    Adds an action event

    Definition Classes
    Action
  5. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  6. val breakDataFrameLineage: Boolean

    Permalink

    Stop propagating input DataFrame through action and instead get a new DataFrame from DataObject.

    Stop propagating input DataFrame through action and instead get a new DataFrame from DataObject. This can help to save memory and performance if the input DataFrame includes many transformations from previous Actions. The new DataFrame will be initialized according to the SubFeed's partitionValues.

    Definition Classes
    DeduplicateActionSparkSubFeedAction
  7. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  8. val columnBlacklist: Option[Seq[String]]

    Permalink
  9. val columnWhitelist: Option[Seq[String]]

    Permalink
  10. def deduplicate(baseDf: DataFrame, newDf: DataFrame, keyColumns: Seq[String])(implicit session: SparkSession): DataFrame

    Permalink

    deduplicate -> keep latest record per key

    deduplicate -> keep latest record per key

    baseDf

    existing data

    newDf

    new data

    returns

    deduplicated data

  11. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  12. final def exec(subFeeds: Seq[SubFeed])(implicit session: SparkSession, context: ActionPipelineContext): Seq[SubFeed]

    Permalink

    Action.exec implementation

    Action.exec implementation

    subFeeds

    SparkSubFeed's to be processed

    returns

    processed SparkSubFeed's

    Definition Classes
    SparkSubFeedAction → Action
  13. def factory: FromConfigFactory[Action]

    Permalink

    Returns the factory that can parse this type (that is, type CO).

    Returns the factory that can parse this type (that is, type CO).

    Typically, implementations of this method should return the companion object of the implementing class. The companion object in turn should implement FromConfigFactory.

    returns

    the factory (object) for this class.

    Definition Classes
    DeduplicateAction → ParsableFromConfig
  14. val filterClause: Option[String]

    Permalink
  15. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  16. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  17. def getInputDataObject[T <: DataObject](id: DataObjectId)(implicit arg0: ClassTag[T], arg1: scala.reflect.api.JavaUniverse.TypeTag[T], registry: InstanceRegistry): T

    Permalink
    Attributes
    protected
    Definition Classes
    Action
  18. def getOutputDataObject[T <: DataObject](id: DataObjectId)(implicit arg0: ClassTag[T], arg1: scala.reflect.api.JavaUniverse.TypeTag[T], registry: InstanceRegistry): T

    Permalink
    Attributes
    protected
    Definition Classes
    Action
  19. def getRuntimeState: Option[String]

    Permalink

    Definition Classes
    Action
  20. val id: ActionObjectId

    Permalink

    A unique identifier for this instance.

    A unique identifier for this instance.

    Definition Classes
    DeduplicateAction → Action → SdlConfigObject
  21. val ignoreOldDeletedColumns: Boolean

    Permalink

    if true, remove no longer existing columns in Schema Evolution

  22. val ignoreOldDeletedNestedColumns: Boolean

    Permalink

    if true, remove no longer existing columns from nested data types in Schema Evolution.

    if true, remove no longer existing columns from nested data types in Schema Evolution. Keeping deleted columns in complex data types has performance impact as all new data in the future has to be converted by a complex function.

  23. final def init(subFeeds: Seq[SubFeed])(implicit session: SparkSession, context: ActionPipelineContext): Seq[SubFeed]

    Permalink

    Action.init implementation

    Action.init implementation

    subFeeds

    SparkSubFeed's to be processed

    returns

    processed SparkSubFeed's

    Definition Classes
    SparkSubFeedAction → Action
  24. val initExecutionMode: Option[ExecutionMode]

    Permalink

    optional execution mode if this Action is a start node of a DAG run

    optional execution mode if this Action is a start node of a DAG run

    Definition Classes
    DeduplicateActionSparkSubFeedAction
  25. val input: DataObject with CanCreateDataFrame

    Permalink

    Input DataObject which can CanCreateDataFrame

    Input DataObject which can CanCreateDataFrame

    Definition Classes
    DeduplicateActionSparkSubFeedAction
  26. val inputId: DataObjectId

    Permalink

    inputs DataObject

  27. val inputs: Seq[DataObject with CanCreateDataFrame]

    Permalink

    Input DataObjects To be implemented by subclasses

    Input DataObjects To be implemented by subclasses

    Definition Classes
    DeduplicateAction → Action
  28. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  29. lazy val logger: Logger

    Permalink
    Attributes
    protected
    Definition Classes
    SmartDataLakeLogger
  30. val metadata: Option[ActionMetadata]

    Permalink

    Additional metadata for the Action

    Additional metadata for the Action

    Definition Classes
    DeduplicateAction → Action
  31. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  32. def nodeId: String

    Permalink

    provide an implementation of the DAG node id

    provide an implementation of the DAG node id

    Definition Classes
    Action → DAGNode
  33. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  34. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  35. val output: TransactionalSparkTableDataObject

    Permalink

    Output DataObject which can CanWriteDataFrame

    Output DataObject which can CanWriteDataFrame

    Definition Classes
    DeduplicateActionSparkSubFeedAction
  36. val outputId: DataObjectId

    Permalink

    output DataObject

  37. val outputs: Seq[TransactionalSparkTableDataObject]

    Permalink

    Output DataObjects To be implemented by subclasses

    Output DataObjects To be implemented by subclasses

    Definition Classes
    DeduplicateAction → Action
  38. val persist: Boolean

    Permalink

    Force persisting DataFrame on Disk.

    Force persisting DataFrame on Disk. This helps to reduce memory needed for caching the DataFrame content and can serve as a recovery point in case an task get's lost.

    Definition Classes
    DeduplicateActionSparkSubFeedAction
  39. final def postExec(inputSubFeeds: Seq[SubFeed], outputSubFeeds: Seq[SubFeed])(implicit session: SparkSession, context: ActionPipelineContext): Unit

    Permalink

    Executes operations needed after executing an action.

    Executes operations needed after executing an action. In this step any operation on Input- or Output-DataObjects needed after the main task is executed, e.g. JdbcTableDataObjects postSql or CopyActions deleteInputData.

    Definition Classes
    SparkSubFeedAction → Action
  40. def postExecSubFeed(inputSubFeed: SubFeed, outputSubFeed: SubFeed)(implicit session: SparkSession, context: ActionPipelineContext): Unit

    Permalink
    Definition Classes
    SparkSubFeedAction
  41. def preExec(implicit session: SparkSession, context: ActionPipelineContext): Unit

    Permalink

    Executes operations needed before executing an action.

    Executes operations needed before executing an action. In this step any operation on Input- or Output-DataObjects needed before the main task is executed, e.g. JdbcTableDataObjects preSql

    Definition Classes
    Action
  42. def prepare(implicit session: SparkSession, context: ActionPipelineContext): Unit

    Permalink

    Prepare DataObjects prerequisites.

    Prepare DataObjects prerequisites. In this step preconditions are prepared & tested: - directories exists or can be created - connections can be created

    This runs during the "prepare" operation of the DAG.

    Definition Classes
    Action
  43. def setSparkJobDescription(operation: String)(implicit session: SparkSession): Unit

    Permalink

    Sets the util job description for better traceability in the Spark UI

    Sets the util job description for better traceability in the Spark UI

    operation

    operation description (be short...)

    session

    util session

    Definition Classes
    Action
  44. val standardizeDatatypes: Boolean

    Permalink
  45. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  46. final def toString(): String

    Permalink

    This is displayed in ascii graph visualization

    This is displayed in ascii graph visualization

    Definition Classes
    Action → AnyRef → Any
  47. def toStringMedium: String

    Permalink
    Definition Classes
    Action
  48. def toStringShort: String

    Permalink
    Definition Classes
    Action
  49. def transform(subFeed: SparkSubFeed)(implicit session: SparkSession, context: ActionPipelineContext): SparkSubFeed

    Permalink

    Transform a SparkSubFeed.

    Transform a SparkSubFeed. To be implemented by subclasses.

    subFeed

    SparkSubFeed to be transformed

    returns

    transformed SparkSubFeed

    Definition Classes
    DeduplicateActionSparkSubFeedAction
  50. val transformer: Option[CustomDfTransformerConfig]

    Permalink
  51. object udfs extends Serializable

    Permalink
  52. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  53. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  54. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Serializable

Inherited from Serializable

Inherited from Product

Inherited from Equals

Inherited from SparkSubFeedAction

Inherited from Action

Inherited from SmartDataLakeLogger

Inherited from DAGNode

Inherited from ParsableFromConfig[Action]

Inherited from SdlConfigObject

Inherited from AnyRef

Inherited from Any

Ungrouped