Class/Object

com.coxautodata.waimak.dataflow.spark

ParquetDataCommitter

Related Docs: object ParquetDataCommitter | package spark

Permalink

case class ParquetDataCommitter(outputBaseFolder: String, snapshotFolder: Option[String] = None, cleanupStrategy: Option[CleanUpStrategy[FileStatus]] = None, hadoopDBConnector: Option[HadoopDBConnector] = None) extends DataCommitter[SparkDataFlow] with Logging with Product with Serializable

Adds actions necessary to commit labels as parquet parquet, supports snapshot folders and interaction with a DB connector.

Created by Alexei Perelighin on 2018/11/05

outputBaseFolder

folder under which final labels will store its data. Ex: baseFolder/label_1/

snapshotFolder

optional name of the snapshot folder that will be used by all of the labels committed via this committer. It needs to be a full name and must not be the same as in any of the previous snapshots for any of the commit managed labels. Ex: baseFolder/label_1/snapshot_folder=20181128 baseFolder/label_1/snapshot_folder=20181129 baseFolder/label_2/snapshot_folder=20181128 baseFolder/label_2/snapshot_folder=20181129

cleanupStrategy

optional function that takes the list of available snapshots and returns list of snapshots to remove

hadoopDBConnector

optional connector to the DB.

Linear Supertypes
Serializable, Serializable, Product, Equals, Logging, DataCommitter[SparkDataFlow], AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. ParquetDataCommitter
  2. Serializable
  3. Serializable
  4. Product
  5. Equals
  6. Logging
  7. DataCommitter
  8. AnyRef
  9. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new ParquetDataCommitter(outputBaseFolder: String, snapshotFolder: Option[String] = None, cleanupStrategy: Option[CleanUpStrategy[FileStatus]] = None, hadoopDBConnector: Option[HadoopDBConnector] = None)

    Permalink

    outputBaseFolder

    folder under which final labels will store its data. Ex: baseFolder/label_1/

    snapshotFolder

    optional name of the snapshot folder that will be used by all of the labels committed via this committer. It needs to be a full name and must not be the same as in any of the previous snapshots for any of the commit managed labels. Ex: baseFolder/label_1/snapshot_folder=20181128 baseFolder/label_1/snapshot_folder=20181129 baseFolder/label_2/snapshot_folder=20181128 baseFolder/label_2/snapshot_folder=20181129

    cleanupStrategy

    optional function that takes the list of available snapshots and returns list of snapshots to remove

    hadoopDBConnector

    optional connector to the DB.

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  5. val cleanupStrategy: Option[CleanUpStrategy[FileStatus]]

    Permalink

    optional function that takes the list of available snapshots and returns list of snapshots to remove

  6. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  7. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  8. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  9. def finish(commitName: String, commitUUID: UUID, labels: Seq[CommitEntry], flow: SparkDataFlow): SparkDataFlow

    Permalink

    Adds actions that are preformed when all data is fully committed/moved into permanent storage.

    Adds actions that are preformed when all data is fully committed/moved into permanent storage. Can be used to do cleanup operations.

    commitName

    logical name of the commit

    commitUUID

    A UUID generated at runtime unique to a commit name

    labels

    labels that were committed

    flow

    data flow to which to add finalise actions

    returns

    data flow with finalise actions

    Attributes
    protected[com.coxautodata.waimak.dataflow]
    Definition Classes
    ParquetDataCommitterDataCommitter
  10. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  11. val hadoopDBConnector: Option[HadoopDBConnector]

    Permalink

    optional connector to the DB.

  12. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  13. def isTraceEnabled(): Boolean

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  14. def logAndReturn[A](a: A, msg: String, level: Level): A

    Permalink

    Takes a value of type A and a msg to log, returning a and logging the message at the desired level

    Takes a value of type A and a msg to log, returning a and logging the message at the desired level

    returns

    a

    Definition Classes
    Logging
  15. def logAndReturn[A](a: A, message: (A) ⇒ String, level: Level): A

    Permalink

    Takes a value of type A and a function message from A to String, logs the value of invoking message(a) at the level described by the level parameter

    Takes a value of type A and a function message from A to String, logs the value of invoking message(a) at the level described by the level parameter

    returns

    a

    Definition Classes
    Logging
    Example:
    1. logAndReturn(1, (num: Int) => s"number: $num", Info)
      // In the log we would see a log corresponding to "number 1"
  16. def logDebug(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  17. def logDebug(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  18. def logError(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  19. def logError(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  20. def logInfo(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  21. def logInfo(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  22. def logName: String

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  23. def logTrace(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  24. def logTrace(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  25. def logWarning(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  26. def logWarning(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  27. def moveToPermanentStorageFlow(commitName: String, commitUUID: UUID, labels: Seq[CommitEntry], flow: SparkDataFlow): SparkDataFlow

    Permalink

    Adds actions to the flow that move data to the permanent storage, simulating a wave commit

    Adds actions to the flow that move data to the permanent storage, simulating a wave commit

    commitName

    logical name of the commit

    commitUUID

    A UUID generated at runtime unique to a commit name

    labels

    labels to move

    flow

    data flow to which move actions are added to

    returns

    data flow with move actions

    Attributes
    protected[com.coxautodata.waimak.dataflow]
    Definition Classes
    ParquetDataCommitterDataCommitter
  28. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  29. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  30. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  31. val outputBaseFolder: String

    Permalink

    folder under which final labels will store its data.

    folder under which final labels will store its data. Ex: baseFolder/label_1/

  32. val snapshotFolder: Option[String]

    Permalink

    optional name of the snapshot folder that will be used by all of the labels committed via this committer.

    optional name of the snapshot folder that will be used by all of the labels committed via this committer. It needs to be a full name and must not be the same as in any of the previous snapshots for any of the commit managed labels. Ex: baseFolder/label_1/snapshot_folder=20181128 baseFolder/label_1/snapshot_folder=20181129 baseFolder/label_2/snapshot_folder=20181128 baseFolder/label_2/snapshot_folder=20181129

  33. def stageToTempFlow(commitName: String, commitUUID: UUID, labels: Seq[CommitEntry], flow: SparkDataFlow): SparkDataFlow

    Permalink

    Adds cache actions to the flow.

    Adds cache actions to the flow.

    commitName

    logical name of the commit

    commitUUID

    A UUID generated at runtime unique to a commit name

    labels

    labels to cache

    flow

    data flow to which the caching actions are added to

    returns

    data flow with caching actions

    Attributes
    protected[com.coxautodata.waimak.dataflow]
    Definition Classes
    ParquetDataCommitterDataCommitter
  34. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  35. def validate(flow: SparkDataFlow, commitName: String, entries: Seq[CommitEntry]): Try[Unit]

    Permalink

    Validates that: 1) data flow is a decedent of the SparkDataFlow 2) data flow has temp folder 3) no committed label has an existing snapshot folder same as new one 4) cleanup can only take place when snapshot folder is defined

    Validates that: 1) data flow is a decedent of the SparkDataFlow 2) data flow has temp folder 3) no committed label has an existing snapshot folder same as new one 4) cleanup can only take place when snapshot folder is defined

    flow

    data flow to validate

    Attributes
    protected[com.coxautodata.waimak.dataflow]
    Definition Classes
    ParquetDataCommitterDataCommitter
  36. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  37. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  38. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  39. def withCleanupStrategy(strategy: CleanUpStrategy[FileStatus]): ParquetDataCommitter

    Permalink

    Set a cleanup strategy for this Parquet Committer

  40. def withDateBasedSnapshotCleanup(folderPrefix: String, dateFormat: String, numberOfFoldersToKeep: Int): ParquetDataCommitter

    Permalink

    Configures a default implementation of a cleanup strategy based on dates encoded into snapshot folder name.

  41. def withHadoopDBConnector(con: HadoopDBConnector): ParquetDataCommitter

    Permalink

    Sets new DB connector

  42. def withSnapshotFolder(folder: String): ParquetDataCommitter

    Permalink

    Set a snapshot folder for this Parquet Committer

Inherited from Serializable

Inherited from Serializable

Inherited from Product

Inherited from Equals

Inherited from Logging

Inherited from DataCommitter[SparkDataFlow]

Inherited from AnyRef

Inherited from Any

Ungrouped