Class

io.smartdatalake.util.hdfs

SparkRepartitionDef

Related Doc: package hdfs

Permalink

case class SparkRepartitionDef(numberOfTasksPerPartition: Int, keyCols: Seq[String] = Seq(), sortCols: Seq[String] = Seq(), filename: Option[String] = None) extends SmartDataLakeLogger with Product with Serializable

This controls repartitioning of the DataFrame before writing with Spark to Hadoop.

When writing multiple partitions of a partitioned DataObject, the number of spark tasks created is equal to numberOfTasksPerPartition multiplied with the number of partitions to write. To spread the records of a partition only over numberOfTasksPerPartition spark tasks, keyCols must be given which are used to derive a task number inside the partition (hashvalue(keyCols) modulo numberOfTasksPerPartition).

When writing to an unpartitioned DataObject or only one partition of a partitioned DataObject, the number of spark tasks created is equal to numberOfTasksPerPartition. Optional keyCols can be used to keep corresponding records together in the same task/file.

numberOfTasksPerPartition

Number of Spark tasks to create per partition before writing to DataObject by repartitioning the DataFrame. This controls how many files are created in each Hadoop partition.

keyCols

Optional key columns to distribute records over Spark tasks inside a Hadoop partition. If DataObject has Hadoop partitions defined, keyCols must be defined.

sortCols

Optional columns to sort records inside files created.

filename

Option filename to rename target file(s). If numberOfTasksPerPartition is greater than 1, multiple files can exist in a directory and a number is inserted into the filename after the first '.'. Example: filename=data.csv -> files created are data.1.csv, data.2.csv, ...

Linear Supertypes
Serializable, Serializable, Product, Equals, SmartDataLakeLogger, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. SparkRepartitionDef
  2. Serializable
  3. Serializable
  4. Product
  5. Equals
  6. SmartDataLakeLogger
  7. AnyRef
  8. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new SparkRepartitionDef(numberOfTasksPerPartition: Int, keyCols: Seq[String] = Seq(), sortCols: Seq[String] = Seq(), filename: Option[String] = None)

    Permalink

    numberOfTasksPerPartition

    Number of Spark tasks to create per partition before writing to DataObject by repartitioning the DataFrame. This controls how many files are created in each Hadoop partition.

    keyCols

    Optional key columns to distribute records over Spark tasks inside a Hadoop partition. If DataObject has Hadoop partitions defined, keyCols must be defined.

    sortCols

    Optional columns to sort records inside files created.

    filename

    Option filename to rename target file(s). If numberOfTasksPerPartition is greater than 1, multiple files can exist in a directory and a number is inserted into the filename after the first '.'. Example: filename=data.csv -> files created are data.1.csv, data.2.csv, ...

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  5. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  6. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  7. val filename: Option[String]

    Permalink

    Option filename to rename target file(s).

    Option filename to rename target file(s). If numberOfTasksPerPartition is greater than 1, multiple files can exist in a directory and a number is inserted into the filename after the first '.'. Example: filename=data.csv -> files created are data.1.csv, data.2.csv, ...

  8. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  9. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  10. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  11. val keyCols: Seq[String]

    Permalink

    Optional key columns to distribute records over Spark tasks inside a Hadoop partition.

    Optional key columns to distribute records over Spark tasks inside a Hadoop partition. If DataObject has Hadoop partitions defined, keyCols must be defined.

  12. lazy val logger: Logger

    Permalink
    Attributes
    protected
    Definition Classes
    SmartDataLakeLogger
  13. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  14. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  15. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  16. val numberOfTasksPerPartition: Int

    Permalink

    Number of Spark tasks to create per partition before writing to DataObject by repartitioning the DataFrame.

    Number of Spark tasks to create per partition before writing to DataObject by repartitioning the DataFrame. This controls how many files are created in each Hadoop partition.

  17. def prepareDataFrame(df: DataFrame, partitions: Seq[String], partitionValues: Seq[PartitionValues], dataObjectId: DataObjectId): DataFrame

    Permalink

    df

    DataFrame to repartition

    partitions

    DataObjects partition columns

    partitionValues

    PartitionsValues to be written with this DataFrame

    dataObjectId

    id of DataObject for logging

  18. def renameFiles(fileRefs: Seq[FileRef])(implicit filesystem: FileSystem): Unit

    Permalink
  19. val sortCols: Seq[String]

    Permalink

    Optional columns to sort records inside files created.

  20. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  21. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  22. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  23. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Serializable

Inherited from Serializable

Inherited from Product

Inherited from Equals

Inherited from SmartDataLakeLogger

Inherited from AnyRef

Inherited from Any

Ungrouped