Package

io.smartdatalake.util

hdfs

Permalink

package hdfs

Visibility
  1. Public
  2. All

Type Members

  1. case class PartitionValues(elements: Map[String, Any]) extends Product with Serializable

    Permalink

    A partition is defined by values for its partition columns.

    A partition is defined by values for its partition columns. It can be represented by a Map. The key of the Map are the partition column names.

    Annotations
    @DeveloperApi()
  2. case class SparkRepartitionDef(numberOfTasksPerPartition: Int, keyCols: Seq[String] = Seq(), sortCols: Seq[String] = Seq(), filename: Option[String] = None) extends SmartDataLakeLogger with Product with Serializable

    Permalink

    This controls repartitioning of the DataFrame before writing with Spark to Hadoop.

    This controls repartitioning of the DataFrame before writing with Spark to Hadoop.

    When writing multiple partitions of a partitioned DataObject, the number of spark tasks created is equal to numberOfTasksPerPartition multiplied with the number of partitions to write. To spread the records of a partition only over numberOfTasksPerPartition spark tasks, keyCols must be given which are used to derive a task number inside the partition (hashvalue(keyCols) modulo numberOfTasksPerPartition).

    When writing to an unpartitioned DataObject or only one partition of a partitioned DataObject, the number of spark tasks created is equal to numberOfTasksPerPartition. Optional keyCols can be used to keep corresponding records together in the same task/file.

    numberOfTasksPerPartition

    Number of Spark tasks to create per partition before writing to DataObject by repartitioning the DataFrame. This controls how many files are created in each Hadoop partition.

    keyCols

    Optional key columns to distribute records over Spark tasks inside a Hadoop partition. If numberOfTasksPerPArtition is 1 this setting has no effect. If DataObject has Hadoop partitions defined, keyCols must be defined.

    sortCols

    Optional columns to sort records inside files created.

    filename

    Option filename to rename target file if numberOfTasksPerPartition is 1

Ungrouped