SparkRepartitionDef

This controls repartitioning of the DataFrame before writing with Spark to Hadoop.

When writing multiple partitions of a partitioned DataObject, the number of spark tasks created is equal to numberOfTasksPerPartition multiplied with the number of partitions to write. To spread the records of a partition only over numberOfTasksPerPartition spark tasks, keyCols must be given which are used to derive a task number inside the partition (hashvalue(keyCols) modulo numberOfTasksPerPartition).

When writing to an unpartitioned DataObject or only one partition of a partitioned DataObject, the number of spark tasks created is equal to numberOfTasksPerPartition. Optional keyCols can be used to keep corresponding records together in the same task/file.

numberOfTasksPerPartition: Number of Spark tasks to create per partition before writing to DataObject by repartitioning the DataFrame. This controls how many files are created in each Hadoop partition.
keyCols: Optional key columns to distribute records over Spark tasks inside a Hadoop partition. If DataObject has Hadoop partitions defined, keyCols must be defined.
sortCols: Optional columns to sort records inside files created.
filename: Option filename to rename target file(s). If numberOfTasksPerPartition is greater than 1, multiple files can exist in a directory and a number is inserted into the filename after the first '.'. Example: filename=data.csv -> files created are data.1.csv, data.2.csv, ...

Linear Supertypes

Serializable, Serializable, Product, Equals, SmartDataLakeLogger, AnyRef, Any

Instance Constructors

new SparkRepartitionDef(numberOfTasksPerPartition: Int, keyCols: Seq[String] = Seq(), sortCols: Seq[String] = Seq(), filename: Option[String] = None)

numberOfTasksPerPartition
Number of Spark tasks to create per partition before writing to DataObject by repartitioning the DataFrame. This controls how many files are created in each Hadoop partition.
keyCols
Optional key columns to distribute records over Spark tasks inside a Hadoop partition. If DataObject has Hadoop partitions defined, keyCols must be defined.
sortCols
Optional columns to sort records inside files created.
filename
Option filename to rename target file(s). If numberOfTasksPerPartition is greater than 1, multiple files can exist in a directory and a number is inserted into the filename after the first '.'. Example: filename=data.csv -> files created are data.1.csv, data.2.csv, ...

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
val filename: Option[String]

Option filename to rename target file(s).
Option filename to rename target file(s). If numberOfTasksPerPartition is greater than 1, multiple files can exist in a directory and a number is inserted into the filename after the first '.'. Example: filename=data.csv -> files created are data.1.csv, data.2.csv, ...
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
val keyCols: Seq[String]

Optional key columns to distribute records over Spark tasks inside a Hadoop partition.
Optional key columns to distribute records over Spark tasks inside a Hadoop partition. If DataObject has Hadoop partitions defined, keyCols must be defined.
lazy val logger: Logger

Attributes
protected
Definition Classes
SmartDataLakeLogger
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
val numberOfTasksPerPartition: Int

Number of Spark tasks to create per partition before writing to DataObject by repartitioning the DataFrame.
Number of Spark tasks to create per partition before writing to DataObject by repartitioning the DataFrame. This controls how many files are created in each Hadoop partition.
def prepareDataFrame(df: DataFrame, partitions: Seq[String], partitionValues: Seq[PartitionValues], dataObjectId: DataObjectId): DataFrame

df
DataFrame to repartition
partitions
DataObjects partition columns
partitionValues
PartitionsValues to be written with this DataFrame
dataObjectId
id of DataObject for logging
def renameFiles(fileRefs: Seq[FileRef])(implicit filesystem: FileSystem): Unit
val sortCols: Seq[String]

Optional columns to sort records inside files created.
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Related Docs: object SparkRepartitionDef | package hdfs

case class SparkRepartitionDef(numberOfTasksPerPartition: Int, keyCols: Seq[String] = Seq(), sortCols: Seq[String] = Seq(), filename: Option[String] = None) extends SmartDataLakeLogger with Product with Serializable

Instance Constructors

new SparkRepartitionDef(numberOfTasksPerPartition: Int, keyCols: Seq[String] = Seq(), sortCols: Seq[String] = Seq(), filename: Option[String] = None)

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def clone(): AnyRef

final def eq(arg0: AnyRef): Boolean

val filename: Option[String]

def finalize(): Unit

final def getClass(): Class[_]

final def isInstanceOf[T0]: Boolean

val keyCols: Seq[String]

lazy val logger: Logger

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

val numberOfTasksPerPartition: Int

def prepareDataFrame(df: DataFrame, partitions: Seq[String], partitionValues: Seq[PartitionValues], dataObjectId: DataObjectId): DataFrame

def renameFiles(fileRefs: Seq[FileRef])(implicit filesystem: FileSystem): Unit

val sortCols: Seq[String]

final def synchronized[T0](arg0: ⇒ T0): T0

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from Serializable

Inherited from Serializable

Inherited from Product

Inherited from Equals

Inherited from SmartDataLakeLogger

Inherited from AnyRef

Inherited from Any

Ungrouped