Object

io.smartdatalake.util.misc

CompactionUtil

Related Doc: package misc

Permalink

object CompactionUtil extends SmartDataLakeLogger

Linear Supertypes

SmartDataLakeLogger, AnyRef, Any

Ordering

Alphabetic
By Inheritance

Inherited

CompactionUtil
SmartDataLakeLogger
AnyRef
Any

Hide All
Show All

Visibility

Public
All

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def compactHadoopStandardPartitions(dataObject: DataObject with CanHandlePartitions with CanCreateDataFrame with CanWriteDataFrame with HasHadoopStandardFilestore, partitionValues: Seq[PartitionValues])(implicit session: SparkSession, actionPipelineContext: ActionPipelineContext): Seq[PartitionValues]

Compacting hadoop partitions is not supported out-of-the-box by hadoop, as files need to be read with the correct format and written again.
Compacting hadoop partitions is not supported out-of-the-box by hadoop, as files need to be read with the correct format and written again. The following steps are used to compact partitions with Spark: 1. Check if compaction is already in progress by looking for a special file "_SDL_COMPACTING" in data objects root hadoop path. If it exists and is not older than 12h exit compaction with Exception. Otherwise create/update special file "_COMPACTION". If the file is older than 12h the compaction process is assumed to be crashed. 2. As step 5 is not atomic (delete and move are two operations), we need to check for possibly incomplete compactions of previous crashed runs and fix them. Incomplete compactions are marked with a special file "_SDL_MOVING" in the temporary path. Incomplete compacted partitions must be moved from temporary path to hadoop path (see step 5) and marked as compacted (see step 6). 3. Filter already compacted partitions from given partitions by looking for "_SDL_COMPACTED" file, see step 5 4. Data from partitions to be compacted is rewritten into a temporary path under this data objects hadoop path. 5. Partitions to be compacted are deleted from the hadoop path and moved from the temporary path to the hadoop path. This should be done one-by-one to reduce risk of data loss. To recover in case of unexpected abort between delete and move, a special file "_SDL_MOVING" is created in temporary path before deleting hadoop path. After moving the temporary path, this file is deleted again. Mark compacted partitions by creating a special file "_SDL_COMPACTED" and 6. Delete "_SDL_COMPACTING" file created in step 1.
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
lazy val logger: Logger

Attributes
protected
Definition Classes
SmartDataLakeLogger
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Inherited from SmartDataLakeLogger

Inherited from AnyRef

Inherited from Any

Ungrouped