FileStorageOps

Abstract Value Members

abstract def atomicWriteAndCleanup(tableName: String, compactedData: Dataset[_], newDataPath: Path, cleanUpBase: Path, cleanUpFolders: Seq[String]): Unit

During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files.
During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files. The operation has to be fail safe; moving out data can only take place after new version is fully written and committed.
E.g. data from fromBase=/data/db/tbl1/type=hot and fromSubFolders=Seq("region=11", "region=12", "region=13", "region=14") will be merged and coalesced into optimal number of partitions in Dataset data and will be written out into newDataPath=/data/db/tbl1/type=cold/region=15 with old folder being moved into table's trash folder.
Starting state:
/data/db/tbl1/type=hot/region=11 .../region=12 .../region=13 .../region=14
Final state:
/data/db/tbl1/type=cold/region=15 /data/db/.Trash/tbl1/region=11 .../region=12 .../region=13 .../region=14
tableName
name of the table
compactedData
the data set with data from fromSubFolders already repartitioned, it will be saved into newDataPath
newDataPath
path into which combined and repartitioned data from the dataset will be committed into
cleanUpBase
parent folder from which to remove the cleanUpFolders
cleanUpFolders
list of sub-folders to remove once the writing and committing of the combined data is successful
abstract def globTablePaths[A](basePath: Path, tableNames: Seq[String], tablePartitions: Seq[String], parFun: PartialFunction[FileStatus, A]): Seq[A]

Glob a list of table paths with partitions, and apply a partial function to collect (filter+map) the result to transform the FileStatus to any type A
Glob a list of table paths with partitions, and apply a partial function to collect (filter+map) the result to transform the FileStatus to any type A
A
return type of final sequence
basePath
parent folder which contains folders with table names
tableNames
list of table names to search under
tablePartitions
list of partition columns to include in the path
parFun
a partition function to transform FileStatus to any type A
abstract def listTables(basePath: Path): Seq[String]

Lists tables in the basePath.
Lists tables in the basePath. It will ignore any folder/table that starts with '.'
basePath
parent folder which contains folders with table names
abstract def mkdirs(path: Path): Boolean

Creates folders on the physical storage.
Creates folders on the physical storage.
path
path to create
returns
true if the folder exists or was created without problems, false if there were problems creating all folders in the path
abstract def openParquet(path: Path, paths: Path*): Option[Dataset[_]]

Opens parquet file from the path, which can be folder or a file.
Opens parquet file from the path, which can be folder or a file. If there are partitioned sub-folders with file with slightly different schema, it will attempt to merge schema to accommodate for the schema evolution.
path
path to open
returns
Some with dataset if there is data, None if path does not exist or can not be opened

Exceptions thrown
Exception in cases of connectivity
abstract def pathExists(path: Path): Boolean

Checks if the path exists in the physical storage.
Checks if the path exists in the physical storage.
returns
true if path exists in the storage layer
abstract def readAuditTableInfo(basePath: Path, tableName: String): Try[AuditTableInfo]

Reads the table info back.
Reads the table info back.
basePath
parent folder which contains folders with table names
tableName
name of the table to read for
abstract def sparkSession: SparkSession
abstract def writeAuditTableInfo(basePath: Path, info: AuditTableInfo): Try[AuditTableInfo]

Writes out static data about the audit table into basePath/table_name/.table_info file.
Writes out static data about the audit table into basePath/table_name/.table_info file.
basePath
parent folder which contains folders with table names
info
static information about table, that will not change during table's existence
abstract def writeParquet(tableName: String, path: Path, ds: Dataset[_]): Unit

Commits data set into full path.
Commits data set into full path. The path is the full path into which the parquet will be placed after it is fully written into the temp folder.
tableName
name of the table, will only be used to write into tmp
path
full destination path
ds
dataset to write out. no partitioning will be performed on it

Exceptions thrown
Exception can be thrown due to access permissions, connectivity, spark UDFs (as datasets are lazily executed)

Concrete Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Related Doc: package storage

trait FileStorageOps extends AnyRef

Abstract Value Members

abstract def atomicWriteAndCleanup(tableName: String, compactedData: Dataset[_], newDataPath: Path, cleanUpBase: Path, cleanUpFolders: Seq[String]): Unit

abstract def globTablePaths[A](basePath: Path, tableNames: Seq[String], tablePartitions: Seq[String], parFun: PartialFunction[FileStatus, A]): Seq[A]

abstract def listTables(basePath: Path): Seq[String]

abstract def mkdirs(path: Path): Boolean

abstract def openParquet(path: Path, paths: Path*): Option[Dataset[_]]

abstract def pathExists(path: Path): Boolean

abstract def readAuditTableInfo(basePath: Path, tableName: String): Try[AuditTableInfo]

abstract def sparkSession: SparkSession

abstract def writeAuditTableInfo(basePath: Path, info: AuditTableInfo): Try[AuditTableInfo]

abstract def writeParquet(tableName: String, path: Path, ds: Dataset[_]): Unit

Concrete Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def clone(): AnyRef

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def finalize(): Unit

final def getClass(): Class[_]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from AnyRef

Inherited from Any

Ungrouped