RDBMExtractor

Abstract Value Members

abstract def connectionDetails: RDBMConnectionDetails
abstract def driverClass: String

The JDBC driver to use for this RDBM
abstract def escapeKeyword(keyword: String): String

Escape a keyword (for use in a query) e.g.
Escape a keyword (for use in a query) e.g. SQlServer uses [], Postgres uses ""
keyword
the keyword to escape
returns
the escaped keyword
abstract def extraConnectionProperties: Properties

JDBC connection properties
abstract def getTableMetadata(dbSchemaName: String, tableName: String, primaryKeys: Option[Seq[String]] = None, lastUpdatedColumn: Option[String] = None): Try[AuditTableInfo]

Tries to get whatever metadata information it can from the database Uses the optional provided values for pks and lastupdated if it cannot get them from the database.
Tries to get whatever metadata information it can from the database Uses the optional provided values for pks and lastupdated if it cannot get them from the database.
dbSchemaName
the database schema name
tableName
the table name
primaryKeys
Optionally, the primary keys for this table
lastUpdatedColumn
Optionally, the last updated column for this table
returns
Success[AuditTableInfo] if all required metadata was either found or provided by the user Failure if required metadata was neither found nor provided by the user Failure if metadata provided differed from the metadata found in the database
abstract def sourceDBSystemTimestampFunction: String

The function to use to get the system timestamp in the database
abstract def sparkSession: SparkSession

Concrete Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def connectionProperties: Properties

Attributes
protected
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
def fromQueryPart(tableMetadata: TableExtractionMetadata, lastUpdated: Option[Timestamp]): String

Attributes
protected
def generateSplitPredicates(tableMetadata: TableExtractionMetadata, lastUpdated: Option[Timestamp], maxRowsPerPartition: Int): Option[Array[String]]

Generates predicates which are used to form the partitions of the read Dataset Queries the table to work out the primary key boundary points to use (so that each partition will contain a maximum of maxRowsPerPartition rows)
Generates predicates which are used to form the partitions of the read Dataset Queries the table to work out the primary key boundary points to use (so that each partition will contain a maximum of maxRowsPerPartition rows)
tableMetadata
the table metadata
lastUpdated
the last updated timestamp from which we wish to read data
maxRowsPerPartition
the maximum number of rows we want in each partition
returns
If the Dataset will have fewer rows than maxRowsPerParition then None, otherwise predicates to use in order to create the partitions e.g. "id >= 5 and id < 7"
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
final def getTableDataset(meta: Map[String, String], lastUpdated: Option[Timestamp], maxRowsPerPartition: Option[Int] = None, forceFullLoad: Boolean = false): Dataset[_]

Creates a Dataset for the given table containing data which was updated after or on the provided timestamp
Creates a Dataset for the given table containing data which was updated after or on the provided timestamp
meta
the table metadata
lastUpdated
the last updated for the table (if None, then we read everything)
maxRowsPerPartition
Optionally, the maximum number of rows to be read per Dataset partition for this table This number will be used to generate predicates to be passed to org.apache.spark.sql.SparkSession.read.jdbc If this is not set, the DataFrame will only have one partition. This could result in memory issues when extracting large tables. Be careful not to create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems. You can also control the maximum number of jdbc connections to open by limiting the number of executors for your application.
forceFullLoad
If set to true, ignore the last updated and read everything
returns
a Dataset for the given table
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def loadDataset(meta: Map[String, String], lastUpdated: Option[Timestamp], maxRowsPerPartition: Option[Int]): (Dataset[_], Column)

Creates a Dataset for the given table containing data which was updated after or on the provided timestamp Override this if required
Creates a Dataset for the given table containing data which was updated after or on the provided timestamp Override this if required
meta
the table metadata
lastUpdated
the last updated timestamp from which we wish to read data (if None, then we read everything)
maxRowsPerPartition
Optionally, the maximum number of rows to be read per Dataset partition for this table This number will be used to generate predicates to be passed to org.apache.spark.sql.SparkSession.read.jdbc If this is not set, the DataFrame will only have one partition. This could result in memory issues when extracting large tables. Be careful not to create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems. You can also control the maximum number of jdbc connections to open by limiting the number of executors for your application.
returns
(Dataset for the given table, Column to use as the last updated)

Attributes
protected
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def rdbmRecordLastUpdatedColumn: String

This is what the column to use as the last updated in the output DataFrames will be called (In some cases, this will come from the provided last updated column, in others it will be the system timestamp)
def resolveLastUpdatedColumn(tableMetadata: TableExtractionMetadata, sparkSession: SparkSession): Column
def selectQuery(tableMetadata: TableExtractionMetadata, lastUpdated: Option[Timestamp], explicitColumnSelects: Seq[String]): String

Generate a query to select from the given table
Generate a query to select from the given table
tableMetadata
the metadata for the table
lastUpdated
the last updated timestamp from which we wish to read data
explicitColumnSelects
any additional columns which need to be specified on read (which won't be picked up by select *) e.g. HIDDEN fields
returns
a query which selects from the given table
def sparkLoad(tableMetadata: TableExtractionMetadata, lastUpdated: Option[Timestamp], maxRowsPerPartition: Option[Int], explicitColumnSelects: Seq[String] = Seq.empty): Dataset[_]

Creates a Spark Dataset for the table
Creates a Spark Dataset for the table
returns
a Spark Dataset for the table
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def systemTimestampColumnName: String

This is what the column containing the system timestamp will be called in the output DataFrames
def toString(): String

Definition Classes
AnyRef → Any
def transformTableNameForRead: (String) ⇒ String

How to transform the target table name into the table name in the database if the two are different.
How to transform the target table name into the table name in the database if the two are different. Useful if you have multiple tables representing the same thing but with different names, and you wish them to be written to the same target table
returns
a function which takes a target table name and returns the table name in the database
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Related Doc: package ingestion

trait RDBMExtractor extends AnyRef

Abstract Value Members

abstract def connectionDetails: RDBMConnectionDetails

abstract def driverClass: String

abstract def escapeKeyword(keyword: String): String

abstract def extraConnectionProperties: Properties

abstract def getTableMetadata(dbSchemaName: String, tableName: String, primaryKeys: Option[Seq[String]] = None, lastUpdatedColumn: Option[String] = None): Try[AuditTableInfo]

abstract def sourceDBSystemTimestampFunction: String

abstract def sparkSession: SparkSession

Concrete Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def clone(): AnyRef

def connectionProperties: Properties

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def finalize(): Unit

def fromQueryPart(tableMetadata: TableExtractionMetadata, lastUpdated: Option[Timestamp]): String

def generateSplitPredicates(tableMetadata: TableExtractionMetadata, lastUpdated: Option[Timestamp], maxRowsPerPartition: Int): Option[Array[String]]

final def getClass(): Class[_]

final def getTableDataset(meta: Map[String, String], lastUpdated: Option[Timestamp], maxRowsPerPartition: Option[Int] = None, forceFullLoad: Boolean = false): Dataset[_]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

def loadDataset(meta: Map[String, String], lastUpdated: Option[Timestamp], maxRowsPerPartition: Option[Int]): (Dataset[_], Column)

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

def rdbmRecordLastUpdatedColumn: String

def resolveLastUpdatedColumn(tableMetadata: TableExtractionMetadata, sparkSession: SparkSession): Column

def selectQuery(tableMetadata: TableExtractionMetadata, lastUpdated: Option[Timestamp], explicitColumnSelects: Seq[String]): String

def sparkLoad(tableMetadata: TableExtractionMetadata, lastUpdated: Option[Timestamp], maxRowsPerPartition: Option[Int], explicitColumnSelects: Seq[String] = Seq.empty): Dataset[_]

final def synchronized[T0](arg0: ⇒ T0): T0

def systemTimestampColumnName: String

def toString(): String

def transformTableNameForRead: (String) ⇒ String

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from AnyRef

Inherited from Any

Ungrouped