The JDBC driver to use for this RDBM
Escape a keyword (for use in a query) e.g.
Escape a keyword (for use in a query) e.g. SQlServer uses [], Postgres uses ""
the keyword to escape
the escaped keyword
JDBC connection properties
Tries to get whatever metadata information it can from the database Uses the optional provided values for pks and lastupdated if it cannot get them from the database.
Tries to get whatever metadata information it can from the database Uses the optional provided values for pks and lastupdated if it cannot get them from the database.
the database schema name
the table name
Optionally, the primary keys for this table
Optionally, the last updated column for this table
Success[AuditTableInfo] if all required metadata was either found or provided by the user Failure if required metadata was neither found nor provided by the user Failure if metadata provided differed from the metadata found in the database
The function to use to get the system timestamp in the database
Generates predicates which are used to form the partitions of the read Dataset Queries the table to work out the primary key boundary points to use (so that each partition will contain a maximum of maxRowsPerPartition rows)
Generates predicates which are used to form the partitions of the read Dataset Queries the table to work out the primary key boundary points to use (so that each partition will contain a maximum of maxRowsPerPartition rows)
the table metadata
the last updated timestamp from which we wish to read data
the maximum number of rows we want in each partition
If the Dataset will have fewer rows than maxRowsPerParition then None, otherwise predicates to use in order to create the partitions e.g. "id >= 5 and id < 7"
Creates a Dataset for the given table containing data which was updated after or on the provided timestamp
Creates a Dataset for the given table containing data which was updated after or on the provided timestamp
the table metadata
the last updated for the table (if None, then we read everything)
Optionally, the maximum number of rows to be read per Dataset partition for this table This number will be used to generate predicates to be passed to org.apache.spark.sql.SparkSession.read.jdbc If this is not set, the DataFrame will only have one partition. This could result in memory issues when extracting large tables. Be careful not to create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems. You can also control the maximum number of jdbc connections to open by limiting the number of executors for your application.
If set to true, ignore the last updated and read everything
a Dataset for the given table
Creates a Dataset for the given table containing data which was updated after or on the provided timestamp Override this if required
Creates a Dataset for the given table containing data which was updated after or on the provided timestamp Override this if required
the table metadata
the last updated timestamp from which we wish to read data (if None, then we read everything)
Optionally, the maximum number of rows to be read per Dataset partition for this table This number will be used to generate predicates to be passed to org.apache.spark.sql.SparkSession.read.jdbc If this is not set, the DataFrame will only have one partition. This could result in memory issues when extracting large tables. Be careful not to create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems. You can also control the maximum number of jdbc connections to open by limiting the number of executors for your application.
(Dataset for the given table, Column to use as the last updated)
This is what the column to use as the last updated in the output DataFrames will be called (In some cases, this will come from the provided last updated column, in others it will be the system timestamp)
Generate a query to select from the given table
Generate a query to select from the given table
the metadata for the table
the last updated timestamp from which we wish to read data
any additional columns which need to be specified on read (which won't be picked up by select *) e.g. HIDDEN fields
a query which selects from the given table
Creates a Spark Dataset for the table
Creates a Spark Dataset for the table
a Spark Dataset for the table
This is what the column containing the system timestamp will be called in the output DataFrames
How to transform the target table name into the table name in the database if the two are different.
How to transform the target table name into the table name in the database if the two are different. Useful if you have multiple tables representing the same thing but with different names, and you wish them to be written to the same target table
a function which takes a target table name and returns the table name in the database
Waimak RDBM connection mechanism