DataObject of type JDBC / Access.
Exports a util DataFrame that contains properties and metadata extracted from all io.smartdatalake.workflow.action.Actions that are registered in the current InstanceRegistry.
Exports a util DataFrame that contains properties and metadata extracted from all io.smartdatalake.workflow.action.Actions that are registered in the current InstanceRegistry.
Alternatively, it can export the properties and metadata of all io.smartdatalake.workflow.action.Actions defined in config files. For this, the configuration "config" has to be set to the location of the config.
Example:
dataObjects = {
...
actions-exporter {
type = ActionsExporterDataObject
config = path/to/myconfiguration.conf
}
...
}
The config value can point to a configuration file or a directory containing configuration files.
Refer to ConfigLoader.loadConfigFromFilesystem() for details about the configuration loading.
A io.smartdatalake.workflow.dataobject.DataObject backed by an Avro data source.
A io.smartdatalake.workflow.dataobject.DataObject backed by an Avro data source.
It manages read and write access and configurations required for io.smartdatalake.workflow.action.Actions to work on Avro formatted files.
Reading and writing details are delegated to Apache Spark org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter respectively. The reader and writer implementations are provided by the databricks spark-avro project.
An optional schema for the spark data frame used when writing new Avro files. Note: Existing Avro files contain a source schema. Therefore, this schema is ignored when reading from existing Avro files.
Optional definition of repartition operation before writing DataFrame with Spark to Hadoop.
org.apache.spark.sql.DataFrameWriter
org.apache.spark.sql.DataFrameReader
A DataObject backed by a comma-separated value (CSV) data source.
A DataObject backed by a comma-separated value (CSV) data source.
It manages read and write access and configurations required for io.smartdatalake.workflow.action.Actions to work on CSV formatted files.
CSV reading and writing details are delegated to Apache Spark org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter respectively.
Read Schema specifications:
If a data object schema is not defined via the schema
attribute (default) and inferSchema
option is
disabled (default) in csvOptions
, then all column types are set to String and the first row of the CSV file is read
to determine the column names and the number of fields.
If the header
option is disabled (default) in csvOptions
, then the header is defined as "_c#" for each column
where "#" is the column index.
Otherwise the first row of the CSV file is not included in the DataFrame content and its entries
are used as the column names for the schema.
If a data object schema is not defined via the schema
attribute and inferSchema
is enabled in csvOptions
, then
the samplingRatio
(default: 1.0) option in csvOptions
is used to extract a sample from the CSV file in order to
determine the input schema automatically.
Settings for the underlying org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter.
An optional data object schema. If defined, any automatic schema inference is avoided.
Specifies the string format used for writing date typed data.
Optional definition of repartition operation before writing DataFrame with Spark to Hadoop.
This data object sets the following default values for csvOptions
: delimiter = "|", quote = null, header = false, and inferSchema = false.
All other csvOption
default to the values defined by Apache Spark.
org.apache.spark.sql.DataFrameWriter
org.apache.spark.sql.DataFrameReader
Generic DataObject containing a config object.
Generic DataObject containing a config object. E.g. used to implement a CustomAction that reads a Webservice.
Additional metadata for a DataObject
Additional metadata for a DataObject
Readable name of the DataObject
Description of the content of the DataObject
Name of the layer this DataObject belongs to
Name of the subject area this DataObject belongs to
Optional custom tags for this object
Exports a util DataFrame that contains properties and metadata extracted from all DataObjects that are registered in the current InstanceRegistry.
Exports a util DataFrame that contains properties and metadata extracted from all DataObjects that are registered in the current InstanceRegistry.
Alternatively, it can export the properties and metadata of all DataObjects defined in config files. For this, the configuration "config" has to be set to the location of the config.
Example:
```dataObjects = {
...
dataobject-exporter {
type = DataObjectsExporterDataObject
config = path/to/myconfiguration.conf
}
...
}
The config value can point to a configuration file or a directory containing configuration files.
Refer to ConfigLoader.loadConfigFromFilesystem() for details about the configuration loading.
DataObject of type DeltaLakeTableDataObject.
DataObject of type DeltaLakeTableDataObject. Provides details to access Hive tables to an Action
unique name of this data object
hadoop directory for this table. If it doesn't contain scheme and authority, the connections pathPrefix is applied. If pathPrefix is not defined or doesn't define scheme and authority, default schema and authority is applied.
partition columns for this data object
type of date column
An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.
DeltaLake table to be written by this output
number of files created when writing into an empty table (otherwise the number will be derived from the existing data)
spark SaveMode to use when writing files, default is "overwrite"
DeltaLake table retention period of old transactions for time travel feature in hours
override connections permissions for files created tables hadoop directory with this connection
optional id of io.smartdatalake.workflow.connection.HiveTableConnection
meta data
A DataObject backed by an Microsoft Excel data source.
A DataObject backed by an Microsoft Excel data source.
It manages read and write access and configurations required for io.smartdatalake.workflow.action.Actions to work on Microsoft Excel (.xslx) formatted files.
Reading and writing details are delegated to Apache Spark org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter respectively. The reader and writer implementation is provided by the Crealytics spark-excel project.
Read Schema:
When useHeader
is set to true (default), the reader will use the first row of the Excel sheet as column names for
the schema and not include the first row as data values. Otherwise the column names are taken from the schema.
If the schema is not provided or inferred, then each column name is defined as "_c#" where "#" is the column index.
When a data object schema is provided, it is used as the schema for the DataFrame. Otherwise if inferSchema
is
enabled (default), then the data types of the columns are inferred based on the first excerptSize
rows
(excluding the first).
When no schema is provided and inferSchema
is disabled, all columns are assumed to be of string type.
Settings for the underlying org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter.
An optional data object schema. If defined, any automatic schema inference is avoided.
Optional definition of repartition operation before writing DataFrame with Spark to Hadoop. Default is numberOfTasksPerPartition = 1.
Options passed to org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter for reading and writing Microsoft Excel files.
Options passed to org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter for reading and writing Microsoft Excel files. Excel support is provided by the spark-excel project (see link below).
the name of the Excel Sheet to read from/write to. This option is required.
the number of rows in the excel spreadsheet to skip before any data is read. This option must not be set for writing.
the first column in the specified Excel Sheet to read from (1-based indexing). This option must not be set for writing.
TODO: this is not used anymore as far as I can tell --> crealytics now uses dataAddress.
Limit the number of rows being returned on read to the first rowLimit
rows.
This is applied after numLinesToSkip
.
If true
, the first row of the excel sheet specifies the column names.
This option is required (default: true).
Empty cells are parsed as null
values (default: true).
Infer the schema of the excel sheet automatically (default: true).
A format string specifying the format to use when writing timestamps (default: dd-MM-yyyy HH:mm:ss).
A format string specifying the format to use when writing dates.
The number of rows that are stored in memory. If set, a streaming reader is used which can help with big files.
Sample size for schema inference.
Foreign key definition
Foreign key definition
target database, if not defined it is assumed to be the same as the table owning the foreign key
referenced target table name
mapping of source column(s) to referenced target table column(s)
optional name for foreign key, e.g to depict it's role
DataObject of type Hive.
DataObject of type Hive. Provides details to access Hive tables to an Action
unique name of this data object
hadoop directory for this table. If it doesn't contain scheme and authority, the connections pathPrefix is applied. If pathPrefix is not defined or doesn't define scheme and authority, default schema and authority is applied.
partition columns for this data object
enable compute statistics after writing data (default=false)
type of date column
An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.
hive table to be written by this output
number of files created when writing into an empty table (otherwise the number will be derived from the existing data)
spark SaveMode to use when writing files, default is "overwrite"
override connections permissions for files created tables hadoop directory with this connection
optional id of io.smartdatalake.workflow.connection.HiveTableConnection
meta data
DataObject of type JDBC.
DataObject of type JDBC. Provides details for an action to access tables in a database through JDBC.
unique name of this data object
DDL-statement to be executed in prepare phase
SQL-statement to be executed before writing to table
SQL-statement to be executed after writing to table
An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.
The jdbc table to be read
DataObject of type JMS queue.
DataObject of type JMS queue. Provides details to an Action to access JMS queues.
JNDI Context Factory
JNDI Provider URL
authentication information: for now BasicAuthMode is supported.
JMS batch size
JMS Connection Factory
Name of MQ Queue
A io.smartdatalake.workflow.dataobject.DataObject backed by a JSON data source.
A io.smartdatalake.workflow.dataobject.DataObject backed by a JSON data source.
It manages read and write access and configurations required for io.smartdatalake.workflow.action.Actions to work on JSON formatted files.
Reading and writing details are delegated to Apache Spark org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter respectively.
Settings for the underlying org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter.
Optional definition of repartition operation before writing DataFrame with Spark to Hadoop.
Set the data type for all values to string.
By default, the JSON option multiline
is enabled.
org.apache.spark.sql.DataFrameWriter
org.apache.spark.sql.DataFrameReader
Checks for Primary Key violations for all DataObjects with Primary Keys defined that are registered in the current InstanceRegistry.
Checks for Primary Key violations for all DataObjects with Primary Keys defined that are registered in the current InstanceRegistry. Returns the list of Primary Key violations as a DataFrame.
Alternatively, it can check for Primary Key violations of all DataObjects defined in config files. For this, the configuration "config" has to be set to the location of the config.
Example:
```dataObjects = {
...
primarykey-violations {
type = PKViolatorsDataObject
config = path/to/myconfiguration.conf
}
...
}
Refer to ConfigLoader.loadConfigFromFilesystem() for details about the configuration loading.
A io.smartdatalake.workflow.dataobject.DataObject backed by an Apache Hive data source.
A io.smartdatalake.workflow.dataobject.DataObject backed by an Apache Hive data source.
It manages read and write access and configurations required for io.smartdatalake.workflow.action.Actions to work on Parquet formatted files.
Reading and writing details are delegated to Apache Spark org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter respectively.
unique name of this data object
Hadoop directory where this data object reads/writes it's files. If it doesn't contain scheme and authority, the connections pathPrefix is applied. If pathPrefix is not defined or doesn't define scheme and authority, default schema and authority is applied. Optionally defined partitions are appended with hadoop standard partition layout to this path. Only files ending with *.parquet* are considered as data for this DataObject.
partition columns for this data object
spark SaveMode to use when writing files, default is "overwrite"
Optional definition of repartition operation before writing DataFrame with Spark to Hadoop.
override connections permissions for files created with this connection
optional id of io.smartdatalake.workflow.connection.HadoopFileConnection
Metadata describing this data object.
org.apache.spark.sql.DataFrameWriter
org.apache.spark.sql.DataFrameReader
DataObject of type raw for files with unknown content.
DataObject of type raw for files with unknown content. Provides details to an Action to access raw files.
Overwrite or Append new data.
Connects to SFtp files Needs java library "com.hieronymus % sshj % 0.21.1" The following authentication mechanisms are supported -> public/private-key: private key must be saved in ~/.ssh, public key must be registered on server.
Connects to SFtp files Needs java library "com.hieronymus % sshj % 0.21.1" The following authentication mechanisms are supported -> public/private-key: private key must be saved in ~/.ssh, public key must be registered on server. -> user/pwd authentication: user and password is taken from two variables set as parameters. These variables could come from clear text (CLEAR), a file (FILE) or an environment variable (ENV)
partition layout defines how partition values can be extracted from the path. Use "%<colname>%" as token to extract the value for a partition column. With "%<colname:regex>%" a regex can be given to limit search. This is especially useful if there is no char to delimit the last token from the rest of the path or also between two tokens.
Overwrite or Append new data.
DataObject of type Splunk.
DataObject of type Splunk. Provides details to an action to access Splunk logs.
Table attributes
Table attributes
optional override of db defined by connection
table name
optional select query
optional sequence of primary key columns
optional sequence of foreign key definitions. This is used as metadata for a data catalog.
DataObject to call webservice and return response as InputStream This is implemented as FileRefDataObject because the response is treated as some file content.
DataObject to call webservice and return response as InputStream This is implemented as FileRefDataObject because the response is treated as some file content. FileRefDataObjects support partitioned data. For a WebserviceFileDataObject partitions are mapped as query parameters to create query string. All possible query parameter values must be given in configuration.
list of partitions with list of possible values for every entry
definition of partitions in query string. Use %<partitionColName>% as placeholder for partition column value in layout.
A io.smartdatalake.workflow.dataobject.DataObject backed by an XML data source.
A io.smartdatalake.workflow.dataobject.DataObject backed by an XML data source.
It manages read and write access and configurations required for io.smartdatalake.workflow.action.Actions to work on XML formatted files.
Reading and writing details are delegated to Apache Spark org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter respectively. The reader and writer implementations are provided by the databricks spark-xml proect.
Settings for the underlying org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter.
Optional definition of repartition operation before writing DataFrame with Spark to Hadoop.
org.apache.spark.sql.DataFrameWriter
org.apache.spark.sql.DataFrameReader
DataObject of type JDBC / Access. Provides access to a Access DB to an Action. The functionality is handled seperately from JdbcTableDataObject to avoid problems with net.ucanaccess.jdbc.UcanaccessDriver