io.smartdatalake.workflow.dataobject
Definition of fileName. This is concatenated with path and partition layout to search for files. Default is an asterix to match everything.
Overwrite or Append new data.
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
Return the ACL definition for the Hadoop path of this DataObject
Return the ACL definition for the Hadoop path of this DataObject
org.apache.hadoop.fs.permission.AclEntry
Check if the input files exist.
Check if the input files exist.
IllegalArgumentException
if failIfFilesMissing
= true and no files found at path
.
Return the connection id.
Return the connection id.
Connection defines path prefix (scheme, authority, base path) and ACL's in central location.
create empty partition
create empty partition
Create empty partitions for partition values not yet existing
Create empty partitions for partition values not yet existing
Delete all data.
Delete all data. This is used to implement SaveMode.Overwrite.
Delete given files.
Delete given files. This is used to cleanup files after they are processed.
Delete Hadoop Partitions.
Delete Hadoop Partitions.
Note that this is only possible, if every set of column names in partitionValues
are valid "inits"
of this DataObject's partitions
.
Every valid "init" can be produced by repeatedly removing the last element of a collection. Example: - a,b of a,b,c -> OK - a,c of a,b,c -> NOK
scala.collection.TraversableLike.init
Optional definition of partitions expected to exist.
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
Extract partition values from a given file path
Extract partition values from a given file path
Returns the factory that can parse this type (that is, type CO
).
Returns the factory that can parse this type (that is, type CO
).
Typically, implementations of this method should return the companion object of the implementing class. The companion object in turn should implement FromConfigFactory.
the factory (object) for this class.
Configure whether io.smartdatalake.workflow.action.Actions should fail if the input file(s) are missing on the file system.
Configure whether io.smartdatalake.workflow.action.Actions should fail if the input file(s) are missing on the file system.
Default is false.
Definition of fileName.
Definition of fileName. This is concatenated with path and partition layout to search for files. Default is an asterix to match everything.
Create a hadoop FileSystem API handle for the provided SparkSession.
Create a hadoop FileSystem API handle for the provided SparkSession.
Filter list of partition values by expected partitions condition
Filter list of partition values by expected partitions condition
Handle class cast exception when getting objects from instance registry
Handle class cast exception when getting objects from instance registry
List files for given partition values
List files for given partition values
List of partition values to be filtered. If empty all files in root path of DataObject will be listed.
List of FileRefs
get partition values formatted by partition layout
get partition values formatted by partition layout
Method for subclasses to override the base path for this DataObject.
Method for subclasses to override the base path for this DataObject. This is for instance needed if pathPrefix is defined in a connection.
prepare paths to be searched
prepare paths to be searched
A unique identifier for this instance.
A unique identifier for this instance.
Return the InstanceRegistry parsed from the SDL configuration used for this run.
Return the InstanceRegistry parsed from the SDL configuration used for this run.
the current InstanceRegistry.
List partitions on data object's root path
List partitions on data object's root path
Additional metadata for the DataObject
Additional metadata for the DataObject
Return a String specifying the partition layout.
Return a String specifying the partition layout.
For Hadoop the default partition layout is colname1=<value1>/colname2=<value2>/.../
Definition of partition columns
Definition of partition columns
The root path of the files that are handled by this DataObject.
The root path of the files that are handled by this DataObject.
Runs operations after reading from DataObject
Runs operations after reading from DataObject
Runs operations after writing to DataObject
Runs operations after writing to DataObject
Runs operations before reading from DataObject
Runs operations before reading from DataObject
Runs operations before writing to DataObject Note: As the transformed SubFeed doesnt yet exist in Action.preWrite, no partition values can be passed as parameters as in preRead
Runs operations before writing to DataObject Note: As the transformed SubFeed doesnt yet exist in Action.preWrite, no partition values can be passed as parameters as in preRead
Prepare & test DataObject's prerequisits
Prepare & test DataObject's prerequisits
This runs during the "prepare" operation of the DAG.
Overwrite or Append new data.
Overwrite or Append new data.
default separator for paths
default separator for paths
Given some FileRefs for another DataObject, translate the paths to the root path of this DataObject
Given some FileRefs for another DataObject, translate the paths to the root path of this DataObject
DataObject of type raw for files with unknown content. Provides details to an Action to access raw files.
Definition of fileName. This is concatenated with path and partition layout to search for files. Default is an asterix to match everything.
Overwrite or Append new data.
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.