Package

io.smartdatalake

definitions

Permalink

package definitions

Visibility
  1. Public
  2. All

Type Members

  1. sealed trait AuthMode extends AnyRef

    Permalink

    Authentication modes define how an application authenticates itself to a given data object/connection

    Authentication modes define how an application authenticates itself to a given data object/connection

    You need to define one of the AuthModes (subclasses) as type, i.e.

    authMode {
      type = BasicAuthMode
      user = myUser
      password = myPassword
    }
  2. case class BasicAuthMode(userVariable: String, passwordVariable: String) extends AuthMode with Product with Serializable

    Permalink

    Derive options for various connection types to connect by basic authentication

  3. case class DefaultExecutionModeExpressionData(feed: String, application: String, runId: Int, attemptId: Int, referenceTimestamp: Option[Timestamp], runStartTime: Timestamp, attemptStartTime: Timestamp, givenPartitionValues: Seq[Map[String, String]], isStartNode: Boolean) extends Product with Serializable

    Permalink

    Attributes definition for spark expressions used as ExecutionMode conditions.

    Attributes definition for spark expressions used as ExecutionMode conditions.

    givenPartitionValues

    Partition values specified with command line (start action) or passed from previous action

    isStartNode

    True if the current action is a start node of the DAG.

  4. sealed trait ExecutionMode extends SmartDataLakeLogger

    Permalink

    Execution mode defines how data is selected when running a data pipeline.

    Execution mode defines how data is selected when running a data pipeline. You need to select one of the subclasses by defining type, i.e.

    executionMode = {
      type = SparkIncrementalMode
      compareCol = "id"
    }
  5. case class FailIfNoPartitionValuesMode() extends ExecutionMode with Product with Serializable

    Permalink

    An execution mode which just validates that partition values are given.

    An execution mode which just validates that partition values are given. Note: For start nodes of the DAG partition values can be defined by command line, for subsequent nodes partition values are passed on from previous nodes.

  6. case class PartitionDiffMode(partitionColNb: Option[Int] = None, alternativeOutputId: Option[DataObjectId] = None, nbOfPartitionValuesPerRun: Option[Int] = None, applyCondition: Option[String] = None, failCondition: Option[String] = None) extends ExecutionMode with ExecutionModeWithMainInputOutput with Product with Serializable

    Permalink

    Partition difference execution mode lists partitions on mainInput & mainOutput DataObject and starts loading all missing partitions.

    Partition difference execution mode lists partitions on mainInput & mainOutput DataObject and starts loading all missing partitions. Partition columns to be used for comparision need to be a common 'init' of input and output partition columns. This mode needs mainInput/Output DataObjects which CanHandlePartitions to list partitions. Partition values are passed to following actions, if for partition columns which they have in common.

    partitionColNb

    optional number of partition columns to use as a common 'init'.

    alternativeOutputId

    optional alternative outputId of DataObject later in the DAG. This replaces the mainOutputId. It can be used to ensure processing all partitions over multiple actions in case of errors.

    nbOfPartitionValuesPerRun

    optional restriction of the number of partition values per run.

    applyCondition

    Condition to decide if execution mode should be applied or not. Define a spark sql expression working with attributes of DefaultExecutionModeExpressionData returning a boolean. Default is to apply the execution mode if given partition values (partition values from command line or passed from previous action) are not empty.

    failCondition

    Condition to fail application of execution mode if true. Define a spark sql expression working with attributes of PartitionDiffModeExpressionData returning a boolean. Default is that the application of the PartitionDiffMode does not fail the action. If there is no data to process, the following actions are skipped.

  7. case class PartitionDiffModeExpressionData(feed: String, application: String, runId: Int, attemptId: Int, referenceTimestamp: Option[Timestamp], runStartTime: Timestamp, attemptStartTime: Timestamp, inputPartitionValues: Seq[Map[String, String]], outputPartitionValues: Seq[Map[String, String]], selectedPartitionValues: Seq[Map[String, String]]) extends Product with Serializable

    Permalink
  8. case class PublicKeyAuthMode(userVariable: String) extends AuthMode with Product with Serializable

    Permalink

    Validate by user and private/public key Private key is read from .ssh

  9. case class SSLCertsAuthMode(keystorePath: String, keystoreType: Option[String], keystorePassVariable: String, truststorePath: String, truststoreType: Option[String], truststorePassVariable: String) extends AuthMode with Product with Serializable

    Permalink

    Validate by SSL Certificates : Only location an credentials.

    Validate by SSL Certificates : Only location an credentials. Additional attributes should be supplied via options map

  10. case class SparkIncrementalMode(compareCol: String, alternativeOutputId: Option[DataObjectId] = None) extends ExecutionMode with ExecutionModeWithMainInputOutput with Product with Serializable

    Permalink

    Compares max entry in "compare column" between mainOutput and mainInput and incrementally loads the delta.

    Compares max entry in "compare column" between mainOutput and mainInput and incrementally loads the delta. This mode works only with SparkSubFeeds. The filter is not propagated to following actions.

    compareCol

    a comparable column name existing in mainInput and mainOutput used to identify the delta. Column content should be bigger for newer records.

    alternativeOutputId

    optional alternative outputId of DataObject later in the DAG. This replaces the mainOutputId. It can be used to ensure processing all partitions over multiple actions in case of errors.

  11. case class SparkStreamingOnceMode(checkpointLocation: String, inputOptions: Map[String, String] = Map(), outputOptions: Map[String, String] = Map(), outputMode: OutputMode = OutputMode.Append) extends ExecutionMode with Product with Serializable

    Permalink

    Spark streaming execution mode uses Spark Structured Streaming to incrementally execute data loads (trigger=Trigger.Once) and keep track of processed data.

    Spark streaming execution mode uses Spark Structured Streaming to incrementally execute data loads (trigger=Trigger.Once) and keep track of processed data. This mode needs a DataObject implementing CanCreateStreamingDataFrame and works only with SparkSubFeeds.

    checkpointLocation

    location for checkpoints of streaming query to keep state

    inputOptions

    additional option to apply when reading streaming source. This overwrites options set by the DataObjects.

    outputOptions

    additional option to apply when writing to streaming sink. This overwrites options set by the DataObjects.

  12. case class TokenAuthMode(tokenVariable: String) extends AuthMode with Product with Serializable

    Permalink

    Derive options for various connection types to connect by token

Value Members

  1. object DateColumnType extends Enumeration

    Permalink

    Datatype for date columns in Hive

  2. object Environment

    Permalink

    Environment dependent configurations.

    Environment dependent configurations. They can be set - by Java system properties (prefixed with "sdl.", e.g. "sdl.hadoopAuthoritiesWithAclsRequired") - by Environment variables (prefixed with "SDL_" and camelCase converted to uppercase, e.g. "SDL_HADOOP_AUTHORITIES_WITH_ACLS_REQUIRED") - by a custom io.smartdatalake.app.SmartDataLakeBuilder implementation for your environment, which sets these variables directly.

  3. object HiveConventions

    Permalink

    Hive conventions

  4. object HiveTableLocationSuffix extends Enumeration

    Permalink

    Suffix used for alternating parquet HDFS paths (usually in TickTockHiveTableDataObject for integration layer)

  5. object OutputType extends Enumeration

    Permalink

    Options for HDFS output

  6. object SparkIncrementalMode extends Serializable

    Permalink
  7. object TechnicalTableColumn extends Enumeration

    Permalink

    Column names specific to historization of Hive tables

Ungrouped