Class

com.coxautodata.waimak.dataflow.spark

SparkDataFlowExtension

Related Doc: package spark

Permalink

implicit class SparkDataFlowExtension extends Logging

Defines functional builder for spark specific data flows and common functionalities like reading csv/parquet/hive data, adding spark SQL steps, data set steps, writing data out into various formats, staging and committing multiple outputs into storage like HDFS, Hive/Impala.

Linear Supertypes
Logging, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. SparkDataFlowExtension
  2. Logging
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new SparkDataFlowExtension(sparkDataFlow: SparkDataFlow)

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. def alias(from: String, to: String): SparkDataFlow

    Permalink

    Creates an alias for an existing label, it will point to the same DataSet.

    Creates an alias for an existing label, it will point to the same DataSet. This can be used when reading table with one name and saving it with another without any transformations.

  5. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  6. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  7. def debugAsTable(labels: String*): SparkDataFlow

    Permalink

    In zeppelin it is easier to debug and visualise data as spark sql tables.

    In zeppelin it is easier to debug and visualise data as spark sql tables. This action does no data transformations, it only marks labels as SQL tables. Only after execution of the flow it is possible

    labels

    - labels to mark.

  8. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  9. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  10. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  11. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  12. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  13. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  14. def isTraceEnabled(): Boolean

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  15. def logAndReturn[A](a: A, msg: String, level: Level): A

    Permalink

    Takes a value of type A and a msg to log, returning a and logging the message at the desired level

    Takes a value of type A and a msg to log, returning a and logging the message at the desired level

    returns

    a

    Definition Classes
    Logging
  16. def logAndReturn[A](a: A, message: (A) ⇒ String, level: Level): A

    Permalink

    Takes a value of type A and a function message from A to String, logs the value of invoking message(a) at the level described by the level parameter

    Takes a value of type A and a function message from A to String, logs the value of invoking message(a) at the level described by the level parameter

    returns

    a

    Definition Classes
    Logging
    Example:
    1. logAndReturn(1, (num: Int) => s"number: $num", Info)
      // In the log we would see a log corresponding to "number 1"
  17. def logDebug(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  18. def logDebug(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  19. def logError(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  20. def logError(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  21. def logInfo(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  22. def logInfo(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  23. def logName: String

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  24. def logTrace(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  25. def logTrace(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  26. def logWarning(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  27. def logWarning(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  28. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  29. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  30. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  31. def open(basePath: String, snapshotFolder: Option[String], outputPrefix: Option[String], labels: Seq[String])(open: (String) ⇒ (DataFrameReader) ⇒ Dataset[_]): SparkDataFlow

    Permalink

    Opens multiple DataSets directly on the folders in the basePath folder.

    Opens multiple DataSets directly on the folders in the basePath folder. Folders with names must exist in the basePath and the respective data sets will be inputs of the flow with the same names. It is also possible to specify prefix for the output labels: Ex name is "table1" and prefix is "test" then output label will be "test_table1".

    In case of generated models as inputs, they will have a snapshot folder, which is the same across all models in the path. Use snapshotFolder to isolate the data for a single snapshot.

    Ex: /path/to/tables/table1/snapshot_key=2018_02_12_10_59_21 /path/to/tables/table1/snapshot_key=2018_02_13_10_00_09 /path/to/tables/table2/snapshot_key=2018_02_12_10_59_21 /path/to/tables/table2/snapshot_key=2018_02_13_10_00_09

    There are 2 snapshots of the table1 and table2 tables. To access just one of the snapshots:

    basePath = /path/to/tables names = Seq("table1", "table2") snapshotFolder = Some("snapshot_key=2018_02_13_10_00_09") outputPrefix = None

    This will add 2 inputs to the data flow "table1", "table2", without a prefix as prefix is None.

    basePath

    Base path of all the labels

    snapshotFolder

    Optional snapshot folder (including key and value as key=value)

    outputPrefix

    Optional prefix to attach to the flow labels

    labels

    List of labels to open

    open

    - function that given a string can produce a function that takes a DataFrameReader and produces a Dataset

  32. def open(label: String, open: (DataFrameReader) ⇒ Dataset[_], options: Map[String, String]): SparkDataFlow

    Permalink

    A generic action to open a dataset with a given label by providing a function that maps from a DataFrameReader object to a Dataset.

    A generic action to open a dataset with a given label by providing a function that maps from a DataFrameReader object to a Dataset. In most cases the user should use a more specialised open fucntion

    label

    Label of the resulting dataset

    open

    Function that maps from a DataFrameReader object to a Dataset.

  33. def open(label: String, open: (SparkFlowContext) ⇒ Dataset[_]): SparkDataFlow

    Permalink

    A generic action to open a dataset with a given label by providing a function that maps from a SparkFlowContext object to a Dataset.

    A generic action to open a dataset with a given label by providing a function that maps from a SparkFlowContext object to a Dataset. In most cases the user should use a more specialised open fucntion

    label

    Label of the resulting dataset

    open

    Function that maps from a SparkFlowContext object to a Dataset.

  34. def openCSV(basePath: String, snapshotFolder: Option[String] = None, outputPrefix: Option[String] = None, options: Map[String, String] = ...)(labels: String*): SparkDataFlow

    Permalink

    Opens CSV folders as data sets.

    Opens CSV folders as data sets. See parent function for complete description.

    basePath

    Base path of all the labels

    snapshotFolder

    Optional snapshot folder below table folder

    outputPrefix

    Optional prefix to attach to the dataset label

    options

    Options for the DataFrameReader

    labels

    List of labels/folders to open

  35. def openFileCSV(path: String, label: String, options: Map[String, String] = ...): SparkDataFlow

    Permalink

    Open a CSV file based on a complete path

    Open a CSV file based on a complete path

    path

    Complete path of the CSV file(s) (can include glob)

    label

    Label to attach to the dataset

    options

    Options for the DataFrameReader

  36. def openFileParquet(path: String, label: String, options: Map[String, String] = Map()): SparkDataFlow

    Permalink

    Open a Parquet path based on a complete path

    Open a Parquet path based on a complete path

    path

    Complete path of the parquet file(s) (can include glob)

    label

    Label to attach to the dataset

    options

    Options for the DataFrameReader

  37. def openParquet(basePath: String, snapshotFolder: Option[String] = None, outputPrefix: Option[String] = None, options: Map[String, String] = Map())(labels: String*): SparkDataFlow

    Permalink

    Opens parquet based folders using open().

    Opens parquet based folders using open(). See parent function for complete description.

    basePath

    Base path of all the labels

    snapshotFolder

    Optional snapshot folder below table folder

    outputPrefix

    Optional prefix to attach to the dataset label

    options

    Options for the DataFrameReader

    labels

    List of labels/folders to open

  38. def openTable(dbName: String, outputPrefix: Option[String] = None)(tables: String*): SparkDataFlow

    Permalink

    Opens multiple Hive/Impala tables.

    Opens multiple Hive/Impala tables. Table names become waimak lables, which can be prefixed.

    dbName

    - name of the database that contains the table

    outputPrefix

    - optional prefix for the waimak label

    tables

    - list of table names in Hive/Impala that will also become waimak labels

  39. def partitionSort(input: String, output: String)(partitionCol: String*)(sortCols: String*): SparkDataFlow

    Permalink

    Before writing out data with partition folders, to avoid lots of small files in each folder, DataSet needs to be reshuffled.

    Before writing out data with partition folders, to avoid lots of small files in each folder, DataSet needs to be reshuffled. Optionally it can be sorted as well within each partition.

    This also can be used if you need to solve problem with Secondary Sort, use mapPartitions on the output.

    partitionCol

    - columns to repartition/shuffle input data set

    sortCols

    - optional sort withing partition columns

  40. def printSchema(label: String): SparkDataFlow

    Permalink

    Prints DataSet's schema to console.

  41. def show(label: String): SparkDataFlow

    Permalink

    Adds actions that prints to console first 10 lines of the input.

    Adds actions that prints to console first 10 lines of the input. Useful for debug and development purposes.

  42. def sql(input: String, inputs: String*)(outputLabel: String, sqlQuery: String, dropColumns: String*): SparkDataFlow

    Permalink

    Executes Spark sql.

    Executes Spark sql. All input labels are automatically registered as sql tables.

    inputs

    - required input labels

    outputLabel

    - label of the output transformation

    sqlQuery

    - sql code that uses labels as table names

    dropColumns

    - optional list of columns to drop after transformation

  43. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  44. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  45. def transform(a: String, b: String, c: String, d: String, e: String, g: String, h: String, i: String, k: String, l: String, n: String, o: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

    Permalink

    Transforms 12 input DataSets to 1 output DataSet using function f, which is a scala function.

  46. def transform(a: String, b: String, c: String, d: String, e: String, g: String, h: String, i: String, k: String, l: String, n: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

    Permalink

    Transforms 11 input DataSets to 1 output DataSet using function f, which is a scala function.

  47. def transform(a: String, b: String, c: String, d: String, e: String, g: String, h: String, i: String, k: String, l: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

    Permalink

    Transforms 10 input DataSets to 1 output DataSet using function f, which is a scala function.

  48. def transform(a: String, b: String, c: String, d: String, e: String, g: String, h: String, i: String, k: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

    Permalink

    Transforms 9 input DataSets to 1 output DataSet using function f, which is a scala function.

  49. def transform(a: String, b: String, c: String, d: String, e: String, g: String, h: String, i: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

    Permalink

    Transforms 8 input DataSets to 1 output DataSet using function f, which is a scala function.

  50. def transform(a: String, b: String, c: String, d: String, e: String, g: String, h: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

    Permalink

    Transforms 7 input DataSets to 1 output DataSet using function f, which is a scala function.

  51. def transform(a: String, b: String, c: String, d: String, e: String, g: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

    Permalink

    Transforms 6 input DataSets to 1 output DataSet using function f, which is a scala function.

  52. def transform(a: String, b: String, c: String, d: String, e: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

    Permalink

    Transforms 5 input DataSets to 1 output DataSet using function f, which is a scala function.

  53. def transform(a: String, b: String, c: String, d: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

    Permalink

    Transforms 4 input DataSets to 1 output DataSet using function f, which is a scala function.

  54. def transform(a: String, b: String, c: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

    Permalink

    Transforms 3 input DataSets to 1 output DataSet using function f, which is a scala function.

  55. def transform(a: String, b: String)(output: String)(f: (Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

    Permalink

    Transforms 2 input DataSets to 1 output DataSet using function f, which is a scala function.

  56. def transform(a: String)(output: String)(f: (Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

    Permalink

    Transforms 1 input DataSet to 1 output DataSet using function f, which is a scala function.

  57. def typedTransform[T](input: String)(output: String)(f: (Dataset[_]) ⇒ T): SparkDataFlow

    Permalink

    Transforms an input dataset to an instance of type T

    Transforms an input dataset to an instance of type T

    T

    the type of the output of the transform function

    input

    the input label

    output

    the output label

    f

    the transform function

    returns

    a new SparkDataFlow with the action added

  58. def unitTransform(input: String)(f: (Dataset[_]) ⇒ Unit, actionName: String = "unit transform"): SparkDataFlow

    Permalink

    Takes a dataset and performs a function with side effects (Unit return type)

    Takes a dataset and performs a function with side effects (Unit return type)

    input

    the input label

    f

    the side-effecting function

    actionName

    the name of the action

    returns

    a new SparkDataFlow with the action added

  59. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  60. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  61. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  62. def write(label: String, pre: (Dataset[_]) ⇒ Dataset[_], dfr: (DataFrameWriter[_]) ⇒ Unit): SparkDataFlow

    Permalink

    Base function for all write operation on current data flow, in most of the cases users should use more specialised one.

    Base function for all write operation on current data flow, in most of the cases users should use more specialised one.

    label

    - label whose data set will be written out

    pre

    - dataset transformation function

    dfr

    - dataframe writer function

  63. def writeAsNamedFiles(label: String, basePath: String, numberOfFiles: Int, filenamePrefix: String, format: String = "parquet", options: Map[String, String] = Map.empty): SparkDataFlow

    Permalink

    Write a file or files with a specific filename to a folder.

    Write a file or files with a specific filename to a folder. Allows you to control the final output filename without the Spark-generated part UUIDs. Filename will be $filenamePrefix.extension if number of files is 1, otherwise $filenamePrefix.$fileNumber.extension where file number is incremental and zero-padded.

    label

    Label to write

    basePath

    Base path to write to

    numberOfFiles

    Number of files to generate

    filenamePrefix

    Prefix of name of the file up to the filenumber and extension

    format

    Format to write (e.g. parquet, csv) Default: parquet

    options

    Options to pass to the DataFrameWriter Default: Empty map

  64. def writeCSV(basePath: String, options: Map[String, String] = Map.empty, overwrite: Boolean = false, numFiles: Option[Int] = Some(1))(labels: String*): SparkDataFlow

    Permalink

    Writes out data set as csv.

    Writes out data set as csv.

    basePath

    - path in which folders will be created

    options

    - list of options to apply to the dataframewriter

    overwrite

    - whether to overwrite existing data

    numFiles

    - number of files to produce as output

    labels

    - labels whose data set will be written out

  65. def writeHiveManagedTable(database: String, overwrite: Boolean = false)(labels: String*): SparkDataFlow

    Permalink

    Writes out the dataset to a Hive-managed table.

    Writes out the dataset to a Hive-managed table. Data will be written out to the default hive warehouse location as specified in the hive-site configuration. Table metadata is generated from the dataset schema, and tables and schemas can be overwritten by setting the optional overwrite flag to true.

    It is recommended to only use this action in non-production flows as it offers no mechanism for managing snapshots or cleanly committing table definitions.

    database

    - Hive database to create the table in

    overwrite

    - Whether to overwrite existing data and recreate table schemas if they already exist

    labels

    - List of labels to create as Hive tables. They will all be created in the same database

  66. def writeParquet(basePath: String, overwrite: Boolean = false)(labels: String*): SparkDataFlow

    Permalink

    Writes multiple datasets as parquet files into basePath.

    Writes multiple datasets as parquet files into basePath. Names of the labels will become names of the folders under the basePath.

    basePath

    - path in which folders will be created

    overwrite

    - if true than overwrite the existing data. By default it is false

    labels

    - labels to write as parquets, labels will become folder names

  67. def writePartitionedCSV(basePath: String, repartition: Boolean = true, options: Map[String, String] = Map.empty)(label: String, partitionColumns: String*): SparkDataFlow

    Permalink

    Writes out data set as csv, can have partitioned columns.

    Writes out data set as csv, can have partitioned columns.

    basePath

    - base path of the label, label will be added to it

    repartition

    - repartition dataframe on partition columns

    options

    - list of options to apply to the dataframewriter

    label

    - label whose data set will be written out

    partitionColumns

    - optional list of partition columns, which will become partition folders

  68. def writePartitionedParquet(basePath: String, repartition: Int)(label: String): SparkDataFlow

    Permalink

    Writes out data set as parquet, can have partitioned columns.

    Writes out data set as parquet, can have partitioned columns.

    basePath

    - base path of the label, label will be added to it

    repartition

    - repartition dataframe by a number of partitions

    label

    - label whose data set will be written out

  69. def writePartitionedParquet(basePath: String, repartition: Boolean = true)(label: String, partitionColumns: String*): SparkDataFlow

    Permalink

    Writes out data set as parquet, can have partitioned columns.

    Writes out data set as parquet, can have partitioned columns.

    basePath

    - base path of the label, label will be added to it

    repartition

    - repartition dataframe on partition columns

    label

    - label whose data set will be written out

    partitionColumns

    - optional list of partition columns, which will become partition folders

Inherited from Logging

Inherited from AnyRef

Inherited from Any

Ungrouped