SparkDataFlowExtension

Instance Constructors

new SparkDataFlowExtension(sparkDataFlow: SparkDataFlow)

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def alias(from: String, to: String): SparkDataFlow

Creates an alias for an existing label, it will point to the same DataSet.
Creates an alias for an existing label, it will point to the same DataSet. This can be used when reading table with one name and saving it with another without any transformations.
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def debugAsTable(labels: String*): SparkDataFlow

In zeppelin it is easier to debug and visualise data as spark sql tables.
In zeppelin it is easier to debug and visualise data as spark sql tables. This action does no data transformations, it only marks labels as SQL tables. Only after execution of the flow it is possible
labels
- labels to mark.
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def isTraceEnabled(): Boolean

Attributes
protected
Definition Classes
Logging
def logAndReturn[A](a: A, msg: String, level: Level): A

Takes a value of type A and a msg to log, returning a and logging the message at the desired level
Takes a value of type A and a msg to log, returning a and logging the message at the desired level
returns
a

Definition Classes
Logging
def logAndReturn[A](a: A, message: (A) ⇒ String, level: Level): A

Takes a value of type A and a function message from A to String, logs the value of invoking message(a) at the level described by the level parameter
Takes a value of type A and a function message from A to String, logs the value of invoking message(a) at the level described by the level parameter
returns
a
Definition Classes
Logging
Example:
1. logAndReturn(1, (num: Int) => s"number: $num", Info) // In the log we would see a log corresponding to "number 1"
def logDebug(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logDebug(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logError(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logError(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logInfo(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logInfo(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logName: String

Attributes
protected
Definition Classes
Logging
def logTrace(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logTrace(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logWarning(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logWarning(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def open(basePath: String, snapshotFolder: Option[String], outputPrefix: Option[String], labels: Seq[String])(open: (String) ⇒ (DataFrameReader) ⇒ Dataset[_]): SparkDataFlow

Opens multiple DataSets directly on the folders in the basePath folder.
Opens multiple DataSets directly on the folders in the basePath folder. Folders with names must exist in the basePath and the respective data sets will be inputs of the flow with the same names. It is also possible to specify prefix for the output labels: Ex name is "table1" and prefix is "test" then output label will be "test_table1".
In case of generated models as inputs, they will have a snapshot folder, which is the same across all models in the path. Use snapshotFolder to isolate the data for a single snapshot.
Ex: /path/to/tables/table1/snapshot_key=2018_02_12_10_59_21 /path/to/tables/table1/snapshot_key=2018_02_13_10_00_09 /path/to/tables/table2/snapshot_key=2018_02_12_10_59_21 /path/to/tables/table2/snapshot_key=2018_02_13_10_00_09
There are 2 snapshots of the table1 and table2 tables. To access just one of the snapshots:
basePath = /path/to/tables names = Seq("table1", "table2") snapshotFolder = Some("snapshot_key=2018_02_13_10_00_09") outputPrefix = None
This will add 2 inputs to the data flow "table1", "table2", without a prefix as prefix is None.
basePath
Base path of all the labels
snapshotFolder
Optional snapshot folder (including key and value as key=value)
outputPrefix
Optional prefix to attach to the flow labels
labels
List of labels to open
open
- function that given a string can produce a function that takes a DataFrameReader and produces a Dataset
def open(label: String, open: (DataFrameReader) ⇒ Dataset[_], options: Map[String, String]): SparkDataFlow

A generic action to open a dataset with a given label by providing a function that maps from a DataFrameReader object to a Dataset.
A generic action to open a dataset with a given label by providing a function that maps from a DataFrameReader object to a Dataset. In most cases the user should use a more specialised open fucntion
label
Label of the resulting dataset
open
Function that maps from a DataFrameReader object to a Dataset.
def open(label: String, open: (SparkFlowContext) ⇒ Dataset[_]): SparkDataFlow

A generic action to open a dataset with a given label by providing a function that maps from a SparkFlowContext object to a Dataset.
A generic action to open a dataset with a given label by providing a function that maps from a SparkFlowContext object to a Dataset. In most cases the user should use a more specialised open fucntion
label
Label of the resulting dataset
open
Function that maps from a SparkFlowContext object to a Dataset.
def openCSV(basePath: String, snapshotFolder: Option[String] = None, outputPrefix: Option[String] = None, options: Map[String, String] = ...)(labels: String*): SparkDataFlow

Opens CSV folders as data sets.
Opens CSV folders as data sets. See parent function for complete description.
basePath
Base path of all the labels
snapshotFolder
Optional snapshot folder below table folder
outputPrefix
Optional prefix to attach to the dataset label
options
Options for the DataFrameReader
labels
List of labels/folders to open
def openFileCSV(path: String, label: String, options: Map[String, String] = ...): SparkDataFlow

Open a CSV file based on a complete path
Open a CSV file based on a complete path
path
Complete path of the CSV file(s) (can include glob)
label
Label to attach to the dataset
options
Options for the DataFrameReader
def openFileParquet(path: String, label: String, options: Map[String, String] = Map()): SparkDataFlow

Open a Parquet path based on a complete path
Open a Parquet path based on a complete path
path
Complete path of the parquet file(s) (can include glob)
label
Label to attach to the dataset
options
Options for the DataFrameReader
def openParquet(basePath: String, snapshotFolder: Option[String] = None, outputPrefix: Option[String] = None, options: Map[String, String] = Map())(labels: String*): SparkDataFlow

Opens parquet based folders using open().
Opens parquet based folders using open(). See parent function for complete description.
basePath
Base path of all the labels
snapshotFolder
Optional snapshot folder below table folder
outputPrefix
Optional prefix to attach to the dataset label
options
Options for the DataFrameReader
labels
List of labels/folders to open
def openTable(dbName: String, outputPrefix: Option[String] = None)(tables: String*): SparkDataFlow

Opens multiple Hive/Impala tables.
Opens multiple Hive/Impala tables. Table names become waimak lables, which can be prefixed.
dbName
- name of the database that contains the table
outputPrefix
- optional prefix for the waimak label
tables
- list of table names in Hive/Impala that will also become waimak labels
def partitionSort(input: String, output: String)(partitionCol: String*)(sortCols: String*): SparkDataFlow

Before writing out data with partition folders, to avoid lots of small files in each folder, DataSet needs to be reshuffled.
Before writing out data with partition folders, to avoid lots of small files in each folder, DataSet needs to be reshuffled. Optionally it can be sorted as well within each partition.
This also can be used if you need to solve problem with Secondary Sort, use mapPartitions on the output.
partitionCol
- columns to repartition/shuffle input data set
sortCols
- optional sort withing partition columns
def printSchema(label: String): SparkDataFlow

Prints DataSet's schema to console.
def show(label: String): SparkDataFlow

Adds actions that prints to console first 10 lines of the input.
Adds actions that prints to console first 10 lines of the input. Useful for debug and development purposes.
def sql(input: String, inputs: String*)(outputLabel: String, sqlQuery: String, dropColumns: String*): SparkDataFlow

Executes Spark sql.
Executes Spark sql. All input labels are automatically registered as sql tables.
inputs
- required input labels
outputLabel
- label of the output transformation
sqlQuery
- sql code that uses labels as table names
dropColumns
- optional list of columns to drop after transformation
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
def transform(a: String, b: String, c: String, d: String, e: String, g: String, h: String, i: String, k: String, l: String, n: String, o: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

Transforms 12 input DataSets to 1 output DataSet using function f, which is a scala function.
def transform(a: String, b: String, c: String, d: String, e: String, g: String, h: String, i: String, k: String, l: String, n: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

Transforms 11 input DataSets to 1 output DataSet using function f, which is a scala function.
def transform(a: String, b: String, c: String, d: String, e: String, g: String, h: String, i: String, k: String, l: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

Transforms 10 input DataSets to 1 output DataSet using function f, which is a scala function.
def transform(a: String, b: String, c: String, d: String, e: String, g: String, h: String, i: String, k: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

Transforms 9 input DataSets to 1 output DataSet using function f, which is a scala function.
def transform(a: String, b: String, c: String, d: String, e: String, g: String, h: String, i: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

Transforms 8 input DataSets to 1 output DataSet using function f, which is a scala function.
def transform(a: String, b: String, c: String, d: String, e: String, g: String, h: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

Transforms 7 input DataSets to 1 output DataSet using function f, which is a scala function.
def transform(a: String, b: String, c: String, d: String, e: String, g: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

Transforms 6 input DataSets to 1 output DataSet using function f, which is a scala function.
def transform(a: String, b: String, c: String, d: String, e: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

Transforms 5 input DataSets to 1 output DataSet using function f, which is a scala function.
def transform(a: String, b: String, c: String, d: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

Transforms 4 input DataSets to 1 output DataSet using function f, which is a scala function.
def transform(a: String, b: String, c: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

Transforms 3 input DataSets to 1 output DataSet using function f, which is a scala function.
def transform(a: String, b: String)(output: String)(f: (Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

Transforms 2 input DataSets to 1 output DataSet using function f, which is a scala function.
def transform(a: String)(output: String)(f: (Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

Transforms 1 input DataSet to 1 output DataSet using function f, which is a scala function.
def typedTransform[T](input: String)(output: String)(f: (Dataset[_]) ⇒ T): SparkDataFlow

Transforms an input dataset to an instance of type T
Transforms an input dataset to an instance of type T
T
the type of the output of the transform function
input
the input label
output
the output label
f
the transform function
returns
a new SparkDataFlow with the action added
def unitTransform(input: String)(f: (Dataset[_]) ⇒ Unit, actionName: String = "unit transform"): SparkDataFlow

Takes a dataset and performs a function with side effects (Unit return type)
Takes a dataset and performs a function with side effects (Unit return type)
input
the input label
f
the side-effecting function
actionName
the name of the action
returns
a new SparkDataFlow with the action added
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
def write(label: String, pre: (Dataset[_]) ⇒ Dataset[_], dfr: (DataFrameWriter[_]) ⇒ Unit): SparkDataFlow

Base function for all write operation on current data flow, in most of the cases users should use more specialised one.
Base function for all write operation on current data flow, in most of the cases users should use more specialised one.
label
- label whose data set will be written out
pre
- dataset transformation function
dfr
- dataframe writer function
def writeAsNamedFiles(label: String, basePath: String, numberOfFiles: Int, filenamePrefix: String, format: String = "parquet", options: Map[String, String] = Map.empty): SparkDataFlow

Write a file or files with a specific filename to a folder.
Write a file or files with a specific filename to a folder. Allows you to control the final output filename without the Spark-generated part UUIDs. Filename will be $filenamePrefix.extension if number of files is 1, otherwise $filenamePrefix.$fileNumber.extension where file number is incremental and zero-padded.
label
Label to write
basePath
Base path to write to
numberOfFiles
Number of files to generate
filenamePrefix
Prefix of name of the file up to the filenumber and extension
format
Format to write (e.g. parquet, csv) Default: parquet
options
Options to pass to the DataFrameWriter Default: Empty map
def writeCSV(basePath: String, options: Map[String, String] = Map.empty, overwrite: Boolean = false, numFiles: Option[Int] = Some(1))(labels: String*): SparkDataFlow

Writes out data set as csv.
Writes out data set as csv.
basePath
- path in which folders will be created
options
- list of options to apply to the dataframewriter
overwrite
- whether to overwrite existing data
numFiles
- number of files to produce as output
labels
- labels whose data set will be written out
def writeHiveManagedTable(database: String, overwrite: Boolean = false)(labels: String*): SparkDataFlow

Writes out the dataset to a Hive-managed table.
Writes out the dataset to a Hive-managed table. Data will be written out to the default hive warehouse location as specified in the hive-site configuration. Table metadata is generated from the dataset schema, and tables and schemas can be overwritten by setting the optional overwrite flag to true.
It is recommended to only use this action in non-production flows as it offers no mechanism for managing snapshots or cleanly committing table definitions.
database
- Hive database to create the table in
overwrite
- Whether to overwrite existing data and recreate table schemas if they already exist
labels
- List of labels to create as Hive tables. They will all be created in the same database
def writeParquet(basePath: String, overwrite: Boolean = false)(labels: String*): SparkDataFlow

Writes multiple datasets as parquet files into basePath.
Writes multiple datasets as parquet files into basePath. Names of the labels will become names of the folders under the basePath.
basePath
- path in which folders will be created
overwrite
- if true than overwrite the existing data. By default it is false
labels
- labels to write as parquets, labels will become folder names
def writePartitionedCSV(basePath: String, repartition: Boolean = true, options: Map[String, String] = Map.empty)(label: String, partitionColumns: String*): SparkDataFlow

Writes out data set as csv, can have partitioned columns.
Writes out data set as csv, can have partitioned columns.
basePath
- base path of the label, label will be added to it
repartition
- repartition dataframe on partition columns
options
- list of options to apply to the dataframewriter
label
- label whose data set will be written out
partitionColumns
- optional list of partition columns, which will become partition folders
def writePartitionedParquet(basePath: String, repartition: Int)(label: String): SparkDataFlow

Writes out data set as parquet, can have partitioned columns.
Writes out data set as parquet, can have partitioned columns.
basePath
- base path of the label, label will be added to it
repartition
- repartition dataframe by a number of partitions
label
- label whose data set will be written out
def writePartitionedParquet(basePath: String, repartition: Boolean = true)(label: String, partitionColumns: String*): SparkDataFlow

Writes out data set as parquet, can have partitioned columns.
Writes out data set as parquet, can have partitioned columns.
basePath
- base path of the label, label will be added to it
repartition
- repartition dataframe on partition columns
label
- label whose data set will be written out
partitionColumns
- optional list of partition columns, which will become partition folders

Related Doc: package spark

implicit class SparkDataFlowExtension extends Logging

Instance Constructors

new SparkDataFlowExtension(sparkDataFlow: SparkDataFlow)

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

def alias(from: String, to: String): SparkDataFlow

final def asInstanceOf[T0]: T0

def clone(): AnyRef

def debugAsTable(labels: String*): SparkDataFlow

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def finalize(): Unit

final def getClass(): Class[_]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

def isTraceEnabled(): Boolean

def logAndReturn[A](a: A, msg: String, level: Level): A

def logAndReturn[A](a: A, message: (A) ⇒ String, level: Level): A

def logDebug(msg: ⇒ String, throwable: Throwable): Unit

def logDebug(msg: ⇒ String): Unit

def logError(msg: ⇒ String, throwable: Throwable): Unit

def logError(msg: ⇒ String): Unit

def logInfo(msg: ⇒ String, throwable: Throwable): Unit

def logInfo(msg: ⇒ String): Unit

def logName: String

def logTrace(msg: ⇒ String, throwable: Throwable): Unit

def logTrace(msg: ⇒ String): Unit

def logWarning(msg: ⇒ String, throwable: Throwable): Unit

def logWarning(msg: ⇒ String): Unit

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

def open(basePath: String, snapshotFolder: Option[String], outputPrefix: Option[String], labels: Seq[String])(open: (String) ⇒ (DataFrameReader) ⇒ Dataset[_]): SparkDataFlow

def open(label: String, open: (DataFrameReader) ⇒ Dataset[_], options: Map[String, String]): SparkDataFlow

def open(label: String, open: (SparkFlowContext) ⇒ Dataset[_]): SparkDataFlow

def openCSV(basePath: String, snapshotFolder: Option[String] = None, outputPrefix: Option[String] = None, options: Map[String, String] = ...)(labels: String*): SparkDataFlow

def openFileCSV(path: String, label: String, options: Map[String, String] = ...): SparkDataFlow

def openFileParquet(path: String, label: String, options: Map[String, String] = Map()): SparkDataFlow

def openParquet(basePath: String, snapshotFolder: Option[String] = None, outputPrefix: Option[String] = None, options: Map[String, String] = Map())(labels: String*): SparkDataFlow

def openTable(dbName: String, outputPrefix: Option[String] = None)(tables: String*): SparkDataFlow

def partitionSort(input: String, output: String)(partitionCol: String*)(sortCols: String*): SparkDataFlow

def printSchema(label: String): SparkDataFlow

def show(label: String): SparkDataFlow

def sql(input: String, inputs: String*)(outputLabel: String, sqlQuery: String, dropColumns: String*): SparkDataFlow

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

def transform(a: String, b: String, c: String, d: String, e: String, g: String, h: String, i: String, k: String, l: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

def transform(a: String, b: String, c: String, d: String, e: String, g: String, h: String, i: String, k: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

def transform(a: String, b: String, c: String, d: String, e: String, g: String, h: String, i: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

def transform(a: String, b: String, c: String, d: String, e: String, g: String, h: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

def transform(a: String, b: String, c: String, d: String, e: String, g: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

def transform(a: String, b: String, c: String, d: String, e: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

def transform(a: String, b: String, c: String, d: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

def transform(a: String, b: String, c: String)(output: String)(f: (Dataset[_], Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

def transform(a: String, b: String)(output: String)(f: (Dataset[_], Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

def transform(a: String)(output: String)(f: (Dataset[_]) ⇒ Dataset[_]): SparkDataFlow

def typedTransform[T](input: String)(output: String)(f: (Dataset[_]) ⇒ T): SparkDataFlow

def unitTransform(input: String)(f: (Dataset[_]) ⇒ Unit, actionName: String = "unit transform"): SparkDataFlow

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

def write(label: String, pre: (Dataset[_]) ⇒ Dataset[_], dfr: (DataFrameWriter[_]) ⇒ Unit): SparkDataFlow

def writeAsNamedFiles(label: String, basePath: String, numberOfFiles: Int, filenamePrefix: String, format: String = "parquet", options: Map[String, String] = Map.empty): SparkDataFlow

def writeCSV(basePath: String, options: Map[String, String] = Map.empty, overwrite: Boolean = false, numFiles: Option[Int] = Some(1))(labels: String*): SparkDataFlow

def writeHiveManagedTable(database: String, overwrite: Boolean = false)(labels: String*): SparkDataFlow

def writeParquet(basePath: String, overwrite: Boolean = false)(labels: String*): SparkDataFlow

def writePartitionedCSV(basePath: String, repartition: Boolean = true, options: Map[String, String] = Map.empty)(label: String, partitionColumns: String*): SparkDataFlow

def writePartitionedParquet(basePath: String, repartition: Int)(label: String): SparkDataFlow

def writePartitionedParquet(basePath: String, repartition: Boolean = true)(label: String, partitionColumns: String*): SparkDataFlow

Inherited from Logging

Inherited from AnyRef

Inherited from Any

Ungrouped

def partitionSort(input: String, output: String)(partitionCol: String)(sortCols: String): SparkDataFlow

def sql(input: String, inputs: String)(outputLabel: String, sqlQuery: String, dropColumns: String): SparkDataFlow