Creates an alias for an existing label, it will point to the same DataSet.
Creates an alias for an existing label, it will point to the same DataSet. This can be used when reading table with one name and saving it with another without any transformations.
In zeppelin it is easier to debug and visualise data as spark sql tables.
In zeppelin it is easier to debug and visualise data as spark sql tables. This action does no data transformations, it only marks labels as SQL tables. Only after execution of the flow it is possible
- labels to mark.
Takes a value of type A and a msg to log, returning a and logging the message at the desired level
Takes a value of type A and a msg to log, returning a and logging the message at the desired level
a
Takes a value of type A and a function message from A to String, logs the value of invoking message(a) at the level described by the level parameter
Takes a value of type A and a function message from A to String, logs the value of invoking message(a) at the level described by the level parameter
a
logAndReturn(1, (num: Int) => s"number: $num", Info) // In the log we would see a log corresponding to "number 1"
Opens multiple DataSets directly on the folders in the basePath folder.
Opens multiple DataSets directly on the folders in the basePath folder. Folders with names must exist in the basePath and the respective data sets will be inputs of the flow with the same names. It is also possible to specify prefix for the output labels: Ex name is "table1" and prefix is "test" then output label will be "test_table1".
In case of generated models as inputs, they will have a snapshot folder, which is the same across all models in the path. Use snapshotFolder to isolate the data for a single snapshot.
Ex: /path/to/tables/table1/snapshot_key=2018_02_12_10_59_21 /path/to/tables/table1/snapshot_key=2018_02_13_10_00_09 /path/to/tables/table2/snapshot_key=2018_02_12_10_59_21 /path/to/tables/table2/snapshot_key=2018_02_13_10_00_09
There are 2 snapshots of the table1 and table2 tables. To access just one of the snapshots:
basePath = /path/to/tables names = Seq("table1", "table2") snapshotFolder = Some("snapshot_key=2018_02_13_10_00_09") outputPrefix = None
This will add 2 inputs to the data flow "table1", "table2", without a prefix as prefix is None.
Base path of all the labels
Optional snapshot folder (including key and value as key=value)
Optional prefix to attach to the flow labels
List of labels to open
- function that given a string can produce a function that takes a DataFrameReader and produces a Dataset
A generic action to open a dataset with a given label by providing a function that maps from a DataFrameReader object to a Dataset.
A generic action to open a dataset with a given label by providing a function that maps from a DataFrameReader object to a Dataset. In most cases the user should use a more specialised open fucntion
Label of the resulting dataset
Function that maps from a DataFrameReader object to a Dataset.
A generic action to open a dataset with a given label by providing a function that maps from a SparkFlowContext object to a Dataset.
A generic action to open a dataset with a given label by providing a function that maps from a SparkFlowContext object to a Dataset. In most cases the user should use a more specialised open fucntion
Label of the resulting dataset
Function that maps from a SparkFlowContext object to a Dataset.
Opens CSV folders as data sets.
Opens CSV folders as data sets. See parent function for complete description.
Base path of all the labels
Optional snapshot folder below table folder
Optional prefix to attach to the dataset label
Options for the DataFrameReader
List of labels/folders to open
Open a CSV file based on a complete path
Open a CSV file based on a complete path
Complete path of the CSV file(s) (can include glob)
Label to attach to the dataset
Options for the DataFrameReader
Open a Parquet path based on a complete path
Open a Parquet path based on a complete path
Complete path of the parquet file(s) (can include glob)
Label to attach to the dataset
Options for the DataFrameReader
Opens parquet based folders using open().
Opens parquet based folders using open(). See parent function for complete description.
Base path of all the labels
Optional snapshot folder below table folder
Optional prefix to attach to the dataset label
Options for the DataFrameReader
List of labels/folders to open
Opens multiple Hive/Impala tables.
Opens multiple Hive/Impala tables. Table names become waimak lables, which can be prefixed.
- name of the database that contains the table
- optional prefix for the waimak label
- list of table names in Hive/Impala that will also become waimak labels
Before writing out data with partition folders, to avoid lots of small files in each folder, DataSet needs to be reshuffled.
Before writing out data with partition folders, to avoid lots of small files in each folder, DataSet needs to be reshuffled. Optionally it can be sorted as well within each partition.
This also can be used if you need to solve problem with Secondary Sort, use mapPartitions on the output.
- columns to repartition/shuffle input data set
- optional sort withing partition columns
Prints DataSet's schema to console.
Adds actions that prints to console first 10 lines of the input.
Adds actions that prints to console first 10 lines of the input. Useful for debug and development purposes.
Executes Spark sql.
Executes Spark sql. All input labels are automatically registered as sql tables.
- required input labels
- label of the output transformation
- sql code that uses labels as table names
- optional list of columns to drop after transformation
Transforms 12 input DataSets to 1 output DataSet using function f, which is a scala function.
Transforms 11 input DataSets to 1 output DataSet using function f, which is a scala function.
Transforms 10 input DataSets to 1 output DataSet using function f, which is a scala function.
Transforms 9 input DataSets to 1 output DataSet using function f, which is a scala function.
Transforms 8 input DataSets to 1 output DataSet using function f, which is a scala function.
Transforms 7 input DataSets to 1 output DataSet using function f, which is a scala function.
Transforms 6 input DataSets to 1 output DataSet using function f, which is a scala function.
Transforms 5 input DataSets to 1 output DataSet using function f, which is a scala function.
Transforms 4 input DataSets to 1 output DataSet using function f, which is a scala function.
Transforms 3 input DataSets to 1 output DataSet using function f, which is a scala function.
Transforms 2 input DataSets to 1 output DataSet using function f, which is a scala function.
Transforms 1 input DataSet to 1 output DataSet using function f, which is a scala function.
Transforms an input dataset to an instance of type T
Transforms an input dataset to an instance of type T
the type of the output of the transform function
the input label
the output label
the transform function
a new SparkDataFlow with the action added
Takes a dataset and performs a function with side effects (Unit return type)
Takes a dataset and performs a function with side effects (Unit return type)
the input label
the side-effecting function
the name of the action
a new SparkDataFlow with the action added
Base function for all write operation on current data flow, in most of the cases users should use more specialised one.
Base function for all write operation on current data flow, in most of the cases users should use more specialised one.
- label whose data set will be written out
- dataset transformation function
- dataframe writer function
Write a file or files with a specific filename to a folder.
Write a file or files with a specific filename to a folder.
Allows you to control the final output filename without the Spark-generated part UUIDs.
Filename will be $filenamePrefix.extension
if number of files is 1, otherwise
$filenamePrefix.$fileNumber.extension
where file number is incremental and zero-padded.
Label to write
Base path to write to
Number of files to generate
Prefix of name of the file up to the filenumber and extension
Format to write (e.g. parquet, csv) Default: parquet
Options to pass to the DataFrameWriter Default: Empty map
Writes out data set as csv.
Writes out data set as csv.
- path in which folders will be created
- list of options to apply to the dataframewriter
- whether to overwrite existing data
- number of files to produce as output
- labels whose data set will be written out
Writes out the dataset to a Hive-managed table.
Writes out the dataset to a Hive-managed table. Data will be written out to the default hive warehouse location as specified in the hive-site configuration. Table metadata is generated from the dataset schema, and tables and schemas can be overwritten by setting the optional overwrite flag to true.
It is recommended to only use this action in non-production flows as it offers no mechanism for managing snapshots or cleanly committing table definitions.
- Hive database to create the table in
- Whether to overwrite existing data and recreate table schemas if they already exist
- List of labels to create as Hive tables. They will all be created in the same database
Writes multiple datasets as parquet files into basePath.
Writes multiple datasets as parquet files into basePath. Names of the labels will become names of the folders under the basePath.
- path in which folders will be created
- if true than overwrite the existing data. By default it is false
- labels to write as parquets, labels will become folder names
Writes out data set as csv, can have partitioned columns.
Writes out data set as csv, can have partitioned columns.
- base path of the label, label will be added to it
- repartition dataframe on partition columns
- list of options to apply to the dataframewriter
- label whose data set will be written out
- optional list of partition columns, which will become partition folders
Writes out data set as parquet, can have partitioned columns.
Writes out data set as parquet, can have partitioned columns.
- base path of the label, label will be added to it
- repartition dataframe by a number of partitions
- label whose data set will be written out
Writes out data set as parquet, can have partitioned columns.
Writes out data set as parquet, can have partitioned columns.
- base path of the label, label will be added to it
- repartition dataframe on partition columns
- label whose data set will be written out
- optional list of partition columns, which will become partition folders
Defines functional builder for spark specific data flows and common functionalities like reading csv/parquet/hive data, adding spark SQL steps, data set steps, writing data out into various formats, staging and committing multiple outputs into storage like HDFS, Hive/Impala.