za.co.absa.spark.hats.transformations
Adds a column similar to df.withColumn(), but you can specify the position of the new column by specifying a column name after which to add the new column
Gathers errors from a nested error column into a global error column for the dataframe
Gathers errors from a nested error column into a global error column for the dataframe
A dataframe containing error columns.
A column name that can be nested deeply inside the dataframe.
An error column name at the root shema level. This column should be at the root level. It will be created automatically if it does not exist.
A dataframe with a new field that contains the list of errors.
Add a column that can be inside nested structs, arrays and its combinations
Add a column that can be inside nested structs, arrays and its combinations
Dataframe to be transformed
A column name to be created
A new column value
A dataframe with a new field that contains transformed values.
Add a column that can be inside nested structs, arrays and its combinations
Add a column that can be inside nested structs, arrays and its combinations
Dataframe to be transformed
A column name to be created
A function that that takes a 'getField()' function and returns a column as a Spark expression.
A dataframe with a new field that contains transformed values.
Drop a column from inside a nested structs, arrays and its combinations
Drop a column from inside a nested structs, arrays and its combinations
Dataframe to be transformed
A column name to be dropped
A dataframe with a new field that contains transformed values.
A nested struct map with error column support.
A nested struct map with error column support. Given a struct field the method will create a new child field of that struct as a transformation of struct fields and will update the error column according to a specified transformation. This is useful for transformations that require combining several fields of a struct in an array. Extended transformation functions are used so that the caller can access any field in the array path.
Here is an example demonstrating how to handle both root and nested cases:
val dfOut = nestedStructAndErrorMap(df, columnPath, "people.addresses.combinedField", (_, getField) => { // Struct transformation concat(getField("id"), getField("people.addresses.city"), getField("people.first_name")) } }, (_, getField) => { // Error column transformation if (isError(getField("people.addresses.city") ErrorCaseClsss("Some error") else null } })
An input DataFrame
A struct column name for which to apply the transformation
The output column name that will be added as a child of the source struct.
The name of the error column.
A function that applies a transformation to a column as a Spark expression
A function that should check error conditions and return an error column in case such conditions are met
A dataframe with a new field that contains transformed values.
A nested struct map.
A nested struct map. Given a struct field the method will create a new child field of that struct as a transformation of struct fields. This is useful for transformations such as concatenation of fields. The method uses extended transformation functions so the caller can access all parent fields as well.
Here is an example demonstrating how to handle both root and nested cases:
val dfOut = nestedStructMap(df, columnPath, "people.combinedField", (_, getField) => { // A root lelev field 'id' is concatenated with the full name field of an array of people. concat(getField("id"), lit(" "), getField("people.full_name")) } })
An input DataFrame
A struct column name for which to apply the transformation
The output column name that will be added as a child of the input struct.
A function that applies a transformation to a column as a Spark expression
A dataframe with a new field that contains transformed values.
A nested map that also appends errors to the error column and uses an extended transformation function that provides the ability to use fields in parent level of nesting.
A nested map that also appends errors to the error column and uses an extended transformation function that provides the ability to use fields in parent level of nesting. (see NestedArrayTransformations.nestedWithColumnMap above for the usage)
Dataframe to be transformed
A column name for which to apply the transformation, e.g. company.employee.firstName
.
The output column name. The path is optional, e.g. you can use conformedName
instead of company.employee.conformedName
.
The name of the error column.
A function that applies a transformation to a column as a Spark expression.
A function that takes an input column and returns an expression for an error column.
A dataframe with a new field that contains transformed values.
A nested struct map with error column support.
A nested struct map with error column support. Given a struct field the method will create a new child field of that struct as a transformation of struct fields and will update the error column according to a specified transformation. This is useful for transformations that require combining several fields of a struct in an array.
To use root of the schema as the input struct pass "" as the inputStructField
.
In this case null
will be passed to the lambda function.
Here is an example demonstrating how to handle both root and nested cases:
val dfOut = nestedStructAndErrorMap(df, columnPath, "combinedField", c => { // Struct transformation if (c==null) { // The columns are at the root level concat(col("city"), col("street")) } else { // The columns are inside nested structs/arrays concat(c.getField("city"), c.getField("street")) } }, c => { // Error column transformation if (c==null) { // The columns are at the root level if (isError(col("city")) ErrorCaseClsss("Some error") else null } else { // The columns are inside nested structs/arrays if (isError(c.getField("city")) ErrorCaseClsss("Some error") else null } })
An input DataFrame
A struct column name for which to apply the transformation
The output column name that will be added as a child of the source struct.
The name of the error column.
A function that applies a transformation to a column as a Spark expression
A function that should check error conditions and return an error column in case such conditions are met
A dataframe with a new field that contains transformed values.
A nested struct map.
A nested struct map. Given a struct field the method will create a new child field of that struct as a transformation of struct fields. This is useful for transformations such as concatenation of fields.
To use root of the schema as the input struct pass "" as the inputStructField
.
In this case null
will be passed to the lambda function.
Here is an example demonstrating how to handle both root and nested cases:
val dfOut = nestedStructMap(df, columnPath, "combinedField", c => { if (c==null) { // The columns are at the root level concat(col("city"), col("street")) } else { // The columns are inside nested structs/arrays concat(c.getField("city"), c.getField("street")) } })
An input DataFrame
A struct column name for which to apply the transformation
The output column name that will be added as a child of the source struct.
A function that applies a transformation to a column as a Spark expression
A dataframe with a new field that contains transformed values.
Moves all fields of the specified struct up one level.
Moves all fields of the specified struct up one level. This can only be envoked on a struct inside other struct
root
|-- a: struct
| |-- b: struct
| | |-- c: string
| | |-- d: string
df.nestedUnstruct("a.b")
root
|-- a: struct
| |-- c: string
| |-- d: string
A struct column name that contains the fields to extract.
A dataframe with the struct removed and its fields are up one level.
A nested map that also appends errors to the error column (see NestedArrayTransformations.nestedWithColumnMap above)
A nested map that also appends errors to the error column (see NestedArrayTransformations.nestedWithColumnMap above)
Dataframe to be transformed
A column name for which to apply the transformation, e.g. company.employee.firstName
.
The output column name. The path is optional, e.g. you can use conformedName
instead of company.employee.conformedName
.
The name of the error column.
A function that applies a transformation to a column as a Spark expression.
A function that takes an input column and returns an expression for an error column.
A dataframe with a new field that contains transformed values.
Map transformation for columns that can be inside nested structs, arrays and its combinations.
Map transformation for columns that can be inside nested structs, arrays and its combinations.
If the input column is a primitive field the method will add outputColumnName at the same level of nesting
by executing the expression
passing the source column into it. If a struct column is expected you can
use .getField(...)
method to operate on its children.
The output column name can omit the full path as the field will be created at the same level of nesting as the input column.
Dataframe to be transformed
A column name for which to apply the transformation, e.g. company.employee.firstName
.
The output column name. The path is optional, e.g. you can use conformedName
instead of company.employee.conformedName
.
A function that applies a transformation to a column as a Spark expression.
A dataframe with a new field that contains transformed values.