ExtendedDataFrameGlobal

Instance Constructors

new ExtendedDataFrameGlobal(dataFrame: DataFrame)

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def breakAndWriteDataFrameForOutputFile(outputColumns: Seq[String], fileColumnName: String, format: String, delimiter: Option[String] = None): Unit

Method to break input dataframe via unique values of fileColumnName colume into multiple dataframes and persist each dataframe into its corresponding output file.
Method to break input dataframe via unique values of fileColumnName colume into multiple dataframes and persist each dataframe into its corresponding output file.

Definition Classes
ExtendedDataFrame
def cleanDataFrame(): DataFrame

Definition Classes
ExtendedDataFrame
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@HotSpotIntrinsicCandidate() @throws( ... )
def collectDataFrameColumnsToApplyFilter(columnList: List[String], filterSourceDataFrame: DataFrame): DataFrame

Method to collect values for columnList columns from filterSourceDataFrame and pass it to caller DataFrame to filter out values in caller DataFrame.
Method to collect values for columnList columns from filterSourceDataFrame and pass it to caller DataFrame to filter out values in caller DataFrame.

Definition Classes
ExtendedDataFrame
def compareRecords(otherDataFrame: DataFrame, componentName: String, limit: Int, spark: SparkSession): DataFrame

Method which implements logic of Compare Records abinitio component.
Method which implements logic of Compare Records abinitio component. Its functioning is as explained below
1. It takes join of both input dataframes via adding incremental sequence number and takes join on this sequence number. 2. It compares all records of both input dataframes and finds count of mismatching records. 3. If mismatch record count is more than limit than it throws error to terminate workflow execution. Otherwise it returns dataframe with mismatch count report.

Definition Classes
ExtendedDataFrame
def convertIntToLong(): DataFrame

Definition Classes
ExtendedDataFrame
var dataFrame: DataFrame

Definition Classes
ExtendedDataFrame
def deduplicate(typeToKeep: String, groupByColumns: List[Column] = List(lit(1)), orderByColumns: List[Column] = List(lit(1))): DataFrame

Method for Deduplicate operation when rows to be kept in each group of rows to be either first, Last or unique-only.
Method for Deduplicate operation when rows to be kept in each group of rows to be either first, Last or unique-only. It does first groupBy on all passed groupByColumns and then depending on typeToKeep value it does further operations.
For both first and last option, it adds new temporary row_number column which returns the row number within a group of rows grouped by groupByColumns. Then to find first records it simply filters out all rows with row_number as 1. To find last records within each group it also computes the count value for each group and filters out all the records where row_number is same as group count
For unique-only case it adds new temporary count column which returns the count of rows within a window partition. Then it filters the resultant dataframe with count value 1.
typeToKeep
option to find kind of rows. Possible values are first, last and unique-only
groupByColumns
columns to be used to group input records.
returns
DataFrame with first or last or unique-only records in each grouping of input records.

Definition Classes
ExtendedDataFrame
def deduplicateFromColumnNames(typeToKeep: String, groupByColumns: ArrayList[String]): DataFrame

Definition Classes
ExtendedDataFrame
def denormalizeSorted(groupByColumns: List[Column] = List(lit(1)), orderByColumns: List[Column] = List(lit(1)), denormalizeRecordExpression: Column, finalizeExpressionMap: Map[String, Column], inputFilter: Option[Column] = None, outputFilter: Option[Column] = None, denormColumnName: String, countColumnName: String = "count"): DataFrame

Definition Classes
ExtendedDataFrame
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def generateLogOutput(componentName: String, subComponentName: String = "", perRowEventTypes: Option[Column] = None, perRowEventTexts: Option[Column] = None, inputRowCount: Long = 0, outputRowCount: Option[Long] = Some(0), finalLogEventType: Option[Column] = None, finalLogEventText: Option[Column] = None, finalEventExtraColumnMap: Map[String, Column] = Map(), sparkSession: SparkSession): DataFrame

Method to generate abinitio log output for any component.
Method to generate abinitio log output for any component. This method takes as input array of non-standard events which are emitted by workflow component and serializes these events into separate row. This method will also add start and finish events with adding count information with finish event.

Definition Classes
ExtendedDataFrame
def generateSurrogateKeys(keyDF: DataFrame, naturalKeys: List[String], surrogateKey: String, overrideSurrogateKeys: Option[String], computeOldPortOutput: Boolean = false, spark: SparkSession): (DataFrame, DataFrame, DataFrame)

Definition Classes
ExtendedDataFrame
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
Annotations
@HotSpotIntrinsicCandidate()
def grouped(windowSize: Int): DataFrame

Definition Classes
ExtendedDataFrame
def hashCode(): Int

Definition Classes
AnyRef → Any
Annotations
@HotSpotIntrinsicCandidate()
def interim(subgraph: String, component: String, port: String)(implicit interimOutput: InterimOutput): DataFrame

Definition Classes
ExtendedDataFrame
Annotations
@Py4JWhitelist()
def interim(subgraph: String, component: String, port: String, subPath: String, numRows: Int, detailedStats: Boolean = false)(implicit interimOutput: InterimOutput): DataFrame

Definition Classes
ExtendedDataFrame
Annotations
@Py4JWhitelist()
def interim(subgraph: String, component: String, port: String, subPath: String, numRows: Int, interimOutput: InterimOutput, detailedStats: Boolean): DataFrame

Definition Classes
ExtendedDataFrame
Annotations
@Py4JWhitelist()
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def mergeMultipleFileContentInDataFrame(fileNameDF: DataFrame, spark: SparkSession, outputSchema: StructType, delimiter: String, readFormat: String, joinWithInputDataframe: Boolean = false): DataFrame

Method to read passed dataframe fileNameDF and read the content of filenames passed in this dataframe.
Method to read passed dataframe fileNameDF and read the content of filenames passed in this dataframe. It will also merge the fileName column and unique sequence id in the final generated dataframe with file content for all passed fileNames.
Finally it joins the dataframe with content of file and dataframe corresponding to input dataframe and returns the joined dataframe.

Definition Classes
ExtendedDataFrame
def metaPivot(pivotColumns: Seq[String], nameField: String, valueField: String, sparkSession: SparkSession): DataFrame

Method to take pivot on passed pivot columns.
Method to take pivot on passed pivot columns. This method splits records by pivot columns, converting each input record into a series of separate output records. There is one separate output record for each field of data in the original input record which is not in pivot list. Each output record contains the name and value of a single data field from the original input record along with pivot columns.

Definition Classes
ExtendedDataFrame
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def normalize(lengthExpression: Option[Column], finishedExpression: Option[Column], finishedCondition: Option[Column], alias: String, colsToSelect: List[Column], tempWindowExpr: Map[String, Column], lengthRelatedGlobalExpressions: Map[String, Column] = Map()): DataFrame

Method to take care of abinitio normalize functionality.
Method to take care of abinitio normalize functionality. It first replicates input dataframe rows, muliple times depending on passed lengthExpression or finishedExpression. LengthExpression evaluates to a number and will replicate each row in input data by this number.
FinishedExpression and finishedCondition are used to apply filter condition on input data and use this condition result to duplicate each input row multiple times.
tempWindowExpr is used to evaluate temp variables for Normalize with Temp case, using window functions. These expressions are then used in computation of final value for normalize output.
lengthExpression
expression which evaluates to a integer value, used to duplicate input records.
finishedExpression
expression to be used in filterCondition during its evaluation for duplication of records. return finishedCondition condition to be used to duplicate input records till condition result is false.
alias
to be used to rename finishedExpressions
colsToSelect
columns to be selected after normalize operations.
tempWindowExpr
window expressions to compute value of temp variables.
returns
final normalize output for both with Temp and without Temp case.

Definition Classes
ExtendedDataFrame
final def notify(): Unit

Definition Classes
AnyRef
Annotations
@HotSpotIntrinsicCandidate()
final def notifyAll(): Unit

Definition Classes
AnyRef
Annotations
@HotSpotIntrinsicCandidate()
def readSeparatedValues(inputColumn: Column, outputSchemaColumns: List[String], recordSeparator: String, fieldSeparator: String): DataFrame

Method to read textual data from inputColumn and split it into multiple records via recordSeparator and then further split each record into multiple columns via fieldSeparator.
Method to read textual data from inputColumn and split it into multiple records via recordSeparator and then further split each record into multiple columns via fieldSeparator. Then finally map the resultant data to output columns passed.

Definition Classes
ExtendedDataFrame
def syncDataFrameColumnsWithSchema(columnNames: Seq[String]): DataFrame

Method to sync column names in dataframe with column names passed as input.
Method to sync column names in dataframe with column names passed as input.

Definition Classes
ExtendedDataFrame
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
def unionWithSchema(otherDataFrame: DataFrame): DataFrame

Method to take union of current dataframe with passed otherDataFrame.
Method to take union of current dataframe with passed otherDataFrame. This method also rearranges the columns ot otherDataFrame in the same order as of current dataFrame columns

Definition Classes
ExtendedDataFrame
lazy val vectorUDF: UserDefinedFunction

Definition Classes
ExtendedDataFrame
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
def withColumnOptional(name: String, value: Column): DataFrame

Adds a column with defined value, if it doesn't exist.
Adds a column with defined value, if it doesn't exist.
name
Column's name
value
New column's value
returns
DataFrame with a new column if it doesn't exist already

Definition Classes
ExtendedDataFrame
def zipWithIndex(startValue: Long = 0L, incrementBy: Long = 1L, indexColName: String, sparkSession: SparkSession): DataFrame

Method to add new unique sequence column in dataframe where value in each row is incremented by incrementBy value and sequence starts with startValue.
Method to add new unique sequence column in dataframe where value in each row is incremented by incrementBy value and sequence starts with startValue.

Definition Classes
ExtendedDataFrame

Deprecated Value Members

def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@Deprecated @deprecated @throws( classOf[java.lang.Throwable] )
Deprecated
(Since version ) see corresponding Javadoc for more information.

Related Doc: package libs

implicit class ExtendedDataFrameGlobal extends ExtendedDataFrame

Instance Constructors

new ExtendedDataFrameGlobal(dataFrame: DataFrame)

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def breakAndWriteDataFrameForOutputFile(outputColumns: Seq[String], fileColumnName: String, format: String, delimiter: Option[String] = None): Unit

def cleanDataFrame(): DataFrame

def clone(): AnyRef

def collectDataFrameColumnsToApplyFilter(columnList: List[String], filterSourceDataFrame: DataFrame): DataFrame

def compareRecords(otherDataFrame: DataFrame, componentName: String, limit: Int, spark: SparkSession): DataFrame

def convertIntToLong(): DataFrame

var dataFrame: DataFrame

def deduplicate(typeToKeep: String, groupByColumns: List[Column] = List(lit(1)), orderByColumns: List[Column] = List(lit(1))): DataFrame

def deduplicateFromColumnNames(typeToKeep: String, groupByColumns: ArrayList[String]): DataFrame

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def generateSurrogateKeys(keyDF: DataFrame, naturalKeys: List[String], surrogateKey: String, overrideSurrogateKeys: Option[String], computeOldPortOutput: Boolean = false, spark: SparkSession): (DataFrame, DataFrame, DataFrame)

final def getClass(): Class[_]

def grouped(windowSize: Int): DataFrame

def hashCode(): Int

def interim(subgraph: String, component: String, port: String)(implicit interimOutput: InterimOutput): DataFrame

def interim(subgraph: String, component: String, port: String, subPath: String, numRows: Int, detailedStats: Boolean = false)(implicit interimOutput: InterimOutput): DataFrame

def interim(subgraph: String, component: String, port: String, subPath: String, numRows: Int, interimOutput: InterimOutput, detailedStats: Boolean): DataFrame

final def isInstanceOf[T0]: Boolean

def mergeMultipleFileContentInDataFrame(fileNameDF: DataFrame, spark: SparkSession, outputSchema: StructType, delimiter: String, readFormat: String, joinWithInputDataframe: Boolean = false): DataFrame

def metaPivot(pivotColumns: Seq[String], nameField: String, valueField: String, sparkSession: SparkSession): DataFrame

final def ne(arg0: AnyRef): Boolean

def normalize(lengthExpression: Option[Column], finishedExpression: Option[Column], finishedCondition: Option[Column], alias: String, colsToSelect: List[Column], tempWindowExpr: Map[String, Column], lengthRelatedGlobalExpressions: Map[String, Column] = Map()): DataFrame

final def notify(): Unit

final def notifyAll(): Unit

def readSeparatedValues(inputColumn: Column, outputSchemaColumns: List[String], recordSeparator: String, fieldSeparator: String): DataFrame

def syncDataFrameColumnsWithSchema(columnNames: Seq[String]): DataFrame

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

def unionWithSchema(otherDataFrame: DataFrame): DataFrame

lazy val vectorUDF: UserDefinedFunction

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

final def wait(): Unit

def withColumnOptional(name: String, value: Column): DataFrame

def zipWithIndex(startValue: Long = 0L, incrementBy: Long = 1L, indexColName: String, sparkSession: SparkSession): DataFrame

Deprecated Value Members

def finalize(): Unit

Inherited from libs.ExtendedDataFrame

Inherited from AnyRef

Inherited from Any

Ungrouped