Class

io.prophecy.libs

ExtendedDataFrameGlobal

Related Doc: package libs

Permalink

implicit class ExtendedDataFrameGlobal extends ExtendedDataFrame

Linear Supertypes
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. ExtendedDataFrameGlobal
  2. ExtendedDataFrame
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new ExtendedDataFrameGlobal(dataFrame: DataFrame)

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  5. def breakAndWriteDataFrameForOutputFile(outputColumns: Seq[String], fileColumnName: String, format: String, delimiter: Option[String] = None): Unit

    Permalink

    Method to break input dataframe via unique values of fileColumnName colume into multiple dataframes and persist each dataframe into its corresponding output file.

    Method to break input dataframe via unique values of fileColumnName colume into multiple dataframes and persist each dataframe into its corresponding output file.

    Definition Classes
    ExtendedDataFrame
  6. def cleanDataFrame(): DataFrame

    Permalink
    Definition Classes
    ExtendedDataFrame
  7. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @HotSpotIntrinsicCandidate() @throws( ... )
  8. def collectDataFrameColumnsToApplyFilter(columnList: List[String], filterSourceDataFrame: DataFrame): DataFrame

    Permalink

    Method to collect values for columnList columns from filterSourceDataFrame and pass it to caller DataFrame to filter out values in caller DataFrame.

    Method to collect values for columnList columns from filterSourceDataFrame and pass it to caller DataFrame to filter out values in caller DataFrame.

    Definition Classes
    ExtendedDataFrame
  9. def compareRecords(otherDataFrame: DataFrame, componentName: String, limit: Int, spark: SparkSession): DataFrame

    Permalink

    Method which implements logic of Compare Records abinitio component.

    Method which implements logic of Compare Records abinitio component. Its functioning is as explained below

    1. It takes join of both input dataframes via adding incremental sequence number and takes join on this sequence number. 2. It compares all records of both input dataframes and finds count of mismatching records. 3. If mismatch record count is more than limit than it throws error to terminate workflow execution. Otherwise it returns dataframe with mismatch count report.

    Definition Classes
    ExtendedDataFrame
  10. def convertIntToLong(): DataFrame

    Permalink
    Definition Classes
    ExtendedDataFrame
  11. var dataFrame: DataFrame

    Permalink
    Definition Classes
    ExtendedDataFrame
  12. def deduplicate(typeToKeep: String, groupByColumns: List[Column] = List(lit(1)), orderByColumns: List[Column] = List(lit(1))): DataFrame

    Permalink

    Method for Deduplicate operation when rows to be kept in each group of rows to be either first, Last or unique-only.

    Method for Deduplicate operation when rows to be kept in each group of rows to be either first, Last or unique-only. It does first groupBy on all passed groupByColumns and then depending on typeToKeep value it does further operations.

    For both first and last option, it adds new temporary row_number column which returns the row number within a group of rows grouped by groupByColumns. Then to find first records it simply filters out all rows with row_number as 1. To find last records within each group it also computes the count value for each group and filters out all the records where row_number is same as group count

    For unique-only case it adds new temporary count column which returns the count of rows within a window partition. Then it filters the resultant dataframe with count value 1.

    typeToKeep

    option to find kind of rows. Possible values are first, last and unique-only

    groupByColumns

    columns to be used to group input records.

    returns

    DataFrame with first or last or unique-only records in each grouping of input records.

    Definition Classes
    ExtendedDataFrame
  13. def deduplicateFromColumnNames(typeToKeep: String, groupByColumns: ArrayList[String]): DataFrame

    Permalink
    Definition Classes
    ExtendedDataFrame
  14. def denormalizeSorted(groupByColumns: List[Column] = List(lit(1)), orderByColumns: List[Column] = List(lit(1)), denormalizeRecordExpression: Column, finalizeExpressionMap: Map[String, Column], inputFilter: Option[Column] = None, outputFilter: Option[Column] = None, denormColumnName: String, countColumnName: String = "count"): DataFrame

    Permalink
    Definition Classes
    ExtendedDataFrame
  15. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  16. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  17. def generateLogOutput(componentName: String, subComponentName: String = "", perRowEventTypes: Option[Column] = None, perRowEventTexts: Option[Column] = None, inputRowCount: Long = 0, outputRowCount: Option[Long] = Some(0), finalLogEventType: Option[Column] = None, finalLogEventText: Option[Column] = None, finalEventExtraColumnMap: Map[String, Column] = Map(), sparkSession: SparkSession): DataFrame

    Permalink

    Method to generate abinitio log output for any component.

    Method to generate abinitio log output for any component. This method takes as input array of non-standard events which are emitted by workflow component and serializes these events into separate row. This method will also add start and finish events with adding count information with finish event.

    Definition Classes
    ExtendedDataFrame
  18. def generateSurrogateKeys(keyDF: DataFrame, naturalKeys: List[String], surrogateKey: String, overrideSurrogateKeys: Option[String], computeOldPortOutput: Boolean = false, spark: SparkSession): (DataFrame, DataFrame, DataFrame)

    Permalink
    Definition Classes
    ExtendedDataFrame
  19. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
    Annotations
    @HotSpotIntrinsicCandidate()
  20. def grouped(windowSize: Int): DataFrame

    Permalink
    Definition Classes
    ExtendedDataFrame
  21. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
    Annotations
    @HotSpotIntrinsicCandidate()
  22. def interim(subgraph: String, component: String, port: String)(implicit interimOutput: InterimOutput): DataFrame

    Permalink
    Definition Classes
    ExtendedDataFrame
    Annotations
    @Py4JWhitelist()
  23. def interim(subgraph: String, component: String, port: String, subPath: String, numRows: Int, detailedStats: Boolean = false)(implicit interimOutput: InterimOutput): DataFrame

    Permalink
    Definition Classes
    ExtendedDataFrame
    Annotations
    @Py4JWhitelist()
  24. def interim(subgraph: String, component: String, port: String, subPath: String, numRows: Int, interimOutput: InterimOutput, detailedStats: Boolean): DataFrame

    Permalink
    Definition Classes
    ExtendedDataFrame
    Annotations
    @Py4JWhitelist()
  25. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  26. def mergeMultipleFileContentInDataFrame(fileNameDF: DataFrame, spark: SparkSession, outputSchema: StructType, delimiter: String, readFormat: String, joinWithInputDataframe: Boolean = false): DataFrame

    Permalink

    Method to read passed dataframe fileNameDF and read the content of filenames passed in this dataframe.

    Method to read passed dataframe fileNameDF and read the content of filenames passed in this dataframe. It will also merge the fileName column and unique sequence id in the final generated dataframe with file content for all passed fileNames.

    Finally it joins the dataframe with content of file and dataframe corresponding to input dataframe and returns the joined dataframe.

    Definition Classes
    ExtendedDataFrame
  27. def metaPivot(pivotColumns: Seq[String], nameField: String, valueField: String, sparkSession: SparkSession): DataFrame

    Permalink

    Method to take pivot on passed pivot columns.

    Method to take pivot on passed pivot columns. This method splits records by pivot columns, converting each input record into a series of separate output records. There is one separate output record for each field of data in the original input record which is not in pivot list. Each output record contains the name and value of a single data field from the original input record along with pivot columns.

    Definition Classes
    ExtendedDataFrame
  28. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  29. def normalize(lengthExpression: Option[Column], finishedExpression: Option[Column], finishedCondition: Option[Column], alias: String, colsToSelect: List[Column], tempWindowExpr: Map[String, Column], lengthRelatedGlobalExpressions: Map[String, Column] = Map()): DataFrame

    Permalink

    Method to take care of abinitio normalize functionality.

    Method to take care of abinitio normalize functionality. It first replicates input dataframe rows, muliple times depending on passed lengthExpression or finishedExpression. LengthExpression evaluates to a number and will replicate each row in input data by this number.

    FinishedExpression and finishedCondition are used to apply filter condition on input data and use this condition result to duplicate each input row multiple times.

    tempWindowExpr is used to evaluate temp variables for Normalize with Temp case, using window functions. These expressions are then used in computation of final value for normalize output.

    lengthExpression

    expression which evaluates to a integer value, used to duplicate input records.

    finishedExpression

    expression to be used in filterCondition during its evaluation for duplication of records. return finishedCondition condition to be used to duplicate input records till condition result is false.

    alias

    to be used to rename finishedExpressions

    colsToSelect

    columns to be selected after normalize operations.

    tempWindowExpr

    window expressions to compute value of temp variables.

    returns

    final normalize output for both with Temp and without Temp case.

    Definition Classes
    ExtendedDataFrame
  30. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @HotSpotIntrinsicCandidate()
  31. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @HotSpotIntrinsicCandidate()
  32. def readSeparatedValues(inputColumn: Column, outputSchemaColumns: List[String], recordSeparator: String, fieldSeparator: String): DataFrame

    Permalink

    Method to read textual data from inputColumn and split it into multiple records via recordSeparator and then further split each record into multiple columns via fieldSeparator.

    Method to read textual data from inputColumn and split it into multiple records via recordSeparator and then further split each record into multiple columns via fieldSeparator. Then finally map the resultant data to output columns passed.

    Definition Classes
    ExtendedDataFrame
  33. def syncDataFrameColumnsWithSchema(columnNames: Seq[String]): DataFrame

    Permalink

    Method to sync column names in dataframe with column names passed as input.

    Method to sync column names in dataframe with column names passed as input.

    Definition Classes
    ExtendedDataFrame
  34. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  35. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  36. def unionWithSchema(otherDataFrame: DataFrame): DataFrame

    Permalink

    Method to take union of current dataframe with passed otherDataFrame.

    Method to take union of current dataframe with passed otherDataFrame. This method also rearranges the columns ot otherDataFrame in the same order as of current dataFrame columns

    Definition Classes
    ExtendedDataFrame
  37. lazy val vectorUDF: UserDefinedFunction

    Permalink
    Definition Classes
    ExtendedDataFrame
  38. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  39. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  40. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  41. def withColumnOptional(name: String, value: Column): DataFrame

    Permalink

    Adds a column with defined value, if it doesn't exist.

    Adds a column with defined value, if it doesn't exist.

    name

    Column's name

    value

    New column's value

    returns

    DataFrame with a new column if it doesn't exist already

    Definition Classes
    ExtendedDataFrame
  42. def zipWithIndex(startValue: Long = 0L, incrementBy: Long = 1L, indexColName: String, sparkSession: SparkSession): DataFrame

    Permalink

    Method to add new unique sequence column in dataframe where value in each row is incremented by incrementBy value and sequence starts with startValue.

    Method to add new unique sequence column in dataframe where value in each row is incremented by incrementBy value and sequence starts with startValue.

    Definition Classes
    ExtendedDataFrame

Deprecated Value Members

  1. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @Deprecated @deprecated @throws( classOf[java.lang.Throwable] )
    Deprecated

    (Since version ) see corresponding Javadoc for more information.

Inherited from libs.ExtendedDataFrame

Inherited from AnyRef

Inherited from Any

Ungrouped