Trait/Object

io.prophecy.libs

DataHelpers

Related Docs: object DataHelpers | package libs

Permalink

trait DataHelpers extends LazyLogging

Helper Utilities for reading/writing data from/to different data sources.

Linear Supertypes
LazyLogging, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. DataHelpers
  2. LazyLogging
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. def appendTrailer(pathInputData: String, pathInputTrailer: String, pathOutputConcatenated: String, configuration: Configuration): Unit

    Permalink

    Appends a trailer data to every single file in the data directory.

    Appends a trailer data to every single file in the data directory. A single trailer file in the pathOutputTrailer directory should correspond to a single data file in the pathOutputData directory.

    If a trailer for a given file does not exist, the file is moved as is to the output directory.

    pathInputData

    Input data files directory

    pathInputTrailer

    Input trailer files directory

    pathOutputConcatenated

    Output concatenated files directory

    configuration

    Hadoop configuration (preferably sparkSession.sparkContext.hadoopConfiguration)

  5. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  6. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @HotSpotIntrinsicCandidate() @throws( ... )
  7. def concatenate(sources: Seq[String], destination: String, compressToGZip: Boolean = false): Unit

    Permalink

    Method to get data from multiple source paths and combine it into single destination path.

    Method to get data from multiple source paths and combine it into single destination path.

    sources

    multiple source paths from which to merge the data.

    destination

    destination path to combine all data to.

    compressToGZip

    flag to compress final output file into gzip format

  8. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  9. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  10. def executeNonSelectSQLQueries(sqlList: Seq[String], dbConnection: Connection): Unit

    Permalink
  11. def ftpTo(remoteHost: String, userName: String, password: String, sourceFile: String, destFile: String, retryFailures: Boolean, retryCount: Int, retryPauseSecs: Int, mode: String, psCmd: String): (Boolean, Boolean, String, String)

    Permalink
  12. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
    Annotations
    @HotSpotIntrinsicCandidate()
  13. def getEmptyLogDataFrame(sparkSession: SparkSession): DataFrame

    Permalink

    Method to get empty dataframe with below abinitio log schema.

    Method to get empty dataframe with below abinitio log schema.

    record string("|") node, timestamp, component, subcomponent, event_type; string("|\n") event_text; end

  14. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
    Annotations
    @HotSpotIntrinsicCandidate()
  15. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  16. def loadBinaryFileAsBinaryDataFrame(filePath: String, lineDelimiter: String = "\n", minPartition: Int = 1, rowName: String = "line", spark: SparkSession): DataFrame

    Permalink
  17. def loadBinaryFileAsStringDataFrame(filePath: String, lineDelimiter: String = "\n", charSetEncoding: String = "Cp1047", minPartition: Int = 1, rowName: String = "line", spark: SparkSession): DataFrame

    Permalink
  18. def loadFixedWindowBinaryFileAsDataFrame(filePath: String, lineLength: Int, minPartition: Int = 1, rowName: String = "line", spark: SparkSession): DataFrame

    Permalink
  19. lazy val logger: Logger

    Permalink
    Attributes
    protected
    Definition Classes
    LazyLogging
  20. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  21. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @HotSpotIntrinsicCandidate()
  22. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @HotSpotIntrinsicCandidate()
  23. def readHiveTable(spark: SparkSession, database: String, table: String, partition: String = ""): DataFrame

    Permalink

    Method to read data from hive table.

    Method to read data from hive table.

    spark

    spark session

    database

    hive database

    table

    hive table.

    partition

    hive table partition to read data specifically from if provided.

    returns

    dataframe with data read from Hive Table.

  24. def readHiveTableInChunks(spark: SparkSession, database: String, table: String, partitionKey: String, partitionValue: String): DataFrame

    Permalink

    Reads a full hive table partition, by reading every subpartition separately and performing a union on all the final DataFrames

    Reads a full hive table partition, by reading every subpartition separately and performing a union on all the final DataFrames

    This function is meant to temporarily solve the problem with Hive metastore crashing when querying too many partitions at the same time.

    spark

    spark session

    database

    hive database name

    table

    hive table name

    partitionKey

    top-level partition's key

    partitionValue

    top-level partition's value

    returns

    A complete DataFrame with the selected hive table partition

  25. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  26. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  27. def unionAll(df: DataFrame*): DataFrame

    Permalink

    Method to take union of all passed dataframes.

    Method to take union of all passed dataframes.

    df

    list of dataframes for which to take union of.

    returns

    union of all passed input dataframes.

  28. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  29. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  30. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  31. def writeDataFrame(df: DataFrame, path: String, spark: SparkSession, props: Map[String, String], format: String, partitionColumns: List[String] = Nil, bucketColumns: List[String] = Nil, numBuckets: Option[Int] = None, sortColumns: List[String] = Nil, tableName: Option[String] = None, databaseName: Option[String] = None): Unit

    Permalink

    Method to write data passed in dataframe in specific file format.

    Method to write data passed in dataframe in specific file format.

    df

    dataframe containing data.

    path

    path to write data to.

    spark

    spark session.

    props

    underlying data source specific properties.

    format

    file format in which to persist data. Supported file formats are csv, text, json, parquet, orc

    partitionColumns

    columns to be used for partitioning.

    bucketColumns

    used to bucket the output by the given columns. If specified, the output is laid out on the file-system similar to Hive's bucketing scheme.

    numBuckets

    number of buckets to be used.

    sortColumns

    columns on which to order data while persisting.

    tableName

    table name for persisting data.

    databaseName

    database name for persisting data.

  32. lazy val write_to_log: UserDefinedFunction

    Permalink

    UDF to write logging parameters to log port.

Deprecated Value Members

  1. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @Deprecated @deprecated @throws( classOf[java.lang.Throwable] )
    Deprecated

    (Since version ) see corresponding Javadoc for more information.

Inherited from LazyLogging

Inherited from AnyRef

Inherited from Any

Ungrouped