Packages

class TextReader extends Serializable

Class to read and parse text files.

Linear Supertypes
Serializable, Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. TextReader
  2. Serializable
  3. Serializable
  4. AnyRef
  5. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new TextReader(titleLengthSize: Int = 50, storeContent: Boolean = false, blockSplit: String = BLOCK_SPLIT_PATTERN, groupBrokenParagraphs: Boolean = false, paragraphSplit: String = DOUBLE_PARAGRAPH_PATTERN, shortLineWordThreshold: Int = 5, maxLineCount: Int = 2000, threshold: Double = 0.1)

    titleLengthSize

    Maximum character length used to determine if a text block qualifies as a title during parsing. The default value is 50.

    storeContent

    Timeout value in seconds for reading remote HTML resources. Applied when fetching content from URLs. By default, it is set to false.

    groupBrokenParagraphs

    Whether to merge fragmented lines into coherent paragraphs using heuristics based on line length and structure. By default, it is set to false.

    paragraphSplit

    Regex pattern used to detect paragraph boundaries when grouping broken paragraphs. Default value is set to double paragraph pattern.

    shortLineWordThreshold

    Maximum word count for a line to be considered 'short' during broken paragraph grouping. Default value is 5.

    maxLineCount

    Maximum number of lines to evaluate when estimating paragraph layout characteristics. Default value is 2000.

    threshold

    Threshold ratio of empty lines used to decide between new line-based or broken-paragraph grouping. Default value is 0.1.

    Example

    val filePath = "./txt-files/simple-text.txt"
    val textReader = new TextReader()
    val textDf = textReader.txt(filePath)
    textDf.show()
    +--------------------+--------------------+
    |                path|                 txt|
    +--------------------+--------------------+
    |file:/content/txt...|[{Title, BIG DATA...|
    +--------------------+--------------------+
    
    textDf.printSchema()
    root
     |-- path: string (nullable = true)
     |-- txt: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- elementType: string (nullable = true)
     |    |    |-- content: string (nullable = true)
     |    |    |-- metadata: map (nullable = true)
     |    |    |    |-- key: string
     |    |    |    |-- value: string (valueContainsNull = true)

    For more examples please refer to this notebook.

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  5. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  6. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  7. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  8. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  9. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  10. def getOutputColumn: String
  11. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  12. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  13. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  14. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  15. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  16. def setOutputColumn(value: String): TextReader.this.type
  17. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  18. def toString(): String
    Definition Classes
    AnyRef → Any
  19. def txt(filePath: String): DataFrame

    Parses TXT files and returns a DataFrame.

    Parses TXT files and returns a DataFrame.

    The DataFrame will contain:

    • "path": the file path,
    • "content": the raw text content,
    • outputColumn: a Seq[HTMLElement] containing the parsed elements.
  20. def txtContent(content: String): DataFrame
  21. def txtToHTMLElement(text: String): Seq[HTMLElement]
  22. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  23. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  24. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped