Packages

class WordReader extends Serializable

Class to read and parse Word files.

Linear Supertypes
Serializable, Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. WordReader
  2. Serializable
  3. Serializable
  4. AnyRef
  5. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new WordReader(storeContent: Boolean = false, includePageBreaks: Boolean = false, inferTableStructure: Boolean = false)

    storeContent

    Whether to include the raw file content in the output DataFrame as a separate content column, alongside the structured output. Default is false.

    includePageBreaks

    Whether to detect and tag content with page break metadata. In Word documents, this includes manual and section breaks. In Excel files, this includes page breaks based on column boundaries. Default is false.

    inferTableStructure

    Whether to generate an HTML table representation from structured table content. When enabled, a full table element is added alongside cell-level elements, based on row and column layout. Default is false.

    Example

    val docDirectory = "./word-files/fake_table.docx"
    val wordReader = new WordReader()
    val wordDf = wordReader.doc(docDirectory)
    
    wordDf.show()
    +--------------------+--------------------+
    |                path|                 doc|
    +--------------------+--------------------+
    |file:/content/wor...|[{Table, Header C...|
    +--------------------+--------------------+
    
    wordDf.printSchema()
    root
     |-- path: string (nullable = true)
     |-- doc: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- elementType: string (nullable = true)
     |    |    |-- content: string (nullable = true)
     |    |    |-- metadata: map (nullable = true)
     |    |    |    |-- key: string
     |    |    |    |-- value: string (valueContainsNull = true)

    For more examples please refer to this notebook.

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  5. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  6. def doc(filePath: String): DataFrame

    filePath

    this is a path to a directory of word files or a path to a word file E.g. "path/word/files"

    returns

    Dataframe with parsed word doc content.

  7. def docToHTMLElement(content: Array[Byte]): Seq[HTMLElement]
  8. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  9. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  10. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  11. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  12. def getOutputColumn: String
  13. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  14. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  15. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  16. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  17. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  18. def setOutputColumn(value: String): WordReader.this.type
  19. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  20. def toString(): String
    Definition Classes
    AnyRef → Any
  21. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  22. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  23. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped