class Partition extends Serializable

The Partition class is a unified interface for extracting structured content from various document types using Spark NLP readers. It supports reading from files, URLs, in-memory strings, or byte arrays, and returns parsed output as a structured Spark DataFrame.

Supported formats include plain text, HTML, Word (.doc/.docx), Excel (.xls/.xlsx), PowerPoint (.ppt/.pptx), email files (.eml, .msg), and PDFs.

The class detects the appropriate reader either from the file extension or a provided MIME contentType, and delegates to the relevant method of SparkNLPReader. Custom behavior (like title thresholds, page breaks, etc.) can be configured through the params map during initialization.

By abstracting reader initialization, type detection, and parsing logic, Partition simplifies document ingestion in scalable NLP pipelines.

Linear Supertypes
Serializable, Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. Partition
  2. Serializable
  3. Serializable
  4. AnyRef
  5. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new Partition(params: Map[String, String] = new java.util.HashMap())

    params

    Map of parameters with custom configurations. It includes the following parameters:

    • content_type (All): Override automatic file type detection.
    • store_content (All): Include raw file content in the output DataFrame as a separate 'content' column.
    • timeout (HTML): Timeout in seconds for fetching remote HTML content.
    • title_font_size (HTML, Excel): Minimum font size used to identify titles based on formatting.
    • include_page_breaks (Word, Excel): Whether to tag content with page break metadata.
    • group_broken_paragraphs (Text): Whether to merge broken lines into full paragraphs using heuristics.
    • title_length_size (Text): Max character length used to qualify text blocks as titles.
    • paragraph_split (Text): Regex to detect paragraph boundaries when grouping lines.
    • short_line_word_threshold (Text): Max word count for a line to be considered short.
    • threshold (Text): Ratio of empty lines used to switch between newline-based and paragraph grouping.
    • max_line_count (Text): Max lines evaluated when analyzing paragraph structure.
    • include_slide_notes (PowerPoint): Whether to include speaker notes from slides as narrative text.
    • infer_table_structure (Word, Excel, PowerPoint): Generate full HTML table structure from parsed table content.
    • append_cells (Excel): Append all rows into a single content block instead of individual elements.
    • cell_separator (Excel): String used to join cell values in a row for text output.
    • add_attachment_content (Email): Include text content of plain-text attachments in the output.
    • headers (HTML): This is used when a URL is provided, allowing you to set the necessary headers for the request.

    Example 1 (Reading Text Files)

    val txtDirectory = "/content/txtfiles/reader/txt"
    val textDf = Partition(Map("content_type" -> "text/plain")).partition(txtDirectory)
    textDf.show()
    
    +--------------------+--------------------+
    |                path|                 txt|
    +--------------------+--------------------+
    |file:/content/txt...|[{Title, BIG DATA...|
    +--------------------+--------------------+

    Example 2 (Reading Email Files)

    emailDirectory = "./email-files/test-several-attachments.eml"
    partitionDf = Partition(Map("content_type" -> "message/rfc822")).partition(emailDirectory)
    partitionDf.show()
    +--------------------+--------------------+
    |                path|               email|
    +--------------------+--------------------+
    |file:/content/ema...|[{Title, Test Sev...|
    +--------------------+--------------------+

    Example 3 (Reading Webpages)

      val htmlDf = Partition().partition("https://www.wikipedia.org")
      htmlDf.show()
    
    +--------------------+--------------------+
    |                 url|                html|
    +--------------------+--------------------+
    |https://www.wikip...|[{Title, Wikipedi...|
    +--------------------+--------------------+

    For more examples, please refer - examples/python/data-preprocessing/SparkNLP_Partition_Reader_Demo.ipynb

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  5. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  6. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  7. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  8. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  9. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  10. def getOutputColumn: String
  11. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  12. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  13. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  14. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  15. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  16. def partition(path: String, headers: Map[String, String] = new java.util.HashMap()): DataFrame

    Takes a URL/file/directory path to read and parse it's content.

    Takes a URL/file/directory path to read and parse it's content.

    path

    Path to a file or local directory where all files are stored. Supports URLs and DFS file systems like databricks, HDFS and Microsoft Fabric OneLake.

    headers

    If the path is a URL it sets the necessary headers for the request.

    returns

    DataFrame with parsed file content.

  17. def partitionBytesContent(input: Array[Byte]): Seq[HTMLElement]
  18. def partitionStringContent(input: String, headers: Map[String, String] = new java.util.HashMap()): Seq[HTMLElement]
  19. def partitionText(text: String): DataFrame

    Parses and reads data from a string.

    Parses and reads data from a string.

    text

    Text data in the form of a string.

    returns

    DataFrame with parsed text content.

    Example

       val content =
         """
           |The big brown fox
           |was walking down the lane.
           |
           |At the end of the lane,
           |the fox met a bear.
           |""".stripMargin
    
       val textDf = Partition(Map("groupBrokenParagraphs" -> "true")).partitionText(content)
       textDf.show()
    
    +--------------------------------------+
    |txt                                   |
    +--------------------------------------+
    |[{NarrativeText, The big brown fox was|
    +--------------------------------------+
    
       textDf.printSchema()
       root
            |-- txt: array (nullable = true)
            |    |-- element: struct (containsNull = true)
            |    |    |-- elementType: string (nullable = true)
            |    |    |-- content: string (nullable = true)
            |    |    |-- metadata: map (nullable = true)
            |    |    |    |-- key: string
            |    |    |    |-- value: string (valueContainsNull = true)
  20. def partitionUrls(urls: Array[String], headers: Map[String, String] = Map.empty): DataFrame

    Parses multiple URL's.

    Parses multiple URL's.

    urls

    list of URL's

    headers

    sets the necessary headers for the URL request.

    returns

    DataFrame with parsed url content.

    Example

    val htmlDf =
         Partition().partitionUrls(Array("https://www.wikipedia.org", "https://example.com/"))
    htmlDf.show()
    
    +--------------------+--------------------+
    |                 url|                html|
    +--------------------+--------------------+
    |https://www.wikip...|[{Title, Wikipedi...|
    |https://example.com/|[{Title, Example ...|
    +--------------------+--------------------+
    
    htmlDf.printSchema()
    root
      |-- url: string (nullable = true)
      |-- html: array (nullable = true)
      |    |-- element: struct (containsNull = true)
      |    |    |-- elementType: string (nullable = true)
      |    |    |-- content: string (nullable = true)
      |    |    |-- metadata: map (nullable = true)
      |    |    |    |-- key: string
      |    |    |    |-- value: string (valueContainsNull = true)
  21. def partitionUrlsJava(urls: List[String], headers: Map[String, String] = new java.util.HashMap()): DataFrame
  22. def setOutputColumn(value: String): Partition.this.type
  23. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  24. def toString(): String
    Definition Classes
    AnyRef → Any
  25. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  26. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  27. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped