Partition

Companion object Partition

class Partition extends Serializable

The Partition class is a unified interface for extracting structured content from various document types using Spark NLP readers. It supports reading from files, URLs, in-memory strings, or byte arrays, and returns parsed output as a structured Spark DataFrame.

Supported formats include plain text, HTML, Word (.doc/.docx), Excel (.xls/.xlsx), PowerPoint (.ppt/.pptx), email files (.eml, .msg), and PDFs.

The class detects the appropriate reader either from the file extension or a provided MIME contentType, and delegates to the relevant method of SparkNLPReader. Custom behavior (like title thresholds, page breaks, etc.) can be configured through the params map during initialization.

By abstracting reader initialization, type detection, and parsing logic, Partition simplifies document ingestion in scalable NLP pipelines.

Linear Supertypes

Serializable, Serializable, AnyRef, Any

Ordering

Alphabetic
By Inheritance

Inherited

Partition
Serializable
Serializable
AnyRef
Any

Hide All
Show All

Visibility

Public
All

Instance Constructors

new Partition(params: Map[String, String] = new java.util.HashMap())
params
Map of parameters with custom configurations. It includes the following parameters:
- content_type (All): Override automatic file type detection.
- store_content (All): Include raw file content in the output DataFrame as a separate 'content' column.
- timeout (HTML): Timeout in seconds for fetching remote HTML content.
- title_font_size (HTML, Excel): Minimum font size used to identify titles based on formatting.
- include_page_breaks (Word, Excel): Whether to tag content with page break metadata.
- group_broken_paragraphs (Text): Whether to merge broken lines into full paragraphs using heuristics.
- title_length_size (Text): Max character length used to qualify text blocks as titles.
- paragraph_split (Text): Regex to detect paragraph boundaries when grouping lines.
- short_line_word_threshold (Text): Max word count for a line to be considered short.
- threshold (Text): Ratio of empty lines used to switch between newline-based and paragraph grouping.
- max_line_count (Text): Max lines evaluated when analyzing paragraph structure.
- include_slide_notes (PowerPoint): Whether to include speaker notes from slides as narrative text.
- infer_table_structure (Word, Excel, PowerPoint): Generate full HTML table structure from parsed table content.
- append_cells (Excel): Append all rows into a single content block instead of individual elements.
- cell_separator (Excel): String used to join cell values in a row for text output.
- add_attachment_content (Email): Include text content of plain-text attachments in the output.
- headers (HTML): This is used when a URL is provided, allowing you to set the necessary headers for the request.
Example 1 (Reading Text Files)
```
val txtDirectory = "/content/txtfiles/reader/txt"
val textDf = Partition(Map("content_type" -> "text/plain")).partition(txtDirectory)
textDf.show()

+--------------------+--------------------+
|                path|                 txt|
+--------------------+--------------------+
|file:/content/txt...|[{Title, BIG DATA...|
+--------------------+--------------------+
```
Example 2 (Reading Email Files)
```
emailDirectory = "./email-files/test-several-attachments.eml"
partitionDf = Partition(Map("content_type" -> "message/rfc822")).partition(emailDirectory)
partitionDf.show()
+--------------------+--------------------+
|                path|               email|
+--------------------+--------------------+
|file:/content/ema...|[{Title, Test Sev...|
+--------------------+--------------------+
```
Example 3 (Reading Webpages)
```
  val htmlDf = Partition().partition("https://www.wikipedia.org")
  htmlDf.show()

+--------------------+--------------------+
|                 url|                html|
+--------------------+--------------------+
|https://www.wikip...|[{Title, Wikipedi...|
+--------------------+--------------------+
```
For more examples, please refer - examples/python/data-preprocessing/SparkNLP_Partition_Reader_Demo.ipynb

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws( ... ) @native()
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
Annotations
@native()
def getOutputColumn: String
def hashCode(): Int

Definition Classes
AnyRef → Any
Annotations
@native()
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
Annotations
@native()
final def notifyAll(): Unit

Definition Classes
AnyRef
Annotations
@native()
def partition(path: String, headers: Map[String, String] = new java.util.HashMap()): DataFrame
Takes a URL/file/directory path to read and parse it's content.
Takes a URL/file/directory path to read and parse it's content.
path
Path to a file or local directory where all files are stored. Supports URLs and DFS file systems like databricks, HDFS and Microsoft Fabric OneLake.
headers
If the path is a URL it sets the necessary headers for the request.
returns
DataFrame with parsed file content.
def partitionBytesContent(input: Array[Byte]): Seq[HTMLElement]
def partitionStringContent(input: String, headers: Map[String, String] = new java.util.HashMap()): Seq[HTMLElement]

def partitionText(text: String): DataFrame

Parses and reads data from a string.

text

Text data in the form of a string.

returns

DataFrame with parsed text content.

Example

   val content =
     """
       |The big brown fox
       |was walking down the lane.
       |
       |At the end of the lane,
       |the fox met a bear.
       |""".stripMargin

   val textDf = Partition(Map("groupBrokenParagraphs" -> "true")).partitionText(content)
   textDf.show()

+--------------------------------------+
|txt                                   |
+--------------------------------------+
|[{NarrativeText, The big brown fox was|
+--------------------------------------+

   textDf.printSchema()
   root
        |-- txt: array (nullable = true)
        |    |-- element: struct (containsNull = true)
        |    |    |-- elementType: string (nullable = true)
        |    |    |-- content: string (nullable = true)
        |    |    |-- metadata: map (nullable = true)
        |    |    |    |-- key: string
        |    |    |    |-- value: string (valueContainsNull = true)

def partitionUrls(urls: Array[String], headers: Map[String, String] = Map.empty): DataFrame

Parses multiple URL's.

urls

list of URL's

headers

sets the necessary headers for the URL request.

returns

DataFrame with parsed url content.

Example

val htmlDf =
     Partition().partitionUrls(Array("https://www.wikipedia.org", "https://example.com/"))
htmlDf.show()

+--------------------+--------------------+
|                 url|                html|
+--------------------+--------------------+
|https://www.wikip...|[{Title, Wikipedi...|
|https://example.com/|[{Title, Example ...|
+--------------------+--------------------+

htmlDf.printSchema()
root
  |-- url: string (nullable = true)
  |-- html: array (nullable = true)
  |    |-- element: struct (containsNull = true)
  |    |    |-- elementType: string (nullable = true)
  |    |    |-- content: string (nullable = true)
  |    |    |-- metadata: map (nullable = true)
  |    |    |    |-- key: string
  |    |    |    |-- value: string (valueContainsNull = true)

def partitionUrlsJava(urls: List[String], headers: Map[String, String] = new java.util.HashMap()): DataFrame
def setOutputColumn(value: String): Partition.this.type
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... ) @native()

Packages

Partition

Companion object Partition

class Partition extends Serializable

Instance Constructors

Example 1 (Reading Text Files)

Example 2 (Reading Email Files)

Example 3 (Reading Webpages)

Value Members

Example

Example

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped

Packages

Partition 

Companion object Partition

class Partition extends Serializable

Instance Constructors

Example 1 (Reading Text Files)

Example 2 (Reading Email Files)

Example 3 (Reading Webpages)

Value Members

Example

Example

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped

Partition