class Partition extends Serializable
The Partition class is a unified interface for extracting structured content from various document types using Spark NLP readers. It supports reading from files, URLs, in-memory strings, or byte arrays, and returns parsed output as a structured Spark DataFrame.
Supported formats include plain text, HTML, Word (.doc/.docx), Excel (.xls/.xlsx), PowerPoint (.ppt/.pptx), email files (.eml, .msg), and PDFs.
The class detects the appropriate reader either from the file extension or a provided MIME contentType, and delegates to the relevant method of SparkNLPReader. Custom behavior (like title thresholds, page breaks, etc.) can be configured through the params map during initialization.
By abstracting reader initialization, type detection, and parsing logic, Partition simplifies document ingestion in scalable NLP pipelines.
- Alphabetic
- By Inheritance
- Partition
- Serializable
- Serializable
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
-
new
Partition(params: Map[String, String] = new java.util.HashMap())
- params
Map of parameters with custom configurations. It includes the following parameters:
- content_type (All): Override automatic file type detection.
- store_content (All): Include raw file content in the output DataFrame as a separate 'content' column.
- timeout (HTML): Timeout in seconds for fetching remote HTML content.
- title_font_size (HTML, Excel): Minimum font size used to identify titles based on formatting.
- include_page_breaks (Word, Excel): Whether to tag content with page break metadata.
- group_broken_paragraphs (Text): Whether to merge broken lines into full paragraphs using heuristics.
- title_length_size (Text): Max character length used to qualify text blocks as titles.
- paragraph_split (Text): Regex to detect paragraph boundaries when grouping lines.
- short_line_word_threshold (Text): Max word count for a line to be considered short.
- threshold (Text): Ratio of empty lines used to switch between newline-based and paragraph grouping.
- max_line_count (Text): Max lines evaluated when analyzing paragraph structure.
- include_slide_notes (PowerPoint): Whether to include speaker notes from slides as narrative text.
- infer_table_structure (Word, Excel, PowerPoint): Generate full HTML table structure from parsed table content.
- append_cells (Excel): Append all rows into a single content block instead of individual elements.
- cell_separator (Excel): String used to join cell values in a row for text output.
- add_attachment_content (Email): Include text content of plain-text attachments in the output.
- headers (HTML): This is used when a URL is provided, allowing you to set the necessary headers for the request.
Example 1 (Reading Text Files)
val txtDirectory = "/content/txtfiles/reader/txt" val textDf = Partition(Map("content_type" -> "text/plain")).partition(txtDirectory) textDf.show() +--------------------+--------------------+ | path| txt| +--------------------+--------------------+ |file:/content/txt...|[{Title, BIG DATA...| +--------------------+--------------------+
Example 2 (Reading Email Files)
emailDirectory = "./email-files/test-several-attachments.eml" partitionDf = Partition(Map("content_type" -> "message/rfc822")).partition(emailDirectory) partitionDf.show() +--------------------+--------------------+ | path| email| +--------------------+--------------------+ |file:/content/ema...|[{Title, Test Sev...| +--------------------+--------------------+
Example 3 (Reading Webpages)
val htmlDf = Partition().partition("https://www.wikipedia.org") htmlDf.show() +--------------------+--------------------+ | url| html| +--------------------+--------------------+ |https://www.wikip...|[{Title, Wikipedi...| +--------------------+--------------------+
For more examples, please refer - examples/python/data-preprocessing/SparkNLP_Partition_Reader_Demo.ipynb
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def getOutputColumn: String
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
def
partition(path: String, headers: Map[String, String] = new java.util.HashMap()): DataFrame
Takes a URL/file/directory path to read and parse it's content.
Takes a URL/file/directory path to read and parse it's content.
- path
Path to a file or local directory where all files are stored. Supports URLs and DFS file systems like databricks, HDFS and Microsoft Fabric OneLake.
- headers
If the path is a URL it sets the necessary headers for the request.
- returns
DataFrame with parsed file content.
- def partitionBytesContent(input: Array[Byte]): Seq[HTMLElement]
- def partitionStringContent(input: String, headers: Map[String, String] = new java.util.HashMap()): Seq[HTMLElement]
-
def
partitionText(text: String): DataFrame
Parses and reads data from a string.
Parses and reads data from a string.
- text
Text data in the form of a string.
- returns
DataFrame with parsed text content.
Example
val content = """ |The big brown fox |was walking down the lane. | |At the end of the lane, |the fox met a bear. |""".stripMargin val textDf = Partition(Map("groupBrokenParagraphs" -> "true")).partitionText(content) textDf.show() +--------------------------------------+ |txt | +--------------------------------------+ |[{NarrativeText, The big brown fox was| +--------------------------------------+ textDf.printSchema() root |-- txt: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true)
-
def
partitionUrls(urls: Array[String], headers: Map[String, String] = Map.empty): DataFrame
Parses multiple URL's.
Parses multiple URL's.
- urls
list of URL's
- headers
sets the necessary headers for the URL request.
- returns
DataFrame with parsed url content.
Example
val htmlDf = Partition().partitionUrls(Array("https://www.wikipedia.org", "https://example.com/")) htmlDf.show() +--------------------+--------------------+ | url| html| +--------------------+--------------------+ |https://www.wikip...|[{Title, Wikipedi...| |https://example.com/|[{Title, Example ...| +--------------------+--------------------+ htmlDf.printSchema() root |-- url: string (nullable = true) |-- html: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true)
- def partitionUrlsJava(urls: List[String], headers: Map[String, String] = new java.util.HashMap()): DataFrame
- def setOutputColumn(value: String): Partition.this.type
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- AnyRef → Any
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()