Spark NLP 6.0.2 ScalaDoc - com.johnsnowlabs.partition

trait HasEmailReaderProperties extends ParamsAndFeaturesWritable

trait HasExcelReaderProperties extends ParamsAndFeaturesWritable

trait HasHTMLReaderProperties extends ParamsAndFeaturesWritable

trait HasPowerPointProperties extends ParamsAndFeaturesWritable

trait HasReaderProperties extends ParamsAndFeaturesWritable

trait HasTextReaderProperties extends ParamsAndFeaturesWritable

class Partition extends Serializable

The Partition class is a unified interface for extracting structured content from various document types using Spark NLP readers.

The Partition class is a unified interface for extracting structured content from various document types using Spark NLP readers. It supports reading from files, URLs, in-memory strings, or byte arrays, and returns parsed output as a structured Spark DataFrame.

Supported formats include plain text, HTML, Word (.doc/.docx), Excel (.xls/.xlsx), PowerPoint (.ppt/.pptx), email files (.eml, .msg), and PDFs.

The class detects the appropriate reader either from the file extension or a provided MIME contentType, and delegates to the relevant method of SparkNLPReader. Custom behavior (like title thresholds, page breaks, etc.) can be configured through the params map during initialization.

By abstracting reader initialization, type detection, and parsing logic, Partition simplifies document ingestion in scalable NLP pipelines.

class PartitionTransformer extends AnnotatorModel[PartitionTransformer] with HasSimpleAnnotate[PartitionTransformer] with HasReaderProperties with HasEmailReaderProperties with HasExcelReaderProperties with HasHTMLReaderProperties with HasPowerPointProperties with HasTextReaderProperties with HasPdfProperties

The PartitionTransformer annotator allows you to use the Partition feature more smoothly within existing Spark NLP workflows, enabling seamless reuse of your pipelines.

The PartitionTransformer annotator allows you to use the Partition feature more smoothly within existing Spark NLP workflows, enabling seamless reuse of your pipelines. PartitionTransformer can be used for extracting structured content from various document types using Spark NLP readers. It supports reading from files, URLs, in-memory strings, or byte arrays, and returns parsed output as a structured Spark DataFrame.

Supported formats include plain text, HTML, Word (.doc/.docx), Excel (.xls/.xlsx), PowerPoint (.ppt/.pptx), email files (.eml, .msg), and PDFs.

Example

import com.johnsnowlabs.partition.PartitionTransformer
import com. johnsnowlabs. nlp. base. DocumentAssembler
import org.apache.spark.ml.Pipeline
import spark.implicits._
val urls = Seq("https://www.blizzard.com", "https://www.google.com/").toDS.toDF("text")

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("documents")

val partition = new PartitionTransformer()
  .setInputCols("document")
  .setOutputCol("partition")
  .setContentType("url")
  .setHeaders(Map("Accept-Language" -> "es-ES"))

val pipeline = new Pipeline()
  .setStages(Array(documentAssembler, partition))

val pipelineModel = pipeline.fit(testDataSet)
val resultDf = pipelineModel.transform(testDataSet)

resultDf.show()
+--------------------+--------------------+--------------------+
|                text|            document|           partition|
+--------------------+--------------------+--------------------+
|https://www.blizz...|[{Title, Juegos d...|[{document, 0, 16...|
|https://www.googl...|[{Title, Gmail Im...|[{document, 0, 28...|
+--------------------+--------------------+--------------------+

Packages

partition

package partition

Type Members

Example

Value Members

Ungrouped

Packages

partition 

package partition

Type Members

Example

Value Members

Ungrouped

partition