package partition
- Alphabetic
- Public
- All
Type Members
- trait HasEmailReaderProperties extends ParamsAndFeaturesWritable
- trait HasExcelReaderProperties extends ParamsAndFeaturesWritable
- trait HasHTMLReaderProperties extends ParamsAndFeaturesWritable
- trait HasPowerPointProperties extends ParamsAndFeaturesWritable
- trait HasReaderProperties extends ParamsAndFeaturesWritable
- trait HasTextReaderProperties extends ParamsAndFeaturesWritable
-
class
Partition extends Serializable
The Partition class is a unified interface for extracting structured content from various document types using Spark NLP readers.
The Partition class is a unified interface for extracting structured content from various document types using Spark NLP readers. It supports reading from files, URLs, in-memory strings, or byte arrays, and returns parsed output as a structured Spark DataFrame.
Supported formats include plain text, HTML, Word (.doc/.docx), Excel (.xls/.xlsx), PowerPoint (.ppt/.pptx), email files (.eml, .msg), and PDFs.
The class detects the appropriate reader either from the file extension or a provided MIME contentType, and delegates to the relevant method of SparkNLPReader. Custom behavior (like title thresholds, page breaks, etc.) can be configured through the params map during initialization.
By abstracting reader initialization, type detection, and parsing logic, Partition simplifies document ingestion in scalable NLP pipelines.
-
class
PartitionTransformer extends AnnotatorModel[PartitionTransformer] with HasSimpleAnnotate[PartitionTransformer] with HasReaderProperties with HasEmailReaderProperties with HasExcelReaderProperties with HasHTMLReaderProperties with HasPowerPointProperties with HasTextReaderProperties with HasPdfProperties
The PartitionTransformer annotator allows you to use the Partition feature more smoothly within existing Spark NLP workflows, enabling seamless reuse of your pipelines.
The PartitionTransformer annotator allows you to use the Partition feature more smoothly within existing Spark NLP workflows, enabling seamless reuse of your pipelines. PartitionTransformer can be used for extracting structured content from various document types using Spark NLP readers. It supports reading from files, URLs, in-memory strings, or byte arrays, and returns parsed output as a structured Spark DataFrame.
Supported formats include plain text, HTML, Word (.doc/.docx), Excel (.xls/.xlsx), PowerPoint (.ppt/.pptx), email files (.eml, .msg), and PDFs.
Example
import com.johnsnowlabs.partition.PartitionTransformer import com. johnsnowlabs. nlp. base. DocumentAssembler import org.apache.spark.ml.Pipeline import spark.implicits._ val urls = Seq("https://www.blizzard.com", "https://www.google.com/").toDS.toDF("text") val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("documents") val partition = new PartitionTransformer() .setInputCols("document") .setOutputCol("partition") .setContentType("url") .setHeaders(Map("Accept-Language" -> "es-ES")) val pipeline = new Pipeline() .setStages(Array(documentAssembler, partition)) val pipelineModel = pipeline.fit(testDataSet) val resultDf = pipelineModel.transform(testDataSet) resultDf.show() +--------------------+--------------------+--------------------+ | text| document| partition| +--------------------+--------------------+--------------------+ |https://www.blizz...|[{Title, Juegos d...|[{document, 0, 16...| |https://www.googl...|[{Title, Gmail Im...|[{document, 0, 28...| +--------------------+--------------------+--------------------+