Packages

p

com.johnsnowlabs

partition

package partition

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. trait HasEmailReaderProperties extends ParamsAndFeaturesWritable
  2. trait HasExcelReaderProperties extends ParamsAndFeaturesWritable
  3. trait HasHTMLReaderProperties extends ParamsAndFeaturesWritable
  4. trait HasPowerPointProperties extends ParamsAndFeaturesWritable
  5. trait HasReaderProperties extends ParamsAndFeaturesWritable
  6. trait HasTextReaderProperties extends ParamsAndFeaturesWritable
  7. class Partition extends Serializable

    The Partition class is a unified interface for extracting structured content from various document types using Spark NLP readers.

    The Partition class is a unified interface for extracting structured content from various document types using Spark NLP readers. It supports reading from files, URLs, in-memory strings, or byte arrays, and returns parsed output as a structured Spark DataFrame.

    Supported formats include plain text, HTML, Word (.doc/.docx), Excel (.xls/.xlsx), PowerPoint (.ppt/.pptx), email files (.eml, .msg), and PDFs.

    The class detects the appropriate reader either from the file extension or a provided MIME contentType, and delegates to the relevant method of SparkNLPReader. Custom behavior (like title thresholds, page breaks, etc.) can be configured through the params map during initialization.

    By abstracting reader initialization, type detection, and parsing logic, Partition simplifies document ingestion in scalable NLP pipelines.

  8. class PartitionTransformer extends AnnotatorModel[PartitionTransformer] with HasSimpleAnnotate[PartitionTransformer] with HasReaderProperties with HasEmailReaderProperties with HasExcelReaderProperties with HasHTMLReaderProperties with HasPowerPointProperties with HasTextReaderProperties with HasPdfProperties

    The PartitionTransformer annotator allows you to use the Partition feature more smoothly within existing Spark NLP workflows, enabling seamless reuse of your pipelines.

    The PartitionTransformer annotator allows you to use the Partition feature more smoothly within existing Spark NLP workflows, enabling seamless reuse of your pipelines. PartitionTransformer can be used for extracting structured content from various document types using Spark NLP readers. It supports reading from files, URLs, in-memory strings, or byte arrays, and returns parsed output as a structured Spark DataFrame.

    Supported formats include plain text, HTML, Word (.doc/.docx), Excel (.xls/.xlsx), PowerPoint (.ppt/.pptx), email files (.eml, .msg), and PDFs.

    Example

    import com.johnsnowlabs.partition.PartitionTransformer
    import com. johnsnowlabs. nlp. base. DocumentAssembler
    import org.apache.spark.ml.Pipeline
    import spark.implicits._
    val urls = Seq("https://www.blizzard.com", "https://www.google.com/").toDS.toDF("text")
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val partition = new PartitionTransformer()
      .setInputCols("document")
      .setOutputCol("partition")
      .setContentType("url")
      .setHeaders(Map("Accept-Language" -> "es-ES"))
    
    val pipeline = new Pipeline()
      .setStages(Array(documentAssembler, partition))
    
    val pipelineModel = pipeline.fit(testDataSet)
    val resultDf = pipelineModel.transform(testDataSet)
    
    resultDf.show()
    +--------------------+--------------------+--------------------+
    |                text|            document|           partition|
    +--------------------+--------------------+--------------------+
    |https://www.blizz...|[{Title, Juegos d...|[{document, 0, 16...|
    |https://www.googl...|[{Title, Gmail Im...|[{document, 0, 28...|
    +--------------------+--------------------+--------------------+

Value Members

  1. object Partition extends Serializable

Ungrouped