Packages

package reader

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. class CSVReader extends Serializable

    CSVReader partitions CSV files into structured elements with metadata, similar to ExcelReader.

  2. class EmailReader extends Serializable

    This class is used to read and parse email content.

  3. class ExcelReader extends Serializable

    This class is used to read and parse excel files.

  4. case class HTMLElement(elementType: String, content: String, metadata: Map[String, String]) extends Product with Serializable
  5. class HTMLReader extends Serializable

    Class to parse and read HTML files.

  6. class MarkdownReader extends Serializable
  7. class PdfReader extends Serializable

    Class to parse and read PDF files.

  8. class PdfToText extends Transformer with DefaultParamsWritable with HasInputValidator with HasInputCol with HasOutputCol with HasLocalProcess with PdfToTextTrait with HasPdfProperties

    Extract text from PDF document to a single string or to several strings per each page.

    Extract text from PDF document to a single string or to several strings per each page. Input is a column with binary representation of PDF document. For the output it generates column with text and page number. Explode each page as separate row if split to page enabled.

    It can be configured with the following properties:

    • pageNumCol: Page number output column name.
    • originCol: Input column name with original path of file.
    • partitionNum: Number of partitions. By default, it is set to 0.
    • storeSplittedPdf: Force to store bytes content of split pdf. By default, it is set to false.
    • splitPage: Enable/disable splitting per page to identify page numbers and improve performance. By default, it is set to true.
    • onlyPageNum: Extract only page numbers. By default, it is set to false.
    • textStripper: Text stripper type used for output layout and formatting.
    • sort: Enable/disable sorting content on the page. By default, it is set to false.

    Example

        val pdfToText = new PdfToText()
          .setStoreSplittedPdf(true)
          .setSplitPage(true)
        val filesDf = spark.read.format("binaryFile").load("Documents/files/pdf")
        val pipelineModel = new Pipeline()
          .setStages(Array(pdfToText))
          .fit(filesDf)
    
        val pdfDf = pipelineModel.transform(filesDf)
    
    pdfDf.show()
    +--------------------+--------------------+------+--------------------+
    |                path|    modificationTime|length|                text|
    +--------------------+--------------------+------+--------------------+
    |file:/Users/paula...|2025-05-15 11:33:...| 25803|This is a Title \...|
    |file:/Users/paula...|2025-05-15 11:33:...| 15629|                  \n|
    |file:/Users/paula...|2025-05-15 11:33:...| 15629|                  \n|
    |file:/Users/paula...|2025-05-15 11:33:...| 15629|                  \n|
    |file:/Users/paula...|2025-05-15 11:33:...|  9487|   This is a page.\n|
    |file:/Users/paula...|2025-05-15 11:33:...|  9487|This is another p...|
    |file:/Users/paula...|2025-05-15 11:33:...|  9487| Yet another page.\n|
    |file:/Users/paula...|2025-05-15 11:56:...|  1563|Hello, this is li...|
    +--------------------+--------------------+------+--------------------+
    
    pdfDf.printSchema()
    root
     |-- path: string (nullable = true)
     |-- modificationTime: timestamp (nullable = true)
     |-- length: long (nullable = true)
     |-- text: string (nullable = true)
     |-- height_dimension: integer (nullable = true)
     |-- width_dimension: integer (nullable = true)
     |-- content: binary (nullable = true)
     |-- exception: string (nullable = true)
     |-- pagenum: integer (nullable = true)
  9. trait PdfToTextTrait extends Logging with PdfUtils
  10. class PowerPointReader extends Serializable

    Class to read and parse PowerPoint files.

  11. class Reader2Doc extends Transformer with DefaultParamsWritable with HasOutputAnnotatorType with HasOutputAnnotationCol with HasReaderProperties with HasEmailReaderProperties with HasExcelReaderProperties with HasHTMLReaderProperties with HasPowerPointProperties with HasTextReaderProperties with HasXmlReaderProperties

    The Reader2Doc annotator allows you to use the reading files more smoothly within existing Spark NLP workflows, enabling seamless reuse of your pipelines.

    The Reader2Doc annotator allows you to use the reading files more smoothly within existing Spark NLP workflows, enabling seamless reuse of your pipelines. Reader2Doc can be used for extracting structured content from various document types using Spark NLP readers. It supports reading from many files types and returns parsed output as a structured Spark DataFrame.

    Supported formats include plain text, HTML, Word (.doc/.docx), Excel (.xls/.xlsx), PowerPoint (.ppt/.pptx), email files (.eml, .msg), and PDFs.

    Example

    import com.johnsnowlabs.reader.Reader2Doc
    import com. johnsnowlabs.nlp.base.DocumentAssembler
    import org.apache.spark.ml.Pipeline
    
    val partition = new Reader2Doc()
      .setContentType("application/pdf")
      .setContentPath(s"$pdfDirectory/")
    
    val pipeline = new Pipeline()
      .setStages(Array(reader2Doc))
    
    val pipelineModel = pipeline.fit(emptyDataSet)
    val resultDf = pipelineModel.transform(emptyDataSet)
    
    resultDf.show()
    +------------------------------------------------------------------------------------------------------------------------------------+
    |document                                                                                                                            |
    +------------------------------------------------------------------------------------------------------------------------------------+
    |[{document, 0, 14, This is a Title, {pageNumber -> 1, elementType -> Title, fileName -> pdf-title.pdf}, []}]                        |
    |[{document, 15, 38, This is a narrative text, {pageNumber -> 1, elementType -> NarrativeText, fileName -> pdf-title.pdf}, []}]      |
    |[{document, 39, 68, This is another narrative text, {pageNumber -> 1, elementType -> NarrativeText, fileName -> pdf-title.pdf}, []}]|
    +------------------------------------------------------------------------------------------------------------------------------------+
  12. class SparkNLPReader extends Serializable
  13. class TextReader extends Serializable

    Class to read and parse text files.

  14. class WordReader extends Serializable

    Class to read and parse Word files.

  15. class XMLReader extends Serializable

    Class to parse and read XML files.

Value Members

  1. object ElementType
  2. object MimeType
  3. object PdfToText extends DefaultParamsReadable[PdfToText] with Serializable
  4. object Reader2Doc extends DefaultParamsReadable[Reader2Doc] with Serializable

Ungrouped