Spark NLP 6.0.2 ScalaDoc - com.johnsnowlabs.reader

class EmailReader extends Serializable

This class is used to read and parse email content.

class ExcelReader extends Serializable

This class is used to read and parse excel files.

case class HTMLElement(elementType: String, content: String, metadata: Map[String, String]) extends Product with Serializable

class HTMLReader extends Serializable

Class to parse and read HTML files.

class PdfToText extends Transformer with DefaultParamsWritable with HasInputValidator with HasInputCol with HasOutputCol with HasLocalProcess with PdfToTextTrait with HasPdfProperties

Extract text from PDF document to a single string or to several strings per each page.

Extract text from PDF document to a single string or to several strings per each page. Input is a column with binary representation of PDF document. For the output it generates column with text and page number. Explode each page as separate row if split to page enabled.

It can be configured with the following properties:

pageNumCol: Page number output column name.
originCol: Input column name with original path of file.
partitionNum: Number of partitions. By default, it is set to 0.
storeSplittedPdf: Force to store bytes content of split pdf. By default, it is set to false.
splitPage: Enable/disable splitting per page to identify page numbers and improve performance. By default, it is set to true.
onlyPageNum: Extract only page numbers. By default, it is set to false.
textStripper: Text stripper type used for output layout and formatting.
sort: Enable/disable sorting content on the page. By default, it is set to false.

Example

    val pdfToText = new PdfToText()
      .setStoreSplittedPdf(true)
      .setSplitPage(true)
    val filesDf = spark.read.format("binaryFile").load("Documents/files/pdf")
    val pipelineModel = new Pipeline()
      .setStages(Array(pdfToText))
      .fit(filesDf)

    val pdfDf = pipelineModel.transform(filesDf)

pdfDf.show()
+--------------------+--------------------+------+--------------------+
|                path|    modificationTime|length|                text|
+--------------------+--------------------+------+--------------------+
|file:/Users/paula...|2025-05-15 11:33:...| 25803|This is a Title \...|
|file:/Users/paula...|2025-05-15 11:33:...| 15629|                  \n|
|file:/Users/paula...|2025-05-15 11:33:...| 15629|                  \n|
|file:/Users/paula...|2025-05-15 11:33:...| 15629|                  \n|
|file:/Users/paula...|2025-05-15 11:33:...|  9487|   This is a page.\n|
|file:/Users/paula...|2025-05-15 11:33:...|  9487|This is another p...|
|file:/Users/paula...|2025-05-15 11:33:...|  9487| Yet another page.\n|
|file:/Users/paula...|2025-05-15 11:56:...|  1563|Hello, this is li...|
+--------------------+--------------------+------+--------------------+

pdfDf.printSchema()
root
 |-- path: string (nullable = true)
 |-- modificationTime: timestamp (nullable = true)
 |-- length: long (nullable = true)
 |-- text: string (nullable = true)
 |-- height_dimension: integer (nullable = true)
 |-- width_dimension: integer (nullable = true)
 |-- content: binary (nullable = true)
 |-- exception: string (nullable = true)
 |-- pagenum: integer (nullable = true)

trait PdfToTextTrait extends Logging with PdfUtils

class PowerPointReader extends Serializable

Class to read and parse PowerPoint files.

class SparkNLPReader extends AnyRef

class TextReader extends Serializable

Class to read and parse text files.

class WordReader extends Serializable

Class to read and parse Word files.

Packages

reader

package reader

Type Members

Example

Value Members

Ungrouped

Packages

reader 

package reader

Type Members

Example

Value Members

Ungrouped

reader