nlp

Type Members

case class Annotation(annotatorType: String, begin: Int, end: Int, result: String, metadata: Map[String, String], embeddings: Array[Float] = Array.emptyFloatArray) extends Product with Serializable

represents annotator's output parts and their details
represents annotator's output parts and their details
annotatorType
the type of annotation
begin
the index of the first character under this annotation
end
the index after the last character under this annotation
metadata
associated metadata for this annotation
abstract class AnnotatorApproach[M <: Model[M]] extends Estimator[M] with HasInputAnnotationCols with HasOutputAnnotationCol with HasOutputAnnotatorType with DefaultParamsWritable with CanBeLazy

This class should grow once we start training on datasets and share params For now it stands as a dummy placeholder for future reference
abstract class AnnotatorModel[M <: Model[M]] extends Model[M] with RawAnnotator[M] with CanBeLazy

This trait implements logic that applies nlp using Spark ML Pipeline transformers Should strongly change once UsedDefinedTypes are allowed https://issues.apache.org/jira/browse/SPARK-7768
trait CanBeLazy extends AnyRef

class Chunk2Doc extends AnnotatorModel[Chunk2Doc] with HasSimpleAnnotate[Chunk2Doc]

Converts a CHUNK type column back into DOCUMENT.

Converts a CHUNK type column back into DOCUMENT. Useful when trying to re-tokenize or do further analysis on a CHUNK result.

For more extended examples on document pre-processing see the Spark NLP Workshop.

Example

Location entities are extracted and converted back into DOCUMENT type for further processing

import spark.implicits._
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.Chunk2Doc

val data = Seq((1, "New York and New Jersey aren't that far apart actually.")).toDF("id", "text")

// Extracts Named Entities amongst other things
val pipeline = PretrainedPipeline("explain_document_dl")

val chunkToDoc = new Chunk2Doc().setInputCols("entities").setOutputCol("chunkConverted")
val explainResult = pipeline.transform(data)

val result = chunkToDoc.transform(explainResult)
result.selectExpr("explode(chunkConverted)").show(false)
+------------------------------------------------------------------------------+
|col                                                                           |
+------------------------------------------------------------------------------+
|[document, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []]    |
|[document, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]|
+------------------------------------------------------------------------------+

See also: Doc2Chunk for converting DOCUMENT annotations to CHUNK
PretrainedPipeline on how to use the PretrainedPipeline

class Doc2Chunk extends Model[Doc2Chunk] with RawAnnotator[Doc2Chunk]

Converts DOCUMENT type annotations into CHUNK type with the contents of a chunkCol.

Converts DOCUMENT type annotations into CHUNK type with the contents of a chunkCol. Chunk text must be contained within input DOCUMENT. May be either StringType or ArrayType[StringType] (using setIsArray). Useful for annotators that require a CHUNK type input.

For more extended examples on document pre-processing see the Spark NLP Workshop.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.{Doc2Chunk, DocumentAssembler}
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val chunkAssembler = new Doc2Chunk()
  .setInputCols("document")
  .setChunkCol("target")
  .setOutputCol("chunk")
  .setIsArray(true)

val data = Seq(
  ("Spark NLP is an open-source text processing library for advanced natural language processing.",
    Seq("Spark NLP", "text processing library", "natural language processing"))
).toDF("text", "target")

val pipeline = new Pipeline().setStages(Array(documentAssembler, chunkAssembler)).fit(data)
val result = pipeline.transform(data)

result.selectExpr("chunk.result", "chunk.annotatorType").show(false)
+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+

See also: Chunk2Doc for converting CHUNK annotations to DOCUMENT

class DocumentAssembler extends Transformer with DefaultParamsWritable with HasOutputAnnotatorType with HasOutputAnnotationCol

Prepares data into a format that is processable by Spark NLP.

Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline. The DocumentAssembler can read either a String column or an Array[String]. Additionally, setCleanupMode can be used to pre-process the text (Default: disabled). For possible options please refer the parameters section.

For more extended examples on document pre-processing see the Spark NLP Workshop.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler

val data = Seq("Spark NLP is an open-source text processing library.").toDF("text")
val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")

val result = documentAssembler.transform(data)

result.select("document").show(false)
+----------------------------------------------------------------------------------------------+
|document                                                                                      |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+

result.select("document").printSchema
root
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)

class EmbeddingsFinisher extends Transformer with DefaultParamsWritable

This transformer is designed to deal with embedding annotators, for example: WordEmbeddings, BertEmbeddings, SentenceEmbeddings and ChunkEmbeddings.

This transformer is designed to deal with embedding annotators, for example: WordEmbeddings, BertEmbeddings, SentenceEmbeddings and ChunkEmbeddings. By using EmbeddingsFinisher you can easily transform your embeddings into array of floats or vectors which are compatible with Spark ML functions such as LDA, K-mean, Random Forest classifier or any other functions that require featureCol.

For more extended examples see the Spark NLP Workshop.

Example

import spark.implicits._
import org.apache.spark.ml.Pipeline
import com.johnsnowlabs.nlp.{DocumentAssembler, EmbeddingsFinisher}
import com.johnsnowlabs.nlp.annotator.{Normalizer, StopWordsCleaner, Tokenizer, WordEmbeddingsModel}

// First the embeddings are extracted using the WordEmbeddingsModel
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val normalizer = new Normalizer()
  .setInputCols("token")
  .setOutputCol("normalized")

val stopwordsCleaner = new StopWordsCleaner()
  .setInputCols("normalized")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)

val gloveEmbeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("document", "cleanTokens")
  .setOutputCol("embeddings")
  .setCaseSensitive(false)

// Then the embeddings can be turned into a vector using the EmbeddingsFinisher
val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("embeddings")
  .setOutputCols("finished_sentence_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val data = Seq("Spark NLP is an open-source text processing library.")
  .toDF("text")
val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  normalizer,
  stopwordsCleaner,
  gloveEmbeddings,
  embeddingsFinisher
)).fit(data)

val result = pipeline.transform(data)
result.select("finished_sentence_embeddings").show(false)
+--------------------------------------------------------------------------------------------------------+
|finished_sentence_embeddings                                                                            |
+--------------------------------------------------------------------------------------------------------+
|[[0.1619900017976761,0.045552998781204224,-0.03229299932718277,-0.6856099963188171,0.5442799925804138...|
+--------------------------------------------------------------------------------------------------------+

See also: Finisher for finishing Strings

class FeaturesReader[T <: HasFeatures] extends MLReader[T]
class FeaturesWriter[T] extends MLWriter with HasFeatures

class Finisher extends Transformer with DefaultParamsWritable

Converts annotation results into a format that easier to use.

Converts annotation results into a format that easier to use. It is useful to extract the results from Spark NLP Pipelines. The Finisher outputs annotation(s) values into String.

For more extended examples on document pre-processing see the Spark NLP Workshop.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.Finisher

val data = Seq((1, "New York and New Jersey aren't that far apart actually.")).toDF("id", "text")

// Extracts Named Entities amongst other things
val pipeline = PretrainedPipeline("explain_document_dl")

val finisher = new Finisher().setInputCols("entities").setOutputCols("output")
val explainResult = pipeline.transform(data)

explainResult.selectExpr("explode(entities)").show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------+
|entities                                                                                                                                              |
+------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[chunk, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []], [chunk, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]]|
+------------------------------------------------------------------------------------------------------------------------------------------------------+

val result = finisher.transform(explainResult)
result.select("output").show(false)
+----------------------+
|output                |
+----------------------+
|[New York, New Jersey]|
+----------------------+

See also: EmbeddingsFinisher for finishing embeddings

trait HasBatchedAnnotate[M <: Model[M]] extends AnyRef
trait HasCaseSensitiveProperties extends ParamsAndFeaturesWritable
trait HasFeatures extends AnyRef
trait HasInputAnnotationCols extends Params
trait HasOutputAnnotationCol extends Params
trait HasOutputAnnotatorType extends AnyRef
trait HasPretrained[M <: PipelineStage] extends AnyRef
trait HasRecursiveFit[M <: Model[M]] extends AnyRef

AnnotatorApproach'es may extend this trait in order to allow RecursivePipelines to include intermediate steps trained PipelineModel's
trait HasRecursiveTransform[M <: Model[M]] extends AnyRef
trait HasSimpleAnnotate[M <: Model[M]] extends AnyRef
case class JavaAnnotation(annotatorType: String, begin: Int, end: Int, result: String, metadata: Map[String, String], embeddings: Array[Float] = Array.emptyFloatArray) extends Product with Serializable
class LightPipeline extends AnyRef
trait ParamsAndFeaturesReadable[T <: HasFeatures] extends DefaultParamsReadable[T]
trait ParamsAndFeaturesWritable extends DefaultParamsWritable with Params with HasFeatures
trait RawAnnotator[M <: Model[M]] extends Model[M] with ParamsAndFeaturesWritable with HasOutputAnnotatorType with HasInputAnnotationCols with HasOutputAnnotationCol

Created by jose on 25/01/18.
class RecursivePipeline extends Pipeline
class RecursivePipelineModel extends Model[RecursivePipelineModel] with MLWritable with Logging

class TokenAssembler extends AnnotatorModel[TokenAssembler] with HasSimpleAnnotate[TokenAssembler]

This transformer reconstructs a DOCUMENT type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators.

For more extended examples on document pre-processing see the Spark NLP Workshop.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.{Normalizer, StopWordsCleaner}
import com.johnsnowlabs.nlp.TokenAssembler
import org.apache.spark.ml.Pipeline

// First, the text is tokenized and cleaned
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentences")

val tokenizer = new Tokenizer()
  .setInputCols("sentences")
  .setOutputCol("token")

val normalizer = new Normalizer()
  .setInputCols("token")
  .setOutputCol("normalized")
  .setLowercase(false)

val stopwordsCleaner = new StopWordsCleaner()
  .setInputCols("normalized")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)

// Then the TokenAssembler turns the cleaned tokens into a `DOCUMENT` type structure.
val tokenAssembler = new TokenAssembler()
  .setInputCols("sentences", "cleanTokens")
  .setOutputCol("cleanText")

val data = Seq("Spark NLP is an open-source text processing library for advanced natural language processing.")
  .toDF("text")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  normalizer,
  stopwordsCleaner,
  tokenAssembler
)).fit(data)

val result = pipeline.transform(data)
result.select("cleanText").show(false)
+---------------------------------------------------------------------------------------------------------------------------+
|cleanText                                                                                                                  |
+---------------------------------------------------------------------------------------------------------------------------+
|[[document, 0, 80, Spark NLP opensource text processing library advanced natural language processing, [sentence -> 0], []]]|
+---------------------------------------------------------------------------------------------------------------------------+

See also: DocumentAssembler on the data structure

Value Members

object Annotation extends Serializable
object AnnotatorType
object Chunk2Doc extends DefaultParamsReadable[Chunk2Doc] with Serializable
object Doc2Chunk extends DefaultParamsReadable[Doc2Chunk] with Serializable
object DocumentAssembler extends DefaultParamsReadable[DocumentAssembler] with Serializable
object EmbeddingsFinisher extends DefaultParamsReadable[EmbeddingsFinisher] with Serializable
object Finisher extends DefaultParamsReadable[Finisher] with Serializable
object SparkNLP
object TokenAssembler extends DefaultParamsReadable[TokenAssembler] with Serializable
package annotators
package embeddings
object functions
package pretrained
package recursive
package serialization
package training
package util

package nlp

Type Members

case class Annotation(annotatorType: String, begin: Int, end: Int, result: String, metadata: Map[String, String], embeddings: Array[Float] = Array.emptyFloatArray) extends Product with Serializable

abstract class AnnotatorApproach[M <: Model[M]] extends Estimator[M] with HasInputAnnotationCols with HasOutputAnnotationCol with HasOutputAnnotatorType with DefaultParamsWritable with CanBeLazy

abstract class AnnotatorModel[M <: Model[M]] extends Model[M] with RawAnnotator[M] with CanBeLazy

trait CanBeLazy extends AnyRef

class Chunk2Doc extends AnnotatorModel[Chunk2Doc] with HasSimpleAnnotate[Chunk2Doc]

Example

class Doc2Chunk extends Model[Doc2Chunk] with RawAnnotator[Doc2Chunk]

Example

class DocumentAssembler extends Transformer with DefaultParamsWritable with HasOutputAnnotatorType with HasOutputAnnotationCol

Example

class EmbeddingsFinisher extends Transformer with DefaultParamsWritable

Example

class FeaturesReader[T <: HasFeatures] extends MLReader[T]

class FeaturesWriter[T] extends MLWriter with HasFeatures

class Finisher extends Transformer with DefaultParamsWritable

Example

trait HasBatchedAnnotate[M <: Model[M]] extends AnyRef

trait HasCaseSensitiveProperties extends ParamsAndFeaturesWritable

trait HasFeatures extends AnyRef

trait HasInputAnnotationCols extends Params

trait HasOutputAnnotationCol extends Params

trait HasOutputAnnotatorType extends AnyRef

trait HasPretrained[M <: PipelineStage] extends AnyRef

trait HasRecursiveFit[M <: Model[M]] extends AnyRef

trait HasRecursiveTransform[M <: Model[M]] extends AnyRef

trait HasSimpleAnnotate[M <: Model[M]] extends AnyRef

case class JavaAnnotation(annotatorType: String, begin: Int, end: Int, result: String, metadata: Map[String, String], embeddings: Array[Float] = Array.emptyFloatArray) extends Product with Serializable

class LightPipeline extends AnyRef

trait ParamsAndFeaturesReadable[T <: HasFeatures] extends DefaultParamsReadable[T]

trait ParamsAndFeaturesWritable extends DefaultParamsWritable with Params with HasFeatures

trait RawAnnotator[M <: Model[M]] extends Model[M] with ParamsAndFeaturesWritable with HasOutputAnnotatorType with HasInputAnnotationCols with HasOutputAnnotationCol

class RecursivePipeline extends Pipeline

class RecursivePipelineModel extends Model[RecursivePipelineModel] with MLWritable with Logging

class TokenAssembler extends AnnotatorModel[TokenAssembler] with HasSimpleAnnotate[TokenAssembler]

Example

Value Members

object Annotation extends Serializable

object AnnotatorType

object Chunk2Doc extends DefaultParamsReadable[Chunk2Doc] with Serializable

object Doc2Chunk extends DefaultParamsReadable[Doc2Chunk] with Serializable

object DocumentAssembler extends DefaultParamsReadable[DocumentAssembler] with Serializable

object EmbeddingsFinisher extends DefaultParamsReadable[EmbeddingsFinisher] with Serializable

object Finisher extends DefaultParamsReadable[Finisher] with Serializable

object SparkNLP

object TokenAssembler extends DefaultParamsReadable[TokenAssembler] with Serializable

package annotators

package embeddings

object functions

package pretrained

package recursive

package serialization

package training

package util

Ungrouped