package nlp
- Alphabetic
- Public
- All
Type Members
-
case class
Annotation(annotatorType: String, begin: Int, end: Int, result: String, metadata: Map[String, String], embeddings: Array[Float] = Array.emptyFloatArray) extends IAnnotation with Product with Serializable
represents annotator's output parts and their details
represents annotator's output parts and their details
- annotatorType
the type of annotation
- begin
the index of the first character under this annotation
- end
the index after the last character under this annotation
- metadata
associated metadata for this annotation
-
case class
AnnotationImage(annotatorType: String, origin: String, height: Int, width: Int, nChannels: Int, mode: Int, result: Array[Byte], metadata: Map[String, String]) extends IAnnotation with Product with Serializable
Represents ImageAssembler's output parts and their details
Represents ImageAssembler's output parts and their details
- annotatorType
Image annotator type
- origin
The origin of the image
- height
Height of the image in pixels
- width
Width of the image in pixels
- nChannels
Number of image channels
- mode
OpenCV-compatible type
- result
Result of the annotation
- metadata
Metadata of the annotation
-
abstract
class
AnnotatorApproach[M <: Model[M]] extends Estimator[M] with HasInputAnnotationCols with HasOutputAnnotationCol with HasOutputAnnotatorType with DefaultParamsWritable with CanBeLazy
This class should grow once we start training on datasets and share params For now it stands as a dummy placeholder for future reference
-
abstract
class
AnnotatorModel[M <: Model[M]] extends Model[M] with RawAnnotator[M] with CanBeLazy
This trait implements logic that applies nlp using Spark ML Pipeline transformers Should strongly change once UsedDefinedTypes are allowed https://issues.apache.org/jira/browse/SPARK-7768
- trait CanBeLazy extends AnyRef
-
class
Chunk2Doc extends AnnotatorModel[Chunk2Doc] with HasSimpleAnnotate[Chunk2Doc]
Converts a
CHUNK
type column back intoDOCUMENT
.Converts a
CHUNK
type column back intoDOCUMENT
. Useful when trying to re-tokenize or do further analysis on aCHUNK
result.For more extended examples on document pre-processing see the Spark NLP Workshop.
Example
Location entities are extracted and converted back into
DOCUMENT
type for further processingimport spark.implicits._ import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline import com.johnsnowlabs.nlp.Chunk2Doc val data = Seq((1, "New York and New Jersey aren't that far apart actually.")).toDF("id", "text") // Extracts Named Entities amongst other things val pipeline = PretrainedPipeline("explain_document_dl") val chunkToDoc = new Chunk2Doc().setInputCols("entities").setOutputCol("chunkConverted") val explainResult = pipeline.transform(data) val result = chunkToDoc.transform(explainResult) result.selectExpr("explode(chunkConverted)").show(false) +------------------------------------------------------------------------------+ |col | +------------------------------------------------------------------------------+ |[document, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []] | |[document, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]| +------------------------------------------------------------------------------+
- See also
PretrainedPipeline on how to use the PretrainedPipeline
Doc2Chunk for converting
DOCUMENT
annotations toCHUNK
-
class
Doc2Chunk extends Model[Doc2Chunk] with RawAnnotator[Doc2Chunk]
Converts
DOCUMENT
type annotations intoCHUNK
type with the contents of achunkCol
.Converts
DOCUMENT
type annotations intoCHUNK
type with the contents of achunkCol
. Chunk text must be contained within inputDOCUMENT
. May be eitherStringType
orArrayType[StringType]
(using setIsArray). Useful for annotators that require a CHUNK type input.For more extended examples on document pre-processing see the Spark NLP Workshop.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.{Doc2Chunk, DocumentAssembler} import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document") val chunkAssembler = new Doc2Chunk() .setInputCols("document") .setChunkCol("target") .setOutputCol("chunk") .setIsArray(true) val data = Seq( ("Spark NLP is an open-source text processing library for advanced natural language processing.", Seq("Spark NLP", "text processing library", "natural language processing")) ).toDF("text", "target") val pipeline = new Pipeline().setStages(Array(documentAssembler, chunkAssembler)).fit(data) val result = pipeline.transform(data) result.selectExpr("chunk.result", "chunk.annotatorType").show(false) +-----------------------------------------------------------------+---------------------+ |result |annotatorType | +-----------------------------------------------------------------+---------------------+ |[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]| +-----------------------------------------------------------------+---------------------+
- See also
Chunk2Doc for converting
CHUNK
annotations toDOCUMENT
-
class
DocumentAssembler extends Transformer with DefaultParamsWritable with HasOutputAnnotatorType with HasOutputAnnotationCol
Prepares data into a format that is processable by Spark NLP.
Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline. The
DocumentAssembler
can read either aString
column or anArray[String]
. Additionally, setCleanupMode can be used to pre-process the text (Default:disabled
). For possible options please refer the parameters section.For more extended examples on document pre-processing see the Spark NLP Workshop.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.DocumentAssembler val data = Seq("Spark NLP is an open-source text processing library.").toDF("text") val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document") val result = documentAssembler.transform(data) result.select("document").show(false) +----------------------------------------------------------------------------------------------+ |document | +----------------------------------------------------------------------------------------------+ |[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]| +----------------------------------------------------------------------------------------------+ result.select("document").printSchema root |-- document: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- annotatorType: string (nullable = true) | | |-- begin: integer (nullable = false) | | |-- end: integer (nullable = false) | | |-- result: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true) | | |-- embeddings: array (nullable = true) | | | |-- element: float (containsNull = false)
-
class
EmbeddingsFinisher extends Transformer with DefaultParamsWritable
Extracts embeddings from Annotations into a more easily usable form.
Extracts embeddings from Annotations into a more easily usable form.
This is useful for example: WordEmbeddings, BertEmbeddings, SentenceEmbeddings and ChunkEmbeddings.
By using
EmbeddingsFinisher
you can easily transform your embeddings into array of floats or vectors which are compatible with Spark ML functions such as LDA, K-mean, Random Forest classifier or any other functions that requirefeatureCol
.For more extended examples see the Spark NLP Workshop.
Example
import spark.implicits._ import org.apache.spark.ml.Pipeline import com.johnsnowlabs.nlp.{DocumentAssembler, EmbeddingsFinisher} import com.johnsnowlabs.nlp.annotator.{Normalizer, StopWordsCleaner, Tokenizer, WordEmbeddingsModel} val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val normalizer = new Normalizer() .setInputCols("token") .setOutputCol("normalized") val stopwordsCleaner = new StopWordsCleaner() .setInputCols("normalized") .setOutputCol("cleanTokens") .setCaseSensitive(false) val gloveEmbeddings = WordEmbeddingsModel.pretrained() .setInputCols("document", "cleanTokens") .setOutputCol("embeddings") .setCaseSensitive(false) val embeddingsFinisher = new EmbeddingsFinisher() .setInputCols("embeddings") .setOutputCols("finished_sentence_embeddings") .setOutputAsVector(true) .setCleanAnnotations(false) val data = Seq("Spark NLP is an open-source text processing library.") .toDF("text") val pipeline = new Pipeline().setStages(Array( documentAssembler, tokenizer, normalizer, stopwordsCleaner, gloveEmbeddings, embeddingsFinisher )).fit(data) val result = pipeline.transform(data) val resultWithSize = result.selectExpr("explode(finished_sentence_embeddings)") .map { row => val vector = row.getAs[org.apache.spark.ml.linalg.DenseVector](0) (vector.size, vector) }.toDF("size", "vector") resultWithSize.show(5, 80) +----+--------------------------------------------------------------------------------+ |size| vector| +----+--------------------------------------------------------------------------------+ | 100|[0.1619900017976761,0.045552998781204224,-0.03229299932718277,-0.685609996318...| | 100|[-0.42416998744010925,1.1378999948501587,-0.5717899799346924,-0.5078899860382...| | 100|[0.08621499687433243,-0.15772999823093414,-0.06067200005054474,0.395359992980...| | 100|[-0.4970499873161316,0.7164199948310852,0.40119001269340515,-0.05761000141501...| | 100|[-0.08170200139284134,0.7159299850463867,-0.20677000284194946,0.0295659992843...| +----+--------------------------------------------------------------------------------+
- See also
Finisher for finishing Strings
- class FeaturesReader[T <: HasFeatures] extends MLReader[T]
- class FeaturesWriter[T] extends MLWriter with HasFeatures
-
class
Finisher extends Transformer with DefaultParamsWritable
Converts annotation results into a format that easier to use.
Converts annotation results into a format that easier to use. It is useful to extract the results from Spark NLP Pipelines. The Finisher outputs annotation(s) values into
String
.For more extended examples on document pre-processing see the Spark NLP Workshop.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline import com.johnsnowlabs.nlp.Finisher val data = Seq((1, "New York and New Jersey aren't that far apart actually.")).toDF("id", "text") // Extracts Named Entities amongst other things val pipeline = PretrainedPipeline("explain_document_dl") val finisher = new Finisher().setInputCols("entities").setOutputCols("output") val explainResult = pipeline.transform(data) explainResult.selectExpr("explode(entities)").show(false) +------------------------------------------------------------------------------------------------------------------------------------------------------+ |entities | +------------------------------------------------------------------------------------------------------------------------------------------------------+ |[[chunk, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []], [chunk, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]]| +------------------------------------------------------------------------------------------------------------------------------------------------------+ val result = finisher.transform(explainResult) result.select("output").show(false) +----------------------+ |output | +----------------------+ |[New York, New Jersey]| +----------------------+
- See also
EmbeddingsFinisher for finishing embeddings
-
class
GraphFinisher extends Transformer
Helper class to convert the knowledge graph from GraphExtraction into a generic format, such as RDF.
Helper class to convert the knowledge graph from GraphExtraction into a generic format, such as RDF.
Example
This is a continuation of the example of GraphExtraction. To see how the graph is extracted, see the documentation of that class.
import com.johnsnowlabs.nlp.GraphFinisher val graphFinisher = new GraphFinisher() .setInputCol("graph") .setOutputCol("graph_finished") .setOutputAsArray(false) val finishedResult = graphFinisher.transform(result) finishedResult.select("text", "graph_finished").show(false) +-----------------------------------------------------+-----------------------------------------------------------------------+ |text |graph_finished | +-----------------------------------------------------+-----------------------------------------------------------------------+ |You and John prefer the morning flight through Denver|[[(prefer,nsubj,morning), (morning,flat,flight), (flight,flat,Denver)]]| +-----------------------------------------------------+-----------------------------------------------------------------------+
- See also
GraphExtraction to extract the graph.
- trait HasBatchedAnnotate[M <: Model[M]] extends AnyRef
- trait HasBatchedAnnotateImage[M <: Model[M]] extends AnyRef
- trait HasCaseSensitiveProperties extends ParamsAndFeaturesWritable
- trait HasClassifierActivationProperties extends ParamsAndFeaturesWritable
- trait HasEnableCachingProperties extends ParamsAndFeaturesWritable
- trait HasFeatures extends AnyRef
-
trait
HasImageFeatureProperties extends ParamsAndFeaturesWritable
example of required parameters
example of required parameters
{ "do_normalize": true, "do_resize": true, "feature_extractor_type": "ViTFeatureExtractor", "image_mean": [ 0.5, 0.5, 0.5 ], "image_std": [ 0.5, 0.5, 0.5 ], "resample": 2, "size": 224 }
- trait HasInputAnnotationCols extends Params
-
trait
HasMultipleInputAnnotationCols extends HasInputAnnotationCols
Trait used to create annotators with input columns of variable length.
- trait HasOutputAnnotationCol extends Params
- trait HasOutputAnnotatorType extends AnyRef
- trait HasPretrained[M <: PipelineStage] extends AnyRef
-
trait
HasRecursiveFit[M <: Model[M]] extends AnyRef
AnnotatorApproach'es may extend this trait in order to allow RecursivePipelines to include intermediate steps trained PipelineModel's
- trait HasRecursiveTransform[M <: Model[M]] extends AnyRef
- trait HasSimpleAnnotate[M <: Model[M]] extends AnyRef
- trait IAnnotation extends AnyRef
-
class
ImageAssembler extends Transformer with DefaultParamsWritable with HasOutputAnnotatorType with HasOutputAnnotationCol
Prepares images read by Spark into a format that is processable by Spark NLP.
Prepares images read by Spark into a format that is processable by Spark NLP. This component is needed to process images.
Example
import com.johnsnowlabs.nlp.ImageAssembler import org.apache.spark.ml.Pipeline val imageDF: DataFrame = spark.read .format("image") .option("dropInvalid", value = true) .load("src/test/resources/image/") val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val pipeline = new Pipeline().setStages(Array(imageAssembler)) val pipelineDF = pipeline.fit(imageDF).transform(imageDF) pipelineDF.printSchema() root |-- image_assembler: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- annotatorType: string (nullable = true) | | |-- origin: string (nullable = true) | | |-- height: integer (nullable = false) | | |-- width: integer (nullable = false) | | |-- nChannels: integer (nullable = false) | | |-- mode: integer (nullable = false) | | |-- result: binary (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true)
- case class JavaAnnotation(annotatorType: String, begin: Int, end: Int, result: String, metadata: Map[String, String], embeddings: Array[Float] = Array.emptyFloatArray) extends IAnnotation with Product with Serializable
- class LightPipeline extends AnyRef
-
class
MultiDocumentAssembler extends Transformer with DefaultParamsWritable with HasOutputAnnotatorType
Prepares data into a format that is processable by Spark NLP.
Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline. The
MultiDocumentAssembler
can read either aString
column or anArray[String]
. Additionally, MultiDocumentAssembler.setCleanupMode can be used to pre-process the text (Default:disabled
). For possible options please refer the parameters section.For more extended examples on document pre-processing see the Spark NLP Workshop.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.MultiDocumentAssembler val data = Seq("Spark NLP is an open-source text processing library.").toDF("text") val multiDocumentAssembler = new MultiDocumentAssembler().setInputCols("text").setOutputCols("document") val result = multiDocumentAssembler.transform(data) result.select("document").show(false) +----------------------------------------------------------------------------------------------+ |document | +----------------------------------------------------------------------------------------------+ |[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]| +----------------------------------------------------------------------------------------------+ result.select("document").printSchema root |-- document: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- annotatorType: string (nullable = true) | | |-- begin: integer (nullable = false) | | |-- end: integer (nullable = false) | | |-- result: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true) | | |-- embeddings: array (nullable = true) | | | |-- element: float (containsNull = false)
- trait ParamsAndFeaturesReadable[T <: HasFeatures] extends DefaultParamsReadable[T]
- trait ParamsAndFeaturesWritable extends DefaultParamsWritable with Params with HasFeatures
- trait RawAnnotator[M <: Model[M]] extends Model[M] with ParamsAndFeaturesWritable with HasOutputAnnotatorType with HasInputAnnotationCols with HasOutputAnnotationCol
- class RecursivePipeline extends Pipeline
- class RecursivePipelineModel extends Model[RecursivePipelineModel] with MLWritable with Logging
-
class
TokenAssembler extends AnnotatorModel[TokenAssembler] with HasSimpleAnnotate[TokenAssembler]
This transformer reconstructs a
DOCUMENT
type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators.This transformer reconstructs a
DOCUMENT
type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators. RequiresDOCUMENT
andTOKEN
type annotations as input.For more extended examples on document pre-processing see the Spark NLP Workshop.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.DocumentAssembler import com.johnsnowlabs.nlp.annotator.SentenceDetector import com.johnsnowlabs.nlp.annotator.Tokenizer import com.johnsnowlabs.nlp.annotator.{Normalizer, StopWordsCleaner} import com.johnsnowlabs.nlp.TokenAssembler import org.apache.spark.ml.Pipeline // First, the text is tokenized and cleaned val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("token") val normalizer = new Normalizer() .setInputCols("token") .setOutputCol("normalized") .setLowercase(false) val stopwordsCleaner = new StopWordsCleaner() .setInputCols("normalized") .setOutputCol("cleanTokens") .setCaseSensitive(false) // Then the TokenAssembler turns the cleaned tokens into a `DOCUMENT` type structure. val tokenAssembler = new TokenAssembler() .setInputCols("sentences", "cleanTokens") .setOutputCol("cleanText") val data = Seq("Spark NLP is an open-source text processing library for advanced natural language processing.") .toDF("text") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, normalizer, stopwordsCleaner, tokenAssembler )).fit(data) val result = pipeline.transform(data) result.select("cleanText").show(false) +---------------------------------------------------------------------------------------------------------------------------+ |cleanText | +---------------------------------------------------------------------------------------------------------------------------+ |[[document, 0, 80, Spark NLP opensource text processing library advanced natural language processing, [sentence -> 0], []]]| +---------------------------------------------------------------------------------------------------------------------------+
- See also
DocumentAssembler on the data structure
Value Members
- object ActivationFunction
- object Annotation extends Serializable
- object AnnotationImage extends Serializable
- object AnnotatorType
-
object
Chunk2Doc extends DefaultParamsReadable[Chunk2Doc] with Serializable
This is the companion object of Chunk2Doc.
This is the companion object of Chunk2Doc. Please refer to that class for the documentation.
-
object
Doc2Chunk extends DefaultParamsReadable[Doc2Chunk] with Serializable
This is the companion object of Doc2Chunk.
This is the companion object of Doc2Chunk. Please refer to that class for the documentation.
-
object
DocumentAssembler extends DefaultParamsReadable[DocumentAssembler] with Serializable
This is the companion object of DocumentAssembler.
This is the companion object of DocumentAssembler. Please refer to that class for the documentation.
-
object
EmbeddingsFinisher extends DefaultParamsReadable[EmbeddingsFinisher] with Serializable
This is the companion object of EmbeddingsFinisher.
This is the companion object of EmbeddingsFinisher. Please refer to that class for the documentation.
-
object
Finisher extends DefaultParamsReadable[Finisher] with Serializable
This is the companion object of Finisher.
This is the companion object of Finisher. Please refer to that class for the documentation.
-
object
MultiDocumentAssembler extends DefaultParamsReadable[MultiDocumentAssembler] with Serializable
This is the companion object of MultiDocumentAssembler.
This is the companion object of MultiDocumentAssembler. Please refer to that class for the documentation.
- object SparkNLP
-
object
TokenAssembler extends DefaultParamsReadable[TokenAssembler] with Serializable
This is the companion object of TokenAssembler.
This is the companion object of TokenAssembler. Please refer to that class for the documentation.
- object functions