represents annotator's output parts and their details
This class should grow once we start training on datasets and share params For now it stands as a dummy placeholder for future reference
This trait implements logic that applies nlp using Spark ML Pipeline transformers Should strongly change once UsedDefinedTypes are allowed https://issues.apache.org/jira/browse/SPARK-7768
Converts a CHUNK
type column back into DOCUMENT
.
Converts a CHUNK
type column back into DOCUMENT
. Useful when trying to re-tokenize or do further analysis on a
CHUNK
result.
For more extended examples on document pre-processing see the Spark NLP Workshop.
Location entities are extracted and converted back into DOCUMENT
type for further processing
import spark.implicits._ import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline import com.johnsnowlabs.nlp.Chunk2Doc val data = Seq((1, "New York and New Jersey aren't that far apart actually.")).toDF("id", "text") // Extracts Named Entities amongst other things val pipeline = PretrainedPipeline("explain_document_dl") val chunkToDoc = new Chunk2Doc().setInputCols("entities").setOutputCol("chunkConverted") val explainResult = pipeline.transform(data) val result = chunkToDoc.transform(explainResult) result.selectExpr("explode(chunkConverted)").show(false) +------------------------------------------------------------------------------+ |col | +------------------------------------------------------------------------------+ |[document, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []] | |[document, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]| +------------------------------------------------------------------------------+
Doc2Chunk for converting DOCUMENT
annotations to CHUNK
PretrainedPipeline on how to use the PretrainedPipeline
Converts DOCUMENT
type annotations into CHUNK
type with the contents of a chunkCol
.
Converts DOCUMENT
type annotations into CHUNK
type with the contents of a chunkCol
.
Chunk text must be contained within input DOCUMENT
. May be either StringType
or ArrayType[StringType]
(using setIsArray). Useful for annotators that require a CHUNK type input.
For more extended examples on document pre-processing see the Spark NLP Workshop.
import spark.implicits._ import com.johnsnowlabs.nlp.{Doc2Chunk, DocumentAssembler} import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document") val chunkAssembler = new Doc2Chunk() .setInputCols("document") .setChunkCol("target") .setOutputCol("chunk") .setIsArray(true) val data = Seq( ("Spark NLP is an open-source text processing library for advanced natural language processing.", Seq("Spark NLP", "text processing library", "natural language processing")) ).toDF("text", "target") val pipeline = new Pipeline().setStages(Array(documentAssembler, chunkAssembler)).fit(data) val result = pipeline.transform(data) result.selectExpr("chunk.result", "chunk.annotatorType").show(false) +-----------------------------------------------------------------+---------------------+ |result |annotatorType | +-----------------------------------------------------------------+---------------------+ |[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]| +-----------------------------------------------------------------+---------------------+
Chunk2Doc for converting CHUNK
annotations to DOCUMENT
Prepares data into a format that is processable by Spark NLP.
Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline.
The DocumentAssembler
can read either a String
column or an Array[String]
. Additionally, setCleanupMode
can be used to pre-process the text (Default: disabled
). For possible options please refer the parameters section.
For more extended examples on document pre-processing see the Spark NLP Workshop.
import spark.implicits._ import com.johnsnowlabs.nlp.DocumentAssembler val data = Seq("Spark NLP is an open-source text processing library.").toDF("text") val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document") val result = documentAssembler.transform(data) result.select("document").show(false) +----------------------------------------------------------------------------------------------+ |document | +----------------------------------------------------------------------------------------------+ |[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]| +----------------------------------------------------------------------------------------------+ result.select("document").printSchema root |-- document: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- annotatorType: string (nullable = true) | | |-- begin: integer (nullable = false) | | |-- end: integer (nullable = false) | | |-- result: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true) | | |-- embeddings: array (nullable = true) | | | |-- element: float (containsNull = false)
This transformer is designed to deal with embedding annotators, for example: WordEmbeddings, BertEmbeddings, SentenceEmbeddings and ChunkEmbeddings.
This transformer is designed to deal with embedding annotators, for example:
WordEmbeddings,
BertEmbeddings,
SentenceEmbeddings and
ChunkEmbeddings.
By using EmbeddingsFinisher
you can easily transform your embeddings into array of floats or vectors which are
compatible with Spark ML functions such as LDA, K-mean, Random Forest classifier or any other functions that require
featureCol
.
For more extended examples see the Spark NLP Workshop.
import spark.implicits._ import org.apache.spark.ml.Pipeline import com.johnsnowlabs.nlp.{DocumentAssembler, EmbeddingsFinisher} import com.johnsnowlabs.nlp.annotator.{Normalizer, StopWordsCleaner, Tokenizer, WordEmbeddingsModel} // First the embeddings are extracted using the WordEmbeddingsModel val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val normalizer = new Normalizer() .setInputCols("token") .setOutputCol("normalized") val stopwordsCleaner = new StopWordsCleaner() .setInputCols("normalized") .setOutputCol("cleanTokens") .setCaseSensitive(false) val gloveEmbeddings = WordEmbeddingsModel.pretrained() .setInputCols("document", "cleanTokens") .setOutputCol("embeddings") .setCaseSensitive(false) // Then the embeddings can be turned into a vector using the EmbeddingsFinisher val embeddingsFinisher = new EmbeddingsFinisher() .setInputCols("embeddings") .setOutputCols("finished_sentence_embeddings") .setOutputAsVector(true) .setCleanAnnotations(false) val data = Seq("Spark NLP is an open-source text processing library.") .toDF("text") val pipeline = new Pipeline().setStages(Array( documentAssembler, tokenizer, normalizer, stopwordsCleaner, gloveEmbeddings, embeddingsFinisher )).fit(data) val result = pipeline.transform(data) result.select("finished_sentence_embeddings").show(false) +--------------------------------------------------------------------------------------------------------+ |finished_sentence_embeddings | +--------------------------------------------------------------------------------------------------------+ |[[0.1619900017976761,0.045552998781204224,-0.03229299932718277,-0.6856099963188171,0.5442799925804138...| +--------------------------------------------------------------------------------------------------------+
Finisher for finishing Strings
Converts annotation results into a format that easier to use.
Converts annotation results into a format that easier to use. It is useful to extract the results from Spark NLP
Pipelines. The Finisher outputs annotation(s) values into String
.
For more extended examples on document pre-processing see the Spark NLP Workshop.
import spark.implicits._ import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline import com.johnsnowlabs.nlp.Finisher val data = Seq((1, "New York and New Jersey aren't that far apart actually.")).toDF("id", "text") // Extracts Named Entities amongst other things val pipeline = PretrainedPipeline("explain_document_dl") val finisher = new Finisher().setInputCols("entities").setOutputCols("output") val explainResult = pipeline.transform(data) explainResult.selectExpr("explode(entities)").show(false) +------------------------------------------------------------------------------------------------------------------------------------------------------+ |entities | +------------------------------------------------------------------------------------------------------------------------------------------------------+ |[[chunk, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []], [chunk, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]]| +------------------------------------------------------------------------------------------------------------------------------------------------------+ val result = finisher.transform(explainResult) result.select("output").show(false) +----------------------+ |output | +----------------------+ |[New York, New Jersey]| +----------------------+
EmbeddingsFinisher for finishing embeddings
AnnotatorApproach'es may extend this trait in order to allow RecursivePipelines to include intermediate steps trained PipelineModel's
Created by jose on 25/01/18.
This transformer reconstructs a DOCUMENT
type annotation from tokens, usually after these have been normalized,
lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators.
This transformer reconstructs a DOCUMENT
type annotation from tokens, usually after these have been normalized,
lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators.
Requires DOCUMENT
and TOKEN
type annotations as input.
For more extended examples on document pre-processing see the Spark NLP Workshop.
import spark.implicits._ import com.johnsnowlabs.nlp.DocumentAssembler import com.johnsnowlabs.nlp.annotator.SentenceDetector import com.johnsnowlabs.nlp.annotator.Tokenizer import com.johnsnowlabs.nlp.annotator.{Normalizer, StopWordsCleaner} import com.johnsnowlabs.nlp.TokenAssembler import org.apache.spark.ml.Pipeline // First, the text is tokenized and cleaned val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("token") val normalizer = new Normalizer() .setInputCols("token") .setOutputCol("normalized") .setLowercase(false) val stopwordsCleaner = new StopWordsCleaner() .setInputCols("normalized") .setOutputCol("cleanTokens") .setCaseSensitive(false) // Then the TokenAssembler turns the cleaned tokens into a `DOCUMENT` type structure. val tokenAssembler = new TokenAssembler() .setInputCols("sentences", "cleanTokens") .setOutputCol("cleanText") val data = Seq("Spark NLP is an open-source text processing library for advanced natural language processing.") .toDF("text") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, normalizer, stopwordsCleaner, tokenAssembler )).fit(data) val result = pipeline.transform(data) result.select("cleanText").show(false) +---------------------------------------------------------------------------------------------------------------------------+ |cleanText | +---------------------------------------------------------------------------------------------------------------------------+ |[[document, 0, 80, Spark NLP opensource text processing library advanced natural language processing, [sentence -> 0], []]]| +---------------------------------------------------------------------------------------------------------------------------+
DocumentAssembler on the data structure
represents annotator's output parts and their details
the type of annotation
the index of the first character under this annotation
the index after the last character under this annotation
associated metadata for this annotation