represents annotator's output parts and their details
This class should grow once we start training on datasets and share params For now it stands as a dummy placeholder for future reference
This trait implements logic that applies nlp using Spark ML Pipeline transformers Should strongly change once UsedDefinedTypes are allowed https://issues.apache.org/jira/browse/SPARK-7768
Converts a CHUNK
type column back into DOCUMENT
.
Converts a CHUNK
type column back into DOCUMENT
. Useful when trying to re-tokenize or do further analysis on a
CHUNK
result.
For more extended examples on document pre-processing see the Spark NLP Workshop.
Location entities are extracted and converted back into DOCUMENT
type for further processing
import spark.implicits._ import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline import com.johnsnowlabs.nlp.Chunk2Doc val data = Seq((1, "New York and New Jersey aren't that far apart actually.")).toDF("id", "text") // Extracts Named Entities amongst other things val pipeline = PretrainedPipeline("explain_document_dl") val chunkToDoc = new Chunk2Doc().setInputCols("entities").setOutputCol("chunkConverted") val explainResult = pipeline.transform(data) val result = chunkToDoc.transform(explainResult) result.selectExpr("explode(chunkConverted)").show(false) +------------------------------------------------------------------------------+ |col | +------------------------------------------------------------------------------+ |[document, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []] | |[document, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]| +------------------------------------------------------------------------------+
Doc2Chunk for converting DOCUMENT
annotations to CHUNK
PretrainedPipeline on how to use the PretrainedPipeline
Converts DOCUMENT
type annotations into CHUNK
type with the contents of a chunkCol
.
Converts DOCUMENT
type annotations into CHUNK
type with the contents of a chunkCol
.
Chunk text must be contained within input DOCUMENT
. May be either StringType
or ArrayType[StringType]
(using setIsArray). Useful for annotators that require a CHUNK type input.
For more extended examples on document pre-processing see the Spark NLP Workshop.
import spark.implicits._ import com.johnsnowlabs.nlp.{Doc2Chunk, DocumentAssembler} import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document") val chunkAssembler = new Doc2Chunk() .setInputCols("document") .setChunkCol("target") .setOutputCol("chunk") .setIsArray(true) val data = Seq( ("Spark NLP is an open-source text processing library for advanced natural language processing.", Seq("Spark NLP", "text processing library", "natural language processing")) ).toDF("text", "target") val pipeline = new Pipeline().setStages(Array(documentAssembler, chunkAssembler)).fit(data) val result = pipeline.transform(data) result.selectExpr("chunk.result", "chunk.annotatorType").show(false) +-----------------------------------------------------------------+---------------------+ |result |annotatorType | +-----------------------------------------------------------------+---------------------+ |[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]| +-----------------------------------------------------------------+---------------------+
Chunk2Doc for converting CHUNK
annotations to DOCUMENT
Prepares data into a format that is processable by Spark NLP.
Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline.
The DocumentAssembler
can read either a String
column or an Array[String]
. Additionally, setCleanupMode
can be used to pre-process the text (Default: disabled
). For possible options please refer the parameters section.
For more extended examples on document pre-processing see the Spark NLP Workshop.
import spark.implicits._ import com.johnsnowlabs.nlp.DocumentAssembler val data = Seq("Spark NLP is an open-source text processing library.").toDF("text") val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document") val result = documentAssembler.transform(data) result.select("document").show(false) +----------------------------------------------------------------------------------------------+ |document | +----------------------------------------------------------------------------------------------+ |[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]| +----------------------------------------------------------------------------------------------+ result.select("document").printSchema root |-- document: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- annotatorType: string (nullable = true) | | |-- begin: integer (nullable = false) | | |-- end: integer (nullable = false) | | |-- result: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true) | | |-- embeddings: array (nullable = true) | | | |-- element: float (containsNull = false)
Extracts embeddings from Annotations into a more easily usable form.
Extracts embeddings from Annotations into a more easily usable form.
This is useful for example: WordEmbeddings, BertEmbeddings, SentenceEmbeddings and ChunkEmbeddings.
By using EmbeddingsFinisher
you can easily transform your embeddings into array of floats or vectors which are
compatible with Spark ML functions such as LDA, K-mean, Random Forest classifier or any other functions that require
featureCol
.
For more extended examples see the Spark NLP Workshop.
import spark.implicits._ import org.apache.spark.ml.Pipeline import com.johnsnowlabs.nlp.{DocumentAssembler, EmbeddingsFinisher} import com.johnsnowlabs.nlp.annotator.{Normalizer, StopWordsCleaner, Tokenizer, WordEmbeddingsModel} val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val normalizer = new Normalizer() .setInputCols("token") .setOutputCol("normalized") val stopwordsCleaner = new StopWordsCleaner() .setInputCols("normalized") .setOutputCol("cleanTokens") .setCaseSensitive(false) val gloveEmbeddings = WordEmbeddingsModel.pretrained() .setInputCols("document", "cleanTokens") .setOutputCol("embeddings") .setCaseSensitive(false) val embeddingsFinisher = new EmbeddingsFinisher() .setInputCols("embeddings") .setOutputCols("finished_sentence_embeddings") .setOutputAsVector(true) .setCleanAnnotations(false) val data = Seq("Spark NLP is an open-source text processing library.") .toDF("text") val pipeline = new Pipeline().setStages(Array( documentAssembler, tokenizer, normalizer, stopwordsCleaner, gloveEmbeddings, embeddingsFinisher )).fit(data) val result = pipeline.transform(data) val resultWithSize = result.selectExpr("explode(finished_sentence_embeddings)") .map { row => val vector = row.getAs[org.apache.spark.ml.linalg.DenseVector](0) (vector.size, vector) }.toDF("size", "vector") resultWithSize.show(5, 80) +----+--------------------------------------------------------------------------------+ |size| vector| +----+--------------------------------------------------------------------------------+ | 100|[0.1619900017976761,0.045552998781204224,-0.03229299932718277,-0.685609996318...| | 100|[-0.42416998744010925,1.1378999948501587,-0.5717899799346924,-0.5078899860382...| | 100|[0.08621499687433243,-0.15772999823093414,-0.06067200005054474,0.395359992980...| | 100|[-0.4970499873161316,0.7164199948310852,0.40119001269340515,-0.05761000141501...| | 100|[-0.08170200139284134,0.7159299850463867,-0.20677000284194946,0.0295659992843...| +----+--------------------------------------------------------------------------------+
Finisher for finishing Strings
Converts annotation results into a format that easier to use.
Converts annotation results into a format that easier to use. It is useful to extract the results from Spark NLP
Pipelines. The Finisher outputs annotation(s) values into String
.
For more extended examples on document pre-processing see the Spark NLP Workshop.
import spark.implicits._ import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline import com.johnsnowlabs.nlp.Finisher val data = Seq((1, "New York and New Jersey aren't that far apart actually.")).toDF("id", "text") // Extracts Named Entities amongst other things val pipeline = PretrainedPipeline("explain_document_dl") val finisher = new Finisher().setInputCols("entities").setOutputCols("output") val explainResult = pipeline.transform(data) explainResult.selectExpr("explode(entities)").show(false) +------------------------------------------------------------------------------------------------------------------------------------------------------+ |entities | +------------------------------------------------------------------------------------------------------------------------------------------------------+ |[[chunk, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []], [chunk, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]]| +------------------------------------------------------------------------------------------------------------------------------------------------------+ val result = finisher.transform(explainResult) result.select("output").show(false) +----------------------+ |output | +----------------------+ |[New York, New Jersey]| +----------------------+
EmbeddingsFinisher for finishing embeddings
Helper class to convert the knowledge graph from GraphExtraction into a generic format, such as RDF.
Helper class to convert the knowledge graph from GraphExtraction into a generic format, such as RDF.
This is a continuation of the example of GraphExtraction. To see how the graph is extracted, see the documentation of that class.
import com.johnsnowlabs.nlp.GraphFinisher val graphFinisher = new GraphFinisher() .setInputCol("graph") .setOutputCol("graph_finished") .setOutputAsArray(false) val finishedResult = graphFinisher.transform(result) finishedResult.select("text", "graph_finished").show(false) +-----------------------------------------------------+-----------------------------------------------------------------------+ |text |graph_finished | +-----------------------------------------------------+-----------------------------------------------------------------------+ |You and John prefer the morning flight through Denver|[[(prefer,nsubj,morning), (morning,flat,flight), (flight,flat,Denver)]]| +-----------------------------------------------------+-----------------------------------------------------------------------+
GraphExtraction to extract the graph.
Trait used to create annotators with input columns of variable length.
AnnotatorApproach'es may extend this trait in order to allow RecursivePipelines to include intermediate steps trained PipelineModel's
This transformer reconstructs a DOCUMENT
type annotation from tokens, usually after these have been normalized,
lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators.
This transformer reconstructs a DOCUMENT
type annotation from tokens, usually after these have been normalized,
lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators.
Requires DOCUMENT
and TOKEN
type annotations as input.
For more extended examples on document pre-processing see the Spark NLP Workshop.
import spark.implicits._ import com.johnsnowlabs.nlp.DocumentAssembler import com.johnsnowlabs.nlp.annotator.SentenceDetector import com.johnsnowlabs.nlp.annotator.Tokenizer import com.johnsnowlabs.nlp.annotator.{Normalizer, StopWordsCleaner} import com.johnsnowlabs.nlp.TokenAssembler import org.apache.spark.ml.Pipeline // First, the text is tokenized and cleaned val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("token") val normalizer = new Normalizer() .setInputCols("token") .setOutputCol("normalized") .setLowercase(false) val stopwordsCleaner = new StopWordsCleaner() .setInputCols("normalized") .setOutputCol("cleanTokens") .setCaseSensitive(false) // Then the TokenAssembler turns the cleaned tokens into a `DOCUMENT` type structure. val tokenAssembler = new TokenAssembler() .setInputCols("sentences", "cleanTokens") .setOutputCol("cleanText") val data = Seq("Spark NLP is an open-source text processing library for advanced natural language processing.") .toDF("text") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, normalizer, stopwordsCleaner, tokenAssembler )).fit(data) val result = pipeline.transform(data) result.select("cleanText").show(false) +---------------------------------------------------------------------------------------------------------------------------+ |cleanText | +---------------------------------------------------------------------------------------------------------------------------+ |[[document, 0, 80, Spark NLP opensource text processing library advanced natural language processing, [sentence -> 0], []]]| +---------------------------------------------------------------------------------------------------------------------------+
DocumentAssembler on the data structure
This is the companion object of Chunk2Doc.
This is the companion object of Chunk2Doc. Please refer to that class for the documentation.
This is the companion object of Doc2Chunk.
This is the companion object of Doc2Chunk. Please refer to that class for the documentation.
This is the companion object of DocumentAssembler.
This is the companion object of DocumentAssembler. Please refer to that class for the documentation.
This is the companion object of EmbeddingsFinisher.
This is the companion object of EmbeddingsFinisher. Please refer to that class for the documentation.
This is the companion object of Finisher.
This is the companion object of Finisher. Please refer to that class for the documentation.
This is the companion object of TokenAssembler.
This is the companion object of TokenAssembler. Please refer to that class for the documentation.
represents annotator's output parts and their details
the type of annotation
the index of the first character under this annotation
the index after the last character under this annotation
associated metadata for this annotation