Trains an averaged Perceptron model to tag words part-of-speech.
Trains an averaged Perceptron model to tag words part-of-speech. Sets a POS tag to each word within a sentence.
For pretrained models please see the PerceptronModel.
The training data needs to be in a Spark DataFrame, where the column needs to consist of
Annotations of type POS
. The Annotation
needs to have member result
set to the POS tag and have a "word"
mapping to its word inside of member metadata
.
This DataFrame for training can easily created by the helper class POS.
POS().readDataset(spark, datasetPath).selectExpr("explode(tags) as tags").show(false) +---------------------------------------------+ |tags | +---------------------------------------------+ |[pos, 0, 5, NNP, [word -> Pierre], []] | |[pos, 7, 12, NNP, [word -> Vinken], []] | |[pos, 14, 14, ,, [word -> ,], []] | |[pos, 31, 34, MD, [word -> will], []] | |[pos, 36, 39, VB, [word -> join], []] | |[pos, 41, 43, DT, [word -> the], []] | |[pos, 45, 49, NN, [word -> board], []] | ...
For extended examples of usage, see the Spark NLP Workshop and PerceptronApproach tests.
import spark.implicits._ import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotator.SentenceDetector import com.johnsnowlabs.nlp.annotators.Tokenizer import com.johnsnowlabs.nlp.training.POS import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronApproach import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val datasetPath = "src/test/resources/anc-pos-corpus-small/test-training.txt" val trainingPerceptronDF = POS().readDataset(spark, datasetPath) val trainedPos = new PerceptronApproach() .setInputCols("document", "token") .setOutputCol("pos") .setPosColumn("tags") .fit(trainingPerceptronDF) val pipeline = new Pipeline().setStages(Array( documentAssembler, sentence, tokenizer, trainedPos )) val data = Seq("To be or not to be, is this the question?").toDF("text") val result = pipeline.fit(data).transform(data) result.selectExpr("pos.result").show(false) +--------------------------------------------------+ |result | +--------------------------------------------------+ |[NNP, NNP, CD, JJ, NNP, NNP, ,, MD, VB, DT, CD, .]| +--------------------------------------------------+
Distributed Averaged Perceptron model to tag words part-of-speech.
Distributed Averaged Perceptron model to tag words part-of-speech.
Sets a POS tag to each word within a sentence. Its train data (train_pos) is a spark dataset of POS format values with Annotation columns.
See https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/pos/perceptron/DistributedPos.scala for further reference on how to use this APIs.
Averaged Perceptron model to tag words part-of-speech.
Averaged Perceptron model to tag words part-of-speech. Sets a POS tag to each word within a sentence.
This is the instantiated model of the PerceptronApproach. For training your own model, please see the documentation of that class.
Pretrained models can be loaded with pretrained
of the companion object:
val posTagger = PerceptronModel.pretrained() .setInputCols("document", "token") .setOutputCol("pos")
The default model is "pos_anc"
, if no name is provided.
For available pretrained models please see the Models Hub. Additionally, pretrained pipelines are available for this module, see Pipelines.
For extended examples of usage, see the Spark NLP Workshop.
import spark.implicits._ import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotators.Tokenizer import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val posTagger = PerceptronModel.pretrained() .setInputCols("document", "token") .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array( documentAssembler, tokenizer, posTagger )) val data = Seq("Peter Pipers employees are picking pecks of pickled peppers").toDF("text") val result = pipeline.fit(data).transform(data) result.selectExpr("explode(pos) as pos").show(false) +-------------------------------------------+ |pos | +-------------------------------------------+ |[pos, 0, 4, NNP, [word -> Peter], []] | |[pos, 6, 11, NNP, [word -> Pipers], []] | |[pos, 13, 21, NNS, [word -> employees], []]| |[pos, 23, 25, VBP, [word -> are], []] | |[pos, 27, 33, VBG, [word -> picking], []] | |[pos, 35, 39, NNS, [word -> pecks], []] | |[pos, 41, 42, IN, [word -> of], []] | |[pos, 44, 50, JJ, [word -> pickled], []] | |[pos, 52, 58, NNS, [word -> peppers], []] | +-------------------------------------------+
This is the companion object of PerceptronApproach.
This is the companion object of PerceptronApproach. Please refer to that class for the documentation.
This is the companion object of PerceptronApproachDistributed.
This is the companion object of PerceptronApproachDistributed. Please refer to that class for the documentation.
This is the companion object of PerceptronModel.
This is the companion object of PerceptronModel. Please refer to that class for the documentation.
Holds all unique tags based on training
Contains non ambiguous words and their tags
Contains prediction information based on context frequencies