Helper class to load a CoNLL type dataset for training.
Helper class for to work with CoNLL 2003 dataset for NER task Class is made for easy use from Java
Helper class for creating DataFrames for training a part-of-speech tagger.
Helper class for creating DataFrames for training a part-of-speech tagger.
The dataset needs to consist of sentences on each line, where each word is delimited with its respective tag:
Pierre|NNP Vinken|NNP ,|, 61|CD years|NNS old|JJ ,|, will|MD join|VB the|DT board|NN as|IN a|DT nonexecutive|JJ director|NN Nov.|NNP 29|CD .|.
The sentence can then be parsed with readDataset into a column with annotations of type POS
.
In this example, the file test-training.txt
has the content of the sentence above.
import com.johnsnowlabs.nlp.training.POS val pos = POS() val path = "src/test/resources/anc-pos-corpus-small/test-training.txt" val posDf = pos.readDataset(spark, path, "|", "tags") posDf.selectExpr("explode(tags) as tags").show(false) +---------------------------------------------+ |tags | +---------------------------------------------+ |[pos, 0, 5, NNP, [word -> Pierre], []] | |[pos, 7, 12, NNP, [word -> Vinken], []] | |[pos, 14, 14, ,, [word -> ,], []] | |[pos, 16, 17, CD, [word -> 61], []] | |[pos, 19, 23, NNS, [word -> years], []] | |[pos, 25, 27, JJ, [word -> old], []] | |[pos, 29, 29, ,, [word -> ,], []] | |[pos, 31, 34, MD, [word -> will], []] | |[pos, 36, 39, VB, [word -> join], []] | |[pos, 41, 43, DT, [word -> the], []] | |[pos, 45, 49, NN, [word -> board], []] | |[pos, 51, 52, IN, [word -> as], []] | |[pos, 47, 47, DT, [word -> a], []] | |[pos, 56, 67, JJ, [word -> nonexecutive], []]| |[pos, 69, 76, NN, [word -> director], []] | |[pos, 78, 81, NNP, [word -> Nov.], []] | |[pos, 83, 84, CD, [word -> 29], []] | |[pos, 81, 81, ., [word -> .], []] | +---------------------------------------------+
Helper class to load a CoNLL type dataset for training.
The dataset should be in the format of CoNLL 2003 and needs to be specified with
readDataset
. Other CoNLL datasets are not supported.Example
Name of the
DOCUMENT
Annotator type columnName of the Sentences of
DOCUMENT
Annotator type columnName of the
TOKEN
Annotator type columnName of the
POS
Annotator type columnIndex of the column for NER Label in the dataset
Index of the column for the POS tags in the dataset
Index of the column for the text in the dataset
Name of the
NAMED_ENTITY
Annotator type columnWhether to explode each sentence to a separate row