Helper class to load a CoNLL type dataset for training.
Helper class for to work with CoNLL 2003 dataset for NER task Class is made for easy use from Java
Instantiates the class to read a CoNLL-U dataset.
Instantiates the class to read a CoNLL-U dataset.
The dataset should be in the format of
CoNLL-U and needs to be specified with
readDataset
, which will create a dataframe with the data.
import com.johnsnowlabs.nlp.training.CoNLLU val conlluFile = "src/test/resources/conllu/en.test.conllu" val conllDataSet = CoNLLU(false).readDataset(ResourceHelper.spark, conlluFile) conllDataSet.selectExpr("text", "form.result as form", "upos.result as upos", "xpos.result as xpos", "lemma.result as lemma") .show(1, false) +---------------------------------------+----------------------------------------------+---------------------------------------------+------------------------------+--------------------------------------------+ |text |form |upos |xpos |lemma | +---------------------------------------+----------------------------------------------+---------------------------------------------+------------------------------+--------------------------------------------+ |What if Google Morphed Into GoogleOS? |[What, if, Google, Morphed, Into, GoogleOS, ?]|[PRON, SCONJ, PROPN, VERB, ADP, PROPN, PUNCT]|[WP, IN, NNP, VBD, IN, NNP, .]|[what, if, Google, morph, into, GoogleOS, ?]| +---------------------------------------+----------------------------------------------+---------------------------------------------+------------------------------+--------------------------------------------+
Whether to split each sentence into a separate row
Helper class for creating DataFrames for training a part-of-speech tagger.
Helper class for creating DataFrames for training a part-of-speech tagger.
The dataset needs to consist of sentences on each line, where each word is delimited with its respective tag:
Pierre|NNP Vinken|NNP ,|, 61|CD years|NNS old|JJ ,|, will|MD join|VB the|DT board|NN as|IN a|DT nonexecutive|JJ director|NN Nov.|NNP 29|CD .|.
The sentence can then be parsed with readDataset into a column with annotations of type
POS
.
In this example, the file test-training.txt
has the content of the sentence above.
import com.johnsnowlabs.nlp.training.POS val pos = POS() val path = "src/test/resources/anc-pos-corpus-small/test-training.txt" val posDf = pos.readDataset(spark, path, "|", "tags") posDf.selectExpr("explode(tags) as tags").show(false) +---------------------------------------------+ |tags | +---------------------------------------------+ |[pos, 0, 5, NNP, [word -> Pierre], []] | |[pos, 7, 12, NNP, [word -> Vinken], []] | |[pos, 14, 14, ,, [word -> ,], []] | |[pos, 16, 17, CD, [word -> 61], []] | |[pos, 19, 23, NNS, [word -> years], []] | |[pos, 25, 27, JJ, [word -> old], []] | |[pos, 29, 29, ,, [word -> ,], []] | |[pos, 31, 34, MD, [word -> will], []] | |[pos, 36, 39, VB, [word -> join], []] | |[pos, 41, 43, DT, [word -> the], []] | |[pos, 45, 49, NN, [word -> board], []] | |[pos, 51, 52, IN, [word -> as], []] | |[pos, 47, 47, DT, [word -> a], []] | |[pos, 56, 67, JJ, [word -> nonexecutive], []]| |[pos, 69, 76, NN, [word -> director], []] | |[pos, 78, 81, NNP, [word -> Nov.], []] | |[pos, 83, 84, CD, [word -> 29], []] | |[pos, 81, 81, ., [word -> .], []] | +---------------------------------------------+
The PubTator format includes medical papers’ titles, abstracts, and tagged chunks.
The PubTator format includes medical papers’ titles, abstracts, and tagged chunks.
For more information see PubTator Docs and MedMentions Docs.
readDataset
is used to create a Spark DataFrame from a PubTator text file.
import com.johnsnowlabs.nlp.training.PubTator val pubTatorFile = "./src/test/resources/corpus_pubtator_sample.txt" val pubTatorDataSet = PubTator().readDataset(ResourceHelper.spark, pubTatorFile) pubTatorDataSet.show(1) +--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+ | doc_id| finished_token| finished_pos| finished_ner|finished_token_metadata|finished_pos_metadata|finished_label_metadata| +--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+ |25763772|[DCTN4, as, a, mo...|[NNP, IN, DT, NN,...|[B-T116, O, O, O,...| [[sentence, 0], [...| [[word, DCTN4], [...| [[word, DCTN4], [...| +--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+
Helper class to load a CoNLL type dataset for training.
The dataset should be in the format of CoNLL 2003 and needs to be specified with
readDataset
. Other CoNLL datasets are not supported.Two types of input paths are supported,
Folder: this is a path ending in
*
, and representing a collection of CoNLL files within a directory. E.g., 'path/to/multiple/conlls/*' Using this pattern will result in all the files being read into a single Dataframe. Some constraints apply on the schemas of the multiple files.File: this is a path to a single file. E.g., 'path/to/single_file.conll'
Example
Name of the
DOCUMENT
Annotator type columnName of the Sentences of
DOCUMENT
Annotator type columnName of the
TOKEN
Annotator type columnName of the
POS
Annotator type columnIndex of the column for NER Label in the dataset
Index of the column for the POS tags in the dataset
Index of the column for the text in the dataset
Name of the
NAMED_ENTITY
Annotator type columnWhether to explode each sentence to a separate row
Delimiter used to separate columns inside CoNLL file