com.johnsnowlabs.nlp.training
The PubTator format includes medical papers’ titles, abstracts, and tagged chunks.
For more information see PubTator Docs and MedMentions Docs.
readDataset is used to create a Spark DataFrame from a PubTator text file.
readDataset
import com.johnsnowlabs.nlp.training.PubTator val pubTatorFile = "./src/test/resources/corpus_pubtator_sample.txt" val pubTatorDataSet = PubTator().readDataset(ResourceHelper.spark, pubTatorFile) pubTatorDataSet.show(1) +--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+ | doc_id| finished_token| finished_pos| finished_ner|finished_token_metadata|finished_pos_metadata|finished_label_metadata| +--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+ |25763772|[DCTN4, as, a, mo...|[NNP, IN, DT, NN,...|[B-T116, O, O, O,...| [[sentence, 0], [...| [[word, DCTN4], [...| [[word, DCTN4], [...| +--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+
The PubTator format includes medical papers’ titles, abstracts, and tagged chunks.
For more information see PubTator Docs and MedMentions Docs.
readDataset
is used to create a Spark DataFrame from a PubTator text file.Example