com.johnsnowlabs.nlp.annotators.seq2seq
internal types to show Rows as a relevant StructType Should be deleted once Spark releases UserDefinedTypes to @developerAPI
internal types to show Rows as a relevant StructType Should be deleted once Spark releases UserDefinedTypes to @developerAPI
takes a document and annotations and produces new annotations of this annotator's annotation type
takes a document and annotations and produces new annotations of this annotator's annotation type
Annotations that correspond to inputAnnotationCols generated by previous annotators if any
any number of annotations processed for every input annotation. Not necessary one to one relationship
Size of every batch (Default depends on model).
Size of every batch (Default depends on model).
ConfigProto from tensorflow, serialized into byte array.
ConfigProto from tensorflow, serialized into byte array. Get with config_proto.SerializeToString()
requirement for annotators copies
requirement for annotators copies
Whether or not to use sampling, use greedy decoding otherwise (Default: false
)
Override for additional custom schema checks
Override for additional custom schema checks
Size of every batch.
Size of every batch.
input annotations columns currently used
Gets annotation column name going to generate
Gets annotation column name going to generate
A list of token ids which are ignored in the decoder's output (Default: Array()
)
Input annotator type : DOCUMENT
Input annotator type : DOCUMENT
columns that contain annotations necessary to run this annotator AnnotatorType is used both as input and output columns if not specified
columns that contain annotations necessary to run this annotator AnnotatorType is used both as input and output columns if not specified
Maximum length of the sequence to be generated (Default: 20
)
Holding merges.txt coming from RoBERTa model
Minimum length of the sequence to be generated (Default: 0
)
If set to int > 0
, all ngrams of that size can only occur once (Default: 0
)
Output annotator type : DOCUMENT
Output annotator type : DOCUMENT
Optional Random seed for the model.
Optional Random seed for the model. Needs to be of type Long
.
The parameter for repetition penalty (Default: 1.0
).
The parameter for repetition penalty (Default: 1.0
). 1.0
means no penalty. See
this paper for more details.
Size of every batch.
Size of every batch.
Overrides required annotators column if different than default
Overrides required annotators column if different than default
Overrides annotation column name when transforming
Overrides annotation column name when transforming
Set transformer task, e.g.
Set transformer task, e.g. "summarize:"
(Default: ""
).
The value used to module the next token probabilities (Default: 1.0
)
The number of highest probability vocabulary tokens to keep for top-k-filtering (Default:
50
)
If set to float < 1.0
, only the most probable tokens with probabilities that add up to
topP
or higher are kept for generation (Default: 1.0
)
Given requirements are met, this applies ML transformation within a Pipeline or stand-alone Output annotation will be generated as a new column, previous annotations are still available separately metadata is built at schema level to record annotations structural information outside its content
Given requirements are met, this applies ML transformation within a Pipeline or stand-alone Output annotation will be generated as a new column, previous annotations are still available separately metadata is built at schema level to record annotations structural information outside its content
Dataset[Row]
requirement for pipeline transformation validation.
requirement for pipeline transformation validation. It is called on fit()
required uid for storing annotator to disk
required uid for storing annotator to disk
takes a Dataset and checks to see if all the required annotation types are present.
takes a Dataset and checks to see if all the required annotation types are present.
to be validated
True if all the required types are present, else false
Vocabulary used to encode the words to ids with bpeTokenizer.encode
A list of (hyper-)parameter keys this annotator can take. Users can set and get the parameter values through setters and getters, respectively.
GPT-2: the OpenAI Text-To-Text Transformer
GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.
GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where we prime the model with an input and have it generate a lengthy continuation. In addition, GPT-2 outperforms other language models trained on specific domains (like Wikipedia, news, or books) without needing to use these domain-specific training datasets. On language tasks like question answering, reading comprehension, summarization, and translation, GPT-2 begins to learn these tasks from the raw text, using no task-specific training data. While scores on these downstream tasks are far from state-of-the-art, they suggest that the tasks can benefit from unsupervised techniques, given sufficient (unlabeled) data and compute.
Pretrained models can be loaded with
pretrained
of the companion object:The default model is
"gpt2"
, if no name is provided. For available pretrained models please see the Models Hub.For extended examples of usage, see GPT2TestSpec.
Sources:
Paper Abstract:
Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Note:
This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
Example