com.johnsnowlabs.nlp.embeddings
input annotations columns currently used
Gets annotation column name going to generate
Gets annotation column name going to generate
Input Annotator Types: TOKEN
Input Annotator Types: TOKEN
columns that contain annotations necessary to run this annotator AnnotatorType is used both as input and output columns if not specified
columns that contain annotations necessary to run this annotator AnnotatorType is used both as input and output columns if not specified
Param for maximum number of iterations (>= 0) (Default: 1
)
Sets the maximum length (in words) of each sentence in the input data (Default: 1000
).
Sets the maximum length (in words) of each sentence in the input data (Default: 1000
).
Any sentence longer than this threshold will be divided into chunks of
up to maxSentenceLength
size.
The minimum number of times a token must appear to be included in the word2vec model's vocabulary.
The minimum number of times a token must appear to be included in the word2vec model's vocabulary. Default: 5
Number of partitions for sentences of words (Default: 1
).
Output Annotator Types: SENTENCE_EMBEDDINGS
Output Annotator Types: SENTENCE_EMBEDDINGS
Random seed for shuffling the dataset (Default: 44
)
Overrides required annotators column if different than default
Overrides required annotators column if different than default
Overrides annotation column name when transforming
Overrides annotation column name when transforming
Param for Step size to be used for each iteration of optimization (> 0) (Default: 0.025
).
Unique identifier for storage (Default: this.uid
)
Unique identifier for storage (Default: this.uid
)
requirement for pipeline transformation validation.
requirement for pipeline transformation validation. It is called on fit()
takes a Dataset and checks to see if all the required annotation types are present.
takes a Dataset and checks to see if all the required annotation types are present.
to be validated
True if all the required types are present, else false
The dimension of the code that you want to transform from words (Default: 100
).
The window size (context words from [-window, window]) (Default: 5
)
A list of (hyper-)parameter keys this annotator can take. Users can set and get the parameter values through setters and getters, respectively.
Required input and expected output annotator types
Trains a Word2Vec model that creates vector representations of words in a text corpus.
The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.
We use Word2Vec implemented in Spark ML. It uses skip-gram model in our implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation.
For instantiated/pretrained models, see Doc2VecModel.
Sources :
For the original C implementation, see https://code.google.com/p/word2vec/
For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.
Example