Word2Vec (deeplearning4j-nlp 0.4-rc3.7 API)

java.lang.Object
- org.deeplearning4j.models.embeddings.wordvectors.WordVectorsImpl
- - org.deeplearning4j.models.word2vec.Word2Vec

All Implemented Interfaces:

Serializable, WordVectors

Direct Known Subclasses:

ParagraphVectors
```
public class Word2Vec
extends WordVectorsImpl
```
Leveraging a 3 layer neural net with a softmax approach as output, converts a word based on its context and the training examples in to a numeric vector

Author:

Adam Gibson

See Also:

Serialized Form

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class Word2Vec.Builder

Nested Classes
Modifier and Type	Class and Description
`static class`	`Word2Vec.Builder`

Field Summary

Fields
Modifier and Type	Field and Description
`protected com.google.common.util.concurrent.AtomicDouble`	`alpha`
`protected int`	`batchSize`
`protected Word2VecConfiguration`	`configuration`
`protected DocumentIterator`	`docIter`
`protected org.apache.commons.math3.random.RandomGenerator`	`g`
`protected InvertedIndex`	`invertedIndex`
`protected int`	`learningRateDecayWords`
`protected static org.slf4j.Logger`	`log`
`protected double`	`minLearningRate`
`protected int`	`numIterations`
`protected boolean`	`resetModel`
`protected double`	`sample`
`protected boolean`	`saveVocab`
`protected long`	`seed`
`protected SentenceIterator`	`sentenceIter`
`protected static long`	`serialVersionUID`
`protected boolean`	`shouldReset`
`protected TokenizerFactory`	`tokenizerFactory`
`protected long`	`totalWords`
`static String`	`UNK`
`protected boolean`	`useAdaGrad`
`protected TextVectorizer`	`vectorizer`
`protected VocabularyHolder`	`vocabularyHolder`
`protected int`	`window`
`protected int`	`workers`

Fields inherited from class org.deeplearning4j.models.embeddings.wordvectors.WordVectorsImpl
layerSize, lookupTable, minWordFrequency, stopWords, vocab

Constructor Summary

Constructors
Constructor and Description

Word2Vec()

Constructors
Constructor and Description
`Word2Vec()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`protected void`	`addWords(List<VocabWord> sentence, AtomicLong nextRandom, List<VocabWord> currMiniBatch)`
`protected void`	`buildBinaryTree()`
`boolean`	`buildVocab()` Builds the vocabulary for training
`protected List<VocabWord>`	`digitizeSentence(List<String> tokens)` Returns sentence as list of word from vocabulary.
`VocabCache`	`fillSpecialVocabulary(SentenceIterator iterator, int minWord)` This method can be used to build vocabulary from special source, that should be treated separately.
`protected int`	`fillVocabulary(List<String> tokens)` This method adds all unknown words to vocabulary.
`void`	`fit()` Train the model
`SentenceIterator`	`getSentenceIter()`
`List<String>`	`getStopWords()`
`TokenizerFactory`	`getTokenizerFactory()`
`TextVectorizer`	`getVectorizer()`
`int`	`getWindow()`
`void`	`iterate(VocabWord w1, VocabWord w2, AtomicLong nextRandom, double alpha)` Train the word vector on the given words
`protected void`	`readStopWords()`
`protected void`	`resetWeights()`
`void`	`resetWeightsOnSetup()` restart training on next fit().
`void`	`setSentenceIter(SentenceIterator sentenceIter)` Note that calling a setter on this means assumes that this is a training continuation and therefore weights should not be reset.
`void`	`setTokenizerFactory(TokenizerFactory tokenizerFactory)`
`void`	`setup()` Build the binary tree Reset the weights
`void`	`setVectorizer(TextVectorizer vectorizer)`
`void`	`skipGram(int i, List<VocabWord> sentence, int b, AtomicLong nextRandom, double alpha)` Train via skip gram
`void`	`trainSentence(List<VocabWord> sentence, AtomicLong nextRandom, double alpha)` Train on a list of vocab words

Methods inherited from class org.deeplearning4j.models.embeddings.wordvectors.WordVectorsImpl
accuracy, getWordVector, getWordVectorMatrix, getWordVectorMatrixNormalized, hasWord, indexOf, lookupTable, setLookupTable, setVocab, similarity, similarWordsInVocabTo, vocab, wordsNearest, wordsNearest, wordsNearest, wordsNearestSum, wordsNearestSum, wordsNearestSum

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

serialVersionUID
```
protected static final long serialVersionUID
```
See Also:

Constant Field Values

configuration

protected transient Word2VecConfiguration configuration

tokenizerFactory

protected transient TokenizerFactory tokenizerFactory

sentenceIter

protected transient SentenceIterator sentenceIter

docIter

protected transient DocumentIterator docIter

vectorizer

protected transient TextVectorizer vectorizer

invertedIndex

protected transient InvertedIndex invertedIndex

vocabularyHolder

protected transient VocabularyHolder vocabularyHolder

g

protected transient org.apache.commons.math3.random.RandomGenerator g

workers
```
protected transient int workers
```

batchSize
```
protected int batchSize
```

sample
```
protected double sample
```

totalWords
```
protected long totalWords
```

alpha

protected com.google.common.util.concurrent.AtomicDouble alpha

window
```
protected int window
```

log

protected static final org.slf4j.Logger log

shouldReset
```
protected boolean shouldReset
```

numIterations
```
protected int numIterations
```

UNK
```
public static final String UNK
```
See Also:

Constant Field Values

seed
```
protected long seed
```

saveVocab
```
protected boolean saveVocab
```

minLearningRate
```
protected double minLearningRate
```

learningRateDecayWords
```
protected int learningRateDecayWords
```

useAdaGrad
```
protected boolean useAdaGrad
```

resetModel
```
protected boolean resetModel
```

Constructor Detail
- Word2Vec
```
public Word2Vec()
```

Method Detail

getVectorizer
```
public TextVectorizer getVectorizer()
```

setVectorizer

public void setVectorizer(TextVectorizer vectorizer)

fillVocabulary
```
protected int fillVocabulary(List<String> tokens)
```
This method adds all unknown words to vocabulary. Known words get their counters updated. And returns number of words being added/incremented in vocabulary

Parameters:

tokens - list of strings received from Tokenizer

fillSpecialVocabulary
```
public VocabCache fillSpecialVocabulary(SentenceIterator iterator,
                                        int minWord)
```
This method can be used to build vocabulary from special source, that should be treated separately. I.e. words from one source should have minWordFrequency set to 1, while the rest of corpus should have minWordFrequency set to 5. So, here's the way to deal with it.

Parameters:

iterator -

Returns:

digitizeSentence
```
protected List<VocabWord> digitizeSentence(List<String> tokens)
```
Returns sentence as list of word from vocabulary.

Parameters:

tokens - - list of tokens from sentence

Returns:

fit

public void fit()
         throws IOException

Train the model

Throws:: IOException

addWords

protected void addWords(List<VocabWord> sentence,
                        AtomicLong nextRandom,
                        List<VocabWord> currMiniBatch)

setup
```
public void setup()
```
Build the binary tree Reset the weights

buildVocab
```
public boolean buildVocab()
```
Builds the vocabulary for training

trainSentence

public void trainSentence(List<VocabWord> sentence,
                          AtomicLong nextRandom,
                          double alpha)

Train on a list of vocab words

Parameters:: sentence - the list of vocab words to train on

skipGram

public void skipGram(int i,
                     List<VocabWord> sentence,
                     int b,
                     AtomicLong nextRandom,
                     double alpha)

Train via skip gram

Parameters:: i -; sentence -

iterate

public void iterate(VocabWord w1,
                    VocabWord w2,
                    AtomicLong nextRandom,
                    double alpha)

Train the word vector on the given words

Parameters:: w1 - the first word to fit

buildBinaryTree
```
protected void buildBinaryTree()
```

resetWeights
```
protected void resetWeights()
```

readStopWords
```
protected void readStopWords()
```

setSentenceIter
```
public void setSentenceIter(SentenceIterator sentenceIter)
```
Note that calling a setter on this means assumes that this is a training continuation and therefore weights should not be reset.

Parameters:

sentenceIter -

resetWeightsOnSetup
```
public void resetWeightsOnSetup()
```
restart training on next fit(). Use when sentence iterator is set for new training.

getWindow
```
public int getWindow()
```

getStopWords
```
public List<String> getStopWords()
```

getSentenceIter

public SentenceIterator getSentenceIter()

getTokenizerFactory

public TokenizerFactory getTokenizerFactory()

setTokenizerFactory

public void setTokenizerFactory(TokenizerFactory tokenizerFactory)

Class Word2Vec

Nested Class Summary

Field Summary

Fields inherited from class org.deeplearning4j.models.embeddings.wordvectors.WordVectorsImpl

Constructor Summary

Method Summary

Methods inherited from class org.deeplearning4j.models.embeddings.wordvectors.WordVectorsImpl

Methods inherited from class java.lang.Object

Field Detail

serialVersionUID

configuration

tokenizerFactory

sentenceIter

docIter

vectorizer

invertedIndex

vocabularyHolder

g

workers

batchSize

sample

totalWords

alpha

window

log

shouldReset

numIterations

UNK

seed

saveVocab

minLearningRate

learningRateDecayWords

useAdaGrad

resetModel

Constructor Detail

Word2Vec

Method Detail

getVectorizer

setVectorizer

fillVocabulary

fillSpecialVocabulary

digitizeSentence

fit

addWords

setup

buildVocab

trainSentence

skipGram

iterate

buildBinaryTree

resetWeights

readStopWords

setSentenceIter

resetWeightsOnSetup

getWindow

getStopWords

getSentenceIter

getTokenizerFactory

setTokenizerFactory