Word2Vec (deeplearning4j-nlp 0.0.3.2.7 API)

java.lang.Object
- org.deeplearning4j.models.word2vec.Word2Vec

All Implemented Interfaces:

Serializable, Persistable

Direct Known Subclasses:

ParagraphVectors
```
public class Word2Vec
extends Object
implements Persistable
```
Leveraging a 3 layer neural net with a softmax approach as output, converts a word based on its context and the training examples in to a numeric vector

Author:

Adam Gibson

See Also:

Serialized Form

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class Word2Vec.Builder

Nested Classes
Modifier and Type	Class and Description
`static class`	`Word2Vec.Builder`

Field Summary

Fields
Modifier and Type	Field and Description
`protected com.google.common.util.concurrent.AtomicDouble`	`alpha`
`protected int`	`batchSize`
`protected VocabCache`	`cache`
`protected DocumentIterator`	`docIter`
`protected org.apache.commons.math3.random.RandomGenerator`	`g`
`protected Queue<List<List<VocabWord>>>`	`jobQueue`
`protected int`	`layerSize`
`protected int`	`learningRateDecayWords`
`protected static org.slf4j.Logger`	`log`
`protected double`	`minLearningRate`
`protected int`	`minWordFrequency`
`protected int`	`numIterations`
`protected AtomicInteger`	`rateOfChange`
`protected double`	`sample`
`protected boolean`	`saveVocab`
`protected long`	`seed`
`protected SentenceIterator`	`sentenceIter`
`protected static long`	`serialVersionUID`
`protected boolean`	`shouldReset`
`protected List<String>`	`stopWords`
`protected TokenizerFactory`	`tokenizerFactory`
`protected int`	`topNSize`
`protected long`	`totalWords`
`static String`	`UNK`
`protected boolean`	`useAdaGrad`
`protected TextVectorizer`	`vectorizer`
`protected int`	`window`
`protected int`	`workers`

Constructor Summary

Constructors
Constructor and Description

Word2Vec()

Constructors
Constructor and Description
`Word2Vec()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`Map<String,Double>`	`accuracy(List<String> questions)` Accuracy based on questions which are a space separated list of strings where the first word is the query word, the next 2 words are negative, and the last word is the predicted word to be nearest
`protected void`	`addWords(List<VocabWord> sentence, AtomicLong nextRandom, List<VocabWord> currMiniBatch)`
`protected void`	`buildBinaryTree()`
`boolean`	`buildVocab()` Builds the vocabulary for training
`void`	`fit()` Train the model
`VocabCache`	`getCache()`
`int`	`getLayerSize()`
`SentenceIterator`	`getSentenceIter()`
`List<String>`	`getStopWords()`
`TokenizerFactory`	`getTokenizerFactory()`
`int`	`getWindow()`
`double[]`	`getWordVector(String word)` Get the word vector for a given matrix
`org.nd4j.linalg.api.ndarray.INDArray`	`getWordVectorMatrix(String word)` Get the word vector for a given matrix
`org.nd4j.linalg.api.ndarray.INDArray`	`getWordVectorMatrixNormalized(String word)` Returns the word vector divided by the norm2 of the array
`boolean`	`hasWord(String word)` Returns true if the model has this word in the vocab
`int`	`indexOf(String word)`
`void`	`iterate(VocabWord w1, VocabWord w2, AtomicLong nextRandom, double alpha)` Train the word vector on the given words
`void`	`load(InputStream is)`
`protected void`	`readStopWords()`
`protected void`	`resetWeights()`
`void`	`resetWeightsOnSetup()` restart training on next fit().
`void`	`setCache(VocabCache cache)`
`void`	`setLayerSize(int layerSize)`
`void`	`setSentenceIter(SentenceIterator sentenceIter)` Note that calling a setter on this means assumes that this is a training continuation and therefore weights should not be reset.
`void`	`setTokenizerFactory(TokenizerFactory tokenizerFactory)`
`void`	`setup()` Build the binary tree Reset the weights
`double`	`similarity(String word, String word2)` Returns the similarity of 2 words
`List<String>`	`similarWordsInVocabTo(String word, double accuracy)` Find all words with a similar characters in the vocab
`void`	`skipGram(int i, List<VocabWord> sentence, int b, AtomicLong nextRandom, double alpha)` Train via skip gram
`void`	`trainSentence(List<VocabWord> sentence, AtomicLong nextRandom, double alpha)` Train on a list of vocab words
`Collection<String>`	`wordsNearest(List<String> positive, List<String> negative, int top)` Words nearest based on positive and negative words
`Collection<String>`	`wordsNearest(String word, int n)` Get the top n words most similar to the given word
`Collection<String>`	`wordsNearestSum(List<String> positive, List<String> negative, int top)` Words nearest based on positive and negative words
`Collection<String>`	`wordsNearestSum(String word, int n)` Get the top n words most similar to the given word
`void`	`write(OutputStream os)`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

serialVersionUID
```
protected static final long serialVersionUID
```
See Also:

Constant Field Values

tokenizerFactory

protected transient TokenizerFactory tokenizerFactory

sentenceIter

protected transient SentenceIterator sentenceIter

docIter

protected transient DocumentIterator docIter

cache
```
protected transient VocabCache cache
```

batchSize
```
protected int batchSize
```

topNSize
```
protected int topNSize
```

sample
```
protected double sample
```

totalWords
```
protected long totalWords
```

rateOfChange
```
protected AtomicInteger rateOfChange
```

alpha

protected com.google.common.util.concurrent.AtomicDouble alpha

minWordFrequency
```
protected int minWordFrequency
```

window
```
protected int window
```

layerSize
```
protected int layerSize
```

g

protected transient org.apache.commons.math3.random.RandomGenerator g

log
```
protected static org.slf4j.Logger log
```

stopWords
```
protected List<String> stopWords
```

shouldReset
```
protected boolean shouldReset
```

numIterations
```
protected int numIterations
```

UNK
```
public static final String UNK
```
See Also:

Constant Field Values

seed
```
protected long seed
```

saveVocab
```
protected boolean saveVocab
```

minLearningRate
```
protected double minLearningRate
```

vectorizer
```
protected TextVectorizer vectorizer
```

learningRateDecayWords
```
protected int learningRateDecayWords
```

useAdaGrad
```
protected boolean useAdaGrad
```

workers
```
protected int workers
```

jobQueue

protected Queue<List<List<VocabWord>>> jobQueue

Constructor Detail
- Word2Vec
```
public Word2Vec()
```

Method Detail

accuracy
```
public Map<String,Double> accuracy(List<String> questions)
```
Accuracy based on questions which are a space separated list of strings where the first word is the query word, the next 2 words are negative, and the last word is the predicted word to be nearest

Parameters:

questions - the questions to ask

Returns:

the accuracy based on these questions

similarWordsInVocabTo
```
public List<String> similarWordsInVocabTo(String word,
                                          double accuracy)
```
Find all words with a similar characters in the vocab

Parameters:

word - the word to compare

accuracy - the accuracy: 0 to 1

Returns:

the list of words that are similar in the vocab

indexOf
```
public int indexOf(String word)
```

getWordVector
```
public double[] getWordVector(String word)
```
Get the word vector for a given matrix

Parameters:

word - the word to get the matrix for

Returns:

the ndarray for this word

getWordVectorMatrix
```
public org.nd4j.linalg.api.ndarray.INDArray getWordVectorMatrix(String word)
```
Get the word vector for a given matrix

Parameters:

word - the word to get the matrix for

Returns:

the ndarray for this word

getWordVectorMatrixNormalized
```
public org.nd4j.linalg.api.ndarray.INDArray getWordVectorMatrixNormalized(String word)
```
Returns the word vector divided by the norm2 of the array

Parameters:

word - the word to get the matrix for

Returns:

the looked up matrix

wordsNearestSum

public Collection<String> wordsNearestSum(List<String> positive,
                                          List<String> negative,
                                          int top)

Words nearest based on positive and negative words

Parameters:: positive - the positive words; negative - the negative words; top - the top n words
Returns:: the words nearest the mean of the words

wordsNearestSum
```
public Collection<String> wordsNearestSum(String word,
                                          int n)
```
Get the top n words most similar to the given word

Parameters:

word - the word to compare

n - the n to get

Returns:

the top n words

wordsNearest

public Collection<String> wordsNearest(List<String> positive,
                                       List<String> negative,
                                       int top)

Words nearest based on positive and negative words

Parameters:: positive - the positive words; negative - the negative words; top - the top n words
Returns:: the words nearest the mean of the words

wordsNearest
```
public Collection<String> wordsNearest(String word,
                                       int n)
```
Get the top n words most similar to the given word

Parameters:

word - the word to compare

n - the n to get

Returns:

the top n words

hasWord
```
public boolean hasWord(String word)
```
Returns true if the model has this word in the vocab

Parameters:

word - the word to test for

Returns:

true if the model has the word in the vocab

fit

public void fit()
         throws IOException

Train the model

Throws:: IOException

addWords

protected void addWords(List<VocabWord> sentence,
                        AtomicLong nextRandom,
                        List<VocabWord> currMiniBatch)

setup
```
public void setup()
```
Build the binary tree Reset the weights

buildVocab
```
public boolean buildVocab()
```
Builds the vocabulary for training

trainSentence

public void trainSentence(List<VocabWord> sentence,
                          AtomicLong nextRandom,
                          double alpha)

Train on a list of vocab words

Parameters:: sentence - the list of vocab words to train on

skipGram

public void skipGram(int i,
                     List<VocabWord> sentence,
                     int b,
                     AtomicLong nextRandom,
                     double alpha)

Train via skip gram

Parameters:: i -; sentence -

iterate

public void iterate(VocabWord w1,
                    VocabWord w2,
                    AtomicLong nextRandom,
                    double alpha)

Train the word vector on the given words

Parameters:: w1 - the first word to fit

buildBinaryTree
```
protected void buildBinaryTree()
```

resetWeights
```
protected void resetWeights()
```

similarity
```
public double similarity(String word,
                         String word2)
```
Returns the similarity of 2 words

Parameters:

word - the first word

word2 - the second word

Returns:

a normalized similarity (cosine similarity)

readStopWords
```
protected void readStopWords()
```

write
```
public void write(OutputStream os)
```
Specified by:

write in interface Persistable

load
```
public void load(InputStream is)
```
Specified by:

load in interface Persistable

setSentenceIter
```
public void setSentenceIter(SentenceIterator sentenceIter)
```
Note that calling a setter on this means assumes that this is a training continuation and therefore weights should not be reset.

Parameters:

sentenceIter -

resetWeightsOnSetup
```
public void resetWeightsOnSetup()
```
restart training on next fit(). Use when sentence iterator is set for new training.

getLayerSize
```
public int getLayerSize()
```

setLayerSize

public void setLayerSize(int layerSize)

getWindow
```
public int getWindow()
```

getStopWords
```
public List<String> getStopWords()
```

getSentenceIter

public SentenceIterator getSentenceIter()

getTokenizerFactory

public TokenizerFactory getTokenizerFactory()

setTokenizerFactory

public void setTokenizerFactory(TokenizerFactory tokenizerFactory)

getCache
```
public VocabCache getCache()
```

setCache

public void setCache(VocabCache cache)

Class Word2Vec

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

serialVersionUID

tokenizerFactory

sentenceIter

docIter

cache

batchSize

topNSize

sample

totalWords

rateOfChange

alpha

minWordFrequency

window

layerSize

g

log

stopWords

shouldReset

numIterations

UNK

seed

saveVocab

minLearningRate

vectorizer

learningRateDecayWords

useAdaGrad

workers

jobQueue

Constructor Detail

Word2Vec

Method Detail

accuracy

similarWordsInVocabTo

indexOf

getWordVector

getWordVectorMatrix

getWordVectorMatrixNormalized

wordsNearestSum

wordsNearestSum

wordsNearest

wordsNearest

hasWord

fit

addWords

setup

buildVocab

trainSentence

skipGram

iterate

buildBinaryTree

resetWeights

similarity

readStopWords

write

load

setSentenceIter

resetWeightsOnSetup

getLayerSize

setLayerSize

getWindow

getStopWords

getSentenceIter

getTokenizerFactory

setTokenizerFactory

getCache

setCache