Class VocabularyHolder
- java.lang.Object
-
- org.deeplearning4j.models.word2vec.wordstore.VocabularyHolder
-
- All Implemented Interfaces:
Serializable
public class VocabularyHolder extends Object implements Serializable
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
VocabularyHolder.Builder
-
Constructor Summary
Constructors Modifier Constructor Description protected
VocabularyHolder()
Default constructorprotected
VocabularyHolder(@NonNull VocabCache<? extends SequenceElement> cache, boolean markAsSpecial)
Builds VocabularyHolder from VocabCache.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
activateScavenger()
This method removes low-frequency words based on their frequency change between activations.void
addWord(String word)
Adds new word to vocabularyvoid
addWord(VocabularyWord word)
static List<Byte>
arrayToList(byte[] array, int codeLen)
This method is used only for VocabCache compatibility purposesstatic List<Integer>
arrayToList(int[] array, int codeLen)
This method is used only for VocabCache compatibility purposesstatic HuffmanNode
buildNode(List<Byte> codes, List<Integer> points, int codeLen, int index)
void
consumeVocabulary(VocabularyHolder holder)
boolean
containsWord(String word)
Checks vocabulary for the word existenceCollection<VocabularyWord>
getVocabulary()
VocabularyWord
getVocabularyWordByIdx(Integer id)
VocabularyWord
getVocabularyWordByString(String word)
void
incrementWordCounter(String word)
Increments by one number of occurrences of the word in corpusint
indexOf(String word)
This method returns index of word in sorted list.static byte[]
listToArray(List<Byte> code)
static int[]
listToArray(List<Integer> points, int codeLen)
int
numWords()
void
resetWordCounters()
This methods reset counters for all words in vocabularyprotected void
setScavengerActivationThreshold(int threshold)
This method is needed ONLY for unit tests and should NOT be available in public scope.long
totalWordsBeyondLimit()
void
transferBackToVocabCache()
void
transferBackToVocabCache(VocabCache cache)
void
transferBackToVocabCache(VocabCache cache, boolean emptyHolder)
This method is required for compatibility purposes.void
truncateVocabulary()
The same as truncateVocabulary(this.minWordFrequency)void
truncateVocabulary(int threshold)
All words with frequency below threshold wii be removedList<VocabularyWord>
updateHuffmanCodes()
build binary tree ordered by counter.List<VocabularyWord>
words()
Returns sorted list of words in vocabulary.
-
-
-
Constructor Detail
-
VocabularyHolder
protected VocabularyHolder()
Default constructor
-
VocabularyHolder
protected VocabularyHolder(@NonNull @NonNull VocabCache<? extends SequenceElement> cache, boolean markAsSpecial)
Builds VocabularyHolder from VocabCache. Basically we just ignore tokens, and transfer VocabularyWords, supposing that it's already truncated by minWordFrequency. Huffman tree data is ignored and recalculated, due to suspectable flaw in dl4j huffman impl, and it's excessive memory usage. This code is required for compatibility between dl4j w2v implementation, and standalone w2v- Parameters:
cache
-
-
-
Method Detail
-
buildNode
public static HuffmanNode buildNode(List<Byte> codes, List<Integer> points, int codeLen, int index)
-
transferBackToVocabCache
public void transferBackToVocabCache()
-
transferBackToVocabCache
public void transferBackToVocabCache(VocabCache cache)
-
transferBackToVocabCache
public void transferBackToVocabCache(VocabCache cache, boolean emptyHolder)
This method is required for compatibility purposes. It just transfers vocabulary from VocabHolder into VocabCache- Parameters:
cache
-
-
setScavengerActivationThreshold
protected void setScavengerActivationThreshold(int threshold)
This method is needed ONLY for unit tests and should NOT be available in public scope. It sets the vocab size ratio, at wich dynamic scavenger will be activated- Parameters:
threshold
-
-
arrayToList
public static List<Byte> arrayToList(byte[] array, int codeLen)
This method is used only for VocabCache compatibility purposes- Parameters:
array
-codeLen
-- Returns:
-
arrayToList
public static List<Integer> arrayToList(int[] array, int codeLen)
This method is used only for VocabCache compatibility purposes- Parameters:
array
-codeLen
-- Returns:
-
getVocabulary
public Collection<VocabularyWord> getVocabulary()
-
getVocabularyWordByString
public VocabularyWord getVocabularyWordByString(String word)
-
getVocabularyWordByIdx
public VocabularyWord getVocabularyWordByIdx(Integer id)
-
containsWord
public boolean containsWord(String word)
Checks vocabulary for the word existence- Parameters:
word
- to be looked for- Returns:
- TRUE of contains, FALSE otherwise
-
incrementWordCounter
public void incrementWordCounter(String word)
Increments by one number of occurrences of the word in corpus- Parameters:
word
- whose counter is to be incremented
-
addWord
public void addWord(String word)
Adds new word to vocabulary- Parameters:
word
- to be added
-
addWord
public void addWord(VocabularyWord word)
-
consumeVocabulary
public void consumeVocabulary(VocabularyHolder holder)
-
activateScavenger
protected void activateScavenger()
This method removes low-frequency words based on their frequency change between activations. I.e. if word has appeared only once, and it's retained the same frequency over consequence activations, we can assume it can be removed freely
-
resetWordCounters
public void resetWordCounters()
This methods reset counters for all words in vocabulary
-
numWords
public int numWords()
- Returns:
- number of words in vocabulary
-
truncateVocabulary
public void truncateVocabulary()
The same as truncateVocabulary(this.minWordFrequency)
-
truncateVocabulary
public void truncateVocabulary(int threshold)
All words with frequency below threshold wii be removed- Parameters:
threshold
- exclusive threshold for removal
-
updateHuffmanCodes
public List<VocabularyWord> updateHuffmanCodes()
build binary tree ordered by counter. Based on original w2v by google
-
indexOf
public int indexOf(String word)
This method returns index of word in sorted list.- Parameters:
word
-- Returns:
-
words
public List<VocabularyWord> words()
Returns sorted list of words in vocabulary. Sort is DESCENDING.- Returns:
- list of VocabularyWord
-
totalWordsBeyondLimit
public long totalWordsBeyondLimit()
-
-