Class VocabularyHolder

    • Constructor Detail

      • VocabularyHolder

        protected VocabularyHolder()
        Default constructor
      • VocabularyHolder

        protected VocabularyHolder​(@NonNull
                                   @NonNull VocabCache<? extends SequenceElement> cache,
                                   boolean markAsSpecial)
        Builds VocabularyHolder from VocabCache. Basically we just ignore tokens, and transfer VocabularyWords, supposing that it's already truncated by minWordFrequency. Huffman tree data is ignored and recalculated, due to suspectable flaw in dl4j huffman impl, and it's excessive memory usage. This code is required for compatibility between dl4j w2v implementation, and standalone w2v
        Parameters:
        cache -
    • Method Detail

      • transferBackToVocabCache

        public void transferBackToVocabCache()
      • transferBackToVocabCache

        public void transferBackToVocabCache​(VocabCache cache)
      • transferBackToVocabCache

        public void transferBackToVocabCache​(VocabCache cache,
                                             boolean emptyHolder)
        This method is required for compatibility purposes. It just transfers vocabulary from VocabHolder into VocabCache
        Parameters:
        cache -
      • setScavengerActivationThreshold

        protected void setScavengerActivationThreshold​(int threshold)
        This method is needed ONLY for unit tests and should NOT be available in public scope. It sets the vocab size ratio, at wich dynamic scavenger will be activated
        Parameters:
        threshold -
      • arrayToList

        public static List<Byte> arrayToList​(byte[] array,
                                             int codeLen)
        This method is used only for VocabCache compatibility purposes
        Parameters:
        array -
        codeLen -
        Returns:
      • listToArray

        public static byte[] listToArray​(List<Byte> code)
      • listToArray

        public static int[] listToArray​(List<Integer> points,
                                        int codeLen)
      • arrayToList

        public static List<Integer> arrayToList​(int[] array,
                                                int codeLen)
        This method is used only for VocabCache compatibility purposes
        Parameters:
        array -
        codeLen -
        Returns:
      • containsWord

        public boolean containsWord​(String word)
        Checks vocabulary for the word existence
        Parameters:
        word - to be looked for
        Returns:
        TRUE of contains, FALSE otherwise
      • incrementWordCounter

        public void incrementWordCounter​(String word)
        Increments by one number of occurrences of the word in corpus
        Parameters:
        word - whose counter is to be incremented
      • addWord

        public void addWord​(String word)
        Adds new word to vocabulary
        Parameters:
        word - to be added
      • consumeVocabulary

        public void consumeVocabulary​(VocabularyHolder holder)
      • activateScavenger

        protected void activateScavenger()
        This method removes low-frequency words based on their frequency change between activations. I.e. if word has appeared only once, and it's retained the same frequency over consequence activations, we can assume it can be removed freely
      • resetWordCounters

        public void resetWordCounters()
        This methods reset counters for all words in vocabulary
      • numWords

        public int numWords()
        Returns:
        number of words in vocabulary
      • truncateVocabulary

        public void truncateVocabulary()
        The same as truncateVocabulary(this.minWordFrequency)
      • truncateVocabulary

        public void truncateVocabulary​(int threshold)
        All words with frequency below threshold wii be removed
        Parameters:
        threshold - exclusive threshold for removal
      • updateHuffmanCodes

        public List<VocabularyWord> updateHuffmanCodes()
        build binary tree ordered by counter. Based on original w2v by google
      • indexOf

        public int indexOf​(String word)
        This method returns index of word in sorted list.
        Parameters:
        word -
        Returns:
      • words

        public List<VocabularyWord> words()
        Returns sorted list of words in vocabulary. Sort is DESCENDING.
        Returns:
        list of VocabularyWord
      • totalWordsBeyondLimit

        public long totalWordsBeyondLimit()