Class LuIndexWriter

  • All Implemented Interfaces:
    VersatileIndexing

    public abstract class LuIndexWriter
    extends java.lang.Object
    implements VersatileIndexing
    Wrapper class containing an instance of Lucene's IndexWriter that facilitates the indexing of both text objects and latent feature vectors. Note that Right now LuIndexWriter is not thread-safe due to the way it uses PersonalizedDocFactory. This will be fixed in a near future version.
    • Constructor Summary

      Constructors 
      Constructor Description
      LuIndexWriter​(java.lang.String indexDirectoryPath, int model, int numHash)  
      LuIndexWriter​(java.lang.String indexDirectoryPath, java.lang.String splitVecPath)  
    • Method Summary

      All Methods Instance Methods Abstract Methods Concrete Methods 
      Modifier and Type Method Description
      void close()
      Closes all open resources and releases the write lock.
      void createIndexFromDir​(java.lang.String dataDirPath, java.io.FileFilter filter)  
      void createIndexFromVecData​(double[][] itemVecs)  
      void deleteByID​(java.lang.Object ID)  
      abstract void indexFile​(java.io.File file)
      Self-implement this function to parse information from your file to be indexed.
      void optimize()  
      void setMaxBufferDocNum​(int num)
      Determines the minimal number of documents required before the buffered in-memory documents are flushed as a new Segment.
      void setMaxBufferRAMSize​(double mb)
      Determines the amount of RAM that may be used for buffering added documents and deletions before they are flushed to the Directory.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • writer

        protected org.apache.lucene.index.IndexWriter writer
    • Constructor Detail

      • LuIndexWriter

        public LuIndexWriter​(java.lang.String indexDirectoryPath,
                             java.lang.String splitVecPath)
                      throws java.io.IOException
        Parameters:
        indexDirectoryPath - directory to the folder containing the index files.
        splitVecPath - path to the object file containing the LSH vectors.
        Throws:
        java.io.IOException - this is triggered when a path or file does not exist. Constructor using an existing LSHash Vector object. This will try to allocate as much memory as possible for the writing buffer. In case a path LSH vectors object is not specify the indexwriter will still load, but any operation involving latent item vector will throw a NullPointerException.
      • LuIndexWriter

        public LuIndexWriter​(java.lang.String indexDirectoryPath,
                             int model,
                             int numHash)
                      throws java.io.IOException
        Parameters:
        indexDirectoryPath - directory to the folder containing the index files.
        model - model ID to decide which configuration to get from the database.
        numHash - number hashing vector to randomize.
        Throws:
        java.io.IOException - this is triggered when a path or file does not exist. Constructor randomizing a new hashtable, then save it to the same folder containing the index file and save metadata to database. Note that this is intented to worked with other unreleased components of Cerebro. As such it is not recommended to instantiate LuIndexWriter this way.
    • Method Detail

      • close

        public final void close()
                         throws java.io.IOException
        Closes all open resources and releases the write lock.

        Note that this may be a costly operation, so, try to re-use a single writer instead of closing and opening a new one.

        NOTE: You must ensure no other threads are still making changes at the same time that this method is invoked.

        Throws:
        java.io.IOException
      • setMaxBufferDocNum

        public final void setMaxBufferDocNum​(int num)
        Determines the minimal number of documents required before the buffered in-memory documents are flushed as a new Segment. Large values generally give faster indexing.

        When this is set, the writer will flush every maxBufferedDocs added documents. Pass in IndexWriterConfig.DISABLE_AUTO_FLUSH to prevent triggering a flush due to number of buffered documents. Note that if flushing by RAM usage is also enabled, then the flush will be triggered by whichever comes first.

        Disabled by default (writer flushes by RAM usage).

        Takes effect immediately, but only the next time a document is added, updated or deleted.

        Throws:
        java.lang.IllegalArgumentException - if maxBufferedDocs is enabled but smaller than 2, or it disables maxBufferedDocs when ramBufferSize is already disabled.
        See Also:
        setMaxBufferRAMSize(double)
      • setMaxBufferRAMSize

        public final void setMaxBufferRAMSize​(double mb)
        Determines the amount of RAM that may be used for buffering added documents and deletions before they are flushed to the Directory. Generally for faster indexing performance it's best to flush by RAM usage instead of document count and use as large a RAM buffer as you can.

        When this is set, the writer will flush whenever buffered documents and deletions use this much RAM. Pass in IndexWriterConfig.DISABLE_AUTO_FLUSH to prevent triggering a flush due to RAM usage. Note that if flushing by document count is also enabled, then the flush will be triggered by whichever comes first.

        The maximum RAM limit is inherently determined by the JVMs available memory. Yet, an IndexWriter session can consume a significantly larger amount of memory than the given RAM limit since this limit is just an indicator when to flush memory resident documents to the Directory. Flushes are likely happen concurrently while other threads adding documents to the writer. For application stability the available memory in the JVM should be significantly larger than the RAM buffer used for indexing.

        NOTE: the account of RAM usage for pending deletions is only approximate. Specifically, if you delete by Query, Lucene currently has no way to measure the RAM usage of individual Queries so the accounting will under-estimate and you should compensate by either calling commit() or refresh() periodically yourself.

        NOTE: It's not guaranteed that all memory resident documents are flushed once this limit is exceeded.

        The default value is IndexWriterConfig.DEFAULT_RAM_BUFFER_SIZE_MB.

        Takes effect immediately, but only the next time a document is added, updated or deleted.

        Throws:
        java.lang.IllegalArgumentException - if ramBufferSize is enabled but non-positive, or it disables ramBufferSize when maxBufferedDocs is already disabled
        See Also:
        IndexWriterConfig.setRAMPerThreadHardLimitMB(int)
      • deleteByID

        public void deleteByID​(java.lang.Object ID)
                        throws java.io.IOException,
                               UnsupportedDataType
        Parameters:
        ID - ID of the document to delete
        Throws:
        java.io.IOException
        UnsupportedDataType - Delete a document by its unique ID. Note that you should let Cerebro handle the ID field automatically, only passing in the ID value either as an integer or a string.
      • optimize

        public void optimize()
                      throws java.io.IOException
        Throws:
        java.io.IOException - With multithreading trying to get all of the index in one segment has no advantage. You should let Lucene decide when to carry out the index optimization.
      • createIndexFromDir

        public final void createIndexFromDir​(java.lang.String dataDirPath,
                                             java.io.FileFilter filter)
                                      throws java.io.IOException
        Parameters:
        dataDirPath - directory to the folder containing the data
        filter - an object to filter out all the type of file we don't want to read.
        Throws:
        java.io.IOException - This method lists all the acceptable files in the given directory and pass them individually to indexFile(File) to index the file content.
      • indexFile

        public abstract void indexFile​(java.io.File file)
                                throws java.io.IOException
        Self-implement this function to parse information from your file to be indexed. If you are utilizing personalized search function, plz use docFactory to create your Documents. Do not try to create Lucene document directly if you want to use personalized search. See the deprecated function createIndexFromVecData(double[][]) as an example of how to work with docFactory.
        Throws:
        java.io.IOException
      • createIndexFromVecData

        public void createIndexFromVecData​(double[][] itemVecs)
                                    throws java.lang.Exception
        Parameters:
        itemVecs - the set of item latent vector to be indexes.
        Throws:
        java.io.IOException
        DocNotClearedException - this exception is triggered when a call to PersonalizedDocFactory.create(Object, double[]) is not paired with a call to PersonalizedDocFactory.getDoc(). This method indexes the given set of vectors, using the order number of a document as ID, this may vary from use-case to use-case.
        java.lang.Exception