Class LuIndexWriter
- java.lang.Object
-
- ai.preferred.cerebro.index.builder.LuIndexWriter
-
- All Implemented Interfaces:
VersatileIndexing
public abstract class LuIndexWriter extends java.lang.Object implements VersatileIndexing
Wrapper class containing an instance of Lucene'sIndexWriter
that facilitates the indexing of both text objects and latent feature vectors. Note that Right now LuIndexWriter is not thread-safe due to the way it uses PersonalizedDocFactory. This will be fixed in a near future version.
-
-
Field Summary
Fields Modifier and Type Field Description protected PersonalizedDocFactory
docFactory
protected org.apache.lucene.index.IndexWriter
writer
-
Constructor Summary
Constructors Constructor Description LuIndexWriter(java.lang.String indexDirectoryPath, int model, int numHash)
LuIndexWriter(java.lang.String indexDirectoryPath, java.lang.String splitVecPath)
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description void
close()
Closes all open resources and releases the write lock.void
createIndexFromDir(java.lang.String dataDirPath, java.io.FileFilter filter)
void
createIndexFromVecData(double[][] itemVecs)
void
deleteByID(java.lang.Object ID)
abstract void
indexFile(java.io.File file)
Self-implement this function to parse information from your file to be indexed.void
optimize()
void
setMaxBufferDocNum(int num)
Determines the minimal number of documents required before the buffered in-memory documents are flushed as a new Segment.void
setMaxBufferRAMSize(double mb)
Determines the amount of RAM that may be used for buffering added documents and deletions before they are flushed to the Directory.-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface ai.preferred.cerebro.index.builder.VersatileIndexing
indexKeyWords, indexLatentVectors
-
-
-
-
Field Detail
-
writer
protected org.apache.lucene.index.IndexWriter writer
-
docFactory
protected PersonalizedDocFactory docFactory
-
-
Constructor Detail
-
LuIndexWriter
public LuIndexWriter(java.lang.String indexDirectoryPath, java.lang.String splitVecPath) throws java.io.IOException
- Parameters:
indexDirectoryPath
- directory to the folder containing the index files.splitVecPath
- path to the object file containing the LSH vectors.- Throws:
java.io.IOException
- this is triggered when a path or file does not exist. Constructor using an existing LSHash Vector object. This will try to allocate as much memory as possible for the writing buffer. In case a path LSH vectors object is not specify the indexwriter will still load, but any operation involving latent item vector will throw aNullPointerException
.
-
LuIndexWriter
public LuIndexWriter(java.lang.String indexDirectoryPath, int model, int numHash) throws java.io.IOException
- Parameters:
indexDirectoryPath
- directory to the folder containing the index files.model
- model ID to decide which configuration to get from the database.numHash
- number hashing vector to randomize.- Throws:
java.io.IOException
- this is triggered when a path or file does not exist. Constructor randomizing a new hashtable, then save it to the same folder containing the index file and save metadata to database. Note that this is intented to worked with other unreleased components of Cerebro. As such it is not recommended to instantiateLuIndexWriter
this way.
-
-
Method Detail
-
close
public final void close() throws java.io.IOException
Closes all open resources and releases the write lock.Note that this may be a costly operation, so, try to re-use a single writer instead of closing and opening a new one.
NOTE: You must ensure no other threads are still making changes at the same time that this method is invoked.
- Throws:
java.io.IOException
-
setMaxBufferDocNum
public final void setMaxBufferDocNum(int num)
Determines the minimal number of documents required before the buffered in-memory documents are flushed as a new Segment. Large values generally give faster indexing.When this is set, the writer will flush every maxBufferedDocs added documents. Pass in
IndexWriterConfig.DISABLE_AUTO_FLUSH
to prevent triggering a flush due to number of buffered documents. Note that if flushing by RAM usage is also enabled, then the flush will be triggered by whichever comes first.Disabled by default (writer flushes by RAM usage).
Takes effect immediately, but only the next time a document is added, updated or deleted.
- Throws:
java.lang.IllegalArgumentException
- if maxBufferedDocs is enabled but smaller than 2, or it disables maxBufferedDocs when ramBufferSize is already disabled.- See Also:
setMaxBufferRAMSize(double)
-
setMaxBufferRAMSize
public final void setMaxBufferRAMSize(double mb)
Determines the amount of RAM that may be used for buffering added documents and deletions before they are flushed to the Directory. Generally for faster indexing performance it's best to flush by RAM usage instead of document count and use as large a RAM buffer as you can.When this is set, the writer will flush whenever buffered documents and deletions use this much RAM. Pass in
IndexWriterConfig.DISABLE_AUTO_FLUSH
to prevent triggering a flush due to RAM usage. Note that if flushing by document count is also enabled, then the flush will be triggered by whichever comes first.The maximum RAM limit is inherently determined by the JVMs available memory. Yet, an
IndexWriter
session can consume a significantly larger amount of memory than the given RAM limit since this limit is just an indicator when to flush memory resident documents to the Directory. Flushes are likely happen concurrently while other threads adding documents to the writer. For application stability the available memory in the JVM should be significantly larger than the RAM buffer used for indexing.NOTE: the account of RAM usage for pending deletions is only approximate. Specifically, if you delete by Query, Lucene currently has no way to measure the RAM usage of individual Queries so the accounting will under-estimate and you should compensate by either calling commit() or refresh() periodically yourself.
NOTE: It's not guaranteed that all memory resident documents are flushed once this limit is exceeded.
The default value is
IndexWriterConfig.DEFAULT_RAM_BUFFER_SIZE_MB
.Takes effect immediately, but only the next time a document is added, updated or deleted.
- Throws:
java.lang.IllegalArgumentException
- if ramBufferSize is enabled but non-positive, or it disables ramBufferSize when maxBufferedDocs is already disabled- See Also:
IndexWriterConfig.setRAMPerThreadHardLimitMB(int)
-
deleteByID
public void deleteByID(java.lang.Object ID) throws java.io.IOException, UnsupportedDataType
- Parameters:
ID
- ID of the document to delete- Throws:
java.io.IOException
UnsupportedDataType
- Delete a document by its unique ID. Note that you should let Cerebro handle the ID field automatically, only passing in the ID value either as an integer or a string.
-
optimize
public void optimize() throws java.io.IOException
- Throws:
java.io.IOException
- With multithreading trying to get all of the index in one segment has no advantage. You should let Lucene decide when to carry out the index optimization.
-
createIndexFromDir
public final void createIndexFromDir(java.lang.String dataDirPath, java.io.FileFilter filter) throws java.io.IOException
- Parameters:
dataDirPath
- directory to the folder containing the datafilter
- an object to filter out all the type of file we don't want to read.- Throws:
java.io.IOException
- This method lists all the acceptable files in the given directory and pass them individually toindexFile(File)
to index the file content.
-
indexFile
public abstract void indexFile(java.io.File file) throws java.io.IOException
Self-implement this function to parse information from your file to be indexed. If you are utilizing personalized search function, plz use docFactory to create your Documents. Do not try to create Lucene document directly if you want to use personalized search. See the deprecated functioncreateIndexFromVecData(double[][])
as an example of how to work with docFactory.- Throws:
java.io.IOException
-
createIndexFromVecData
public void createIndexFromVecData(double[][] itemVecs) throws java.lang.Exception
- Parameters:
itemVecs
- the set of item latent vector to be indexes.- Throws:
java.io.IOException
DocNotClearedException
- this exception is triggered when a call toPersonalizedDocFactory.create(Object, double[])
is not paired with a call toPersonalizedDocFactory.getDoc()
. This method indexes the given set of vectors, using the order number of a document as ID, this may vary from use-case to use-case.java.lang.Exception
-
-