Class DeDuplicatingTokenFilter
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.TokenFilter
-
- org.apache.lucene.analysis.FilteringTokenFilter
-
- org.apache.lucene.analysis.miscellaneous.DeDuplicatingTokenFilter
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
public class DeDuplicatingTokenFilter extends org.apache.lucene.analysis.FilteringTokenFilter
Inspects token streams for duplicate sequences of tokens. Token sequences have a minimum length - 6 is a good heuristic as it avoids filtering common idioms/phrases but detects longer sections that are typical of cut+paste copies of text.Internally each token is hashed/moduloed into a single byte (so 256 possible values for each token) and then recorded in a trie of seen byte sequences using a
DuplicateByteSequenceSpotter
. This trie is passed into the TokenFilter constructor so a single object can be reused across multiple documents.The emitDuplicates setting controls if duplicate tokens are filtered from results or are output (the
DuplicateSequenceAttribute
attribute can be used to inspect the number of prior sightings when emitDuplicates is true)
-
-
Constructor Summary
Constructors Constructor Description DeDuplicatingTokenFilter(org.apache.lucene.analysis.TokenStream in, DuplicateByteSequenceSpotter byteStreamDuplicateSpotter)
DeDuplicatingTokenFilter(org.apache.lucene.analysis.TokenStream in, DuplicateByteSequenceSpotter byteStreamDuplicateSpotter, boolean emitDuplicates)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected boolean
accept()
-
Methods inherited from class org.apache.lucene.analysis.FilteringTokenFilter
end, incrementToken, reset
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
-
-
-
Constructor Detail
-
DeDuplicatingTokenFilter
public DeDuplicatingTokenFilter(org.apache.lucene.analysis.TokenStream in, DuplicateByteSequenceSpotter byteStreamDuplicateSpotter)
-
DeDuplicatingTokenFilter
public DeDuplicatingTokenFilter(org.apache.lucene.analysis.TokenStream in, DuplicateByteSequenceSpotter byteStreamDuplicateSpotter, boolean emitDuplicates)
- Parameters:
in
- The input token streambyteStreamDuplicateSpotter
- object which retains trie of token sequencesemitDuplicates
- true if duplicate tokens are to be emitted (useDuplicateSequenceAttribute
attribute to inspect number of prior sightings of tokens as part of a sequence).
-
-