public class HTTPContentDigest extends Processor
This processor allows the user to specify a regular expression called strip-reg-expr. Any segment of a document (text only, binary files will be skipped) that matches this regular expression will by rewritten with the blank character (character 32 in the ANSI character set) for the purpose of the digest this has no effect on the document for subsequent processing or archiving.
NOTE: Content digest only accounts for the document body, not headers.
The operator will also be able to specify a maximum length for documents being evaluated by this processors. Documents exceeding that length will be ignored.
To further discriminate by file type or URL, an operator should use the override and refinement options.
It is generally recommended that this recalculation only be performed when absolutely needed (because of stripping data that changes automatically each time the URL is fetched) as this is an expensive operation. NOTE: This processor may open a ReplayCharSequence from the CrawlURI's Recorder, without closing that ReplayCharSequence, to allow reuse by later processors in sequence. In the usual (Heritrix) case, a call after all processing to the Recorder's endReplays() method ensures timely close of any reused ReplayCharSequences. Reuse of this processor elsewhere should ensure a similar cleanup call to Recorder.endReplays() occurs.
Constructor and Description |
---|
HTTPContentDigest()
Constructor.
|
Modifier and Type | Method and Description |
---|---|
long |
getMaxSizeToDigest() |
String |
getStripRegex() |
protected void |
innerProcess(CrawlURI curi)
Actually performs the process.
|
void |
setMaxSizeToDigest(long threshold) |
void |
setStripRegex(String regex) |
protected boolean |
shouldProcess(CrawlURI uri)
Determines whether the given uri should be processed by this
processor.
|
doCheckpoint, finishCheckpoint, flattenVia, fromCheckpointJson, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, report, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop, toCheckpointJson
public String getStripRegex()
public void setStripRegex(String regex)
public long getMaxSizeToDigest()
public void setMaxSizeToDigest(long threshold)
protected boolean shouldProcess(CrawlURI uri)
Processor
shouldProcess
in class Processor
uri
- the URI to testprotected void innerProcess(CrawlURI curi) throws InterruptedException
Processor
Processor.getEnabled()
, the
Processor.getShouldProcessRule()
and the Processor.shouldProcess(CrawlURI)
tests.innerProcess
in class Processor
curi
- the URI to processInterruptedException
- if the thread is interruptedCopyright © 2003–2019 Internet Archive. All rights reserved.