public abstract class ContentExtractor extends Extractor
DEFAULT_PARAMETERS, extractorParameters, loggerModule, numberOfLinksExtracted
Constructor and Description |
---|
ContentExtractor() |
Modifier and Type | Method and Description |
---|---|
protected void |
extract(CrawlURI uri)
Extracts links
|
protected abstract boolean |
innerExtract(CrawlURI uri)
Actually extracts links.
|
protected abstract boolean |
shouldExtract(CrawlURI uri)
Determines if otherwise valid URIs should have links extracted or not.
|
protected boolean |
shouldProcess(CrawlURI uri)
Determines if links should be extracted from the given URI.
|
add, addOutlink, addOutlink, addRelativeToBase, addRelativeToVia, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, report, setExtractorParameters, setLoggerModule, toCheckpointJson
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop
protected final void extract(CrawlURI uri)
protected final boolean shouldProcess(CrawlURI uri)
CrawlURI.isSuccess()
returns true.
The second check runs only if
ExtractorParameters.getExtractIndependently()
is false. It checks
CrawlURI.hasBeenLinkExtracted()
result. If that result is
true, then this method returns false, as some other extractor has claimed
that links are already extracted.
Next, this method checks that the content length of the URI is greater than zero (in other words, that there is actually content for links to be extracted from). If the content length of the URI is zero or less, then this method returns false.
Finally, this method delegates to innerExtract(CrawlURI)
and
returns that result.
shouldProcess
in class Processor
uri
- the URI to checkprotected abstract boolean shouldExtract(CrawlURI uri)
ExtractorHTML
implementation checks that the content-type of
the given URI is text/html.uri
- the URI to checkprotected abstract boolean innerExtract(CrawlURI uri)
shouldProcess(CrawlURI)
. Subclasses
should implement this method to discover outlinks in the URI's
content stream. For instance, ExtractorHTML
extracts links
from Anchor tags and so on.
This method should only return true if extraction completed successfully. If not (for instance, if an IO error occurred), then this method should return false. Returning false indicates to the pipeline that downstream extractors should attempt to extract links themselves. Returning true indicates that downstream extractors should be skipped.
uri
- the URI whose links to extractCopyright © 2003–2019 Internet Archive. All rights reserved.