public class ExtractorUniversal extends ContentExtractor
To accomplish this it will scan through the bytecode and try and build up strings of consecutive bytes that all represent characters that are valid in a URL (see #isURLableChar(int) for details). Once it hits the end of such a string (i.e. finds a character that should not be in a URL) it will try to determine if it has found a URL. This is done be seeing if the string is an IP address prefixed with http(s):// or contains a dot followed by a Top Level Domain and end of string or a slash.
Modifier and Type | Field and Description |
---|---|
protected static Pattern |
IP_ADDRESS
Matches any string that begins with http:// or https:// followed by
something that looks like an ip address (four numbers, none longer then
3 chars seperated by 3 dots).
|
static Pattern |
TLDs
Matches any string that begins with a TLD (no .) followed by a '/' slash
or end of string.
|
DEFAULT_PARAMETERS, extractorParameters, loggerModule, numberOfLinksExtracted
Constructor and Description |
---|
ExtractorUniversal()
Constructor.
|
Modifier and Type | Method and Description |
---|---|
long |
getMaxSizeToParse() |
protected boolean |
innerExtract(CrawlURI curi)
Actually extracts links.
|
void |
setMaxSizeToParse(long threshold) |
protected boolean |
shouldExtract(CrawlURI uri)
Determines if otherwise valid URIs should have links extracted or not.
|
extract, shouldProcess
add, addOutlink, addOutlink, addRelativeToBase, addRelativeToVia, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, report, setExtractorParameters, setLoggerModule, toCheckpointJson
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop
protected static final Pattern IP_ADDRESS
public static final Pattern TLDs
public long getMaxSizeToParse()
public void setMaxSizeToParse(long threshold)
protected boolean shouldExtract(CrawlURI uri)
ContentExtractor
ExtractorHTML
implementation checks that the content-type of
the given URI is text/html.shouldExtract
in class ContentExtractor
uri
- the URI to checkprotected boolean innerExtract(CrawlURI curi)
ContentExtractor
ContentExtractor.shouldProcess(CrawlURI)
. Subclasses
should implement this method to discover outlinks in the URI's
content stream. For instance, ExtractorHTML
extracts links
from Anchor tags and so on.
This method should only return true if extraction completed successfully. If not (for instance, if an IO error occurred), then this method should return false. Returning false indicates to the pipeline that downstream extractors should attempt to extract links themselves. Returning true indicates that downstream extractors should be skipped.
innerExtract
in class ContentExtractor
curi
- the URI whose links to extractCopyright © 2003–2019 Internet Archive. All rights reserved.