ExtractorUniversal (Heritrix 3: 'modules' subproject (reusable components) 3.4.0-20190205 API)

java.lang.Object
- org.archive.modules.Processor
- - org.archive.modules.extractor.Extractor
  - - org.archive.modules.extractor.ContentExtractor
    - - org.archive.modules.extractor.ExtractorUniversal

All Implemented Interfaces:

Checkpointable, HasKeyedProperties, org.springframework.beans.factory.BeanNameAware, org.springframework.context.Lifecycle
```
public class ExtractorUniversal
extends ContentExtractor
```
A last ditch extractor that will look at the raw byte code and try to extract anything that looks like a link. If used, it should always be specified as the last link extractor in the order file.
To accomplish this it will scan through the bytecode and try and build up strings of consecutive bytes that all represent characters that are valid in a URL (see #isURLableChar(int) for details). Once it hits the end of such a string (i.e. finds a character that should not be in a URL) it will try to determine if it has found a URL. This is done be seeing if the string is an IP address prefixed with http(s):// or contains a dot followed by a Top Level Domain and end of string or a slash.

Author:

Kristinn Sigurdsson

Field Summary

Fields
Modifier and Type	Field and Description
`protected static Pattern`	`IP_ADDRESS` Matches any string that begins with http:// or https:// followed by something that looks like an ip address (four numbers, none longer then 3 chars seperated by 3 dots).
`static Pattern`	`TLDs` Matches any string that begins with a TLD (no .) followed by a '/' slash or end of string.

Fields inherited from class org.archive.modules.extractor.Extractor
DEFAULT_PARAMETERS, extractorParameters, loggerModule, numberOfLinksExtracted

Fields inherited from class org.archive.modules.Processor
beanName, isRunning, kp, recoveryCheckpoint, uriCount

Constructor Summary

Constructors
Constructor and Description

ExtractorUniversal()
Constructor.

Constructors
Constructor and Description
`ExtractorUniversal()` Constructor.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`long`	`getMaxSizeToParse()`
`protected boolean`	`innerExtract(CrawlURI curi)` Actually extracts links.
`void`	`setMaxSizeToParse(long threshold)`
`protected boolean`	`shouldExtract(CrawlURI uri)` Determines if otherwise valid URIs should have links extracted or not.

Methods inherited from class org.archive.modules.extractor.ContentExtractor
extract, shouldProcess

Methods inherited from class org.archive.modules.extractor.Extractor
add, addOutlink, addOutlink, addRelativeToBase, addRelativeToVia, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, report, setExtractorParameters, setLoggerModule, toCheckpointJson

Methods inherited from class org.archive.modules.Processor
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - IP_ADDRESS
```
protected static final Pattern IP_ADDRESS
```
    Matches any string that begins with http:// or https:// followed by something that looks like an ip address (four numbers, none longer then 3 chars seperated by 3 dots). Does not ensure that the numbers are each in the range 0-255.
  - TLDs
```
public static final Pattern TLDs
```
    Matches any string that begins with a TLD (no .) followed by a '/' slash or end of string. If followed by slash then nothing after the slash is of consequence.
- Constructor Detail
  - ExtractorUniversal
```
public ExtractorUniversal()
```
    Constructor.
- Method Detail
  - getMaxSizeToParse
```
public long getMaxSizeToParse()
```
  - setMaxSizeToParse
```
public void setMaxSizeToParse(long threshold)
```
  - shouldExtract
```
protected boolean shouldExtract(CrawlURI uri)
```
    Description copied from class: ContentExtractor
    
    Determines if otherwise valid URIs should have links extracted or not. The given URI will have content length greater than zero. Subclasses should implement this method to perform additional checks. For instance, the ExtractorHTML implementation checks that the content-type of the given URI is text/html.
    
    Specified by:
    
    shouldExtract in class ContentExtractor
    
    Parameters:
    
    uri - the URI to check
    
    Returns:
    
    true if links should be extracted from that URI, false otherwise
  - innerExtract
```
protected boolean innerExtract(CrawlURI curi)
```
    Description copied from class: ContentExtractor
    
    Actually extracts links. The given URI will have passed the three checks described in ContentExtractor.shouldProcess(CrawlURI). Subclasses should implement this method to discover outlinks in the URI's content stream. For instance, ExtractorHTML extracts links from Anchor tags and so on.
    This method should only return true if extraction completed successfully. If not (for instance, if an IO error occurred), then this method should return false. Returning false indicates to the pipeline that downstream extractors should attempt to extract links themselves. Returning true indicates that downstream extractors should be skipped.
    
    Specified by:
    
    innerExtract in class ContentExtractor
    
    Parameters:
    
    curi - the URI whose links to extract
    
    Returns:
    
    true if link extraction finished; false if downstream extractors should attempt to extract links

Class ExtractorUniversal

Field Summary

Fields inherited from class org.archive.modules.extractor.Extractor

Fields inherited from class org.archive.modules.Processor

Constructor Summary

Method Summary

Methods inherited from class org.archive.modules.extractor.ContentExtractor

Methods inherited from class org.archive.modules.extractor.Extractor

Methods inherited from class org.archive.modules.Processor

Methods inherited from class java.lang.Object

Field Detail

IP_ADDRESS

TLDs

Constructor Detail

ExtractorUniversal

Method Detail

getMaxSizeToParse

setMaxSizeToParse

shouldExtract

innerExtract