ContentExtractor (Heritrix 3: 'modules' subproject (reusable components) 3.4.0-20190205 API)

java.lang.Object
- org.archive.modules.Processor
- - org.archive.modules.extractor.Extractor
  - - org.archive.modules.extractor.ContentExtractor

All Implemented Interfaces:

Checkpointable, HasKeyedProperties, org.springframework.beans.factory.BeanNameAware, org.springframework.context.Lifecycle

Direct Known Subclasses:

ExtractorCSS, ExtractorDOC, ExtractorHTML, ExtractorJS, ExtractorPDF, ExtractorSWF, ExtractorUniversal, ExtractorXML, TrapSuppressExtractor
```
public abstract class ContentExtractor
extends Extractor
```
Extracts link from the fetched content of a URI, as opposed to its headers.

Author:

pjack

Field Summary
- Fields inherited from class org.archive.modules.extractor.Extractor
  DEFAULT_PARAMETERS, extractorParameters, loggerModule, numberOfLinksExtracted
- Fields inherited from class org.archive.modules.Processor
  beanName, isRunning, kp, recoveryCheckpoint, uriCount

Constructor Summary

Constructors
Constructor and Description

ContentExtractor()

Constructors
Constructor and Description
`ContentExtractor()`

Method Summary

All Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method and Description
`protected void`	`extract(CrawlURI uri)` Extracts links
`protected abstract boolean`	`innerExtract(CrawlURI uri)` Actually extracts links.
`protected abstract boolean`	`shouldExtract(CrawlURI uri)` Determines if otherwise valid URIs should have links extracted or not.
`protected boolean`	`shouldProcess(CrawlURI uri)` Determines if links should be extracted from the given URI.

Methods inherited from class org.archive.modules.extractor.Extractor
add, addOutlink, addOutlink, addRelativeToBase, addRelativeToVia, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, report, setExtractorParameters, setLoggerModule, toCheckpointJson

Methods inherited from class org.archive.modules.Processor
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - ContentExtractor
```
public ContentExtractor()
```
- Method Detail
  - extract
```
protected final void extract(CrawlURI uri)
```
    Extracts links
    
    Specified by:
    
    extract in class Extractor
    
    Parameters:
    
    uri - the uri to extract links from
  - shouldProcess
```
protected final boolean shouldProcess(CrawlURI uri)
```
    Determines if links should be extracted from the given URI. This method performs four checks. It first checks if the URI was processed successfully, i.e. CrawlURI.isSuccess() returns true.
    The second check runs only if ExtractorParameters.getExtractIndependently() is false. It checks CrawlURI.hasBeenLinkExtracted() result. If that result is true, then this method returns false, as some other extractor has claimed that links are already extracted.
    Next, this method checks that the content length of the URI is greater than zero (in other words, that there is actually content for links to be extracted from). If the content length of the URI is zero or less, then this method returns false.
    Finally, this method delegates to innerExtract(CrawlURI) and returns that result.
    
    Specified by:
    
    shouldProcess in class Processor
    
    Parameters:
    
    uri - the URI to check
    
    Returns:
    
    true if links should be extracted from the URI, false otherwise
  - shouldExtract
```
protected abstract boolean shouldExtract(CrawlURI uri)
```
    Determines if otherwise valid URIs should have links extracted or not. The given URI will have content length greater than zero. Subclasses should implement this method to perform additional checks. For instance, the ExtractorHTML implementation checks that the content-type of the given URI is text/html.
    
    Parameters:
    
    uri - the URI to check
    
    Returns:
    
    true if links should be extracted from that URI, false otherwise
  - innerExtract
```
protected abstract boolean innerExtract(CrawlURI uri)
```
    Actually extracts links. The given URI will have passed the three checks described in shouldProcess(CrawlURI). Subclasses should implement this method to discover outlinks in the URI's content stream. For instance, ExtractorHTML extracts links from Anchor tags and so on.
    This method should only return true if extraction completed successfully. If not (for instance, if an IO error occurred), then this method should return false. Returning false indicates to the pipeline that downstream extractors should attempt to extract links themselves. Returning true indicates that downstream extractors should be skipped.
    
    Parameters:
    
    uri - the URI whose links to extract
    
    Returns:
    
    true if link extraction finished; false if downstream extractors should attempt to extract links

Class ContentExtractor

Field Summary

Fields inherited from class org.archive.modules.extractor.Extractor

Fields inherited from class org.archive.modules.Processor

Constructor Summary

Method Summary

Methods inherited from class org.archive.modules.extractor.Extractor

Methods inherited from class org.archive.modules.Processor

Methods inherited from class java.lang.Object

Constructor Detail

ContentExtractor

Method Detail

extract

shouldProcess

shouldExtract

innerExtract