ExtractorPDF (Heritrix 3: 'modules' subproject (reusable components) 3.4.0-20190205 API)

java.lang.Object
- org.archive.modules.Processor
- - org.archive.modules.extractor.Extractor
  - - org.archive.modules.extractor.ContentExtractor
    - - org.archive.modules.extractor.ExtractorPDF

All Implemented Interfaces:

Checkpointable, HasKeyedProperties, org.springframework.beans.factory.BeanNameAware, org.springframework.context.Lifecycle
```
public class ExtractorPDF
extends ContentExtractor
```
Allows the caller to process a CrawlURI representing a PDF for the purpose of extracting URIs

Author:

Parker Thompson

Field Summary
- Fields inherited from class org.archive.modules.extractor.Extractor
  DEFAULT_PARAMETERS, extractorParameters, loggerModule, numberOfLinksExtracted
- Fields inherited from class org.archive.modules.Processor
  beanName, isRunning, kp, recoveryCheckpoint, uriCount

Constructor Summary

Constructors
Constructor and Description

ExtractorPDF()

Constructors
Constructor and Description
`ExtractorPDF()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`long`	`getMaxSizeToParse()`
`protected boolean`	`innerExtract(CrawlURI curi)` Actually extracts links.
`void`	`setMaxSizeToParse(long threshold)`
`protected boolean`	`shouldExtract(CrawlURI uri)` Determines if otherwise valid URIs should have links extracted or not.

Methods inherited from class org.archive.modules.extractor.ContentExtractor
extract, shouldProcess

Methods inherited from class org.archive.modules.extractor.Extractor
add, addOutlink, addOutlink, addRelativeToBase, addRelativeToVia, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, report, setExtractorParameters, setLoggerModule, toCheckpointJson

Methods inherited from class org.archive.modules.Processor
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - ExtractorPDF
```
public ExtractorPDF()
```
- Method Detail
  - getMaxSizeToParse
```
public long getMaxSizeToParse()
```
  - setMaxSizeToParse
```
public void setMaxSizeToParse(long threshold)
```
  - shouldExtract
```
protected boolean shouldExtract(CrawlURI uri)
```
    Description copied from class: ContentExtractor
    
    Determines if otherwise valid URIs should have links extracted or not. The given URI will have content length greater than zero. Subclasses should implement this method to perform additional checks. For instance, the ExtractorHTML implementation checks that the content-type of the given URI is text/html.
    
    Specified by:
    
    shouldExtract in class ContentExtractor
    
    Parameters:
    
    uri - the URI to check
    
    Returns:
    
    true if links should be extracted from that URI, false otherwise
  - innerExtract
```
protected boolean innerExtract(CrawlURI curi)
```
    Description copied from class: ContentExtractor
    
    Actually extracts links. The given URI will have passed the three checks described in ContentExtractor.shouldProcess(CrawlURI). Subclasses should implement this method to discover outlinks in the URI's content stream. For instance, ExtractorHTML extracts links from Anchor tags and so on.
    This method should only return true if extraction completed successfully. If not (for instance, if an IO error occurred), then this method should return false. Returning false indicates to the pipeline that downstream extractors should attempt to extract links themselves. Returning true indicates that downstream extractors should be skipped.
    
    Specified by:
    
    innerExtract in class ContentExtractor
    
    Parameters:
    
    curi - the URI whose links to extract
    
    Returns:
    
    true if link extraction finished; false if downstream extractors should attempt to extract links

Class ExtractorPDF

Field Summary

Fields inherited from class org.archive.modules.extractor.Extractor

Fields inherited from class org.archive.modules.Processor

Constructor Summary

Method Summary

Methods inherited from class org.archive.modules.extractor.ContentExtractor

Methods inherited from class org.archive.modules.extractor.Extractor

Methods inherited from class org.archive.modules.Processor

Methods inherited from class java.lang.Object

Constructor Detail

ExtractorPDF

Method Detail

getMaxSizeToParse

setMaxSizeToParse

shouldExtract

innerExtract