ExtractorHTMLForms (Heritrix 3: 'modules' subproject (reusable components) 3.4.0-20190418 API)

java.lang.Object
- org.archive.modules.Processor
- - org.archive.modules.extractor.Extractor
  - - org.archive.modules.forms.ExtractorHTMLForms

All Implemented Interfaces:

Checkpointable, HasKeyedProperties, org.springframework.beans.factory.BeanNameAware, org.springframework.context.Lifecycle
```
public class ExtractorHTMLForms
extends Extractor
```
Extracts extra information about FORMs in HTML, loading this into the CrawlURI (for potential later use by FormLoginProcessor) and adding a small annotation to the crawl.log. Must come after ExtractorHTML, as it relies on information left in the CrawlURI's A_FORM_OFFSETS data key. By default (with 'extractAllForms' equal false), only saves-to-CrawlURI and annotates forms that appear to be login forms, by the test HTMLForm.seemsLoginForm(). Typical CXML configuration would be, first, as top-level named beans:
```
 
 <bean id="extractorForms" class="org.archive.modules.forms.ExtractorHTMLForms">
   
 </bean>
 <bean id="formFiller" class="org.archive.modules.forms.FormLoginProcessor">
   
   
   
   
 </bean> 
 
 
```
Then, inside the fetch chain, after all other extractors:
```
 
 <bean id="fetchProcessors" class="org.archive.modules.FetchChain">
  <property name="processors">
   <list>
    ...ALL USUAL PREPROCESSORS/FETCHERS/EXTRACTORS HERE, THEN...
    <ref bean="extractorForms"/>
    <ref bean="formFiller"/>
   </list>
  </property>
 </bean>
 
 
```
NOTE: This processor may open a ReplayCharSequence from the CrawlURI's Recorder, without closing that ReplayCharSequence, to allow reuse by later processors in sequence. In the usual (Heritrix) case, a call after all processing to the Recorder's endReplays() method ensures timely close of any reused ReplayCharSequences. Reuse of this processor elsewhere should ensure a similar cleanup call to Recorder.endReplays() occurs.
Author:

gojomo

Field Summary

Fields
Modifier and Type Field and Description

static String A_HTML_FORM_OBJECTS
- Fields inherited from class org.archive.modules.extractor.Extractor
  DEFAULT_PARAMETERS, extractorParameters, loggerModule, numberOfLinksExtracted
- Fields inherited from class org.archive.modules.Processor
  beanName, isRunning, kp, recoveryCheckpoint, uriCount

Fields
Modifier and Type	Field and Description
`static String`	`A_HTML_FORM_OBJECTS`

Constructor Summary

Constructors
Constructor and Description

ExtractorHTMLForms()

Constructors
Constructor and Description
`ExtractorHTMLForms()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`protected void`	`analyze(CrawlURI curi, CharSequence cs)` Run analysis: find form METHOD, ACTION, and all INPUT names/values Log as configured.
`void`	`extract(CrawlURI curi)` Extracts links from the given URI.
`protected String`	`findAttributeValueGroup(String pattern, int groupNumber, CharSequence cs)`
`protected List<CharSequence>`	`findGroups(String pattern, int groupNumber, CharSequence cs)`
`boolean`	`getExtractAllForms()`
`void`	`setExtractAllForms(boolean extractAllForms)`
`protected boolean`	`shouldProcess(CrawlURI uri)` Determines whether the given uri should be processed by this processor.

Methods inherited from class org.archive.modules.extractor.Extractor
add, addOutlink, addOutlink, addRelativeToBase, addRelativeToVia, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, report, setExtractorParameters, setLoggerModule, toCheckpointJson

Methods inherited from class org.archive.modules.Processor
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - A_HTML_FORM_OBJECTS
```
public static final String A_HTML_FORM_OBJECTS
```
    See Also:
    
    Constant Field Values
- Constructor Detail
  - ExtractorHTMLForms
```
public ExtractorHTMLForms()
```
- Method Detail
  - getExtractAllForms
```
public boolean getExtractAllForms()
```
  - setExtractAllForms
```
public void setExtractAllForms(boolean extractAllForms)
```
  - shouldProcess
```
protected boolean shouldProcess(CrawlURI uri)
```
    Description copied from class: Processor
    
    Determines whether the given uri should be processed by this processor. For instance, a processor that only works on HTML content might reject the URI if its content type is not "text/html", if its content length is zero, and so on.
    
    Specified by:
    
    shouldProcess in class Processor
    
    Parameters:
    
    uri - the URI to test
    
    Returns:
    
    true if this processor should process that uri; false if not
  - extract
```
public void extract(CrawlURI curi)
```
    Description copied from class: Extractor
    
    Extracts links from the given URI. Subclasses should use CrawlURI.getRecorder() to process the content of the URI. Any links that are discovered should be added to the CrawlURI.getOutLinks() set.
    
    Specified by:
    
    extract in class Extractor
    
    Parameters:
    
    curi - the uri to extract links from
  - analyze
```
protected void analyze(CrawlURI curi,
                       CharSequence cs)
```
    Run analysis: find form METHOD, ACTION, and all INPUT names/values Log as configured.
    
    Parameters:
    
    curi - CrawlURI we're processing.
    
    cs - Sequence from underlying ReplayCharSequence. This is TRANSIENT data. Make a copy if you want the data to live outside of this extractors' lifetime.
  - findGroups
```
protected List<CharSequence> findGroups(String pattern,
                                        int groupNumber,
                                        CharSequence cs)
```
  - findAttributeValueGroup
```
protected String findAttributeValueGroup(String pattern,
                                         int groupNumber,
                                         CharSequence cs)
```

Class ExtractorHTMLForms

Field Summary

Fields inherited from class org.archive.modules.extractor.Extractor

Fields inherited from class org.archive.modules.Processor

Constructor Summary

Method Summary

Methods inherited from class org.archive.modules.extractor.Extractor

Methods inherited from class org.archive.modules.Processor

Methods inherited from class java.lang.Object

Field Detail

A_HTML_FORM_OBJECTS

Constructor Detail

ExtractorHTMLForms

Method Detail

getExtractAllForms

setExtractAllForms

shouldProcess

extract

analyze

findGroups

findAttributeValueGroup