public class ExtractorHTMLForms extends Extractor
<bean id="extractorForms" class="org.archive.modules.forms.ExtractorHTMLForms">
<!-- <property name="extractAllForms" value="false" /> -->
</bean>
<bean id="formFiller" class="org.archive.modules.forms.FormLoginProcessor">
<!-- generally these are overlaid with sheets rather than set directly -->
<!-- <property name="applicableSurtPrefix" value="" /> -->
<!-- <property name="loginUsername" value="" /> -->
<!-- <property name="loginPassword" value="" /> -->
</bean>
Then, inside the fetch chain, after all other extractors:
<bean id="fetchProcessors" class="org.archive.modules.FetchChain">
<property name="processors">
<list>
...ALL USUAL PREPROCESSORS/FETCHERS/EXTRACTORS HERE, THEN...
<ref bean="extractorForms"/>
<ref bean="formFiller"/>
</list>
</property>
</bean>
NOTE: This processor may open a ReplayCharSequence from the
CrawlURI's Recorder, without closing that ReplayCharSequence, to allow
reuse by later processors in sequence. In the usual (Heritrix) case, a
call after all processing to the Recorder's endReplays() method ensures
timely close of any reused ReplayCharSequences. Reuse of this processor
elsewhere should ensure a similar cleanup call to Recorder.endReplays()
occurs.Modifier and Type | Field and Description |
---|---|
static String |
A_HTML_FORM_OBJECTS |
DEFAULT_PARAMETERS, extractorParameters, loggerModule, numberOfLinksExtracted
Constructor and Description |
---|
ExtractorHTMLForms() |
Modifier and Type | Method and Description |
---|---|
protected void |
analyze(CrawlURI curi,
CharSequence cs)
Run analysis: find form METHOD, ACTION, and all INPUT names/values
Log as configured.
|
void |
extract(CrawlURI curi)
Extracts links from the given URI.
|
protected String |
findAttributeValueGroup(String pattern,
int groupNumber,
CharSequence cs) |
protected List<CharSequence> |
findGroups(String pattern,
int groupNumber,
CharSequence cs) |
boolean |
getExtractAllForms() |
void |
setExtractAllForms(boolean extractAllForms) |
protected boolean |
shouldProcess(CrawlURI uri)
Determines whether the given uri should be processed by this
processor.
|
add, addOutlink, addOutlink, addRelativeToBase, addRelativeToVia, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, report, setExtractorParameters, setLoggerModule, toCheckpointJson
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop
public static final String A_HTML_FORM_OBJECTS
public boolean getExtractAllForms()
public void setExtractAllForms(boolean extractAllForms)
protected boolean shouldProcess(CrawlURI uri)
Processor
shouldProcess
in class Processor
uri
- the URI to testpublic void extract(CrawlURI curi)
Extractor
CrawlURI.getRecorder()
to process the content of the
URI. Any links that are discovered should be added to the
CrawlURI.getOutLinks()
set.protected void analyze(CrawlURI curi, CharSequence cs)
curi
- CrawlURI we're processing.cs
- Sequence from underlying ReplayCharSequence. This
is TRANSIENT data. Make a copy if you want the data to live outside
of this extractors' lifetime.protected List<CharSequence> findGroups(String pattern, int groupNumber, CharSequence cs)
protected String findAttributeValueGroup(String pattern, int groupNumber, CharSequence cs)
Copyright © 2003–2019 Internet Archive. All rights reserved.