Package | Description |
---|---|
org.archive.modules |
The beginnings of a refactored settings framework.
|
org.archive.modules.deciderules | |
org.archive.modules.deciderules.recrawl | |
org.archive.modules.deciderules.surt | |
org.archive.modules.fetcher |
Modifier and Type | Method and Description |
---|---|
DecideRule |
Processor.getShouldProcessRule() |
Modifier and Type | Method and Description |
---|---|
void |
Processor.setShouldProcessRule(DecideRule rule)
Decide rule(s) (also particular to a URI) that determine whether
or not a particular URI is processed here.
|
Modifier and Type | Class and Description |
---|---|
class |
AcceptDecideRule |
class |
AddRedirectFromRootServerToScope |
class |
ContentLengthDecideRule |
class |
ContentTypeMatchesRegexDecideRule
DecideRule whose decision is applied if the URI's content-type
is present and matches the supplied regular expression.
|
class |
ContentTypeNotMatchesRegexDecideRule
DecideRule whose decision is applied if the URI's content-type
is present and does not match the supplied regular expression.
|
class |
DecideRuleSequence |
class |
ExternalGeoLocationDecideRule
A rule that can be configured to take alternate implementations
of the ExternalGeoLocationInterface.
|
class |
FetchStatusDecideRule
Rule applies the configured decision for any URI which has a
fetch status equal to the 'target-status' setting.
|
class |
FetchStatusMatchesRegexDecideRule |
class |
FetchStatusNotMatchesRegexDecideRule |
class |
HasViaDecideRule
Rule applies the configured decision for any URI which has a 'via'
(essentially, any URI that was a seed or some kinds of mid-crawl adds).
|
class |
HopCrossesAssignmentLevelDomainDecideRule
Applies its decision if the current URI differs in that portion of
its hostname/domain that is assigned/sold by registrars, its
'assignment-level-domain' (ALD) (AKA 'public suffix' or in previous
Heritrix versions, 'topmost assigned SURT')
|
class |
HopsPathMatchesRegexDecideRule
Rule applies configured decision to any CrawlURIs whose 'hops-path'
(string like "LLXE" etc.) matches the supplied regex.
|
class |
IpAddressSetDecideRule
IpAddressSetDecideRule must be used with
org.archive.crawler.prefetch.Preselector#setRecheckScope(boolean) set
to true because it relies on Heritrix' dns lookup to establish the ip address
for a URI before it can run.
|
class |
MatchesFilePatternDecideRule
Compares suffix of a passed CrawlURI, UURI, or String against a regular
expression pattern, applying its configured decision to all matches.
|
class |
MatchesListRegexDecideRule
Rule applies configured decision to any CrawlURIs whose String URI
matches the supplied regexs.
|
class |
MatchesRegexDecideRule
Rule applies configured decision to any CrawlURIs whose String URI
matches the supplied regex.
|
class |
MatchesStatusCodeDecideRule
Provides a rule that returns "true" for any CrawlURIs which have a fetch
status code that falls within the provided inclusive range.
|
class |
NotMatchesFilePatternDecideRule
Rule applies configured decision to any URIs which do *not*
match the supplied (file-pattern) regex.
|
class |
NotMatchesListRegexDecideRule
Rule applies configured decision to any URIs which do *not*
match the supplied regex.
|
class |
NotMatchesRegexDecideRule
Rule applies configured decision to any URIs which do *not*
match the supplied regex.
|
class |
NotMatchesStatusCodeDecideRule
Provides a rule that returns "true" for any CrawlURIs which has a fetch
status code that does not fall within the provided inclusive range.
|
class |
PathologicalPathDecideRule
Rule REJECTs any URI which contains an excessive number of identical,
consecutive path-segments (eg http://example.com/a/a/a/boo.html == 3 '/a'
segments)
|
class |
PredicatedDecideRule
Rule which applies the configured decision only if a
test evaluates to true.
|
class |
PrerequisiteAcceptDecideRule
Rule which ACCEPTs all 'prerequisite' URIs (those with a 'P' in
the last hopsPath position).
|
class |
RejectDecideRule |
class |
ResourceLongerThanDecideRule
Applies configured decision for URIs with content length greater than
a given threshold length value.
|
class |
ResourceNoLongerThanDecideRule
Applies configured decision for URIs with content length less than or equal
to a given threshold length value.
|
class |
ResponseContentLengthDecideRule
Decide rule that will ACCEPT or REJECT a uri, depending on the
"decision" property, after it's fetched, if the content body is within a
specified size range, specified in bytes.
|
class |
SchemeNotInSetDecideRule
Rule applies the configured decision (default REJECT) for any URI which
has a URI-scheme NOT contained in the configured Set.
|
class |
ScriptedDecideRule
Rule which runs a JSR-223 script to make its decision.
|
class |
SeedAcceptDecideRule
Rule which ACCEPTs all 'seed' URIs (those for which
isSeed is true).
|
class |
SourceSeedDecideRule
Rule applies the configured decision for any URI with discovered from one of
the seeds in
sourceSeeds . |
class |
TooManyHopsDecideRule
Rule REJECTs any CrawlURIs whose total number of hops (length of the
hopsPath string, traversed links of any type) is over a threshold.
|
class |
TooManyPathSegmentsDecideRule
Rule REJECTs any CrawlURIs whose total number of path-segments (as
indicated by the count of '/' characters not including the first '//')
is over a given threshold.
|
class |
TransclusionDecideRule
Rule ACCEPTs any CrawlURIs whose path-from-seed ('hopsPath' -- see
CrawlURI.getPathFromSeed() ends
with at least one, but not more than, the given number of
non-navlink ('L') hops. |
class |
ViaSurtPrefixedDecideRule
Rule applies the configured decision for any URI which has a 'via' whose
surtform matches any surt specified in the surtPrefixes list
|
Modifier and Type | Method and Description |
---|---|
List<DecideRule> |
DecideRuleSequence.getRules() |
Modifier and Type | Method and Description |
---|---|
protected void |
DecideRuleSequence.decisionMade(CrawlURI uri,
DecideRule decisiveRule,
int decisiveRuleNumber,
DecideResult result) |
Modifier and Type | Method and Description |
---|---|
void |
DecideRuleSequence.setRules(List<DecideRule> rules) |
Modifier and Type | Class and Description |
---|---|
class |
IdenticalDigestDecideRule
Rule applies configured decision to any CrawlURIs whose revisit profile is set with a profile matching
WARCConstants.PROFILE_REVISIT_IDENTICAL_DIGEST |
Modifier and Type | Class and Description |
---|---|
class |
NotOnDomainsDecideRule
Rule applies configured decision to any URIs that are
*not* in one of the domains in the configured set of
domains, filled from the seed set.
|
class |
NotOnHostsDecideRule
Rule applies configured decision to any URIs that
are *not* on one of the hosts in the configured set of
hosts, filled from the seed set.
|
class |
NotSurtPrefixedDecideRule
Rule applies configured decision to any URIs that, when
expressed in SURT form, do *not* begin with one of the prefixes
in the configured set.
|
class |
OnDomainsDecideRule
Rule applies configured decision to any URIs that
are on one of the domains in the configured set of
domains, filled from the seed set.
|
class |
OnHostsDecideRule
Rule applies configured decision to any URIs that
are on one of the hosts in the configured set of
hosts, filled from the seed set.
|
class |
SurtPrefixedDecideRule
Rule applies configured decision to any URIs that, when
expressed in SURT form, begin with one of the prefixes
in the configured set.
|
Modifier and Type | Method and Description |
---|---|
DecideRule |
FetchHTTP.getShouldFetchBodyRule() |
Modifier and Type | Method and Description |
---|---|
void |
FetchHTTP.setShouldFetchBodyRule(DecideRule rule)
DecideRules applied after receipt of HTTP response headers but before we
start to download the body.
|
Copyright © 2003–2022 Internet Archive. All rights reserved.