Skip navigation links
A B C D E F G H I J K L M N O P Q R S T U V W 

A

A_ANNOTATIONS - Static variable in interface org.archive.modules.CoreAttributeConstants
shorthand string tokens indicating notable occurrences, separated by commas
A_CONTENT_DIGEST - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
content digest
A_CONTENT_DIGEST_COUNT - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
number of times we've seen this content digest (1 original + n duplicates)
A_CONTENT_DIGEST_HISTORY - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
content digest history map
A_CONTENT_TYPE - Static variable in interface org.archive.modules.CoreAttributeConstants
Extracted MIME type of fetched content; should be set immediately by fetching module if possible (rather than waiting for a later analyzer)
A_CREDENTIALS_KEY - Static variable in interface org.archive.modules.CoreAttributeConstants
Key to get credential avatars from A_LIST.
A_DELAY_FACTOR - Static variable in interface org.archive.modules.CoreAttributeConstants
Multiplier of last fetch duration to wait before fetching another item of the same class (eg host)
A_DISTANCE_FROM_SEED - Static variable in interface org.archive.modules.CoreAttributeConstants
 
A_DNS_FETCH_TIME - Static variable in interface org.archive.modules.CoreAttributeConstants
 
A_ETAG_HEADER - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
header name (and AList key) for ETag
A_FETCH_BEGAN_TIME - Static variable in interface org.archive.modules.CoreAttributeConstants
 
A_FETCH_COMPLETED_TIME - Static variable in interface org.archive.modules.CoreAttributeConstants
 
A_FETCH_HISTORY - Static variable in class org.archive.modules.CrawlURI
fetch history array
A_FETCH_HISTORY - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
A_FORCE_RETIRE - Static variable in interface org.archive.modules.CoreAttributeConstants
flag indicating the containing queue should be retired
A_FORM_OFFSETS - Static variable in class org.archive.modules.extractor.ExtractorHTML
 
A_FTP_CONTROL_CONVERSATION - Static variable in interface org.archive.modules.CoreAttributeConstants
 
A_FTP_FETCH_STATUS - Static variable in interface org.archive.modules.CoreAttributeConstants
 
A_HERITABLE_KEYS - Static variable in interface org.archive.modules.CoreAttributeConstants
Key to (optional) attribute specifying a list of keys that are passed to CandidateURIs that 'descend' (are discovered via) this URI.
A_HREF - Static variable in class org.archive.modules.extractor.HTMLLinkContext
 
A_HTML_BASE - Static variable in interface org.archive.modules.CoreAttributeConstants
 
A_HTML_FORM_OBJECTS - Static variable in class org.archive.modules.forms.ExtractorHTMLForms
 
A_HTTP_AUTH_CHALLENGES - Static variable in interface org.archive.modules.CoreAttributeConstants
 
A_HTTP_PROXY_HOST - Static variable in interface org.archive.modules.CoreAttributeConstants
local override of proxy host
A_HTTP_PROXY_PORT - Static variable in interface org.archive.modules.CoreAttributeConstants
local override of proxy port
A_HTTP_RESPONSE_HEADERS - Static variable in interface org.archive.modules.CoreAttributeConstants
 
A_LAST_MODIFIED_HEADER - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
header name (and AList key) for last-modified timestamp
A_META_ROBOTS - Static variable in class org.archive.modules.extractor.ExtractorHTML
 
A_MINIMUM_DELAY - Static variable in interface org.archive.modules.CoreAttributeConstants
Minimum delay before fetching another item of th same class (eg host).
A_MIRROR_PATH - Static variable in interface org.archive.modules.CoreAttributeConstants
Define for org.archive.crawler.writer.MirrorWriterProcessor.
A_MIRROR_PATH - Static variable in class org.archive.modules.writer.MirrorWriterProcessor
 
A_NONFATAL_ERRORS - Static variable in interface org.archive.modules.CoreAttributeConstants
 
A_ORIGINAL_DATE - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
date content payload was written
A_ORIGINAL_URL - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
url that the content payload was written for
A_PRECALC_PRECEDENCE - Static variable in interface org.archive.modules.CoreAttributeConstants
key to attribute containing pre-calculated precedence
A_PREREQUISITE_URI - Static variable in interface org.archive.modules.CoreAttributeConstants
 
A_REFERENCE_LENGTH - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
reference length (content length or virtual length
A_RETRY_DELAY - Static variable in interface org.archive.modules.CoreAttributeConstants
 
A_RRECORD_SET_LABEL - Static variable in interface org.archive.modules.CoreAttributeConstants
 
A_RUNTIME_EXCEPTION - Static variable in interface org.archive.modules.CoreAttributeConstants
 
A_SERVER_IP - Static variable in interface org.archive.modules.CoreAttributeConstants
IP address of the server the resource was fetched from.
A_SOURCE_TAG - Static variable in interface org.archive.modules.CoreAttributeConstants
a 'source' (usu.
A_STATUS - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
key for status (when in history)
A_SUBMIT_DATA - Static variable in interface org.archive.modules.CoreAttributeConstants
 
A_SUBMIT_ENCTYPE - Static variable in interface org.archive.modules.CoreAttributeConstants
 
A_VIA_DIGEST - Static variable in class org.archive.modules.extractor.TrapSuppressExtractor
ALIst attribute key for carrying-forward content-digest from 'via'
A_WARC_FILE_OFFSET - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
offset into warc file of warc record with content payload
A_WARC_FILENAME - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
warc filename containing the content payload
A_WARC_RECORD_ID - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
warc record id of warc record with the content payload
A_WARC_RESPONSE_HEADERS - Static variable in interface org.archive.modules.CoreAttributeConstants
 
A_WARC_STATS - Static variable in interface org.archive.modules.CoreAttributeConstants
 
A_WRITE_TAG - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
Writer processors of all types are encouraged to put a 'writeTag' (analogous to HTTP 'etag') in the CrawlURI state.
aboutToLog() - Method in class org.archive.modules.CrawlURI
Notify CrawlURI it is about to be logged; opportunity for self-annotation
ABS_HTTP_URI_PATTERN - Static variable in class org.archive.modules.extractor.ExtractorURI
 
AbstractContentDigestHistory - Class in org.archive.modules.recrawl
Represents a store of information, presumably persistent, keyed by content digest.
AbstractContentDigestHistory() - Constructor for class org.archive.modules.recrawl.AbstractContentDigestHistory
 
AbstractCookieStore - Class in org.archive.modules.fetcher
 
AbstractCookieStore() - Constructor for class org.archive.modules.fetcher.AbstractCookieStore
 
AbstractCookieStore.LimitedCookieStoreFacade - Class in org.archive.modules.fetcher
 
AbstractPersistProcessor - Class in org.archive.modules.recrawl
 
AbstractPersistProcessor() - Constructor for class org.archive.modules.recrawl.AbstractPersistProcessor
 
AbstractProfile - Class in org.archive.modules.revisit
 
AbstractProfile() - Constructor for class org.archive.modules.revisit.AbstractProfile
 
AcceptDecideRule - Class in org.archive.modules.deciderules
 
AcceptDecideRule() - Constructor for class org.archive.modules.deciderules.AcceptDecideRule
 
accepts(CrawlURI) - Method in class org.archive.modules.deciderules.DecideRule
 
accumulate(CrawlURI) - Method in class org.archive.crawler.util.CrawledBytesHistotable
 
action - Variable in class org.archive.modules.forms.HTMLForm
 
actions - Variable in class org.archive.modules.extractor.CustomSWFTags
 
actOn(File) - Method in class org.archive.modules.seeds.SeedModule
 
actOn(File) - Method in class org.archive.modules.seeds.TextSeedModule
Treat the given file as a source of additional seeds, announcing to SeedListeners.
add(CrawlURI, int, String, LinkContext, Hop) - Static method in class org.archive.modules.extractor.Extractor
 
add(T) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
add(int, T) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
addAll(Collection<? extends T>) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
addAll(int, Collection<? extends T>) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
addAllow(String) - Method in class org.archive.modules.net.RobotsDirectives
 
addAnnotations(CrawlURI, CrawlURI) - Method in class org.archive.modules.extractor.ExtractorSWF.CrawlUriSWFAction
 
addContentLocationHeaderLink(CrawlURI, String) - Method in class org.archive.modules.extractor.ExtractorHTTP
 
addCookie(Cookie) - Method in class org.archive.modules.fetcher.AbstractCookieStore
 
addCookie(Cookie) - Method in class org.archive.modules.fetcher.AbstractCookieStore.LimitedCookieStoreFacade
 
addCookieImpl(Cookie) - Method in class org.archive.modules.fetcher.AbstractCookieStore
 
addCookieImpl(Cookie) - Method in class org.archive.modules.fetcher.BdbCookieStore
 
addCookieImpl(Cookie) - Method in class org.archive.modules.fetcher.SimpleCookieStore
 
addCredential(Credential) - Method in class org.archive.modules.net.CrawlServer
Add an avatar.
addDisallow(String) - Method in class org.archive.modules.net.RobotsDirectives
 
addedCredentials - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
 
addedSeed(CrawlURI) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
If appropriate, convert seed notification into prefix-addition.
addedSeed(CrawlURI) - Method in interface org.archive.modules.seeds.SeedListener
 
addExtraInfo(String, Object) - Method in class org.archive.modules.CrawlURI
 
addField(String, String, String, boolean) - Method in class org.archive.modules.forms.HTMLForm
Add a discovered INPUT, tracking it as potential username/password receiver.
addField(String, String, String) - Method in class org.archive.modules.forms.HTMLForm
Add a discovered INPUT, tracking it as potential username/password receiver.
addHeaderLink(CrawlURI, String) - Method in class org.archive.modules.extractor.ExtractorHTTP
 
addHeaderLink(CrawlURI, String, String) - Method in class org.archive.modules.extractor.ExtractorHTTP
 
addIfNotBlank(ANVLRecord, String, String) - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
 
addLinkFromString(CrawlURI, CharSequence, CharSequence, Hop) - Method in class org.archive.modules.extractor.ExtractorHTML
 
addOutlink(CrawlURI, String, LinkContext, Hop) - Method in class org.archive.modules.extractor.Extractor
Create and add a 'Link' to the CrawlURI with given URI/context/hop-type
addOutlink(CrawlURI, UURI, LinkContext, Hop) - Method in class org.archive.modules.extractor.Extractor
 
AddRedirectFromRootServerToScope - Class in org.archive.modules.deciderules
 
AddRedirectFromRootServerToScope() - Constructor for class org.archive.modules.deciderules.AddRedirectFromRootServerToScope
 
addRefreshHeaderLink(CrawlURI, String) - Method in class org.archive.modules.extractor.ExtractorHTTP
 
addRelativeToBase(CrawlURI, int, CharSequence, LinkContext, Hop) - Static method in class org.archive.modules.extractor.Extractor
Adds an outlink to uri relative to uri.getBaseURI().
addRelativeToVia(CrawlURI, int, String, LinkContext, Hop) - Static method in class org.archive.modules.extractor.Extractor
Adds an outlink to uri relative to uri.getVia().
addResponseContent(HttpResponse, CrawlURI) - Method in class org.archive.modules.fetcher.FetchHTTP
This method populates curi with response status and content type.
addSeed(CrawlURI) - Method in class org.archive.modules.seeds.SeedModule
 
addSeed(CrawlURI) - Method in class org.archive.modules.seeds.TextSeedModule
Add a new seed to scope.
addSeedListener(SeedListener) - Method in class org.archive.modules.seeds.SeedModule
 
addStats(Map<String, Map<String, Long>>) - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
 
addTotalBytesWritten(long) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
addWhoisLink(CrawlURI, String) - Method in class org.archive.modules.fetcher.FetchWhois
 
addWhoisLinks(CrawlURI) - Method in class org.archive.modules.fetcher.FetchWhois
Adds outlinks to whois:{domain} and whois:{ipAddress}
afterPropertiesSet() - Method in class org.archive.modules.CrawlMetadata
 
afterPropertiesSet() - Method in class org.archive.modules.deciderules.ScriptedDecideRule
 
afterPropertiesSet() - Method in class org.archive.modules.extractor.ExtractorHTML
 
afterPropertiesSet() - Method in class org.archive.modules.ScriptedProcessor
 
agentsToDirectives - Variable in class org.archive.modules.net.Robotstxt
 
AggressiveExtractorHTML - Class in org.archive.modules.extractor
Extended version of ExtractorHTML with more aggressive javascript link extraction where javascript code is parsed first with general HTML tags regex, and than by javascript speculative link regex.
AggressiveExtractorHTML() - Constructor for class org.archive.modules.extractor.AggressiveExtractorHTML
 
allInputs - Variable in class org.archive.modules.forms.HTMLForm
 
allows(String, CrawlURI, Robotstxt) - Method in class org.archive.modules.net.CustomRobotsPolicy
 
allows(String, CrawlURI, Robotstxt) - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
 
allows(String, CrawlURI, Robotstxt) - Method in class org.archive.modules.net.IgnoreRobotsPolicy
 
allows(String, CrawlURI, Robotstxt) - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
 
allows(String, CrawlURI, Robotstxt) - Method in class org.archive.modules.net.ObeyRobotsPolicy
 
allows - Variable in class org.archive.modules.net.RobotsDirectives
 
allows(String) - Method in class org.archive.modules.net.RobotsDirectives
 
allows(String, CrawlURI, Robotstxt) - Method in class org.archive.modules.net.RobotsPolicy
 
allowsAll() - Method in class org.archive.modules.net.Robotstxt
Does this policy effectively allow everything? (No disallows or timing (crawl-delay) directives?)
analyze(CrawlURI, CharSequence) - Method in class org.archive.modules.forms.ExtractorHTMLForms
Run analysis: find form METHOD, ACTION, and all INPUT names/values Log as configured.
ANNOTATION_IS_SITEMAP - Static variable in class org.archive.modules.extractor.ExtractorRobotsTxt
 
ANNOTATION_UNWRITTEN - Static variable in class org.archive.modules.writer.WriterPoolProcessor
CrawlURI annotation indicating no record was written.
announceSeeds() - Method in class org.archive.modules.seeds.SeedModule
 
announceSeeds() - Method in class org.archive.modules.seeds.TextSeedModule
Announce all seeds from configured source to SeedListeners (including nonseed lines mixed in).
announceSeeds(CountDownLatch) - Method in class org.archive.modules.seeds.TextSeedModule
 
announceSeedsFromReader(BufferedReader, CountDownLatch) - Method in class org.archive.modules.seeds.TextSeedModule
Announce all seeds (and nonseed possible-directive lines) from the given Reader
appCtx - Variable in class org.archive.modules.deciderules.ScriptedDecideRule
 
appCtx - Variable in class org.archive.modules.ScriptedProcessor
 
ARCHIVE_TIME_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
 
ARCWriterProcessor - Class in org.archive.modules.writer
Processor module for writing the results of successful fetches (and perhaps someday, certain kinds of network failures) to the Internet Archive ARC file format.
ARCWriterProcessor() - Constructor for class org.archive.modules.writer.ARCWriterProcessor
 
asAnnotation() - Method in class org.archive.modules.forms.HTMLForm
Provide abbreviated annotation, of the form...
assertNoSideEffects(CrawlURI) - Static method in class org.archive.modules.extractor.ContentExtractorTestBase
Asserts that the given URI has no URI errors, no localized errors, and no annotations.
atProcessor(Processor) - Method in interface org.archive.modules.ProcessorChain.ChainStatusReceiver
 
attach(CrawlURI) - Method in class org.archive.modules.credential.Credential
Attach this credentials avatar to the passed curi .
ATTR_MAX_BYTES_WRITTEN - Static variable in class org.archive.modules.writer.Kw3WriterProcessor
Max size for each file.Key for the maximum ARC bytes to write attribute.
audience - Variable in class org.archive.modules.CrawlMetadata
 
AUTH_SCHEME_REGISTRY - Static variable in class org.archive.modules.fetcher.FetchHTTP
 
autoregisterTo(AutoKryo) - Static method in class org.archive.modules.CrawlURI
 
autoregisterTo(AutoKryo) - Static method in class org.archive.modules.net.CrawlHost
 
autoregisterTo(AutoKryo) - Static method in class org.archive.modules.net.CrawlServer
 
autoregisterTo(AutoKryo) - Static method in class org.archive.modules.net.RobotsDirectives
 
autoregisterTo(AutoKryo) - Static method in class org.archive.modules.net.Robotstxt
 
availableRobotsPolicies - Variable in class org.archive.modules.CrawlMetadata
Map of all available RobotsPolicies, by name, to choose from.

B

BaseRule - Class in org.archive.modules.canonicalize
Base of all rules applied canonicalizing a URL that are configurable via the Heritrix settings system.
BaseRule() - Constructor for class org.archive.modules.canonicalize.BaseRule
Constructor.
BaseWARCRecordBuilder - Class in org.archive.modules.warc
 
BaseWARCRecordBuilder() - Constructor for class org.archive.modules.warc.BaseWARCRecordBuilder
 
BaseWARCWriterProcessor - Class in org.archive.modules.writer
 
BaseWARCWriterProcessor() - Constructor for class org.archive.modules.writer.BaseWARCWriterProcessor
 
BasicExecutionAwareEntityEnclosingRequest - Class in org.archive.modules.fetcher
 
BasicExecutionAwareEntityEnclosingRequest(String, String) - Constructor for class org.archive.modules.fetcher.BasicExecutionAwareEntityEnclosingRequest
 
BasicExecutionAwareEntityEnclosingRequest(String, String, ProtocolVersion) - Constructor for class org.archive.modules.fetcher.BasicExecutionAwareEntityEnclosingRequest
 
BasicExecutionAwareEntityEnclosingRequest(RequestLine) - Constructor for class org.archive.modules.fetcher.BasicExecutionAwareEntityEnclosingRequest
 
BasicExecutionAwareRequest - Class in org.archive.modules.fetcher
 
BasicExecutionAwareRequest(String, String) - Constructor for class org.archive.modules.fetcher.BasicExecutionAwareRequest
Creates an instance of this class using the given request method and URI.
BasicExecutionAwareRequest(String, String, ProtocolVersion) - Constructor for class org.archive.modules.fetcher.BasicExecutionAwareRequest
Creates an instance of this class using the given request method, URI and the HTTP protocol version.
BasicExecutionAwareRequest(RequestLine) - Constructor for class org.archive.modules.fetcher.BasicExecutionAwareRequest
Creates an instance of this class using the given request line.
bdb - Variable in class org.archive.modules.fetcher.BdbCookieStore
 
bdb - Variable in class org.archive.modules.fetcher.FetchWhois
 
bdb - Variable in class org.archive.modules.net.BdbServerCache
 
bdb - Variable in class org.archive.modules.recrawl.BdbContentDigestHistory
 
bdb - Variable in class org.archive.modules.recrawl.PersistOnlineProcessor
 
BdbContentDigestHistory - Class in org.archive.modules.recrawl
Bdb content digest history store.
BdbContentDigestHistory() - Constructor for class org.archive.modules.recrawl.BdbContentDigestHistory
 
BdbCookieStore - Class in org.archive.modules.fetcher
Cookie store using bdb for storage.
BdbCookieStore() - Constructor for class org.archive.modules.fetcher.BdbCookieStore
 
BdbCookieStore.RestrictedCollectionWrappedList<T> - Class in org.archive.modules.fetcher
A List implementation that wraps a Collection.
BdbServerCache - Class in org.archive.modules.net
ServerCache backed by BDB big maps; the usual choice for crawls.
BdbServerCache() - Constructor for class org.archive.modules.net.BdbServerCache
 
beanName - Variable in class org.archive.modules.deciderules.DecideRuleSequence
 
beanName - Variable in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
beanName - Variable in class org.archive.modules.Processor
 
blockAwaitingSeedLines - Variable in class org.archive.modules.seeds.TextSeedModule
Number of lines of seeds-source to read on initial load before proceeding with crawl.
buildAndAddOutlink(CrawlURI, Map<String, Object>) - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
 
buildConnectionManager() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
 
buildPostRequestEntity(CrawlURI) - Method in class org.archive.modules.fetcher.FetchHTTPRequest
 
buildRecord(CrawlURI, URI) - Method in class org.archive.modules.warc.DnsResponseRecordBuilder
 
buildRecord(CrawlURI, URI) - Method in class org.archive.modules.warc.FtpControlConversationRecordBuilder
 
buildRecord(CrawlURI, URI) - Method in class org.archive.modules.warc.FtpResponseRecordBuilder
 
buildRecord(CrawlURI, URI) - Method in class org.archive.modules.warc.HttpRequestRecordBuilder
 
buildRecord(CrawlURI, URI) - Method in class org.archive.modules.warc.HttpResponseRecordBuilder
 
buildRecord(CrawlURI, URI) - Method in class org.archive.modules.warc.MetadataRecordBuilder
 
buildRecord(CrawlURI, URI) - Method in class org.archive.modules.warc.RevisitRecordBuilder
 
buildRecord(CrawlURI, URI) - Method in interface org.archive.modules.warc.WARCRecordBuilder
Builds a warc record for this capture.
buildRecord(CrawlURI, URI) - Method in class org.archive.modules.warc.WhoisResponseRecordBuilder
 
buildSurtPrefixSet() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
Construct the set of prefixes to use, from the seed list ( which may include both URIs and '+'-prefixed directives).

C

calcOutputDirs() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
CandidateChain - Class in org.archive.modules
 
CandidateChain() - Constructor for class org.archive.modules.CandidateChain
 
candidatePasswordInputs - Variable in class org.archive.modules.forms.HTMLForm
 
candidateUserAgents - Variable in class org.archive.modules.net.FirstNamedRobotsPolicy
list of user-agents to try; if any are allowed, a URI will be crawled
candidateUserAgents - Variable in class org.archive.modules.net.MostFavoredRobotsPolicy
list of user-agents to try; if any are allowed, a URI will be crawled
candidateUsernameInputs - Variable in class org.archive.modules.forms.HTMLForm
 
CanonicalizationRule - Interface in org.archive.modules.canonicalize
A rule to apply canonicalizing a url.
canonicalize(String) - Method in interface org.archive.modules.canonicalize.CanonicalizationRule
Apply this canonicalization rule.
canonicalize(String) - Method in class org.archive.modules.canonicalize.FixupQueryString
 
canonicalize(String) - Method in class org.archive.modules.canonicalize.LowercaseRule
 
canonicalize(String) - Method in class org.archive.modules.canonicalize.RegexRule
 
canonicalize(String) - Method in class org.archive.modules.canonicalize.RulesCanonicalizationPolicy
Run the passed uuri through the list of rules.
canonicalize(String) - Method in class org.archive.modules.canonicalize.StripExtraSlashes
 
canonicalize(String) - Method in class org.archive.modules.canonicalize.StripSessionCFIDs
 
canonicalize(String) - Method in class org.archive.modules.canonicalize.StripSessionIDs
 
canonicalize(String) - Method in class org.archive.modules.canonicalize.StripUserinfoRule
 
canonicalize(String) - Method in class org.archive.modules.canonicalize.StripWWWNRule
 
canonicalize(String) - Method in class org.archive.modules.canonicalize.StripWWWRule
 
canonicalize(String) - Method in class org.archive.modules.canonicalize.UriCanonicalizationPolicy
 
canonicalString - Variable in class org.archive.modules.CrawlURI
 
caseSensitiveFilesystem - Variable in class org.archive.modules.writer.MirrorWriterProcessor
True if the file system is case-sensitive, like UNIX.
catalog - Variable in class org.archive.modules.extractor.PDFParser
 
characterMap - Variable in class org.archive.modules.writer.MirrorWriterProcessor
This list is grouped in pairs.
checkBytesWritten() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
checked - Variable in class org.archive.modules.forms.HTMLForm.FormInput
 
checkMidfetchAbort(CrawlURI) - Method in class org.archive.modules.fetcher.FetchHTTP
 
chmod - Variable in class org.archive.modules.writer.Kw3WriterProcessor
Should permissions be changed for the newly created dirs.
chmodValue - Variable in class org.archive.modules.writer.Kw3WriterProcessor
What should the permissions be set to.
chooseAuthScheme(Map<String, String>, String) - Method in class org.archive.modules.fetcher.FetchHTTP
 
cleanup(CrawlURI, Exception, String, int) - Method in class org.archive.modules.fetcher.FetchHTTP
Cleanup after a failed method execute.
clear() - Method in class org.archive.modules.fetcher.AbstractCookieStore
 
clear() - Method in class org.archive.modules.fetcher.AbstractCookieStore.LimitedCookieStoreFacade
 
clear() - Method in class org.archive.modules.fetcher.BdbCookieStore
 
clear() - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
clear() - Method in class org.archive.modules.fetcher.SimpleCookieStore
 
clearExpired(Date) - Method in class org.archive.modules.fetcher.AbstractCookieStore.LimitedCookieStoreFacade
 
clearExpired(Date) - Method in class org.archive.modules.fetcher.BdbCookieStore
 
clearExpired(Date) - Method in class org.archive.modules.fetcher.SimpleCookieStore
 
clearPrerequisiteUri() - Method in class org.archive.modules.CrawlURI
Clear prerequisite, if any.
close() - Method in class org.archive.modules.fetcher.DefaultServerCache
Called when shutting down the cache so we can do clean up.
close() - Method in class org.archive.modules.fetcher.FetchHTTPRequest.RecordingHttpClientConnection
 
collection - Variable in class org.archive.modules.writer.Kw3WriterProcessor
Name of collection.
COLLECTION_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
 
comment - Variable in class org.archive.modules.deciderules.DecideRule
 
compareTo(CrawlURI) - Method in class org.archive.modules.CrawlURI
 
compress - Variable in class org.archive.modules.writer.WriterPoolProcessor
Whether to gzip-compress files when writing to disk; by default true, meaning do-compress.
concludedSeedBatch() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
concludedSeedBatch() - Method in interface org.archive.modules.seeds.SeedListener
 
configureHttpClientBuilder() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
 
configureRequest() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
 
configureRequestHeaders() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
 
connectTimeoutMs - Variable in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
 
connMan - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
 
consecutiveConnectionErrors - Variable in class org.archive.modules.net.CrawlServer
 
considerIfLikelyUri(CrawlURI, CharSequence, CharSequence, Hop) - Method in class org.archive.modules.extractor.ExtractorHTML
Consider whether a given string is URI-like.
considerQueryStringValues(CrawlURI, CharSequence, CharSequence, Hop) - Method in class org.archive.modules.extractor.ExtractorHTML
Consider a query-string-like collections of key=value[&key=value] pairs for URI-like strings in the values.
considerString(Extractor, CrawlURI, boolean, String) - Method in class org.archive.modules.extractor.ExtractorJS
 
considerStringAsUri(String) - Method in class org.archive.modules.extractor.ExtractorSWF.CrawlUriSWFAction
 
considerStrings(CrawlURI, CharSequence) - Method in class org.archive.modules.extractor.ExtractorJS
 
considerStrings(Extractor, CrawlURI, CharSequence) - Method in class org.archive.modules.extractor.ExtractorJS
 
considerStrings(Extractor, CrawlURI, CharSequence, boolean) - Method in class org.archive.modules.extractor.ExtractorJS
 
constructRegex(int) - Method in class org.archive.modules.deciderules.PathologicalPathDecideRule
 
contains(Object) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
containsAll(Collection<?>) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
containsContentTypeCharsetDeclaration() - Method in class org.archive.modules.CrawlURI
 
containsDataKey(String) - Method in class org.archive.modules.CrawlURI
 
containsHost(String) - Method in class org.archive.modules.fetcher.DefaultServerCache
 
containsServer(String) - Method in class org.archive.modules.fetcher.DefaultServerCache
 
CONTENT_LENGTH_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
 
CONTENT_MD5_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
 
CONTENT_TYPE_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
 
contentDigestHistory - Variable in class org.archive.modules.recrawl.ContentDigestHistoryLoader
 
contentDigestHistory - Variable in class org.archive.modules.recrawl.ContentDigestHistoryStorer
 
ContentDigestHistoryLoader - Class in org.archive.modules.recrawl
 
ContentDigestHistoryLoader() - Constructor for class org.archive.modules.recrawl.ContentDigestHistoryLoader
 
ContentDigestHistoryStorer - Class in org.archive.modules.recrawl
 
ContentDigestHistoryStorer() - Constructor for class org.archive.modules.recrawl.ContentDigestHistoryStorer
 
ContentExtractor - Class in org.archive.modules.extractor
Extracts link from the fetched content of a URI, as opposed to its headers.
ContentExtractor() - Constructor for class org.archive.modules.extractor.ContentExtractor
 
ContentExtractorTestBase - Class in org.archive.modules.extractor
Abstract base class for unit testing ContentExtractor implementations.
ContentExtractorTestBase() - Constructor for class org.archive.modules.extractor.ContentExtractorTestBase
 
ContentLengthDecideRule - Class in org.archive.modules.deciderules
 
ContentLengthDecideRule() - Constructor for class org.archive.modules.deciderules.ContentLengthDecideRule
Usual constructor.
contentTypeMap - Variable in class org.archive.modules.writer.MirrorWriterProcessor
This list is grouped in pairs.
ContentTypeMatchesRegexDecideRule - Class in org.archive.modules.deciderules
DecideRule whose decision is applied if the URI's content-type is present and matches the supplied regular expression.
ContentTypeMatchesRegexDecideRule() - Constructor for class org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule
 
ContentTypeNotMatchesRegexDecideRule - Class in org.archive.modules.deciderules
DecideRule whose decision is applied if the URI's content-type is present and does not match the supplied regular expression.
ContentTypeNotMatchesRegexDecideRule() - Constructor for class org.archive.modules.deciderules.ContentTypeNotMatchesRegexDecideRule
 
cookieComparator - Static variable in class org.archive.modules.fetcher.AbstractCookieStore
 
COOKIEDB_NAME - Static variable in class org.archive.modules.fetcher.BdbCookieStore
 
cookies - Variable in class org.archive.modules.fetcher.SimpleCookieStore
 
cookiesLoadFile - Variable in class org.archive.modules.fetcher.AbstractCookieStore
 
cookiesSaveFile - Variable in class org.archive.modules.fetcher.AbstractCookieStore
 
cookieStore - Variable in class org.archive.modules.fetcher.FetchHTTP
 
cookieStoreFor(CrawlURI) - Method in class org.archive.modules.fetcher.AbstractCookieStore
 
cookieStoreFor(String) - Method in class org.archive.modules.fetcher.BdbCookieStore
Returns a LimitedCookieStoreFacade whose LimitedCookieStoreFacade#getCookies() method returns only cookies from host and its parent domains, if applicable.
cookieStoreFor(String) - Method in interface org.archive.modules.fetcher.FetchHTTPCookieStore
Returns a CookieStore whose CookieStore.getCookies() returns all the cookies from host and each of its parent domains, if applicable.
cookieStoreFor(CrawlURI) - Method in interface org.archive.modules.fetcher.FetchHTTPCookieStore
Returns a CookieStore whose CookieStore.getCookies() returns all the cookies that could possibly apply curi.
cookieStoreFor(String) - Method in class org.archive.modules.fetcher.SimpleCookieStore
 
copyForwardWriteTagIfDupe(CrawlURI) - Method in class org.archive.modules.writer.WriterPoolProcessor
If this fetch is identical to the last written (archived) fetch, then copy forward the writeTag.
copyPersistSourceToHistoryMap(File, StoredSortedMap<String, Map>) - Static method in class org.archive.modules.recrawl.PersistProcessor
Populates a given StoredSortedMap (history map) from an old environment db or a persist log.
copyPersistSourceToHistoryMap(URL, StoredSortedMap<String, Map>) - Static method in class org.archive.modules.recrawl.PersistProcessor
Populates a given StoredSortedMap (history map) from an old persist log.
copyStats(Map<String, Map<String, Long>>) - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
 
CoreAttributeConstants - Interface in org.archive.modules
Attribute keys and constant strings used by the core crawler classes.
countryCodes - Variable in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
Country code name.
crawlDelay - Variable in class org.archive.modules.net.RobotsDirectives
 
CrawledBytesHistotable - Class in org.archive.crawler.util
 
CrawledBytesHistotable() - Constructor for class org.archive.crawler.util.CrawledBytesHistotable
 
CrawlHost - Class in org.archive.modules.net
Represents a single remote "host".
CrawlHost(String) - Constructor for class org.archive.modules.net.CrawlHost
Create a new CrawlHost object.
CrawlHost(String, String) - Constructor for class org.archive.modules.net.CrawlHost
Create a new CrawlHost object.
CrawlMetadata - Class in org.archive.modules
Basic crawl metadata, as consulted by functional modules and recorded in ARCs/WARCs.
CrawlMetadata() - Constructor for class org.archive.modules.CrawlMetadata
 
CrawlServer - Class in org.archive.modules.net
Represents a single remote "server".
CrawlServer(String) - Constructor for class org.archive.modules.net.CrawlServer
Creates a new CrawlServer object.
CrawlURI - Class in org.archive.modules
Represents a candidate URI and the associated state it collects as it is crawled.
CrawlURI(UURI) - Constructor for class org.archive.modules.CrawlURI
Create a new instance of CrawlURI from a UURI.
CrawlURI(UURI, String, UURI, LinkContext) - Constructor for class org.archive.modules.CrawlURI
 
CrawlURI.FetchType - Enum in org.archive.modules
 
CrawlUriSWFAction(CrawlURI, Extractor) - Constructor for class org.archive.modules.extractor.ExtractorSWF.CrawlUriSWFAction
 
createCrawlURI(UURI, LinkContext, Hop) - Method in class org.archive.modules.CrawlURI
Utility method for creating CrawlURIs that were found as out links from the current CrawlURI links from this CrawlURI.
createCrawlURI(String, LinkContext, Hop) - Method in class org.archive.modules.CrawlURI
 
createCrawlURI(UURI, LinkContext, Hop, int, boolean) - Method in class org.archive.modules.CrawlURI
Utility method for creation of CrawlURIs found extracting links from this CrawlURI.
createDNSLookup(String) - Method in class org.archive.modules.fetcher.FetchDNS
 
createFormSubmissionAttempt(CrawlURI, HTMLForm, String) - Method in class org.archive.modules.forms.FormLoginProcessor
 
createHostDirectory - Variable in class org.archive.modules.writer.MirrorWriterProcessor
Create a subdirectory named for the host in the URI.
createPortDirectory - Variable in class org.archive.modules.writer.MirrorWriterProcessor
Create a subdirectory named for the port in the URI.
createRecorder(String) - Static method in class org.archive.modules.extractor.ContentExtractorTestBase
Deprecated.
createRecorder(String, String) - Static method in class org.archive.modules.extractor.ContentExtractorTestBase
 
createSocket() - Method in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
 
createSocket(String, int) - Method in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
 
createSocket(InetAddress, int) - Method in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
 
createSocket(String, int, InetAddress, int) - Method in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
 
createSocket(InetAddress, int, InetAddress, int) - Method in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
 
createSocket(HttpContext) - Method in class org.archive.modules.fetcher.SocksSocketFactory
 
createSocket(HttpContext) - Method in class org.archive.modules.fetcher.SocksSSLSocketFactory
 
Credential - Class in org.archive.modules.credential
Credential type.
Credential() - Constructor for class org.archive.modules.credential.Credential
Constructor.
CredentialStore - Class in org.archive.modules.credential
Front door to the credential store.
CredentialStore() - Constructor for class org.archive.modules.credential.CredentialStore
Constructor.
CSS_BACKSLASH_ESCAPE - Static variable in class org.archive.modules.extractor.ExtractorCSS
 
CSS_URI_EXTRACTOR - Static variable in class org.archive.modules.extractor.ExtractorCSS
CSS URL extractor pattern.
curi - Variable in class org.archive.modules.extractor.ExtractorSWF.CrawlUriSWFAction
 
curi - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
 
customRobots - Variable in class org.archive.modules.net.CustomRobotsPolicy
textual alternate robots.txt rules to follow
CustomRobotsPolicy - Class in org.archive.modules.net
Follow a custom-written robots policy, rather than the site's own declarations Does not support overlays of different custom-robots; instead it is recommended each custom policy be declared as a separate bean, with a distinct name.
CustomRobotsPolicy() - Constructor for class org.archive.modules.net.CustomRobotsPolicy
 
customRobotstxt - Variable in class org.archive.modules.net.CustomRobotsPolicy
 
CustomSWFTags - Class in org.archive.modules.extractor
Overwrite action tags, that may hold URI, to use CrawlUriSWFAction action.
CustomSWFTags(SWFActions) - Constructor for class org.archive.modules.extractor.CustomSWFTags
 

D

data - Variable in class org.archive.modules.CrawlURI
Flexible dynamic attributes list.
DecideResult - Enum in org.archive.modules.deciderules
The decision of a DecideRule.
DecideRule - Class in org.archive.modules.deciderules
 
DecideRule() - Constructor for class org.archive.modules.deciderules.DecideRule
 
DecideRuleSequence - Class in org.archive.modules.deciderules
 
DecideRuleSequence() - Constructor for class org.archive.modules.deciderules.DecideRuleSequence
 
decisionFor(CrawlURI) - Method in class org.archive.modules.deciderules.DecideRule
 
decisionMade(CrawlURI, DecideRule, int, DecideResult) - Method in class org.archive.modules.deciderules.DecideRuleSequence
 
DEFAULT_IP_WHOIS_SERVER - Static variable in class org.archive.modules.fetcher.FetchWhois
 
DEFAULT_LOWER_BOUND - Static variable in class org.archive.modules.deciderules.MatchesStatusCodeDecideRule
Default lower bound
DEFAULT_PARAMETERS - Static variable in class org.archive.modules.extractor.Extractor
 
DEFAULT_UPPER_BOUND - Static variable in class org.archive.modules.deciderules.MatchesStatusCodeDecideRule
Default upper bound
DefaultServerCache - Class in org.archive.modules.fetcher
Server and Host cache.
DefaultServerCache() - Constructor for class org.archive.modules.fetcher.DefaultServerCache
Constructor.
DefaultServerCache(ObjectIdentityCache<CrawlServer>, ObjectIdentityCache<CrawlHost>) - Constructor for class org.archive.modules.fetcher.DefaultServerCache
 
DefaultTempDirProvider - Class in org.archive.modules.net
 
DefaultTempDirProvider() - Constructor for class org.archive.modules.net.DefaultTempDirProvider
 
defaultURI() - Method in class org.archive.modules.extractor.ContentExtractorTestBase
Returns a CrawlURI for testing purposes.
deferOrFinishGeneric(CrawlURI, String) - Method in class org.archive.modules.fetcher.FetchWhois
 
description - Variable in class org.archive.modules.CrawlMetadata
 
detach(CrawlURI) - Method in class org.archive.modules.credential.Credential
Detach this credential from passed curi.
detachAll(CrawlURI) - Method in class org.archive.modules.credential.Credential
Detach all credentials of this type from passed curi.
digestAlgorithm - Variable in class org.archive.modules.fetcher.FetchDNS
 
digestAlgorithm - Variable in class org.archive.modules.fetcher.FetchFTP
Which algorithm (for example MD5 or SHA-1) to use to perform an on-the-fly digest hash of retrieved content-bodies.
digestAlgorithm - Variable in class org.archive.modules.fetcher.FetchHTTP
 
digestAlgorithm - Variable in class org.archive.modules.fetcher.FetchSFTP
Which algorithm (for example MD5 or SHA-1) to use to perform an on-the-fly digest hash of retrieved content-bodies.
directory - Variable in class org.archive.modules.writer.WriterPoolProcessor
 
directoryFile - Variable in class org.archive.modules.writer.MirrorWriterProcessor
Implicitly append this to a URI ending with '/'.
disallows - Variable in class org.archive.modules.net.RobotsDirectives
 
DispositionChain - Class in org.archive.modules
 
DispositionChain() - Constructor for class org.archive.modules.DispositionChain
 
DnsResponseRecordBuilder - Class in org.archive.modules.warc
 
DnsResponseRecordBuilder() - Constructor for class org.archive.modules.warc.DnsResponseRecordBuilder
 
doAbort(CrawlURI, AbstractExecutionAwareRequest, String) - Method in class org.archive.modules.fetcher.FetchHTTP
 
doCheckpoint(Checkpoint) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
doCheckpoint(Checkpoint) - Method in class org.archive.modules.fetcher.BdbCookieStore
 
doCheckpoint(Checkpoint) - Method in class org.archive.modules.fetcher.SimpleCookieStore
 
doCheckpoint(Checkpoint) - Method in class org.archive.modules.net.BdbServerCache
 
doCheckpoint(Checkpoint) - Method in class org.archive.modules.Processor
 
doCheckpoint(Checkpoint) - Method in class org.archive.modules.recrawl.PersistLogProcessor
 
doCheckpoint(Checkpoint) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
document - Variable in class org.archive.modules.extractor.PDFParser
 
documentReader - Variable in class org.archive.modules.extractor.PDFParser
 
domain - Variable in class org.archive.modules.credential.Credential
The root domain this credential goes against: E.g.
doStripRegexMatch(String, String) - Method in class org.archive.modules.canonicalize.BaseRule
Run a regex that strips elements of a string.
dotBegin - Variable in class org.archive.modules.writer.MirrorWriterProcessor
If a segment starts with '.', the '.' is replaced by this.
dotEnd - Variable in class org.archive.modules.writer.MirrorWriterProcessor
If a directory name ends with '.' it is replaced by this.
dumpSurtPrefixSet() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
Dump the current prefixes in use to configured dump file (if any)
DUPLICATE - Static variable in class org.archive.crawler.util.CrawledBytesHistotable
 
DUPLICATECOUNT - Static variable in class org.archive.crawler.util.CrawledBytesHistotable
 

E

elementContext(CharSequence, CharSequence) - Static method in class org.archive.modules.extractor.ExtractorHTML
Create a suitable XPath-like context from an element name and optional attribute name.
eligibleFormsAttemptsCount - Variable in class org.archive.modules.forms.FormLoginProcessor
 
eligibleFormsSeenCount - Variable in class org.archive.modules.forms.FormLoginProcessor
 
EMBED_MISC - Static variable in class org.archive.modules.extractor.LinkContext
Stand-in value for embeds without other context.
encounteredReferences - Variable in class org.archive.modules.extractor.PDFParser
 
enctype - Variable in class org.archive.modules.forms.HTMLForm
 
engineName - Variable in class org.archive.modules.deciderules.ScriptedDecideRule
engine name; default "beanshell"
engineName - Variable in class org.archive.modules.ScriptedProcessor
engine name; default "beanshell"
ensureStandardPoliciesAvailable() - Method in class org.archive.modules.CrawlMetadata
 
equals(Object) - Method in class org.archive.modules.CrawlURI
 
equals(Object) - Method in class org.archive.modules.extractor.LinkContext
 
equals(Object) - Method in class org.archive.modules.net.CrawlHost
 
equals(Object) - Method in class org.archive.modules.net.CrawlServer
 
escapeForMultipart(String) - Static method in class org.archive.modules.fetcher.FetchHTTPRequest
Returns a copy of the string with non-ascii characters replaced by their html numeric character reference in decimal (e.g.
eTag - Variable in class org.archive.modules.revisit.ServerNotModifiedRevisit
 
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.AddRedirectFromRootServerToScope
 
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.ContentTypeNotMatchesRegexDecideRule
Evaluate whether given object's string version does not match configured regex (by reversing the superclass's answer).
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
 
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.FetchStatusDecideRule
Evaluate whether given object is equal to the configured status
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.FetchStatusNotMatchesRegexDecideRule
Evaluate whether given object's FetchStatus does not match configured regex (by reversing the superclass's answer).
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.HasViaDecideRule
Evaluate whether given object is over the threshold number of hops.
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.HopCrossesAssignmentLevelDomainDecideRule
 
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.IpAddressSetDecideRule
 
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.MatchesListRegexDecideRule
Evaluate whether given object's string version matches configured regexes
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.MatchesRegexDecideRule
Evaluate whether given object's string version matches configured regex
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.MatchesStatusCodeDecideRule
Returns "true" if the provided CrawlURI has a fetch status that falls within this instance's specified range.
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.NotMatchesFilePatternDecideRule
Evaluate whether given object's string version does not match configured regex (by reversing the superclass's answer).
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.NotMatchesListRegexDecideRule
Evaluate whether given object's string version does not match configured regexs (by reversing the superclass's answer).
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.NotMatchesRegexDecideRule
Evaluate whether given object's string version does not match configured regex (by reversing the superclass's answer).
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.NotMatchesStatusCodeDecideRule
Returns "true" if the provided CrawlURI has a fetch status that does not fall within this instance's specified range.
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.PredicatedDecideRule
 
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.recrawl.IdenticalDigestDecideRule
Evaluate whether given CrawlURI's revisit profile has been set to identical digest
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.ResourceNoLongerThanDecideRule
 
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.ResponseContentLengthDecideRule
 
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.SchemeNotInSetDecideRule
Evaluate whether given object is over the threshold number of hops.
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.SourceSeedDecideRule
 
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.surt.NotOnDomainsDecideRule
Evaluate whether given object's URI is NOT in the set of domains -- simply reverse superclass's determination
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.surt.NotOnHostsDecideRule
Evaluate whether given object's URI is NOT in the set of hosts -- simply reverse superclass's determination
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.surt.NotSurtPrefixedDecideRule
Evaluate whether given object's URI is NOT in the SURT prefix set -- simply reverse superclass's determination
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
Evaluate whether given object's URI is covered by the SURT prefix set
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.TooManyHopsDecideRule
Evaluate whether given object is over the threshold number of hops.
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.TooManyPathSegmentsDecideRule
Evaluate whether given object is over the threshold number of path-segments.
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.TransclusionDecideRule
Evaluate whether given object is within the acceptable thresholds of transitive hops.
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.ViaSurtPrefixedDecideRule
Evaluate whether given object's surt form matches one of the supplied surts
execute() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
 
expectContinue() - Method in class org.archive.modules.fetcher.BasicExecutionAwareEntityEnclosingRequest
 
expectedResult - Variable in class org.archive.modules.extractor.StringExtractorTestBase.TestData
 
expireCookie(Cookie, Date) - Method in class org.archive.modules.fetcher.AbstractCookieStore
 
expireCookie(Cookie, Date) - Method in class org.archive.modules.fetcher.BdbCookieStore
 
expireCookie(Cookie, Date) - Method in class org.archive.modules.fetcher.SimpleCookieStore
 
extendHopsPath(String, char) - Static method in class org.archive.modules.CrawlURI
Extend a 'hopsPath' (pathFromSeed string of single-character hop-type symbols), keeping the number of displayed hop-types under MAX_HOPS_DISPLAYED.
ExternalGeoLocationDecideRule - Class in org.archive.modules.deciderules
A rule that can be configured to take alternate implementations of the ExternalGeoLocationInterface.
ExternalGeoLocationDecideRule() - Constructor for class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
 
ExternalGeoLookupInterface - Interface in org.archive.modules.deciderules
 
extract(CrawlURI) - Method in class org.archive.modules.extractor.ContentExtractor
Extracts links
extract(CrawlURI) - Method in class org.archive.modules.extractor.Extractor
Extracts links from the given URI.
extract(CrawlURI, CharSequence) - Method in class org.archive.modules.extractor.ExtractorHTML
Run extractor.
extract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorHTTP
 
extract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorImpliedURI
Perform usual extraction on a CrawlURI
extract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
 
extract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorURI
Perform usual extraction on a CrawlURI
extract(CrawlURI, CharSequence) - Method in class org.archive.modules.extractor.JerichoExtractorHTML
Run extractor.
extract(CrawlURI) - Method in class org.archive.modules.forms.ExtractorHTMLForms
 
extractChallenges(HttpResponse, CrawlURI, AuthenticationStrategy) - Method in class org.archive.modules.fetcher.FetchHTTP
 
extractImplied(CharSequence, Pattern, String) - Static method in class org.archive.modules.extractor.ExtractorImpliedURI
Utility method for extracting 'implied' URI given a source uri, trigger pattern, and build pattern.
extractLink(CrawlURI, CrawlURI) - Method in class org.archive.modules.extractor.ExtractorURI
Consider a single Link for internal URIs
extractor - Variable in class org.archive.modules.extractor.ContentExtractorTestBase
An extractor created during the setUp.
Extractor - Class in org.archive.modules.extractor
Extracts links from fetched URIs.
Extractor() - Constructor for class org.archive.modules.extractor.Extractor
 
ExtractorCSS - Class in org.archive.modules.extractor
This extractor is parsing URIs from CSS type files.
ExtractorCSS() - Constructor for class org.archive.modules.extractor.ExtractorCSS
 
ExtractorDOC - Class in org.archive.modules.extractor
This class allows the caller to extract href style links from word97-format word documents.
ExtractorDOC() - Constructor for class org.archive.modules.extractor.ExtractorDOC
 
ExtractorHTML - Class in org.archive.modules.extractor
Basic link-extraction, from an HTML content-body, using regular expressions.
ExtractorHTML() - Constructor for class org.archive.modules.extractor.ExtractorHTML
 
ExtractorHTMLForms - Class in org.archive.modules.forms
Extracts extra information about FORMs in HTML, loading this into the CrawlURI (for potential later use by FormLoginProcessor) and adding a small annotation to the crawl.log.
ExtractorHTMLForms() - Constructor for class org.archive.modules.forms.ExtractorHTMLForms
 
ExtractorHTTP - Class in org.archive.modules.extractor
Extracts URIs from HTTP response headers.
ExtractorHTTP() - Constructor for class org.archive.modules.extractor.ExtractorHTTP
 
ExtractorImpliedURI - Class in org.archive.modules.extractor
An extractor for finding 'implied' URIs inside other URIs.
ExtractorImpliedURI() - Constructor for class org.archive.modules.extractor.ExtractorImpliedURI
Constructor.
extractorJS - Variable in class org.archive.modules.extractor.ExtractorHTML
Javascript extractor to use to process inline javascript.
ExtractorJS - Class in org.archive.modules.extractor
Processes Javascript files for strings that are likely to be crawlable URIs.
ExtractorJS() - Constructor for class org.archive.modules.extractor.ExtractorJS
 
extractorJS - Variable in class org.archive.modules.extractor.ExtractorSWF
Javascript extractor to use to process inline javascript.
ExtractorMultipleRegex - Class in org.archive.modules.extractor
An extractor that uses regular expressions to find strings in the fetched content of a URI, and constructs outlink URIs from those strings.
ExtractorMultipleRegex() - Constructor for class org.archive.modules.extractor.ExtractorMultipleRegex
 
ExtractorMultipleRegex.GroupList - Class in org.archive.modules.extractor
 
ExtractorMultipleRegex.MatchList - Class in org.archive.modules.extractor
 
extractorParameters - Variable in class org.archive.modules.extractor.Extractor
 
ExtractorParameters - Interface in org.archive.modules.extractor
Bean interface for parameters consulted by multiple Extractors, and thus provided by some shared object.
ExtractorPDF - Class in org.archive.modules.extractor
Allows the caller to process a CrawlURI representing a PDF for the purpose of extracting URIs
ExtractorPDF() - Constructor for class org.archive.modules.extractor.ExtractorPDF
 
ExtractorRobotsTxt - Class in org.archive.modules.extractor
 
ExtractorRobotsTxt() - Constructor for class org.archive.modules.extractor.ExtractorRobotsTxt
 
ExtractorSitemap - Class in org.archive.modules.extractor
 
ExtractorSitemap() - Constructor for class org.archive.modules.extractor.ExtractorSitemap
 
ExtractorSWF - Class in org.archive.modules.extractor
Extracts URIs from SWF (flash/shockwave) files.
ExtractorSWF() - Constructor for class org.archive.modules.extractor.ExtractorSWF
 
ExtractorSWF.CrawlUriSWFAction - Class in org.archive.modules.extractor
SWF action that handles discovered URIs.
ExtractorSWF.ExtractorTagParser - Class in org.archive.modules.extractor
TagParser customized to ignore SWFTags that will never contain extractable URIs.
ExtractorTagParser(SWFTagTypes) - Constructor for class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
 
ExtractorUniversal - Class in org.archive.modules.extractor
A last ditch extractor that will look at the raw byte code and try to extract anything that looks like a link.
ExtractorUniversal() - Constructor for class org.archive.modules.extractor.ExtractorUniversal
Constructor.
ExtractorURI - Class in org.archive.modules.extractor
An extractor for finding URIs inside other URIs.
ExtractorURI() - Constructor for class org.archive.modules.extractor.ExtractorURI
Constructor
ExtractorXML - Class in org.archive.modules.extractor
A simple extractor which finds HTTP URIs inside XML/RSS files, inside attribute values and simple elements (those with only whitespace + HTTP URI + whitespace as contents).
ExtractorXML() - Constructor for class org.archive.modules.extractor.ExtractorXML
 
extractQueryStringLinks(UURI) - Static method in class org.archive.modules.extractor.ExtractorURI
Look for URIs inside the supplied UURI.
extractURIs() - Method in class org.archive.modules.extractor.PDFParser
Extract URIs from all objects found in a Pdf document's catalog.
extractURIs(PdfObject) - Method in class org.archive.modules.extractor.PDFParser
Parse a PdfDictionary, looking for URIs recursively and adding them to foundURIs
extraInfo - Variable in class org.archive.modules.CrawlURI
 

F

failedExecuteCleanup(CrawlURI, Exception) - Method in class org.archive.modules.fetcher.FetchHTTP
Cleanup after a failed method execute.
fetch(CrawlURI, String, String) - Method in class org.archive.modules.fetcher.FetchWhois
 
FETCH_DISREGARDS - Static variable in class org.archive.modules.fetcher.FetchStats
 
FETCH_FAILURES - Static variable in class org.archive.modules.fetcher.FetchStats
 
FETCH_NONRESPONSES - Static variable in class org.archive.modules.fetcher.FetchStats
 
FETCH_RESPONSES - Static variable in class org.archive.modules.fetcher.FetchStats
 
FETCH_SUCCESSES - Static variable in class org.archive.modules.fetcher.FetchStats
 
FetchChain - Class in org.archive.modules
 
FetchChain() - Constructor for class org.archive.modules.FetchChain
 
FetchDNS - Class in org.archive.modules.fetcher
Processor to resolve 'dns:' URIs.
FetchDNS() - Constructor for class org.archive.modules.fetcher.FetchDNS
 
fetcher - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
 
FetchErrors - Class in org.archive.modules.fetcher
 
FetchErrors() - Constructor for class org.archive.modules.fetcher.FetchErrors
 
FetchFTP - Class in org.archive.modules.fetcher
Fetches documents and directory listings using FTP.
FetchFTP() - Constructor for class org.archive.modules.fetcher.FetchFTP
Constructs a new FetchFTP.
FetchFTP.SocketFactoryWithTimeout - Class in org.archive.modules.fetcher
A SocketFactory much like javax.net.DefaultSocketFactory, except that the createSocket() methods that open connections support a connect timeout.
FetchHistoryProcessor - Class in org.archive.modules.recrawl
Maintain a history of fetch information inside the CrawlURI's attributes.
FetchHistoryProcessor() - Constructor for class org.archive.modules.recrawl.FetchHistoryProcessor
 
FetchHTTP - Class in org.archive.modules.fetcher
HTTP fetcher that uses Apache HttpComponents.
FetchHTTP() - Constructor for class org.archive.modules.fetcher.FetchHTTP
 
FetchHTTPCookieStore - Interface in org.archive.modules.fetcher
 
FetchHTTPRequest - Class in org.archive.modules.fetcher
 
FetchHTTPRequest(FetchHTTP, CrawlURI) - Constructor for class org.archive.modules.fetcher.FetchHTTPRequest
 
FetchHTTPRequest.RecordingHttpClientConnection - Class in org.archive.modules.fetcher
 
FetchHTTPRequest.ServerCacheResolver - Class in org.archive.modules.fetcher
Implementation of DnsResolver that uses the server cache which is normally expected to have been populated by FetchDNS.
FetchSFTP - Class in org.archive.modules.fetcher
 
FetchSFTP() - Constructor for class org.archive.modules.fetcher.FetchSFTP
Constructs a new FetchSFTP.
FetchStats - Class in org.archive.modules.fetcher
Collector of statistics for a 'subset' of a crawl, such as a server (host:port), host, or frontier group (eg queue).
FetchStats() - Constructor for class org.archive.modules.fetcher.FetchStats
 
FetchStats.CollectsFetchStats - Interface in org.archive.modules.fetcher
 
FetchStats.HasFetchStats - Interface in org.archive.modules.fetcher
 
FetchStats.Stage - Enum in org.archive.modules.fetcher
 
FetchStatusCodes - Interface in org.archive.modules.fetcher
Constant flag codes to be used, in lieu of per-protocol codes (like HTTP's 200, 404, etc.), when network/internal/ out-of-band conditions occur.
fetchStatusCodesToString(int) - Static method in class org.archive.modules.CrawlURI
Takes a status code and converts it into a human readable string.
FetchStatusDecideRule - Class in org.archive.modules.deciderules
Rule applies the configured decision for any URI which has a fetch status equal to the 'target-status' setting.
FetchStatusDecideRule() - Constructor for class org.archive.modules.deciderules.FetchStatusDecideRule
Usual constructor.
FetchStatusMatchesRegexDecideRule - Class in org.archive.modules.deciderules
 
FetchStatusMatchesRegexDecideRule() - Constructor for class org.archive.modules.deciderules.FetchStatusMatchesRegexDecideRule
Usual constructor.
FetchStatusNotMatchesRegexDecideRule - Class in org.archive.modules.deciderules
 
FetchStatusNotMatchesRegexDecideRule() - Constructor for class org.archive.modules.deciderules.FetchStatusNotMatchesRegexDecideRule
Usual constructor.
FetchWhois - Class in org.archive.modules.fetcher
WHOIS Fetcher (RFC 3912).
FetchWhois() - Constructor for class org.archive.modules.fetcher.FetchWhois
 
FetchWhois.UrlStatus - Enum in org.archive.modules.fetcher
 
fileLogger - Variable in class org.archive.modules.deciderules.DecideRuleSequence
 
findAttributeValueGroup(String, int, CharSequence) - Method in class org.archive.modules.forms.ExtractorHTMLForms
 
findGroups(String, int, CharSequence) - Method in class org.archive.modules.forms.ExtractorHTMLForms
 
FINISH - Static variable in class org.archive.modules.ProcessResult
 
finishCheckpoint(Checkpoint) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
finishCheckpoint(Checkpoint) - Method in class org.archive.modules.fetcher.BdbCookieStore
 
finishCheckpoint(Checkpoint) - Method in class org.archive.modules.fetcher.SimpleCookieStore
 
finishCheckpoint(Checkpoint) - Method in class org.archive.modules.net.BdbServerCache
 
finishCheckpoint(Checkpoint) - Method in class org.archive.modules.Processor
 
finishCheckpoint(Checkpoint) - Method in class org.archive.modules.recrawl.PersistLogProcessor
 
FirstNamedRobotsPolicy - Class in org.archive.modules.net
Working from an ordered list of potential User-Agents, consisting of first the regularly-configured User-Agent and then those in the candidateUserAgents list, consider each potential agent in order.
FirstNamedRobotsPolicy() - Constructor for class org.archive.modules.net.FirstNamedRobotsPolicy
 
fixUpName() - Method in class org.archive.modules.net.CrawlHost
 
FixupQueryString - Class in org.archive.modules.canonicalize
Strip any trailing question mark.
FixupQueryString() - Constructor for class org.archive.modules.canonicalize.FixupQueryString
 
flattenVia() - Method in class org.archive.modules.CrawlURI
Method returns string version of this URI's referral URI.
flattenVia(CrawlURI) - Static method in class org.archive.modules.Processor
 
forAllHostsDo(Closure) - Method in class org.archive.modules.fetcher.DefaultServerCache
NOTE: Should not mutate the CrawlHost instance so retrieved; depending on the hostscache implementation, the change may not be reliably persistent.
forAllHostsDo(Closure) - Method in class org.archive.modules.net.ServerCache
Utility for performing an action on every CrawlHost.
forceFetch() - Method in class org.archive.modules.CrawlURI
If this method returns true, this URI should be fetched even though it already has been crawled.
formData(String, String) - Method in class org.archive.modules.forms.HTMLForm
 
FormInput() - Constructor for class org.archive.modules.forms.HTMLForm.FormInput
 
formItems - Variable in class org.archive.modules.credential.HtmlFormCredential
Form items.
FormLoginProcessor - Class in org.archive.modules.forms
A step, post-ExtractorHTMLForms, where a followup CrawlURI to attempt a form submission may be synthesized.
FormLoginProcessor() - Constructor for class org.archive.modules.forms.FormLoginProcessor
 
foundURIs - Variable in class org.archive.modules.extractor.PDFParser
 
frequentFlushes - Variable in class org.archive.modules.writer.WriterPoolProcessor
Whether to flush to underlying file frequently (at least after each record), or not.
fromCheckpointJson(JSONObject) - Method in class org.archive.modules.extractor.Extractor
 
fromCheckpointJson(JSONObject) - Method in class org.archive.modules.forms.FormLoginProcessor
 
fromCheckpointJson(JSONObject) - Method in class org.archive.modules.Processor
Restore internal state from JSONObject stored at earlier checkpoint-time.
fromCheckpointJson(JSONObject) - Method in class org.archive.modules.writer.WARCWriterChainProcessor
 
fromCheckpointJson(JSONObject) - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
 
fromCheckpointJson(JSONObject) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
fromHopsViaString(String) - Static method in class org.archive.modules.CrawlURI
 
FtpControlConversationRecordBuilder - Class in org.archive.modules.warc
 
FtpControlConversationRecordBuilder() - Constructor for class org.archive.modules.warc.FtpControlConversationRecordBuilder
 
FtpResponseRecordBuilder - Class in org.archive.modules.warc
 
FtpResponseRecordBuilder() - Constructor for class org.archive.modules.warc.FtpResponseRecordBuilder
 
fullVia - Variable in class org.archive.modules.CrawlURI
 

G

generateRecordID() - Static method in class org.archive.modules.warc.BaseWARCRecordBuilder
 
generator - Variable in class org.archive.modules.writer.BaseWARCWriterProcessor
Generator for record IDs
get(Object, String) - Method in class org.archive.modules.credential.CredentialStore
 
get(CharSequence, CharSequence) - Static method in class org.archive.modules.extractor.HTMLLinkContext
return an instance of HTMLLinkContext for attribute attr in element el.
get(String) - Static method in class org.archive.modules.extractor.HTMLLinkContext
return an instance of HTMLLinkContext for path path.
get(int) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
getAcceptCompression() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getAcceptHeaders() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getAcceptNonDnsResolves() - Method in class org.archive.modules.fetcher.FetchDNS
 
getAction() - Method in class org.archive.modules.forms.HTMLForm
 
getAll() - Method in class org.archive.modules.credential.CredentialStore
 
getAlsoCheckVia() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
getAnnotations() - Method in class org.archive.modules.CrawlURI
Get the annotations set for this uri.
getApplicableSurtPrefix() - Method in class org.archive.modules.forms.FormLoginProcessor
 
getAttributeEither(CrawlURI, String) - Method in class org.archive.modules.fetcher.FetchHTTP
Get a value either from inside the CrawlURI instance, or from settings (module attributes).
getAudience() - Method in class org.archive.modules.CrawlMetadata
 
getAvailableRobotsPolicies() - Method in class org.archive.modules.CrawlMetadata
 
getBaseURI() - Method in class org.archive.modules.CrawlURI
Get the (HTML) Base URI used for derelativizing internal URIs.
getBeanName() - Method in class org.archive.modules.deciderules.DecideRuleSequence
 
getBeanName() - Method in class org.archive.modules.Processor
 
getBlockAwaitingSeedLines() - Method in class org.archive.modules.seeds.TextSeedModule
 
getByRealm(Set<Credential>, String, CrawlURI) - Static method in class org.archive.modules.credential.HttpAuthenticationCredential
Convenience method that does look up on passed set using realm for key.
getCandidateUserAgents() - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
 
getCandidateUserAgents() - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
 
getCanonicalString() - Method in class org.archive.modules.CrawlURI
 
getCaseSensitiveFilesystem() - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
getChain() - Method in class org.archive.modules.writer.WARCWriterChainProcessor
 
getCharacterMap() - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
getChmod() - Method in class org.archive.modules.writer.Kw3WriterProcessor
 
getChmodValue() - Method in class org.archive.modules.writer.Kw3WriterProcessor
 
getClassKey() - Method in class org.archive.modules.CrawlURI
Get the token (usually the hostname + port) which indicates what "class" this CrawlURI should be grouped with, for the purposes of ensuring only one item of the class is processed at once, all items of the class are held for a politeness period, etc.
getCollection() - Method in class org.archive.modules.writer.Kw3WriterProcessor
 
getComment() - Method in class org.archive.modules.deciderules.DecideRule
 
getCompress() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
getConfiguredHttpVersion() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getConnectTimeoutMs() - Method in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
 
getContentDeclaredCharset(CrawlURI, String) - Method in class org.archive.modules.extractor.ExtractorHTML
 
getContentDeclaredCharset(CrawlURI, String) - Method in class org.archive.modules.extractor.ExtractorXML
 
getContentDigest() - Method in class org.archive.modules.CrawlURI
Return the retained content-digest value, if any.
getContentDigestHistory() - Method in class org.archive.modules.CrawlURI
 
getContentDigestSchemeString() - Method in class org.archive.modules.CrawlURI
 
getContentDigestString() - Method in class org.archive.modules.CrawlURI
 
getContentLength() - Method in class org.archive.modules.CrawlURI
For completed HTTP transactions, the length of the content-body.
getContentLengthThreshold() - Method in class org.archive.modules.deciderules.ContentLengthDecideRule
 
getContentLengthThreshold() - Method in class org.archive.modules.deciderules.ResourceNoLongerThanDecideRule
 
getContentRegexes() - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
 
getContentSize() - Method in class org.archive.modules.CrawlURI
Get the size in bytes of this URI's recorded content, inclusive of things like protocol headers.
getContentType() - Method in class org.archive.modules.CrawlURI
Get the content type of this URI.
getContentTypeMap() - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
getCookies() - Method in class org.archive.modules.fetcher.AbstractCookieStore.LimitedCookieStoreFacade
 
getCookies() - Method in class org.archive.modules.fetcher.BdbCookieStore
 
getCookies() - Method in class org.archive.modules.fetcher.SimpleCookieStore
 
getCookiesLoadFile() - Method in class org.archive.modules.fetcher.AbstractCookieStore
 
getCookiesSaveFile() - Method in class org.archive.modules.fetcher.AbstractCookieStore
 
getCookieStore() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getCountryCode() - Method in class org.archive.modules.net.CrawlHost
Get country code of this host
getCountryCodes() - Method in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
 
getCrawlDelay() - Method in class org.archive.modules.net.RobotsDirectives
 
getCreateHostDirectory() - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
getCreatePortDirectory() - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
getCredentials() - Method in class org.archive.modules.CrawlURI
 
getCredentials() - Method in class org.archive.modules.credential.CredentialStore
 
getCredentials(CrawlURI, Class<?>) - Method in class org.archive.modules.fetcher.FetchHTTP
 
getCredentials() - Method in class org.archive.modules.net.CrawlServer
 
getCredentialStore() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getCredentialTypes() - Static method in class org.archive.modules.credential.CredentialStore
 
getCustomRobots() - Method in class org.archive.modules.net.CustomRobotsPolicy
 
getData() - Method in class org.archive.modules.CrawlURI
 
getDataList(String) - Method in class org.archive.modules.CrawlURI
Convenience method: return (creating if necessary) list at given data key
getDecision() - Method in class org.archive.modules.deciderules.PredicatedDecideRule
 
getDefaultCharset() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getDefaultEncoding() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getDefaultMaxFileSize() - Method in class org.archive.modules.writer.ARCWriterProcessor
 
getDefaultMaxFileSize() - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
 
getDefaultMaxFileSize() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
getDefaultRules() - Static method in class org.archive.modules.canonicalize.RulesCanonicalizationPolicy
A reasonable set of default rules to use, if no others are provided by operator configuration.
getDefaultStorePaths() - Method in class org.archive.modules.writer.ARCWriterProcessor
 
getDefaultStorePaths() - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
 
getDefaultStorePaths() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
getDeferrals() - Method in class org.archive.modules.CrawlURI
Get the deferral count.
getDescription() - Method in class org.archive.modules.CrawlMetadata
 
getDigestAlgorithm() - Method in class org.archive.modules.fetcher.FetchDNS
 
getDigestAlgorithm() - Method in class org.archive.modules.fetcher.FetchFTP
 
getDigestAlgorithm() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getDigestAlgorithm() - Method in class org.archive.modules.fetcher.FetchSFTP
 
getDigestContent() - Method in class org.archive.modules.fetcher.FetchDNS
 
getDigestContent() - Method in class org.archive.modules.fetcher.FetchFTP
 
getDigestContent() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getDigestContent() - Method in class org.archive.modules.fetcher.FetchSFTP
 
getDirectivesFor(String, boolean) - Method in class org.archive.modules.net.Robotstxt
Return the RobotsDirectives, if any, appropriate for the given User-Agent string.
getDirectivesFor(String) - Method in class org.archive.modules.net.Robotstxt
Return directives to use for the given User-Agent, resorting to wildcard rules or the default no-directives if necessary.
getDirectory() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
getDirectoryFile() - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
getDisableJavaDnsResolves() - Method in class org.archive.modules.fetcher.FetchDNS
 
getDnsOverHttpServer() - Method in class org.archive.modules.fetcher.FetchDNS
 
getDNSRecord(long, Record[]) - Method in class org.archive.modules.fetcher.FetchDNS
 
getDomain() - Method in class org.archive.modules.credential.Credential
 
getDotBegin() - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
getDotEnd() - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
getDupByHashBytes() - Method in class org.archive.modules.fetcher.FetchStats
 
getDupByHashUrls() - Method in class org.archive.modules.fetcher.FetchStats
 
getEarliestNextURIEmitTime() - Method in class org.archive.modules.net.CrawlHost
Get the earliest time a URI for this host could be emitted.
getEmbedHopCount() - Method in class org.archive.modules.CrawlURI
Get the embed hop count.
getEnabled() - Method in class org.archive.modules.canonicalize.BaseRule
 
getEnabled() - Method in interface org.archive.modules.canonicalize.CanonicalizationRule
 
getEnabled() - Method in class org.archive.modules.deciderules.DecideRule
 
getEnabled() - Method in class org.archive.modules.Processor
 
getEnctype() - Method in class org.archive.modules.forms.HTMLForm
 
getEngine() - Method in class org.archive.modules.deciderules.ScriptedDecideRule
Get the proper ScriptEngine instance -- either shared or local to this thread.
getEngine() - Method in class org.archive.modules.ScriptedProcessor
Get the proper ScriptEngine instance -- either shared or local to this thread.
getEngineName() - Method in class org.archive.modules.deciderules.ScriptedDecideRule
 
getEngineName() - Method in class org.archive.modules.ScriptedProcessor
 
getEntity() - Method in class org.archive.modules.fetcher.BasicExecutionAwareEntityEnclosingRequest
 
getETag() - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
 
getExtract404s() - Method in interface org.archive.modules.extractor.ExtractorParameters
Whether to extract links from responses with a 404 'not found' response code.
getExtractAllForms() - Method in class org.archive.modules.forms.ExtractorHTMLForms
 
getExtractFromDirs() - Method in class org.archive.modules.fetcher.FetchFTP
Returns the extract.from.dirs attribute for this FetchFTP and the given curi.
getExtractFromDirs() - Method in class org.archive.modules.fetcher.FetchSFTP
Returns the extract.from.dirs attribute for this FetchSFTP and the given curi.
getExtractIndependently() - Method in interface org.archive.modules.extractor.ExtractorParameters
Whether each extractor should make an independent decision as to whether it can extract links from a URI's content (when value is true), or whether a previous extractor's success (marking the URI as hasBeenLinkExtracted) should cancel later extractors (when value is false).
getExtractJavascript() - Method in class org.archive.modules.extractor.ExtractorHTML
 
getExtractOnlyFormGets() - Method in class org.archive.modules.extractor.ExtractorHTML
 
getExtractorJS() - Method in class org.archive.modules.extractor.ExtractorHTML
 
getExtractorJS() - Method in class org.archive.modules.extractor.ExtractorSWF
 
getExtractorParameters() - Method in class org.archive.modules.extractor.Extractor
 
getExtractParent() - Method in class org.archive.modules.fetcher.FetchFTP
Returns the extract.parent attribute for this FetchFTP and the given curi.
getExtractParent() - Method in class org.archive.modules.fetcher.FetchSFTP
Returns the extract.parent attribute for this FetchSFTP and the given curi.
getExtractValueAttributes() - Method in class org.archive.modules.extractor.ExtractorHTML
 
getExtraInfo() - Method in class org.archive.modules.CrawlURI
 
getFetchAttempts() - Method in class org.archive.modules.CrawlURI
Get the count of attempts (trips through the processing loop) at getting the document referenced by this URI.
getFetchBeginTime() - Method in class org.archive.modules.CrawlURI
 
getFetchCompletedTime() - Method in class org.archive.modules.CrawlURI
 
getFetchDisregards() - Method in class org.archive.modules.fetcher.FetchStats
 
getFetchDuration() - Method in class org.archive.modules.CrawlURI
 
getFetchHistory() - Method in class org.archive.modules.CrawlURI
 
getFetchNonResponses() - Method in class org.archive.modules.fetcher.FetchStats
 
getFetchResponses() - Method in class org.archive.modules.fetcher.FetchStats
 
getFetchStatus() - Method in class org.archive.modules.CrawlURI
Return the overall/fetch status of this CrawlURI for its current trip through the processing loop.
getFetchSuccesses() - Method in class org.archive.modules.fetcher.FetchStats
 
getFetchType() - Method in class org.archive.modules.CrawlURI
 
getFirstARecord(Record[]) - Method in class org.archive.modules.fetcher.FetchDNS
 
getFormat() - Method in class org.archive.modules.canonicalize.RegexRule
 
getFormat() - Method in class org.archive.modules.extractor.ExtractorImpliedURI
 
getFormItems() - Method in class org.archive.modules.credential.HtmlFormCredential
 
getFormProvince(CrawlURI) - Method in class org.archive.modules.forms.FormLoginProcessor
Get the 'form province' - either the configured (applicableSurtPrefix) or inferred (full current server) range of URIs that is considered covered by one form login
getFrequentFlushes() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
getFrom() - Method in class org.archive.modules.CrawlMetadata
 
getFrom() - Method in interface org.archive.modules.fetcher.UserAgentProvider
 
getFullVia() - Method in class org.archive.modules.CrawlURI
 
getHarvester() - Method in class org.archive.modules.writer.Kw3WriterProcessor
 
getHistoryDbName() - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
 
getHistoryDbName() - Method in class org.archive.modules.recrawl.PersistOnlineProcessor
 
getHistoryLength() - Method in class org.archive.modules.recrawl.FetchHistoryProcessor
 
getHolder() - Method in class org.archive.modules.CrawlURI
Return the 'holder' for the convenience of an external facility.
getHolderCost() - Method in class org.archive.modules.CrawlURI
Return the 'holderCost' for convenience of external facility (frontier)
getHolderKey() - Method in class org.archive.modules.CrawlURI
Return the 'holderKey' for convenience of an external facility (Frontier).
getHopChar() - Method in enum org.archive.modules.extractor.Hop
Returns a hop character suitable for display in logs.
getHopCount() - Method in class org.archive.modules.CrawlURI
Get total hops from seed.
getHopString() - Method in enum org.archive.modules.extractor.Hop
 
getHostAddress(CrawlURI) - Method in class org.archive.modules.deciderules.IpAddressSetDecideRule
from WriterPoolProcessor
getHostAddress(CrawlURI) - Method in class org.archive.modules.writer.WriterPoolProcessor
Deprecated.
WARCRecordBuilder instances use CrawlURI.getServerIP()
getHostFor(String) - Method in class org.archive.modules.fetcher.DefaultServerCache
Get the CrawlHost associated with name.
getHostFor(String) - Method in class org.archive.modules.net.ServerCache
 
getHostFor(UURI) - Method in class org.archive.modules.net.ServerCache
Get the CrawlHost associated with curi.
getHostMap() - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
getHostName() - Method in class org.archive.modules.net.CrawlHost
Get the host name.
getHttpAuthChallenges() - Method in class org.archive.modules.CrawlURI
 
getHttpAuthChallenges() - Method in class org.archive.modules.net.CrawlServer
 
getHttpBindAddress() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getHttpMethod() - Method in class org.archive.modules.credential.HtmlFormCredential
Deprecated.
ignored, always POST
getHttpProxyHost() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getHttpProxyPassword() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getHttpProxyPort() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getHttpProxyUser() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getHttpResponseHeader(String) - Method in class org.archive.modules.CrawlURI
 
getId() - Method in class org.archive.modules.fetcher.FetchHTTPRequest.RecordingHttpClientConnection
 
getIgnoreCookies() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getIgnoreFormActionUrls() - Method in class org.archive.modules.extractor.ExtractorHTML
 
getIgnoreUnexpectedHtml() - Method in class org.archive.modules.extractor.ExtractorHTML
 
getInferRootPage() - Method in class org.archive.modules.extractor.ExtractorHTTP
 
getInFromFile(String) - Method in class org.archive.modules.extractor.PDFParser
Read a file named 'doc' and store its' bytes for later processing.
getIP() - Method in class org.archive.modules.net.CrawlHost
Get the IP address for this host.
getIpAddresses() - Method in class org.archive.modules.deciderules.IpAddressSetDecideRule
 
getIpFetched() - Method in class org.archive.modules.net.CrawlHost
Get the time when the IP address for this host was last looked up.
getIpTTL() - Method in class org.archive.modules.net.CrawlHost
Get the TTL value from the dns record for this host.
getIsolateThreads() - Method in class org.archive.modules.deciderules.ScriptedDecideRule
 
getIsolateThreads() - Method in class org.archive.modules.ScriptedProcessor
 
getJobName() - Method in class org.archive.modules.CrawlMetadata
 
getJumpTarget() - Method in class org.archive.modules.ProcessResult
 
getKey() - Method in class org.archive.modules.credential.Credential
 
getKey() - Method in class org.archive.modules.credential.HtmlFormCredential
 
getKey() - Method in class org.archive.modules.credential.HttpAuthenticationCredential
 
getKey() - Method in class org.archive.modules.net.CrawlHost
 
getKey() - Method in class org.archive.modules.net.CrawlServer
 
getKeyedProperties() - Method in class org.archive.modules.canonicalize.BaseRule
 
getKeyedProperties() - Method in class org.archive.modules.canonicalize.RulesCanonicalizationPolicy
 
getKeyedProperties() - Method in class org.archive.modules.CrawlMetadata
 
getKeyedProperties() - Method in class org.archive.modules.credential.CredentialStore
 
getKeyedProperties() - Method in class org.archive.modules.deciderules.DecideRule
 
getKeyedProperties() - Method in class org.archive.modules.Processor
 
getKeyedProperties() - Method in class org.archive.modules.ProcessorChain
 
getLastHop() - Method in class org.archive.modules.CrawlURI
convenience access to last hop character, as string
getLastModified() - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
 
getLastSuccessTime() - Method in class org.archive.modules.fetcher.FetchStats
 
getLinkCount() - Method in class org.archive.modules.extractor.ExtractorSWF.CrawlUriSWFAction
 
getLinkHopCount() - Method in class org.archive.modules.CrawlURI
Get the link hop count.
getListLogicalOr() - Method in class org.archive.modules.deciderules.MatchesListRegexDecideRule
 
getLogExtraInfo() - Method in class org.archive.modules.deciderules.DecideRuleSequence
 
getLogFile() - Method in class org.archive.modules.recrawl.PersistLogProcessor
 
getLoggerModule() - Method in class org.archive.modules.deciderules.DecideRuleSequence
 
getLoggerModule() - Method in class org.archive.modules.extractor.Extractor
 
getLoggerModule() - Method in class org.archive.modules.forms.FormLoginProcessor
 
getLogin() - Method in class org.archive.modules.credential.HttpAuthenticationCredential
 
getLoginPassword() - Method in class org.archive.modules.forms.FormLoginProcessor
 
getLoginUri() - Method in class org.archive.modules.credential.HtmlFormCredential
 
getLoginUsername() - Method in class org.archive.modules.forms.FormLoginProcessor
 
getLogToFile() - Method in class org.archive.modules.deciderules.DecideRuleSequence
 
getLookup() - Method in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
 
getLowerBound() - Method in class org.archive.modules.deciderules.MatchesStatusCodeDecideRule
Returns the lower bound on the range of acceptable status codes.
getLowerBound() - Method in class org.archive.modules.deciderules.ResponseContentLengthDecideRule
 
getMaxAttributeNameLength() - Method in class org.archive.modules.extractor.ExtractorHTML
 
getMaxAttributeValLength() - Method in class org.archive.modules.extractor.ExtractorHTML
 
getMaxElementLength() - Method in class org.archive.modules.extractor.ExtractorHTML
 
getMaxFetchKBSec() - Method in class org.archive.modules.fetcher.FetchFTP
 
getMaxFetchKBSec() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getMaxFetchKBSec() - Method in class org.archive.modules.fetcher.FetchSFTP
 
getMaxFileSizeBytes() - Method in class org.archive.modules.writer.Kw3WriterProcessor
 
getMaxFileSizeBytes() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
getMaxHops() - Method in class org.archive.modules.deciderules.TooManyHopsDecideRule
 
getMaxLengthBytes() - Method in class org.archive.modules.fetcher.FetchFTP
 
getMaxLengthBytes() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getMaxLengthBytes() - Method in class org.archive.modules.fetcher.FetchSFTP
 
getMaxOutlinks() - Method in interface org.archive.modules.extractor.ExtractorParameters
The maximum number of outlinks to discover from any URI's content.
getMaxPathDepth() - Method in class org.archive.modules.deciderules.TooManyPathSegmentsDecideRule
 
getMaxPathLength() - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
getMaxRepetitions() - Method in class org.archive.modules.deciderules.PathologicalPathDecideRule
 
getMaxSegLength() - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
getMaxSizeToDigest() - Method in class org.archive.modules.extractor.HTTPContentDigest
 
getMaxSizeToParse() - Method in class org.archive.modules.extractor.ExtractorPDF
 
getMaxSizeToParse() - Method in class org.archive.modules.extractor.ExtractorUniversal
 
getMaxSpeculativeHops() - Method in class org.archive.modules.deciderules.TransclusionDecideRule
 
getMaxTotalBytesToWrite() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
getMaxTransHops() - Method in class org.archive.modules.deciderules.TransclusionDecideRule
 
getMaxWaitForIdleMs() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
getMetadata() - Method in class org.archive.modules.extractor.ExtractorHTML
 
getMetadata() - Method in class org.archive.modules.writer.ARCWriterProcessor
 
getMetadata() - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
 
getMetadata() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
getMetadataProvider() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
getModuleClass() - Method in class org.archive.state.ModuleTestBase
Returns the class of the module to test.
getName() - Method in class org.archive.modules.net.CrawlServer
 
getNamedUserAgents() - Method in class org.archive.modules.net.Robotstxt
 
getNonFatalFailures() - Method in class org.archive.modules.CrawlURI
 
getNotModifiedBytes() - Method in class org.archive.modules.fetcher.FetchStats
 
getNotModifiedUrls() - Method in class org.archive.modules.fetcher.FetchStats
 
getNovelBytes() - Method in class org.archive.modules.fetcher.FetchStats
 
getNovelUrls() - Method in class org.archive.modules.fetcher.FetchStats
 
getOnlyStoreIfWriteTagPresent() - Method in class org.archive.modules.recrawl.AbstractPersistProcessor
 
getOperator() - Method in class org.archive.modules.CrawlMetadata
 
getOperatorContactUrl() - Method in class org.archive.modules.CrawlMetadata
 
getOperatorFrom() - Method in class org.archive.modules.CrawlMetadata
 
getOrdinal() - Method in class org.archive.modules.CrawlURI
Get the ordinal (serial number) assigned at creation.
getOrganization() - Method in class org.archive.modules.CrawlMetadata
 
getOtherDupBytes() - Method in class org.archive.modules.fetcher.FetchStats
 
getOtherDupUrls() - Method in class org.archive.modules.fetcher.FetchStats
 
getOutLinks() - Method in class org.archive.modules.CrawlURI
Returns discovered links.
getOverlayMap(String) - Method in class org.archive.modules.CrawlURI
 
getOverlayNames() - Method in class org.archive.modules.CrawlURI
 
getPassword() - Method in class org.archive.modules.credential.HttpAuthenticationCredential
 
getPassword() - Method in class org.archive.modules.fetcher.FetchFTP
 
getPassword() - Method in class org.archive.modules.fetcher.FetchSFTP
 
getPath() - Method in class org.archive.modules.writer.Kw3WriterProcessor
 
getPath() - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
getPathFromSeed() - Method in class org.archive.modules.CrawlURI
 
getPathQuery(CrawlURI) - Method in class org.archive.modules.net.RobotsPolicy
 
getPattern() - Method in enum org.archive.modules.deciderules.MatchesFilePatternDecideRule.Preset
 
getPayloadDigest() - Method in class org.archive.modules.revisit.IdenticalPayloadDigestRevisit
 
getPayloadDigest() - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
 
getPolicyBasisUURI() - Method in class org.archive.modules.CrawlURI
Get the UURI that should be used as the basis of policy/overlay decisions.
getPolitenessDelay() - Method in class org.archive.modules.CrawlURI
 
getPool() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
getPoolMaxActive() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
getPort() - Method in class org.archive.modules.net.CrawlServer
Get the port number for this server.
getPrecedence() - Method in class org.archive.modules.CrawlURI
 
getPrefix() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
getPreloadSource() - Method in class org.archive.modules.recrawl.PersistLoadProcessor
 
getPreloadSourceUrl() - Method in class org.archive.modules.recrawl.PersistLoadProcessor
 
getPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.Credential
Return the authentication URI, either absolute or relative, that serves as prerequisite the passed curi.
getPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.HtmlFormCredential
 
getPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.HttpAuthenticationCredential
 
getPrerequisiteUri() - Method in class org.archive.modules.CrawlURI
Get the prerequisite for this URI.
getProcessors() - Method in class org.archive.modules.ProcessorChain
 
getProcessStatus() - Method in class org.archive.modules.ProcessResult
 
getProfileName() - Method in class org.archive.modules.revisit.IdenticalPayloadDigestRevisit
 
getProfileName() - Method in interface org.archive.modules.revisit.RevisitProfile
 
getProfileName() - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
 
getProtocolVersion() - Method in class org.archive.modules.fetcher.BasicExecutionAwareRequest
Returns the HTTP protocol version to be used for this request.
getRealm() - Method in class org.archive.modules.credential.HttpAuthenticationCredential
 
getRecordedFinishes() - Method in class org.archive.modules.fetcher.FetchStats
 
getRecordedSize() - Method in class org.archive.modules.CrawlURI
Get size of data recorded (transferred)
getRecordedSize(CrawlURI) - Static method in class org.archive.modules.Processor
 
getRecorder() - Method in class org.archive.modules.CrawlURI
Get the http recorder associated with this uri.
getRecorder() - Method in class org.archive.state.ModuleTestBase
 
getRecordID() - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
 
getRecordIDGenerator() - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
 
getRefersToDate() - Method in class org.archive.modules.revisit.AbstractProfile
 
getRefersToRecordID() - Method in class org.archive.modules.revisit.AbstractProfile
 
getRefersToTargetURI() - Method in class org.archive.modules.revisit.IdenticalPayloadDigestRevisit
 
getRegex() - Method in class org.archive.modules.canonicalize.RegexRule
 
getRegex() - Method in class org.archive.modules.deciderules.MatchesFilePatternDecideRule
Use a preset if configured to do so.
getRegex() - Method in class org.archive.modules.deciderules.MatchesRegexDecideRule
 
getRegex() - Method in class org.archive.modules.extractor.ExtractorImpliedURI
 
getRegexList() - Method in class org.archive.modules.deciderules.MatchesListRegexDecideRule
 
getRemaining() - Method in class org.archive.modules.fetcher.FetchStats
 
getRemoveTriggerUris() - Method in class org.archive.modules.extractor.ExtractorImpliedURI
 
getRequestLine() - Method in class org.archive.modules.fetcher.BasicExecutionAwareRequest
Returns the request line of this request.
getRescheduleTime() - Method in class org.archive.modules.CrawlURI
 
getResourceDir() - Method in class org.archive.state.ModuleTestBase
Returns the location of the Java resources directory for your project.
getRevisitProfile() - Method in class org.archive.modules.CrawlURI
 
getRobotsDenials() - Method in class org.archive.modules.fetcher.FetchStats
 
getRobotsPolicy() - Method in class org.archive.modules.CrawlMetadata
Get the currently-effective RobotsPolicy, as specified by the string name and chosen from the full available map.
getRobotsPolicyName() - Method in class org.archive.modules.CrawlMetadata
 
getRobotstxt() - Method in class org.archive.modules.net.CrawlServer
 
getRules() - Method in class org.archive.modules.canonicalize.RulesCanonicalizationPolicy
 
getRules() - Method in class org.archive.modules.deciderules.DecideRuleSequence
 
getSchedulingDirective() - Method in class org.archive.modules.CrawlURI
 
getSchemes() - Method in class org.archive.modules.deciderules.SchemeNotInSetDecideRule
 
getScratchDisk() - Method in interface org.archive.modules.extractor.TempDirProvider
 
getScratchDisk() - Method in class org.archive.modules.net.DefaultTempDirProvider
 
getScriptSource() - Method in class org.archive.modules.deciderules.ScriptedDecideRule
 
getScriptSource() - Method in class org.archive.modules.ScriptedProcessor
 
getSeedListeners() - Method in class org.archive.modules.seeds.SeedModule
 
getSeeds() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
getSeedsAsSurtPrefixes() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
getSendConnectionClose() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getSendIfModifiedSince() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getSendIfNoneMatch() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getSendRange() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getSendReferer() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getSerialNo() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
getServerCache() - Method in class org.archive.modules.deciderules.DecideRuleSequence
 
getServerCache() - Method in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
 
getServerCache() - Method in class org.archive.modules.deciderules.IpAddressSetDecideRule
 
getServerCache() - Method in class org.archive.modules.fetcher.FetchDNS
 
getServerCache() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getServerCache() - Method in class org.archive.modules.fetcher.FetchWhois
 
getServerCache() - Method in class org.archive.modules.writer.Kw3WriterProcessor
 
getServerCache() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
getServerFor(String) - Method in class org.archive.modules.fetcher.DefaultServerCache
Get the CrawlServer associated with name.
getServerFor(String) - Method in class org.archive.modules.net.ServerCache
 
getServerFor(UURI) - Method in class org.archive.modules.net.ServerCache
Get the CrawlServer associated with curi.
getServerIP() - Method in class org.archive.modules.CrawlURI
Returns the IP address the request was fetched against or null if unavailable.
getServerKey(CrawlURI) - Static method in class org.archive.modules.fetcher.FetchHTTP
 
getServerKey(UURI) - Static method in class org.archive.modules.net.CrawlServer
Get key to use doing lookup on server instances.
getShouldFetchBodyRule() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getShouldMasquerade() - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
 
getShouldMasquerade() - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
 
getShouldProcessRule() - Method in class org.archive.modules.Processor
 
getSkipIdenticalDigests() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
getSocket() - Method in class org.archive.modules.fetcher.FetchHTTPRequest.RecordingHttpClientConnection
 
getSocketInputStream(Socket) - Method in class org.archive.modules.fetcher.FetchHTTPRequest.RecordingHttpClientConnection
 
getSocketOutputStream(Socket) - Method in class org.archive.modules.fetcher.FetchHTTPRequest.RecordingHttpClientConnection
 
getSocksProxyHost() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getSocksProxyPort() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getSoTimeoutMs() - Method in class org.archive.modules.fetcher.FetchFTP
 
getSoTimeoutMs() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getSoTimeoutMs() - Method in class org.archive.modules.fetcher.FetchSFTP
 
getSoTimeoutMs() - Method in class org.archive.modules.fetcher.FetchWhois
 
getSourceCodeDir() - Method in class org.archive.state.ModuleTestBase
Returns the location of the source code directory for your project.
getSourceSeeds() - Method in class org.archive.modules.deciderules.SourceSeedDecideRule
 
getSourceTag() - Method in class org.archive.modules.CrawlURI
 
getSourceTagSeeds() - Method in class org.archive.modules.seeds.SeedModule
 
getSSLSession() - Method in class org.archive.modules.fetcher.FetchHTTPRequest.RecordingHttpClientConnection
 
getSslTrustLevel() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getStartNewFilesOnCheckpoint() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
getStats() - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
 
getStatusCodes() - Method in class org.archive.modules.deciderules.FetchStatusDecideRule
 
getStorePaths() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
getString(CrawlURI) - Method in class org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule
 
getString(CrawlURI) - Method in class org.archive.modules.deciderules.FetchStatusMatchesRegexDecideRule
 
getString(CrawlURI) - Method in class org.archive.modules.deciderules.HopsPathMatchesRegexDecideRule
 
getString(CrawlURI) - Method in class org.archive.modules.deciderules.MatchesRegexDecideRule
 
getStripRegex() - Method in class org.archive.modules.extractor.HTTPContentDigest
 
getSubstats() - Method in interface org.archive.modules.fetcher.FetchStats.HasFetchStats
 
getSubstats() - Method in class org.archive.modules.net.CrawlHost
 
getSubstats() - Method in class org.archive.modules.net.CrawlServer
 
getSuccessBytes() - Method in class org.archive.modules.fetcher.FetchStats
 
getSuffixAtEnd() - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
getSurtPrefixes() - Method in class org.archive.modules.deciderules.ViaSurtPrefixedDecideRule
 
getSurtsDumpFile() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
getSurtsSource() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
getSurtsSourceFile() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
Deprecated.
redundant now that we have SurtPrefixedDecideRule.surtsSource
getTemplate() - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
 
getTemplate() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
getTextSource() - Method in class org.archive.modules.seeds.TextSeedModule
 
getThreadNumber() - Method in class org.archive.modules.CrawlURI
Get the number of the ToeThread responsible for processing this uri.
getTimeoutPerRegexSeconds() - Method in class org.archive.modules.deciderules.MatchesListRegexDecideRule
 
getTimeoutSeconds() - Method in class org.archive.modules.fetcher.FetchFTP
 
getTimeoutSeconds() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getTimeoutSeconds() - Method in class org.archive.modules.fetcher.FetchSFTP
 
getTooLongDirectory() - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
getTotalBytes() - Method in class org.archive.crawler.util.CrawledBytesHistotable
 
getTotalBytes() - Method in class org.archive.modules.fetcher.FetchStats
 
getTotalBytesWritten() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
getTotalScheduled() - Method in class org.archive.modules.fetcher.FetchStats
 
getTotalUrls() - Method in class org.archive.crawler.util.CrawledBytesHistotable
 
getTransHops() - Method in class org.archive.modules.CrawlURI
Tally up the number of transitive (non-simple-link) hops at the end of this CrawlURI's pathFromSeed.
getTreatFramesAsEmbedLinks() - Method in class org.archive.modules.extractor.ExtractorHTML
 
getUnderscoreSet() - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
getUpperBound() - Method in class org.archive.modules.deciderules.MatchesStatusCodeDecideRule
Returns the upper bound on the range of acceptable status codes.
getUpperBound() - Method in class org.archive.modules.deciderules.NotMatchesStatusCodeDecideRule
Returns the upper bound on the range of acceptable status codes.
getUpperBound() - Method in class org.archive.modules.deciderules.ResponseContentLengthDecideRule
 
getURI() - Method in class org.archive.modules.CrawlURI
 
getURICount() - Method in class org.archive.modules.Processor
Returns the number of URIs this processor has handled.
getUriRegex() - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
 
getURIs() - Method in class org.archive.modules.extractor.PDFParser
Get a list of URIs retrieved from the Pdf during the extractURIs operation.
getURL(String, String) - Method in class org.archive.modules.extractor.ExtractorSWF.CrawlUriSWFAction
Overwrite handling of discovered URIs.
getUrlPattern() - Method in class org.archive.modules.extractor.ExtractorSitemap
 
getUseHeaderLength() - Method in class org.archive.modules.deciderules.ResourceNoLongerThanDecideRule
 
getUseHTTP11() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getUsePreset() - Method in class org.archive.modules.deciderules.MatchesFilePatternDecideRule
 
getUserAgent() - Method in class org.archive.modules.CrawlMetadata
 
getUserAgent() - Method in class org.archive.modules.CrawlURI
Get the user agent to use for crawling this URI.
getUserAgent() - Method in interface org.archive.modules.fetcher.UserAgentProvider
 
getUserAgentProvider() - Method in class org.archive.modules.fetcher.FetchHTTP
 
getUserAgentTemplate() - Method in class org.archive.modules.CrawlMetadata
 
getUsername() - Method in class org.archive.modules.fetcher.FetchFTP
 
getUsername() - Method in class org.archive.modules.fetcher.FetchSFTP
 
getUURI() - Method in class org.archive.modules.CrawlURI
 
getValidator() - Method in class org.archive.modules.CrawlMetadata
 
getValidTestData() - Method in class org.archive.modules.extractor.StringExtractorTestBase
Returns an array of valid test data pairs.
getVia() - Method in class org.archive.modules.CrawlURI
 
getViaContext() - Method in class org.archive.modules.CrawlURI
 
getWarcHeaders() - Method in class org.archive.modules.revisit.AbstractProfile
 
getWarcHeaders() - Method in class org.archive.modules.revisit.IdenticalPayloadDigestRevisit
 
getWarcHeaders() - Method in interface org.archive.modules.revisit.RevisitProfile
 
getWarcHeaders() - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
 
getWhoisQuery(CrawlURI) - Method in class org.archive.modules.fetcher.FetchWhois
 
getWhoisServer(CrawlURI) - Method in class org.archive.modules.fetcher.FetchWhois
 
getWriteBufferSize() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
getWriteMetadata() - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
 
getWriteRequests() - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
 
groovyTemplate() - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
 
groovyTemplates - Variable in class org.archive.modules.extractor.ExtractorMultipleRegex
 
GroupList(MatchResult) - Constructor for class org.archive.modules.extractor.ExtractorMultipleRegex.GroupList
 

H

handle401(HttpResponse, CrawlURI) - Method in class org.archive.modules.fetcher.FetchHTTP
Server is looking for basic/digest auth credentials (RFC2617).
harvester - Variable in class org.archive.modules.writer.Kw3WriterProcessor
Name of the harvester that is used for the web harvesting.
HARVESTER_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
 
hasBeenLinkExtracted() - Method in class org.archive.modules.CrawlURI
If true then a link extractor has already claimed this CrawlURI and performed link extraction on the document content.
hasBeenLookedUp() - Method in class org.archive.modules.net.CrawlHost
Return true if the IP for this host has been looked up.
hasContentDigestHistory() - Method in class org.archive.modules.CrawlURI
 
hasCredentials() - Method in class org.archive.modules.CrawlURI
 
hasCredentials() - Method in class org.archive.modules.net.CrawlServer
 
hasDirectives - Variable in class org.archive.modules.net.RobotsDirectives
 
hasErrors - Variable in class org.archive.modules.net.Robotstxt
 
hashCode() - Method in class org.archive.modules.CrawlURI
 
hashCode() - Method in class org.archive.modules.extractor.LinkContext
 
hashCode() - Method in class org.archive.modules.net.CrawlHost
 
hashCode() - Method in class org.archive.modules.net.CrawlServer
 
hasHttpAuthenticationCredential(CrawlURI) - Static method in class org.archive.modules.Processor
 
hasIdenticalDigest(CrawlURI) - Static method in class org.archive.modules.deciderules.recrawl.IdenticalDigestDecideRule
Utility method for testing if a CrawlURI's revisit profile matches an identical payload digest.
hasIdenticalDigest(CrawlURI) - Static method in class org.archive.modules.recrawl.FetchHistoryProcessor
Utility method for testing if a CrawlURI's last two history entries (one being the most recent fetch) have identical content-digest information.
hasPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.Credential
 
hasPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.HtmlFormCredential
 
hasPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.HttpAuthenticationCredential
 
hasPrerequisiteUri() - Method in class org.archive.modules.CrawlURI
 
hasRfc2617Credential() - Method in class org.archive.modules.CrawlURI
 
HasViaDecideRule - Class in org.archive.modules.deciderules
Rule applies the configured decision for any URI which has a 'via' (essentially, any URI that was a seed or some kinds of mid-crawl adds).
HasViaDecideRule() - Constructor for class org.archive.modules.deciderules.HasViaDecideRule
Usual constructor.
hasWriteTag(CrawlURI) - Method in class org.archive.modules.recrawl.AbstractPersistProcessor
 
haveOverlayNamesBeenSet() - Method in class org.archive.modules.CrawlURI
 
haveSeen(int, int) - Method in class org.archive.modules.extractor.PDFParser
Indicates, based on a PDFObject's generation/id pair whether the parser has already encountered this object (or a reference to it) so we don't infinitely loop on circuits within the PDF.
HEADER_LENGTH_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
 
HEADER_MD5_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
 
HEADER_PREDICTS_MISSING - Static variable in class org.archive.modules.deciderules.ResourceNoLongerThanDecideRule
 
HEADER_TRUNC - Static variable in interface org.archive.modules.CoreAttributeConstants
 
HEADER_TRUNC - Static variable in class org.archive.modules.fetcher.FetchErrors
 
HIGH - Static variable in class org.archive.modules.SchedulingConstants
High scheduling priority.
HIGHEST - Static variable in class org.archive.modules.SchedulingConstants
Highest scheduling priority.
HISTORY_DB_CONFIG - Static variable in class org.archive.modules.recrawl.PersistProcessor
 
historyDb - Variable in class org.archive.modules.recrawl.BdbContentDigestHistory
 
historyDb - Variable in class org.archive.modules.recrawl.PersistOnlineProcessor
 
historyDbConfig - Variable in class org.archive.modules.recrawl.BdbContentDigestHistory
 
historyDbConfig() - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
 
historyDbName - Variable in class org.archive.modules.recrawl.BdbContentDigestHistory
 
historyDbName - Variable in class org.archive.modules.recrawl.PersistOnlineProcessor
 
historyLength - Variable in class org.archive.modules.recrawl.FetchHistoryProcessor
Desired history array length.
historyRealloc(CrawlURI) - Method in class org.archive.modules.recrawl.FetchHistoryProcessor
Get or create proper-sized history array
holder - Variable in class org.archive.modules.CrawlURI
 
holderCost - Variable in class org.archive.modules.CrawlURI
spot for an integer cost to be placed by external facility (frontier).
holderKey - Variable in class org.archive.modules.CrawlURI
 
Hop - Enum in org.archive.modules.extractor
The kind of "hop" from one URI to another.
HopCrossesAssignmentLevelDomainDecideRule - Class in org.archive.modules.deciderules
Applies its decision if the current URI differs in that portion of its hostname/domain that is assigned/sold by registrars, its 'assignment-level-domain' (ALD) (AKA 'public suffix' or in previous Heritrix versions, 'topmost assigned SURT')
HopCrossesAssignmentLevelDomainDecideRule() - Constructor for class org.archive.modules.deciderules.HopCrossesAssignmentLevelDomainDecideRule
 
HopsPathMatchesRegexDecideRule - Class in org.archive.modules.deciderules
Rule applies configured decision to any CrawlURIs whose 'hops-path' (string like "LLXE" etc.) matches the supplied regex.
HopsPathMatchesRegexDecideRule() - Constructor for class org.archive.modules.deciderules.HopsPathMatchesRegexDecideRule
Usual constructor.
hopString - Variable in enum org.archive.modules.extractor.Hop
 
hostKeys() - Method in class org.archive.modules.fetcher.DefaultServerCache
 
hostKeys() - Method in class org.archive.modules.net.ServerCache
 
hostMap - Variable in class org.archive.modules.writer.MirrorWriterProcessor
This list is grouped in pairs.
HostResolver - Interface in org.archive.modules.fetcher
 
hosts - Variable in class org.archive.modules.fetcher.DefaultServerCache
hostname -> CrawlHost.
hostSubset(String) - Method in class org.archive.modules.fetcher.BdbCookieStore
 
HTMLForm - Class in org.archive.modules.forms
Simple representation of a discovered HTML Form.
HTMLForm() - Constructor for class org.archive.modules.forms.HTMLForm
 
HTMLForm.FormInput - Class in org.archive.modules.forms
 
HTMLForm.NameValue - Class in org.archive.modules.forms
 
HtmlFormCredential - Class in org.archive.modules.credential
Credential that holds all needed to do a GET/POST to a HTML form.
HtmlFormCredential() - Constructor for class org.archive.modules.credential.HtmlFormCredential
Constructor.
HtmlFormCredential.Method - Enum in org.archive.modules.credential
 
HTMLLinkContext - Class in org.archive.modules.extractor
XPath-like context for HTML discovered URIs.
HTMLLinkContext(String) - Constructor for class org.archive.modules.extractor.HTMLLinkContext
Constructor.
HTMLLinkContext(CharSequence, CharSequence) - Constructor for class org.archive.modules.extractor.HTMLLinkContext
 
HTTP_BIND_ADDRESS - Static variable in class org.archive.modules.fetcher.FetchHTTP
 
HTTP_SCHEME - Static variable in class org.archive.modules.fetcher.FetchHTTP
 
HttpAuthenticationCredential - Class in org.archive.modules.credential
A Basic/Digest HTTP Authentication (RFC2617) credential.
HttpAuthenticationCredential() - Constructor for class org.archive.modules.credential.HttpAuthenticationCredential
Constructor.
httpClientBuilder - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
 
httpClientContext - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
 
HTTPContentDigest - Class in org.archive.modules.extractor
A processor for calculating custom HTTP content digests in place of the default (if any) computed by the HTTP fetcher processors.
HTTPContentDigest() - Constructor for class org.archive.modules.extractor.HTTPContentDigest
Constructor.
httpMethod - Variable in class org.archive.modules.credential.HtmlFormCredential
Deprecated.
ignored, always POST
HttpRequestRecordBuilder - Class in org.archive.modules.warc
 
HttpRequestRecordBuilder() - Constructor for class org.archive.modules.warc.HttpRequestRecordBuilder
 
HttpResponseRecordBuilder - Class in org.archive.modules.warc
 
HttpResponseRecordBuilder() - Constructor for class org.archive.modules.warc.HttpResponseRecordBuilder
 
HTTPS_SCHEME - Static variable in class org.archive.modules.fetcher.FetchHTTP
 

I

IdenticalDigestDecideRule - Class in org.archive.modules.deciderules.recrawl
Rule applies configured decision to any CrawlURIs whose revisit profile is set with a profile matching WARCConstants.PROFILE_REVISIT_IDENTICAL_DIGEST
IdenticalDigestDecideRule() - Constructor for class org.archive.modules.deciderules.recrawl.IdenticalDigestDecideRule
Usual constructor.
IdenticalPayloadDigestRevisit - Class in org.archive.modules.revisit
 
IdenticalPayloadDigestRevisit(String) - Constructor for class org.archive.modules.revisit.IdenticalPayloadDigestRevisit
Minimal constructor.
IgnoreRobotsPolicy - Class in org.archive.modules.net
Policy to ignore robots.
IgnoreRobotsPolicy() - Constructor for class org.archive.modules.net.IgnoreRobotsPolicy
 
IMG_DATA_ORIGINAL - Static variable in class org.archive.modules.extractor.HTMLLinkContext
 
IMG_DATA_ORIGINAL_SET - Static variable in class org.archive.modules.extractor.HTMLLinkContext
 
IMG_DATA_SRC - Static variable in class org.archive.modules.extractor.HTMLLinkContext
 
IMG_DATA_SRCSET - Static variable in class org.archive.modules.extractor.HTMLLinkContext
 
IMG_SRC - Static variable in class org.archive.modules.extractor.HTMLLinkContext
 
IMG_SRCSET - Static variable in class org.archive.modules.extractor.HTMLLinkContext
 
includesRetireDirective() - Method in class org.archive.modules.CrawlURI
 
incrementConsecutiveConnectionErrors() - Method in class org.archive.modules.net.CrawlServer
 
incrementDeferrals() - Method in class org.archive.modules.CrawlURI
Increment the deferral count.
incrementDiscardedOutLinks() - Method in class org.archive.modules.CrawlURI
 
incrementFetchAttempts() - Method in class org.archive.modules.CrawlURI
Increment the count of attempts (trips through the processing loop) at getting the document referenced by this URI.
indexOf(Object) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
INFERRED_MISC - Static variable in class org.archive.modules.extractor.LinkContext
Stand-in value for inferred urls without other context.
inferRootPage - Variable in class org.archive.modules.extractor.ExtractorHTTP
should all HTTP URIs be used to infer a link to the site's root?
inheritFrom(CrawlURI) - Method in class org.archive.modules.CrawlURI
Inherit (copy) the relevant keys-values from the ancestor.
initHttpClientBuilder() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
 
initialize() - Method in class org.archive.modules.extractor.PDFParser
Initialize opens the document for reading.
initializeFromReader(Reader) - Method in class org.archive.modules.net.Robotstxt
 
initOutputStream(CrawlURI) - Method in class org.archive.modules.writer.Kw3WriterProcessor
Get the OutputStream for the file to write to.
innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.AcceptDecideRule
 
innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.ContentLengthDecideRule
 
innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.DecideRule
 
innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.DecideRuleSequence
 
innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.PathologicalPathDecideRule
 
innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.PredicatedDecideRule
 
innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.PrerequisiteAcceptDecideRule
 
innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.RejectDecideRule
 
innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.ScriptedDecideRule
 
innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.SeedAcceptDecideRule
 
innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ContentExtractor
Actually extracts links.
innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorCSS
 
innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorDOC
Processes a word document and extracts any hyperlinks from it.
innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorHTML
 
innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorJS
 
innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorPDF
 
innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorRobotsTxt
 
innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorSitemap
 
innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorSWF
 
innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorUniversal
 
innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorXML
 
innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.TrapSuppressExtractor
 
innerProcess(CrawlURI) - Method in class org.archive.modules.extractor.Extractor
Processes the given URI.
innerProcess(CrawlURI) - Method in class org.archive.modules.extractor.HTTPContentDigest
 
innerProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchDNS
 
innerProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchFTP
Processes the given URI.
innerProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchHTTP
 
innerProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchSFTP
Processes the given URI.
innerProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchWhois
 
innerProcess(CrawlURI) - Method in class org.archive.modules.forms.FormLoginProcessor
 
innerProcess(CrawlURI) - Method in class org.archive.modules.Processor
Actually performs the process.
innerProcess(CrawlURI) - Method in class org.archive.modules.recrawl.ContentDigestHistoryLoader
 
innerProcess(CrawlURI) - Method in class org.archive.modules.recrawl.ContentDigestHistoryStorer
 
innerProcess(CrawlURI) - Method in class org.archive.modules.recrawl.FetchHistoryProcessor
 
innerProcess(CrawlURI) - Method in class org.archive.modules.recrawl.PersistLoadProcessor
 
innerProcess(CrawlURI) - Method in class org.archive.modules.recrawl.PersistLogProcessor
 
innerProcess(CrawlURI) - Method in class org.archive.modules.recrawl.PersistStoreProcessor
 
innerProcess(CrawlURI) - Method in class org.archive.modules.ScriptedProcessor
 
innerProcess(CrawlURI) - Method in class org.archive.modules.writer.Kw3WriterProcessor
 
innerProcess(CrawlURI) - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
innerProcess(CrawlURI) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
innerProcessResult(CrawlURI) - Method in class org.archive.modules.fetcher.FetchWhois
 
innerProcessResult(CrawlURI) - Method in class org.archive.modules.Processor
 
innerProcessResult(CrawlURI) - Method in class org.archive.modules.writer.ARCWriterProcessor
Writes a CrawlURI and its associated data to store file.
innerProcessResult(CrawlURI) - Method in class org.archive.modules.writer.WARCWriterChainProcessor
 
innerProcessResult(CrawlURI) - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
Writes a CrawlURI and its associated data to store file.
innerProcessResult(CrawlURI) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
innerRejectProcess(CrawlURI) - Method in class org.archive.modules.Processor
Invoked after a URI has been rejected.
innerRejectProcess(CrawlURI) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
INSTANCE - Static variable in class org.archive.modules.net.IgnoreRobotsPolicy
 
INSTANCE - Static variable in class org.archive.modules.net.ObeyRobotsPolicy
 
INSTANCE - Static variable in class org.archive.modules.net.RobotsTxtOnlyPolicy
 
invert(DecideResult) - Static method in enum org.archive.modules.deciderules.DecideResult
 
IP_ADDRESS - Static variable in class org.archive.modules.extractor.ExtractorUniversal
Matches any string that begins with http:// or https:// followed by something that looks like an ip address (four numbers, none longer then 3 chars seperated by 3 dots).
IP_ADDRESS_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
 
IP_ADDRESS_REGEX - Static variable in class org.archive.modules.fetcher.FetchWhois
 
IP_NEVER_EXPIRES - Static variable in class org.archive.modules.net.CrawlHost
Flag value indicating always-valid IP
IP_NEVER_LOOKED_UP - Static variable in class org.archive.modules.net.CrawlHost
Flag value indicating an IP has not yet been looked up
IpAddressSetDecideRule - Class in org.archive.modules.deciderules
IpAddressSetDecideRule must be used with org.archive.crawler.prefetch.Preselector#setRecheckScope(boolean) set to true because it relies on Heritrix' dns lookup to establish the ip address for a URI before it can run.
IpAddressSetDecideRule() - Constructor for class org.archive.modules.deciderules.IpAddressSetDecideRule
 
is2XXSuccess() - Method in class org.archive.modules.CrawlURI
 
isCheckpointRecovery - Variable in class org.archive.modules.fetcher.BdbCookieStore
are we a checkpoint recovery? (in which case, reuse stored cookie data?)
isCheckpointRecovery - Variable in class org.archive.modules.net.BdbServerCache
 
isCookieCountMaxedForDomain(String) - Method in class org.archive.modules.fetcher.AbstractCookieStore
 
isDisableSNI() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
 
isEmpty() - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
isEnableLenientExtraction() - Method in class org.archive.modules.extractor.ExtractorSitemap
 
isEveryTime() - Method in class org.archive.modules.credential.Credential
 
isEveryTime() - Method in class org.archive.modules.credential.HtmlFormCredential
 
isEveryTime() - Method in class org.archive.modules.credential.HttpAuthenticationCredential
 
isHtmlExpectedHere(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorHTML
Test whether this HTML is so unexpected (eg in place of a GIF URI) that it shouldn't be scanned for links.
isHttpTransaction() - Method in class org.archive.modules.CrawlURI
Return true if this is a http transaction.
isLocation() - Method in class org.archive.modules.CrawlURI
 
isMultipleFormSubmitInputs(String) - Method in class org.archive.modules.forms.HTMLForm
 
isObeyMetaRobotsNofollow() - Method in class org.archive.modules.net.CustomRobotsPolicy
 
isObeyMetaRobotsNofollow() - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
 
isObeyMetaRobotsNofollow() - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
 
isolateThreads - Variable in class org.archive.modules.deciderules.ScriptedDecideRule
Whether each ToeThread should get its own independent script engine, or they should share synchronized access to one engine.
isolateThreads - Variable in class org.archive.modules.ScriptedProcessor
Whether each ToeThread should get its own independent script engine, or they should share synchronized access to one engine.
isPost() - Method in class org.archive.modules.credential.Credential
 
isPost() - Method in class org.archive.modules.credential.HtmlFormCredential
 
isPost() - Method in class org.archive.modules.credential.HttpAuthenticationCredential
 
isPrerequisite() - Method in class org.archive.modules.CrawlURI
Returns true if this CrawlURI is a prerequisite.
isPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.Credential
 
isPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.HtmlFormCredential
 
isPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.HttpAuthenticationCredential
 
isQuadAddress(CrawlURI, String, CrawlHost) - Method in class org.archive.modules.fetcher.FetchDNS
 
isRevisit() - Method in class org.archive.modules.CrawlURI
Indicates if this CrawlURI object has been deemed a revisit.
isRobotsExpired(int) - Method in class org.archive.modules.net.CrawlServer
Is the robots policy expired.
isRunning - Variable in class org.archive.modules.deciderules.DecideRuleSequence
 
isRunning() - Method in class org.archive.modules.deciderules.DecideRuleSequence
 
isRunning - Variable in class org.archive.modules.fetcher.AbstractCookieStore
 
isRunning() - Method in class org.archive.modules.fetcher.AbstractCookieStore
 
isRunning() - Method in class org.archive.modules.fetcher.FetchWhois
 
isRunning - Variable in class org.archive.modules.net.BdbServerCache
 
isRunning() - Method in class org.archive.modules.net.BdbServerCache
 
isRunning - Variable in class org.archive.modules.Processor
 
isRunning() - Method in class org.archive.modules.Processor
 
isRunning - Variable in class org.archive.modules.ProcessorChain
 
isRunning() - Method in class org.archive.modules.ProcessorChain
 
isRunning() - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
 
isRunning() - Method in class org.archive.modules.recrawl.PersistLogProcessor
 
isRunning() - Method in class org.archive.modules.recrawl.PersistOnlineProcessor
 
isSeed() - Method in class org.archive.modules.CrawlURI
 
isSuccess() - Method in class org.archive.modules.CrawlURI
Ask this URI if it was a success or not.
isSuccess(CrawlURI) - Static method in class org.archive.modules.Processor
 
isValidRobots() - Method in class org.archive.modules.net.CrawlServer
If true then valid robots.txt information has been retrieved.
iterator() - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
iterator() - Method in class org.archive.modules.ProcessorChain
 

J

JAVASCRIPT_STRING_EXTRACTOR - Static variable in class org.archive.modules.extractor.ExtractorJS
 
JerichoExtractorHTML - Class in org.archive.modules.extractor
Improved link-extraction from an HTML content-body using jericho-html parser.
JerichoExtractorHTML() - Constructor for class org.archive.modules.extractor.JerichoExtractorHTML
 
jobName - Variable in class org.archive.modules.CrawlMetadata
 
JS_MISC - Static variable in class org.archive.modules.extractor.LinkContext
Stand-in value for JavaScript-discovered urls without other context.
JSSTRING - Static variable in class org.archive.modules.extractor.ExtractorSWF
 
jump(String) - Static method in class org.archive.modules.ProcessResult
 

K

kp - Variable in class org.archive.modules.canonicalize.BaseRule
 
kp - Variable in class org.archive.modules.canonicalize.RulesCanonicalizationPolicy
 
kp - Variable in class org.archive.modules.CrawlMetadata
 
kp - Variable in class org.archive.modules.credential.CredentialStore
 
kp - Variable in class org.archive.modules.deciderules.DecideRule
 
kp - Variable in class org.archive.modules.Processor
 
kp - Variable in class org.archive.modules.ProcessorChain
 
Kw3Constants - Interface in org.archive.modules.writer
 
Kw3WriterProcessor - Class in org.archive.modules.writer
Processor module that writes the results of successful fetches to files on disk.
Kw3WriterProcessor() - Constructor for class org.archive.modules.writer.Kw3WriterProcessor
Constructor.

L

lastIndexOf(Object) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
lastModified - Variable in class org.archive.modules.revisit.ServerNotModifiedRevisit
 
lastSuccessTime - Variable in class org.archive.modules.fetcher.FetchStats
 
LENGTH_TRUNC - Static variable in interface org.archive.modules.CoreAttributeConstants
 
LENGTH_TRUNC - Static variable in class org.archive.modules.fetcher.FetchErrors
 
LimitedCookieStoreFacade(List<Cookie>) - Constructor for class org.archive.modules.fetcher.AbstractCookieStore.LimitedCookieStoreFacade
 
LinkContext - Class in org.archive.modules.extractor
The context of link discovery.
LinkContext() - Constructor for class org.archive.modules.extractor.LinkContext
 
LinkContext.SimpleLinkContext - Class in org.archive.modules.extractor
Class for representing handy default LinkContext values.
linkExtractorFinished() - Method in class org.archive.modules.CrawlURI
Note that link extraction has been performed on this CrawlURI.
listIterator() - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
listIterator(int) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
load(CrawlURI) - Method in class org.archive.modules.recrawl.AbstractContentDigestHistory
Looks up the history by key persistKeyFor(curi) and loads it into curi.getContentDigestHistory().
load(CrawlURI) - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
 
loadCookies(ConfigFile) - Method in class org.archive.modules.fetcher.AbstractCookieStore
 
loadCookies(Reader) - Method in class org.archive.modules.fetcher.AbstractCookieStore
 
log - Variable in class org.archive.modules.recrawl.PersistLogProcessor
 
logExtraInfo - Variable in class org.archive.modules.deciderules.DecideRuleSequence
Whether to include the "extra info" field for each entry in crawl.log.
logFile - Variable in class org.archive.modules.recrawl.PersistLogProcessor
 
logger - Static variable in class org.archive.modules.canonicalize.RegexRule
 
logger - Static variable in class org.archive.modules.extractor.AggressiveExtractorHTML
 
logger - Variable in class org.archive.modules.fetcher.AbstractCookieStore
 
loggerModule - Variable in class org.archive.modules.deciderules.DecideRuleSequence
 
loggerModule - Variable in class org.archive.modules.extractor.Extractor
 
loggerModule - Variable in class org.archive.modules.forms.FormLoginProcessor
 
login - Variable in class org.archive.modules.credential.HttpAuthenticationCredential
Login.
loginUri - Variable in class org.archive.modules.credential.HtmlFormCredential
Full URI of page that contains the HTML login form we're to apply these credentials too: E.g.
logUriError(URIException, UURI, CharSequence) - Method in class org.archive.modules.extractor.Extractor
 
logUriError(URIException, UURI, CharSequence) - Method in interface org.archive.modules.extractor.UriErrorLoggerModule
 
longestPrefixLength(ConcurrentSkipListSet<String>, String) - Method in class org.archive.modules.net.RobotsDirectives
 
lookup - Variable in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
 
lookup(InetAddress) - Method in interface org.archive.modules.deciderules.ExternalGeoLookupInterface
 
lookupTable(String[]) - Method in class org.archive.modules.extractor.ExtractorSWF.CrawlUriSWFAction
 
LowercaseRule - Class in org.archive.modules.canonicalize
Lowercases the URL.
LowercaseRule() - Constructor for class org.archive.modules.canonicalize.LowercaseRule
Constructor.

M

main(String[]) - Static method in class org.archive.modules.extractor.ExtractorHTML
 
main(String[]) - Static method in class org.archive.modules.extractor.PDFParser
 
main(String[]) - Static method in class org.archive.modules.recrawl.PersistProcessor
Utility main for importing a log into a BDB-JE environment or moving a database between environments (2 arguments), or simply dumping a log to stderr in a more readable format (1 argument).
makeBindings(Map<String, ExtractorMultipleRegex.MatchList>, String[], int) - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
 
makeCrawlURI(String) - Method in class org.archive.state.ModuleTestBase
 
makeData(String, String) - Method in class org.archive.modules.extractor.StringExtractorTestBase
 
makeDirty() - Method in class org.archive.modules.net.CrawlHost
 
makeDirty() - Method in class org.archive.modules.net.CrawlServer
 
makeExtractor() - Method in class org.archive.modules.extractor.ContentExtractorTestBase
Subclasses should return an Extractor instance to test.
makeHeritable(String) - Method in class org.archive.modules.CrawlURI
Make the given key 'heritable', meaning its value will be added to descendant CrawlURIs.
makeModule() - Method in class org.archive.modules.extractor.ContentExtractorTestBase
 
makeModule() - Method in class org.archive.state.ModuleTestBase
Return an example instance of the module.
makeNonHeritable(String) - Method in class org.archive.modules.CrawlURI
Make the given key non-'heritable', meaning its value will not be added to descendant CrawlURIs.
makeTempDir() - Static method in class org.archive.modules.net.DefaultTempDirProvider
 
makeWhoisUrl(String, String) - Method in class org.archive.modules.fetcher.FetchWhois
 
MANIFEST_MISC - Static variable in class org.archive.modules.extractor.LinkContext
Stand-in value for prerequisite urls without other context.
markAsSeen(int, int) - Method in class org.archive.modules.extractor.PDFParser
Note that an object (id/generation pair) has been seen by this parser so that it can be handled differently when it is encountered again.
markPrerequisite(String) - Method in class org.archive.modules.CrawlURI
Do all actions associated with setting a CrawlURI as requiring a prerequisite.
MatchesFilePatternDecideRule - Class in org.archive.modules.deciderules
Compares suffix of a passed CrawlURI, UURI, or String against a regular expression pattern, applying its configured decision to all matches.
MatchesFilePatternDecideRule() - Constructor for class org.archive.modules.deciderules.MatchesFilePatternDecideRule
Usual constructor.
MatchesFilePatternDecideRule.Preset - Enum in org.archive.modules.deciderules
 
MatchesListRegexDecideRule - Class in org.archive.modules.deciderules
Rule applies configured decision to any CrawlURIs whose String URI matches the supplied regexs.
MatchesListRegexDecideRule() - Constructor for class org.archive.modules.deciderules.MatchesListRegexDecideRule
Usual constructor.
MatchesRegexDecideRule - Class in org.archive.modules.deciderules
Rule applies configured decision to any CrawlURIs whose String URI matches the supplied regex.
MatchesRegexDecideRule() - Constructor for class org.archive.modules.deciderules.MatchesRegexDecideRule
Usual constructor.
MatchesStatusCodeDecideRule - Class in org.archive.modules.deciderules
Provides a rule that returns "true" for any CrawlURIs which have a fetch status code that falls within the provided inclusive range.
MatchesStatusCodeDecideRule() - Constructor for class org.archive.modules.deciderules.MatchesStatusCodeDecideRule
Creates a new MatchStatusCodeDecideRule instance.
MatchList(String, CharSequence) - Constructor for class org.archive.modules.extractor.ExtractorMultipleRegex.MatchList
 
MatchList(ExtractorMultipleRegex.GroupList...) - Constructor for class org.archive.modules.extractor.ExtractorMultipleRegex.MatchList
 
MAX_COOKIES_FOR_DOMAIN - Static variable in class org.archive.modules.fetcher.AbstractCookieStore
 
MAX_SIZE - Static variable in class org.archive.modules.net.Robotstxt
 
maxFileSizeBytes - Variable in class org.archive.modules.writer.Kw3WriterProcessor
Max size for each file.
maxFileSizeBytes - Variable in class org.archive.modules.writer.WriterPoolProcessor
Max size of each file.
maxPathLength - Variable in class org.archive.modules.writer.MirrorWriterProcessor
Maximum file system path length.
maxSegLength - Variable in class org.archive.modules.writer.MirrorWriterProcessor
Maximum file system path segment length.
maxTotalBytesToWrite - Variable in class org.archive.modules.writer.WriterPoolProcessor
Total file bytes to write to disk.
maxWaitForIdleMs - Variable in class org.archive.modules.writer.WriterPoolProcessor
Maximum time to wait on idle writer before (possibly) creating an additional instance.
maybeAddConditionalGetHeader(boolean, String, String) - Method in class org.archive.modules.fetcher.FetchHTTPRequest
Add the given conditional-GET header, if the setting is enabled and a suitable value is available in the URI history.
maybeMidfetchAbort(CrawlURI, AbstractExecutionAwareRequest) - Method in class org.archive.modules.fetcher.FetchHTTP
 
MEDIUM - Static variable in class org.archive.modules.SchedulingConstants
Medium priority.
META - Static variable in class org.archive.modules.extractor.HTMLLinkContext
 
META_HREF - Static variable in class org.archive.modules.extractor.HTMLLinkContext
 
metadata - Variable in class org.archive.modules.extractor.ExtractorHTML
CrawlMetadata provides the robots honoring policy to use when considering a robots META tag.
MetadataRecordBuilder - Class in org.archive.modules.warc
 
MetadataRecordBuilder() - Constructor for class org.archive.modules.warc.MetadataRecordBuilder
 
method - Variable in class org.archive.modules.forms.HTMLForm
 
MIN_ROBOTS_RETRIES - Static variable in class org.archive.modules.net.CrawlServer
only check if robots-fetch is perhaps superfluous after this many tries
MirrorWriterProcessor - Class in org.archive.modules.writer
Processor module that writes the results of successful fetches to files on disk.
MirrorWriterProcessor() - Constructor for class org.archive.modules.writer.MirrorWriterProcessor
 
ModuleTestBase - Class in org.archive.state
Base class for unit testing Module implementations.
ModuleTestBase() - Constructor for class org.archive.state.ModuleTestBase
Magical constructor that attempts to auto-create static key field descriptions for your module class.
MostFavoredRobotsPolicy - Class in org.archive.modules.net
Follow a most-favored robots policy -- allowing an URL if either the conventionally-configured User-Agent, or any of a number of alternate User-Agents (from the candidateUserAgents list) would be allowed.
MostFavoredRobotsPolicy() - Constructor for class org.archive.modules.net.MostFavoredRobotsPolicy
 

N

name - Variable in class org.archive.modules.forms.HTMLForm.FormInput
 
name - Variable in class org.archive.modules.forms.HTMLForm.NameValue
 
namedUserAgents - Variable in class org.archive.modules.net.Robotstxt
 
NameValue(String, String) - Constructor for class org.archive.modules.forms.HTMLForm.NameValue
 
NAVLINK_MISC - Static variable in class org.archive.modules.extractor.LinkContext
Stand-in value for navlink urls without other context.
newEngine() - Method in class org.archive.modules.deciderules.ScriptedDecideRule
Create a new ScriptEngine instance, preloaded with any supplied source file and the variables 'self' (this ScriptedDecideRule) and 'context' (the ApplicationContext).
newEngine() - Method in class org.archive.modules.ScriptedProcessor
Create a new ScriptEngine instance, preloaded with any supplied source file and the variables 'self' (this ScriptedProcessor) and 'context' (the ApplicationContext).
NO_DIRECTIVES - Static variable in class org.archive.modules.net.Robotstxt
 
NO_ROBOTS - Static variable in class org.archive.modules.net.Robotstxt
empty, reusable instance for all sites providing no rules
nonseedLine(String) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
Consider nonseed lines as possible SURT prefix directives.
nonseedLine(String) - Method in interface org.archive.modules.seeds.SeedListener
 
nonseedLine(String) - Method in class org.archive.modules.seeds.TextSeedModule
Handle a read line that is not a seed, but may still have meaning to seed-consumers (such as scoping beans).
NORMAL - Static variable in class org.archive.modules.SchedulingConstants
Normal/low priority.
normalizeHost(String) - Method in class org.archive.modules.fetcher.AbstractCookieStore
 
NotMatchesFilePatternDecideRule - Class in org.archive.modules.deciderules
Rule applies configured decision to any URIs which do *not* match the supplied (file-pattern) regex.
NotMatchesFilePatternDecideRule() - Constructor for class org.archive.modules.deciderules.NotMatchesFilePatternDecideRule
Usual constructor.
NotMatchesListRegexDecideRule - Class in org.archive.modules.deciderules
Rule applies configured decision to any URIs which do *not* match the supplied regex.
NotMatchesListRegexDecideRule() - Constructor for class org.archive.modules.deciderules.NotMatchesListRegexDecideRule
Usual constructor.
NotMatchesRegexDecideRule - Class in org.archive.modules.deciderules
Rule applies configured decision to any URIs which do *not* match the supplied regex.
NotMatchesRegexDecideRule(String) - Constructor for class org.archive.modules.deciderules.NotMatchesRegexDecideRule
Usual constructor.
NotMatchesStatusCodeDecideRule - Class in org.archive.modules.deciderules
Provides a rule that returns "true" for any CrawlURIs which has a fetch status code that does not fall within the provided inclusive range.
NotMatchesStatusCodeDecideRule() - Constructor for class org.archive.modules.deciderules.NotMatchesStatusCodeDecideRule
 
NOTMODIFIED - Static variable in class org.archive.crawler.util.CrawledBytesHistotable
 
NOTMODIFIEDCOUNT - Static variable in class org.archive.crawler.util.CrawledBytesHistotable
 
NotOnDomainsDecideRule - Class in org.archive.modules.deciderules.surt
Rule applies configured decision to any URIs that are *not* in one of the domains in the configured set of domains, filled from the seed set.
NotOnDomainsDecideRule() - Constructor for class org.archive.modules.deciderules.surt.NotOnDomainsDecideRule
Usual constructor.
NotOnHostsDecideRule - Class in org.archive.modules.deciderules.surt
Rule applies configured decision to any URIs that are *not* on one of the hosts in the configured set of hosts, filled from the seed set.
NotOnHostsDecideRule() - Constructor for class org.archive.modules.deciderules.surt.NotOnHostsDecideRule
Usual constructor.
NotSurtPrefixedDecideRule - Class in org.archive.modules.deciderules.surt
Rule applies configured decision to any URIs that, when expressed in SURT form, do *not* begin with one of the prefixes in the configured set.
NotSurtPrefixedDecideRule() - Constructor for class org.archive.modules.deciderules.surt.NotSurtPrefixedDecideRule
Usual constructor.
NOVEL - Static variable in class org.archive.crawler.util.CrawledBytesHistotable
 
NOVELCOUNT - Static variable in class org.archive.crawler.util.CrawledBytesHistotable
 
numberOfCURIsHandled - Variable in class org.archive.modules.extractor.ExtractorJS
 
numberOfCURIsHandled - Variable in class org.archive.modules.extractor.TrapSuppressExtractor
 
numberOfCURIsSuppressed - Variable in class org.archive.modules.extractor.TrapSuppressExtractor
 
numberOfFormsProcessed - Variable in class org.archive.modules.extractor.JerichoExtractorHTML
 
numberOfLinksExtracted - Variable in class org.archive.modules.extractor.Extractor
 

O

obeyMetaRobotsNofollow - Variable in class org.archive.modules.net.CustomRobotsPolicy
whether to obey the 'nofollow' directive in an HTML META ROBOTS element
obeyMetaRobotsNofollow() - Method in class org.archive.modules.net.CustomRobotsPolicy
 
obeyMetaRobotsNofollow - Variable in class org.archive.modules.net.FirstNamedRobotsPolicy
whether to obey the 'nofollow' directive in an HTML META ROBOTS element
obeyMetaRobotsNofollow() - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
 
obeyMetaRobotsNofollow() - Method in class org.archive.modules.net.IgnoreRobotsPolicy
 
obeyMetaRobotsNofollow - Variable in class org.archive.modules.net.MostFavoredRobotsPolicy
whether to obey the 'nofollow' directive in an HTML META ROBOTS element
obeyMetaRobotsNofollow() - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
 
obeyMetaRobotsNofollow() - Method in class org.archive.modules.net.ObeyRobotsPolicy
 
obeyMetaRobotsNofollow() - Method in class org.archive.modules.net.RobotsPolicy
 
obeyMetaRobotsNofollow() - Method in class org.archive.modules.net.RobotsTxtOnlyPolicy
 
ObeyRobotsPolicy - Class in org.archive.modules.net
Classic obey-robots-as-declared policy.
ObeyRobotsPolicy() - Constructor for class org.archive.modules.net.ObeyRobotsPolicy
 
obtainReader() - Method in class org.archive.modules.seeds.TextSeedModule
 
onApplicationEvent(ApplicationEvent) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
OnDomainsDecideRule - Class in org.archive.modules.deciderules.surt
Rule applies configured decision to any URIs that are on one of the domains in the configured set of domains, filled from the seed set.
OnDomainsDecideRule() - Constructor for class org.archive.modules.deciderules.surt.OnDomainsDecideRule
Usual constructor.
OnHostsDecideRule - Class in org.archive.modules.deciderules.surt
Rule applies configured decision to any URIs that are on one of the hosts in the configured set of hosts, filled from the seed set.
OnHostsDecideRule() - Constructor for class org.archive.modules.deciderules.surt.OnHostsDecideRule
Usual constructor.
onlyDecision(CrawlURI) - Method in class org.archive.modules.deciderules.AcceptDecideRule
 
onlyDecision(CrawlURI) - Method in class org.archive.modules.deciderules.DecideRule
 
onlyDecision(CrawlURI) - Method in class org.archive.modules.deciderules.PredicatedDecideRule
 
onlyDecision(CrawlURI) - Method in class org.archive.modules.deciderules.RejectDecideRule
 
onlyStoreIfWriteTagPresent - Variable in class org.archive.modules.recrawl.AbstractPersistProcessor
 
operator - Variable in class org.archive.modules.CrawlMetadata
 
ordinal - Variable in class org.archive.modules.CrawlURI
Monotonically increasing number within a crawl; useful for tending towards breadth-first ordering.
org.archive.crawler.util - package org.archive.crawler.util
 
org.archive.modules - package org.archive.modules
The beginnings of a refactored settings framework.
org.archive.modules.canonicalize - package org.archive.modules.canonicalize
 
org.archive.modules.credential - package org.archive.modules.credential
Contains html form login and basic and digest credentials used by Heritrix logging into sites.
org.archive.modules.deciderules - package org.archive.modules.deciderules
 
org.archive.modules.deciderules.recrawl - package org.archive.modules.deciderules.recrawl
 
org.archive.modules.deciderules.surt - package org.archive.modules.deciderules.surt
 
org.archive.modules.extractor - package org.archive.modules.extractor
 
org.archive.modules.fetcher - package org.archive.modules.fetcher
 
org.archive.modules.forms - package org.archive.modules.forms
 
org.archive.modules.net - package org.archive.modules.net
 
org.archive.modules.recrawl - package org.archive.modules.recrawl
 
org.archive.modules.revisit - package org.archive.modules.revisit
 
org.archive.modules.seeds - package org.archive.modules.seeds
 
org.archive.modules.warc - package org.archive.modules.warc
 
org.archive.modules.writer - package org.archive.modules.writer
 
org.archive.state - package org.archive.state
 
organization - Variable in class org.archive.modules.CrawlMetadata
 
OTHERDUPLICATE - Static variable in class org.archive.crawler.util.CrawledBytesHistotable
 
OTHERDUPLICATECOUNT - Static variable in class org.archive.crawler.util.CrawledBytesHistotable
 
outLinks - Variable in class org.archive.modules.CrawlURI
All discovered outbound urls as CrawlURIs (navlinks, embeds, etc.)
overlayMapsSource - Variable in class org.archive.modules.CrawlURI
 
overlayNames - Variable in class org.archive.modules.CrawlURI
 

P

parseDefineBits(InStream) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
 
parseDefineBitsJPEG3(InStream) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
 
parseDefineBitsLossless(InStream, int, boolean) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
 
parseDefineButtonSound(InStream) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
 
parseDefineFont(InStream) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
 
parseDefineFont2(InStream) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
 
parseDefineJPEG2(InStream, int) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
 
parseDefineJPEGTables(InStream) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
 
parseDefineShape(int, InStream) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
 
parseDefineSound(InStream) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
 
parseDefineSprite(InStream) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
 
parseFontInfo(InStream, int, boolean) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
 
parsePlaceObject2(InStream) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
 
parseRobotsTxt(InputStream) - Method in class org.archive.modules.extractor.ExtractorRobotsTxt
 
password - Variable in class org.archive.modules.credential.HttpAuthenticationCredential
Password.
path - Variable in class org.archive.modules.writer.Kw3WriterProcessor
Top-level directory for archive files.
path - Variable in class org.archive.modules.writer.MirrorWriterProcessor
Top-level directory for mirror files.
PathologicalPathDecideRule - Class in org.archive.modules.deciderules
Rule REJECTs any URI which contains an excessive number of identical, consecutive path-segments (eg http://example.com/a/a/a/boo.html == 3 '/a' segments)
PathologicalPathDecideRule() - Constructor for class org.archive.modules.deciderules.PathologicalPathDecideRule
Constructs a new PathologicalPathFilter.
payloadDigest - Variable in class org.archive.modules.revisit.IdenticalPayloadDigestRevisit
 
payloadDigest - Variable in class org.archive.modules.revisit.ServerNotModifiedRevisit
 
PDFParser - Class in org.archive.modules.extractor
Supports PDF parsing operations.
PDFParser(String) - Constructor for class org.archive.modules.extractor.PDFParser
 
PDFParser(byte[]) - Constructor for class org.archive.modules.extractor.PDFParser
 
persistKeyFor(CrawlURI) - Method in class org.archive.modules.recrawl.AbstractContentDigestHistory
 
persistKeyFor(CrawlURI) - Static method in class org.archive.modules.recrawl.PersistProcessor
Return a preferred String key for persisting the given CrawlURI's AList state.
persistKeyFor(String) - Static method in class org.archive.modules.recrawl.PersistProcessor
 
PersistLoadProcessor - Class in org.archive.modules.recrawl
Loads CrawlURI attributes from previous fetch from persistent storage for consultation by a later recrawl.
PersistLoadProcessor() - Constructor for class org.archive.modules.recrawl.PersistLoadProcessor
 
PersistLogProcessor - Class in org.archive.modules.recrawl
Log CrawlURI attributes from latest fetch for consultation by a later recrawl.
PersistLogProcessor() - Constructor for class org.archive.modules.recrawl.PersistLogProcessor
 
PersistOnlineProcessor - Class in org.archive.modules.recrawl
Common superclass for persisting Processors which directly store/load to persistence (as opposed to logging for batch load later).
PersistOnlineProcessor() - Constructor for class org.archive.modules.recrawl.PersistOnlineProcessor
 
PersistProcessor - Class in org.archive.modules.recrawl
Superclass for Processors which utilize BDB-JE for URI state (including most notably history) persistence.
PersistProcessor() - Constructor for class org.archive.modules.recrawl.PersistProcessor
 
PersistStoreProcessor - Class in org.archive.modules.recrawl
Store CrawlURI attributes from latest fetch to persistent storage for consultation by a later recrawl.
PersistStoreProcessor() - Constructor for class org.archive.modules.recrawl.PersistStoreProcessor
 
politenessDelay - Variable in class org.archive.modules.CrawlURI
 
poolMaxActive - Variable in class org.archive.modules.writer.WriterPoolProcessor
Maximum active files in pool.
populateHtmlFormCredential(HtmlFormCredential) - Method in class org.archive.modules.fetcher.FetchHTTPRequest
 
populateHttpCredential(HttpHost, AuthScheme, String, String) - Method in class org.archive.modules.fetcher.FetchHTTPRequest
 
populateHttpProxyCredential() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
 
populatePersistEnv(String, File) - Static method in class org.archive.modules.recrawl.PersistProcessor
Populates a new environment db from an old environment db or a persist log.
populateTargetCredential() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
Add credentials if any to passed method.
PredicatedDecideRule - Class in org.archive.modules.deciderules
Rule which applies the configured decision only if a test evaluates to true.
PredicatedDecideRule() - Constructor for class org.archive.modules.deciderules.PredicatedDecideRule
 
prefix - Variable in class org.archive.modules.writer.WriterPoolProcessor
File prefix.
prefixFrom(String) - Method in class org.archive.modules.deciderules.surt.OnDomainsDecideRule
 
prefixFrom(String) - Method in class org.archive.modules.deciderules.surt.OnHostsDecideRule
 
prefixFrom(String) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
preloadSource - Variable in class org.archive.modules.recrawl.PersistLoadProcessor
A source (either log file or BDB directory) from which to copy history information into the current store at startup.
preloadSourceUrl - Variable in class org.archive.modules.recrawl.PersistLoadProcessor
A log file source url from which to copy history information into the current store at startup.
prepare() - Method in class org.archive.modules.fetcher.AbstractCookieStore
 
prepare() - Method in class org.archive.modules.fetcher.BdbCookieStore
 
prepare() - Method in class org.archive.modules.fetcher.SimpleCookieStore
 
PREREQ_MISC - Static variable in class org.archive.modules.extractor.LinkContext
Stand-in value for prerequisite urls without other context.
PrerequisiteAcceptDecideRule - Class in org.archive.modules.deciderules
Rule which ACCEPTs all 'prerequisite' URIs (those with a 'P' in the last hopsPath position).
PrerequisiteAcceptDecideRule() - Constructor for class org.archive.modules.deciderules.PrerequisiteAcceptDecideRule
 
presumedUsernameInput() - Method in class org.archive.modules.forms.HTMLForm
 
PROCEED - Static variable in class org.archive.modules.ProcessResult
 
process(CrawlURI) - Method in class org.archive.modules.Processor
Processes the given URI.
process(CrawlURI, ProcessorChain.ChainStatusReceiver) - Method in class org.archive.modules.ProcessorChain
 
processEmbed(CrawlURI, CharSequence, CharSequence) - Method in class org.archive.modules.extractor.ExtractorHTML
 
processEmbed(CrawlURI, CharSequence, CharSequence, Hop) - Method in class org.archive.modules.extractor.ExtractorHTML
 
processForm(CrawlURI, Element) - Method in class org.archive.modules.extractor.JerichoExtractorHTML
 
processGeneralTag(CrawlURI, CharSequence, CharSequence) - Method in class org.archive.modules.extractor.ExtractorHTML
 
processGeneralTag(CrawlURI, Element, Attributes) - Method in class org.archive.modules.extractor.JerichoExtractorHTML
 
processingCleanup() - Method in class org.archive.modules.CrawlURI
Clean up after a run through the processing chain.
processLink(CrawlURI, CharSequence, CharSequence) - Method in class org.archive.modules.extractor.ExtractorHTML
Handle generic HREF cases.
processLinkTagWithRel(CrawlURI, CharSequence, CharSequence) - Method in class org.archive.modules.extractor.ExtractorHTML
 
processMeta(CrawlURI, CharSequence) - Method in class org.archive.modules.extractor.ExtractorHTML
Process metadata tags.
processMeta(CrawlURI, Element) - Method in class org.archive.modules.extractor.JerichoExtractorHTML
 
Processor - Class in org.archive.modules
A processor of URIs.
Processor() - Constructor for class org.archive.modules.Processor
 
ProcessorChain - Class in org.archive.modules
Collection of Processors to run.
ProcessorChain() - Constructor for class org.archive.modules.ProcessorChain
 
ProcessorChain.ChainStatusReceiver - Interface in org.archive.modules
 
ProcessorTestBase - Class in org.archive.modules
Unit test for Processor.
ProcessorTestBase() - Constructor for class org.archive.modules.ProcessorTestBase
 
ProcessResult - Class in org.archive.modules
Returned by a Processor's process method to indicate the status of the process.
ProcessResult.ProcessStatus - Enum in org.archive.modules
 
processScript(CrawlURI, CharSequence, int) - Method in class org.archive.modules.extractor.AggressiveExtractorHTML
 
processScript(CrawlURI, CharSequence, int) - Method in class org.archive.modules.extractor.ExtractorHTML
 
processScript(CrawlURI, Element) - Method in class org.archive.modules.extractor.JerichoExtractorHTML
 
processScriptCode(CrawlURI, CharSequence) - Method in class org.archive.modules.extractor.ExtractorHTML
Extract the (java)script source in the given CharSequence.
processStyle(CrawlURI, CharSequence, int) - Method in class org.archive.modules.extractor.ExtractorHTML
Process style text.
processStyle(CrawlURI, Element) - Method in class org.archive.modules.extractor.JerichoExtractorHTML
 
processStyleCode(Extractor, CrawlURI, CharSequence) - Static method in class org.archive.modules.extractor.ExtractorCSS
 
processXml(Extractor, CrawlURI, CharSequence) - Static method in class org.archive.modules.extractor.ExtractorXML
 
promoteCredentials(CrawlURI) - Method in class org.archive.modules.fetcher.FetchHTTP
Promote successful credential to the server.
proxyHost - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
 
publishAddedSeed(CrawlURI) - Method in class org.archive.modules.seeds.SeedModule
 
publishConcludedSeedBatch() - Method in class org.archive.modules.seeds.SeedModule
 
publishNonSeedLine(String) - Method in class org.archive.modules.seeds.SeedModule
 
push(String) - Method in class org.archive.modules.extractor.ExtractorSWF.CrawlUriSWFAction
 
putHttpResponseHeader(String, String) - Method in class org.archive.modules.CrawlURI
 

Q

qualifyRecordID(URI, String, String) - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
 

R

readCookies(Reader) - Method in class org.archive.modules.fetcher.AbstractCookieStore
Load cookies.
readPrefixes() - Method in class org.archive.modules.deciderules.surt.OnDomainsDecideRule
Patch the SURT prefix set so that it only includes host-enforcing prefixes
readPrefixes() - Method in class org.archive.modules.deciderules.surt.OnHostsDecideRule
Patch the SURT prefix set so that it only includes host-enforcing prefixes
readPrefixes() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
readUuri(String) - Method in class org.archive.modules.CrawlURI
Read a UURI from a String, handling a null or URIException
realm - Variable in class org.archive.modules.credential.HttpAuthenticationCredential
Basic/Digest Auth realm.
recordDNS(CrawlURI, Record[]) - Method in class org.archive.modules.fetcher.FetchDNS
 
RecordingHttpClientConnection(int, int, CharsetDecoder, CharsetEncoder, MessageConstraints, ContentLengthStrategy, ContentLengthStrategy, HttpMessageWriterFactory<HttpRequest>, HttpMessageParserFactory<HttpResponse>, HttpHost, CrawlURI) - Constructor for class org.archive.modules.fetcher.FetchHTTPRequest.RecordingHttpClientConnection
 
recoveryCheckpoint - Variable in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
recoveryCheckpoint - Variable in class org.archive.modules.Processor
 
RecrawlAttributeConstants - Interface in org.archive.modules.recrawl
 
refersToDate - Variable in class org.archive.modules.revisit.AbstractProfile
 
refersToRecordID - Variable in class org.archive.modules.revisit.AbstractProfile
 
refersToTargetURI - Variable in class org.archive.modules.revisit.IdenticalPayloadDigestRevisit
 
RegexRule - Class in org.archive.modules.canonicalize
General conversion rule.
RegexRule() - Constructor for class org.archive.modules.canonicalize.RegexRule
 
RejectDecideRule - Class in org.archive.modules.deciderules
 
RejectDecideRule() - Constructor for class org.archive.modules.deciderules.RejectDecideRule
 
remove(Object) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
remove(int) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
removeAll(Collection<?>) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
report() - Method in class org.archive.modules.extractor.Extractor
 
report() - Method in class org.archive.modules.extractor.JerichoExtractorHTML
 
report() - Method in class org.archive.modules.Processor
 
report() - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
 
reportTo(PrintWriter) - Method in class org.archive.modules.CrawlURI
 
reportTo(PrintWriter) - Method in class org.archive.modules.fetcher.FetchStats
 
reportTo(PrintWriter) - Method in class org.archive.modules.ProcessorChain
Compiles and returns a human readable report on the active processors.
request - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
 
requestConfigBuilder - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
 
rescheduleTime - Variable in class org.archive.modules.CrawlURI
A future time at which this CrawlURI should be reenqueued.
resetConsecutiveConnectionErrors() - Method in class org.archive.modules.net.CrawlServer
 
resetDeferrals() - Method in class org.archive.modules.CrawlURI
Reset deferrals counter.
resetFetchAttempts() - Method in class org.archive.modules.CrawlURI
Reset fetchAttempts counter.
resetForRescheduling() - Method in class org.archive.modules.CrawlURI
Reset state that that should not persist when a URI is rescheduled for a specific future time.
resetState() - Method in class org.archive.modules.extractor.PDFParser
Reinitialize the object as though a new one were created.
resetState(byte[]) - Method in class org.archive.modules.extractor.PDFParser
Reset the object and initialize it with a new byte array (the document).
resetState(String) - Method in class org.archive.modules.extractor.PDFParser
Reinitialize the object as though a new one were created, complete with a valid pointer to a document that can be read
resolve(String) - Method in class org.archive.modules.fetcher.FetchHTTPRequest.ServerCacheResolver
 
resolve(String) - Method in interface org.archive.modules.fetcher.HostResolver
 
ResourceLongerThanDecideRule - Class in org.archive.modules.deciderules
Applies configured decision for URIs with content length greater than a given threshold length value.
ResourceLongerThanDecideRule() - Constructor for class org.archive.modules.deciderules.ResourceLongerThanDecideRule
 
ResourceNoLongerThanDecideRule - Class in org.archive.modules.deciderules
Applies configured decision for URIs with content length less than or equal to a given threshold length value.
ResourceNoLongerThanDecideRule() - Constructor for class org.archive.modules.deciderules.ResourceNoLongerThanDecideRule
 
ResponseContentLengthDecideRule - Class in org.archive.modules.deciderules
Decide rule that will ACCEPT or REJECT a uri, depending on the "decision" property, after it's fetched, if the content body is within a specified size range, specified in bytes.
ResponseContentLengthDecideRule() - Constructor for class org.archive.modules.deciderules.ResponseContentLengthDecideRule
 
RestrictedCollectionWrappedList(Collection<T>) - Constructor for class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
retainAll(Collection<?>) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
RevisitProfile - Interface in org.archive.modules.revisit
 
RevisitRecordBuilder - Class in org.archive.modules.warc
 
RevisitRecordBuilder() - Constructor for class org.archive.modules.warc.RevisitRecordBuilder
 
ROBOTS_DENIALS - Static variable in class org.archive.modules.fetcher.FetchStats
 
ROBOTS_NOT_FETCHED - Static variable in class org.archive.modules.net.CrawlServer
 
RobotsDirectives - Class in org.archive.modules.net
Represents the directives that apply to a user-agent (or set of user-agents)
RobotsDirectives() - Constructor for class org.archive.modules.net.RobotsDirectives
 
robotsFetched - Variable in class org.archive.modules.net.CrawlServer
 
RobotsPolicy - Class in org.archive.modules.net
RobotsPolicy represents the strategy used by the crawler for determining how robots.txt files will be honored.
RobotsPolicy() - Constructor for class org.archive.modules.net.RobotsPolicy
 
robotstxt - Variable in class org.archive.modules.net.CrawlServer
 
Robotstxt - Class in org.archive.modules.net
Utility class for parsing and representing 'robots.txt' format directives, into a list of named user-agents and map from user-agents to RobotsDirectives.
Robotstxt() - Constructor for class org.archive.modules.net.Robotstxt
 
Robotstxt(Reader) - Constructor for class org.archive.modules.net.Robotstxt
 
Robotstxt(ReadSource) - Constructor for class org.archive.modules.net.Robotstxt
 
RobotsTxtOnlyPolicy - Class in org.archive.modules.net
Policy to obey robots.txt but ignore meta nofollow.
RobotsTxtOnlyPolicy() - Constructor for class org.archive.modules.net.RobotsTxtOnlyPolicy
 
rootUriMatch(ServerCache, CrawlURI) - Method in class org.archive.modules.credential.Credential
Test passed curi matches this credentials rootUri.
ROUTE_PLANNER - Static variable in class org.archive.modules.fetcher.FetchHTTPRequest
 
RulesCanonicalizationPolicy - Class in org.archive.modules.canonicalize
URI Canonicalizatioon Policy
RulesCanonicalizationPolicy() - Constructor for class org.archive.modules.canonicalize.RulesCanonicalizationPolicy
 
runTest() - Method in class org.archive.state.ModuleTestBase
 

S

S_BLOCKED_BY_CUSTOM_PROCESSOR - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
Blocked by custom prefetcher processor.
S_BLOCKED_BY_QUOTA - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
Blocked due to exceeding an established quota.
S_BLOCKED_BY_RUNTIME_LIMIT - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
Blocked due to exceeding an established runtime.
S_BLOCKED_BY_USER - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
blocked from fetch by user setting.
S_CONNECT_FAILED - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
HTTP connect failed
S_CONNECT_LOST - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
HTTP connect broken
S_DEEMED_CHAFF - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
'chaff' detection of traps/content of negligible value applied
S_DEEMED_NOT_FOUND - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
synthetic status, used when some other status (such as connection-lost) is considered by policy the same as a document-not-found
S_DEFERRED - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
temporary status assigned URIs awaiting preconditions; appearance in logs is a bug
S_DELETED_BY_USER - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
deleted from frontier by user
S_DNS_SUCCESS - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
DNS success
S_DOMAIN_PREREQUISITE_FAILURE - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
DNS prerequisite failed, precluding attempt
S_DOMAIN_UNRESOLVABLE - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
DNS lookup failed
S_GETBYNAME_SUCCESS - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
InetAddress.getByName success
S_NOT_FOUND - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
HTTP 404 NOT FOUND
S_OTHER_PREREQUISITE_FAILURE - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
DNS prerequisite failed, precluding attempt
S_OUT_OF_SCOPE - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
out-of-scope upoin reexamination (only when scope changes during crawl)
S_PREREQUISITE_UNSCHEDULABLE_FAILURE - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
DNS prerequisite failed, precluding attempt
S_PROCESSING_THREAD_KILLED - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
Processing thread was killed
S_ROBOTS_PRECLUDED - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
robots rules precluded fetch
S_ROBOTS_PREREQUISITE_FAILURE - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
Robots prerequisite failed, precluding attempt
S_RUNTIME_EXCEPTION - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
Unexpected runtime exception; see runtime-errors.log
S_SERIOUS_ERROR - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
severe java 'Error' conditions (OutOfMemoryError, StackOverflowError, etc.) during URI processing
S_TIMEOUT - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
HTTP timeout (before any meaningful response received)
S_TOO_MANY_EMBED_HOPS - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
overstepped embed/trans hops
S_TOO_MANY_LINK_HOPS - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
overstepped link hops
S_TOO_MANY_RETRIES - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
multiple retries all failed
S_UNATTEMPTED - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
fetch never tried (perhaps protocol unsupported or illegal URI)
S_UNFETCHABLE_URI - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
URI recognized as unsupported or illegal)
S_UNQUEUEABLE - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
URI could not be queued in Frontier; when URIs are properly filtered for format, should never occur
S_WHOIS_GENERIC_FINISHED - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
Finished all fetches for serverless WHOIS url (whois:foo.org)
S_WHOIS_SUCCESS - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
WHOIS success
saveCookies() - Method in class org.archive.modules.fetcher.AbstractCookieStore
 
saveCookies(String) - Method in class org.archive.modules.fetcher.AbstractCookieStore
 
saveHeader(CrawlURI, Map<String, Object>, String) - Method in class org.archive.modules.recrawl.FetchHistoryProcessor
Save a header from the given HTTP operation into the Map.
saveHeader(CrawlURI, ANVLRecord, String, String) - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
Saves a header from the given HTTP operation into the provider headers under a new name
SchedulingConstants - Class in org.archive.modules
 
SchemeNotInSetDecideRule - Class in org.archive.modules.deciderules
Rule applies the configured decision (default REJECT) for any URI which has a URI-scheme NOT contained in the configured Set.
SchemeNotInSetDecideRule() - Constructor for class org.archive.modules.deciderules.SchemeNotInSetDecideRule
Usual constructor.
schemes - Variable in class org.archive.modules.deciderules.SchemeNotInSetDecideRule
set of schemes to test URI scheme
SCRIPT_SRC - Static variable in class org.archive.modules.extractor.HTMLLinkContext
 
ScriptedDecideRule - Class in org.archive.modules.deciderules
Rule which runs a JSR-223 script to make its decision.
ScriptedDecideRule() - Constructor for class org.archive.modules.deciderules.ScriptedDecideRule
 
ScriptedProcessor - Class in org.archive.modules
A processor which runs a JSR-223 script on the CrawlURI.
ScriptedProcessor() - Constructor for class org.archive.modules.ScriptedProcessor
Constructor.
scriptSource - Variable in class org.archive.modules.deciderules.ScriptedDecideRule
 
scriptSource - Variable in class org.archive.modules.ScriptedProcessor
 
SeedAcceptDecideRule - Class in org.archive.modules.deciderules
Rule which ACCEPTs all 'seed' URIs (those for which isSeed is true).
SeedAcceptDecideRule() - Constructor for class org.archive.modules.deciderules.SeedAcceptDecideRule
 
seedLine(String) - Method in class org.archive.modules.seeds.TextSeedModule
Handle a read line that is probably a seed.
SeedListener - Interface in org.archive.modules.seeds
Implemented by components which want notifications of seed list changes.
seedListeners - Variable in class org.archive.modules.seeds.SeedModule
 
SeedModule - Class in org.archive.modules.seeds
 
SeedModule() - Constructor for class org.archive.modules.seeds.SeedModule
 
seeds - Variable in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
seedsAsSurtPrefixes - Variable in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
Should seeds also be interpreted as SURT prefixes.
seemsLoginForm() - Method in class org.archive.modules.forms.HTMLForm
For now, we consider a POST form with only 1 password field and 1 potential username field (type text or email) to be a likely login form.
serverCache - Variable in class org.archive.modules.deciderules.DecideRuleSequence
 
serverCache - Variable in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
 
serverCache - Variable in class org.archive.modules.deciderules.IpAddressSetDecideRule
 
serverCache - Variable in class org.archive.modules.fetcher.FetchDNS
Used to do DNS lookups.
serverCache - Variable in class org.archive.modules.fetcher.FetchHTTP
 
serverCache - Variable in class org.archive.modules.fetcher.FetchHTTPRequest.ServerCacheResolver
 
serverCache - Variable in class org.archive.modules.fetcher.FetchWhois
 
ServerCache - Class in org.archive.modules.net
Abstract class for crawl-global registry of CrawlServer (host:port) and CrawlHost (hostname) objects.
ServerCache() - Constructor for class org.archive.modules.net.ServerCache
 
serverCache - Variable in class org.archive.modules.writer.Kw3WriterProcessor
The server cache to use.
serverCache - Variable in class org.archive.modules.writer.WriterPoolProcessor
 
ServerCacheResolver(ServerCache) - Constructor for class org.archive.modules.fetcher.FetchHTTPRequest.ServerCacheResolver
 
serverInetAddr - Variable in class org.archive.modules.fetcher.FetchDNS
 
ServerNotModifiedRevisit - Class in org.archive.modules.revisit
 
ServerNotModifiedRevisit() - Constructor for class org.archive.modules.revisit.ServerNotModifiedRevisit
Minimal constructor.
servers - Variable in class org.archive.modules.fetcher.DefaultServerCache
hostname[:port] -> CrawlServer.
set(int, T) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
setAcceptCompression(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
Set headers to accept compressed responses.
setAcceptHeaders(List<String>) - Method in class org.archive.modules.fetcher.FetchHTTP
Accept Headers to include in each request.
setAcceptNonDnsResolves(boolean) - Method in class org.archive.modules.fetcher.FetchDNS
If a DNS lookup fails, whether or not to fall back to InetAddress resolution, which may use local 'hosts' files or other mechanisms.
setAction(String) - Method in class org.archive.modules.forms.HTMLForm
 
setAlsoCheckVia(boolean) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
Whether to also make the configured decision if a URI's 'via' URI (the URI from which it was discovered) in SURT form begins with any of the established prefixes.
setApplicableSurtPrefix(String) - Method in class org.archive.modules.forms.FormLoginProcessor
SURT prefix against which configured username/password is applicable.
setApplicationContext(ApplicationContext) - Method in class org.archive.modules.deciderules.ScriptedDecideRule
 
setApplicationContext(ApplicationContext) - Method in class org.archive.modules.ScriptedProcessor
 
setAudience(String) - Method in class org.archive.modules.CrawlMetadata
 
setAvailableRobotsPolicies(Map<String, RobotsPolicy>) - Method in class org.archive.modules.CrawlMetadata
 
setBaseURI(String) - Method in class org.archive.modules.CrawlURI
Set the (HTML) Base URI used for derelativizing internal URIs.
setBaseURI(UURI) - Method in class org.archive.modules.CrawlURI
 
setBdbModule(BdbModule) - Method in class org.archive.modules.fetcher.BdbCookieStore
 
setBdbModule(BdbModule) - Method in class org.archive.modules.fetcher.FetchWhois
 
setBdbModule(BdbModule) - Method in class org.archive.modules.net.BdbServerCache
 
setBdbModule(BdbModule) - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
 
setBdbModule(BdbModule) - Method in class org.archive.modules.recrawl.PersistOnlineProcessor
 
setBeanName(String) - Method in class org.archive.modules.deciderules.DecideRuleSequence
 
setBeanName(String) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
setBeanName(String) - Method in class org.archive.modules.Processor
 
setBlockAwaitingSeedLines(int) - Method in class org.archive.modules.seeds.TextSeedModule
 
setCandidateUserAgents(List<String>) - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
 
setCandidateUserAgents(List<String>) - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
 
setCanonicalString(String) - Method in class org.archive.modules.CrawlURI
 
setCaseSensitiveFilesystem(boolean) - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
setChain(List<? extends WARCRecordBuilder>) - Method in class org.archive.modules.writer.WARCWriterChainProcessor
 
setCharacterEncoding(CrawlURI, Recorder, HttpResponse) - Method in class org.archive.modules.fetcher.FetchHTTP
Set the character encoding based on the result headers or default.
setCharacterMap(List<String>) - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
setChmod(boolean) - Method in class org.archive.modules.writer.Kw3WriterProcessor
 
setChmodValue(String) - Method in class org.archive.modules.writer.Kw3WriterProcessor
 
setClassKey(String) - Method in class org.archive.modules.CrawlURI
 
setCollection(String) - Method in class org.archive.modules.writer.Kw3WriterProcessor
 
setComment(String) - Method in class org.archive.modules.deciderules.DecideRule
 
setCompress(boolean) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
setConnectTimeoutMs(int) - Method in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
 
setContentDigest(byte[]) - Method in class org.archive.modules.CrawlURI
setContentDigest(String, byte[]) - Method in class org.archive.modules.CrawlURI
 
setContentDigestHistory(AbstractContentDigestHistory) - Method in class org.archive.modules.recrawl.ContentDigestHistoryLoader
 
setContentDigestHistory(AbstractContentDigestHistory) - Method in class org.archive.modules.recrawl.ContentDigestHistoryStorer
 
setContentLengthThreshold(long) - Method in class org.archive.modules.deciderules.ContentLengthDecideRule
Content-length threshold.
setContentLengthThreshold(long) - Method in class org.archive.modules.deciderules.ResourceNoLongerThanDecideRule
Max content-length this filter will allow to pass through.
setContentRegexes(Map<String, String>) - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
A map of { name => regex }.
setContentSize(long) - Method in class org.archive.modules.CrawlURI
Sets the 'content size' for the URI, which is considered inclusive of all of all recorded material (such as protocol headers) or even material 'virtually' considered (as in material from a previous fetch confirmed unchanged with a server).
setContentType(String) - Method in class org.archive.modules.CrawlURI
Set a fetched uri's content type.
setContentTypeMap(List<String>) - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
setCookiesLoadFile(ConfigFile) - Method in class org.archive.modules.fetcher.AbstractCookieStore
 
setCookiesSaveFile(ConfigPath) - Method in class org.archive.modules.fetcher.AbstractCookieStore
 
setCookieStore(AbstractCookieStore) - Method in class org.archive.modules.fetcher.FetchHTTP
 
setCountryCode(String) - Method in class org.archive.modules.net.CrawlHost
Set country code for this hos
setCountryCodes(List<String>) - Method in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
 
setCrawlDelay(float) - Method in class org.archive.modules.net.RobotsDirectives
 
setCreateHostDirectory(boolean) - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
setCreatePortDirectory(boolean) - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
setCredentials(Map<String, Credential>) - Method in class org.archive.modules.credential.CredentialStore
Credentials used by heritrix authenticating.
setCredentialStore(CredentialStore) - Method in class org.archive.modules.fetcher.FetchHTTP
Used to store credentials.
setCustomRobots(ReadSource) - Method in class org.archive.modules.net.CustomRobotsPolicy
 
setDecision(DecideResult) - Method in class org.archive.modules.deciderules.PredicatedDecideRule
 
setDefaultEncoding(String) - Method in class org.archive.modules.fetcher.FetchHTTP
The character encoding to use for files that do not have one specified in the HTTP response headers.
setDescription(String) - Method in class org.archive.modules.CrawlMetadata
 
setDigestAlgorithm(String) - Method in class org.archive.modules.fetcher.FetchDNS
Which algorithm (for example MD5 or SHA-1) to use to perform an on-the-fly digest hash of retrieved content-bodies.
setDigestAlgorithm(String) - Method in class org.archive.modules.fetcher.FetchFTP
 
setDigestAlgorithm(String) - Method in class org.archive.modules.fetcher.FetchHTTP
Which algorithm (for example MD5 or SHA-1) to use to perform an on-the-fly digest hash of retrieved content-bodies.
setDigestAlgorithm(String) - Method in class org.archive.modules.fetcher.FetchSFTP
 
setDigestContent(boolean) - Method in class org.archive.modules.fetcher.FetchDNS
Whether or not to perform an on-the-fly digest hash of retrieved content-bodies.
setDigestContent(boolean) - Method in class org.archive.modules.fetcher.FetchFTP
Whether or not to perform an on-the-fly digest hash of retrieved content-bodies.
setDigestContent(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
Whether or not to perform an on-the-fly digest hash of retrieved content-bodies.
setDigestContent(boolean) - Method in class org.archive.modules.fetcher.FetchSFTP
Whether or not to perform an on-the-fly digest hash of retrieved content-bodies.
setDirectory(ConfigPath) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
setDirectoryFile(String) - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
setDisableJavaDnsResolves(boolean) - Method in class org.archive.modules.fetcher.FetchDNS
Optionally, only allow InetAddress resolution, precisely because it may use local 'hosts' files or other mechanisms.
setDisableSNI(boolean) - Method in class org.archive.modules.fetcher.FetchHTTPRequest
 
setDnsOverHttpServer(String) - Method in class org.archive.modules.fetcher.FetchDNS
URL to the DNS-on-HTTP(S) server.
setDomain(String) - Method in class org.archive.modules.credential.Credential
 
setDotBegin(String) - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
setDotEnd(String) - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
setEarliestNextURIEmitTime(long) - Method in class org.archive.modules.net.CrawlHost
Set the earliest time a URI for this host could be emitted.
setEnabled(boolean) - Method in class org.archive.modules.canonicalize.BaseRule
 
setEnabled(boolean) - Method in class org.archive.modules.deciderules.DecideRule
 
setEnabled(boolean) - Method in class org.archive.modules.Processor
Whether or not this process will execute for a particular URI.
setEnableLenientExtraction(boolean) - Method in class org.archive.modules.extractor.ExtractorSitemap
If true, all urls in the sitemap file are extracted, regardless of whether or not they obey the scoping rules specified in the sitemap protocol (https://www.sitemaps.org/protocol.html).
setEnctype(String) - Method in class org.archive.modules.forms.HTMLForm
 
setEngineName(String) - Method in class org.archive.modules.deciderules.ScriptedDecideRule
 
setEngineName(String) - Method in class org.archive.modules.ScriptedProcessor
 
setEntity(HttpEntity) - Method in class org.archive.modules.fetcher.BasicExecutionAwareEntityEnclosingRequest
 
setError(String) - Method in class org.archive.modules.CrawlURI
 
setETag(String) - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
 
setExtractAllForms(boolean) - Method in class org.archive.modules.forms.ExtractorHTMLForms
If true, report all FORMs.
setExtractFromDirs(boolean) - Method in class org.archive.modules.fetcher.FetchFTP
Set to true to extract further URIs from FTP directories.
setExtractFromDirs(boolean) - Method in class org.archive.modules.fetcher.FetchSFTP
Set to true to extract further URIs from SFTP directories.
setExtractJavascript(boolean) - Method in class org.archive.modules.extractor.ExtractorHTML
If true, in-page Javascript is scanned for strings that appear likely to be URIs.
setExtractOnlyFormGets(boolean) - Method in class org.archive.modules.extractor.ExtractorHTML
If true, only ACTION URIs with a METHOD of GET (explicit or implied) are extracted.
setExtractorJS(ExtractorJS) - Method in class org.archive.modules.extractor.ExtractorHTML
 
setExtractorJS(ExtractorJS) - Method in class org.archive.modules.extractor.ExtractorSWF
 
setExtractorParameters(ExtractorParameters) - Method in class org.archive.modules.extractor.Extractor
 
setExtractParent(boolean) - Method in class org.archive.modules.fetcher.FetchFTP
Set to true to extract the parent URI from all FTP URIs.
setExtractParent(boolean) - Method in class org.archive.modules.fetcher.FetchSFTP
Set to true to extract the parent URI from all SFTP URIs.
setExtractValueAttributes(boolean) - Method in class org.archive.modules.extractor.ExtractorHTML
If true, strings that look like URIs found in unusual places (such as form VALUE attributes) will be extracted.
setFetchBeginTime(long) - Method in class org.archive.modules.CrawlURI
 
setFetchCompletedTime(long) - Method in class org.archive.modules.CrawlURI
 
setFetchHistory(Map<String, Object>[]) - Method in class org.archive.modules.CrawlURI
 
setFetchStatus(int) - Method in class org.archive.modules.CrawlURI
Set the overall/fetch status of this CrawlURI for its current trip through the processing loop.
setFetchType(CrawlURI.FetchType) - Method in class org.archive.modules.CrawlURI
 
setForceFetch(boolean) - Method in class org.archive.modules.CrawlURI
Method to signal that this URI should be fetched even though it already has been crawled.
setForceRetire(boolean) - Method in class org.archive.modules.CrawlURI
 
setFormat(String) - Method in class org.archive.modules.canonicalize.RegexRule
The format string to use when a match is found.
setFormat(String) - Method in class org.archive.modules.extractor.ExtractorImpliedURI
Replacement pattern to build 'implied' URI, using captured groups of trigger expression.
setFormItems(Map<String, String>) - Method in class org.archive.modules.credential.HtmlFormCredential
 
setFrequentFlushes(boolean) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
setFullVia(CrawlURI) - Method in class org.archive.modules.CrawlURI
 
setHarvester(String) - Method in class org.archive.modules.writer.Kw3WriterProcessor
 
setHistoryDbName(String) - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
 
setHistoryDbName(String) - Method in class org.archive.modules.recrawl.PersistOnlineProcessor
 
setHistoryLength(int) - Method in class org.archive.modules.recrawl.FetchHistoryProcessor
 
setHolder(Object) - Method in class org.archive.modules.CrawlURI
Remember a 'holder' to which some enclosing/queueing facility has assigned this CrawlURI .
setHolderCost(int) - Method in class org.archive.modules.CrawlURI
Remember a 'holderCost' which some enclosing/queueing facility has assigned this CrawlURI
setHolderKey(Object) - Method in class org.archive.modules.CrawlURI
Remember a 'holderKey' which some enclosing/queueing facility has assigned this CrawlURI .
setHostMap(List<String>) - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
setHttpAuthChallenges(Map<String, String>) - Method in class org.archive.modules.CrawlURI
 
setHttpAuthChallenges(Map<String, String>) - Method in class org.archive.modules.net.CrawlServer
 
setHttpBindAddress(String) - Method in class org.archive.modules.fetcher.FetchHTTP
Local IP address or hostname to use when making connections (binding sockets).
setHttpMethod(HtmlFormCredential.Method) - Method in class org.archive.modules.credential.HtmlFormCredential
Deprecated.
ignored, always POST
setHttpProxyHost(String) - Method in class org.archive.modules.fetcher.FetchHTTP
Proxy host IP (set only if needed).
setHttpProxyPassword(String) - Method in class org.archive.modules.fetcher.FetchHTTP
Proxy password (set only if needed).
setHttpProxyPort(Integer) - Method in class org.archive.modules.fetcher.FetchHTTP
Proxy port (set only if needed).
setHttpProxyUser(String) - Method in class org.archive.modules.fetcher.FetchHTTP
Proxy user (set only if needed).
setIdentityCache(ObjectIdentityCache<?>) - Method in class org.archive.modules.net.CrawlHost
 
setIdentityCache(ObjectIdentityCache<?>) - Method in class org.archive.modules.net.CrawlServer
 
setIgnoreCookies(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
Disable cookie handling.
setIgnoreFormActionUrls(boolean) - Method in class org.archive.modules.extractor.ExtractorHTML
If true, URIs appearing as the ACTION attribute in HTML FORMs are ignored.
setIgnoreUnexpectedHtml(boolean) - Method in class org.archive.modules.extractor.ExtractorHTML
If true, URIs which end in typical non-HTML extensions (such as .gif) will not be scanned as if it were HTML.
setInferRootPage(boolean) - Method in class org.archive.modules.extractor.ExtractorHTTP
 
setIP(InetAddress, long) - Method in class org.archive.modules.net.CrawlHost
Set the IP address for this host.
setIpAddresses(Set<String>) - Method in class org.archive.modules.deciderules.IpAddressSetDecideRule
 
setIsolateThreads(boolean) - Method in class org.archive.modules.deciderules.ScriptedDecideRule
 
setIsolateThreads(boolean) - Method in class org.archive.modules.ScriptedProcessor
 
setJobName(String) - Method in class org.archive.modules.CrawlMetadata
 
setLastModified(String) - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
 
setListLogicalOr(boolean) - Method in class org.archive.modules.deciderules.MatchesListRegexDecideRule
True if the list of regular expression should be considered as logically AND when matching.
setLogExtraInfo(boolean) - Method in class org.archive.modules.deciderules.DecideRuleSequence
 
setLogFile(ConfigPath) - Method in class org.archive.modules.recrawl.PersistLogProcessor
 
setLoggerModule(SimpleFileLoggerProvider) - Method in class org.archive.modules.deciderules.DecideRuleSequence
 
setLoggerModule(UriErrorLoggerModule) - Method in class org.archive.modules.extractor.Extractor
 
setLoggerModule(UriErrorLoggerModule) - Method in class org.archive.modules.forms.FormLoginProcessor
 
setLogin(String) - Method in class org.archive.modules.credential.HttpAuthenticationCredential
 
setLoginPassword(String) - Method in class org.archive.modules.forms.FormLoginProcessor
Password string to use in appropriate form input field.
setLoginUri(String) - Method in class org.archive.modules.credential.HtmlFormCredential
 
setLoginUsername(String) - Method in class org.archive.modules.forms.FormLoginProcessor
Username (or similar) string to use in appropriate form input field.
setLogToFile(boolean) - Method in class org.archive.modules.deciderules.DecideRuleSequence
If enabled, log decisions to file named logs/{spring-bean-id}.log.
setLookup(ExternalGeoLookupInterface) - Method in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
 
setLowerBound(Integer) - Method in class org.archive.modules.deciderules.MatchesStatusCodeDecideRule
Sets the lower bound on the range of acceptable status codes.
setLowerBound(long) - Method in class org.archive.modules.deciderules.ResponseContentLengthDecideRule
The rule will apply if the url has been fetched and content body length is greater than or equal to this number of bytes.
setMaxAttributeNameLength(int) - Method in class org.archive.modules.extractor.ExtractorHTML
 
setMaxAttributeValLength(int) - Method in class org.archive.modules.extractor.ExtractorHTML
 
setMaxElementLength(int) - Method in class org.archive.modules.extractor.ExtractorHTML
 
setMaxFetchKBSec(int) - Method in class org.archive.modules.fetcher.FetchFTP
The maximum KB/sec to use when fetching data from a server.
setMaxFetchKBSec(int) - Method in class org.archive.modules.fetcher.FetchHTTP
The maximum KB/sec to use when fetching data from a server.
setMaxFetchKBSec(int) - Method in class org.archive.modules.fetcher.FetchSFTP
The maximum KB/sec to use when fetching data from a server.
setMaxFileSizeBytes(long) - Method in class org.archive.modules.writer.Kw3WriterProcessor
 
setMaxFileSizeBytes(long) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
setMaxHops(int) - Method in class org.archive.modules.deciderules.TooManyHopsDecideRule
Max path depth for which this filter will match.
setMaxLengthBytes(long) - Method in class org.archive.modules.fetcher.FetchFTP
Maximum length in bytes to fetch.
setMaxLengthBytes(long) - Method in class org.archive.modules.fetcher.FetchHTTP
Maximum length in bytes to fetch.
setMaxLengthBytes(long) - Method in class org.archive.modules.fetcher.FetchSFTP
Maximum length in bytes to fetch.
setMaxPathDepth(int) - Method in class org.archive.modules.deciderules.TooManyPathSegmentsDecideRule
Number of path segments beyond which this rule will reject URIs.
setMaxPathLength(int) - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
setMaxRepetitions(int) - Method in class org.archive.modules.deciderules.PathologicalPathDecideRule
Number of times the pattern should be allowed to occur.
setMaxSegLength(int) - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
setMaxSizeToDigest(long) - Method in class org.archive.modules.extractor.HTTPContentDigest
Maximum file size for - longer files will be ignored.
setMaxSizeToParse(long) - Method in class org.archive.modules.extractor.ExtractorPDF
The maximum size of PDF files to consider.
setMaxSizeToParse(long) - Method in class org.archive.modules.extractor.ExtractorUniversal
How deep to look into files for URI strings, in bytes.
setMaxSpeculativeHops(int) - Method in class org.archive.modules.deciderules.TransclusionDecideRule
Maximum number of speculative ('X') hops to ACCEPT.
setMaxTotalBytesToWrite(long) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
setMaxTransHops(int) - Method in class org.archive.modules.deciderules.TransclusionDecideRule
Maximum number of non-navlink (non-'L') hops to ACCEPT.
setMaxWaitForIdleMs(int) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
setMetadata(CrawlMetadata) - Method in class org.archive.modules.extractor.ExtractorHTML
 
setMetadataProvider(CrawlMetadata) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
setMethod(String) - Method in class org.archive.modules.forms.HTMLForm
 
setObeyMetaRobotsNofollow(boolean) - Method in class org.archive.modules.net.CustomRobotsPolicy
 
setObeyMetaRobotsNofollow(boolean) - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
 
setObeyMetaRobotsNofollow(boolean) - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
 
setOnlyStoreIfWriteTagPresent(boolean) - Method in class org.archive.modules.recrawl.AbstractPersistProcessor
 
setOperator(String) - Method in class org.archive.modules.CrawlMetadata
 
setOperatorContactUrl(String) - Method in class org.archive.modules.CrawlMetadata
 
setOperatorFrom(String) - Method in class org.archive.modules.CrawlMetadata
 
setOrdinal(long) - Method in class org.archive.modules.CrawlURI
 
setOrganization(String) - Method in class org.archive.modules.CrawlMetadata
 
setOtherCodings(CrawlURI, Recorder, HttpResponse) - Method in class org.archive.modules.fetcher.FetchHTTP
Set the transfer, content encodings based on headers (if necessary).
setOverlayMapsSource(OverlayMapsSource) - Method in class org.archive.modules.CrawlURI
 
setPassword(String) - Method in class org.archive.modules.credential.HttpAuthenticationCredential
 
setPassword(String) - Method in class org.archive.modules.fetcher.FetchFTP
The password to send to FTP servers.
setPassword(String) - Method in class org.archive.modules.fetcher.FetchSFTP
The password to send to SFTP servers.
setPath(ConfigPath) - Method in class org.archive.modules.writer.Kw3WriterProcessor
 
setPath(ConfigPath) - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
setPayloadDigest(String) - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
 
setPolitenessDelay(long) - Method in class org.archive.modules.CrawlURI
 
setPool(WriterPool) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
setPoolMaxActive(int) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
setPrecedence(int) - Method in class org.archive.modules.CrawlURI
 
setPrefix(String) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
setPreloadSource(ConfigPath) - Method in class org.archive.modules.recrawl.PersistLoadProcessor
 
setPreloadSourceUrl(String) - Method in class org.archive.modules.recrawl.PersistLoadProcessor
 
setPrerequisite(boolean) - Method in class org.archive.modules.CrawlURI
Set if this CrawlURI is itself a prerequisite URI.
setPrerequisiteUri(CrawlURI) - Method in class org.archive.modules.CrawlURI
Set a prerequisite for this URI.
setProcessors(List<Processor>) - Method in class org.archive.modules.ProcessorChain
 
setRealm(String) - Method in class org.archive.modules.credential.HttpAuthenticationCredential
 
setRecorder(Recorder) - Method in class org.archive.modules.CrawlURI
Set the http recorder to be associated with this uri.
setRecordIDGenerator(RecordIDGenerator) - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
 
setRecoveryCheckpoint(Checkpoint) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
setRecoveryCheckpoint(Checkpoint) - Method in class org.archive.modules.fetcher.BdbCookieStore
 
setRecoveryCheckpoint(Checkpoint) - Method in class org.archive.modules.fetcher.SimpleCookieStore
 
setRecoveryCheckpoint(Checkpoint) - Method in class org.archive.modules.net.BdbServerCache
 
setRecoveryCheckpoint(Checkpoint) - Method in class org.archive.modules.Processor
 
setRefersToDate(String) - Method in class org.archive.modules.revisit.AbstractProfile
Set the refers to date
setRefersToDate(long) - Method in class org.archive.modules.revisit.AbstractProfile
Set the refers to date
setRefersToRecordID(String) - Method in class org.archive.modules.revisit.AbstractProfile
 
setRefersToTargetURI(String) - Method in class org.archive.modules.revisit.IdenticalPayloadDigestRevisit
 
setRegex(Pattern) - Method in class org.archive.modules.canonicalize.RegexRule
The regular expression to use to match.
setRegex(Pattern) - Method in class org.archive.modules.deciderules.MatchesRegexDecideRule
 
setRegex(Pattern) - Method in class org.archive.modules.extractor.ExtractorImpliedURI
Triggering regular expression.
setRegexList(List<Pattern>) - Method in class org.archive.modules.deciderules.MatchesListRegexDecideRule
The list of regular expressions to evalute against the URI.
setRemoveTriggerUris(boolean) - Method in class org.archive.modules.extractor.ExtractorImpliedURI
If true, all URIs that match trigger regular expression are removed from the list of extracted URIs.
setRescheduleTime(long) - Method in class org.archive.modules.CrawlURI
 
setRevisitProfile(RevisitProfile) - Method in class org.archive.modules.CrawlURI
 
setRobotsPolicyName(String) - Method in class org.archive.modules.CrawlMetadata
Robots policy name
setRules(List<CanonicalizationRule>) - Method in class org.archive.modules.canonicalize.RulesCanonicalizationPolicy
 
setRules(List<DecideRule>) - Method in class org.archive.modules.deciderules.DecideRuleSequence
 
setSchedulingDirective(int) - Method in class org.archive.modules.CrawlURI
 
setSchemes(Set<String>) - Method in class org.archive.modules.deciderules.SchemeNotInSetDecideRule
 
setScriptSource(ReadSource) - Method in class org.archive.modules.deciderules.ScriptedDecideRule
 
setScriptSource(ReadSource) - Method in class org.archive.modules.ScriptedProcessor
 
setSeed(boolean) - Method in class org.archive.modules.CrawlURI
Set the isSeed attribute of this URI.
setSeedListeners(Set<SeedListener>) - Method in class org.archive.modules.seeds.SeedModule
 
setSeeds(SeedModule) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
setSeedsAsSurtPrefixes(boolean) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
setSendConnectionClose(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
Send 'Connection: close' header with every request.
setSendIfModifiedSince(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
Send 'If-Modified-Since' header, if previous 'Last-Modified' fetch history information is available in URI history.
setSendIfNoneMatch(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
Send 'If-None-Match' header, if previous 'Etag' fetch history information is available in URI history.
setSendRange(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
Send 'Range' header when a limit (#MAX_LENGTH_BYTES) on document size.
setSendReferer(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
Send 'Referer' header with every request.
setServerCache(ServerCache) - Method in class org.archive.modules.deciderules.DecideRuleSequence
 
setServerCache(ServerCache) - Method in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
 
setServerCache(ServerCache) - Method in class org.archive.modules.deciderules.IpAddressSetDecideRule
 
setServerCache(ServerCache) - Method in class org.archive.modules.fetcher.FetchDNS
 
setServerCache(ServerCache) - Method in class org.archive.modules.fetcher.FetchHTTP
Used to do DNS lookups.
setServerCache(ServerCache) - Method in class org.archive.modules.fetcher.FetchWhois
 
setServerCache(ServerCache) - Method in class org.archive.modules.writer.Kw3WriterProcessor
 
setServerCache(ServerCache) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
setServerIP(String) - Method in class org.archive.modules.CrawlURI
 
setShouldFetchBodyRule(DecideRule) - Method in class org.archive.modules.fetcher.FetchHTTP
DecideRules applied after receipt of HTTP response headers but before we start to download the body.
setShouldMasquerade(boolean) - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
 
setShouldMasquerade(boolean) - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
 
setShouldProcessRule(DecideRule) - Method in class org.archive.modules.Processor
Decide rule(s) (also particular to a URI) that determine whether or not a particular URI is processed here.
setSizes(CrawlURI, Recorder) - Method in class org.archive.modules.fetcher.FetchHTTP
Update CrawlURI internal sizes based on current transaction (and in the case of 304s, history)
setSkipIdenticalDigests(boolean) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
setSocksProxyHost(String) - Method in class org.archive.modules.fetcher.FetchHTTP
Sets a SOCKS5 proxy host to use.
setSocksProxyPort(Integer) - Method in class org.archive.modules.fetcher.FetchHTTP
Sets a SOCKS5 proxy port to use.
setSoTimeoutMs(int) - Method in class org.archive.modules.fetcher.FetchFTP
If the socket is unresponsive for this number of milliseconds, give up.
setSoTimeoutMs(int) - Method in class org.archive.modules.fetcher.FetchHTTP
If the socket is unresponsive for this number of milliseconds, give up.
setSoTimeoutMs(int) - Method in class org.archive.modules.fetcher.FetchSFTP
If the socket is unresponsive for this number of milliseconds, give up.
setSoTimeoutMs(int) - Method in class org.archive.modules.fetcher.FetchWhois
If the socket is unresponsive for this number of milliseconds, give up.
setSourceSeeds(Set<String>) - Method in class org.archive.modules.deciderules.SourceSeedDecideRule
 
setSourceTag(String) - Method in class org.archive.modules.CrawlURI
 
setSourceTagSeeds(boolean) - Method in class org.archive.modules.seeds.SeedModule
 
setSpecialQueryTemplates(Map<String, String>) - Method in class org.archive.modules.fetcher.FetchWhois
 
setSslTrustLevel(ConfigurableX509TrustManager.TrustLevel) - Method in class org.archive.modules.fetcher.FetchHTTP
SSL certificate trust level.
setStartNewFilesOnCheckpoint(boolean) - Method in class org.archive.modules.writer.WriterPoolProcessor
Whether to close output files and start new ones on checkpoint.
setStatusCodes(List<Integer>) - Method in class org.archive.modules.deciderules.FetchStatusDecideRule
 
setStorePaths(List<ConfigPath>) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
setStripRegex(String) - Method in class org.archive.modules.extractor.HTTPContentDigest
A regular expression that matches those portions of downloaded documents that need to be ignored when calculating the content digest.
setSuffixAtEnd(boolean) - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
setSurtPrefixes(List<String>) - Method in class org.archive.modules.deciderules.ViaSurtPrefixedDecideRule
 
setSurtsDumpFile(ConfigFile) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
setSurtsSource(ReadSource) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
setSurtsSourceFile(ConfigFile) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
Deprecated. 
setTemplate(String) - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
URI-building template.
setTemplate(String) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
setTextSource(ReadSource) - Method in class org.archive.modules.seeds.TextSeedModule
 
setThreadNumber(int) - Method in class org.archive.modules.CrawlURI
Set the number of the ToeThread responsible for processing this uri.
setTimeoutPerRegexSeconds(long) - Method in class org.archive.modules.deciderules.MatchesListRegexDecideRule
The timeout for regular expression matching, in seconds.
setTimeoutSeconds(int) - Method in class org.archive.modules.fetcher.FetchFTP
If the fetch is not completed in this number of seconds, give up (and retry later).
setTimeoutSeconds(int) - Method in class org.archive.modules.fetcher.FetchHTTP
If the fetch is not completed in this number of seconds, give up (and retry later).
setTimeoutSeconds(int) - Method in class org.archive.modules.fetcher.FetchSFTP
If the fetch is not completed in this number of seconds, give up (and retry later).
setTooLongDirectory(String) - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
setTotalBytesWritten(long) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
setTreatFramesAsEmbedLinks(boolean) - Method in class org.archive.modules.extractor.ExtractorHTML
If true, FRAME/IFRAME SRC-links are treated as embedded resources (like IMG, 'E' hop-type), otherwise they are treated as navigational links.
setUnderscoreSet(List<String>) - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
setUnresolvable(CrawlURI, CrawlHost) - Method in class org.archive.modules.fetcher.FetchDNS
 
setUp() - Method in class org.archive.modules.extractor.ContentExtractorTestBase
setupCopyEnvironment(File) - Static method in class org.archive.modules.recrawl.PersistProcessor
 
setupCopyEnvironment(File, boolean) - Static method in class org.archive.modules.recrawl.PersistProcessor
 
setUpperBound(Integer) - Method in class org.archive.modules.deciderules.MatchesStatusCodeDecideRule
Sets the upper bound on the range of acceptable status codes.
setUpperBound(Integer) - Method in class org.archive.modules.deciderules.NotMatchesStatusCodeDecideRule
Sets the upper bound on the range of acceptable status codes.
setUpperBound(long) - Method in class org.archive.modules.deciderules.ResponseContentLengthDecideRule
The rule will apply if the url has been fetched and content body length is less than or equal to this number of bytes.
setupPool(AtomicInteger) - Method in class org.archive.modules.writer.ARCWriterProcessor
 
setupPool(AtomicInteger) - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
 
setupPool(AtomicInteger) - Method in class org.archive.modules.writer.WriterPoolProcessor
Set up pool of files.
setupSimpleLog(String) - Method in interface org.archive.modules.SimpleFileLoggerProvider
 
setUriRegex(String) - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
Regular expression against which to match the URI.
setUrlPattern(String) - Method in class org.archive.modules.extractor.ExtractorSitemap
If urlPattern is not null then any url marked as a sitemap and matching the pattern is assumed to be a sitemap.
setUseHeaderLength(boolean) - Method in class org.archive.modules.deciderules.ResourceNoLongerThanDecideRule
Shall this rule be used as a midfetch rule? If true, this rule will determine content length based on HTTP header information, otherwise the size of the already downloaded content will be used.
setUseHTTP11(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
Use HTTP/1.1.
setUsePreset(MatchesFilePatternDecideRule.Preset) - Method in class org.archive.modules.deciderules.MatchesFilePatternDecideRule
 
setUserAgent(String) - Method in class org.archive.modules.CrawlURI
Set the user agent to use when crawling this URI.
setUserAgentProvider(UserAgentProvider) - Method in class org.archive.modules.fetcher.FetchHTTP
 
setUserAgentTemplate(String) - Method in class org.archive.modules.CrawlMetadata
 
setUsername(String) - Method in class org.archive.modules.fetcher.FetchFTP
The username to send to FTP servers.
setUsername(String) - Method in class org.archive.modules.fetcher.FetchSFTP
The username to send to SFTP servers.
setVia(UURI) - Method in class org.archive.modules.CrawlURI
 
setWriteBufferSize(int) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
setWriteMetadata(boolean) - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
Whether to write 'metadata' type records.
setWriteRequests(boolean) - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
Whether to write 'request' type records.
setWriteRevisitForIdenticalDigests(boolean) - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
setWriteRevisitForNotModified(boolean) - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
sharedEngine - Variable in class org.archive.modules.deciderules.ScriptedDecideRule
 
sharedEngine - Variable in class org.archive.modules.ScriptedProcessor
 
shortReportLegend() - Method in class org.archive.modules.CrawlURI
 
shortReportLegend() - Method in class org.archive.modules.fetcher.FetchStats
 
shortReportLegend() - Method in class org.archive.modules.ProcessorChain
 
shortReportLine() - Method in class org.archive.modules.CrawlURI
 
shortReportLine() - Method in class org.archive.modules.fetcher.FetchStats
 
shortReportLineTo(PrintWriter) - Method in class org.archive.modules.CrawlURI
 
shortReportLineTo(PrintWriter) - Method in class org.archive.modules.fetcher.FetchStats
 
shortReportLineTo(PrintWriter) - Method in class org.archive.modules.ProcessorChain
 
shortReportMap() - Method in class org.archive.modules.CrawlURI
 
shortReportMap() - Method in class org.archive.modules.fetcher.FetchStats
 
shortReportMap() - Method in class org.archive.modules.ProcessorChain
 
shouldBuildRecord(CrawlURI) - Method in class org.archive.modules.warc.DnsResponseRecordBuilder
 
shouldBuildRecord(CrawlURI) - Method in class org.archive.modules.warc.FtpControlConversationRecordBuilder
 
shouldBuildRecord(CrawlURI) - Method in class org.archive.modules.warc.FtpResponseRecordBuilder
 
shouldBuildRecord(CrawlURI) - Method in class org.archive.modules.warc.HttpRequestRecordBuilder
 
shouldBuildRecord(CrawlURI) - Method in class org.archive.modules.warc.HttpResponseRecordBuilder
 
shouldBuildRecord(CrawlURI) - Method in class org.archive.modules.warc.MetadataRecordBuilder
If you don't want metadata records, take this class out of the chain.
shouldBuildRecord(CrawlURI) - Method in class org.archive.modules.warc.RevisitRecordBuilder
 
shouldBuildRecord(CrawlURI) - Method in interface org.archive.modules.warc.WARCRecordBuilder
Decides whether to build a record for the given capture.
shouldBuildRecord(CrawlURI) - Method in class org.archive.modules.warc.WhoisResponseRecordBuilder
 
shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ContentExtractor
Determines if otherwise valid URIs should have links extracted or not.
shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorCSS
 
shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorDOC
 
shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorHTML
 
shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorJS
 
shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorPDF
 
shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorRobotsTxt
 
shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorSitemap
 
shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorSWF
 
shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorUniversal
 
shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorXML
 
shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.TrapSuppressExtractor
 
shouldLoad(CrawlURI) - Method in class org.archive.modules.recrawl.AbstractPersistProcessor
Whether the current CrawlURI's state should be loaded
shouldMasquerade - Variable in class org.archive.modules.net.FirstNamedRobotsPolicy
whether to adopt the user-agent that is allowed for the fetch
shouldMasquerade - Variable in class org.archive.modules.net.MostFavoredRobotsPolicy
whether to adopt the user-agent that is allowed for the fetch
shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.ContentExtractor
Determines if links should be extracted from the given URI.
shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorHTTP
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorImpliedURI
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorURI
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.HTTPContentDigest
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchDNS
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchFTP
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchHTTP
Can this processor fetch the given CrawlURI.
shouldProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchSFTP
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchWhois
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.forms.ExtractorHTMLForms
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.forms.FormLoginProcessor
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.Processor
Determines whether the given uri should be processed by this processor.
shouldProcess(CrawlURI) - Method in class org.archive.modules.recrawl.ContentDigestHistoryLoader
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.recrawl.ContentDigestHistoryStorer
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.recrawl.FetchHistoryProcessor
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.recrawl.PersistLoadProcessor
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.recrawl.PersistLogProcessor
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.recrawl.PersistStoreProcessor
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.ScriptedProcessor
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.writer.Kw3WriterProcessor
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.writer.WriterPoolProcessor
 
shouldStore(CrawlURI) - Method in class org.archive.modules.recrawl.AbstractPersistProcessor
Whether the current CrawlURI's state should be persisted (to log or direct to database)
shouldWrite(CrawlURI) - Method in class org.archive.modules.writer.WARCWriterChainProcessor
 
shouldWrite(CrawlURI) - Method in class org.archive.modules.writer.WriterPoolProcessor
Whether the given CrawlURI should be written to archive files.
SimpleCookieStore - Class in org.archive.modules.fetcher
In-memory cookie store, mostly for testing.
SimpleCookieStore() - Constructor for class org.archive.modules.fetcher.SimpleCookieStore
 
SimpleFileLoggerProvider - Interface in org.archive.modules
 
SimpleLinkContext(String) - Constructor for class org.archive.modules.extractor.LinkContext.SimpleLinkContext
 
size() - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
size() - Method in class org.archive.modules.ProcessorChain
 
skipIdenticalDigests - Variable in class org.archive.modules.writer.WriterPoolProcessor
Whether to skip the writing of a record when URI history information is available and indicates the prior fetch had an identical content digest.
socketFactory - Variable in class org.archive.modules.fetcher.FetchFTP
 
SocketFactoryWithTimeout() - Constructor for class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
 
socksProxyHost - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
 
socksProxyPort - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
 
SocksSocketFactory - Class in org.archive.modules.fetcher
 
SocksSocketFactory() - Constructor for class org.archive.modules.fetcher.SocksSocketFactory
 
SocksSSLSocketFactory - Class in org.archive.modules.fetcher
 
SocksSSLSocketFactory(SSLContext) - Constructor for class org.archive.modules.fetcher.SocksSSLSocketFactory
 
sortableKey(Cookie) - Method in class org.archive.modules.fetcher.AbstractCookieStore
Returns a string that uniquely identifies the cookie, The format The format of the key is "normalizedDomain;name;path".
SOURCE_DATA_ORIGINAL_SET - Static variable in class org.archive.modules.extractor.HTMLLinkContext
 
SOURCE_SRCSET - Static variable in class org.archive.modules.extractor.HTMLLinkContext
 
SourceSeedDecideRule - Class in org.archive.modules.deciderules
Rule applies the configured decision for any URI with discovered from one of the seeds in sourceSeeds.
SourceSeedDecideRule() - Constructor for class org.archive.modules.deciderules.SourceSeedDecideRule
 
sourceSeeds - Variable in class org.archive.modules.deciderules.SourceSeedDecideRule
 
sourceTagSeeds - Variable in class org.archive.modules.seeds.SeedModule
Whether to tag seeds with their own URI as a heritable 'source' String, which will be carried-forward to all URIs discovered on paths originating from that seed.
specialQueryTemplates - Variable in class org.archive.modules.fetcher.FetchWhois
 
SPECULATIVE_MISC - Static variable in class org.archive.modules.extractor.LinkContext
Stand-in value for speculative/aggressively extracted urls without other context.
sslContext - Variable in class org.archive.modules.fetcher.FetchHTTP
 
sslContext() - Method in class org.archive.modules.fetcher.FetchHTTP
 
sslTrustLevel - Variable in class org.archive.modules.fetcher.FetchHTTP
 
STANDARD_POLICIES - Static variable in class org.archive.modules.net.RobotsPolicy
 
start() - Method in class org.archive.modules.deciderules.DecideRuleSequence
 
start() - Method in class org.archive.modules.fetcher.AbstractCookieStore
 
start() - Method in class org.archive.modules.fetcher.FetchHTTP
 
start() - Method in class org.archive.modules.fetcher.FetchWhois
 
start() - Method in class org.archive.modules.net.BdbServerCache
 
start() - Method in class org.archive.modules.Processor
 
start() - Method in class org.archive.modules.ProcessorChain
 
start() - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
 
start() - Method in class org.archive.modules.recrawl.PersistLoadProcessor
 
start() - Method in class org.archive.modules.recrawl.PersistLogProcessor
 
start() - Method in class org.archive.modules.recrawl.PersistOnlineProcessor
 
start() - Method in interface org.archive.modules.SimpleFileLoggerProvider
 
start() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
startCheckpoint(Checkpoint) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
startCheckpoint(Checkpoint) - Method in class org.archive.modules.fetcher.BdbCookieStore
 
startCheckpoint(Checkpoint) - Method in class org.archive.modules.fetcher.SimpleCookieStore
 
startCheckpoint(Checkpoint) - Method in class org.archive.modules.net.BdbServerCache
 
startCheckpoint(Checkpoint) - Method in class org.archive.modules.Processor
 
startCheckpoint(Checkpoint) - Method in class org.archive.modules.recrawl.PersistLogProcessor
 
startNewFilesOnCheckpoint - Variable in class org.archive.modules.writer.WriterPoolProcessor
 
stats - Variable in class org.archive.modules.writer.BaseWARCWriterProcessor
 
STATUS_CODE_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
 
statusCodes - Variable in class org.archive.modules.deciderules.FetchStatusDecideRule
 
stop() - Method in class org.archive.modules.deciderules.DecideRuleSequence
 
stop() - Method in class org.archive.modules.fetcher.AbstractCookieStore
 
stop() - Method in class org.archive.modules.fetcher.FetchHTTP
 
stop() - Method in class org.archive.modules.fetcher.FetchWhois
 
stop() - Method in class org.archive.modules.net.BdbServerCache
 
stop() - Method in class org.archive.modules.Processor
 
stop() - Method in class org.archive.modules.ProcessorChain
 
stop() - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
 
stop() - Method in class org.archive.modules.recrawl.PersistLogProcessor
 
stop() - Method in class org.archive.modules.recrawl.PersistOnlineProcessor
 
stop() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
store(CrawlURI) - Method in class org.archive.modules.recrawl.AbstractContentDigestHistory
Stores curi.getContentDigestHistory() for the key persistKeyFor(curi).
store - Variable in class org.archive.modules.recrawl.BdbContentDigestHistory
 
store(CrawlURI) - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
 
store - Variable in class org.archive.modules.recrawl.PersistOnlineProcessor
 
storeDNSRecord(CrawlURI, String, CrawlHost, Record[]) - Method in class org.archive.modules.fetcher.FetchDNS
 
storePaths - Variable in class org.archive.modules.writer.WriterPoolProcessor
Where to save files.
StringExtractorTestBase - Class in org.archive.modules.extractor
 
StringExtractorTestBase() - Constructor for class org.archive.modules.extractor.StringExtractorTestBase
 
StringExtractorTestBase.TestData - Class in org.archive.modules.extractor
 
StripExtraSlashes - Class in org.archive.modules.canonicalize
Strip any extra slashes, '/', found in the path.
StripExtraSlashes() - Constructor for class org.archive.modules.canonicalize.StripExtraSlashes
 
StripSessionCFIDs - Class in org.archive.modules.canonicalize
Strip cold fusion session ids.
StripSessionCFIDs() - Constructor for class org.archive.modules.canonicalize.StripSessionCFIDs
 
StripSessionIDs - Class in org.archive.modules.canonicalize
Strip known session ids.
StripSessionIDs() - Constructor for class org.archive.modules.canonicalize.StripSessionIDs
 
stripToMinimal() - Method in class org.archive.modules.CrawlURI
Remove all attributes set on this uri.
StripUserinfoRule - Class in org.archive.modules.canonicalize
Strip any 'userinfo' found on http/https URLs.
StripUserinfoRule() - Constructor for class org.archive.modules.canonicalize.StripUserinfoRule
 
StripWWWNRule - Class in org.archive.modules.canonicalize
Strip any 'www[0-9]*' found on http/https URLs IF they have some path/query component (content after third slash).
StripWWWNRule() - Constructor for class org.archive.modules.canonicalize.StripWWWNRule
 
StripWWWRule - Class in org.archive.modules.canonicalize
Strip any 'www' found on http/https URLs, IF they have some path/query component (content after third slash).
StripWWWRule() - Constructor for class org.archive.modules.canonicalize.StripWWWRule
 
subList(int, int) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
submitStatusFor(String) - Method in class org.archive.modules.forms.FormLoginProcessor
 
subset(CrawlURI, Class<?>) - Method in class org.archive.modules.credential.CredentialStore
Return set made up of all credentials of the passed type.
subset(CrawlURI, Class<?>, String) - Method in class org.archive.modules.credential.CredentialStore
Return set made up of all credentials of the passed type.
substats - Variable in class org.archive.modules.net.CrawlHost
 
substats - Variable in class org.archive.modules.net.CrawlServer
 
SUCCESS_BYTES - Static variable in class org.archive.modules.fetcher.FetchStats
 
suffixAtEnd - Variable in class org.archive.modules.writer.MirrorWriterProcessor
If true, the suffix is placed at the end of the path, after the query (if any).
summary() - Method in class org.archive.crawler.util.CrawledBytesHistotable
 
SurtPrefixedDecideRule - Class in org.archive.modules.deciderules.surt
Rule applies configured decision to any URIs that, when expressed in SURT form, begin with one of the prefixes in the configured set.
SurtPrefixedDecideRule() - Constructor for class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
surtPrefixes - Variable in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
 
surtPrefixes - Variable in class org.archive.modules.deciderules.ViaSurtPrefixedDecideRule
 
surtsDumpFile - Variable in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
Dump file to save SURT prefixes actually used: Useful debugging SURTs.
surtsSource - Variable in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
Text from which to infer SURT prefixes.

T

tagDefineButton(int, Vector) - Method in class org.archive.modules.extractor.CustomSWFTags
 
tagDefineButton2(int, boolean, Vector) - Method in class org.archive.modules.extractor.CustomSWFTags
 
tagDefineSprite(int) - Method in class org.archive.modules.extractor.CustomSWFTags
 
tagDoAction() - Method in class org.archive.modules.extractor.CustomSWFTags
 
tagDoInActions(int) - Method in class org.archive.modules.extractor.CustomSWFTags
 
tagDoInitAction(int) - Method in class org.archive.modules.extractor.CustomSWFTags
 
tagPlaceObject2(boolean, int, int, int, Matrix, AlphaTransform, int, String, int) - Method in class org.archive.modules.extractor.CustomSWFTags
 
tally(CrawlURI, FetchStats.Stage) - Method in interface org.archive.modules.fetcher.FetchStats.CollectsFetchStats
 
tally(CrawlURI, FetchStats.Stage) - Method in class org.archive.modules.fetcher.FetchStats
 
targetHost - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
 
TempDirProvider - Interface in org.archive.modules.extractor
 
template - Variable in class org.archive.modules.writer.WriterPoolProcessor
Template from which a filename is interpolated.
test(int) - Method in class org.archive.modules.deciderules.ResourceLongerThanDecideRule
 
test(int) - Method in class org.archive.modules.deciderules.ResourceNoLongerThanDecideRule
 
TestData(CrawlURI, CrawlURI) - Constructor for class org.archive.modules.extractor.StringExtractorTestBase.TestData
 
testExtraction() - Method in class org.archive.modules.extractor.StringExtractorTestBase
Tests each text/URI pair in the test data array.
testFinished() - Method in class org.archive.modules.extractor.ContentExtractorTestBase
Tests that a URI whose linkExtractionFinished flag has been set has no links extracted.
testSerializationIfAppropriate() - Method in class org.archive.state.ModuleTestBase
Tests that the module can be serialized.
testZeroContent() - Method in class org.archive.modules.extractor.ContentExtractorTestBase
Tests that a URI with a zero content length has no links extracted.
TextSeedModule - Class in org.archive.modules.seeds
Module that announces a list of seeds from a text source (such as a ConfigFile or ConfigString), and provides a mechanism for adding seeds after a crawl has begun.
TextSeedModule() - Constructor for class org.archive.modules.seeds.TextSeedModule
 
textSource - Variable in class org.archive.modules.seeds.TextSeedModule
Text from which to extract seeds
threadEngine - Variable in class org.archive.modules.deciderules.ScriptedDecideRule
 
threadEngine - Variable in class org.archive.modules.ScriptedProcessor
 
TIMER_TRUNC - Static variable in interface org.archive.modules.CoreAttributeConstants
 
TIMER_TRUNC - Static variable in class org.archive.modules.fetcher.FetchErrors
 
TLDs - Static variable in class org.archive.modules.extractor.ExtractorUniversal
Matches any string that begins with a TLD (no .) followed by a '/' slash or end of string.
toArray() - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
toArray(T[]) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
 
toCheckpointJson() - Method in class org.archive.modules.extractor.Extractor
 
toCheckpointJson() - Method in class org.archive.modules.forms.FormLoginProcessor
 
toCheckpointJson() - Method in class org.archive.modules.Processor
Return a JSONObject of current stat that can be consulted on recovery to restore necessary values.
toCheckpointJson() - Method in class org.archive.modules.writer.WARCWriterChainProcessor
 
toCheckpointJson() - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
 
toCheckpointJson() - Method in class org.archive.modules.writer.WriterPoolProcessor
 
tooLongDirectory - Variable in class org.archive.modules.writer.MirrorWriterProcessor
If all the directories in the URI would exceed, or come close to exceeding, the file system maximum path length, then they are all replaced by this.
TooManyHopsDecideRule - Class in org.archive.modules.deciderules
Rule REJECTs any CrawlURIs whose total number of hops (length of the hopsPath string, traversed links of any type) is over a threshold.
TooManyHopsDecideRule() - Constructor for class org.archive.modules.deciderules.TooManyHopsDecideRule
Usual constructor.
TooManyPathSegmentsDecideRule - Class in org.archive.modules.deciderules
Rule REJECTs any CrawlURIs whose total number of path-segments (as indicated by the count of '/' characters not including the first '//') is over a given threshold.
TooManyPathSegmentsDecideRule() - Constructor for class org.archive.modules.deciderules.TooManyPathSegmentsDecideRule
Usual constructor.
toString() - Method in class org.archive.modules.CrawlURI
 
toString() - Method in class org.archive.modules.extractor.HTMLLinkContext
 
toString() - Method in class org.archive.modules.extractor.LinkContext.SimpleLinkContext
 
toString() - Method in class org.archive.modules.fetcher.BasicExecutionAwareRequest
 
toString() - Method in class org.archive.modules.forms.HTMLForm.FormInput
 
toString() - Method in class org.archive.modules.forms.HTMLForm
 
toString() - Method in class org.archive.modules.net.CrawlHost
 
toString() - Method in class org.archive.modules.net.CrawlServer
 
TOTAL_BYTES - Static variable in class org.archive.modules.fetcher.FetchStats
 
TOTAL_SCHEDULED - Static variable in class org.archive.modules.fetcher.FetchStats
 
TransclusionDecideRule - Class in org.archive.modules.deciderules
Rule ACCEPTs any CrawlURIs whose path-from-seed ('hopsPath' -- see CrawlURI.getPathFromSeed() ends with at least one, but not more than, the given number of non-navlink ('L') hops.
TransclusionDecideRule() - Constructor for class org.archive.modules.deciderules.TransclusionDecideRule
Usual constructor.
TrapSuppressExtractor - Class in org.archive.modules.extractor
Pseudo-extractor that suppresses link-extraction of likely trap pages, by noticing when content's digest is identical to that of its 'via'.
TrapSuppressExtractor() - Constructor for class org.archive.modules.extractor.TrapSuppressExtractor
Usual constructor.
TRUNC_SUFFIX - Static variable in interface org.archive.modules.CoreAttributeConstants
Fetch truncation codes present in CrawlURI annotations.
TRUNC_SUFFIX - Static variable in class org.archive.modules.fetcher.FetchErrors
Fetch truncation codes present in ProcessorURI annotations.
type - Variable in class org.archive.modules.forms.HTMLForm.FormInput
 

U

ULTRA_SUFFIX_WHOIS_SERVER - Static variable in class org.archive.modules.fetcher.FetchWhois
 
UNCALCULATED - Static variable in class org.archive.modules.CrawlURI
 
underscoreSet - Variable in class org.archive.modules.writer.MirrorWriterProcessor
If a directory name appears (case-insensitive) in this list then an underscore is placed before it.
updateMetadataAfterWrite(CrawlURI, WARCWriter, long) - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
 
updateRobots(CrawlURI) - Method in class org.archive.modules.net.CrawlServer
Update the server's robotstxt
uri - Variable in class org.archive.modules.extractor.StringExtractorTestBase.TestData
 
URI_HISTORY_DBNAME - Static variable in class org.archive.modules.recrawl.PersistProcessor
name of history Database
UriCanonicalizationPolicy - Class in org.archive.modules.canonicalize
URI Canonicalizatioon Policy
UriCanonicalizationPolicy() - Constructor for class org.archive.modules.canonicalize.UriCanonicalizationPolicy
 
uriCount - Variable in class org.archive.modules.Processor
The number of URIs processed by this processor.
UriErrorLoggerModule - Interface in org.archive.modules.extractor
 
URL_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
 
urlsWritten - Variable in class org.archive.modules.writer.BaseWARCWriterProcessor
 
UserAgentProvider - Interface in org.archive.modules.fetcher
 
useSocksProxy - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
 

V

validate(Pattern, String) - Method in class org.archive.modules.writer.MirrorWriterProcessor
 
VALIDATOR - Static variable in class org.archive.modules.CrawlMetadata
 
validRobots - Variable in class org.archive.modules.net.CrawlServer
 
value - Variable in class org.archive.modules.forms.HTMLForm.FormInput
 
value - Variable in class org.archive.modules.forms.HTMLForm.NameValue
 
valueOf(String) - Static method in enum org.archive.modules.CrawlURI.FetchType
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum org.archive.modules.credential.HtmlFormCredential.Method
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum org.archive.modules.deciderules.DecideResult
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum org.archive.modules.deciderules.MatchesFilePatternDecideRule.Preset
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum org.archive.modules.extractor.Hop
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum org.archive.modules.fetcher.FetchStats.Stage
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum org.archive.modules.fetcher.FetchWhois.UrlStatus
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum org.archive.modules.ProcessResult.ProcessStatus
Returns the enum constant of this type with the specified name.
values() - Static method in enum org.archive.modules.CrawlURI.FetchType
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum org.archive.modules.credential.HtmlFormCredential.Method
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum org.archive.modules.deciderules.DecideResult
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum org.archive.modules.deciderules.MatchesFilePatternDecideRule.Preset
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum org.archive.modules.extractor.Hop
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum org.archive.modules.fetcher.FetchStats.Stage
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum org.archive.modules.fetcher.FetchWhois.UrlStatus
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum org.archive.modules.ProcessResult.ProcessStatus
Returns an array containing the constants of this enum type, in the order they are declared.
verifySerialization(Object, byte[], Object, byte[]) - Method in class org.archive.state.ModuleTestBase
Verifies that serialization was successful.
ViaSurtPrefixedDecideRule - Class in org.archive.modules.deciderules
Rule applies the configured decision for any URI which has a 'via' whose surtform matches any surt specified in the surtPrefixes list
ViaSurtPrefixedDecideRule() - Constructor for class org.archive.modules.deciderules.ViaSurtPrefixedDecideRule
 

W

WARC_NOVEL_CONTENT_BYTES - Static variable in class org.archive.crawler.util.CrawledBytesHistotable
 
WARC_NOVEL_URLS - Static variable in class org.archive.crawler.util.CrawledBytesHistotable
 
warcHeaderFor(String) - Method in class org.archive.modules.forms.FormLoginProcessor
 
WARCRecordBuilder - Interface in org.archive.modules.warc
Implementations of this interface are each responsible for building a particular type of WARC record.
WARCWriterChainProcessor - Class in org.archive.modules.writer
WARC writer processor.
WARCWriterChainProcessor() - Constructor for class org.archive.modules.writer.WARCWriterChainProcessor
 
WARCWriterProcessor - Class in org.archive.modules.writer
Deprecated.
WARCWriterProcessor() - Constructor for class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
 
WHOIS_SERVER_REGEX - Static variable in class org.archive.modules.fetcher.FetchWhois
 
WhoisResponseRecordBuilder - Class in org.archive.modules.warc
 
WhoisResponseRecordBuilder() - Constructor for class org.archive.modules.warc.WhoisResponseRecordBuilder
 
wildcardDirectives - Variable in class org.archive.modules.net.Robotstxt
 
write(CrawlURI, long, InputStream, String) - Method in class org.archive.modules.writer.ARCWriterProcessor
 
write(CrawlURI) - Method in class org.archive.modules.writer.WARCWriterChainProcessor
 
write(String, CrawlURI) - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
 
writeArchiveInfoPart(String, CrawlURI, ReplayInputStream, OutputStream) - Method in class org.archive.modules.writer.Kw3WriterProcessor
 
writeBufferSize - Variable in class org.archive.modules.writer.WriterPoolProcessor
Size of buffer in front of disk-writing.
writeContentPart(String, CrawlURI, ReplayInputStream, OutputStream) - Method in class org.archive.modules.writer.Kw3WriterProcessor
 
writeDnsRecords(CrawlURI, WARCWriter, URI, String) - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
 
writeFtpControlConversation(WARCWriter, String, URI, CrawlURI, ANVLRecord, String) - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
 
writeFtpRecords(WARCWriter, CrawlURI, URI, String) - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
 
writeHeaderPart(String, ReplayInputStream, OutputStream) - Method in class org.archive.modules.writer.Kw3WriterProcessor
 
writeHttpRecords(CrawlURI, WARCWriter, URI, String) - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
 
writeMetadata(WARCWriter, String, URI, CrawlURI, ANVLRecord) - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
 
writeMimeFile(CrawlURI) - Method in class org.archive.modules.writer.Kw3WriterProcessor
The actual writing of the Kulturarw3 MIME-file.
writeRecords(CrawlURI, WARCWriter) - Method in class org.archive.modules.writer.WARCWriterChainProcessor
 
writeRequest(WARCWriter, String, String, URI, CrawlURI, ANVLRecord) - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
 
writeResource(WARCWriter, String, String, URI, CrawlURI, ANVLRecord) - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
 
writeResponse(WARCWriter, String, String, URI, CrawlURI, ANVLRecord) - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
 
writeRevisit(WARCWriter, String, String, URI, CrawlURI, ANVLRecord) - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
 
writeRevisit(WARCWriter, String, String, URI, CrawlURI, ANVLRecord, long) - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
 
WriterPoolProcessor - Class in org.archive.modules.writer
Abstract implementation of a file pool processor.
WriterPoolProcessor() - Constructor for class org.archive.modules.writer.WriterPoolProcessor
 
writeWhoisRecords(WARCWriter, CrawlURI, URI, String) - Method in class org.archive.modules.writer.WARCWriterProcessor
Deprecated.
 
A B C D E F G H I J K L M N O P Q R S T U V W 
Skip navigation links

Copyright © 2003–2022 Internet Archive. All rights reserved.