Index
All Classes|All Packages
A
- A_ANNOTATIONS - Static variable in interface org.archive.modules.CoreAttributeConstants
-
shorthand string tokens indicating notable occurrences, separated by commas
- A_CONTENT_DIGEST - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
content digest
- A_CONTENT_DIGEST_COUNT - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
number of times we've seen this content digest (1 original + n duplicates)
- A_CONTENT_DIGEST_HISTORY - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
content digest history map
- A_CONTENT_TYPE - Static variable in interface org.archive.modules.CoreAttributeConstants
-
Extracted MIME type of fetched content; should be set immediately by fetching module if possible (rather than waiting for a later analyzer)
- A_CREDENTIALS_KEY - Static variable in interface org.archive.modules.CoreAttributeConstants
-
Key to get credential avatars from A_LIST.
- A_DELAY_FACTOR - Static variable in interface org.archive.modules.CoreAttributeConstants
-
Multiplier of last fetch duration to wait before fetching another item of the same class (eg host)
- A_DISTANCE_FROM_SEED - Static variable in interface org.archive.modules.CoreAttributeConstants
- A_DNS_FETCH_TIME - Static variable in interface org.archive.modules.CoreAttributeConstants
- A_DNS_SERVER_IP_LABEL - Static variable in interface org.archive.modules.CoreAttributeConstants
- A_ETAG_HEADER - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
header name (and AList key) for ETag
- A_FETCH_BEGAN_TIME - Static variable in interface org.archive.modules.CoreAttributeConstants
- A_FETCH_COMPLETED_TIME - Static variable in interface org.archive.modules.CoreAttributeConstants
- A_FETCH_HISTORY - Static variable in class org.archive.modules.CrawlURI
-
fetch history array
- A_FETCH_HISTORY - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
Deprecated.
- A_FORCE_RETIRE - Static variable in interface org.archive.modules.CoreAttributeConstants
-
flag indicating the containing queue should be retired
- A_FORM_OFFSETS - Static variable in class org.archive.modules.extractor.ExtractorHTML
- A_FTP_CONTROL_CONVERSATION - Static variable in interface org.archive.modules.CoreAttributeConstants
- A_FTP_FETCH_STATUS - Static variable in interface org.archive.modules.CoreAttributeConstants
- A_HERITABLE_KEYS - Static variable in interface org.archive.modules.CoreAttributeConstants
-
Key to (optional) attribute specifying a list of keys that are passed to CandidateURIs that 'descend' (are discovered via) this URI.
- A_HREF - Static variable in class org.archive.modules.extractor.HTMLLinkContext
- A_HTML_BASE - Static variable in interface org.archive.modules.CoreAttributeConstants
- A_HTML_FORM_OBJECTS - Static variable in class org.archive.modules.forms.ExtractorHTMLForms
- A_HTTP_AUTH_CHALLENGES - Static variable in interface org.archive.modules.CoreAttributeConstants
- A_HTTP_PROXY_HOST - Static variable in interface org.archive.modules.CoreAttributeConstants
-
local override of proxy host
- A_HTTP_PROXY_PORT - Static variable in interface org.archive.modules.CoreAttributeConstants
-
local override of proxy port
- A_HTTP_RESPONSE_HEADERS - Static variable in interface org.archive.modules.CoreAttributeConstants
- A_LAST_MODIFIED_HEADER - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
header name (and AList key) for last-modified timestamp
- A_META_ROBOTS - Static variable in class org.archive.modules.extractor.ExtractorHTML
- A_MINIMUM_DELAY - Static variable in interface org.archive.modules.CoreAttributeConstants
-
Minimum delay before fetching another item of th same class (eg host).
- A_MIRROR_PATH - Static variable in interface org.archive.modules.CoreAttributeConstants
-
Define for org.archive.crawler.writer.MirrorWriterProcessor.
- A_MIRROR_PATH - Static variable in class org.archive.modules.writer.MirrorWriterProcessor
- A_NONFATAL_ERRORS - Static variable in interface org.archive.modules.CoreAttributeConstants
- A_ORIGINAL_DATE - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
date content payload was written
- A_ORIGINAL_URL - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
url that the content payload was written for
- A_PRECALC_PRECEDENCE - Static variable in interface org.archive.modules.CoreAttributeConstants
-
key to attribute containing pre-calculated precedence
- A_PREREQUISITE_URI - Static variable in interface org.archive.modules.CoreAttributeConstants
- A_REFERENCE_LENGTH - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
reference length (content length or virtual length
- A_RETRY_DELAY - Static variable in interface org.archive.modules.CoreAttributeConstants
- A_RRECORD_SET_LABEL - Static variable in interface org.archive.modules.CoreAttributeConstants
- A_RUNTIME_EXCEPTION - Static variable in interface org.archive.modules.CoreAttributeConstants
- A_SOURCE_TAG - Static variable in interface org.archive.modules.CoreAttributeConstants
-
a 'source' (usu.
- A_STATUS - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
key for status (when in history)
- A_SUBMIT_DATA - Static variable in interface org.archive.modules.CoreAttributeConstants
- A_SUBMIT_ENCTYPE - Static variable in interface org.archive.modules.CoreAttributeConstants
- A_VIA_DIGEST - Static variable in class org.archive.modules.extractor.TrapSuppressExtractor
-
ALIst attribute key for carrying-forward content-digest from 'via'
- A_WARC_FILE_OFFSET - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
offset into warc file of warc record with content payload
- A_WARC_FILENAME - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
warc filename containing the content payload
- A_WARC_RECORD_ID - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
warc record id of warc record with the content payload
- A_WARC_RESPONSE_HEADERS - Static variable in interface org.archive.modules.CoreAttributeConstants
- A_WARC_STATS - Static variable in interface org.archive.modules.CoreAttributeConstants
- A_WHOIS_SERVER_IP - Static variable in interface org.archive.modules.CoreAttributeConstants
- A_WRITE_TAG - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
Writer processors of all types are encouraged to put a 'writeTag' (analogous to HTTP 'etag') in the CrawlURI state.
- aboutToLog() - Method in class org.archive.modules.CrawlURI
-
Notify CrawlURI it is about to be logged; opportunity for self-annotation
- ABS_HTTP_URI_PATTERN - Static variable in class org.archive.modules.extractor.ExtractorURI
- AbstractContentDigestHistory - Class in org.archive.modules.recrawl
-
Represents a store of information, presumably persistent, keyed by content digest.
- AbstractContentDigestHistory() - Constructor for class org.archive.modules.recrawl.AbstractContentDigestHistory
- AbstractCookieStore - Class in org.archive.modules.fetcher
- AbstractCookieStore() - Constructor for class org.archive.modules.fetcher.AbstractCookieStore
- AbstractCookieStore.LimitedCookieStoreFacade - Class in org.archive.modules.fetcher
- AbstractPersistProcessor - Class in org.archive.modules.recrawl
- AbstractPersistProcessor() - Constructor for class org.archive.modules.recrawl.AbstractPersistProcessor
- AbstractProfile - Class in org.archive.modules.revisit
- AbstractProfile() - Constructor for class org.archive.modules.revisit.AbstractProfile
- ACCEPT - org.archive.modules.deciderules.DecideResult
-
Indicates the URI was accepted.
- AcceptDecideRule - Class in org.archive.modules.deciderules
- AcceptDecideRule() - Constructor for class org.archive.modules.deciderules.AcceptDecideRule
- accepts(CrawlURI) - Method in class org.archive.modules.deciderules.DecideRule
- accumulate(CrawlURI) - Method in class org.archive.crawler.util.CrawledBytesHistotable
- action - Variable in class org.archive.modules.forms.HTMLForm
- actions - Variable in class org.archive.modules.extractor.CustomSWFTags
- actOn(File) - Method in class org.archive.modules.seeds.SeedModule
- actOn(File) - Method in class org.archive.modules.seeds.TextSeedModule
-
Treat the given file as a source of additional seeds, announcing to SeedListeners.
- add(int, T) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- add(CrawlURI, int, String, LinkContext, Hop) - Static method in class org.archive.modules.extractor.Extractor
- add(T) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- addAll(int, Collection<? extends T>) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- addAll(Collection<? extends T>) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- addAllow(String) - Method in class org.archive.modules.net.RobotsDirectives
- addAnnotations(CrawlURI, CrawlURI) - Method in class org.archive.modules.extractor.ExtractorSWF.CrawlUriSWFAction
- addContentLocationHeaderLink(CrawlURI, String) - Method in class org.archive.modules.extractor.ExtractorHTTP
- addCookie(Cookie) - Method in class org.archive.modules.fetcher.AbstractCookieStore
- addCookie(Cookie) - Method in class org.archive.modules.fetcher.AbstractCookieStore.LimitedCookieStoreFacade
- addCookieImpl(Cookie) - Method in class org.archive.modules.fetcher.AbstractCookieStore
- addCookieImpl(Cookie) - Method in class org.archive.modules.fetcher.BdbCookieStore
- addCookieImpl(Cookie) - Method in class org.archive.modules.fetcher.SimpleCookieStore
- addCredential(Credential) - Method in class org.archive.modules.net.CrawlServer
-
Add an avatar.
- addDisallow(String) - Method in class org.archive.modules.net.RobotsDirectives
- addedCredentials - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
- addedSeed(CrawlURI) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
If appropriate, convert seed notification into prefix-addition.
- addedSeed(CrawlURI) - Method in interface org.archive.modules.seeds.SeedListener
- addExtraInfo(String, Object) - Method in class org.archive.modules.CrawlURI
- addField(String, String, String) - Method in class org.archive.modules.forms.HTMLForm
-
Add a discovered INPUT, tracking it as potential username/password receiver.
- addField(String, String, String, boolean) - Method in class org.archive.modules.forms.HTMLForm
-
Add a discovered INPUT, tracking it as potential username/password receiver.
- addHeaderLink(CrawlURI, String) - Method in class org.archive.modules.extractor.ExtractorHTTP
- addHeaderLink(CrawlURI, String, String) - Method in class org.archive.modules.extractor.ExtractorHTTP
- addIfNotBlank(ANVLRecord, String, String) - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
- addLinkFromString(CrawlURI, CharSequence, CharSequence, Hop) - Method in class org.archive.modules.extractor.ExtractorHTML
- addOutlink(CrawlURI, String, LinkContext, Hop) - Method in class org.archive.modules.extractor.Extractor
-
Create and add a 'Link' to the CrawlURI with given URI/context/hop-type
- addOutlink(CrawlURI, UURI, LinkContext, Hop) - Method in class org.archive.modules.extractor.Extractor
- AddRedirectFromRootServerToScope - Class in org.archive.modules.deciderules
- AddRedirectFromRootServerToScope() - Constructor for class org.archive.modules.deciderules.AddRedirectFromRootServerToScope
- addRefreshHeaderLink(CrawlURI, String) - Method in class org.archive.modules.extractor.ExtractorHTTP
- addRelativeToBase(CrawlURI, int, String, LinkContext, Hop) - Static method in class org.archive.modules.extractor.Extractor
- addRelativeToVia(CrawlURI, int, String, LinkContext, Hop) - Static method in class org.archive.modules.extractor.Extractor
- addResponseContent(HttpResponse, CrawlURI) - Method in class org.archive.modules.fetcher.FetchHTTP
-
This method populates
curi
with response status and content type. - addSeed(CrawlURI) - Method in class org.archive.modules.seeds.SeedModule
- addSeed(CrawlURI) - Method in class org.archive.modules.seeds.TextSeedModule
-
Add a new seed to scope.
- addSeedListener(SeedListener) - Method in class org.archive.modules.seeds.SeedModule
- addStats(Map<String, Map<String, Long>>) - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
- addWhoisLink(CrawlURI, String) - Method in class org.archive.modules.fetcher.FetchWhois
- addWhoisLinks(CrawlURI) - Method in class org.archive.modules.fetcher.FetchWhois
-
Adds outlinks to whois:{domain} and whois:{ipAddress}
- afterPropertiesSet() - Method in class org.archive.modules.CrawlMetadata
- afterPropertiesSet() - Method in class org.archive.modules.deciderules.ScriptedDecideRule
- afterPropertiesSet() - Method in class org.archive.modules.extractor.ExtractorHTML
- afterPropertiesSet() - Method in class org.archive.modules.ScriptedProcessor
- agentsToDirectives - Variable in class org.archive.modules.net.Robotstxt
- AggressiveExtractorHTML - Class in org.archive.modules.extractor
-
Extended version of ExtractorHTML with more aggressive javascript link extraction where javascript code is parsed first with general HTML tags regex, and than by javascript speculative link regex.
- AggressiveExtractorHTML() - Constructor for class org.archive.modules.extractor.AggressiveExtractorHTML
- ALL - org.archive.modules.deciderules.MatchesFilePatternDecideRule.Preset
- allInputs - Variable in class org.archive.modules.forms.HTMLForm
- allows - Variable in class org.archive.modules.net.RobotsDirectives
- allows(String) - Method in class org.archive.modules.net.RobotsDirectives
- allows(String, CrawlURI, Robotstxt) - Method in class org.archive.modules.net.CustomRobotsPolicy
- allows(String, CrawlURI, Robotstxt) - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
- allows(String, CrawlURI, Robotstxt) - Method in class org.archive.modules.net.IgnoreRobotsPolicy
- allows(String, CrawlURI, Robotstxt) - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
- allows(String, CrawlURI, Robotstxt) - Method in class org.archive.modules.net.ObeyRobotsPolicy
- allows(String, CrawlURI, Robotstxt) - Method in class org.archive.modules.net.RobotsPolicy
- allowsAll() - Method in class org.archive.modules.net.Robotstxt
-
Does this policy effectively allow everything? (No disallows or timing (crawl-delay) directives?)
- analyze(CrawlURI, CharSequence) - Method in class org.archive.modules.forms.ExtractorHTMLForms
-
Run analysis: find form METHOD, ACTION, and all INPUT names/values Log as configured.
- ANNOTATION_IS_SITEMAP - Static variable in class org.archive.modules.extractor.ExtractorRobotsTxt
- ANNOTATION_UNWRITTEN - Static variable in class org.archive.modules.writer.WriterPoolProcessor
-
CrawlURI annotation indicating no record was written.
- announceSeeds() - Method in class org.archive.modules.seeds.SeedModule
- announceSeeds() - Method in class org.archive.modules.seeds.TextSeedModule
-
Announce all seeds from configured source to SeedListeners (including nonseed lines mixed in).
- announceSeeds(CountDownLatch) - Method in class org.archive.modules.seeds.TextSeedModule
- announceSeedsFromReader(BufferedReader, CountDownLatch) - Method in class org.archive.modules.seeds.TextSeedModule
-
Announce all seeds (and nonseed possible-directive lines) from the given Reader
- appCtx - Variable in class org.archive.modules.deciderules.ScriptedDecideRule
- appCtx - Variable in class org.archive.modules.ScriptedProcessor
- ARCHIVE_TIME_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
- ARCWriterProcessor - Class in org.archive.modules.writer
-
Processor module for writing the results of successful fetches (and perhaps someday, certain kinds of network failures) to the Internet Archive ARC file format.
- ARCWriterProcessor() - Constructor for class org.archive.modules.writer.ARCWriterProcessor
- asAnnotation() - Method in class org.archive.modules.forms.HTMLForm
-
Provide abbreviated annotation, of the form...
- assertNoSideEffects(CrawlURI) - Static method in class org.archive.modules.extractor.ContentExtractorTestBase
-
Asserts that the given URI has no URI errors, no localized errors, and no annotations.
- atProcessor(Processor) - Method in interface org.archive.modules.ProcessorChain.ChainStatusReceiver
- attach(CrawlURI) - Method in class org.archive.modules.credential.Credential
-
Attach this credentials avatar to the passed
curi
. - ATTR_MAX_BYTES_WRITTEN - Static variable in class org.archive.modules.writer.Kw3WriterProcessor
-
Max size for each file.Key for the maximum ARC bytes to write attribute.
- audience - Variable in class org.archive.modules.CrawlMetadata
- AUDIO - org.archive.modules.deciderules.MatchesFilePatternDecideRule.Preset
- AUTH_SCHEME_REGISTRY - Static variable in class org.archive.modules.fetcher.FetchHTTP
- autoregisterTo(AutoKryo) - Static method in class org.archive.modules.CrawlURI
- autoregisterTo(AutoKryo) - Static method in class org.archive.modules.net.CrawlHost
- autoregisterTo(AutoKryo) - Static method in class org.archive.modules.net.CrawlServer
- autoregisterTo(AutoKryo) - Static method in class org.archive.modules.net.RobotsDirectives
- autoregisterTo(AutoKryo) - Static method in class org.archive.modules.net.Robotstxt
- availableRobotsPolicies - Variable in class org.archive.modules.CrawlMetadata
-
Map of all available RobotsPolicies, by name, to choose from.
B
- BaseRule - Class in org.archive.modules.canonicalize
-
Base of all rules applied canonicalizing a URL that are configurable via the Heritrix settings system.
- BaseRule() - Constructor for class org.archive.modules.canonicalize.BaseRule
-
Constructor.
- BaseWARCRecordBuilder - Class in org.archive.modules.warc
- BaseWARCRecordBuilder() - Constructor for class org.archive.modules.warc.BaseWARCRecordBuilder
- BaseWARCWriterProcessor - Class in org.archive.modules.writer
- BaseWARCWriterProcessor() - Constructor for class org.archive.modules.writer.BaseWARCWriterProcessor
- BasicExecutionAwareEntityEnclosingRequest - Class in org.archive.modules.fetcher
- BasicExecutionAwareEntityEnclosingRequest(String, String) - Constructor for class org.archive.modules.fetcher.BasicExecutionAwareEntityEnclosingRequest
- BasicExecutionAwareEntityEnclosingRequest(String, String, ProtocolVersion) - Constructor for class org.archive.modules.fetcher.BasicExecutionAwareEntityEnclosingRequest
- BasicExecutionAwareEntityEnclosingRequest(RequestLine) - Constructor for class org.archive.modules.fetcher.BasicExecutionAwareEntityEnclosingRequest
- BasicExecutionAwareRequest - Class in org.archive.modules.fetcher
- BasicExecutionAwareRequest(String, String) - Constructor for class org.archive.modules.fetcher.BasicExecutionAwareRequest
-
Creates an instance of this class using the given request method and URI.
- BasicExecutionAwareRequest(String, String, ProtocolVersion) - Constructor for class org.archive.modules.fetcher.BasicExecutionAwareRequest
-
Creates an instance of this class using the given request method, URI and the HTTP protocol version.
- BasicExecutionAwareRequest(RequestLine) - Constructor for class org.archive.modules.fetcher.BasicExecutionAwareRequest
-
Creates an instance of this class using the given request line.
- bdb - Variable in class org.archive.modules.fetcher.BdbCookieStore
- bdb - Variable in class org.archive.modules.fetcher.FetchWhois
- bdb - Variable in class org.archive.modules.net.BdbServerCache
- bdb - Variable in class org.archive.modules.recrawl.BdbContentDigestHistory
- bdb - Variable in class org.archive.modules.recrawl.PersistOnlineProcessor
- BdbContentDigestHistory - Class in org.archive.modules.recrawl
-
Bdb content digest history store.
- BdbContentDigestHistory() - Constructor for class org.archive.modules.recrawl.BdbContentDigestHistory
- BdbCookieStore - Class in org.archive.modules.fetcher
-
Cookie store using bdb for storage.
- BdbCookieStore() - Constructor for class org.archive.modules.fetcher.BdbCookieStore
- BdbCookieStore.RestrictedCollectionWrappedList<T> - Class in org.archive.modules.fetcher
-
A
List
implementation that wraps aCollection
. - BdbServerCache - Class in org.archive.modules.net
-
ServerCache backed by BDB big maps; the usual choice for crawls.
- BdbServerCache() - Constructor for class org.archive.modules.net.BdbServerCache
- beanName - Variable in class org.archive.modules.deciderules.DecideRuleSequence
- beanName - Variable in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- beanName - Variable in class org.archive.modules.Processor
- blockAwaitingSeedLines - Variable in class org.archive.modules.seeds.TextSeedModule
-
Number of lines of seeds-source to read on initial load before proceeding with crawl.
- buildAndAddOutlink(CrawlURI, Map<String, Object>) - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
- buildConnectionManager() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
- buildPostRequestEntity(CrawlURI) - Method in class org.archive.modules.fetcher.FetchHTTPRequest
- buildRecord(CrawlURI, URI) - Method in class org.archive.modules.warc.DnsResponseRecordBuilder
- buildRecord(CrawlURI, URI) - Method in class org.archive.modules.warc.FtpControlConversationRecordBuilder
- buildRecord(CrawlURI, URI) - Method in class org.archive.modules.warc.FtpResponseRecordBuilder
- buildRecord(CrawlURI, URI) - Method in class org.archive.modules.warc.HttpRequestRecordBuilder
- buildRecord(CrawlURI, URI) - Method in class org.archive.modules.warc.HttpResponseRecordBuilder
- buildRecord(CrawlURI, URI) - Method in class org.archive.modules.warc.MetadataRecordBuilder
- buildRecord(CrawlURI, URI) - Method in class org.archive.modules.warc.RevisitRecordBuilder
- buildRecord(CrawlURI, URI) - Method in interface org.archive.modules.warc.WARCRecordBuilder
-
Builds a warc record for this capture.
- buildRecord(CrawlURI, URI) - Method in class org.archive.modules.warc.WhoisResponseRecordBuilder
- buildSurtPrefixSet() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
Construct the set of prefixes to use, from the seed list ( which may include both URIs and '+'-prefixed directives).
C
- calcOutputDirs() - Method in class org.archive.modules.writer.WriterPoolProcessor
- CandidateChain - Class in org.archive.modules
- CandidateChain() - Constructor for class org.archive.modules.CandidateChain
- candidatePasswordInputs - Variable in class org.archive.modules.forms.HTMLForm
- candidateUserAgents - Variable in class org.archive.modules.net.FirstNamedRobotsPolicy
-
list of user-agents to try; if any are allowed, a URI will be crawled
- candidateUserAgents - Variable in class org.archive.modules.net.MostFavoredRobotsPolicy
-
list of user-agents to try; if any are allowed, a URI will be crawled
- candidateUsernameInputs - Variable in class org.archive.modules.forms.HTMLForm
- CanonicalizationRule - Interface in org.archive.modules.canonicalize
-
A rule to apply canonicalizing a url.
- canonicalize(String) - Method in interface org.archive.modules.canonicalize.CanonicalizationRule
-
Apply this canonicalization rule.
- canonicalize(String) - Method in class org.archive.modules.canonicalize.FixupQueryString
- canonicalize(String) - Method in class org.archive.modules.canonicalize.LowercaseRule
- canonicalize(String) - Method in class org.archive.modules.canonicalize.RegexRule
- canonicalize(String) - Method in class org.archive.modules.canonicalize.RulesCanonicalizationPolicy
-
Run the passed uuri through the list of rules.
- canonicalize(String) - Method in class org.archive.modules.canonicalize.StripExtraSlashes
- canonicalize(String) - Method in class org.archive.modules.canonicalize.StripSessionCFIDs
- canonicalize(String) - Method in class org.archive.modules.canonicalize.StripSessionIDs
- canonicalize(String) - Method in class org.archive.modules.canonicalize.StripUserinfoRule
- canonicalize(String) - Method in class org.archive.modules.canonicalize.StripWWWNRule
- canonicalize(String) - Method in class org.archive.modules.canonicalize.StripWWWRule
- canonicalize(String) - Method in class org.archive.modules.canonicalize.UriCanonicalizationPolicy
- canonicalString - Variable in class org.archive.modules.CrawlURI
- caseSensitiveFilesystem - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
True if the file system is case-sensitive, like UNIX.
- catalog - Variable in class org.archive.modules.extractor.PDFParser
- characterMap - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
This list is grouped in pairs.
- checkBytesWritten() - Method in class org.archive.modules.writer.WriterPoolProcessor
- checked - Variable in class org.archive.modules.forms.HTMLForm.FormInput
- checkMidfetchAbort(CrawlURI) - Method in class org.archive.modules.fetcher.FetchHTTP
- chmod - Variable in class org.archive.modules.writer.Kw3WriterProcessor
-
Should permissions be changed for the newly created dirs.
- chmodValue - Variable in class org.archive.modules.writer.Kw3WriterProcessor
-
What should the permissions be set to.
- chooseAuthScheme(Map<String, String>, String) - Method in class org.archive.modules.fetcher.FetchHTTP
- cleanup(CrawlURI, Exception, String, int) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Cleanup after a failed method execute.
- clear() - Method in class org.archive.modules.fetcher.AbstractCookieStore
- clear() - Method in class org.archive.modules.fetcher.AbstractCookieStore.LimitedCookieStoreFacade
- clear() - Method in class org.archive.modules.fetcher.BdbCookieStore
- clear() - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- clear() - Method in class org.archive.modules.fetcher.SimpleCookieStore
- clearExpired(Date) - Method in class org.archive.modules.fetcher.AbstractCookieStore.LimitedCookieStoreFacade
- clearExpired(Date) - Method in class org.archive.modules.fetcher.BdbCookieStore
- clearExpired(Date) - Method in class org.archive.modules.fetcher.SimpleCookieStore
- clearPrerequisiteUri() - Method in class org.archive.modules.CrawlURI
-
Clear prerequisite, if any.
- close() - Method in class org.archive.modules.fetcher.DefaultServerCache
-
Called when shutting down the cache so we can do clean up.
- close() - Method in class org.archive.modules.fetcher.FetchHTTPRequest.RecordingHttpClientConnection
- collection - Variable in class org.archive.modules.writer.Kw3WriterProcessor
-
Name of collection.
- COLLECTION_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
- comment - Variable in class org.archive.modules.deciderules.DecideRule
- compareTo(CrawlURI) - Method in class org.archive.modules.CrawlURI
- compress - Variable in class org.archive.modules.writer.WriterPoolProcessor
-
Whether to gzip-compress files when writing to disk; by default true, meaning do-compress.
- concludedSeedBatch() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- concludedSeedBatch() - Method in interface org.archive.modules.seeds.SeedListener
- configureHttpClientBuilder() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
- configureRequest() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
- configureRequestHeaders() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
- connectTimeoutMs - Variable in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
- connMan - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
- consecutiveConnectionErrors - Variable in class org.archive.modules.net.CrawlServer
- considerIfLikelyUri(CrawlURI, CharSequence, CharSequence, Hop) - Method in class org.archive.modules.extractor.ExtractorHTML
-
Consider whether a given string is URI-like.
- considerQueryStringValues(CrawlURI, CharSequence, CharSequence, Hop) - Method in class org.archive.modules.extractor.ExtractorHTML
-
Consider a query-string-like collections of key=value[&key=value] pairs for URI-like strings in the values.
- considerString(Extractor, CrawlURI, boolean, String) - Method in class org.archive.modules.extractor.ExtractorJS
- considerStringAsUri(String) - Method in class org.archive.modules.extractor.ExtractorSWF.CrawlUriSWFAction
- considerStrings(CrawlURI, CharSequence) - Method in class org.archive.modules.extractor.ExtractorJS
- considerStrings(Extractor, CrawlURI, CharSequence) - Method in class org.archive.modules.extractor.ExtractorJS
- considerStrings(Extractor, CrawlURI, CharSequence, boolean) - Method in class org.archive.modules.extractor.ExtractorJS
- constructRegex(int) - Method in class org.archive.modules.deciderules.PathologicalPathDecideRule
- contains(Object) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- containsAll(Collection<?>) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- containsContentTypeCharsetDeclaration() - Method in class org.archive.modules.CrawlURI
- containsDataKey(String) - Method in class org.archive.modules.CrawlURI
- containsHost(String) - Method in class org.archive.modules.fetcher.DefaultServerCache
- containsServer(String) - Method in class org.archive.modules.fetcher.DefaultServerCache
- CONTENT_LENGTH_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
- CONTENT_MD5_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
- CONTENT_TYPE_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
- contentDigestHistory - Variable in class org.archive.modules.recrawl.ContentDigestHistoryLoader
- contentDigestHistory - Variable in class org.archive.modules.recrawl.ContentDigestHistoryStorer
- ContentDigestHistoryLoader - Class in org.archive.modules.recrawl
- ContentDigestHistoryLoader() - Constructor for class org.archive.modules.recrawl.ContentDigestHistoryLoader
- ContentDigestHistoryStorer - Class in org.archive.modules.recrawl
- ContentDigestHistoryStorer() - Constructor for class org.archive.modules.recrawl.ContentDigestHistoryStorer
- ContentExtractor - Class in org.archive.modules.extractor
-
Extracts link from the fetched content of a URI, as opposed to its headers.
- ContentExtractor() - Constructor for class org.archive.modules.extractor.ContentExtractor
- ContentExtractorTestBase - Class in org.archive.modules.extractor
-
Abstract base class for unit testing ContentExtractor implementations.
- ContentExtractorTestBase() - Constructor for class org.archive.modules.extractor.ContentExtractorTestBase
- ContentLengthDecideRule - Class in org.archive.modules.deciderules
- ContentLengthDecideRule() - Constructor for class org.archive.modules.deciderules.ContentLengthDecideRule
-
Usual constructor.
- contentTypeMap - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
This list is grouped in pairs.
- ContentTypeMatchesRegexDecideRule - Class in org.archive.modules.deciderules
-
DecideRule whose decision is applied if the URI's content-type is present and matches the supplied regular expression.
- ContentTypeMatchesRegexDecideRule() - Constructor for class org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule
- ContentTypeNotMatchesRegexDecideRule - Class in org.archive.modules.deciderules
-
DecideRule whose decision is applied if the URI's content-type is present and does not match the supplied regular expression.
- ContentTypeNotMatchesRegexDecideRule() - Constructor for class org.archive.modules.deciderules.ContentTypeNotMatchesRegexDecideRule
- cookieComparator - Static variable in class org.archive.modules.fetcher.AbstractCookieStore
- COOKIEDB_NAME - Static variable in class org.archive.modules.fetcher.BdbCookieStore
- cookies - Variable in class org.archive.modules.fetcher.SimpleCookieStore
- cookiesLoadFile - Variable in class org.archive.modules.fetcher.AbstractCookieStore
- cookiesSaveFile - Variable in class org.archive.modules.fetcher.AbstractCookieStore
- cookieStore - Variable in class org.archive.modules.fetcher.FetchHTTP
- cookieStoreFor(String) - Method in class org.archive.modules.fetcher.BdbCookieStore
-
Returns a
AbstractCookieStore.LimitedCookieStoreFacade
whoseAbstractCookieStore.LimitedCookieStoreFacade.getCookies()
method returns only cookies fromhost
and its parent domains, if applicable. - cookieStoreFor(String) - Method in interface org.archive.modules.fetcher.FetchHTTPCookieStore
-
Returns a
CookieStore
whoseCookieStore.getCookies()
returns all the cookies fromhost
and each of its parent domains, if applicable. - cookieStoreFor(String) - Method in class org.archive.modules.fetcher.SimpleCookieStore
- cookieStoreFor(CrawlURI) - Method in class org.archive.modules.fetcher.AbstractCookieStore
- cookieStoreFor(CrawlURI) - Method in interface org.archive.modules.fetcher.FetchHTTPCookieStore
-
Returns a
CookieStore
whoseCookieStore.getCookies()
returns all the cookies that could possibly applycuri
. - copyForwardWriteTagIfDupe(CrawlURI) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
If this fetch is identical to the last written (archived) fetch, then copy forward the writeTag.
- copyPersistSourceToHistoryMap(File, StoredSortedMap<String, Map>) - Static method in class org.archive.modules.recrawl.PersistProcessor
-
Populates a given StoredSortedMap (history map) from an old environment db or a persist log.
- copyPersistSourceToHistoryMap(URL, StoredSortedMap<String, Map>) - Static method in class org.archive.modules.recrawl.PersistProcessor
-
Populates a given StoredSortedMap (history map) from an old persist log.
- copyStats(Map<String, Map<String, Long>>) - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
- CoreAttributeConstants - Interface in org.archive.modules
-
Attribute keys and constant strings used by the core crawler classes.
- countryCodes - Variable in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
-
Country code name.
- crawlDelay - Variable in class org.archive.modules.net.RobotsDirectives
- CrawledBytesHistotable - Class in org.archive.crawler.util
- CrawledBytesHistotable() - Constructor for class org.archive.crawler.util.CrawledBytesHistotable
- CrawlHost - Class in org.archive.modules.net
-
Represents a single remote "host".
- CrawlHost(String) - Constructor for class org.archive.modules.net.CrawlHost
-
Create a new CrawlHost object.
- CrawlHost(String, String) - Constructor for class org.archive.modules.net.CrawlHost
-
Create a new CrawlHost object.
- CrawlMetadata - Class in org.archive.modules
-
Basic crawl metadata, as consulted by functional modules and recorded in ARCs/WARCs.
- CrawlMetadata() - Constructor for class org.archive.modules.CrawlMetadata
- CrawlServer - Class in org.archive.modules.net
-
Represents a single remote "server".
- CrawlServer(String) - Constructor for class org.archive.modules.net.CrawlServer
-
Creates a new CrawlServer object.
- CrawlURI - Class in org.archive.modules
-
Represents a candidate URI and the associated state it collects as it is crawled.
- CrawlURI(UURI) - Constructor for class org.archive.modules.CrawlURI
-
Create a new instance of CrawlURI from a
UURI
. - CrawlURI(UURI, String, UURI, LinkContext) - Constructor for class org.archive.modules.CrawlURI
- CrawlURI.FetchType - Enum in org.archive.modules
- CrawlUriSWFAction(CrawlURI, Extractor) - Constructor for class org.archive.modules.extractor.ExtractorSWF.CrawlUriSWFAction
- createCrawlURI(String, LinkContext, Hop) - Method in class org.archive.modules.CrawlURI
- createCrawlURI(UURI, LinkContext, Hop) - Method in class org.archive.modules.CrawlURI
-
Utility method for creating CrawlURIs that were found as out links from the current CrawlURI links from this CrawlURI.
- createCrawlURI(UURI, LinkContext, Hop, int, boolean) - Method in class org.archive.modules.CrawlURI
-
Utility method for creation of CrawlURIs found extracting links from this CrawlURI.
- createFormSubmissionAttempt(CrawlURI, HTMLForm, String) - Method in class org.archive.modules.forms.FormLoginProcessor
- createHostDirectory - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
Create a subdirectory named for the host in the URI.
- createPortDirectory - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
Create a subdirectory named for the port in the URI.
- createRecorder(String) - Static method in class org.archive.modules.extractor.ContentExtractorTestBase
-
Deprecated.
- createRecorder(String, String) - Static method in class org.archive.modules.extractor.ContentExtractorTestBase
- createSocket() - Method in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
- createSocket(String, int) - Method in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
- createSocket(String, int, InetAddress, int) - Method in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
- createSocket(InetAddress, int) - Method in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
- createSocket(InetAddress, int, InetAddress, int) - Method in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
- Credential - Class in org.archive.modules.credential
-
Credential type.
- Credential() - Constructor for class org.archive.modules.credential.Credential
-
Constructor.
- CredentialStore - Class in org.archive.modules.credential
-
Front door to the credential store.
- CredentialStore() - Constructor for class org.archive.modules.credential.CredentialStore
-
Constructor.
- CSS_BACKSLASH_ESCAPE - Static variable in class org.archive.modules.extractor.ExtractorCSS
- CSS_URI_EXTRACTOR - Static variable in class org.archive.modules.extractor.ExtractorCSS
-
CSS URL extractor pattern.
- curi - Variable in class org.archive.modules.extractor.ExtractorSWF.CrawlUriSWFAction
- curi - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
- CUSTOM - org.archive.modules.deciderules.MatchesFilePatternDecideRule.Preset
- customRobots - Variable in class org.archive.modules.net.CustomRobotsPolicy
-
textual alternate robots.txt rules to follow
- CustomRobotsPolicy - Class in org.archive.modules.net
-
Follow a custom-written robots policy, rather than the site's own declarations Does not support overlays of different custom-robots; instead it is recommended each custom policy be declared as a separate bean, with a distinct name.
- CustomRobotsPolicy() - Constructor for class org.archive.modules.net.CustomRobotsPolicy
- customRobotstxt - Variable in class org.archive.modules.net.CustomRobotsPolicy
- CustomSWFTags - Class in org.archive.modules.extractor
-
Overwrite action tags, that may hold URI, to use
CrawlUriSWFAction
action. - CustomSWFTags(SWFActions) - Constructor for class org.archive.modules.extractor.CustomSWFTags
D
- data - Variable in class org.archive.modules.CrawlURI
-
Flexible dynamic attributes list.
- DecideResult - Enum in org.archive.modules.deciderules
-
The decision of a DecideRule.
- DecideRule - Class in org.archive.modules.deciderules
- DecideRule() - Constructor for class org.archive.modules.deciderules.DecideRule
- DecideRuleSequence - Class in org.archive.modules.deciderules
- DecideRuleSequence() - Constructor for class org.archive.modules.deciderules.DecideRuleSequence
- decisionFor(CrawlURI) - Method in class org.archive.modules.deciderules.DecideRule
- decisionMade(CrawlURI, DecideRule, int, DecideResult) - Method in class org.archive.modules.deciderules.DecideRuleSequence
- DEFAULT_IP_WHOIS_SERVER - Static variable in class org.archive.modules.fetcher.FetchWhois
- DEFAULT_LOWER_BOUND - Static variable in class org.archive.modules.deciderules.MatchesStatusCodeDecideRule
-
Default lower bound
- DEFAULT_PARAMETERS - Static variable in class org.archive.modules.extractor.Extractor
- DEFAULT_UPPER_BOUND - Static variable in class org.archive.modules.deciderules.MatchesStatusCodeDecideRule
-
Default upper bound
- DefaultServerCache - Class in org.archive.modules.fetcher
-
Server and Host cache.
- DefaultServerCache() - Constructor for class org.archive.modules.fetcher.DefaultServerCache
-
Constructor.
- DefaultServerCache(ObjectIdentityCache<CrawlServer>, ObjectIdentityCache<CrawlHost>) - Constructor for class org.archive.modules.fetcher.DefaultServerCache
- DefaultTempDirProvider - Class in org.archive.modules.net
- DefaultTempDirProvider() - Constructor for class org.archive.modules.net.DefaultTempDirProvider
- defaultURI() - Method in class org.archive.modules.extractor.ContentExtractorTestBase
-
Returns a CrawlURI for testing purposes.
- deferOrFinishGeneric(CrawlURI, String) - Method in class org.archive.modules.fetcher.FetchWhois
- description - Variable in class org.archive.modules.CrawlMetadata
- detach(CrawlURI) - Method in class org.archive.modules.credential.Credential
-
Detach this credential from passed curi.
- detachAll(CrawlURI) - Method in class org.archive.modules.credential.Credential
-
Detach all credentials of this type from passed curi.
- digestAlgorithm - Variable in class org.archive.modules.fetcher.FetchDNS
-
Which algorithm (for example MD5 or SHA-1) to use to perform an on-the-fly digest hash of retrieved content-bodies.
- digestAlgorithm - Variable in class org.archive.modules.fetcher.FetchFTP
-
Which algorithm (for example MD5 or SHA-1) to use to perform an on-the-fly digest hash of retrieved content-bodies.
- digestAlgorithm - Variable in class org.archive.modules.fetcher.FetchHTTP
- digestAlgorithm - Variable in class org.archive.modules.fetcher.FetchSFTP
-
Which algorithm (for example MD5 or SHA-1) to use to perform an on-the-fly digest hash of retrieved content-bodies.
- directory - Variable in class org.archive.modules.writer.WriterPoolProcessor
- directoryFile - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
Implicitly append this to a URI ending with '/'.
- disallows - Variable in class org.archive.modules.net.RobotsDirectives
- DispositionChain - Class in org.archive.modules
- DispositionChain() - Constructor for class org.archive.modules.DispositionChain
- DISREGARDED - org.archive.modules.fetcher.FetchStats.Stage
- DnsResponseRecordBuilder - Class in org.archive.modules.warc
- DnsResponseRecordBuilder() - Constructor for class org.archive.modules.warc.DnsResponseRecordBuilder
- doAbort(CrawlURI, AbstractExecutionAwareRequest, String) - Method in class org.archive.modules.fetcher.FetchHTTP
- doCheckpoint(Checkpoint) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- doCheckpoint(Checkpoint) - Method in class org.archive.modules.fetcher.BdbCookieStore
- doCheckpoint(Checkpoint) - Method in class org.archive.modules.fetcher.SimpleCookieStore
- doCheckpoint(Checkpoint) - Method in class org.archive.modules.net.BdbServerCache
- doCheckpoint(Checkpoint) - Method in class org.archive.modules.Processor
- doCheckpoint(Checkpoint) - Method in class org.archive.modules.recrawl.PersistLogProcessor
- doCheckpoint(Checkpoint) - Method in class org.archive.modules.writer.WriterPoolProcessor
- document - Variable in class org.archive.modules.extractor.PDFParser
- documentReader - Variable in class org.archive.modules.extractor.PDFParser
- domain - Variable in class org.archive.modules.credential.Credential
-
The root domain this credential goes against: E.g.
- DONE - org.archive.modules.fetcher.FetchWhois.UrlStatus
- doStripRegexMatch(String, String) - Method in class org.archive.modules.canonicalize.BaseRule
-
Run a regex that strips elements of a string.
- dotBegin - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
If a segment starts with '.', the '.' is replaced by this.
- dotEnd - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
If a directory name ends with '.' it is replaced by this.
- dumpSurtPrefixSet() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
Dump the current prefixes in use to configured dump file (if any)
- DUPLICATE - Static variable in class org.archive.crawler.util.CrawledBytesHistotable
- DUPLICATECOUNT - Static variable in class org.archive.crawler.util.CrawledBytesHistotable
E
- elementContext(CharSequence, CharSequence) - Static method in class org.archive.modules.extractor.ExtractorHTML
-
Create a suitable XPath-like context from an element name and optional attribute name.
- eligibleFormsAttemptsCount - Variable in class org.archive.modules.forms.FormLoginProcessor
- eligibleFormsSeenCount - Variable in class org.archive.modules.forms.FormLoginProcessor
- EMBED - org.archive.modules.extractor.Hop
-
Embedded links necessary to render the page, like IMG/@SRC.
- EMBED_MISC - Static variable in class org.archive.modules.extractor.LinkContext
-
Stand-in value for embeds without other context.
- encounteredReferences - Variable in class org.archive.modules.extractor.PDFParser
- enctype - Variable in class org.archive.modules.forms.HTMLForm
- engineName - Variable in class org.archive.modules.deciderules.ScriptedDecideRule
-
engine name; default "beanshell"
- engineName - Variable in class org.archive.modules.ScriptedProcessor
-
engine name; default "beanshell"
- ensureStandardPoliciesAvailable() - Method in class org.archive.modules.CrawlMetadata
- equals(Object) - Method in class org.archive.modules.CrawlURI
- equals(Object) - Method in class org.archive.modules.extractor.LinkContext
- equals(Object) - Method in class org.archive.modules.net.CrawlHost
- equals(Object) - Method in class org.archive.modules.net.CrawlServer
- escapeForMultipart(String) - Static method in class org.archive.modules.fetcher.FetchHTTPRequest
-
Returns a copy of the string with non-ascii characters replaced by their html numeric character reference in decimal (e.g.
- eTag - Variable in class org.archive.modules.revisit.ServerNotModifiedRevisit
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.AddRedirectFromRootServerToScope
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.ContentTypeNotMatchesRegexDecideRule
-
Evaluate whether given object's string version does not match configured regex (by reversing the superclass's answer).
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.FetchStatusDecideRule
-
Evaluate whether given object is equal to the configured status
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.FetchStatusNotMatchesRegexDecideRule
-
Evaluate whether given object's FetchStatus does not match configured regex (by reversing the superclass's answer).
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.HasViaDecideRule
-
Evaluate whether given object is over the threshold number of hops.
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.HopCrossesAssignmentLevelDomainDecideRule
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.IpAddressSetDecideRule
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.MatchesListRegexDecideRule
-
Evaluate whether given object's string version matches configured regexes
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.MatchesRegexDecideRule
-
Evaluate whether given object's string version matches configured regex
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.MatchesStatusCodeDecideRule
-
Returns "true" if the provided CrawlURI has a fetch status that falls within this instance's specified range.
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.NotMatchesFilePatternDecideRule
-
Evaluate whether given object's string version does not match configured regex (by reversing the superclass's answer).
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.NotMatchesListRegexDecideRule
-
Evaluate whether given object's string version does not match configured regexs (by reversing the superclass's answer).
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.NotMatchesRegexDecideRule
-
Evaluate whether given object's string version does not match configured regex (by reversing the superclass's answer).
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.NotMatchesStatusCodeDecideRule
-
Returns "true" if the provided CrawlURI has a fetch status that does not fall within this instance's specified range.
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.PredicatedDecideRule
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.recrawl.IdenticalDigestDecideRule
-
Evaluate whether given CrawlURI's revisit profile has been set to identical digest
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.ResourceNoLongerThanDecideRule
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.ResponseContentLengthDecideRule
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.SchemeNotInSetDecideRule
-
Evaluate whether given object is over the threshold number of hops.
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.SourceSeedDecideRule
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.surt.NotOnDomainsDecideRule
-
Evaluate whether given object's URI is NOT in the set of domains -- simply reverse superclass's determination
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.surt.NotOnHostsDecideRule
-
Evaluate whether given object's URI is NOT in the set of hosts -- simply reverse superclass's determination
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.surt.NotSurtPrefixedDecideRule
-
Evaluate whether given object's URI is NOT in the SURT prefix set -- simply reverse superclass's determination
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
Evaluate whether given object's URI is covered by the SURT prefix set
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.TooManyHopsDecideRule
-
Evaluate whether given object is over the threshold number of hops.
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.TooManyPathSegmentsDecideRule
-
Evaluate whether given object is over the threshold number of path-segments.
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.TransclusionDecideRule
-
Evaluate whether given object is within the acceptable thresholds of transitive hops.
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.ViaSurtPrefixedDecideRule
-
Evaluate whether given object's surt form matches one of the supplied surts
- execute() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
- expectContinue() - Method in class org.archive.modules.fetcher.BasicExecutionAwareEntityEnclosingRequest
- expectedResult - Variable in class org.archive.modules.extractor.StringExtractorTestBase.TestData
- extendHopsPath(String, char) - Static method in class org.archive.modules.CrawlURI
-
Extend a 'hopsPath' (pathFromSeed string of single-character hop-type symbols), keeping the number of displayed hop-types under MAX_HOPS_DISPLAYED.
- ExternalGeoLocationDecideRule - Class in org.archive.modules.deciderules
-
A rule that can be configured to take alternate implementations of the ExternalGeoLocationInterface.
- ExternalGeoLocationDecideRule() - Constructor for class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
- ExternalGeoLookupInterface - Interface in org.archive.modules.deciderules
- extract(CrawlURI) - Method in class org.archive.modules.extractor.ContentExtractor
-
Extracts links
- extract(CrawlURI) - Method in class org.archive.modules.extractor.Extractor
-
Extracts links from the given URI.
- extract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorHTTP
- extract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorImpliedURI
-
Perform usual extraction on a CrawlURI
- extract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
- extract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorURI
-
Perform usual extraction on a CrawlURI
- extract(CrawlURI) - Method in class org.archive.modules.forms.ExtractorHTMLForms
- extract(CrawlURI, CharSequence) - Method in class org.archive.modules.extractor.ExtractorHTML
-
Run extractor.
- extract(CrawlURI, CharSequence) - Method in class org.archive.modules.extractor.JerichoExtractorHTML
-
Run extractor.
- extractChallenges(HttpResponse, CrawlURI, AuthenticationStrategy) - Method in class org.archive.modules.fetcher.FetchHTTP
- extractImplied(CharSequence, Pattern, String) - Static method in class org.archive.modules.extractor.ExtractorImpliedURI
-
Utility method for extracting 'implied' URI given a source uri, trigger pattern, and build pattern.
- extractLink(CrawlURI, CrawlURI) - Method in class org.archive.modules.extractor.ExtractorURI
-
Consider a single Link for internal URIs
- extractor - Variable in class org.archive.modules.extractor.ContentExtractorTestBase
-
An extractor created during the setUp.
- Extractor - Class in org.archive.modules.extractor
-
Extracts links from fetched URIs.
- Extractor() - Constructor for class org.archive.modules.extractor.Extractor
- ExtractorCSS - Class in org.archive.modules.extractor
-
This extractor is parsing URIs from CSS type files.
- ExtractorCSS() - Constructor for class org.archive.modules.extractor.ExtractorCSS
- ExtractorDOC - Class in org.archive.modules.extractor
-
This class allows the caller to extract href style links from word97-format word documents.
- ExtractorDOC() - Constructor for class org.archive.modules.extractor.ExtractorDOC
- ExtractorHTML - Class in org.archive.modules.extractor
-
Basic link-extraction, from an HTML content-body, using regular expressions.
- ExtractorHTML() - Constructor for class org.archive.modules.extractor.ExtractorHTML
- ExtractorHTMLForms - Class in org.archive.modules.forms
-
Extracts extra information about FORMs in HTML, loading this into the CrawlURI (for potential later use by FormLoginProcessor) and adding a small annotation to the crawl.log.
- ExtractorHTMLForms() - Constructor for class org.archive.modules.forms.ExtractorHTMLForms
- ExtractorHTTP - Class in org.archive.modules.extractor
-
Extracts URIs from HTTP response headers.
- ExtractorHTTP() - Constructor for class org.archive.modules.extractor.ExtractorHTTP
- ExtractorImpliedURI - Class in org.archive.modules.extractor
-
An extractor for finding 'implied' URIs inside other URIs.
- ExtractorImpliedURI() - Constructor for class org.archive.modules.extractor.ExtractorImpliedURI
-
Constructor.
- extractorJS - Variable in class org.archive.modules.extractor.ExtractorHTML
-
Javascript extractor to use to process inline javascript.
- extractorJS - Variable in class org.archive.modules.extractor.ExtractorSWF
-
Javascript extractor to use to process inline javascript.
- ExtractorJS - Class in org.archive.modules.extractor
-
Processes Javascript files for strings that are likely to be crawlable URIs.
- ExtractorJS() - Constructor for class org.archive.modules.extractor.ExtractorJS
- ExtractorMultipleRegex - Class in org.archive.modules.extractor
-
An extractor that uses regular expressions to find strings in the fetched content of a URI, and constructs outlink URIs from those strings.
- ExtractorMultipleRegex() - Constructor for class org.archive.modules.extractor.ExtractorMultipleRegex
- ExtractorMultipleRegex.GroupList - Class in org.archive.modules.extractor
- ExtractorMultipleRegex.MatchList - Class in org.archive.modules.extractor
- extractorParameters - Variable in class org.archive.modules.extractor.Extractor
- ExtractorParameters - Interface in org.archive.modules.extractor
-
Bean interface for parameters consulted by multiple Extractors, and thus provided by some shared object.
- ExtractorPDF - Class in org.archive.modules.extractor
-
Allows the caller to process a CrawlURI representing a PDF for the purpose of extracting URIs
- ExtractorPDF() - Constructor for class org.archive.modules.extractor.ExtractorPDF
- ExtractorRobotsTxt - Class in org.archive.modules.extractor
- ExtractorRobotsTxt() - Constructor for class org.archive.modules.extractor.ExtractorRobotsTxt
- ExtractorSitemap - Class in org.archive.modules.extractor
- ExtractorSitemap() - Constructor for class org.archive.modules.extractor.ExtractorSitemap
- ExtractorSWF - Class in org.archive.modules.extractor
-
Extracts URIs from SWF (flash/shockwave) files.
- ExtractorSWF() - Constructor for class org.archive.modules.extractor.ExtractorSWF
- ExtractorSWF.CrawlUriSWFAction - Class in org.archive.modules.extractor
-
SWF action that handles discovered URIs.
- ExtractorSWF.ExtractorTagParser - Class in org.archive.modules.extractor
-
TagParser customized to ignore SWFTags that will never contain extractable URIs.
- ExtractorTagParser(SWFTagTypes) - Constructor for class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
- ExtractorUniversal - Class in org.archive.modules.extractor
-
A last ditch extractor that will look at the raw byte code and try to extract anything that looks like a link.
- ExtractorUniversal() - Constructor for class org.archive.modules.extractor.ExtractorUniversal
-
Constructor.
- ExtractorURI - Class in org.archive.modules.extractor
-
An extractor for finding URIs inside other URIs.
- ExtractorURI() - Constructor for class org.archive.modules.extractor.ExtractorURI
-
Constructor
- ExtractorXML - Class in org.archive.modules.extractor
-
A simple extractor which finds HTTP URIs inside XML/RSS files, inside attribute values and simple elements (those with only whitespace + HTTP URI + whitespace as contents).
- ExtractorXML() - Constructor for class org.archive.modules.extractor.ExtractorXML
- extractQueryStringLinks(UURI) - Static method in class org.archive.modules.extractor.ExtractorURI
-
Look for URIs inside the supplied UURI.
- extractURIs() - Method in class org.archive.modules.extractor.PDFParser
-
Extract URIs from all objects found in a Pdf document's catalog.
- extractURIs(PdfObject) - Method in class org.archive.modules.extractor.PDFParser
-
Parse a PdfDictionary, looking for URIs recursively and adding them to foundURIs
- extraInfo - Variable in class org.archive.modules.CrawlURI
F
- FAILED - org.archive.modules.fetcher.FetchStats.Stage
- failedExecuteCleanup(CrawlURI, Exception) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Cleanup after a failed method execute.
- fetch(CrawlURI, String, String) - Method in class org.archive.modules.fetcher.FetchWhois
- FETCH_DISREGARDS - Static variable in class org.archive.modules.fetcher.FetchStats
- FETCH_FAILURES - Static variable in class org.archive.modules.fetcher.FetchStats
- FETCH_NONRESPONSES - Static variable in class org.archive.modules.fetcher.FetchStats
- FETCH_RESPONSES - Static variable in class org.archive.modules.fetcher.FetchStats
- FETCH_SUCCESSES - Static variable in class org.archive.modules.fetcher.FetchStats
- FetchChain - Class in org.archive.modules
- FetchChain() - Constructor for class org.archive.modules.FetchChain
- FetchDNS - Class in org.archive.modules.fetcher
-
Processor to resolve 'dns:' URIs.
- FetchDNS() - Constructor for class org.archive.modules.fetcher.FetchDNS
- fetcher - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
- FetchErrors - Class in org.archive.modules.fetcher
- FetchErrors() - Constructor for class org.archive.modules.fetcher.FetchErrors
- FetchFTP - Class in org.archive.modules.fetcher
-
Fetches documents and directory listings using FTP.
- FetchFTP() - Constructor for class org.archive.modules.fetcher.FetchFTP
-
Constructs a new
FetchFTP
. - FetchFTP.SocketFactoryWithTimeout - Class in org.archive.modules.fetcher
-
A
SocketFactory
much like javax.net.DefaultSocketFactory, except that the createSocket() methods that open connections support a connect timeout. - FetchHistoryProcessor - Class in org.archive.modules.recrawl
-
Maintain a history of fetch information inside the CrawlURI's attributes.
- FetchHistoryProcessor() - Constructor for class org.archive.modules.recrawl.FetchHistoryProcessor
- FetchHTTP - Class in org.archive.modules.fetcher
-
HTTP fetcher that uses Apache HttpComponents.
- FetchHTTP() - Constructor for class org.archive.modules.fetcher.FetchHTTP
- FetchHTTPCookieStore - Interface in org.archive.modules.fetcher
- FetchHTTPRequest - Class in org.archive.modules.fetcher
- FetchHTTPRequest(FetchHTTP, CrawlURI) - Constructor for class org.archive.modules.fetcher.FetchHTTPRequest
- FetchHTTPRequest.RecordingHttpClientConnection - Class in org.archive.modules.fetcher
- FetchHTTPRequest.ServerCacheResolver - Class in org.archive.modules.fetcher
-
Implementation of
DnsResolver
that uses the server cache which is normally expected to have been populated by FetchDNS. - FetchSFTP - Class in org.archive.modules.fetcher
- FetchSFTP() - Constructor for class org.archive.modules.fetcher.FetchSFTP
-
Constructs a new
FetchSFTP
. - FetchStats - Class in org.archive.modules.fetcher
-
Collector of statistics for a 'subset' of a crawl, such as a server (host:port), host, or frontier group (eg queue).
- FetchStats() - Constructor for class org.archive.modules.fetcher.FetchStats
- FetchStats.CollectsFetchStats - Interface in org.archive.modules.fetcher
- FetchStats.HasFetchStats - Interface in org.archive.modules.fetcher
- FetchStats.Stage - Enum in org.archive.modules.fetcher
- FetchStatusCodes - Interface in org.archive.modules.fetcher
-
Constant flag codes to be used, in lieu of per-protocol codes (like HTTP's 200, 404, etc.), when network/internal/ out-of-band conditions occur.
- fetchStatusCodesToString(int) - Static method in class org.archive.modules.CrawlURI
-
Takes a status code and converts it into a human readable string.
- FetchStatusDecideRule - Class in org.archive.modules.deciderules
-
Rule applies the configured decision for any URI which has a fetch status equal to the 'target-status' setting.
- FetchStatusDecideRule() - Constructor for class org.archive.modules.deciderules.FetchStatusDecideRule
-
Usual constructor.
- FetchStatusMatchesRegexDecideRule - Class in org.archive.modules.deciderules
- FetchStatusMatchesRegexDecideRule() - Constructor for class org.archive.modules.deciderules.FetchStatusMatchesRegexDecideRule
-
Usual constructor.
- FetchStatusNotMatchesRegexDecideRule - Class in org.archive.modules.deciderules
- FetchStatusNotMatchesRegexDecideRule() - Constructor for class org.archive.modules.deciderules.FetchStatusNotMatchesRegexDecideRule
-
Usual constructor.
- FetchWhois - Class in org.archive.modules.fetcher
-
WHOIS Fetcher (RFC 3912).
- FetchWhois() - Constructor for class org.archive.modules.fetcher.FetchWhois
- FetchWhois.UrlStatus - Enum in org.archive.modules.fetcher
- fileLogger - Variable in class org.archive.modules.deciderules.DecideRuleSequence
- findAttributeValueGroup(String, int, CharSequence) - Method in class org.archive.modules.forms.ExtractorHTMLForms
- findGroups(String, int, CharSequence) - Method in class org.archive.modules.forms.ExtractorHTMLForms
- FINISH - org.archive.modules.ProcessResult.ProcessStatus
-
The Processor believes that the ProcessorURI is invalid, or otherwise incapable of further processing at this time.
- FINISH - Static variable in class org.archive.modules.ProcessResult
- finishCheckpoint(Checkpoint) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- finishCheckpoint(Checkpoint) - Method in class org.archive.modules.fetcher.BdbCookieStore
- finishCheckpoint(Checkpoint) - Method in class org.archive.modules.fetcher.SimpleCookieStore
- finishCheckpoint(Checkpoint) - Method in class org.archive.modules.net.BdbServerCache
- finishCheckpoint(Checkpoint) - Method in class org.archive.modules.Processor
- finishCheckpoint(Checkpoint) - Method in class org.archive.modules.recrawl.PersistLogProcessor
- FirstNamedRobotsPolicy - Class in org.archive.modules.net
-
Working from an ordered list of potential User-Agents, consisting of first the regularly-configured User-Agent and then those in the candidateUserAgents list, consider each potential agent in order.
- FirstNamedRobotsPolicy() - Constructor for class org.archive.modules.net.FirstNamedRobotsPolicy
- fixUpName() - Method in class org.archive.modules.net.CrawlHost
- FixupQueryString - Class in org.archive.modules.canonicalize
-
Strip any trailing question mark.
- FixupQueryString() - Constructor for class org.archive.modules.canonicalize.FixupQueryString
- flattenVia() - Method in class org.archive.modules.CrawlURI
-
Method returns string version of this URI's referral URI.
- flattenVia(CrawlURI) - Static method in class org.archive.modules.Processor
- forAllHostsDo(Closure) - Method in class org.archive.modules.fetcher.DefaultServerCache
-
NOTE: Should not mutate the CrawlHost instance so retrieved; depending on the hostscache implementation, the change may not be reliably persistent.
- forAllHostsDo(Closure) - Method in class org.archive.modules.net.ServerCache
-
Utility for performing an action on every CrawlHost.
- forceFetch() - Method in class org.archive.modules.CrawlURI
-
If this method returns true, this URI should be fetched even though it already has been crawled.
- formData(String, String) - Method in class org.archive.modules.forms.HTMLForm
- FormInput() - Constructor for class org.archive.modules.forms.HTMLForm.FormInput
- formItems - Variable in class org.archive.modules.credential.HtmlFormCredential
-
Form items.
- FormLoginProcessor - Class in org.archive.modules.forms
-
A step, post-ExtractorHTMLForms, where a followup CrawlURI to attempt a form submission may be synthesized.
- FormLoginProcessor() - Constructor for class org.archive.modules.forms.FormLoginProcessor
- foundURIs - Variable in class org.archive.modules.extractor.PDFParser
- frequentFlushes - Variable in class org.archive.modules.writer.WriterPoolProcessor
-
Whether to flush to underlying file frequently (at least after each record), or not.
- fromCheckpointJson(JSONObject) - Method in class org.archive.modules.extractor.Extractor
- fromCheckpointJson(JSONObject) - Method in class org.archive.modules.forms.FormLoginProcessor
- fromCheckpointJson(JSONObject) - Method in class org.archive.modules.Processor
-
Restore internal state from JSONObject stored at earlier checkpoint-time.
- fromCheckpointJson(JSONObject) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
- fromCheckpointJson(JSONObject) - Method in class org.archive.modules.writer.WriterPoolProcessor
- fromHopsViaString(String) - Static method in class org.archive.modules.CrawlURI
- FtpControlConversationRecordBuilder - Class in org.archive.modules.warc
- FtpControlConversationRecordBuilder() - Constructor for class org.archive.modules.warc.FtpControlConversationRecordBuilder
- FtpResponseRecordBuilder - Class in org.archive.modules.warc
- FtpResponseRecordBuilder() - Constructor for class org.archive.modules.warc.FtpResponseRecordBuilder
- fullVia - Variable in class org.archive.modules.CrawlURI
G
- generateRecordID() - Static method in class org.archive.modules.warc.BaseWARCRecordBuilder
- generator - Variable in class org.archive.modules.writer.BaseWARCWriterProcessor
-
Generator for record IDs
- get(int) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- get(CharSequence, CharSequence) - Static method in class org.archive.modules.extractor.HTMLLinkContext
-
return an instance of HTMLLinkContext for attribute
attr
in elementel
. - get(Object, String) - Method in class org.archive.modules.credential.CredentialStore
- get(String) - Static method in class org.archive.modules.extractor.HTMLLinkContext
-
return an instance of HTMLLinkContext for path
path
. - GET - org.archive.modules.credential.HtmlFormCredential.Method
- getAcceptCompression() - Method in class org.archive.modules.fetcher.FetchHTTP
- getAcceptHeaders() - Method in class org.archive.modules.fetcher.FetchHTTP
- getAcceptNonDnsResolves() - Method in class org.archive.modules.fetcher.FetchDNS
- getAction() - Method in class org.archive.modules.forms.HTMLForm
- getAll() - Method in class org.archive.modules.credential.CredentialStore
- getAlsoCheckVia() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- getAnnotations() - Method in class org.archive.modules.CrawlURI
-
Get the annotations set for this uri.
- getApplicableSurtPrefix() - Method in class org.archive.modules.forms.FormLoginProcessor
- getAttributeEither(CrawlURI, String) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Get a value either from inside the CrawlURI instance, or from settings (module attributes).
- getAudience() - Method in class org.archive.modules.CrawlMetadata
- getAvailableRobotsPolicies() - Method in class org.archive.modules.CrawlMetadata
- getBaseURI() - Method in class org.archive.modules.CrawlURI
-
Get the (HTML) Base URI used for derelativizing internal URIs.
- getBeanName() - Method in class org.archive.modules.deciderules.DecideRuleSequence
- getBeanName() - Method in class org.archive.modules.Processor
- getBlockAwaitingSeedLines() - Method in class org.archive.modules.seeds.TextSeedModule
- getByRealm(Set<Credential>, String, CrawlURI) - Static method in class org.archive.modules.credential.HttpAuthenticationCredential
-
Convenience method that does look up on passed set using realm for key.
- getCandidateUserAgents() - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
- getCandidateUserAgents() - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
- getCanonicalString() - Method in class org.archive.modules.CrawlURI
- getCaseSensitiveFilesystem() - Method in class org.archive.modules.writer.MirrorWriterProcessor
- getChain() - Method in class org.archive.modules.writer.WARCWriterChainProcessor
- getCharacterMap() - Method in class org.archive.modules.writer.MirrorWriterProcessor
- getChmod() - Method in class org.archive.modules.writer.Kw3WriterProcessor
- getChmodValue() - Method in class org.archive.modules.writer.Kw3WriterProcessor
- getClassKey() - Method in class org.archive.modules.CrawlURI
-
Get the token (usually the hostname + port) which indicates what "class" this CrawlURI should be grouped with, for the purposes of ensuring only one item of the class is processed at once, all items of the class are held for a politeness period, etc.
- getCollection() - Method in class org.archive.modules.writer.Kw3WriterProcessor
- getComment() - Method in class org.archive.modules.deciderules.DecideRule
- getCompress() - Method in class org.archive.modules.writer.WriterPoolProcessor
- getConfiguredHttpVersion() - Method in class org.archive.modules.fetcher.FetchHTTP
- getConnectTimeoutMs() - Method in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
- getContentDeclaredCharset(CrawlURI, String) - Method in class org.archive.modules.extractor.ExtractorHTML
- getContentDeclaredCharset(CrawlURI, String) - Method in class org.archive.modules.extractor.ExtractorXML
- getContentDigest() - Method in class org.archive.modules.CrawlURI
-
Return the retained content-digest value, if any.
- getContentDigestHistory() - Method in class org.archive.modules.CrawlURI
- getContentDigestSchemeString() - Method in class org.archive.modules.CrawlURI
- getContentDigestString() - Method in class org.archive.modules.CrawlURI
- getContentLength() - Method in class org.archive.modules.CrawlURI
-
For completed HTTP transactions, the length of the content-body.
- getContentLengthThreshold() - Method in class org.archive.modules.deciderules.ContentLengthDecideRule
- getContentLengthThreshold() - Method in class org.archive.modules.deciderules.ResourceNoLongerThanDecideRule
- getContentRegexes() - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
- getContentSize() - Method in class org.archive.modules.CrawlURI
-
Get the size in bytes of this URI's recorded content, inclusive of things like protocol headers.
- getContentType() - Method in class org.archive.modules.CrawlURI
-
Get the content type of this URI.
- getContentTypeMap() - Method in class org.archive.modules.writer.MirrorWriterProcessor
- getCookies() - Method in class org.archive.modules.fetcher.AbstractCookieStore.LimitedCookieStoreFacade
- getCookies() - Method in class org.archive.modules.fetcher.BdbCookieStore
- getCookies() - Method in class org.archive.modules.fetcher.SimpleCookieStore
- getCookiesLoadFile() - Method in class org.archive.modules.fetcher.AbstractCookieStore
- getCookiesSaveFile() - Method in class org.archive.modules.fetcher.AbstractCookieStore
- getCookieStore() - Method in class org.archive.modules.fetcher.FetchHTTP
- getCountryCode() - Method in class org.archive.modules.net.CrawlHost
-
Get country code of this host
- getCountryCodes() - Method in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
- getCrawlDelay() - Method in class org.archive.modules.net.RobotsDirectives
- getCreateHostDirectory() - Method in class org.archive.modules.writer.MirrorWriterProcessor
- getCreatePortDirectory() - Method in class org.archive.modules.writer.MirrorWriterProcessor
- getCredentials() - Method in class org.archive.modules.CrawlURI
- getCredentials() - Method in class org.archive.modules.credential.CredentialStore
- getCredentials() - Method in class org.archive.modules.net.CrawlServer
- getCredentials(CrawlURI, Class<?>) - Method in class org.archive.modules.fetcher.FetchHTTP
- getCredentialStore() - Method in class org.archive.modules.fetcher.FetchHTTP
- getCredentialTypes() - Static method in class org.archive.modules.credential.CredentialStore
- getCustomRobots() - Method in class org.archive.modules.net.CustomRobotsPolicy
- getData() - Method in class org.archive.modules.CrawlURI
- getDataList(String) - Method in class org.archive.modules.CrawlURI
-
Convenience method: return (creating if necessary) list at given data key
- getDecision() - Method in class org.archive.modules.deciderules.PredicatedDecideRule
- getDefaultCharset() - Method in class org.archive.modules.fetcher.FetchHTTP
- getDefaultEncoding() - Method in class org.archive.modules.fetcher.FetchHTTP
- getDefaultMaxFileSize() - Method in class org.archive.modules.writer.ARCWriterProcessor
- getDefaultMaxFileSize() - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
- getDefaultMaxFileSize() - Method in class org.archive.modules.writer.WriterPoolProcessor
- getDefaultRules() - Static method in class org.archive.modules.canonicalize.RulesCanonicalizationPolicy
-
A reasonable set of default rules to use, if no others are provided by operator configuration.
- getDefaultStorePaths() - Method in class org.archive.modules.writer.ARCWriterProcessor
- getDefaultStorePaths() - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
- getDefaultStorePaths() - Method in class org.archive.modules.writer.WriterPoolProcessor
- getDeferrals() - Method in class org.archive.modules.CrawlURI
-
Get the deferral count.
- getDescription() - Method in class org.archive.modules.CrawlMetadata
- getDigestAlgorithm() - Method in class org.archive.modules.fetcher.FetchDNS
- getDigestAlgorithm() - Method in class org.archive.modules.fetcher.FetchFTP
- getDigestAlgorithm() - Method in class org.archive.modules.fetcher.FetchHTTP
- getDigestAlgorithm() - Method in class org.archive.modules.fetcher.FetchSFTP
- getDigestContent() - Method in class org.archive.modules.fetcher.FetchDNS
- getDigestContent() - Method in class org.archive.modules.fetcher.FetchFTP
- getDigestContent() - Method in class org.archive.modules.fetcher.FetchHTTP
- getDigestContent() - Method in class org.archive.modules.fetcher.FetchSFTP
- getDirectivesFor(String) - Method in class org.archive.modules.net.Robotstxt
-
Return directives to use for the given User-Agent, resorting to wildcard rules or the default no-directives if necessary.
- getDirectivesFor(String, boolean) - Method in class org.archive.modules.net.Robotstxt
-
Return the RobotsDirectives, if any, appropriate for the given User-Agent string.
- getDirectory() - Method in class org.archive.modules.writer.WriterPoolProcessor
- getDirectoryFile() - Method in class org.archive.modules.writer.MirrorWriterProcessor
- getDisableJavaDnsResolves() - Method in class org.archive.modules.fetcher.FetchDNS
- getDNSRecord(long, Record[]) - Method in class org.archive.modules.fetcher.FetchDNS
- getDNSServerIPLabel() - Method in class org.archive.modules.CrawlURI
- getDomain() - Method in class org.archive.modules.credential.Credential
- getDotBegin() - Method in class org.archive.modules.writer.MirrorWriterProcessor
- getDotEnd() - Method in class org.archive.modules.writer.MirrorWriterProcessor
- getDupByHashBytes() - Method in class org.archive.modules.fetcher.FetchStats
- getDupByHashUrls() - Method in class org.archive.modules.fetcher.FetchStats
- getEarliestNextURIEmitTime() - Method in class org.archive.modules.net.CrawlHost
-
Get the earliest time a URI for this host could be emitted.
- getEmbedHopCount() - Method in class org.archive.modules.CrawlURI
-
Get the embed hop count.
- getEnabled() - Method in class org.archive.modules.canonicalize.BaseRule
- getEnabled() - Method in interface org.archive.modules.canonicalize.CanonicalizationRule
- getEnabled() - Method in class org.archive.modules.deciderules.DecideRule
- getEnabled() - Method in class org.archive.modules.Processor
- getEnctype() - Method in class org.archive.modules.forms.HTMLForm
- getEngine() - Method in class org.archive.modules.deciderules.ScriptedDecideRule
-
Get the proper ScriptEngine instance -- either shared or local to this thread.
- getEngine() - Method in class org.archive.modules.ScriptedProcessor
-
Get the proper ScriptEngine instance -- either shared or local to this thread.
- getEngineName() - Method in class org.archive.modules.deciderules.ScriptedDecideRule
- getEngineName() - Method in class org.archive.modules.ScriptedProcessor
- getEntity() - Method in class org.archive.modules.fetcher.BasicExecutionAwareEntityEnclosingRequest
- getETag() - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
- getExtract404s() - Method in interface org.archive.modules.extractor.ExtractorParameters
-
Whether to extract links from responses with a 404 'not found' response code.
- getExtractAllForms() - Method in class org.archive.modules.forms.ExtractorHTMLForms
- getExtractFromDirs() - Method in class org.archive.modules.fetcher.FetchFTP
-
Returns the
extract.from.dirs
attribute for thisFetchFTP
and the given curi. - getExtractFromDirs() - Method in class org.archive.modules.fetcher.FetchSFTP
-
Returns the
extract.from.dirs
attribute for thisFetchSFTP
and the given curi. - getExtractIndependently() - Method in interface org.archive.modules.extractor.ExtractorParameters
-
Whether each extractor should make an independent decision as to whether it can extract links from a URI's content (when value is true), or whether a previous extractor's success (marking the URI as hasBeenLinkExtracted) should cancel later extractors (when value is false).
- getExtractJavascript() - Method in class org.archive.modules.extractor.ExtractorHTML
- getExtractOnlyFormGets() - Method in class org.archive.modules.extractor.ExtractorHTML
- getExtractorJS() - Method in class org.archive.modules.extractor.ExtractorHTML
- getExtractorJS() - Method in class org.archive.modules.extractor.ExtractorSWF
- getExtractorParameters() - Method in class org.archive.modules.extractor.Extractor
- getExtractParent() - Method in class org.archive.modules.fetcher.FetchFTP
-
Returns the
extract.parent
attribute for thisFetchFTP
and the given curi. - getExtractParent() - Method in class org.archive.modules.fetcher.FetchSFTP
-
Returns the
extract.parent
attribute for thisFetchSFTP
and the given curi. - getExtractValueAttributes() - Method in class org.archive.modules.extractor.ExtractorHTML
- getExtraInfo() - Method in class org.archive.modules.CrawlURI
- getFetchAttempts() - Method in class org.archive.modules.CrawlURI
-
Get the count of attempts (trips through the processing loop) at getting the document referenced by this URI.
- getFetchBeginTime() - Method in class org.archive.modules.CrawlURI
- getFetchCompletedTime() - Method in class org.archive.modules.CrawlURI
- getFetchDisregards() - Method in class org.archive.modules.fetcher.FetchStats
- getFetchDuration() - Method in class org.archive.modules.CrawlURI
- getFetchHistory() - Method in class org.archive.modules.CrawlURI
- getFetchNonResponses() - Method in class org.archive.modules.fetcher.FetchStats
- getFetchResponses() - Method in class org.archive.modules.fetcher.FetchStats
- getFetchStatus() - Method in class org.archive.modules.CrawlURI
-
Return the overall/fetch status of this CrawlURI for its current trip through the processing loop.
- getFetchSuccesses() - Method in class org.archive.modules.fetcher.FetchStats
- getFetchType() - Method in class org.archive.modules.CrawlURI
- getFirstARecord(Record[]) - Method in class org.archive.modules.fetcher.FetchDNS
- getFormat() - Method in class org.archive.modules.canonicalize.RegexRule
- getFormat() - Method in class org.archive.modules.extractor.ExtractorImpliedURI
- getFormItems() - Method in class org.archive.modules.credential.HtmlFormCredential
- getFormProvince(CrawlURI) - Method in class org.archive.modules.forms.FormLoginProcessor
-
Get the 'form province' - either the configured (applicableSurtPrefix) or inferred (full current server) range of URIs that is considered covered by one form login
- getFrequentFlushes() - Method in class org.archive.modules.writer.WriterPoolProcessor
- getFrom() - Method in class org.archive.modules.CrawlMetadata
- getFrom() - Method in interface org.archive.modules.fetcher.UserAgentProvider
- getFullVia() - Method in class org.archive.modules.CrawlURI
- getHarvester() - Method in class org.archive.modules.writer.Kw3WriterProcessor
- getHistoryDbName() - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
- getHistoryDbName() - Method in class org.archive.modules.recrawl.PersistOnlineProcessor
- getHistoryLength() - Method in class org.archive.modules.recrawl.FetchHistoryProcessor
- getHolder() - Method in class org.archive.modules.CrawlURI
-
Return the 'holder' for the convenience of an external facility.
- getHolderCost() - Method in class org.archive.modules.CrawlURI
-
Return the 'holderCost' for convenience of external facility (frontier)
- getHolderKey() - Method in class org.archive.modules.CrawlURI
-
Return the 'holderKey' for convenience of an external facility (Frontier).
- getHopChar() - Method in enum org.archive.modules.extractor.Hop
-
Returns a hop character suitable for display in logs.
- getHopCount() - Method in class org.archive.modules.CrawlURI
-
Get total hops from seed.
- getHopString() - Method in enum org.archive.modules.extractor.Hop
- getHostAddress(CrawlURI) - Method in class org.archive.modules.deciderules.IpAddressSetDecideRule
-
from WriterPoolProcessor
- getHostAddress(CrawlURI) - Method in class org.archive.modules.warc.BaseWARCRecordBuilder
-
Return IP address of given URI suitable for recording (as in a classic ARC 5-field header line).
- getHostAddress(CrawlURI) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
Deprecated.WARCRecordBuilder instances use
BaseWARCRecordBuilder.getHostAddress(CrawlURI)
- getHostFor(String) - Method in class org.archive.modules.fetcher.DefaultServerCache
-
Get the
CrawlHost
associated withname
. - getHostFor(String) - Method in class org.archive.modules.net.ServerCache
- getHostFor(UURI) - Method in class org.archive.modules.net.ServerCache
-
Get the
CrawlHost
associated withcuri
. - getHostMap() - Method in class org.archive.modules.writer.MirrorWriterProcessor
- getHostName() - Method in class org.archive.modules.net.CrawlHost
-
Get the host name.
- getHttpAuthChallenges() - Method in class org.archive.modules.CrawlURI
- getHttpAuthChallenges() - Method in class org.archive.modules.net.CrawlServer
- getHttpBindAddress() - Method in class org.archive.modules.fetcher.FetchHTTP
- getHttpMethod() - Method in class org.archive.modules.credential.HtmlFormCredential
-
Deprecated.ignored, always POST
- getHttpProxyHost() - Method in class org.archive.modules.fetcher.FetchHTTP
- getHttpProxyPassword() - Method in class org.archive.modules.fetcher.FetchHTTP
- getHttpProxyPort() - Method in class org.archive.modules.fetcher.FetchHTTP
- getHttpProxyUser() - Method in class org.archive.modules.fetcher.FetchHTTP
- getHttpResponseHeader(String) - Method in class org.archive.modules.CrawlURI
- getId() - Method in class org.archive.modules.fetcher.FetchHTTPRequest.RecordingHttpClientConnection
- getIgnoreCookies() - Method in class org.archive.modules.fetcher.FetchHTTP
- getIgnoreFormActionUrls() - Method in class org.archive.modules.extractor.ExtractorHTML
- getIgnoreUnexpectedHtml() - Method in class org.archive.modules.extractor.ExtractorHTML
- getInferRootPage() - Method in class org.archive.modules.extractor.ExtractorHTTP
- getInFromFile(String) - Method in class org.archive.modules.extractor.PDFParser
-
Read a file named 'doc' and store its' bytes for later processing.
- getIP() - Method in class org.archive.modules.net.CrawlHost
-
Get the IP address for this host.
- getIpAddresses() - Method in class org.archive.modules.deciderules.IpAddressSetDecideRule
- getIpFetched() - Method in class org.archive.modules.net.CrawlHost
-
Get the time when the IP address for this host was last looked up.
- getIpTTL() - Method in class org.archive.modules.net.CrawlHost
-
Get the TTL value from the dns record for this host.
- getIsolateThreads() - Method in class org.archive.modules.deciderules.ScriptedDecideRule
- getIsolateThreads() - Method in class org.archive.modules.ScriptedProcessor
- getJobName() - Method in class org.archive.modules.CrawlMetadata
- getJumpTarget() - Method in class org.archive.modules.ProcessResult
- getKey() - Method in class org.archive.modules.credential.Credential
- getKey() - Method in class org.archive.modules.credential.HtmlFormCredential
- getKey() - Method in class org.archive.modules.credential.HttpAuthenticationCredential
- getKey() - Method in class org.archive.modules.net.CrawlHost
- getKey() - Method in class org.archive.modules.net.CrawlServer
- getKeyedProperties() - Method in class org.archive.modules.canonicalize.BaseRule
- getKeyedProperties() - Method in class org.archive.modules.canonicalize.RulesCanonicalizationPolicy
- getKeyedProperties() - Method in class org.archive.modules.CrawlMetadata
- getKeyedProperties() - Method in class org.archive.modules.credential.CredentialStore
- getKeyedProperties() - Method in class org.archive.modules.deciderules.DecideRule
- getKeyedProperties() - Method in class org.archive.modules.Processor
- getKeyedProperties() - Method in class org.archive.modules.ProcessorChain
- getLastHop() - Method in class org.archive.modules.CrawlURI
-
convenience access to last hop character, as string
- getLastModified() - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
- getLastSuccessTime() - Method in class org.archive.modules.fetcher.FetchStats
- getLinkCount() - Method in class org.archive.modules.extractor.ExtractorSWF.CrawlUriSWFAction
- getLinkHopCount() - Method in class org.archive.modules.CrawlURI
-
Get the link hop count.
- getListLogicalOr() - Method in class org.archive.modules.deciderules.MatchesListRegexDecideRule
- getLogExtraInfo() - Method in class org.archive.modules.deciderules.DecideRuleSequence
- getLogFile() - Method in class org.archive.modules.recrawl.PersistLogProcessor
- getLoggerModule() - Method in class org.archive.modules.deciderules.DecideRuleSequence
- getLoggerModule() - Method in class org.archive.modules.extractor.Extractor
- getLoggerModule() - Method in class org.archive.modules.forms.FormLoginProcessor
- getLogin() - Method in class org.archive.modules.credential.HttpAuthenticationCredential
- getLoginPassword() - Method in class org.archive.modules.forms.FormLoginProcessor
- getLoginUri() - Method in class org.archive.modules.credential.HtmlFormCredential
- getLoginUsername() - Method in class org.archive.modules.forms.FormLoginProcessor
- getLogToFile() - Method in class org.archive.modules.deciderules.DecideRuleSequence
- getLookup() - Method in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
- getLowerBound() - Method in class org.archive.modules.deciderules.MatchesStatusCodeDecideRule
-
Returns the lower bound on the range of acceptable status codes.
- getLowerBound() - Method in class org.archive.modules.deciderules.ResponseContentLengthDecideRule
- getMaxAttributeNameLength() - Method in class org.archive.modules.extractor.ExtractorHTML
- getMaxAttributeValLength() - Method in class org.archive.modules.extractor.ExtractorHTML
- getMaxElementLength() - Method in class org.archive.modules.extractor.ExtractorHTML
- getMaxFetchKBSec() - Method in class org.archive.modules.fetcher.FetchFTP
- getMaxFetchKBSec() - Method in class org.archive.modules.fetcher.FetchHTTP
- getMaxFetchKBSec() - Method in class org.archive.modules.fetcher.FetchSFTP
- getMaxFileSizeBytes() - Method in class org.archive.modules.writer.Kw3WriterProcessor
- getMaxFileSizeBytes() - Method in class org.archive.modules.writer.WriterPoolProcessor
- getMaxHops() - Method in class org.archive.modules.deciderules.TooManyHopsDecideRule
- getMaxLengthBytes() - Method in class org.archive.modules.fetcher.FetchFTP
- getMaxLengthBytes() - Method in class org.archive.modules.fetcher.FetchHTTP
- getMaxLengthBytes() - Method in class org.archive.modules.fetcher.FetchSFTP
- getMaxOutlinks() - Method in interface org.archive.modules.extractor.ExtractorParameters
-
The maximum number of outlinks to discover from any URI's content.
- getMaxPathDepth() - Method in class org.archive.modules.deciderules.TooManyPathSegmentsDecideRule
- getMaxPathLength() - Method in class org.archive.modules.writer.MirrorWriterProcessor
- getMaxRepetitions() - Method in class org.archive.modules.deciderules.PathologicalPathDecideRule
- getMaxSegLength() - Method in class org.archive.modules.writer.MirrorWriterProcessor
- getMaxSizeToDigest() - Method in class org.archive.modules.extractor.HTTPContentDigest
- getMaxSizeToParse() - Method in class org.archive.modules.extractor.ExtractorPDF
- getMaxSizeToParse() - Method in class org.archive.modules.extractor.ExtractorUniversal
- getMaxSpeculativeHops() - Method in class org.archive.modules.deciderules.TransclusionDecideRule
- getMaxTotalBytesToWrite() - Method in class org.archive.modules.writer.WriterPoolProcessor
- getMaxTransHops() - Method in class org.archive.modules.deciderules.TransclusionDecideRule
- getMaxWaitForIdleMs() - Method in class org.archive.modules.writer.WriterPoolProcessor
- getMetadata() - Method in class org.archive.modules.extractor.ExtractorHTML
- getMetadata() - Method in class org.archive.modules.writer.ARCWriterProcessor
- getMetadata() - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
- getMetadata() - Method in class org.archive.modules.writer.WriterPoolProcessor
- getMetadataProvider() - Method in class org.archive.modules.writer.WriterPoolProcessor
- getModuleClass() - Method in class org.archive.state.ModuleTestBase
-
Returns the class of the module to test.
- getName() - Method in class org.archive.modules.net.CrawlServer
- getNamedUserAgents() - Method in class org.archive.modules.net.Robotstxt
- getNonFatalFailures() - Method in class org.archive.modules.CrawlURI
- getNotModifiedBytes() - Method in class org.archive.modules.fetcher.FetchStats
- getNotModifiedUrls() - Method in class org.archive.modules.fetcher.FetchStats
- getNovelBytes() - Method in class org.archive.modules.fetcher.FetchStats
- getNovelUrls() - Method in class org.archive.modules.fetcher.FetchStats
- getOnlyStoreIfWriteTagPresent() - Method in class org.archive.modules.recrawl.AbstractPersistProcessor
- getOperator() - Method in class org.archive.modules.CrawlMetadata
- getOperatorContactUrl() - Method in class org.archive.modules.CrawlMetadata
- getOperatorFrom() - Method in class org.archive.modules.CrawlMetadata
- getOrdinal() - Method in class org.archive.modules.CrawlURI
-
Get the ordinal (serial number) assigned at creation.
- getOrganization() - Method in class org.archive.modules.CrawlMetadata
- getOtherDupBytes() - Method in class org.archive.modules.fetcher.FetchStats
- getOtherDupUrls() - Method in class org.archive.modules.fetcher.FetchStats
- getOutLinks() - Method in class org.archive.modules.CrawlURI
-
Returns discovered links.
- getOverlayMap(String) - Method in class org.archive.modules.CrawlURI
- getOverlayNames() - Method in class org.archive.modules.CrawlURI
- getPassword() - Method in class org.archive.modules.credential.HttpAuthenticationCredential
- getPassword() - Method in class org.archive.modules.fetcher.FetchFTP
- getPassword() - Method in class org.archive.modules.fetcher.FetchSFTP
- getPath() - Method in class org.archive.modules.writer.Kw3WriterProcessor
- getPath() - Method in class org.archive.modules.writer.MirrorWriterProcessor
- getPathFromSeed() - Method in class org.archive.modules.CrawlURI
- getPathQuery(CrawlURI) - Method in class org.archive.modules.net.RobotsPolicy
- getPattern() - Method in enum org.archive.modules.deciderules.MatchesFilePatternDecideRule.Preset
- getPayloadDigest() - Method in class org.archive.modules.revisit.IdenticalPayloadDigestRevisit
- getPayloadDigest() - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
- getPolicyBasisUURI() - Method in class org.archive.modules.CrawlURI
-
Get the UURI that should be used as the basis of policy/overlay decisions.
- getPolitenessDelay() - Method in class org.archive.modules.CrawlURI
- getPool() - Method in class org.archive.modules.writer.WriterPoolProcessor
- getPoolMaxActive() - Method in class org.archive.modules.writer.WriterPoolProcessor
- getPort() - Method in class org.archive.modules.net.CrawlServer
-
Get the port number for this server.
- getPrecedence() - Method in class org.archive.modules.CrawlURI
- getPrefix() - Method in class org.archive.modules.writer.WriterPoolProcessor
- getPreloadSource() - Method in class org.archive.modules.recrawl.PersistLoadProcessor
- getPreloadSourceUrl() - Method in class org.archive.modules.recrawl.PersistLoadProcessor
- getPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.Credential
-
Return the authentication URI, either absolute or relative, that serves as prerequisite the passed
curi
. - getPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.HtmlFormCredential
- getPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.HttpAuthenticationCredential
- getPrerequisiteUri() - Method in class org.archive.modules.CrawlURI
-
Get the prerequisite for this URI.
- getProcessors() - Method in class org.archive.modules.ProcessorChain
- getProcessStatus() - Method in class org.archive.modules.ProcessResult
- getProfileName() - Method in class org.archive.modules.revisit.IdenticalPayloadDigestRevisit
- getProfileName() - Method in interface org.archive.modules.revisit.RevisitProfile
- getProfileName() - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
- getProtocolVersion() - Method in class org.archive.modules.fetcher.BasicExecutionAwareRequest
-
Returns the HTTP protocol version to be used for this request.
- getRealm() - Method in class org.archive.modules.credential.HttpAuthenticationCredential
- getRecordedFinishes() - Method in class org.archive.modules.fetcher.FetchStats
- getRecordedSize() - Method in class org.archive.modules.CrawlURI
-
Get size of data recorded (transferred)
- getRecordedSize(CrawlURI) - Static method in class org.archive.modules.Processor
- getRecorder() - Method in class org.archive.modules.CrawlURI
-
Get the http recorder associated with this uri.
- getRecorder() - Method in class org.archive.state.ModuleTestBase
- getRecordID() - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
- getRecordIDGenerator() - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
- getRefersToDate() - Method in class org.archive.modules.revisit.AbstractProfile
- getRefersToRecordID() - Method in class org.archive.modules.revisit.AbstractProfile
- getRefersToTargetURI() - Method in class org.archive.modules.revisit.IdenticalPayloadDigestRevisit
- getRegex() - Method in class org.archive.modules.canonicalize.RegexRule
- getRegex() - Method in class org.archive.modules.deciderules.MatchesFilePatternDecideRule
-
Use a preset if configured to do so.
- getRegex() - Method in class org.archive.modules.deciderules.MatchesRegexDecideRule
- getRegex() - Method in class org.archive.modules.extractor.ExtractorImpliedURI
- getRegexList() - Method in class org.archive.modules.deciderules.MatchesListRegexDecideRule
- getRemaining() - Method in class org.archive.modules.fetcher.FetchStats
- getRemoveTriggerUris() - Method in class org.archive.modules.extractor.ExtractorImpliedURI
- getRequestLine() - Method in class org.archive.modules.fetcher.BasicExecutionAwareRequest
-
Returns the request line of this request.
- getRescheduleTime() - Method in class org.archive.modules.CrawlURI
- getResourceDir() - Method in class org.archive.state.ModuleTestBase
-
Returns the location of the Java resources directory for your project.
- getRevisitProfile() - Method in class org.archive.modules.CrawlURI
- getRobotsDenials() - Method in class org.archive.modules.fetcher.FetchStats
- getRobotsPolicy() - Method in class org.archive.modules.CrawlMetadata
-
Get the currently-effective RobotsPolicy, as specified by the string name and chosen from the full available map.
- getRobotsPolicyName() - Method in class org.archive.modules.CrawlMetadata
- getRobotstxt() - Method in class org.archive.modules.net.CrawlServer
- getRules() - Method in class org.archive.modules.canonicalize.RulesCanonicalizationPolicy
- getRules() - Method in class org.archive.modules.deciderules.DecideRuleSequence
- getSchedulingDirective() - Method in class org.archive.modules.CrawlURI
- getSchemes() - Method in class org.archive.modules.deciderules.SchemeNotInSetDecideRule
- getScratchDisk() - Method in interface org.archive.modules.extractor.TempDirProvider
- getScratchDisk() - Method in class org.archive.modules.net.DefaultTempDirProvider
- getScriptSource() - Method in class org.archive.modules.deciderules.ScriptedDecideRule
- getScriptSource() - Method in class org.archive.modules.ScriptedProcessor
- getSeedListeners() - Method in class org.archive.modules.seeds.SeedModule
- getSeeds() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- getSeedsAsSurtPrefixes() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- getSendConnectionClose() - Method in class org.archive.modules.fetcher.FetchHTTP
- getSendIfModifiedSince() - Method in class org.archive.modules.fetcher.FetchHTTP
- getSendIfNoneMatch() - Method in class org.archive.modules.fetcher.FetchHTTP
- getSendRange() - Method in class org.archive.modules.fetcher.FetchHTTP
- getSendReferer() - Method in class org.archive.modules.fetcher.FetchHTTP
- getSerialNo() - Method in class org.archive.modules.writer.WriterPoolProcessor
- getServerCache() - Method in class org.archive.modules.deciderules.DecideRuleSequence
- getServerCache() - Method in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
- getServerCache() - Method in class org.archive.modules.deciderules.IpAddressSetDecideRule
- getServerCache() - Method in class org.archive.modules.fetcher.FetchDNS
- getServerCache() - Method in class org.archive.modules.fetcher.FetchHTTP
- getServerCache() - Method in class org.archive.modules.fetcher.FetchWhois
- getServerCache() - Method in class org.archive.modules.warc.BaseWARCRecordBuilder
- getServerCache() - Method in class org.archive.modules.writer.Kw3WriterProcessor
- getServerCache() - Method in class org.archive.modules.writer.WriterPoolProcessor
- getServerFor(String) - Method in class org.archive.modules.fetcher.DefaultServerCache
-
Get the
CrawlServer
associated withname
. - getServerFor(String) - Method in class org.archive.modules.net.ServerCache
- getServerFor(UURI) - Method in class org.archive.modules.net.ServerCache
-
Get the
CrawlServer
associated withcuri
. - getServerKey(CrawlURI) - Static method in class org.archive.modules.fetcher.FetchHTTP
- getServerKey(UURI) - Static method in class org.archive.modules.net.CrawlServer
-
Get key to use doing lookup on server instances.
- getShouldFetchBodyRule() - Method in class org.archive.modules.fetcher.FetchHTTP
- getShouldMasquerade() - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
- getShouldMasquerade() - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
- getShouldProcessRule() - Method in class org.archive.modules.Processor
- getSkipIdenticalDigests() - Method in class org.archive.modules.writer.WriterPoolProcessor
- getSocket() - Method in class org.archive.modules.fetcher.FetchHTTPRequest.RecordingHttpClientConnection
- getSocketInputStream(Socket) - Method in class org.archive.modules.fetcher.FetchHTTPRequest.RecordingHttpClientConnection
- getSocketOutputStream(Socket) - Method in class org.archive.modules.fetcher.FetchHTTPRequest.RecordingHttpClientConnection
- getSoTimeoutMs() - Method in class org.archive.modules.fetcher.FetchFTP
- getSoTimeoutMs() - Method in class org.archive.modules.fetcher.FetchHTTP
- getSoTimeoutMs() - Method in class org.archive.modules.fetcher.FetchSFTP
- getSoTimeoutMs() - Method in class org.archive.modules.fetcher.FetchWhois
- getSourceCodeDir() - Method in class org.archive.state.ModuleTestBase
-
Returns the location of the source code directory for your project.
- getSourceSeeds() - Method in class org.archive.modules.deciderules.SourceSeedDecideRule
- getSourceTag() - Method in class org.archive.modules.CrawlURI
- getSourceTagSeeds() - Method in class org.archive.modules.seeds.SeedModule
- getSSLSession() - Method in class org.archive.modules.fetcher.FetchHTTPRequest.RecordingHttpClientConnection
- getSslTrustLevel() - Method in class org.archive.modules.fetcher.FetchHTTP
- getStartNewFilesOnCheckpoint() - Method in class org.archive.modules.writer.WriterPoolProcessor
- getStats() - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
- getStatusCodes() - Method in class org.archive.modules.deciderules.FetchStatusDecideRule
- getStorePaths() - Method in class org.archive.modules.writer.WriterPoolProcessor
- getString(CrawlURI) - Method in class org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule
- getString(CrawlURI) - Method in class org.archive.modules.deciderules.FetchStatusMatchesRegexDecideRule
- getString(CrawlURI) - Method in class org.archive.modules.deciderules.HopsPathMatchesRegexDecideRule
- getString(CrawlURI) - Method in class org.archive.modules.deciderules.MatchesRegexDecideRule
- getStripRegex() - Method in class org.archive.modules.extractor.HTTPContentDigest
- getSubstats() - Method in interface org.archive.modules.fetcher.FetchStats.HasFetchStats
- getSubstats() - Method in class org.archive.modules.net.CrawlHost
- getSubstats() - Method in class org.archive.modules.net.CrawlServer
- getSuccessBytes() - Method in class org.archive.modules.fetcher.FetchStats
- getSuffixAtEnd() - Method in class org.archive.modules.writer.MirrorWriterProcessor
- getSurtPrefixes() - Method in class org.archive.modules.deciderules.ViaSurtPrefixedDecideRule
- getSurtsDumpFile() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- getSurtsSource() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- getSurtsSourceFile() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
Deprecated.redundant now that we have
SurtPrefixedDecideRule.surtsSource
- getTemplate() - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
- getTemplate() - Method in class org.archive.modules.writer.WriterPoolProcessor
- getTextSource() - Method in class org.archive.modules.seeds.TextSeedModule
- getThreadNumber() - Method in class org.archive.modules.CrawlURI
-
Get the number of the ToeThread responsible for processing this uri.
- getTimeoutPerRegexSeconds() - Method in class org.archive.modules.deciderules.MatchesListRegexDecideRule
- getTimeoutSeconds() - Method in class org.archive.modules.fetcher.FetchFTP
- getTimeoutSeconds() - Method in class org.archive.modules.fetcher.FetchHTTP
- getTimeoutSeconds() - Method in class org.archive.modules.fetcher.FetchSFTP
- getTooLongDirectory() - Method in class org.archive.modules.writer.MirrorWriterProcessor
- getTotalBytes() - Method in class org.archive.crawler.util.CrawledBytesHistotable
- getTotalBytes() - Method in class org.archive.modules.fetcher.FetchStats
- getTotalBytesWritten() - Method in class org.archive.modules.writer.WriterPoolProcessor
- getTotalScheduled() - Method in class org.archive.modules.fetcher.FetchStats
- getTotalUrls() - Method in class org.archive.crawler.util.CrawledBytesHistotable
- getTransHops() - Method in class org.archive.modules.CrawlURI
-
Tally up the number of transitive (non-simple-link) hops at the end of this CrawlURI's pathFromSeed.
- getTreatFramesAsEmbedLinks() - Method in class org.archive.modules.extractor.ExtractorHTML
- getUnderscoreSet() - Method in class org.archive.modules.writer.MirrorWriterProcessor
- getUpperBound() - Method in class org.archive.modules.deciderules.MatchesStatusCodeDecideRule
-
Returns the upper bound on the range of acceptable status codes.
- getUpperBound() - Method in class org.archive.modules.deciderules.NotMatchesStatusCodeDecideRule
-
Returns the upper bound on the range of acceptable status codes.
- getUpperBound() - Method in class org.archive.modules.deciderules.ResponseContentLengthDecideRule
- getURI() - Method in class org.archive.modules.CrawlURI
- getURICount() - Method in class org.archive.modules.Processor
-
Returns the number of URIs this processor has handled.
- getUriRegex() - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
- getURIs() - Method in class org.archive.modules.extractor.PDFParser
-
Get a list of URIs retrieved from the Pdf during the extractURIs operation.
- getURL(String, String) - Method in class org.archive.modules.extractor.ExtractorSWF.CrawlUriSWFAction
-
Overwrite handling of discovered URIs.
- getUseHeaderLength() - Method in class org.archive.modules.deciderules.ResourceNoLongerThanDecideRule
- getUseHTTP11() - Method in class org.archive.modules.fetcher.FetchHTTP
- getUsePreset() - Method in class org.archive.modules.deciderules.MatchesFilePatternDecideRule
- getUserAgent() - Method in class org.archive.modules.CrawlMetadata
- getUserAgent() - Method in class org.archive.modules.CrawlURI
-
Get the user agent to use for crawling this URI.
- getUserAgent() - Method in interface org.archive.modules.fetcher.UserAgentProvider
- getUserAgentProvider() - Method in class org.archive.modules.fetcher.FetchHTTP
- getUserAgentTemplate() - Method in class org.archive.modules.CrawlMetadata
- getUsername() - Method in class org.archive.modules.fetcher.FetchFTP
- getUsername() - Method in class org.archive.modules.fetcher.FetchSFTP
- getUURI() - Method in class org.archive.modules.CrawlURI
- getValidator() - Method in class org.archive.modules.CrawlMetadata
- getValidTestData() - Method in class org.archive.modules.extractor.StringExtractorTestBase
-
Returns an array of valid test data pairs.
- getVia() - Method in class org.archive.modules.CrawlURI
- getViaContext() - Method in class org.archive.modules.CrawlURI
- getWarcHeaders() - Method in class org.archive.modules.revisit.AbstractProfile
- getWarcHeaders() - Method in class org.archive.modules.revisit.IdenticalPayloadDigestRevisit
- getWarcHeaders() - Method in interface org.archive.modules.revisit.RevisitProfile
- getWarcHeaders() - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
- getWhoisQuery(CrawlURI) - Method in class org.archive.modules.fetcher.FetchWhois
- getWhoisServer(CrawlURI) - Method in class org.archive.modules.fetcher.FetchWhois
- getWriteBufferSize() - Method in class org.archive.modules.writer.WriterPoolProcessor
- getWriteMetadata() - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
- getWriteRequests() - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
- groovyTemplate() - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
- groovyTemplates - Variable in class org.archive.modules.extractor.ExtractorMultipleRegex
- GroupList(MatchResult) - Constructor for class org.archive.modules.extractor.ExtractorMultipleRegex.GroupList
H
- handle401(HttpResponse, CrawlURI) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Server is looking for basic/digest auth credentials (RFC2617).
- harvester - Variable in class org.archive.modules.writer.Kw3WriterProcessor
-
Name of the harvester that is used for the web harvesting.
- HARVESTER_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
- hasBeenLinkExtracted() - Method in class org.archive.modules.CrawlURI
-
If true then a link extractor has already claimed this CrawlURI and performed link extraction on the document content.
- hasBeenLookedUp() - Method in class org.archive.modules.net.CrawlHost
-
Return true if the IP for this host has been looked up.
- hasContentDigestHistory() - Method in class org.archive.modules.CrawlURI
- hasCredentials() - Method in class org.archive.modules.CrawlURI
- hasCredentials() - Method in class org.archive.modules.net.CrawlServer
- hasDirectives - Variable in class org.archive.modules.net.RobotsDirectives
- hasErrors - Variable in class org.archive.modules.net.Robotstxt
- hashCode() - Method in class org.archive.modules.CrawlURI
- hashCode() - Method in class org.archive.modules.extractor.LinkContext
- hashCode() - Method in class org.archive.modules.net.CrawlHost
- hashCode() - Method in class org.archive.modules.net.CrawlServer
- hasHttpAuthenticationCredential(CrawlURI) - Static method in class org.archive.modules.Processor
- hasIdenticalDigest(CrawlURI) - Static method in class org.archive.modules.deciderules.recrawl.IdenticalDigestDecideRule
-
Utility method for testing if a CrawlURI's revisit profile matches an identical payload digest.
- hasIdenticalDigest(CrawlURI) - Static method in class org.archive.modules.recrawl.FetchHistoryProcessor
-
Utility method for testing if a CrawlURI's last two history entries (one being the most recent fetch) have identical content-digest information.
- hasPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.Credential
- hasPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.HtmlFormCredential
- hasPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.HttpAuthenticationCredential
- hasPrerequisiteUri() - Method in class org.archive.modules.CrawlURI
- hasRfc2617Credential() - Method in class org.archive.modules.CrawlURI
- HasViaDecideRule - Class in org.archive.modules.deciderules
-
Rule applies the configured decision for any URI which has a 'via' (essentially, any URI that was a seed or some kinds of mid-crawl adds).
- HasViaDecideRule() - Constructor for class org.archive.modules.deciderules.HasViaDecideRule
-
Usual constructor.
- hasWriteTag(CrawlURI) - Method in class org.archive.modules.recrawl.AbstractPersistProcessor
- haveOverlayNamesBeenSet() - Method in class org.archive.modules.CrawlURI
- haveSeen(int, int) - Method in class org.archive.modules.extractor.PDFParser
-
Indicates, based on a PDFObject's generation/id pair whether the parser has already encountered this object (or a reference to it) so we don't infinitely loop on circuits within the PDF.
- HEADER_LENGTH_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
- HEADER_MD5_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
- HEADER_PREDICTS_MISSING - Static variable in class org.archive.modules.deciderules.ResourceNoLongerThanDecideRule
- HEADER_TRUNC - Static variable in interface org.archive.modules.CoreAttributeConstants
- HEADER_TRUNC - Static variable in class org.archive.modules.fetcher.FetchErrors
- HIGH - Static variable in class org.archive.modules.SchedulingConstants
-
High scheduling priority.
- HIGHEST - Static variable in class org.archive.modules.SchedulingConstants
-
Highest scheduling priority.
- HISTORY_DB_CONFIG - Static variable in class org.archive.modules.recrawl.PersistProcessor
- historyDb - Variable in class org.archive.modules.recrawl.BdbContentDigestHistory
- historyDb - Variable in class org.archive.modules.recrawl.PersistOnlineProcessor
- historyDbConfig - Variable in class org.archive.modules.recrawl.BdbContentDigestHistory
- historyDbConfig() - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
- historyDbName - Variable in class org.archive.modules.recrawl.BdbContentDigestHistory
- historyDbName - Variable in class org.archive.modules.recrawl.PersistOnlineProcessor
- historyLength - Variable in class org.archive.modules.recrawl.FetchHistoryProcessor
-
Desired history array length.
- historyRealloc(CrawlURI) - Method in class org.archive.modules.recrawl.FetchHistoryProcessor
-
Get or create proper-sized history array
- holder - Variable in class org.archive.modules.CrawlURI
- holderCost - Variable in class org.archive.modules.CrawlURI
-
spot for an integer cost to be placed by external facility (frontier).
- holderKey - Variable in class org.archive.modules.CrawlURI
- Hop - Enum in org.archive.modules.extractor
-
The kind of "hop" from one URI to another.
- HopCrossesAssignmentLevelDomainDecideRule - Class in org.archive.modules.deciderules
-
Applies its decision if the current URI differs in that portion of its hostname/domain that is assigned/sold by registrars, its 'assignment-level-domain' (ALD) (AKA 'public suffix' or in previous Heritrix versions, 'topmost assigned SURT')
- HopCrossesAssignmentLevelDomainDecideRule() - Constructor for class org.archive.modules.deciderules.HopCrossesAssignmentLevelDomainDecideRule
- HopsPathMatchesRegexDecideRule - Class in org.archive.modules.deciderules
-
Rule applies configured decision to any CrawlURIs whose 'hops-path' (string like "LLXE" etc.) matches the supplied regex.
- HopsPathMatchesRegexDecideRule() - Constructor for class org.archive.modules.deciderules.HopsPathMatchesRegexDecideRule
-
Usual constructor.
- hopString - Variable in enum org.archive.modules.extractor.Hop
- hostKeys() - Method in class org.archive.modules.fetcher.DefaultServerCache
- hostKeys() - Method in class org.archive.modules.net.ServerCache
- hostMap - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
This list is grouped in pairs.
- HostResolver - Interface in org.archive.modules.fetcher
- hosts - Variable in class org.archive.modules.fetcher.DefaultServerCache
-
hostname -> CrawlHost.
- hostSubset(String) - Method in class org.archive.modules.fetcher.BdbCookieStore
- HTMLForm - Class in org.archive.modules.forms
-
Simple representation of a discovered HTML Form.
- HTMLForm() - Constructor for class org.archive.modules.forms.HTMLForm
- HTMLForm.FormInput - Class in org.archive.modules.forms
- HTMLForm.NameValue - Class in org.archive.modules.forms
- HtmlFormCredential - Class in org.archive.modules.credential
-
Credential that holds all needed to do a GET/POST to a HTML form.
- HtmlFormCredential() - Constructor for class org.archive.modules.credential.HtmlFormCredential
-
Constructor.
- HtmlFormCredential.Method - Enum in org.archive.modules.credential
- HTMLLinkContext - Class in org.archive.modules.extractor
-
XPath-like context for HTML discovered URIs.
- HTMLLinkContext(CharSequence, CharSequence) - Constructor for class org.archive.modules.extractor.HTMLLinkContext
- HTMLLinkContext(String) - Constructor for class org.archive.modules.extractor.HTMLLinkContext
-
Constructor.
- HTTP_BIND_ADDRESS - Static variable in class org.archive.modules.fetcher.FetchHTTP
- HTTP_GET - org.archive.modules.CrawlURI.FetchType
- HTTP_POST - org.archive.modules.CrawlURI.FetchType
- HTTP_SCHEME - Static variable in class org.archive.modules.fetcher.FetchHTTP
- HttpAuthenticationCredential - Class in org.archive.modules.credential
-
A Basic/Digest HTTP Authentication (RFC2617) credential.
- HttpAuthenticationCredential() - Constructor for class org.archive.modules.credential.HttpAuthenticationCredential
-
Constructor.
- httpClientBuilder - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
- httpClientContext - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
- HTTPContentDigest - Class in org.archive.modules.extractor
-
A processor for calculating custom HTTP content digests in place of the default (if any) computed by the HTTP fetcher processors.
- HTTPContentDigest() - Constructor for class org.archive.modules.extractor.HTTPContentDigest
-
Constructor.
- httpMethod - Variable in class org.archive.modules.credential.HtmlFormCredential
-
Deprecated.ignored, always POST
- HttpRequestRecordBuilder - Class in org.archive.modules.warc
- HttpRequestRecordBuilder() - Constructor for class org.archive.modules.warc.HttpRequestRecordBuilder
- HttpResponseRecordBuilder - Class in org.archive.modules.warc
- HttpResponseRecordBuilder() - Constructor for class org.archive.modules.warc.HttpResponseRecordBuilder
- HTTPS_SCHEME - Static variable in class org.archive.modules.fetcher.FetchHTTP
I
- IdenticalDigestDecideRule - Class in org.archive.modules.deciderules.recrawl
-
Rule applies configured decision to any CrawlURIs whose revisit profile is set with a profile matching
WARCConstants.PROFILE_REVISIT_IDENTICAL_DIGEST
- IdenticalDigestDecideRule() - Constructor for class org.archive.modules.deciderules.recrawl.IdenticalDigestDecideRule
-
Usual constructor.
- IdenticalPayloadDigestRevisit - Class in org.archive.modules.revisit
- IdenticalPayloadDigestRevisit(String) - Constructor for class org.archive.modules.revisit.IdenticalPayloadDigestRevisit
-
Minimal constructor.
- IgnoreRobotsPolicy - Class in org.archive.modules.net
-
Policy to ignore robots.
- IgnoreRobotsPolicy() - Constructor for class org.archive.modules.net.IgnoreRobotsPolicy
- IMAGES - org.archive.modules.deciderules.MatchesFilePatternDecideRule.Preset
- IMG_DATA_ORIGINAL - Static variable in class org.archive.modules.extractor.HTMLLinkContext
- IMG_DATA_ORIGINAL_SET - Static variable in class org.archive.modules.extractor.HTMLLinkContext
- IMG_DATA_SRC - Static variable in class org.archive.modules.extractor.HTMLLinkContext
- IMG_DATA_SRCSET - Static variable in class org.archive.modules.extractor.HTMLLinkContext
- IMG_SRC - Static variable in class org.archive.modules.extractor.HTMLLinkContext
- IMG_SRCSET - Static variable in class org.archive.modules.extractor.HTMLLinkContext
- IN_PROGRESS - org.archive.modules.fetcher.FetchWhois.UrlStatus
- includesRetireDirective() - Method in class org.archive.modules.CrawlURI
- incrementConsecutiveConnectionErrors() - Method in class org.archive.modules.net.CrawlServer
- incrementDeferrals() - Method in class org.archive.modules.CrawlURI
-
Increment the deferral count.
- incrementDiscardedOutLinks() - Method in class org.archive.modules.CrawlURI
- incrementFetchAttempts() - Method in class org.archive.modules.CrawlURI
-
Increment the count of attempts (trips through the processing loop) at getting the document referenced by this URI.
- indexOf(Object) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- INFERRED - org.archive.modules.extractor.Hop
-
Inferred/implied links -- not necessarily literally in the source material, but deduced by convention.
- INFERRED_MISC - Static variable in class org.archive.modules.extractor.LinkContext
-
Stand-in value for inferred urls without other context.
- inferRootPage - Variable in class org.archive.modules.extractor.ExtractorHTTP
-
should all HTTP URIs be used to infer a link to the site's root?
- inheritFrom(CrawlURI) - Method in class org.archive.modules.CrawlURI
-
Inherit (copy) the relevant keys-values from the ancestor.
- initHttpClientBuilder() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
- initialize() - Method in class org.archive.modules.extractor.PDFParser
-
Initialize opens the document for reading.
- initializeFromReader(Reader) - Method in class org.archive.modules.net.Robotstxt
- initOutputStream(CrawlURI) - Method in class org.archive.modules.writer.Kw3WriterProcessor
-
Get the OutputStream for the file to write to.
- innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.AcceptDecideRule
- innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.ContentLengthDecideRule
- innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.DecideRule
- innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.DecideRuleSequence
- innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.PathologicalPathDecideRule
- innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.PredicatedDecideRule
- innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.PrerequisiteAcceptDecideRule
- innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.RejectDecideRule
- innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.ScriptedDecideRule
- innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.SeedAcceptDecideRule
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ContentExtractor
-
Actually extracts links.
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorCSS
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorDOC
-
Processes a word document and extracts any hyperlinks from it.
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorHTML
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorJS
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorPDF
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorRobotsTxt
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorSitemap
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorSWF
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorUniversal
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorXML
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.TrapSuppressExtractor
- innerProcess(CrawlURI) - Method in class org.archive.modules.extractor.Extractor
-
Processes the given URI.
- innerProcess(CrawlURI) - Method in class org.archive.modules.extractor.HTTPContentDigest
- innerProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchDNS
- innerProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchFTP
-
Processes the given URI.
- innerProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchHTTP
- innerProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchSFTP
-
Processes the given URI.
- innerProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchWhois
- innerProcess(CrawlURI) - Method in class org.archive.modules.forms.FormLoginProcessor
- innerProcess(CrawlURI) - Method in class org.archive.modules.Processor
-
Actually performs the process.
- innerProcess(CrawlURI) - Method in class org.archive.modules.recrawl.ContentDigestHistoryLoader
- innerProcess(CrawlURI) - Method in class org.archive.modules.recrawl.ContentDigestHistoryStorer
- innerProcess(CrawlURI) - Method in class org.archive.modules.recrawl.FetchHistoryProcessor
- innerProcess(CrawlURI) - Method in class org.archive.modules.recrawl.PersistLoadProcessor
- innerProcess(CrawlURI) - Method in class org.archive.modules.recrawl.PersistLogProcessor
- innerProcess(CrawlURI) - Method in class org.archive.modules.recrawl.PersistStoreProcessor
- innerProcess(CrawlURI) - Method in class org.archive.modules.ScriptedProcessor
- innerProcess(CrawlURI) - Method in class org.archive.modules.writer.Kw3WriterProcessor
- innerProcess(CrawlURI) - Method in class org.archive.modules.writer.MirrorWriterProcessor
- innerProcess(CrawlURI) - Method in class org.archive.modules.writer.WriterPoolProcessor
- innerProcessResult(CrawlURI) - Method in class org.archive.modules.fetcher.FetchWhois
- innerProcessResult(CrawlURI) - Method in class org.archive.modules.Processor
- innerProcessResult(CrawlURI) - Method in class org.archive.modules.writer.ARCWriterProcessor
-
Writes a CrawlURI and its associated data to store file.
- innerProcessResult(CrawlURI) - Method in class org.archive.modules.writer.WARCWriterChainProcessor
- innerProcessResult(CrawlURI) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.Writes a CrawlURI and its associated data to store file.
- innerProcessResult(CrawlURI) - Method in class org.archive.modules.writer.WriterPoolProcessor
- innerRejectProcess(CrawlURI) - Method in class org.archive.modules.Processor
-
Invoked after a URI has been rejected.
- innerRejectProcess(CrawlURI) - Method in class org.archive.modules.writer.WriterPoolProcessor
- INSTANCE - Static variable in class org.archive.modules.net.IgnoreRobotsPolicy
- INSTANCE - Static variable in class org.archive.modules.net.ObeyRobotsPolicy
- invert(DecideResult) - Static method in enum org.archive.modules.deciderules.DecideResult
- IP_ADDRESS - Static variable in class org.archive.modules.extractor.ExtractorUniversal
-
Matches any string that begins with http:// or https:// followed by something that looks like an ip address (four numbers, none longer then 3 chars seperated by 3 dots).
- IP_ADDRESS_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
- IP_ADDRESS_REGEX - Static variable in class org.archive.modules.fetcher.FetchWhois
- IP_NEVER_EXPIRES - Static variable in class org.archive.modules.net.CrawlHost
-
Flag value indicating always-valid IP
- IP_NEVER_LOOKED_UP - Static variable in class org.archive.modules.net.CrawlHost
-
Flag value indicating an IP has not yet been looked up
- IpAddressSetDecideRule - Class in org.archive.modules.deciderules
-
IpAddressSetDecideRule must be used with org.archive.crawler.prefetch.Preselector#setRecheckScope(boolean) set to true because it relies on Heritrix' dns lookup to establish the ip address for a URI before it can run.
- IpAddressSetDecideRule() - Constructor for class org.archive.modules.deciderules.IpAddressSetDecideRule
- is2XXSuccess() - Method in class org.archive.modules.CrawlURI
- isCheckpointRecovery - Variable in class org.archive.modules.fetcher.BdbCookieStore
-
are we a checkpoint recovery? (in which case, reuse stored cookie data?)
- isCheckpointRecovery - Variable in class org.archive.modules.net.BdbServerCache
- isCookieCountMaxedForDomain(String) - Method in class org.archive.modules.fetcher.AbstractCookieStore
- isDisableSNI() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
- isEmpty() - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- isEveryTime() - Method in class org.archive.modules.credential.Credential
- isEveryTime() - Method in class org.archive.modules.credential.HtmlFormCredential
- isEveryTime() - Method in class org.archive.modules.credential.HttpAuthenticationCredential
- isHtmlExpectedHere(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorHTML
-
Test whether this HTML is so unexpected (eg in place of a GIF URI) that it shouldn't be scanned for links.
- isHttpTransaction() - Method in class org.archive.modules.CrawlURI
-
Return true if this is a http transaction.
- isLocation() - Method in class org.archive.modules.CrawlURI
- isMultipleFormSubmitInputs(String) - Method in class org.archive.modules.forms.HTMLForm
- isObeyMetaRobotsNofollow() - Method in class org.archive.modules.net.CustomRobotsPolicy
- isObeyMetaRobotsNofollow() - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
- isObeyMetaRobotsNofollow() - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
- isolateThreads - Variable in class org.archive.modules.deciderules.ScriptedDecideRule
-
Whether each ToeThread should get its own independent script engine, or they should share synchronized access to one engine.
- isolateThreads - Variable in class org.archive.modules.ScriptedProcessor
-
Whether each ToeThread should get its own independent script engine, or they should share synchronized access to one engine.
- isPost() - Method in class org.archive.modules.credential.Credential
- isPost() - Method in class org.archive.modules.credential.HtmlFormCredential
- isPost() - Method in class org.archive.modules.credential.HttpAuthenticationCredential
- isPrerequisite() - Method in class org.archive.modules.CrawlURI
-
Returns true if this CrawlURI is a prerequisite.
- isPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.Credential
- isPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.HtmlFormCredential
- isPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.HttpAuthenticationCredential
- isQuadAddress(CrawlURI, String, CrawlHost) - Method in class org.archive.modules.fetcher.FetchDNS
- isRevisit() - Method in class org.archive.modules.CrawlURI
-
Indicates if this CrawlURI object has been deemed a revisit.
- isRobotsExpired(int) - Method in class org.archive.modules.net.CrawlServer
-
Is the robots policy expired.
- isRunning - Variable in class org.archive.modules.deciderules.DecideRuleSequence
- isRunning - Variable in class org.archive.modules.fetcher.AbstractCookieStore
- isRunning - Variable in class org.archive.modules.net.BdbServerCache
- isRunning - Variable in class org.archive.modules.Processor
- isRunning - Variable in class org.archive.modules.ProcessorChain
- isRunning() - Method in class org.archive.modules.deciderules.DecideRuleSequence
- isRunning() - Method in class org.archive.modules.fetcher.AbstractCookieStore
- isRunning() - Method in class org.archive.modules.fetcher.FetchWhois
- isRunning() - Method in class org.archive.modules.net.BdbServerCache
- isRunning() - Method in class org.archive.modules.Processor
- isRunning() - Method in class org.archive.modules.ProcessorChain
- isRunning() - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
- isRunning() - Method in class org.archive.modules.recrawl.PersistLogProcessor
- isRunning() - Method in class org.archive.modules.recrawl.PersistOnlineProcessor
- isSeed() - Method in class org.archive.modules.CrawlURI
- isSuccess() - Method in class org.archive.modules.CrawlURI
-
Ask this URI if it was a success or not.
- isSuccess(CrawlURI) - Static method in class org.archive.modules.Processor
- isValidRobots() - Method in class org.archive.modules.net.CrawlServer
-
If true then valid robots.txt information has been retrieved.
- iterator() - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- iterator() - Method in class org.archive.modules.ProcessorChain
J
- JAVASCRIPT_STRING_EXTRACTOR - Static variable in class org.archive.modules.extractor.ExtractorJS
- JerichoExtractorHTML - Class in org.archive.modules.extractor
-
Improved link-extraction from an HTML content-body using jericho-html parser.
- JerichoExtractorHTML() - Constructor for class org.archive.modules.extractor.JerichoExtractorHTML
- jobName - Variable in class org.archive.modules.CrawlMetadata
- JS_MISC - Static variable in class org.archive.modules.extractor.LinkContext
-
Stand-in value for JavaScript-discovered urls without other context.
- JSSTRING - Static variable in class org.archive.modules.extractor.ExtractorSWF
- jump(String) - Static method in class org.archive.modules.ProcessResult
- JUMP - org.archive.modules.ProcessResult.ProcessStatus
-
The Processor has specified the next processor for the URI.
K
- kp - Variable in class org.archive.modules.canonicalize.BaseRule
- kp - Variable in class org.archive.modules.canonicalize.RulesCanonicalizationPolicy
- kp - Variable in class org.archive.modules.CrawlMetadata
- kp - Variable in class org.archive.modules.credential.CredentialStore
- kp - Variable in class org.archive.modules.deciderules.DecideRule
- kp - Variable in class org.archive.modules.Processor
- kp - Variable in class org.archive.modules.ProcessorChain
- Kw3Constants - Interface in org.archive.modules.writer
- Kw3WriterProcessor - Class in org.archive.modules.writer
-
Processor module that writes the results of successful fetches to files on disk.
- Kw3WriterProcessor() - Constructor for class org.archive.modules.writer.Kw3WriterProcessor
-
Constructor.
L
- lastIndexOf(Object) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- lastModified - Variable in class org.archive.modules.revisit.ServerNotModifiedRevisit
- lastSuccessTime - Variable in class org.archive.modules.fetcher.FetchStats
- LENGTH_TRUNC - Static variable in interface org.archive.modules.CoreAttributeConstants
- LENGTH_TRUNC - Static variable in class org.archive.modules.fetcher.FetchErrors
- LimitedCookieStoreFacade(List<Cookie>) - Constructor for class org.archive.modules.fetcher.AbstractCookieStore.LimitedCookieStoreFacade
- LinkContext - Class in org.archive.modules.extractor
-
The context of link discovery.
- LinkContext() - Constructor for class org.archive.modules.extractor.LinkContext
- LinkContext.SimpleLinkContext - Class in org.archive.modules.extractor
-
Class for representing handy default LinkContext values.
- linkExtractorFinished() - Method in class org.archive.modules.CrawlURI
-
Note that link extraction has been performed on this CrawlURI.
- listIterator() - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- listIterator(int) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- load(CrawlURI) - Method in class org.archive.modules.recrawl.AbstractContentDigestHistory
-
Looks up the history by key
persistKeyFor(curi)
and loads it intocuri.getContentDigestHistory()
. - load(CrawlURI) - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
- loadCookies(Reader) - Method in class org.archive.modules.fetcher.AbstractCookieStore
- loadCookies(ConfigFile) - Method in class org.archive.modules.fetcher.AbstractCookieStore
- log - Variable in class org.archive.modules.recrawl.PersistLogProcessor
- logExtraInfo - Variable in class org.archive.modules.deciderules.DecideRuleSequence
-
Whether to include the "extra info" field for each entry in crawl.log.
- logFile - Variable in class org.archive.modules.recrawl.PersistLogProcessor
- logger - Static variable in class org.archive.modules.canonicalize.RegexRule
- logger - Static variable in class org.archive.modules.extractor.AggressiveExtractorHTML
- logger - Variable in class org.archive.modules.fetcher.AbstractCookieStore
- loggerModule - Variable in class org.archive.modules.deciderules.DecideRuleSequence
- loggerModule - Variable in class org.archive.modules.extractor.Extractor
- loggerModule - Variable in class org.archive.modules.forms.FormLoginProcessor
- login - Variable in class org.archive.modules.credential.HttpAuthenticationCredential
-
Login.
- loginUri - Variable in class org.archive.modules.credential.HtmlFormCredential
-
Full URI of page that contains the HTML login form we're to apply these credentials too: E.g.
- logUriError(URIException, UURI, CharSequence) - Method in class org.archive.modules.extractor.Extractor
- logUriError(URIException, UURI, CharSequence) - Method in interface org.archive.modules.extractor.UriErrorLoggerModule
- longestPrefixLength(ConcurrentSkipListSet<String>, String) - Method in class org.archive.modules.net.RobotsDirectives
- lookup - Variable in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
- lookup(InetAddress) - Method in interface org.archive.modules.deciderules.ExternalGeoLookupInterface
- lookupTable(String[]) - Method in class org.archive.modules.extractor.ExtractorSWF.CrawlUriSWFAction
- LowercaseRule - Class in org.archive.modules.canonicalize
-
Lowercases the URL.
- LowercaseRule() - Constructor for class org.archive.modules.canonicalize.LowercaseRule
-
Constructor.
M
- main(String[]) - Static method in class org.archive.modules.extractor.PDFParser
- main(String[]) - Static method in class org.archive.modules.recrawl.PersistProcessor
-
Utility main for importing a log into a BDB-JE environment or moving a database between environments (2 arguments), or simply dumping a log to stderr in a more readable format (1 argument).
- makeBindings(Map<String, ExtractorMultipleRegex.MatchList>, String[], int) - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
- makeCrawlURI(String) - Method in class org.archive.state.ModuleTestBase
- makeData(String, String) - Method in class org.archive.modules.extractor.StringExtractorTestBase
- makeDirty() - Method in class org.archive.modules.net.CrawlHost
- makeDirty() - Method in class org.archive.modules.net.CrawlServer
- makeExtractor() - Method in class org.archive.modules.extractor.ContentExtractorTestBase
-
Subclasses should return an Extractor instance to test.
- makeHeritable(String) - Method in class org.archive.modules.CrawlURI
-
Make the given key 'heritable', meaning its value will be added to descendant CrawlURIs.
- makeModule() - Method in class org.archive.modules.extractor.ContentExtractorTestBase
- makeModule() - Method in class org.archive.state.ModuleTestBase
-
Return an example instance of the module.
- makeNonHeritable(String) - Method in class org.archive.modules.CrawlURI
-
Make the given key non-'heritable', meaning its value will not be added to descendant CrawlURIs.
- makeTempDir() - Static method in class org.archive.modules.net.DefaultTempDirProvider
- makeWhoisUrl(String, String) - Method in class org.archive.modules.fetcher.FetchWhois
- MANIFEST - org.archive.modules.extractor.Hop
-
Found in some form of site provided URL manifest (e.g.
- MANIFEST_MISC - Static variable in class org.archive.modules.extractor.LinkContext
-
Stand-in value for prerequisite urls without other context.
- markAsSeen(int, int) - Method in class org.archive.modules.extractor.PDFParser
-
Note that an object (id/generation pair) has been seen by this parser so that it can be handled differently when it is encountered again.
- markPrerequisite(String) - Method in class org.archive.modules.CrawlURI
-
Do all actions associated with setting a
CrawlURI
as requiring a prerequisite. - MatchesFilePatternDecideRule - Class in org.archive.modules.deciderules
-
Compares suffix of a passed CrawlURI, UURI, or String against a regular expression pattern, applying its configured decision to all matches.
- MatchesFilePatternDecideRule() - Constructor for class org.archive.modules.deciderules.MatchesFilePatternDecideRule
-
Usual constructor.
- MatchesFilePatternDecideRule.Preset - Enum in org.archive.modules.deciderules
- MatchesListRegexDecideRule - Class in org.archive.modules.deciderules
-
Rule applies configured decision to any CrawlURIs whose String URI matches the supplied regexs.
- MatchesListRegexDecideRule() - Constructor for class org.archive.modules.deciderules.MatchesListRegexDecideRule
-
Usual constructor.
- MatchesRegexDecideRule - Class in org.archive.modules.deciderules
-
Rule applies configured decision to any CrawlURIs whose String URI matches the supplied regex.
- MatchesRegexDecideRule() - Constructor for class org.archive.modules.deciderules.MatchesRegexDecideRule
-
Usual constructor.
- MatchesStatusCodeDecideRule - Class in org.archive.modules.deciderules
-
Provides a rule that returns "true" for any CrawlURIs which have a fetch status code that falls within the provided inclusive range.
- MatchesStatusCodeDecideRule() - Constructor for class org.archive.modules.deciderules.MatchesStatusCodeDecideRule
-
Creates a new MatchStatusCodeDecideRule instance.
- MatchList(String, CharSequence) - Constructor for class org.archive.modules.extractor.ExtractorMultipleRegex.MatchList
- MatchList(ExtractorMultipleRegex.GroupList...) - Constructor for class org.archive.modules.extractor.ExtractorMultipleRegex.MatchList
- MAX_COOKIES_FOR_DOMAIN - Static variable in class org.archive.modules.fetcher.AbstractCookieStore
- MAX_SIZE - Static variable in class org.archive.modules.net.Robotstxt
- maxFileSizeBytes - Variable in class org.archive.modules.writer.Kw3WriterProcessor
-
Max size for each file.
- maxFileSizeBytes - Variable in class org.archive.modules.writer.WriterPoolProcessor
-
Max size of each file.
- maxPathLength - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
Maximum file system path length.
- maxSegLength - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
Maximum file system path segment length.
- maxTotalBytesToWrite - Variable in class org.archive.modules.writer.WriterPoolProcessor
-
Total file bytes to write to disk.
- maxWaitForIdleMs - Variable in class org.archive.modules.writer.WriterPoolProcessor
-
Maximum time to wait on idle writer before (possibly) creating an additional instance.
- maybeAddConditionalGetHeader(boolean, String, String) - Method in class org.archive.modules.fetcher.FetchHTTPRequest
-
Add the given conditional-GET header, if the setting is enabled and a suitable value is available in the URI history.
- maybeMidfetchAbort(CrawlURI, AbstractExecutionAwareRequest) - Method in class org.archive.modules.fetcher.FetchHTTP
- MEDIUM - Static variable in class org.archive.modules.SchedulingConstants
-
Medium priority.
- META - Static variable in class org.archive.modules.extractor.HTMLLinkContext
- META_HREF - Static variable in class org.archive.modules.extractor.HTMLLinkContext
- metadata - Variable in class org.archive.modules.extractor.ExtractorHTML
-
CrawlMetadata provides the robots honoring policy to use when considering a robots META tag.
- MetadataRecordBuilder - Class in org.archive.modules.warc
- MetadataRecordBuilder() - Constructor for class org.archive.modules.warc.MetadataRecordBuilder
- method - Variable in class org.archive.modules.forms.HTMLForm
- MIN_ROBOTS_RETRIES - Static variable in class org.archive.modules.net.CrawlServer
-
only check if robots-fetch is perhaps superfluous after this many tries
- MirrorWriterProcessor - Class in org.archive.modules.writer
-
Processor module that writes the results of successful fetches to files on disk.
- MirrorWriterProcessor() - Constructor for class org.archive.modules.writer.MirrorWriterProcessor
- MISC - org.archive.modules.deciderules.MatchesFilePatternDecideRule.Preset
- ModuleTestBase - Class in org.archive.state
-
Base class for unit testing Module implementations.
- ModuleTestBase() - Constructor for class org.archive.state.ModuleTestBase
-
Magical constructor that attempts to auto-create static key field descriptions for your module class.
- MostFavoredRobotsPolicy - Class in org.archive.modules.net
-
Follow a most-favored robots policy -- allowing an URL if either the conventionally-configured User-Agent, or any of a number of alternate User-Agents (from the candidateUserAgents list) would be allowed.
- MostFavoredRobotsPolicy() - Constructor for class org.archive.modules.net.MostFavoredRobotsPolicy
N
- name - Variable in class org.archive.modules.forms.HTMLForm.FormInput
- name - Variable in class org.archive.modules.forms.HTMLForm.NameValue
- namedUserAgents - Variable in class org.archive.modules.net.Robotstxt
- NameValue(String, String) - Constructor for class org.archive.modules.forms.HTMLForm.NameValue
- NAVLINK - org.archive.modules.extractor.Hop
-
Navigation links, like A/@HREF.
- NAVLINK_MISC - Static variable in class org.archive.modules.extractor.LinkContext
-
Stand-in value for navlink urls without other context.
- newEngine() - Method in class org.archive.modules.deciderules.ScriptedDecideRule
-
Create a new ScriptEngine instance, preloaded with any supplied source file and the variables 'self' (this ScriptedDecideRule) and 'context' (the ApplicationContext).
- newEngine() - Method in class org.archive.modules.ScriptedProcessor
-
Create a new
ScriptEngine
instance, preloaded with any supplied source file and the variables 'self' (thisScriptedProcessor
) and 'context' (theApplicationContext
). - NO_DIRECTIVES - Static variable in class org.archive.modules.net.Robotstxt
- NO_ROBOTS - Static variable in class org.archive.modules.net.Robotstxt
-
empty, reusable instance for all sites providing no rules
- NONE - org.archive.modules.deciderules.DecideResult
-
Indicates the URI was neither accepted nor rejected.
- nonseedLine(String) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
Consider nonseed lines as possible SURT prefix directives.
- nonseedLine(String) - Method in interface org.archive.modules.seeds.SeedListener
- nonseedLine(String) - Method in class org.archive.modules.seeds.TextSeedModule
-
Handle a read line that is not a seed, but may still have meaning to seed-consumers (such as scoping beans).
- NORMAL - Static variable in class org.archive.modules.SchedulingConstants
-
Normal/low priority.
- normalizeHost(String) - Method in class org.archive.modules.fetcher.AbstractCookieStore
- NotMatchesFilePatternDecideRule - Class in org.archive.modules.deciderules
-
Rule applies configured decision to any URIs which do *not* match the supplied (file-pattern) regex.
- NotMatchesFilePatternDecideRule() - Constructor for class org.archive.modules.deciderules.NotMatchesFilePatternDecideRule
-
Usual constructor.
- NotMatchesListRegexDecideRule - Class in org.archive.modules.deciderules
-
Rule applies configured decision to any URIs which do *not* match the supplied regex.
- NotMatchesListRegexDecideRule() - Constructor for class org.archive.modules.deciderules.NotMatchesListRegexDecideRule
-
Usual constructor.
- NotMatchesRegexDecideRule - Class in org.archive.modules.deciderules
-
Rule applies configured decision to any URIs which do *not* match the supplied regex.
- NotMatchesRegexDecideRule(String) - Constructor for class org.archive.modules.deciderules.NotMatchesRegexDecideRule
-
Usual constructor.
- NotMatchesStatusCodeDecideRule - Class in org.archive.modules.deciderules
-
Provides a rule that returns "true" for any CrawlURIs which has a fetch status code that does not fall within the provided inclusive range.
- NotMatchesStatusCodeDecideRule() - Constructor for class org.archive.modules.deciderules.NotMatchesStatusCodeDecideRule
- NOTMODIFIED - Static variable in class org.archive.crawler.util.CrawledBytesHistotable
- NOTMODIFIEDCOUNT - Static variable in class org.archive.crawler.util.CrawledBytesHistotable
- NotOnDomainsDecideRule - Class in org.archive.modules.deciderules.surt
-
Rule applies configured decision to any URIs that are *not* in one of the domains in the configured set of domains, filled from the seed set.
- NotOnDomainsDecideRule() - Constructor for class org.archive.modules.deciderules.surt.NotOnDomainsDecideRule
-
Usual constructor.
- NotOnHostsDecideRule - Class in org.archive.modules.deciderules.surt
-
Rule applies configured decision to any URIs that are *not* on one of the hosts in the configured set of hosts, filled from the seed set.
- NotOnHostsDecideRule() - Constructor for class org.archive.modules.deciderules.surt.NotOnHostsDecideRule
-
Usual constructor.
- NotSurtPrefixedDecideRule - Class in org.archive.modules.deciderules.surt
-
Rule applies configured decision to any URIs that, when expressed in SURT form, do *not* begin with one of the prefixes in the configured set.
- NotSurtPrefixedDecideRule() - Constructor for class org.archive.modules.deciderules.surt.NotSurtPrefixedDecideRule
-
Usual constructor.
- NOVEL - Static variable in class org.archive.crawler.util.CrawledBytesHistotable
- NOVELCOUNT - Static variable in class org.archive.crawler.util.CrawledBytesHistotable
- numberOfCURIsHandled - Variable in class org.archive.modules.extractor.ExtractorJS
- numberOfCURIsHandled - Variable in class org.archive.modules.extractor.TrapSuppressExtractor
- numberOfCURIsSuppressed - Variable in class org.archive.modules.extractor.TrapSuppressExtractor
- numberOfFormsProcessed - Variable in class org.archive.modules.extractor.JerichoExtractorHTML
- numberOfLinksExtracted - Variable in class org.archive.modules.extractor.Extractor
O
- obeyMetaRobotsNofollow - Variable in class org.archive.modules.net.CustomRobotsPolicy
-
whether to obey the 'nofollow' directive in an HTML META ROBOTS element
- obeyMetaRobotsNofollow - Variable in class org.archive.modules.net.FirstNamedRobotsPolicy
-
whether to obey the 'nofollow' directive in an HTML META ROBOTS element
- obeyMetaRobotsNofollow - Variable in class org.archive.modules.net.MostFavoredRobotsPolicy
-
whether to obey the 'nofollow' directive in an HTML META ROBOTS element
- obeyMetaRobotsNofollow() - Method in class org.archive.modules.net.CustomRobotsPolicy
- obeyMetaRobotsNofollow() - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
- obeyMetaRobotsNofollow() - Method in class org.archive.modules.net.IgnoreRobotsPolicy
- obeyMetaRobotsNofollow() - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
- obeyMetaRobotsNofollow() - Method in class org.archive.modules.net.ObeyRobotsPolicy
- obeyMetaRobotsNofollow() - Method in class org.archive.modules.net.RobotsPolicy
- ObeyRobotsPolicy - Class in org.archive.modules.net
-
Classic obey-robots-as-declared policy.
- ObeyRobotsPolicy() - Constructor for class org.archive.modules.net.ObeyRobotsPolicy
- obtainReader() - Method in class org.archive.modules.seeds.TextSeedModule
- onApplicationEvent(ApplicationEvent) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- OnDomainsDecideRule - Class in org.archive.modules.deciderules.surt
-
Rule applies configured decision to any URIs that are on one of the domains in the configured set of domains, filled from the seed set.
- OnDomainsDecideRule() - Constructor for class org.archive.modules.deciderules.surt.OnDomainsDecideRule
-
Usual constructor.
- OnHostsDecideRule - Class in org.archive.modules.deciderules.surt
-
Rule applies configured decision to any URIs that are on one of the hosts in the configured set of hosts, filled from the seed set.
- OnHostsDecideRule() - Constructor for class org.archive.modules.deciderules.surt.OnHostsDecideRule
-
Usual constructor.
- onlyDecision(CrawlURI) - Method in class org.archive.modules.deciderules.AcceptDecideRule
- onlyDecision(CrawlURI) - Method in class org.archive.modules.deciderules.DecideRule
- onlyDecision(CrawlURI) - Method in class org.archive.modules.deciderules.PredicatedDecideRule
- onlyDecision(CrawlURI) - Method in class org.archive.modules.deciderules.RejectDecideRule
- onlyStoreIfWriteTagPresent - Variable in class org.archive.modules.recrawl.AbstractPersistProcessor
- operator - Variable in class org.archive.modules.CrawlMetadata
- ordinal - Variable in class org.archive.modules.CrawlURI
-
Monotonically increasing number within a crawl; useful for tending towards breadth-first ordering.
- org.archive.crawler.util - package org.archive.crawler.util
- org.archive.modules - package org.archive.modules
-
The beginnings of a refactored settings framework.
- org.archive.modules.canonicalize - package org.archive.modules.canonicalize
- org.archive.modules.credential - package org.archive.modules.credential
-
Contains html form login and basic and digest credentials used by Heritrix logging into sites.
- org.archive.modules.deciderules - package org.archive.modules.deciderules
- org.archive.modules.deciderules.recrawl - package org.archive.modules.deciderules.recrawl
- org.archive.modules.deciderules.surt - package org.archive.modules.deciderules.surt
- org.archive.modules.extractor - package org.archive.modules.extractor
- org.archive.modules.fetcher - package org.archive.modules.fetcher
- org.archive.modules.forms - package org.archive.modules.forms
- org.archive.modules.net - package org.archive.modules.net
- org.archive.modules.recrawl - package org.archive.modules.recrawl
- org.archive.modules.revisit - package org.archive.modules.revisit
- org.archive.modules.seeds - package org.archive.modules.seeds
- org.archive.modules.warc - package org.archive.modules.warc
- org.archive.modules.writer - package org.archive.modules.writer
- org.archive.state - package org.archive.state
- organization - Variable in class org.archive.modules.CrawlMetadata
- OTHERDUPLICATE - Static variable in class org.archive.crawler.util.CrawledBytesHistotable
- OTHERDUPLICATECOUNT - Static variable in class org.archive.crawler.util.CrawledBytesHistotable
- outLinks - Variable in class org.archive.modules.CrawlURI
-
All discovered outbound urls as CrawlURIs (navlinks, embeds, etc.)
- overlayMapsSource - Variable in class org.archive.modules.CrawlURI
- overlayNames - Variable in class org.archive.modules.CrawlURI
P
- parseDefineBits(InStream) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
- parseDefineBitsJPEG3(InStream) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
- parseDefineBitsLossless(InStream, int, boolean) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
- parseDefineButtonSound(InStream) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
- parseDefineFont(InStream) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
- parseDefineFont2(InStream) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
- parseDefineJPEG2(InStream, int) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
- parseDefineJPEGTables(InStream) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
- parseDefineShape(int, InStream) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
- parseDefineSound(InStream) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
- parseDefineSprite(InStream) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
- parseFontInfo(InStream, int, boolean) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
- parsePlaceObject2(InStream) - Method in class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
- parseRobotsTxt(InputStream) - Method in class org.archive.modules.extractor.ExtractorRobotsTxt
- password - Variable in class org.archive.modules.credential.HttpAuthenticationCredential
-
Password.
- path - Variable in class org.archive.modules.writer.Kw3WriterProcessor
-
Top-level directory for archive files.
- path - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
Top-level directory for mirror files.
- PathologicalPathDecideRule - Class in org.archive.modules.deciderules
-
Rule REJECTs any URI which contains an excessive number of identical, consecutive path-segments (eg http://example.com/a/a/a/boo.html == 3 '/a' segments)
- PathologicalPathDecideRule() - Constructor for class org.archive.modules.deciderules.PathologicalPathDecideRule
-
Constructs a new PathologicalPathFilter.
- payloadDigest - Variable in class org.archive.modules.revisit.IdenticalPayloadDigestRevisit
- payloadDigest - Variable in class org.archive.modules.revisit.ServerNotModifiedRevisit
- PDFParser - Class in org.archive.modules.extractor
-
Supports PDF parsing operations.
- PDFParser(byte[]) - Constructor for class org.archive.modules.extractor.PDFParser
- PDFParser(String) - Constructor for class org.archive.modules.extractor.PDFParser
- persistKeyFor(String) - Static method in class org.archive.modules.recrawl.PersistProcessor
- persistKeyFor(CrawlURI) - Method in class org.archive.modules.recrawl.AbstractContentDigestHistory
- persistKeyFor(CrawlURI) - Static method in class org.archive.modules.recrawl.PersistProcessor
-
Return a preferred String key for persisting the given CrawlURI's AList state.
- PersistLoadProcessor - Class in org.archive.modules.recrawl
-
Loads CrawlURI attributes from previous fetch from persistent storage for consultation by a later recrawl.
- PersistLoadProcessor() - Constructor for class org.archive.modules.recrawl.PersistLoadProcessor
- PersistLogProcessor - Class in org.archive.modules.recrawl
-
Log CrawlURI attributes from latest fetch for consultation by a later recrawl.
- PersistLogProcessor() - Constructor for class org.archive.modules.recrawl.PersistLogProcessor
- PersistOnlineProcessor - Class in org.archive.modules.recrawl
-
Common superclass for persisting Processors which directly store/load to persistence (as opposed to logging for batch load later).
- PersistOnlineProcessor() - Constructor for class org.archive.modules.recrawl.PersistOnlineProcessor
- PersistProcessor - Class in org.archive.modules.recrawl
-
Superclass for Processors which utilize BDB-JE for URI state (including most notably history) persistence.
- PersistProcessor() - Constructor for class org.archive.modules.recrawl.PersistProcessor
- PersistStoreProcessor - Class in org.archive.modules.recrawl
-
Store CrawlURI attributes from latest fetch to persistent storage for consultation by a later recrawl.
- PersistStoreProcessor() - Constructor for class org.archive.modules.recrawl.PersistStoreProcessor
- politenessDelay - Variable in class org.archive.modules.CrawlURI
- poolMaxActive - Variable in class org.archive.modules.writer.WriterPoolProcessor
-
Maximum active files in pool.
- populateHtmlFormCredential(HtmlFormCredential) - Method in class org.archive.modules.fetcher.FetchHTTPRequest
- populateHttpCredential(HttpHost, AuthScheme, String, String) - Method in class org.archive.modules.fetcher.FetchHTTPRequest
- populateHttpProxyCredential() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
- populatePersistEnv(String, File) - Static method in class org.archive.modules.recrawl.PersistProcessor
-
Populates a new environment db from an old environment db or a persist log.
- populateTargetCredential() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
-
Add credentials if any to passed
method
. - POST - org.archive.modules.credential.HtmlFormCredential.Method
- PredicatedDecideRule - Class in org.archive.modules.deciderules
-
Rule which applies the configured decision only if a test evaluates to true.
- PredicatedDecideRule() - Constructor for class org.archive.modules.deciderules.PredicatedDecideRule
- prefix - Variable in class org.archive.modules.writer.WriterPoolProcessor
-
File prefix.
- prefixFrom(String) - Method in class org.archive.modules.deciderules.surt.OnDomainsDecideRule
- prefixFrom(String) - Method in class org.archive.modules.deciderules.surt.OnHostsDecideRule
- prefixFrom(String) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- preloadSource - Variable in class org.archive.modules.recrawl.PersistLoadProcessor
-
A source (either log file or BDB directory) from which to copy history information into the current store at startup.
- preloadSourceUrl - Variable in class org.archive.modules.recrawl.PersistLoadProcessor
-
A log file source url from which to copy history information into the current store at startup.
- prepare() - Method in class org.archive.modules.fetcher.AbstractCookieStore
- prepare() - Method in class org.archive.modules.fetcher.BdbCookieStore
- prepare() - Method in class org.archive.modules.fetcher.SimpleCookieStore
- PREREQ - org.archive.modules.extractor.Hop
-
Implied prerequisite links, like dns or robots.
- PREREQ_MISC - Static variable in class org.archive.modules.extractor.LinkContext
-
Stand-in value for prerequisite urls without other context.
- PrerequisiteAcceptDecideRule - Class in org.archive.modules.deciderules
-
Rule which ACCEPTs all 'prerequisite' URIs (those with a 'P' in the last hopsPath position).
- PrerequisiteAcceptDecideRule() - Constructor for class org.archive.modules.deciderules.PrerequisiteAcceptDecideRule
- presumedUsernameInput() - Method in class org.archive.modules.forms.HTMLForm
- PROCEED - org.archive.modules.ProcessResult.ProcessStatus
-
The URI was processed normally, and no special action needs to be taken by the framework.
- PROCEED - Static variable in class org.archive.modules.ProcessResult
- process(CrawlURI) - Method in class org.archive.modules.Processor
-
Processes the given URI.
- process(CrawlURI, ProcessorChain.ChainStatusReceiver) - Method in class org.archive.modules.ProcessorChain
- processEmbed(CrawlURI, CharSequence, CharSequence) - Method in class org.archive.modules.extractor.ExtractorHTML
- processEmbed(CrawlURI, CharSequence, CharSequence, Hop) - Method in class org.archive.modules.extractor.ExtractorHTML
- processForm(CrawlURI, Element) - Method in class org.archive.modules.extractor.JerichoExtractorHTML
- processGeneralTag(CrawlURI, Element, Attributes) - Method in class org.archive.modules.extractor.JerichoExtractorHTML
- processGeneralTag(CrawlURI, CharSequence, CharSequence) - Method in class org.archive.modules.extractor.ExtractorHTML
- processingCleanup() - Method in class org.archive.modules.CrawlURI
-
Clean up after a run through the processing chain.
- processLink(CrawlURI, CharSequence, CharSequence) - Method in class org.archive.modules.extractor.ExtractorHTML
-
Handle generic HREF cases.
- processMeta(CrawlURI, Element) - Method in class org.archive.modules.extractor.JerichoExtractorHTML
- processMeta(CrawlURI, CharSequence) - Method in class org.archive.modules.extractor.ExtractorHTML
-
Process metadata tags.
- Processor - Class in org.archive.modules
-
A processor of URIs.
- Processor() - Constructor for class org.archive.modules.Processor
- ProcessorChain - Class in org.archive.modules
-
Collection of Processors to run.
- ProcessorChain() - Constructor for class org.archive.modules.ProcessorChain
- ProcessorChain.ChainStatusReceiver - Interface in org.archive.modules
- ProcessorTestBase - Class in org.archive.modules
-
Unit test for Processor.
- ProcessorTestBase() - Constructor for class org.archive.modules.ProcessorTestBase
- ProcessResult - Class in org.archive.modules
-
Returned by a Processor's process method to indicate the status of the process.
- ProcessResult.ProcessStatus - Enum in org.archive.modules
- processScript(CrawlURI, Element) - Method in class org.archive.modules.extractor.JerichoExtractorHTML
- processScript(CrawlURI, CharSequence, int) - Method in class org.archive.modules.extractor.AggressiveExtractorHTML
- processScript(CrawlURI, CharSequence, int) - Method in class org.archive.modules.extractor.ExtractorHTML
- processScriptCode(CrawlURI, CharSequence) - Method in class org.archive.modules.extractor.ExtractorHTML
-
Extract the (java)script source in the given CharSequence.
- processStyle(CrawlURI, Element) - Method in class org.archive.modules.extractor.JerichoExtractorHTML
- processStyle(CrawlURI, CharSequence, int) - Method in class org.archive.modules.extractor.ExtractorHTML
-
Process style text.
- processStyleCode(Extractor, CrawlURI, CharSequence) - Static method in class org.archive.modules.extractor.ExtractorCSS
- processXml(Extractor, CrawlURI, CharSequence) - Static method in class org.archive.modules.extractor.ExtractorXML
- promoteCredentials(CrawlURI) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Promote successful credential to the server.
- proxyHost - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
- publishAddedSeed(CrawlURI) - Method in class org.archive.modules.seeds.SeedModule
- publishConcludedSeedBatch() - Method in class org.archive.modules.seeds.SeedModule
- publishNonSeedLine(String) - Method in class org.archive.modules.seeds.SeedModule
- push(String) - Method in class org.archive.modules.extractor.ExtractorSWF.CrawlUriSWFAction
- putHttpResponseHeader(String, String) - Method in class org.archive.modules.CrawlURI
Q
- qualifyRecordID(URI, String, String) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
R
- readCookies(Reader) - Method in class org.archive.modules.fetcher.AbstractCookieStore
-
Load cookies.
- readPrefixes() - Method in class org.archive.modules.deciderules.surt.OnDomainsDecideRule
-
Patch the SURT prefix set so that it only includes host-enforcing prefixes
- readPrefixes() - Method in class org.archive.modules.deciderules.surt.OnHostsDecideRule
-
Patch the SURT prefix set so that it only includes host-enforcing prefixes
- readPrefixes() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- readUuri(String) - Method in class org.archive.modules.CrawlURI
-
Read a UURI from a String, handling a null or URIException
- realm - Variable in class org.archive.modules.credential.HttpAuthenticationCredential
-
Basic/Digest Auth realm.
- recordDNS(CrawlURI, Record[]) - Method in class org.archive.modules.fetcher.FetchDNS
- RecordingHttpClientConnection(int, int, CharsetDecoder, CharsetEncoder, MessageConstraints, ContentLengthStrategy, ContentLengthStrategy, HttpMessageWriterFactory<HttpRequest>, HttpMessageParserFactory<HttpResponse>) - Constructor for class org.archive.modules.fetcher.FetchHTTPRequest.RecordingHttpClientConnection
- recoveryCheckpoint - Variable in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- recoveryCheckpoint - Variable in class org.archive.modules.Processor
- RecrawlAttributeConstants - Interface in org.archive.modules.recrawl
- REFER - org.archive.modules.extractor.Hop
-
Referral/redirect links, like header 'Location:' on a 301/302 response.
- refersToDate - Variable in class org.archive.modules.revisit.AbstractProfile
- refersToRecordID - Variable in class org.archive.modules.revisit.AbstractProfile
- refersToTargetURI - Variable in class org.archive.modules.revisit.IdenticalPayloadDigestRevisit
- RegexRule - Class in org.archive.modules.canonicalize
-
General conversion rule.
- RegexRule() - Constructor for class org.archive.modules.canonicalize.RegexRule
- REJECT - org.archive.modules.deciderules.DecideResult
-
Indicates the URI was rejected.
- RejectDecideRule - Class in org.archive.modules.deciderules
- RejectDecideRule() - Constructor for class org.archive.modules.deciderules.RejectDecideRule
- RELOCATED - org.archive.modules.fetcher.FetchStats.Stage
- remove(int) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- remove(Object) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- removeAll(Collection<?>) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- report() - Method in class org.archive.modules.extractor.Extractor
- report() - Method in class org.archive.modules.extractor.JerichoExtractorHTML
- report() - Method in class org.archive.modules.Processor
- report() - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
- reportTo(PrintWriter) - Method in class org.archive.modules.CrawlURI
- reportTo(PrintWriter) - Method in class org.archive.modules.fetcher.FetchStats
- reportTo(PrintWriter) - Method in class org.archive.modules.ProcessorChain
-
Compiles and returns a human readable report on the active processors.
- request - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
- requestConfigBuilder - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
- rescheduleTime - Variable in class org.archive.modules.CrawlURI
-
A future time at which this CrawlURI should be reenqueued.
- resetConsecutiveConnectionErrors() - Method in class org.archive.modules.net.CrawlServer
- resetDeferrals() - Method in class org.archive.modules.CrawlURI
-
Reset deferrals counter.
- resetFetchAttempts() - Method in class org.archive.modules.CrawlURI
-
Reset fetchAttempts counter.
- resetForRescheduling() - Method in class org.archive.modules.CrawlURI
-
Reset state that that should not persist when a URI is rescheduled for a specific future time.
- resetState() - Method in class org.archive.modules.extractor.PDFParser
-
Reinitialize the object as though a new one were created.
- resetState(byte[]) - Method in class org.archive.modules.extractor.PDFParser
-
Reset the object and initialize it with a new byte array (the document).
- resetState(String) - Method in class org.archive.modules.extractor.PDFParser
-
Reinitialize the object as though a new one were created, complete with a valid pointer to a document that can be read
- resolve(String) - Method in class org.archive.modules.fetcher.FetchHTTPRequest.ServerCacheResolver
- resolve(String) - Method in interface org.archive.modules.fetcher.HostResolver
- ResourceLongerThanDecideRule - Class in org.archive.modules.deciderules
-
Applies configured decision for URIs with content length greater than a given threshold length value.
- ResourceLongerThanDecideRule() - Constructor for class org.archive.modules.deciderules.ResourceLongerThanDecideRule
- ResourceNoLongerThanDecideRule - Class in org.archive.modules.deciderules
-
Applies configured decision for URIs with content length less than or equal to a given threshold length value.
- ResourceNoLongerThanDecideRule() - Constructor for class org.archive.modules.deciderules.ResourceNoLongerThanDecideRule
- ResponseContentLengthDecideRule - Class in org.archive.modules.deciderules
-
Decide rule that will ACCEPT or REJECT a uri, depending on the "decision" property, after it's fetched, if the content body is within a specified size range, specified in bytes.
- ResponseContentLengthDecideRule() - Constructor for class org.archive.modules.deciderules.ResponseContentLengthDecideRule
- RestrictedCollectionWrappedList(Collection<T>) - Constructor for class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- retainAll(Collection<?>) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- RETRIED - org.archive.modules.fetcher.FetchStats.Stage
- RevisitProfile - Interface in org.archive.modules.revisit
- RevisitRecordBuilder - Class in org.archive.modules.warc
- RevisitRecordBuilder() - Constructor for class org.archive.modules.warc.RevisitRecordBuilder
- ROBOTS_DENIALS - Static variable in class org.archive.modules.fetcher.FetchStats
- ROBOTS_NOT_FETCHED - Static variable in class org.archive.modules.net.CrawlServer
- RobotsDirectives - Class in org.archive.modules.net
-
Represents the directives that apply to a user-agent (or set of user-agents)
- RobotsDirectives() - Constructor for class org.archive.modules.net.RobotsDirectives
- robotsFetched - Variable in class org.archive.modules.net.CrawlServer
- RobotsPolicy - Class in org.archive.modules.net
-
RobotsPolicy represents the strategy used by the crawler for determining how robots.txt files will be honored.
- RobotsPolicy() - Constructor for class org.archive.modules.net.RobotsPolicy
- robotstxt - Variable in class org.archive.modules.net.CrawlServer
- Robotstxt - Class in org.archive.modules.net
-
Utility class for parsing and representing 'robots.txt' format directives, into a list of named user-agents and map from user-agents to RobotsDirectives.
- Robotstxt() - Constructor for class org.archive.modules.net.Robotstxt
- Robotstxt(Reader) - Constructor for class org.archive.modules.net.Robotstxt
- Robotstxt(ReadSource) - Constructor for class org.archive.modules.net.Robotstxt
- rootUriMatch(ServerCache, CrawlURI) - Method in class org.archive.modules.credential.Credential
-
Test passed curi matches this credentials rootUri.
- ROUTE_PLANNER - Static variable in class org.archive.modules.fetcher.FetchHTTPRequest
- RulesCanonicalizationPolicy - Class in org.archive.modules.canonicalize
-
URI Canonicalizatioon Policy
- RulesCanonicalizationPolicy() - Constructor for class org.archive.modules.canonicalize.RulesCanonicalizationPolicy
- runTest() - Method in class org.archive.state.ModuleTestBase
S
- S_BLOCKED_BY_CUSTOM_PROCESSOR - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
Blocked by custom prefetcher processor.
- S_BLOCKED_BY_QUOTA - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
Blocked due to exceeding an established quota.
- S_BLOCKED_BY_RUNTIME_LIMIT - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
Blocked due to exceeding an established runtime.
- S_BLOCKED_BY_USER - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
blocked from fetch by user setting.
- S_CONNECT_FAILED - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
HTTP connect failed
- S_CONNECT_LOST - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
HTTP connect broken
- S_DEEMED_CHAFF - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
'chaff' detection of traps/content of negligible value applied
- S_DEEMED_NOT_FOUND - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
synthetic status, used when some other status (such as connection-lost) is considered by policy the same as a document-not-found
- S_DEFERRED - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
temporary status assigned URIs awaiting preconditions; appearance in logs is a bug
- S_DELETED_BY_USER - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
deleted from frontier by user
- S_DNS_SUCCESS - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
DNS success
- S_DOMAIN_PREREQUISITE_FAILURE - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
DNS prerequisite failed, precluding attempt
- S_DOMAIN_UNRESOLVABLE - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
DNS lookup failed
- S_GETBYNAME_SUCCESS - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
InetAddress.getByName success
- S_NOT_FOUND - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
HTTP 404 NOT FOUND
- S_OTHER_PREREQUISITE_FAILURE - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
DNS prerequisite failed, precluding attempt
- S_OUT_OF_SCOPE - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
out-of-scope upoin reexamination (only when scope changes during crawl)
- S_PREREQUISITE_UNSCHEDULABLE_FAILURE - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
DNS prerequisite failed, precluding attempt
- S_PROCESSING_THREAD_KILLED - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
Processing thread was killed
- S_ROBOTS_PRECLUDED - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
robots rules precluded fetch
- S_ROBOTS_PREREQUISITE_FAILURE - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
Robots prerequisite failed, precluding attempt
- S_RUNTIME_EXCEPTION - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
Unexpected runtime exception; see runtime-errors.log
- S_SERIOUS_ERROR - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
severe java 'Error' conditions (OutOfMemoryError, StackOverflowError, etc.) during URI processing
- S_TIMEOUT - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
HTTP timeout (before any meaningful response received)
- S_TOO_MANY_EMBED_HOPS - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
overstepped embed/trans hops
- S_TOO_MANY_LINK_HOPS - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
overstepped link hops
- S_TOO_MANY_RETRIES - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
multiple retries all failed
- S_UNATTEMPTED - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
fetch never tried (perhaps protocol unsupported or illegal URI)
- S_UNFETCHABLE_URI - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
URI recognized as unsupported or illegal)
- S_UNQUEUEABLE - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
URI could not be queued in Frontier; when URIs are properly filtered for format, should never occur
- S_WHOIS_GENERIC_FINISHED - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
Finished all fetches for serverless WHOIS url (whois:foo.org)
- S_WHOIS_SUCCESS - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
WHOIS success
- saveCookies() - Method in class org.archive.modules.fetcher.AbstractCookieStore
- saveCookies(String) - Method in class org.archive.modules.fetcher.AbstractCookieStore
- saveHeader(CrawlURI, Map<String, Object>, String) - Method in class org.archive.modules.recrawl.FetchHistoryProcessor
-
Save a header from the given HTTP operation into the Map.
- saveHeader(CrawlURI, ANVLRecord, String, String) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.Saves a header from the given HTTP operation into the provider headers under a new name
- SCHEDULED - org.archive.modules.fetcher.FetchStats.Stage
- SchedulingConstants - Class in org.archive.modules
- SchemeNotInSetDecideRule - Class in org.archive.modules.deciderules
-
Rule applies the configured decision (default REJECT) for any URI which has a URI-scheme NOT contained in the configured Set.
- SchemeNotInSetDecideRule() - Constructor for class org.archive.modules.deciderules.SchemeNotInSetDecideRule
-
Usual constructor.
- schemes - Variable in class org.archive.modules.deciderules.SchemeNotInSetDecideRule
-
set of schemes to test URI scheme
- SCRIPT_SRC - Static variable in class org.archive.modules.extractor.HTMLLinkContext
- ScriptedDecideRule - Class in org.archive.modules.deciderules
-
Rule which runs a JSR-223 script to make its decision.
- ScriptedDecideRule() - Constructor for class org.archive.modules.deciderules.ScriptedDecideRule
- ScriptedProcessor - Class in org.archive.modules
-
A processor which runs a JSR-223 script on the CrawlURI.
- ScriptedProcessor() - Constructor for class org.archive.modules.ScriptedProcessor
-
Constructor.
- scriptSource - Variable in class org.archive.modules.deciderules.ScriptedDecideRule
- scriptSource - Variable in class org.archive.modules.ScriptedProcessor
- SeedAcceptDecideRule - Class in org.archive.modules.deciderules
-
Rule which ACCEPTs all 'seed' URIs (those for which isSeed is true).
- SeedAcceptDecideRule() - Constructor for class org.archive.modules.deciderules.SeedAcceptDecideRule
- seedLine(String) - Method in class org.archive.modules.seeds.TextSeedModule
-
Handle a read line that is probably a seed.
- SeedListener - Interface in org.archive.modules.seeds
-
Implemented by components which want notifications of seed list changes.
- seedListeners - Variable in class org.archive.modules.seeds.SeedModule
- SeedModule - Class in org.archive.modules.seeds
- SeedModule() - Constructor for class org.archive.modules.seeds.SeedModule
- seeds - Variable in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- seedsAsSurtPrefixes - Variable in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
Should seeds also be interpreted as SURT prefixes.
- seemsLoginForm() - Method in class org.archive.modules.forms.HTMLForm
-
For now, we consider a POST form with only 1 password field and 1 potential username field (type text or email) to be a likely login form.
- serverCache - Variable in class org.archive.modules.deciderules.DecideRuleSequence
- serverCache - Variable in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
- serverCache - Variable in class org.archive.modules.deciderules.IpAddressSetDecideRule
- serverCache - Variable in class org.archive.modules.fetcher.FetchDNS
-
Used to do DNS lookups.
- serverCache - Variable in class org.archive.modules.fetcher.FetchHTTP
- serverCache - Variable in class org.archive.modules.fetcher.FetchHTTPRequest.ServerCacheResolver
- serverCache - Variable in class org.archive.modules.fetcher.FetchWhois
- serverCache - Variable in class org.archive.modules.warc.BaseWARCRecordBuilder
- serverCache - Variable in class org.archive.modules.writer.Kw3WriterProcessor
-
The server cache to use.
- serverCache - Variable in class org.archive.modules.writer.WriterPoolProcessor
- ServerCache - Class in org.archive.modules.net
-
Abstract class for crawl-global registry of CrawlServer (host:port) and CrawlHost (hostname) objects.
- ServerCache() - Constructor for class org.archive.modules.net.ServerCache
- ServerCacheResolver(ServerCache) - Constructor for class org.archive.modules.fetcher.FetchHTTPRequest.ServerCacheResolver
- serverInetAddr - Variable in class org.archive.modules.fetcher.FetchDNS
- ServerNotModifiedRevisit - Class in org.archive.modules.revisit
- ServerNotModifiedRevisit() - Constructor for class org.archive.modules.revisit.ServerNotModifiedRevisit
-
Minimal constructor.
- servers - Variable in class org.archive.modules.fetcher.DefaultServerCache
-
hostname[:port] -> CrawlServer.
- set(int, T) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- setAcceptCompression(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Set headers to accept compressed responses.
- setAcceptHeaders(List<String>) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Accept Headers to include in each request.
- setAcceptNonDnsResolves(boolean) - Method in class org.archive.modules.fetcher.FetchDNS
- setAction(String) - Method in class org.archive.modules.forms.HTMLForm
- setAlsoCheckVia(boolean) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- setApplicableSurtPrefix(String) - Method in class org.archive.modules.forms.FormLoginProcessor
- setApplicationContext(ApplicationContext) - Method in class org.archive.modules.deciderules.ScriptedDecideRule
- setApplicationContext(ApplicationContext) - Method in class org.archive.modules.ScriptedProcessor
- setAudience(String) - Method in class org.archive.modules.CrawlMetadata
- setAvailableRobotsPolicies(Map<String, RobotsPolicy>) - Method in class org.archive.modules.CrawlMetadata
- setBaseURI(String) - Method in class org.archive.modules.CrawlURI
-
Set the (HTML) Base URI used for derelativizing internal URIs.
- setBaseURI(UURI) - Method in class org.archive.modules.CrawlURI
- setBdbModule(BdbModule) - Method in class org.archive.modules.fetcher.BdbCookieStore
- setBdbModule(BdbModule) - Method in class org.archive.modules.fetcher.FetchWhois
- setBdbModule(BdbModule) - Method in class org.archive.modules.net.BdbServerCache
- setBdbModule(BdbModule) - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
- setBdbModule(BdbModule) - Method in class org.archive.modules.recrawl.PersistOnlineProcessor
- setBeanName(String) - Method in class org.archive.modules.deciderules.DecideRuleSequence
- setBeanName(String) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- setBeanName(String) - Method in class org.archive.modules.Processor
- setBlockAwaitingSeedLines(int) - Method in class org.archive.modules.seeds.TextSeedModule
- setCandidateUserAgents(List<String>) - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
- setCandidateUserAgents(List<String>) - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
- setCanonicalString(String) - Method in class org.archive.modules.CrawlURI
- setCaseSensitiveFilesystem(boolean) - Method in class org.archive.modules.writer.MirrorWriterProcessor
- setChain(List<? extends WARCRecordBuilder>) - Method in class org.archive.modules.writer.WARCWriterChainProcessor
- setCharacterEncoding(CrawlURI, Recorder, HttpResponse) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Set the character encoding based on the result headers or default.
- setCharacterMap(List<String>) - Method in class org.archive.modules.writer.MirrorWriterProcessor
- setChmod(boolean) - Method in class org.archive.modules.writer.Kw3WriterProcessor
- setChmodValue(String) - Method in class org.archive.modules.writer.Kw3WriterProcessor
- setClassKey(String) - Method in class org.archive.modules.CrawlURI
- setCollection(String) - Method in class org.archive.modules.writer.Kw3WriterProcessor
- setComment(String) - Method in class org.archive.modules.deciderules.DecideRule
- setCompress(boolean) - Method in class org.archive.modules.writer.WriterPoolProcessor
- setConnectTimeoutMs(int) - Method in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
- setContentDigest(byte[]) - Method in class org.archive.modules.CrawlURI
-
Deprecated.
- setContentDigest(String, byte[]) - Method in class org.archive.modules.CrawlURI
- setContentDigestHistory(AbstractContentDigestHistory) - Method in class org.archive.modules.recrawl.ContentDigestHistoryLoader
- setContentDigestHistory(AbstractContentDigestHistory) - Method in class org.archive.modules.recrawl.ContentDigestHistoryStorer
- setContentLengthThreshold(long) - Method in class org.archive.modules.deciderules.ContentLengthDecideRule
- setContentLengthThreshold(long) - Method in class org.archive.modules.deciderules.ResourceNoLongerThanDecideRule
- setContentRegexes(Map<String, String>) - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
-
A map of { name => regex }.
- setContentSize(long) - Method in class org.archive.modules.CrawlURI
-
Sets the 'content size' for the URI, which is considered inclusive of all of all recorded material (such as protocol headers) or even material 'virtually' considered (as in material from a previous fetch confirmed unchanged with a server).
- setContentType(String) - Method in class org.archive.modules.CrawlURI
-
Set a fetched uri's content type.
- setContentTypeMap(List<String>) - Method in class org.archive.modules.writer.MirrorWriterProcessor
- setCookiesLoadFile(ConfigFile) - Method in class org.archive.modules.fetcher.AbstractCookieStore
- setCookiesSaveFile(ConfigPath) - Method in class org.archive.modules.fetcher.AbstractCookieStore
- setCookieStore(AbstractCookieStore) - Method in class org.archive.modules.fetcher.FetchHTTP
- setCountryCode(String) - Method in class org.archive.modules.net.CrawlHost
-
Set country code for this hos
- setCountryCodes(List<String>) - Method in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
- setCrawlDelay(float) - Method in class org.archive.modules.net.RobotsDirectives
- setCreateHostDirectory(boolean) - Method in class org.archive.modules.writer.MirrorWriterProcessor
- setCreatePortDirectory(boolean) - Method in class org.archive.modules.writer.MirrorWriterProcessor
- setCredentials(Map<String, Credential>) - Method in class org.archive.modules.credential.CredentialStore
- setCredentialStore(CredentialStore) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Used to store credentials.
- setCustomRobots(ReadSource) - Method in class org.archive.modules.net.CustomRobotsPolicy
- setDecision(DecideResult) - Method in class org.archive.modules.deciderules.PredicatedDecideRule
- setDefaultEncoding(String) - Method in class org.archive.modules.fetcher.FetchHTTP
-
The character encoding to use for files that do not have one specified in the HTTP response headers.
- setDescription(String) - Method in class org.archive.modules.CrawlMetadata
- setDigestAlgorithm(String) - Method in class org.archive.modules.fetcher.FetchDNS
- setDigestAlgorithm(String) - Method in class org.archive.modules.fetcher.FetchFTP
- setDigestAlgorithm(String) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Which algorithm (for example MD5 or SHA-1) to use to perform an on-the-fly digest hash of retrieved content-bodies.
- setDigestAlgorithm(String) - Method in class org.archive.modules.fetcher.FetchSFTP
- setDigestContent(boolean) - Method in class org.archive.modules.fetcher.FetchDNS
- setDigestContent(boolean) - Method in class org.archive.modules.fetcher.FetchFTP
- setDigestContent(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Whether or not to perform an on-the-fly digest hash of retrieved content-bodies.
- setDigestContent(boolean) - Method in class org.archive.modules.fetcher.FetchSFTP
- setDirectory(ConfigPath) - Method in class org.archive.modules.writer.WriterPoolProcessor
- setDirectoryFile(String) - Method in class org.archive.modules.writer.MirrorWriterProcessor
- setDisableJavaDnsResolves(boolean) - Method in class org.archive.modules.fetcher.FetchDNS
- setDisableSNI(boolean) - Method in class org.archive.modules.fetcher.FetchHTTPRequest
- setDNSServerIPLabel(String) - Method in class org.archive.modules.CrawlURI
- setDomain(String) - Method in class org.archive.modules.credential.Credential
- setDotBegin(String) - Method in class org.archive.modules.writer.MirrorWriterProcessor
- setDotEnd(String) - Method in class org.archive.modules.writer.MirrorWriterProcessor
- setEarliestNextURIEmitTime(long) - Method in class org.archive.modules.net.CrawlHost
-
Set the earliest time a URI for this host could be emitted.
- setEnabled(boolean) - Method in class org.archive.modules.canonicalize.BaseRule
- setEnabled(boolean) - Method in class org.archive.modules.deciderules.DecideRule
- setEnabled(boolean) - Method in class org.archive.modules.Processor
- setEnctype(String) - Method in class org.archive.modules.forms.HTMLForm
- setEngineName(String) - Method in class org.archive.modules.deciderules.ScriptedDecideRule
- setEngineName(String) - Method in class org.archive.modules.ScriptedProcessor
- setEntity(HttpEntity) - Method in class org.archive.modules.fetcher.BasicExecutionAwareEntityEnclosingRequest
- setError(String) - Method in class org.archive.modules.CrawlURI
- setETag(String) - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
- setExtractAllForms(boolean) - Method in class org.archive.modules.forms.ExtractorHTMLForms
- setExtractFromDirs(boolean) - Method in class org.archive.modules.fetcher.FetchFTP
- setExtractFromDirs(boolean) - Method in class org.archive.modules.fetcher.FetchSFTP
- setExtractJavascript(boolean) - Method in class org.archive.modules.extractor.ExtractorHTML
- setExtractOnlyFormGets(boolean) - Method in class org.archive.modules.extractor.ExtractorHTML
- setExtractorJS(ExtractorJS) - Method in class org.archive.modules.extractor.ExtractorHTML
- setExtractorJS(ExtractorJS) - Method in class org.archive.modules.extractor.ExtractorSWF
- setExtractorParameters(ExtractorParameters) - Method in class org.archive.modules.extractor.Extractor
- setExtractParent(boolean) - Method in class org.archive.modules.fetcher.FetchFTP
- setExtractParent(boolean) - Method in class org.archive.modules.fetcher.FetchSFTP
- setExtractValueAttributes(boolean) - Method in class org.archive.modules.extractor.ExtractorHTML
- setFetchBeginTime(long) - Method in class org.archive.modules.CrawlURI
- setFetchCompletedTime(long) - Method in class org.archive.modules.CrawlURI
- setFetchHistory(Map<String, Object>[]) - Method in class org.archive.modules.CrawlURI
- setFetchStatus(int) - Method in class org.archive.modules.CrawlURI
-
Set the overall/fetch status of this CrawlURI for its current trip through the processing loop.
- setFetchType(CrawlURI.FetchType) - Method in class org.archive.modules.CrawlURI
- setForceFetch(boolean) - Method in class org.archive.modules.CrawlURI
-
Method to signal that this URI should be fetched even though it already has been crawled.
- setForceRetire(boolean) - Method in class org.archive.modules.CrawlURI
- setFormat(String) - Method in class org.archive.modules.canonicalize.RegexRule
- setFormat(String) - Method in class org.archive.modules.extractor.ExtractorImpliedURI
- setFormItems(Map<String, String>) - Method in class org.archive.modules.credential.HtmlFormCredential
- setFrequentFlushes(boolean) - Method in class org.archive.modules.writer.WriterPoolProcessor
- setFullVia(CrawlURI) - Method in class org.archive.modules.CrawlURI
- setHarvester(String) - Method in class org.archive.modules.writer.Kw3WriterProcessor
- setHistoryDbName(String) - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
- setHistoryDbName(String) - Method in class org.archive.modules.recrawl.PersistOnlineProcessor
- setHistoryLength(int) - Method in class org.archive.modules.recrawl.FetchHistoryProcessor
- setHolder(Object) - Method in class org.archive.modules.CrawlURI
-
Remember a 'holder' to which some enclosing/queueing facility has assigned this CrawlURI .
- setHolderCost(int) - Method in class org.archive.modules.CrawlURI
-
Remember a 'holderCost' which some enclosing/queueing facility has assigned this CrawlURI
- setHolderKey(Object) - Method in class org.archive.modules.CrawlURI
-
Remember a 'holderKey' which some enclosing/queueing facility has assigned this CrawlURI .
- setHostMap(List<String>) - Method in class org.archive.modules.writer.MirrorWriterProcessor
- setHttpAuthChallenges(Map<String, String>) - Method in class org.archive.modules.CrawlURI
- setHttpAuthChallenges(Map<String, String>) - Method in class org.archive.modules.net.CrawlServer
- setHttpBindAddress(String) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Local IP address or hostname to use when making connections (binding sockets).
- setHttpMethod(HtmlFormCredential.Method) - Method in class org.archive.modules.credential.HtmlFormCredential
-
Deprecated.ignored, always POST
- setHttpProxyHost(String) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Proxy host IP (set only if needed).
- setHttpProxyPassword(String) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Proxy password (set only if needed).
- setHttpProxyPort(Integer) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Proxy port (set only if needed).
- setHttpProxyUser(String) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Proxy user (set only if needed).
- setIdentityCache(ObjectIdentityCache<?>) - Method in class org.archive.modules.net.CrawlHost
- setIdentityCache(ObjectIdentityCache<?>) - Method in class org.archive.modules.net.CrawlServer
- setIgnoreCookies(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Disable cookie handling.
- setIgnoreFormActionUrls(boolean) - Method in class org.archive.modules.extractor.ExtractorHTML
- setIgnoreUnexpectedHtml(boolean) - Method in class org.archive.modules.extractor.ExtractorHTML
- setInferRootPage(boolean) - Method in class org.archive.modules.extractor.ExtractorHTTP
- setIP(InetAddress, long) - Method in class org.archive.modules.net.CrawlHost
-
Set the IP address for this host.
- setIpAddresses(Set<String>) - Method in class org.archive.modules.deciderules.IpAddressSetDecideRule
- setIsolateThreads(boolean) - Method in class org.archive.modules.deciderules.ScriptedDecideRule
- setIsolateThreads(boolean) - Method in class org.archive.modules.ScriptedProcessor
- setJobName(String) - Method in class org.archive.modules.CrawlMetadata
- setLastModified(String) - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
- setListLogicalOr(boolean) - Method in class org.archive.modules.deciderules.MatchesListRegexDecideRule
- setLogExtraInfo(boolean) - Method in class org.archive.modules.deciderules.DecideRuleSequence
- setLogFile(ConfigPath) - Method in class org.archive.modules.recrawl.PersistLogProcessor
- setLoggerModule(UriErrorLoggerModule) - Method in class org.archive.modules.extractor.Extractor
- setLoggerModule(UriErrorLoggerModule) - Method in class org.archive.modules.forms.FormLoginProcessor
- setLoggerModule(SimpleFileLoggerProvider) - Method in class org.archive.modules.deciderules.DecideRuleSequence
- setLogin(String) - Method in class org.archive.modules.credential.HttpAuthenticationCredential
- setLoginPassword(String) - Method in class org.archive.modules.forms.FormLoginProcessor
- setLoginUri(String) - Method in class org.archive.modules.credential.HtmlFormCredential
- setLoginUsername(String) - Method in class org.archive.modules.forms.FormLoginProcessor
- setLogToFile(boolean) - Method in class org.archive.modules.deciderules.DecideRuleSequence
- setLookup(ExternalGeoLookupInterface) - Method in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
- setLowerBound(long) - Method in class org.archive.modules.deciderules.ResponseContentLengthDecideRule
-
The rule will apply if the url has been fetched and content body length is greater than or equal to this number of bytes.
- setLowerBound(Integer) - Method in class org.archive.modules.deciderules.MatchesStatusCodeDecideRule
-
Sets the lower bound on the range of acceptable status codes.
- setMaxAttributeNameLength(int) - Method in class org.archive.modules.extractor.ExtractorHTML
- setMaxAttributeValLength(int) - Method in class org.archive.modules.extractor.ExtractorHTML
- setMaxElementLength(int) - Method in class org.archive.modules.extractor.ExtractorHTML
- setMaxFetchKBSec(int) - Method in class org.archive.modules.fetcher.FetchFTP
- setMaxFetchKBSec(int) - Method in class org.archive.modules.fetcher.FetchHTTP
-
The maximum KB/sec to use when fetching data from a server.
- setMaxFetchKBSec(int) - Method in class org.archive.modules.fetcher.FetchSFTP
- setMaxFileSizeBytes(long) - Method in class org.archive.modules.writer.Kw3WriterProcessor
- setMaxFileSizeBytes(long) - Method in class org.archive.modules.writer.WriterPoolProcessor
- setMaxHops(int) - Method in class org.archive.modules.deciderules.TooManyHopsDecideRule
- setMaxLengthBytes(long) - Method in class org.archive.modules.fetcher.FetchFTP
- setMaxLengthBytes(long) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Maximum length in bytes to fetch.
- setMaxLengthBytes(long) - Method in class org.archive.modules.fetcher.FetchSFTP
- setMaxPathDepth(int) - Method in class org.archive.modules.deciderules.TooManyPathSegmentsDecideRule
- setMaxPathLength(int) - Method in class org.archive.modules.writer.MirrorWriterProcessor
- setMaxRepetitions(int) - Method in class org.archive.modules.deciderules.PathologicalPathDecideRule
- setMaxSegLength(int) - Method in class org.archive.modules.writer.MirrorWriterProcessor
- setMaxSizeToDigest(long) - Method in class org.archive.modules.extractor.HTTPContentDigest
- setMaxSizeToParse(long) - Method in class org.archive.modules.extractor.ExtractorPDF
- setMaxSizeToParse(long) - Method in class org.archive.modules.extractor.ExtractorUniversal
- setMaxSpeculativeHops(int) - Method in class org.archive.modules.deciderules.TransclusionDecideRule
- setMaxTotalBytesToWrite(long) - Method in class org.archive.modules.writer.WriterPoolProcessor
- setMaxTransHops(int) - Method in class org.archive.modules.deciderules.TransclusionDecideRule
- setMaxWaitForIdleMs(int) - Method in class org.archive.modules.writer.WriterPoolProcessor
- setMetadata(CrawlMetadata) - Method in class org.archive.modules.extractor.ExtractorHTML
- setMetadataProvider(CrawlMetadata) - Method in class org.archive.modules.writer.WriterPoolProcessor
- setMethod(String) - Method in class org.archive.modules.forms.HTMLForm
- setObeyMetaRobotsNofollow(boolean) - Method in class org.archive.modules.net.CustomRobotsPolicy
- setObeyMetaRobotsNofollow(boolean) - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
- setObeyMetaRobotsNofollow(boolean) - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
- setOnlyStoreIfWriteTagPresent(boolean) - Method in class org.archive.modules.recrawl.AbstractPersistProcessor
- setOperator(String) - Method in class org.archive.modules.CrawlMetadata
- setOperatorContactUrl(String) - Method in class org.archive.modules.CrawlMetadata
- setOperatorFrom(String) - Method in class org.archive.modules.CrawlMetadata
- setOrdinal(long) - Method in class org.archive.modules.CrawlURI
- setOrganization(String) - Method in class org.archive.modules.CrawlMetadata
- setOtherCodings(CrawlURI, Recorder, HttpResponse) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Set the transfer, content encodings based on headers (if necessary).
- setOverlayMapsSource(OverlayMapsSource) - Method in class org.archive.modules.CrawlURI
- setPassword(String) - Method in class org.archive.modules.credential.HttpAuthenticationCredential
- setPassword(String) - Method in class org.archive.modules.fetcher.FetchFTP
- setPassword(String) - Method in class org.archive.modules.fetcher.FetchSFTP
- setPath(ConfigPath) - Method in class org.archive.modules.writer.Kw3WriterProcessor
- setPath(ConfigPath) - Method in class org.archive.modules.writer.MirrorWriterProcessor
- setPayloadDigest(String) - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
- setPolitenessDelay(long) - Method in class org.archive.modules.CrawlURI
- setPool(WriterPool) - Method in class org.archive.modules.writer.WriterPoolProcessor
- setPoolMaxActive(int) - Method in class org.archive.modules.writer.WriterPoolProcessor
- setPrecedence(int) - Method in class org.archive.modules.CrawlURI
- setPrefix(String) - Method in class org.archive.modules.writer.WriterPoolProcessor
- setPreloadSource(ConfigPath) - Method in class org.archive.modules.recrawl.PersistLoadProcessor
- setPreloadSourceUrl(String) - Method in class org.archive.modules.recrawl.PersistLoadProcessor
- setPrerequisite(boolean) - Method in class org.archive.modules.CrawlURI
-
Set if this CrawlURI is itself a prerequisite URI.
- setPrerequisiteUri(CrawlURI) - Method in class org.archive.modules.CrawlURI
-
Set a prerequisite for this URI.
- setProcessors(List<Processor>) - Method in class org.archive.modules.ProcessorChain
- setRealm(String) - Method in class org.archive.modules.credential.HttpAuthenticationCredential
- setRecorder(Recorder) - Method in class org.archive.modules.CrawlURI
-
Set the http recorder to be associated with this uri.
- setRecordIDGenerator(RecordIDGenerator) - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
- setRecoveryCheckpoint(Checkpoint) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- setRecoveryCheckpoint(Checkpoint) - Method in class org.archive.modules.fetcher.BdbCookieStore
- setRecoveryCheckpoint(Checkpoint) - Method in class org.archive.modules.fetcher.SimpleCookieStore
- setRecoveryCheckpoint(Checkpoint) - Method in class org.archive.modules.net.BdbServerCache
- setRecoveryCheckpoint(Checkpoint) - Method in class org.archive.modules.Processor
- setRefersToDate(long) - Method in class org.archive.modules.revisit.AbstractProfile
-
Set the refers to date
- setRefersToDate(String) - Method in class org.archive.modules.revisit.AbstractProfile
-
Set the refers to date
- setRefersToRecordID(String) - Method in class org.archive.modules.revisit.AbstractProfile
- setRefersToTargetURI(String) - Method in class org.archive.modules.revisit.IdenticalPayloadDigestRevisit
- setRegex(Pattern) - Method in class org.archive.modules.canonicalize.RegexRule
- setRegex(Pattern) - Method in class org.archive.modules.deciderules.MatchesRegexDecideRule
- setRegex(Pattern) - Method in class org.archive.modules.extractor.ExtractorImpliedURI
- setRegexList(List<Pattern>) - Method in class org.archive.modules.deciderules.MatchesListRegexDecideRule
- setRemoveTriggerUris(boolean) - Method in class org.archive.modules.extractor.ExtractorImpliedURI
- setRescheduleTime(long) - Method in class org.archive.modules.CrawlURI
- setRevisitProfile(RevisitProfile) - Method in class org.archive.modules.CrawlURI
- setRobotsPolicyName(String) - Method in class org.archive.modules.CrawlMetadata
- setRules(List<CanonicalizationRule>) - Method in class org.archive.modules.canonicalize.RulesCanonicalizationPolicy
- setRules(List<DecideRule>) - Method in class org.archive.modules.deciderules.DecideRuleSequence
- setSchedulingDirective(int) - Method in class org.archive.modules.CrawlURI
- setSchemes(Set<String>) - Method in class org.archive.modules.deciderules.SchemeNotInSetDecideRule
- setScriptSource(ReadSource) - Method in class org.archive.modules.deciderules.ScriptedDecideRule
- setScriptSource(ReadSource) - Method in class org.archive.modules.ScriptedProcessor
- setSeed(boolean) - Method in class org.archive.modules.CrawlURI
-
Set the isSeed attribute of this URI.
- setSeedListeners(Set<SeedListener>) - Method in class org.archive.modules.seeds.SeedModule
- setSeeds(SeedModule) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- setSeedsAsSurtPrefixes(boolean) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- setSendConnectionClose(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Send 'Connection: close' header with every request.
- setSendIfModifiedSince(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Send 'If-Modified-Since' header, if previous 'Last-Modified' fetch history information is available in URI history.
- setSendIfNoneMatch(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Send 'If-None-Match' header, if previous 'Etag' fetch history information is available in URI history.
- setSendRange(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
- setSendReferer(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Send 'Referer' header with every request.
- setServerCache(ServerCache) - Method in class org.archive.modules.deciderules.DecideRuleSequence
- setServerCache(ServerCache) - Method in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
- setServerCache(ServerCache) - Method in class org.archive.modules.deciderules.IpAddressSetDecideRule
- setServerCache(ServerCache) - Method in class org.archive.modules.fetcher.FetchDNS
- setServerCache(ServerCache) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Used to do DNS lookups.
- setServerCache(ServerCache) - Method in class org.archive.modules.fetcher.FetchWhois
- setServerCache(ServerCache) - Method in class org.archive.modules.warc.BaseWARCRecordBuilder
- setServerCache(ServerCache) - Method in class org.archive.modules.writer.Kw3WriterProcessor
- setServerCache(ServerCache) - Method in class org.archive.modules.writer.WriterPoolProcessor
- setShouldFetchBodyRule(DecideRule) - Method in class org.archive.modules.fetcher.FetchHTTP
-
DecideRules applied after receipt of HTTP response headers but before we start to download the body.
- setShouldMasquerade(boolean) - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
- setShouldMasquerade(boolean) - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
- setShouldProcessRule(DecideRule) - Method in class org.archive.modules.Processor
- setSizes(CrawlURI, Recorder) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Update CrawlURI internal sizes based on current transaction (and in the case of 304s, history)
- setSkipIdenticalDigests(boolean) - Method in class org.archive.modules.writer.WriterPoolProcessor
- setSoTimeoutMs(int) - Method in class org.archive.modules.fetcher.FetchFTP
- setSoTimeoutMs(int) - Method in class org.archive.modules.fetcher.FetchHTTP
-
If the socket is unresponsive for this number of milliseconds, give up.
- setSoTimeoutMs(int) - Method in class org.archive.modules.fetcher.FetchSFTP
- setSoTimeoutMs(int) - Method in class org.archive.modules.fetcher.FetchWhois
- setSourceSeeds(Set<String>) - Method in class org.archive.modules.deciderules.SourceSeedDecideRule
- setSourceTag(String) - Method in class org.archive.modules.CrawlURI
- setSourceTagSeeds(boolean) - Method in class org.archive.modules.seeds.SeedModule
- setSpecialQueryTemplates(Map<String, String>) - Method in class org.archive.modules.fetcher.FetchWhois
- setSslTrustLevel(ConfigurableX509TrustManager.TrustLevel) - Method in class org.archive.modules.fetcher.FetchHTTP
-
SSL certificate trust level.
- setStartNewFilesOnCheckpoint(boolean) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
Whether to close output files and start new ones on checkpoint.
- setStatusCodes(List<Integer>) - Method in class org.archive.modules.deciderules.FetchStatusDecideRule
- setStorePaths(List<ConfigPath>) - Method in class org.archive.modules.writer.WriterPoolProcessor
- setStripRegex(String) - Method in class org.archive.modules.extractor.HTTPContentDigest
- setSuffixAtEnd(boolean) - Method in class org.archive.modules.writer.MirrorWriterProcessor
- setSurtPrefixes(List<String>) - Method in class org.archive.modules.deciderules.ViaSurtPrefixedDecideRule
- setSurtsDumpFile(ConfigFile) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- setSurtsSource(ReadSource) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- setSurtsSourceFile(ConfigFile) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
Deprecated.
- setTemplate(String) - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
-
URI-building template.
- setTemplate(String) - Method in class org.archive.modules.writer.WriterPoolProcessor
- setTextSource(ReadSource) - Method in class org.archive.modules.seeds.TextSeedModule
- setThreadNumber(int) - Method in class org.archive.modules.CrawlURI
-
Set the number of the ToeThread responsible for processing this uri.
- setTimeoutPerRegexSeconds(long) - Method in class org.archive.modules.deciderules.MatchesListRegexDecideRule
- setTimeoutSeconds(int) - Method in class org.archive.modules.fetcher.FetchFTP
- setTimeoutSeconds(int) - Method in class org.archive.modules.fetcher.FetchHTTP
-
If the fetch is not completed in this number of seconds, give up (and retry later).
- setTimeoutSeconds(int) - Method in class org.archive.modules.fetcher.FetchSFTP
- setTooLongDirectory(String) - Method in class org.archive.modules.writer.MirrorWriterProcessor
- setTotalBytesWritten(long) - Method in class org.archive.modules.writer.WriterPoolProcessor
- setTreatFramesAsEmbedLinks(boolean) - Method in class org.archive.modules.extractor.ExtractorHTML
- setUnderscoreSet(List<String>) - Method in class org.archive.modules.writer.MirrorWriterProcessor
- setUnresolvable(CrawlURI, CrawlHost) - Method in class org.archive.modules.fetcher.FetchDNS
- setUp() - Method in class org.archive.modules.extractor.ContentExtractorTestBase
-
Sets up the
ContentExtractorTestBase.extractor
. - setupCopyEnvironment(File) - Static method in class org.archive.modules.recrawl.PersistProcessor
- setupCopyEnvironment(File, boolean) - Static method in class org.archive.modules.recrawl.PersistProcessor
- setUpperBound(long) - Method in class org.archive.modules.deciderules.ResponseContentLengthDecideRule
-
The rule will apply if the url has been fetched and content body length is less than or equal to this number of bytes.
- setUpperBound(Integer) - Method in class org.archive.modules.deciderules.MatchesStatusCodeDecideRule
-
Sets the upper bound on the range of acceptable status codes.
- setUpperBound(Integer) - Method in class org.archive.modules.deciderules.NotMatchesStatusCodeDecideRule
-
Sets the upper bound on the range of acceptable status codes.
- setupPool(AtomicInteger) - Method in class org.archive.modules.writer.ARCWriterProcessor
- setupPool(AtomicInteger) - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
- setupPool(AtomicInteger) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
Set up pool of files.
- setupSimpleLog(String) - Method in interface org.archive.modules.SimpleFileLoggerProvider
- setUriRegex(String) - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
-
Regular expression against which to match the URI.
- setUseHeaderLength(boolean) - Method in class org.archive.modules.deciderules.ResourceNoLongerThanDecideRule
- setUseHTTP11(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Use HTTP/1.1.
- setUsePreset(MatchesFilePatternDecideRule.Preset) - Method in class org.archive.modules.deciderules.MatchesFilePatternDecideRule
- setUserAgent(String) - Method in class org.archive.modules.CrawlURI
-
Set the user agent to use when crawling this URI.
- setUserAgentProvider(UserAgentProvider) - Method in class org.archive.modules.fetcher.FetchHTTP
- setUserAgentTemplate(String) - Method in class org.archive.modules.CrawlMetadata
- setUsername(String) - Method in class org.archive.modules.fetcher.FetchFTP
- setUsername(String) - Method in class org.archive.modules.fetcher.FetchSFTP
- setVia(UURI) - Method in class org.archive.modules.CrawlURI
- setWriteBufferSize(int) - Method in class org.archive.modules.writer.WriterPoolProcessor
- setWriteMetadata(boolean) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
- setWriteRequests(boolean) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
- setWriteRevisitForIdenticalDigests(boolean) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
- setWriteRevisitForNotModified(boolean) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
- sharedEngine - Variable in class org.archive.modules.deciderules.ScriptedDecideRule
- sharedEngine - Variable in class org.archive.modules.ScriptedProcessor
- shortReportLegend() - Method in class org.archive.modules.CrawlURI
- shortReportLegend() - Method in class org.archive.modules.fetcher.FetchStats
- shortReportLegend() - Method in class org.archive.modules.ProcessorChain
- shortReportLine() - Method in class org.archive.modules.CrawlURI
- shortReportLine() - Method in class org.archive.modules.fetcher.FetchStats
- shortReportLineTo(PrintWriter) - Method in class org.archive.modules.CrawlURI
- shortReportLineTo(PrintWriter) - Method in class org.archive.modules.fetcher.FetchStats
- shortReportLineTo(PrintWriter) - Method in class org.archive.modules.ProcessorChain
- shortReportMap() - Method in class org.archive.modules.CrawlURI
- shortReportMap() - Method in class org.archive.modules.fetcher.FetchStats
- shortReportMap() - Method in class org.archive.modules.ProcessorChain
- shouldBuildRecord(CrawlURI) - Method in class org.archive.modules.warc.DnsResponseRecordBuilder
- shouldBuildRecord(CrawlURI) - Method in class org.archive.modules.warc.FtpControlConversationRecordBuilder
- shouldBuildRecord(CrawlURI) - Method in class org.archive.modules.warc.FtpResponseRecordBuilder
- shouldBuildRecord(CrawlURI) - Method in class org.archive.modules.warc.HttpRequestRecordBuilder
- shouldBuildRecord(CrawlURI) - Method in class org.archive.modules.warc.HttpResponseRecordBuilder
- shouldBuildRecord(CrawlURI) - Method in class org.archive.modules.warc.MetadataRecordBuilder
-
If you don't want metadata records, take this class out of the chain.
- shouldBuildRecord(CrawlURI) - Method in class org.archive.modules.warc.RevisitRecordBuilder
- shouldBuildRecord(CrawlURI) - Method in interface org.archive.modules.warc.WARCRecordBuilder
-
Decides whether to build a record for the given capture.
- shouldBuildRecord(CrawlURI) - Method in class org.archive.modules.warc.WhoisResponseRecordBuilder
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ContentExtractor
-
Determines if otherwise valid URIs should have links extracted or not.
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorCSS
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorDOC
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorHTML
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorJS
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorPDF
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorRobotsTxt
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorSitemap
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorSWF
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorUniversal
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorXML
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.TrapSuppressExtractor
- shouldLoad(CrawlURI) - Method in class org.archive.modules.recrawl.AbstractPersistProcessor
-
Whether the current CrawlURI's state should be loaded
- shouldMasquerade - Variable in class org.archive.modules.net.FirstNamedRobotsPolicy
-
whether to adopt the user-agent that is allowed for the fetch
- shouldMasquerade - Variable in class org.archive.modules.net.MostFavoredRobotsPolicy
-
whether to adopt the user-agent that is allowed for the fetch
- shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.ContentExtractor
-
Determines if links should be extracted from the given URI.
- shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorHTTP
- shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorImpliedURI
- shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
- shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorURI
- shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.HTTPContentDigest
- shouldProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchDNS
- shouldProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchFTP
- shouldProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Can this processor fetch the given CrawlURI.
- shouldProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchSFTP
- shouldProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchWhois
- shouldProcess(CrawlURI) - Method in class org.archive.modules.forms.ExtractorHTMLForms
- shouldProcess(CrawlURI) - Method in class org.archive.modules.forms.FormLoginProcessor
- shouldProcess(CrawlURI) - Method in class org.archive.modules.Processor
-
Determines whether the given uri should be processed by this processor.
- shouldProcess(CrawlURI) - Method in class org.archive.modules.recrawl.ContentDigestHistoryLoader
- shouldProcess(CrawlURI) - Method in class org.archive.modules.recrawl.ContentDigestHistoryStorer
- shouldProcess(CrawlURI) - Method in class org.archive.modules.recrawl.FetchHistoryProcessor
- shouldProcess(CrawlURI) - Method in class org.archive.modules.recrawl.PersistLoadProcessor
- shouldProcess(CrawlURI) - Method in class org.archive.modules.recrawl.PersistLogProcessor
- shouldProcess(CrawlURI) - Method in class org.archive.modules.recrawl.PersistStoreProcessor
- shouldProcess(CrawlURI) - Method in class org.archive.modules.ScriptedProcessor
- shouldProcess(CrawlURI) - Method in class org.archive.modules.writer.Kw3WriterProcessor
- shouldProcess(CrawlURI) - Method in class org.archive.modules.writer.MirrorWriterProcessor
- shouldProcess(CrawlURI) - Method in class org.archive.modules.writer.WriterPoolProcessor
- shouldStore(CrawlURI) - Method in class org.archive.modules.recrawl.AbstractPersistProcessor
-
Whether the current CrawlURI's state should be persisted (to log or direct to database)
- shouldWrite(CrawlURI) - Method in class org.archive.modules.writer.WARCWriterChainProcessor
- shouldWrite(CrawlURI) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
Whether the given CrawlURI should be written to archive files.
- SimpleCookieStore - Class in org.archive.modules.fetcher
-
In-memory cookie store, mostly for testing.
- SimpleCookieStore() - Constructor for class org.archive.modules.fetcher.SimpleCookieStore
- SimpleFileLoggerProvider - Interface in org.archive.modules
- SimpleLinkContext(String) - Constructor for class org.archive.modules.extractor.LinkContext.SimpleLinkContext
- size() - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- size() - Method in class org.archive.modules.ProcessorChain
- skipIdenticalDigests - Variable in class org.archive.modules.writer.WriterPoolProcessor
-
Whether to skip the writing of a record when URI history information is available and indicates the prior fetch had an identical content digest.
- socketFactory - Variable in class org.archive.modules.fetcher.FetchFTP
- SocketFactoryWithTimeout() - Constructor for class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
- sortableKey(Cookie) - Method in class org.archive.modules.fetcher.AbstractCookieStore
-
Returns a string that uniquely identifies the cookie, The format The format of the key is
"normalizedDomain;name;path"
. - SOURCE_DATA_ORIGINAL_SET - Static variable in class org.archive.modules.extractor.HTMLLinkContext
- SOURCE_SRCSET - Static variable in class org.archive.modules.extractor.HTMLLinkContext
- SourceSeedDecideRule - Class in org.archive.modules.deciderules
-
Rule applies the configured decision for any URI with discovered from one of the seeds in
sourceSeeds
. - SourceSeedDecideRule() - Constructor for class org.archive.modules.deciderules.SourceSeedDecideRule
- sourceSeeds - Variable in class org.archive.modules.deciderules.SourceSeedDecideRule
- sourceTagSeeds - Variable in class org.archive.modules.seeds.SeedModule
-
Whether to tag seeds with their own URI as a heritable 'source' String, which will be carried-forward to all URIs discovered on paths originating from that seed.
- specialQueryTemplates - Variable in class org.archive.modules.fetcher.FetchWhois
- SPECULATIVE - org.archive.modules.extractor.Hop
-
Speculative/aggressively extracted links, perhaps embed or nav, as in javascript.
- SPECULATIVE_MISC - Static variable in class org.archive.modules.extractor.LinkContext
-
Stand-in value for speculative/aggressively extracted urls without other context.
- sslContext - Variable in class org.archive.modules.fetcher.FetchHTTP
- sslContext() - Method in class org.archive.modules.fetcher.FetchHTTP
- sslTrustLevel - Variable in class org.archive.modules.fetcher.FetchHTTP
- STANDARD_POLICIES - Static variable in class org.archive.modules.net.RobotsPolicy
- start() - Method in class org.archive.modules.deciderules.DecideRuleSequence
- start() - Method in class org.archive.modules.fetcher.AbstractCookieStore
- start() - Method in class org.archive.modules.fetcher.FetchHTTP
- start() - Method in class org.archive.modules.fetcher.FetchWhois
- start() - Method in class org.archive.modules.net.BdbServerCache
- start() - Method in class org.archive.modules.Processor
- start() - Method in class org.archive.modules.ProcessorChain
- start() - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
- start() - Method in class org.archive.modules.recrawl.PersistLoadProcessor
- start() - Method in class org.archive.modules.recrawl.PersistLogProcessor
- start() - Method in class org.archive.modules.recrawl.PersistOnlineProcessor
- start() - Method in interface org.archive.modules.SimpleFileLoggerProvider
- start() - Method in class org.archive.modules.writer.WriterPoolProcessor
- startCheckpoint(Checkpoint) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- startCheckpoint(Checkpoint) - Method in class org.archive.modules.fetcher.BdbCookieStore
- startCheckpoint(Checkpoint) - Method in class org.archive.modules.fetcher.SimpleCookieStore
- startCheckpoint(Checkpoint) - Method in class org.archive.modules.net.BdbServerCache
- startCheckpoint(Checkpoint) - Method in class org.archive.modules.Processor
- startCheckpoint(Checkpoint) - Method in class org.archive.modules.recrawl.PersistLogProcessor
- startNewFilesOnCheckpoint - Variable in class org.archive.modules.writer.WriterPoolProcessor
- stats - Variable in class org.archive.modules.writer.BaseWARCWriterProcessor
- STATUS_CODE_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
- statusCodes - Variable in class org.archive.modules.deciderules.FetchStatusDecideRule
- stop() - Method in class org.archive.modules.deciderules.DecideRuleSequence
- stop() - Method in class org.archive.modules.fetcher.AbstractCookieStore
- stop() - Method in class org.archive.modules.fetcher.FetchHTTP
- stop() - Method in class org.archive.modules.fetcher.FetchWhois
- stop() - Method in class org.archive.modules.net.BdbServerCache
- stop() - Method in class org.archive.modules.Processor
- stop() - Method in class org.archive.modules.ProcessorChain
- stop() - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
- stop() - Method in class org.archive.modules.recrawl.PersistLogProcessor
- stop() - Method in class org.archive.modules.recrawl.PersistOnlineProcessor
- stop() - Method in class org.archive.modules.writer.WriterPoolProcessor
- store - Variable in class org.archive.modules.recrawl.BdbContentDigestHistory
- store - Variable in class org.archive.modules.recrawl.PersistOnlineProcessor
- store(CrawlURI) - Method in class org.archive.modules.recrawl.AbstractContentDigestHistory
-
Stores
curi.getContentDigestHistory()
for the keypersistKeyFor(curi)
. - store(CrawlURI) - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
- storeDNSRecord(CrawlURI, String, CrawlHost, Record[]) - Method in class org.archive.modules.fetcher.FetchDNS
- storePaths - Variable in class org.archive.modules.writer.WriterPoolProcessor
-
Where to save files.
- StringExtractorTestBase - Class in org.archive.modules.extractor
- StringExtractorTestBase() - Constructor for class org.archive.modules.extractor.StringExtractorTestBase
- StringExtractorTestBase.TestData - Class in org.archive.modules.extractor
- StripExtraSlashes - Class in org.archive.modules.canonicalize
-
Strip any extra slashes, '/', found in the path.
- StripExtraSlashes() - Constructor for class org.archive.modules.canonicalize.StripExtraSlashes
- StripSessionCFIDs - Class in org.archive.modules.canonicalize
-
Strip cold fusion session ids.
- StripSessionCFIDs() - Constructor for class org.archive.modules.canonicalize.StripSessionCFIDs
- StripSessionIDs - Class in org.archive.modules.canonicalize
-
Strip known session ids.
- StripSessionIDs() - Constructor for class org.archive.modules.canonicalize.StripSessionIDs
- stripToMinimal() - Method in class org.archive.modules.CrawlURI
-
Remove all attributes set on this uri.
- StripUserinfoRule - Class in org.archive.modules.canonicalize
-
Strip any 'userinfo' found on http/https URLs.
- StripUserinfoRule() - Constructor for class org.archive.modules.canonicalize.StripUserinfoRule
- StripWWWNRule - Class in org.archive.modules.canonicalize
-
Strip any 'www[0-9]*' found on http/https URLs IF they have some path/query component (content after third slash).
- StripWWWNRule() - Constructor for class org.archive.modules.canonicalize.StripWWWNRule
- StripWWWRule - Class in org.archive.modules.canonicalize
-
Strip any 'www' found on http/https URLs, IF they have some path/query component (content after third slash).
- StripWWWRule() - Constructor for class org.archive.modules.canonicalize.StripWWWRule
- subList(int, int) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- SUBMIT - org.archive.modules.extractor.Hop
-
Synthesized form-submit
- submitStatusFor(String) - Method in class org.archive.modules.forms.FormLoginProcessor
- subset(CrawlURI, Class<?>) - Method in class org.archive.modules.credential.CredentialStore
-
Return set made up of all credentials of the passed
type
. - subset(CrawlURI, Class<?>, String) - Method in class org.archive.modules.credential.CredentialStore
-
Return set made up of all credentials of the passed
type
. - substats - Variable in class org.archive.modules.net.CrawlHost
- substats - Variable in class org.archive.modules.net.CrawlServer
- SUCCEEDED - org.archive.modules.fetcher.FetchStats.Stage
- SUCCESS_BYTES - Static variable in class org.archive.modules.fetcher.FetchStats
- suffixAtEnd - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
If true, the suffix is placed at the end of the path, after the query (if any).
- summary() - Method in class org.archive.crawler.util.CrawledBytesHistotable
- SurtPrefixedDecideRule - Class in org.archive.modules.deciderules.surt
-
Rule applies configured decision to any URIs that, when expressed in SURT form, begin with one of the prefixes in the configured set.
- SurtPrefixedDecideRule() - Constructor for class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- surtPrefixes - Variable in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
- surtPrefixes - Variable in class org.archive.modules.deciderules.ViaSurtPrefixedDecideRule
- surtsDumpFile - Variable in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
Dump file to save SURT prefixes actually used: Useful debugging SURTs.
- surtsSource - Variable in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
Text from which to infer SURT prefixes.
T
- tagDefineButton(int, Vector) - Method in class org.archive.modules.extractor.CustomSWFTags
- tagDefineButton2(int, boolean, Vector) - Method in class org.archive.modules.extractor.CustomSWFTags
- tagDefineSprite(int) - Method in class org.archive.modules.extractor.CustomSWFTags
- tagDoAction() - Method in class org.archive.modules.extractor.CustomSWFTags
- tagDoInActions(int) - Method in class org.archive.modules.extractor.CustomSWFTags
- tagDoInitAction(int) - Method in class org.archive.modules.extractor.CustomSWFTags
- tagPlaceObject2(boolean, int, int, int, Matrix, AlphaTransform, int, String, int) - Method in class org.archive.modules.extractor.CustomSWFTags
- tally(CrawlURI, FetchStats.Stage) - Method in interface org.archive.modules.fetcher.FetchStats.CollectsFetchStats
- tally(CrawlURI, FetchStats.Stage) - Method in class org.archive.modules.fetcher.FetchStats
- targetHost - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
- TempDirProvider - Interface in org.archive.modules.extractor
- template - Variable in class org.archive.modules.writer.WriterPoolProcessor
-
Template from which a filename is interpolated.
- test(int) - Method in class org.archive.modules.deciderules.ResourceLongerThanDecideRule
- test(int) - Method in class org.archive.modules.deciderules.ResourceNoLongerThanDecideRule
- TestData(CrawlURI, CrawlURI) - Constructor for class org.archive.modules.extractor.StringExtractorTestBase.TestData
- testExtraction() - Method in class org.archive.modules.extractor.StringExtractorTestBase
-
Tests each text/URI pair in the test data array.
- testFinished() - Method in class org.archive.modules.extractor.ContentExtractorTestBase
-
Tests that a URI whose linkExtractionFinished flag has been set has no links extracted.
- testSerializationIfAppropriate() - Method in class org.archive.state.ModuleTestBase
-
Tests that the module can be serialized.
- testZeroContent() - Method in class org.archive.modules.extractor.ContentExtractorTestBase
-
Tests that a URI with a zero content length has no links extracted.
- TextSeedModule - Class in org.archive.modules.seeds
-
Module that announces a list of seeds from a text source (such as a ConfigFile or ConfigString), and provides a mechanism for adding seeds after a crawl has begun.
- TextSeedModule() - Constructor for class org.archive.modules.seeds.TextSeedModule
- textSource - Variable in class org.archive.modules.seeds.TextSeedModule
-
Text from which to extract seeds
- threadEngine - Variable in class org.archive.modules.deciderules.ScriptedDecideRule
- threadEngine - Variable in class org.archive.modules.ScriptedProcessor
- TIMER_TRUNC - Static variable in interface org.archive.modules.CoreAttributeConstants
- TIMER_TRUNC - Static variable in class org.archive.modules.fetcher.FetchErrors
- TLDs - Static variable in class org.archive.modules.extractor.ExtractorUniversal
-
Matches any string that begins with a TLD (no .) followed by a '/' slash or end of string.
- toArray() - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- toArray(T[]) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
- toCheckpointJson() - Method in class org.archive.modules.extractor.Extractor
- toCheckpointJson() - Method in class org.archive.modules.forms.FormLoginProcessor
- toCheckpointJson() - Method in class org.archive.modules.Processor
-
Return a JSONObject of current stat that can be consulted on recovery to restore necessary values.
- toCheckpointJson() - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
- toCheckpointJson() - Method in class org.archive.modules.writer.WriterPoolProcessor
- tooLongDirectory - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
If all the directories in the URI would exceed, or come close to exceeding, the file system maximum path length, then they are all replaced by this.
- TooManyHopsDecideRule - Class in org.archive.modules.deciderules
-
Rule REJECTs any CrawlURIs whose total number of hops (length of the hopsPath string, traversed links of any type) is over a threshold.
- TooManyHopsDecideRule() - Constructor for class org.archive.modules.deciderules.TooManyHopsDecideRule
-
Usual constructor.
- TooManyPathSegmentsDecideRule - Class in org.archive.modules.deciderules
-
Rule REJECTs any CrawlURIs whose total number of path-segments (as indicated by the count of '/' characters not including the first '//') is over a given threshold.
- TooManyPathSegmentsDecideRule() - Constructor for class org.archive.modules.deciderules.TooManyPathSegmentsDecideRule
-
Usual constructor.
- toString() - Method in class org.archive.modules.CrawlURI
- toString() - Method in class org.archive.modules.extractor.HTMLLinkContext
- toString() - Method in class org.archive.modules.extractor.LinkContext.SimpleLinkContext
- toString() - Method in class org.archive.modules.fetcher.BasicExecutionAwareRequest
- toString() - Method in class org.archive.modules.forms.HTMLForm.FormInput
- toString() - Method in class org.archive.modules.forms.HTMLForm
- toString() - Method in class org.archive.modules.net.CrawlHost
- toString() - Method in class org.archive.modules.net.CrawlServer
- TOTAL_BYTES - Static variable in class org.archive.modules.fetcher.FetchStats
- TOTAL_SCHEDULED - Static variable in class org.archive.modules.fetcher.FetchStats
- TransclusionDecideRule - Class in org.archive.modules.deciderules
-
Rule ACCEPTs any CrawlURIs whose path-from-seed ('hopsPath' -- see
CrawlURI.getPathFromSeed()
ends with at least one, but not more than, the given number of non-navlink ('L') hops. - TransclusionDecideRule() - Constructor for class org.archive.modules.deciderules.TransclusionDecideRule
-
Usual constructor.
- TrapSuppressExtractor - Class in org.archive.modules.extractor
-
Pseudo-extractor that suppresses link-extraction of likely trap pages, by noticing when content's digest is identical to that of its 'via'.
- TrapSuppressExtractor() - Constructor for class org.archive.modules.extractor.TrapSuppressExtractor
-
Usual constructor.
- TRUNC_SUFFIX - Static variable in interface org.archive.modules.CoreAttributeConstants
-
Fetch truncation codes present in
CrawlURI
annotations. - TRUNC_SUFFIX - Static variable in class org.archive.modules.fetcher.FetchErrors
-
Fetch truncation codes present in ProcessorURI annotations.
- type - Variable in class org.archive.modules.forms.HTMLForm.FormInput
U
- ULTRA_SUFFIX_WHOIS_SERVER - Static variable in class org.archive.modules.fetcher.FetchWhois
- UNCALCULATED - Static variable in class org.archive.modules.CrawlURI
- underscoreSet - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
If a directory name appears (case-insensitive) in this list then an underscore is placed before it.
- UNKNOWN - org.archive.modules.CrawlURI.FetchType
- updateMetadataAfterWrite(CrawlURI, WARCWriter, long) - Method in class org.archive.modules.writer.BaseWARCWriterProcessor
- updateRobots(CrawlURI) - Method in class org.archive.modules.net.CrawlServer
-
Update the server's robotstxt
- uri - Variable in class org.archive.modules.extractor.StringExtractorTestBase.TestData
- URI_HISTORY_DBNAME - Static variable in class org.archive.modules.recrawl.PersistProcessor
-
name of history Database
- UriCanonicalizationPolicy - Class in org.archive.modules.canonicalize
-
URI Canonicalizatioon Policy
- UriCanonicalizationPolicy() - Constructor for class org.archive.modules.canonicalize.UriCanonicalizationPolicy
- uriCount - Variable in class org.archive.modules.Processor
-
The number of URIs processed by this processor.
- UriErrorLoggerModule - Interface in org.archive.modules.extractor
- URL_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
- urlsWritten - Variable in class org.archive.modules.writer.BaseWARCWriterProcessor
- UserAgentProvider - Interface in org.archive.modules.fetcher
V
- validate(Pattern, String) - Method in class org.archive.modules.writer.MirrorWriterProcessor
- VALIDATOR - Static variable in class org.archive.modules.CrawlMetadata
- validRobots - Variable in class org.archive.modules.net.CrawlServer
- value - Variable in class org.archive.modules.forms.HTMLForm.FormInput
- value - Variable in class org.archive.modules.forms.HTMLForm.NameValue
- valueOf(String) - Static method in enum org.archive.modules.CrawlURI.FetchType
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.archive.modules.credential.HtmlFormCredential.Method
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.archive.modules.deciderules.DecideResult
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.archive.modules.deciderules.MatchesFilePatternDecideRule.Preset
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.archive.modules.extractor.Hop
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.archive.modules.fetcher.FetchStats.Stage
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.archive.modules.fetcher.FetchWhois.UrlStatus
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.archive.modules.ProcessResult.ProcessStatus
-
Returns the enum constant of this type with the specified name.
- values() - Static method in enum org.archive.modules.CrawlURI.FetchType
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum org.archive.modules.credential.HtmlFormCredential.Method
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum org.archive.modules.deciderules.DecideResult
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum org.archive.modules.deciderules.MatchesFilePatternDecideRule.Preset
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum org.archive.modules.extractor.Hop
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum org.archive.modules.fetcher.FetchStats.Stage
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum org.archive.modules.fetcher.FetchWhois.UrlStatus
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum org.archive.modules.ProcessResult.ProcessStatus
-
Returns an array containing the constants of this enum type, in the order they are declared.
- verifySerialization(Object, byte[], Object, byte[]) - Method in class org.archive.state.ModuleTestBase
-
Verifies that serialization was successful.
- ViaSurtPrefixedDecideRule - Class in org.archive.modules.deciderules
-
Rule applies the configured decision for any URI which has a 'via' whose surtform matches any surt specified in the surtPrefixes list
- ViaSurtPrefixedDecideRule() - Constructor for class org.archive.modules.deciderules.ViaSurtPrefixedDecideRule
- VIDEO - org.archive.modules.deciderules.MatchesFilePatternDecideRule.Preset
W
- WARC_NOVEL_CONTENT_BYTES - Static variable in class org.archive.crawler.util.CrawledBytesHistotable
- WARC_NOVEL_URLS - Static variable in class org.archive.crawler.util.CrawledBytesHistotable
- warcHeaderFor(String) - Method in class org.archive.modules.forms.FormLoginProcessor
- WARCRecordBuilder - Interface in org.archive.modules.warc
-
Implementations of this interface are each responsible for building a particular type of WARC record.
- WARCWriterChainProcessor - Class in org.archive.modules.writer
-
WARC writer processor.
- WARCWriterChainProcessor() - Constructor for class org.archive.modules.writer.WARCWriterChainProcessor
- WARCWriterProcessor - Class in org.archive.modules.writer
-
Deprecated.in favor of
WARCWriterChainProcessor
- WARCWriterProcessor() - Constructor for class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
- WHOIS_SERVER_REGEX - Static variable in class org.archive.modules.fetcher.FetchWhois
- WhoisResponseRecordBuilder - Class in org.archive.modules.warc
- WhoisResponseRecordBuilder() - Constructor for class org.archive.modules.warc.WhoisResponseRecordBuilder
- wildcardDirectives - Variable in class org.archive.modules.net.Robotstxt
- write(String, CrawlURI) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
- write(CrawlURI) - Method in class org.archive.modules.writer.WARCWriterChainProcessor
- write(CrawlURI, long, InputStream, String) - Method in class org.archive.modules.writer.ARCWriterProcessor
- writeArchiveInfoPart(String, CrawlURI, ReplayInputStream, OutputStream) - Method in class org.archive.modules.writer.Kw3WriterProcessor
- writeBufferSize - Variable in class org.archive.modules.writer.WriterPoolProcessor
-
Size of buffer in front of disk-writing.
- writeContentPart(String, CrawlURI, ReplayInputStream, OutputStream) - Method in class org.archive.modules.writer.Kw3WriterProcessor
- writeDnsRecords(CrawlURI, WARCWriter, URI, String) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
- writeFtpControlConversation(WARCWriter, String, URI, CrawlURI, ANVLRecord, String) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
- writeFtpRecords(WARCWriter, CrawlURI, URI, String) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
- writeHeaderPart(String, ReplayInputStream, OutputStream) - Method in class org.archive.modules.writer.Kw3WriterProcessor
- writeHttpRecords(CrawlURI, WARCWriter, URI, String) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
- writeMetadata(WARCWriter, String, URI, CrawlURI, ANVLRecord) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
- writeMimeFile(CrawlURI) - Method in class org.archive.modules.writer.Kw3WriterProcessor
-
The actual writing of the Kulturarw3 MIME-file.
- writeRecords(CrawlURI, WARCWriter) - Method in class org.archive.modules.writer.WARCWriterChainProcessor
- writeRequest(WARCWriter, String, String, URI, CrawlURI, ANVLRecord) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
- writeResource(WARCWriter, String, String, URI, CrawlURI, ANVLRecord) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
- writeResponse(WARCWriter, String, String, URI, CrawlURI, ANVLRecord) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
- writeRevisit(WARCWriter, String, String, URI, CrawlURI, ANVLRecord) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
- writeRevisit(WARCWriter, String, String, URI, CrawlURI, ANVLRecord, long) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
- WriterPoolProcessor - Class in org.archive.modules.writer
-
Abstract implementation of a file pool processor.
- WriterPoolProcessor() - Constructor for class org.archive.modules.writer.WriterPoolProcessor
- writeWhoisRecords(WARCWriter, CrawlURI, URI, String) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
All Classes|All Packages
CrawlURI.getFetchHistory()
andCrawlURI.setFetchHistory(java.util.Map[])