- A_ANNOTATIONS - Static variable in interface org.archive.modules.CoreAttributeConstants
-
shorthand string tokens indicating notable occurrences,
separated by commas
- A_CONTENT_DIGEST - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
content digest
- A_CONTENT_DIGEST_COUNT - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
number of times we've seen this content digest (1 original + n duplicates)
- A_CONTENT_DIGEST_HISTORY - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
content digest history map
- A_CONTENT_TYPE - Static variable in interface org.archive.modules.CoreAttributeConstants
-
Extracted MIME type of fetched content; should be
set immediately by fetching module if possible
(rather than waiting for a later analyzer)
- A_CREDENTIALS_KEY - Static variable in interface org.archive.modules.CoreAttributeConstants
-
Key to get credential avatars from A_LIST.
- A_DELAY_FACTOR - Static variable in interface org.archive.modules.CoreAttributeConstants
-
Multiplier of last fetch duration to wait before
fetching another item of the same class (eg host)
- A_DISTANCE_FROM_SEED - Static variable in interface org.archive.modules.CoreAttributeConstants
-
- A_DNS_FETCH_TIME - Static variable in interface org.archive.modules.CoreAttributeConstants
-
- A_DNS_SERVER_IP_LABEL - Static variable in interface org.archive.modules.CoreAttributeConstants
-
- A_ETAG_HEADER - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
header name (and AList key) for ETag
- A_FETCH_BEGAN_TIME - Static variable in interface org.archive.modules.CoreAttributeConstants
-
- A_FETCH_COMPLETED_TIME - Static variable in interface org.archive.modules.CoreAttributeConstants
-
- A_FETCH_HISTORY - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
fetch history array
- A_FORCE_RETIRE - Static variable in interface org.archive.modules.CoreAttributeConstants
-
flag indicating the containing queue should be retired
- A_FORM_OFFSETS - Static variable in class org.archive.modules.extractor.ExtractorHTML
-
- A_FTP_CONTROL_CONVERSATION - Static variable in interface org.archive.modules.CoreAttributeConstants
-
- A_FTP_FETCH_STATUS - Static variable in interface org.archive.modules.CoreAttributeConstants
-
- A_HERITABLE_KEYS - Static variable in interface org.archive.modules.CoreAttributeConstants
-
Key to (optional) attribute specifying a list of keys that
are passed to CandidateURIs that 'descend' (are discovered
via) this URI.
- A_HREF - Static variable in class org.archive.modules.extractor.HTMLLinkContext
-
- A_HTML_BASE - Static variable in interface org.archive.modules.CoreAttributeConstants
-
- A_HTML_FORM_OBJECTS - Static variable in class org.archive.modules.forms.ExtractorHTMLForms
-
- A_HTTP_AUTH_CHALLENGES - Static variable in interface org.archive.modules.CoreAttributeConstants
-
- A_HTTP_PROXY_HOST - Static variable in interface org.archive.modules.CoreAttributeConstants
-
local override of proxy host
- A_HTTP_PROXY_PORT - Static variable in interface org.archive.modules.CoreAttributeConstants
-
local override of proxy port
- A_HTTP_RESPONSE_HEADERS - Static variable in interface org.archive.modules.CoreAttributeConstants
-
- A_LAST_MODIFIED_HEADER - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
header name (and AList key) for last-modified timestamp
- A_META_ROBOTS - Static variable in class org.archive.modules.extractor.ExtractorHTML
-
- A_MINIMUM_DELAY - Static variable in interface org.archive.modules.CoreAttributeConstants
-
Minimum delay before fetching another item of th
same class (eg host).
- A_MIRROR_PATH - Static variable in interface org.archive.modules.CoreAttributeConstants
-
Define for org.archive.crawler.writer.MirrorWriterProcessor.
- A_MIRROR_PATH - Static variable in class org.archive.modules.writer.MirrorWriterProcessor
-
- A_NONFATAL_ERRORS - Static variable in interface org.archive.modules.CoreAttributeConstants
-
- A_ORIGINAL_DATE - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
date content payload was written
- A_ORIGINAL_URL - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
url that the content payload was written for
- A_PRECALC_PRECEDENCE - Static variable in interface org.archive.modules.CoreAttributeConstants
-
key to attribute containing pre-calculated precedence
- A_PREREQUISITE_URI - Static variable in interface org.archive.modules.CoreAttributeConstants
-
- A_REFERENCE_LENGTH - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
reference length (content length or virtual length
- A_RETRY_DELAY - Static variable in interface org.archive.modules.CoreAttributeConstants
-
- A_RRECORD_SET_LABEL - Static variable in interface org.archive.modules.CoreAttributeConstants
-
- A_RUNTIME_EXCEPTION - Static variable in interface org.archive.modules.CoreAttributeConstants
-
- A_SOURCE_TAG - Static variable in interface org.archive.modules.CoreAttributeConstants
-
a 'source' (usu.
- A_STATUS - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
key for status (when in history)
- A_SUBMIT_DATA - Static variable in interface org.archive.modules.CoreAttributeConstants
-
- A_SUBMIT_ENCTYPE - Static variable in interface org.archive.modules.CoreAttributeConstants
-
- A_VIA_DIGEST - Static variable in class org.archive.modules.extractor.TrapSuppressExtractor
-
ALIst attribute key for carrying-forward content-digest from 'via'
- A_WARC_FILE_OFFSET - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
offset into warc file of warc record with content payload
- A_WARC_FILENAME - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
warc filename containing the content payload
- A_WARC_RECORD_ID - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
warc record id of warc record with the content payload
- A_WARC_RESPONSE_HEADERS - Static variable in interface org.archive.modules.CoreAttributeConstants
-
- A_WARC_STATS - Static variable in interface org.archive.modules.CoreAttributeConstants
-
- A_WHOIS_SERVER_IP - Static variable in interface org.archive.modules.CoreAttributeConstants
-
- A_WRITE_TAG - Static variable in interface org.archive.modules.recrawl.RecrawlAttributeConstants
-
Writer processors of all types are encouraged to put a 'writeTag'
(analogous to HTTP 'etag') in the CrawlURI state.
- aboutToLog() - Method in class org.archive.modules.CrawlURI
-
Notify CrawlURI it is about to be logged; opportunity
for self-annotation
- ABS_HTTP_URI_PATTERN - Static variable in class org.archive.modules.extractor.ExtractorURI
-
- AbstractContentDigestHistory - Class in org.archive.modules.recrawl
-
Represents a store of information, presumably persistent, keyed by content
digest.
- AbstractContentDigestHistory() - Constructor for class org.archive.modules.recrawl.AbstractContentDigestHistory
-
- AbstractCookieStore - Class in org.archive.modules.fetcher
-
- AbstractCookieStore() - Constructor for class org.archive.modules.fetcher.AbstractCookieStore
-
- AbstractCookieStore.LimitedCookieStoreFacade - Class in org.archive.modules.fetcher
-
- AbstractPersistProcessor - Class in org.archive.modules.recrawl
-
- AbstractPersistProcessor() - Constructor for class org.archive.modules.recrawl.AbstractPersistProcessor
-
- AbstractProfile - Class in org.archive.modules.revisit
-
- AbstractProfile() - Constructor for class org.archive.modules.revisit.AbstractProfile
-
- AcceptDecideRule - Class in org.archive.modules.deciderules
-
- AcceptDecideRule() - Constructor for class org.archive.modules.deciderules.AcceptDecideRule
-
- accepts(CrawlURI) - Method in class org.archive.modules.deciderules.DecideRule
-
- accumulate(CrawlURI) - Method in class org.archive.crawler.util.CrawledBytesHistotable
-
- action - Variable in class org.archive.modules.forms.HTMLForm
-
- actions - Variable in class org.archive.modules.extractor.CustomSWFTags
-
- actOn(File) - Method in class org.archive.modules.seeds.SeedModule
-
- actOn(File) - Method in class org.archive.modules.seeds.TextSeedModule
-
Treat the given file as a source of additional seeds,
announcing to SeedListeners.
- add(CrawlURI, int, String, LinkContext, Hop) - Static method in class org.archive.modules.extractor.Extractor
-
- add(T) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
-
- add(int, T) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
-
- addAll(Collection<? extends T>) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
-
- addAll(int, Collection<? extends T>) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
-
- addAllow(String) - Method in class org.archive.modules.net.RobotsDirectives
-
- addAnnotations(CrawlURI, CrawlURI) - Method in class org.archive.modules.extractor.ExtractorSWF.CrawlUriSWFAction
-
- addContentLocationHeaderLink(CrawlURI, String) - Method in class org.archive.modules.extractor.ExtractorHTTP
-
- addCookie(Cookie) - Method in class org.archive.modules.fetcher.AbstractCookieStore
-
- addCookie(Cookie) - Method in class org.archive.modules.fetcher.AbstractCookieStore.LimitedCookieStoreFacade
-
- addCookieImpl(Cookie) - Method in class org.archive.modules.fetcher.AbstractCookieStore
-
- addCookieImpl(Cookie) - Method in class org.archive.modules.fetcher.BdbCookieStore
-
- addCookieImpl(Cookie) - Method in class org.archive.modules.fetcher.SimpleCookieStore
-
- addCredential(Credential) - Method in class org.archive.modules.net.CrawlServer
-
Add an avatar.
- addDataPersistentMember(String) - Static method in class org.archive.modules.CrawlURI
-
Add the key of data map items you want to persist across
processings.
- addDisallow(String) - Method in class org.archive.modules.net.RobotsDirectives
-
- addedCredentials - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
-
- addedSeed(CrawlURI) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
If appropriate, convert seed notification into prefix-addition.
- addedSeed(CrawlURI) - Method in interface org.archive.modules.seeds.SeedListener
-
- addExtraInfo(String, Object) - Method in class org.archive.modules.CrawlURI
-
- addField(String, String, String, boolean) - Method in class org.archive.modules.forms.HTMLForm
-
Add a discovered INPUT, tracking it as potential
username/password receiver.
- addField(String, String, String) - Method in class org.archive.modules.forms.HTMLForm
-
Add a discovered INPUT, tracking it as potential
username/password receiver.
- addHeaderLink(CrawlURI, String) - Method in class org.archive.modules.extractor.ExtractorHTTP
-
- addHeaderLink(CrawlURI, String, String) - Method in class org.archive.modules.extractor.ExtractorHTTP
-
- addIfNotBlank(ANVLRecord, String, String) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
- addLinkFromString(CrawlURI, CharSequence, CharSequence, Hop) - Method in class org.archive.modules.extractor.ExtractorHTML
-
- addOutlink(CrawlURI, String, LinkContext, Hop) - Method in class org.archive.modules.extractor.Extractor
-
Create and add a 'Link' to the CrawlURI with given URI/context/hop-type
- addOutlink(CrawlURI, UURI, LinkContext, Hop) - Method in class org.archive.modules.extractor.Extractor
-
- addPersistentDataMapKey(String) - Method in class org.archive.modules.CrawlURI
-
Add the key of items you want to persist across
processings.
- AddRedirectFromRootServerToScope - Class in org.archive.modules.deciderules
-
- AddRedirectFromRootServerToScope() - Constructor for class org.archive.modules.deciderules.AddRedirectFromRootServerToScope
-
- addRefreshHeaderLink(CrawlURI, String) - Method in class org.archive.modules.extractor.ExtractorHTTP
-
- addRelativeToBase(CrawlURI, int, String, LinkContext, Hop) - Static method in class org.archive.modules.extractor.Extractor
-
- addRelativeToVia(CrawlURI, int, String, LinkContext, Hop) - Static method in class org.archive.modules.extractor.Extractor
-
- addResponseContent(HttpResponse, CrawlURI) - Method in class org.archive.modules.fetcher.FetchHTTP
-
This method populates curi
with response status and
content type.
- addSeed(CrawlURI) - Method in class org.archive.modules.seeds.SeedModule
-
- addSeed(CrawlURI) - Method in class org.archive.modules.seeds.TextSeedModule
-
Add a new seed to scope.
- addSeedListener(SeedListener) - Method in class org.archive.modules.seeds.SeedModule
-
- addStats(Map<String, Map<String, Long>>) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
- addWhoisLink(CrawlURI, String) - Method in class org.archive.modules.fetcher.FetchWhois
-
- addWhoisLinks(CrawlURI) - Method in class org.archive.modules.fetcher.FetchWhois
-
Adds outlinks to whois:{domain} and whois:{ipAddress}
- afterPropertiesSet() - Method in class org.archive.modules.CrawlMetadata
-
- afterPropertiesSet() - Method in class org.archive.modules.deciderules.ScriptedDecideRule
-
- afterPropertiesSet() - Method in class org.archive.modules.extractor.ExtractorHTML
-
- afterPropertiesSet() - Method in class org.archive.modules.ScriptedProcessor
-
- agentsToDirectives - Variable in class org.archive.modules.net.Robotstxt
-
- AggressiveExtractorHTML - Class in org.archive.modules.extractor
-
Extended version of ExtractorHTML with more aggressive javascript link
extraction where javascript code is parsed first with general HTML tags
regex, and than by javascript speculative link regex.
- AggressiveExtractorHTML() - Constructor for class org.archive.modules.extractor.AggressiveExtractorHTML
-
- allInputs - Variable in class org.archive.modules.forms.HTMLForm
-
- allows(String, CrawlURI, Robotstxt) - Method in class org.archive.modules.net.CustomRobotsPolicy
-
- allows(String, CrawlURI, Robotstxt) - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
-
- allows(String, CrawlURI, Robotstxt) - Method in class org.archive.modules.net.IgnoreRobotsPolicy
-
- allows(String, CrawlURI, Robotstxt) - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
-
- allows(String, CrawlURI, Robotstxt) - Method in class org.archive.modules.net.ObeyRobotsPolicy
-
- allows - Variable in class org.archive.modules.net.RobotsDirectives
-
- allows(String) - Method in class org.archive.modules.net.RobotsDirectives
-
- allows(String, CrawlURI, Robotstxt) - Method in class org.archive.modules.net.RobotsPolicy
-
- allowsAll() - Method in class org.archive.modules.net.Robotstxt
-
Does this policy effectively allow everything? (No
disallows or timing (crawl-delay) directives?)
- analyze(CrawlURI, CharSequence) - Method in class org.archive.modules.forms.ExtractorHTMLForms
-
Run analysis: find form METHOD, ACTION, and all INPUT names/values
Log as configured.
- ANNOTATION_UNWRITTEN - Static variable in class org.archive.modules.writer.WriterPoolProcessor
-
CrawlURI annotation indicating no record was written.
- announceSeeds() - Method in class org.archive.modules.seeds.SeedModule
-
- announceSeeds() - Method in class org.archive.modules.seeds.TextSeedModule
-
Announce all seeds from configured source to SeedListeners
(including nonseed lines mixed in).
- announceSeeds(CountDownLatch) - Method in class org.archive.modules.seeds.TextSeedModule
-
- announceSeedsFromReader(BufferedReader, CountDownLatch) - Method in class org.archive.modules.seeds.TextSeedModule
-
Announce all seeds (and nonseed possible-directive lines) from
the given Reader
- appCtx - Variable in class org.archive.modules.deciderules.ScriptedDecideRule
-
- appCtx - Variable in class org.archive.modules.ScriptedProcessor
-
- ARCHIVE_TIME_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
-
- ARCWriterProcessor - Class in org.archive.modules.writer
-
Processor module for writing the results of successful fetches (and
perhaps someday, certain kinds of network failures) to the Internet Archive
ARC file format.
- ARCWriterProcessor() - Constructor for class org.archive.modules.writer.ARCWriterProcessor
-
- asAnnotation() - Method in class org.archive.modules.forms.HTMLForm
-
Provide abbreviated annotation, of the form...
- assertNoSideEffects(CrawlURI) - Static method in class org.archive.modules.extractor.ContentExtractorTestBase
-
Asserts that the given URI has no URI errors, no localized errors, and
no annotations.
- atProcessor(Processor) - Method in interface org.archive.modules.ProcessorChain.ChainStatusReceiver
-
- attach(CrawlURI) - Method in class org.archive.modules.credential.Credential
-
Attach this credentials avatar to the passed curi
.
- ATTR_MAX_BYTES_WRITTEN - Static variable in class org.archive.modules.writer.Kw3WriterProcessor
-
Max size for each file.Key for the maximum ARC bytes to write attribute.
- audience - Variable in class org.archive.modules.CrawlMetadata
-
- AUTH_SCHEME_REGISTRY - Static variable in class org.archive.modules.fetcher.FetchHTTP
-
- autoregisterTo(AutoKryo) - Static method in class org.archive.modules.CrawlURI
-
- autoregisterTo(AutoKryo) - Static method in class org.archive.modules.net.CrawlHost
-
- autoregisterTo(AutoKryo) - Static method in class org.archive.modules.net.CrawlServer
-
- autoregisterTo(AutoKryo) - Static method in class org.archive.modules.net.RobotsDirectives
-
- autoregisterTo(AutoKryo) - Static method in class org.archive.modules.net.Robotstxt
-
- availableRobotsPolicies - Variable in class org.archive.modules.CrawlMetadata
-
Map of all available RobotsPolicies, by name, to choose from.
- calcOutputDirs() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- CandidateChain - Class in org.archive.modules
-
- CandidateChain() - Constructor for class org.archive.modules.CandidateChain
-
- candidatePasswordInputs - Variable in class org.archive.modules.forms.HTMLForm
-
- candidateUserAgents - Variable in class org.archive.modules.net.FirstNamedRobotsPolicy
-
list of user-agents to try; if any are allowed, a URI will be crawled
- candidateUserAgents - Variable in class org.archive.modules.net.MostFavoredRobotsPolicy
-
list of user-agents to try; if any are allowed, a URI will be crawled
- candidateUsernameInputs - Variable in class org.archive.modules.forms.HTMLForm
-
- CanonicalizationRule - Interface in org.archive.modules.canonicalize
-
A rule to apply canonicalizing a url.
- canonicalize(String) - Method in interface org.archive.modules.canonicalize.CanonicalizationRule
-
Apply this canonicalization rule.
- canonicalize(String) - Method in class org.archive.modules.canonicalize.FixupQueryString
-
- canonicalize(String) - Method in class org.archive.modules.canonicalize.LowercaseRule
-
- canonicalize(String) - Method in class org.archive.modules.canonicalize.RegexRule
-
- canonicalize(String) - Method in class org.archive.modules.canonicalize.RulesCanonicalizationPolicy
-
Run the passed uuri through the list of rules.
- canonicalize(String) - Method in class org.archive.modules.canonicalize.StripExtraSlashes
-
- canonicalize(String) - Method in class org.archive.modules.canonicalize.StripSessionCFIDs
-
- canonicalize(String) - Method in class org.archive.modules.canonicalize.StripSessionIDs
-
- canonicalize(String) - Method in class org.archive.modules.canonicalize.StripUserinfoRule
-
- canonicalize(String) - Method in class org.archive.modules.canonicalize.StripWWWNRule
-
- canonicalize(String) - Method in class org.archive.modules.canonicalize.StripWWWRule
-
- canonicalize(String) - Method in class org.archive.modules.canonicalize.UriCanonicalizationPolicy
-
- canonicalString - Variable in class org.archive.modules.CrawlURI
-
- caseSensitiveFilesystem - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
True if the file system is case-sensitive, like UNIX.
- catalog - Variable in class org.archive.modules.extractor.PDFParser
-
- characterMap - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
This list is grouped in pairs.
- checkBytesWritten() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- checked - Variable in class org.archive.modules.forms.HTMLForm.FormInput
-
- checkMidfetchAbort(CrawlURI) - Method in class org.archive.modules.fetcher.FetchHTTP
-
- chmod - Variable in class org.archive.modules.writer.Kw3WriterProcessor
-
Should permissions be changed for the newly created dirs.
- chmodValue - Variable in class org.archive.modules.writer.Kw3WriterProcessor
-
What should the permissions be set to.
- chooseAuthScheme(Map<String, String>, String) - Method in class org.archive.modules.fetcher.FetchHTTP
-
- cleanup(CrawlURI, Exception, String, int) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Cleanup after a failed method execute.
- clear() - Method in class org.archive.modules.fetcher.AbstractCookieStore
-
- clear() - Method in class org.archive.modules.fetcher.AbstractCookieStore.LimitedCookieStoreFacade
-
- clear() - Method in class org.archive.modules.fetcher.BdbCookieStore
-
- clear() - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
-
- clear() - Method in class org.archive.modules.fetcher.SimpleCookieStore
-
- clearExpired(Date) - Method in class org.archive.modules.fetcher.AbstractCookieStore.LimitedCookieStoreFacade
-
- clearExpired(Date) - Method in class org.archive.modules.fetcher.BdbCookieStore
-
- clearExpired(Date) - Method in class org.archive.modules.fetcher.SimpleCookieStore
-
- clearPrerequisiteUri() - Method in class org.archive.modules.CrawlURI
-
Clear prerequisite, if any.
- close() - Method in class org.archive.modules.fetcher.DefaultServerCache
-
Called when shutting down the cache so we can do clean up.
- close() - Method in class org.archive.modules.fetcher.FetchHTTPRequest.RecordingHttpClientConnection
-
- collection - Variable in class org.archive.modules.writer.Kw3WriterProcessor
-
Name of collection.
- COLLECTION_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
-
- comment - Variable in class org.archive.modules.deciderules.DecideRule
-
- compareTo(CrawlURI) - Method in class org.archive.modules.CrawlURI
-
- compress - Variable in class org.archive.modules.writer.WriterPoolProcessor
-
Whether to gzip-compress files when writing to disk;
by default true, meaning do-compress.
- concludedSeedBatch() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
- concludedSeedBatch() - Method in interface org.archive.modules.seeds.SeedListener
-
- configureHttpClientBuilder() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
-
- configureRequest() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
-
- configureRequestHeaders() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
-
- connectTimeoutMs - Variable in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
-
- connMan - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
-
- consecutiveConnectionErrors - Variable in class org.archive.modules.net.CrawlServer
-
- considerIfLikelyUri(CrawlURI, CharSequence, CharSequence, Hop) - Method in class org.archive.modules.extractor.ExtractorHTML
-
Consider whether a given string is URI-like.
- considerQueryStringValues(CrawlURI, CharSequence, CharSequence, Hop) - Method in class org.archive.modules.extractor.ExtractorHTML
-
Consider a query-string-like collections of key=value[&key=value]
pairs for URI-like strings in the values.
- considerString(Extractor, CrawlURI, boolean, String) - Method in class org.archive.modules.extractor.ExtractorJS
-
- considerStringAsUri(String) - Method in class org.archive.modules.extractor.ExtractorSWF.CrawlUriSWFAction
-
- considerStrings(CrawlURI, CharSequence) - Method in class org.archive.modules.extractor.ExtractorJS
-
- considerStrings(Extractor, CrawlURI, CharSequence) - Method in class org.archive.modules.extractor.ExtractorJS
-
- considerStrings(Extractor, CrawlURI, CharSequence, boolean) - Method in class org.archive.modules.extractor.ExtractorJS
-
- constructRegex(int) - Method in class org.archive.modules.deciderules.PathologicalPathDecideRule
-
- contains(Object) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
-
- containsAll(Collection<?>) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
-
- containsContentTypeCharsetDeclaration() - Method in class org.archive.modules.CrawlURI
-
- containsDataKey(String) - Method in class org.archive.modules.CrawlURI
-
- containsHost(String) - Method in class org.archive.modules.fetcher.DefaultServerCache
-
- containsServer(String) - Method in class org.archive.modules.fetcher.DefaultServerCache
-
- CONTENT_LENGTH_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
-
- CONTENT_MD5_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
-
- CONTENT_TYPE_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
-
- contentDigestHistory - Variable in class org.archive.modules.recrawl.ContentDigestHistoryLoader
-
- contentDigestHistory - Variable in class org.archive.modules.recrawl.ContentDigestHistoryStorer
-
- ContentDigestHistoryLoader - Class in org.archive.modules.recrawl
-
- ContentDigestHistoryLoader() - Constructor for class org.archive.modules.recrawl.ContentDigestHistoryLoader
-
- ContentDigestHistoryStorer - Class in org.archive.modules.recrawl
-
- ContentDigestHistoryStorer() - Constructor for class org.archive.modules.recrawl.ContentDigestHistoryStorer
-
- ContentExtractor - Class in org.archive.modules.extractor
-
Extracts link from the fetched content of a URI, as opposed to its headers.
- ContentExtractor() - Constructor for class org.archive.modules.extractor.ContentExtractor
-
- ContentExtractorTestBase - Class in org.archive.modules.extractor
-
Abstract base class for unit testing ContentExtractor implementations.
- ContentExtractorTestBase() - Constructor for class org.archive.modules.extractor.ContentExtractorTestBase
-
- ContentLengthDecideRule - Class in org.archive.modules.deciderules
-
- ContentLengthDecideRule() - Constructor for class org.archive.modules.deciderules.ContentLengthDecideRule
-
Usual constructor.
- contentTypeMap - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
This list is grouped in pairs.
- ContentTypeMatchesRegexDecideRule - Class in org.archive.modules.deciderules
-
DecideRule whose decision is applied if the URI's content-type
is present and matches the supplied regular expression.
- ContentTypeMatchesRegexDecideRule() - Constructor for class org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule
-
- ContentTypeNotMatchesRegexDecideRule - Class in org.archive.modules.deciderules
-
DecideRule whose decision is applied if the URI's content-type
is present and does not match the supplied regular expression.
- ContentTypeNotMatchesRegexDecideRule() - Constructor for class org.archive.modules.deciderules.ContentTypeNotMatchesRegexDecideRule
-
- cookieComparator - Static variable in class org.archive.modules.fetcher.AbstractCookieStore
-
- COOKIEDB_NAME - Static variable in class org.archive.modules.fetcher.BdbCookieStore
-
- cookies - Variable in class org.archive.modules.fetcher.SimpleCookieStore
-
- cookiesLoadFile - Variable in class org.archive.modules.fetcher.AbstractCookieStore
-
- cookiesSaveFile - Variable in class org.archive.modules.fetcher.AbstractCookieStore
-
- cookieStore - Variable in class org.archive.modules.fetcher.FetchHTTP
-
- cookieStoreFor(CrawlURI) - Method in class org.archive.modules.fetcher.AbstractCookieStore
-
- cookieStoreFor(String) - Method in class org.archive.modules.fetcher.BdbCookieStore
-
Returns a LimitedCookieStoreFacade
whose
LimitedCookieStoreFacade#getCookies()
method returns only cookies
from host
and its parent domains, if applicable.
- cookieStoreFor(String) - Method in interface org.archive.modules.fetcher.FetchHTTPCookieStore
-
Returns a CookieStore
whose CookieStore.getCookies()
returns all the cookies from host
and each of its
parent domains, if applicable.
- cookieStoreFor(CrawlURI) - Method in interface org.archive.modules.fetcher.FetchHTTPCookieStore
-
Returns a CookieStore
whose CookieStore.getCookies()
returns all the cookies that could possibly apply curi
.
- cookieStoreFor(String) - Method in class org.archive.modules.fetcher.SimpleCookieStore
-
- copyForwardWriteTagIfDupe(CrawlURI) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
If this fetch is identical to the last written (archived) fetch, then
copy forward the writeTag.
- copyPersistSourceToHistoryMap(File, StoredSortedMap<String, Map>) - Static method in class org.archive.modules.recrawl.PersistProcessor
-
Populates a given StoredSortedMap (history map) from an old
environment db or a persist log.
- copyPersistSourceToHistoryMap(URL, StoredSortedMap<String, Map>) - Static method in class org.archive.modules.recrawl.PersistProcessor
-
Populates a given StoredSortedMap (history map) from an old persist log.
- copyStats(Map<String, Map<String, Long>>) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
- CoreAttributeConstants - Interface in org.archive.modules
-
Attribute keys and constant strings used by the core crawler
classes.
- countryCodes - Variable in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
-
Country code name.
- crawlDelay - Variable in class org.archive.modules.net.RobotsDirectives
-
- CrawledBytesHistotable - Class in org.archive.crawler.util
-
- CrawledBytesHistotable() - Constructor for class org.archive.crawler.util.CrawledBytesHistotable
-
- CrawlHost - Class in org.archive.modules.net
-
Represents a single remote "host".
- CrawlHost(String) - Constructor for class org.archive.modules.net.CrawlHost
-
Create a new CrawlHost object.
- CrawlHost(String, String) - Constructor for class org.archive.modules.net.CrawlHost
-
Create a new CrawlHost object.
- CrawlMetadata - Class in org.archive.modules
-
Basic crawl metadata, as consulted by functional modules and
recorded in ARCs/WARCs.
- CrawlMetadata() - Constructor for class org.archive.modules.CrawlMetadata
-
- CrawlServer - Class in org.archive.modules.net
-
Represents a single remote "server".
- CrawlServer(String) - Constructor for class org.archive.modules.net.CrawlServer
-
Creates a new CrawlServer object.
- CrawlURI - Class in org.archive.modules
-
Represents a candidate URI and the associated state it
collects as it is crawled.
- CrawlURI(UURI) - Constructor for class org.archive.modules.CrawlURI
-
Create a new instance of CrawlURI from a
UURI
.
- CrawlURI(UURI, String, UURI, LinkContext) - Constructor for class org.archive.modules.CrawlURI
-
- CrawlURI.FetchType - Enum in org.archive.modules
-
- CrawlUriSWFAction(CrawlURI, Extractor) - Constructor for class org.archive.modules.extractor.ExtractorSWF.CrawlUriSWFAction
-
- createCrawlURI(UURI, LinkContext, Hop) - Method in class org.archive.modules.CrawlURI
-
Utility method for creating CrawlURIs that were found as out links from the current CrawlURI
links from this CrawlURI.
- createCrawlURI(String, LinkContext, Hop) - Method in class org.archive.modules.CrawlURI
-
- createCrawlURI(UURI, LinkContext, Hop, int, boolean) - Method in class org.archive.modules.CrawlURI
-
Utility method for creation of CrawlURIs found extracting
links from this CrawlURI.
- createFormSubmissionAttempt(CrawlURI, HTMLForm, String) - Method in class org.archive.modules.forms.FormLoginProcessor
-
- createHostDirectory - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
Create a subdirectory named for the host in the URI.
- createPortDirectory - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
Create a subdirectory named for the port in the URI.
- createRecorder(String) - Static method in class org.archive.modules.extractor.ContentExtractorTestBase
-
Deprecated.
- createRecorder(String, String) - Static method in class org.archive.modules.extractor.ContentExtractorTestBase
-
- createSocket() - Method in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
-
- createSocket(String, int) - Method in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
-
- createSocket(InetAddress, int) - Method in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
-
- createSocket(String, int, InetAddress, int) - Method in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
-
- createSocket(InetAddress, int, InetAddress, int) - Method in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
-
- Credential - Class in org.archive.modules.credential
-
Credential type.
- Credential() - Constructor for class org.archive.modules.credential.Credential
-
Constructor.
- CredentialStore - Class in org.archive.modules.credential
-
Front door to the credential store.
- CredentialStore() - Constructor for class org.archive.modules.credential.CredentialStore
-
Constructor.
- CSS_BACKSLASH_ESCAPE - Static variable in class org.archive.modules.extractor.ExtractorCSS
-
- CSS_URI_EXTRACTOR - Static variable in class org.archive.modules.extractor.ExtractorCSS
-
CSS URL extractor pattern.
- curi - Variable in class org.archive.modules.extractor.ExtractorSWF.CrawlUriSWFAction
-
- curi - Variable in class org.archive.modules.fetcher.FetchHTTPRequest
-
- customRobots - Variable in class org.archive.modules.net.CustomRobotsPolicy
-
textual alternate robots.txt rules to follow
- CustomRobotsPolicy - Class in org.archive.modules.net
-
Follow a custom-written robots policy, rather than the site's own declarations
Does not support overlays of different custom-robots; instead it is
recommended each custom policy be declared as a separate bean, with a
distinct name.
- CustomRobotsPolicy() - Constructor for class org.archive.modules.net.CustomRobotsPolicy
-
- customRobotstxt - Variable in class org.archive.modules.net.CustomRobotsPolicy
-
- CustomSWFTags - Class in org.archive.modules.extractor
-
Overwrite action tags, that may hold URI, to use CrawlUriSWFAction
action.
- CustomSWFTags(SWFActions) - Constructor for class org.archive.modules.extractor.CustomSWFTags
-
- elementContext(CharSequence, CharSequence) - Static method in class org.archive.modules.extractor.ExtractorHTML
-
Create a suitable XPath-like context from an element name and optional
attribute name.
- eligibleFormsAttemptsCount - Variable in class org.archive.modules.forms.FormLoginProcessor
-
- eligibleFormsSeenCount - Variable in class org.archive.modules.forms.FormLoginProcessor
-
- EMBED_MISC - Static variable in class org.archive.modules.extractor.LinkContext
-
Stand-in value for embeds without other context.
- encounteredReferences - Variable in class org.archive.modules.extractor.PDFParser
-
- enctype - Variable in class org.archive.modules.forms.HTMLForm
-
- engineName - Variable in class org.archive.modules.deciderules.ScriptedDecideRule
-
engine name; default "beanshell"
- engineName - Variable in class org.archive.modules.ScriptedProcessor
-
engine name; default "beanshell"
- ensureStandardPoliciesAvailable() - Method in class org.archive.modules.CrawlMetadata
-
- equals(Object) - Method in class org.archive.modules.CrawlURI
-
- equals(Object) - Method in class org.archive.modules.extractor.LinkContext
-
- equals(Object) - Method in class org.archive.modules.net.CrawlHost
-
- equals(Object) - Method in class org.archive.modules.net.CrawlServer
-
- escapeForMultipart(String) - Static method in class org.archive.modules.fetcher.FetchHTTPRequest
-
Returns a copy of the string with non-ascii characters replaced by their
html numeric character reference in decimal (e.g.
- eTag - Variable in class org.archive.modules.revisit.ServerNotModifiedRevisit
-
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.AddRedirectFromRootServerToScope
-
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.ContentTypeNotMatchesRegexDecideRule
-
Evaluate whether given object's string version does not match
configured regex (by reversing the superclass's answer).
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
-
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.FetchStatusDecideRule
-
Evaluate whether given object is equal to the configured status
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.FetchStatusNotMatchesRegexDecideRule
-
Evaluate whether given object's FetchStatus does not match
configured regex (by reversing the superclass's answer).
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.HasViaDecideRule
-
Evaluate whether given object is over the threshold number of
hops.
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.HopCrossesAssignmentLevelDomainDecideRule
-
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.IpAddressSetDecideRule
-
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.MatchesListRegexDecideRule
-
Evaluate whether given object's string version
matches configured regexes
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.MatchesRegexDecideRule
-
Evaluate whether given object's string version
matches configured regex
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.MatchesStatusCodeDecideRule
-
Returns "true" if the provided CrawlURI has a fetch status that falls
within this instance's specified range.
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.NotMatchesFilePatternDecideRule
-
Evaluate whether given object's string version does not match
configured regex (by reversing the superclass's answer).
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.NotMatchesListRegexDecideRule
-
Evaluate whether given object's string version does not match
configured regexs (by reversing the superclass's answer).
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.NotMatchesRegexDecideRule
-
Evaluate whether given object's string version does not match
configured regex (by reversing the superclass's answer).
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.NotMatchesStatusCodeDecideRule
-
Returns "true" if the provided CrawlURI has a fetch status that does not
fall within this instance's specified range.
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.PredicatedDecideRule
-
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.recrawl.IdenticalDigestDecideRule
-
Evaluate whether given CrawlURI's revisit profile has been set to identical digest
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.ResourceNoLongerThanDecideRule
-
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.ResponseContentLengthDecideRule
-
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.SchemeNotInSetDecideRule
-
Evaluate whether given object is over the threshold number of
hops.
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.SourceSeedDecideRule
-
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.surt.NotOnDomainsDecideRule
-
Evaluate whether given object's URI is NOT in the set of
domains -- simply reverse superclass's determination
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.surt.NotOnHostsDecideRule
-
Evaluate whether given object's URI is NOT in the set of
hosts -- simply reverse superclass's determination
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.surt.NotSurtPrefixedDecideRule
-
Evaluate whether given object's URI is NOT in the SURT
prefix set -- simply reverse superclass's determination
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
Evaluate whether given object's URI is covered by the SURT prefix set
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.TooManyHopsDecideRule
-
Evaluate whether given object is over the threshold number of
hops.
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.TooManyPathSegmentsDecideRule
-
Evaluate whether given object is over the threshold number of
path-segments.
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.TransclusionDecideRule
-
Evaluate whether given object is within the acceptable thresholds of
transitive hops.
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.ViaSurtPrefixedDecideRule
-
Evaluate whether given object's surt form
matches one of the supplied surts
- execute() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
-
- expectContinue() - Method in class org.archive.modules.fetcher.BasicExecutionAwareEntityEnclosingRequest
-
- expectedResult - Variable in class org.archive.modules.extractor.StringExtractorTestBase.TestData
-
- extendHopsPath(String, char) - Static method in class org.archive.modules.CrawlURI
-
Extend a 'hopsPath' (pathFromSeed string of single-character hop-type symbols),
keeping the number of displayed hop-types under MAX_HOPS_DISPLAYED.
- ExternalGeoLocationDecideRule - Class in org.archive.modules.deciderules
-
A rule that can be configured to take alternate implementations
of the ExternalGeoLocationInterface.
- ExternalGeoLocationDecideRule() - Constructor for class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
-
- ExternalGeoLookupInterface - Interface in org.archive.modules.deciderules
-
- extract(CrawlURI) - Method in class org.archive.modules.extractor.ContentExtractor
-
Extracts links
- extract(CrawlURI) - Method in class org.archive.modules.extractor.Extractor
-
Extracts links from the given URI.
- extract(CrawlURI, CharSequence) - Method in class org.archive.modules.extractor.ExtractorHTML
-
Run extractor.
- extract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorHTTP
-
- extract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorImpliedURI
-
Perform usual extraction on a CrawlURI
- extract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
-
- extract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorURI
-
Perform usual extraction on a CrawlURI
- extract(CrawlURI, CharSequence) - Method in class org.archive.modules.extractor.JerichoExtractorHTML
-
Run extractor.
- extract(CrawlURI) - Method in class org.archive.modules.forms.ExtractorHTMLForms
-
- extractChallenges(HttpResponse, CrawlURI, AuthenticationStrategy) - Method in class org.archive.modules.fetcher.FetchHTTP
-
- extractImplied(CharSequence, Pattern, String) - Static method in class org.archive.modules.extractor.ExtractorImpliedURI
-
Utility method for extracting 'implied' URI given a source uri,
trigger pattern, and build pattern.
- extractLink(CrawlURI, CrawlURI) - Method in class org.archive.modules.extractor.ExtractorURI
-
Consider a single Link for internal URIs
- extractor - Variable in class org.archive.modules.extractor.ContentExtractorTestBase
-
An extractor created during the setUp.
- Extractor - Class in org.archive.modules.extractor
-
Extracts links from fetched URIs.
- Extractor() - Constructor for class org.archive.modules.extractor.Extractor
-
- ExtractorCSS - Class in org.archive.modules.extractor
-
This extractor is parsing URIs from CSS type files.
- ExtractorCSS() - Constructor for class org.archive.modules.extractor.ExtractorCSS
-
- ExtractorDOC - Class in org.archive.modules.extractor
-
This class allows the caller to extract href style links from word97-format word documents.
- ExtractorDOC() - Constructor for class org.archive.modules.extractor.ExtractorDOC
-
- ExtractorHTML - Class in org.archive.modules.extractor
-
Basic link-extraction, from an HTML content-body,
using regular expressions.
- ExtractorHTML() - Constructor for class org.archive.modules.extractor.ExtractorHTML
-
- ExtractorHTMLForms - Class in org.archive.modules.forms
-
Extracts extra information about FORMs in HTML, loading this
into the CrawlURI (for potential later use by FormLoginProcessor)
and adding a small annotation to the crawl.log.
- ExtractorHTMLForms() - Constructor for class org.archive.modules.forms.ExtractorHTMLForms
-
- ExtractorHTTP - Class in org.archive.modules.extractor
-
Extracts URIs from HTTP response headers.
- ExtractorHTTP() - Constructor for class org.archive.modules.extractor.ExtractorHTTP
-
- ExtractorImpliedURI - Class in org.archive.modules.extractor
-
An extractor for finding 'implied' URIs inside other URIs.
- ExtractorImpliedURI() - Constructor for class org.archive.modules.extractor.ExtractorImpliedURI
-
Constructor.
- extractorJS - Variable in class org.archive.modules.extractor.ExtractorHTML
-
Javascript extractor to use to process inline javascript.
- ExtractorJS - Class in org.archive.modules.extractor
-
Processes Javascript files for strings that are likely to be
crawlable URIs.
- ExtractorJS() - Constructor for class org.archive.modules.extractor.ExtractorJS
-
- extractorJS - Variable in class org.archive.modules.extractor.ExtractorSWF
-
Javascript extractor to use to process inline javascript.
- ExtractorMultipleRegex - Class in org.archive.modules.extractor
-
An extractor that uses regular expressions to find strings in the fetched
content of a URI, and constructs outlink URIs from those strings.
- ExtractorMultipleRegex() - Constructor for class org.archive.modules.extractor.ExtractorMultipleRegex
-
- ExtractorMultipleRegex.GroupList - Class in org.archive.modules.extractor
-
- ExtractorMultipleRegex.MatchList - Class in org.archive.modules.extractor
-
- extractorParameters - Variable in class org.archive.modules.extractor.Extractor
-
- ExtractorParameters - Interface in org.archive.modules.extractor
-
Bean interface for parameters consulted by multiple Extractors, and
thus provided by some shared object.
- ExtractorPDF - Class in org.archive.modules.extractor
-
Allows the caller to process a CrawlURI representing a PDF
for the purpose of extracting URIs
- ExtractorPDF() - Constructor for class org.archive.modules.extractor.ExtractorPDF
-
- ExtractorSWF - Class in org.archive.modules.extractor
-
Extracts URIs from SWF (flash/shockwave) files.
- ExtractorSWF() - Constructor for class org.archive.modules.extractor.ExtractorSWF
-
- ExtractorSWF.CrawlUriSWFAction - Class in org.archive.modules.extractor
-
SWF action that handles discovered URIs.
- ExtractorSWF.ExtractorTagParser - Class in org.archive.modules.extractor
-
TagParser customized to ignore SWFTags that
will never contain extractable URIs.
- ExtractorTagParser(SWFTagTypes) - Constructor for class org.archive.modules.extractor.ExtractorSWF.ExtractorTagParser
-
- ExtractorUniversal - Class in org.archive.modules.extractor
-
A last ditch extractor that will look at the raw byte code and try to extract
anything that looks like a link.
- ExtractorUniversal() - Constructor for class org.archive.modules.extractor.ExtractorUniversal
-
Constructor.
- ExtractorURI - Class in org.archive.modules.extractor
-
An extractor for finding URIs inside other URIs.
- ExtractorURI() - Constructor for class org.archive.modules.extractor.ExtractorURI
-
Constructor
- ExtractorXML - Class in org.archive.modules.extractor
-
A simple extractor which finds HTTP URIs inside XML/RSS files,
inside attribute values and simple elements (those with only
whitespace + HTTP URI + whitespace as contents).
- ExtractorXML() - Constructor for class org.archive.modules.extractor.ExtractorXML
-
- extractQueryStringLinks(UURI) - Static method in class org.archive.modules.extractor.ExtractorURI
-
Look for URIs inside the supplied UURI.
- extractURIs() - Method in class org.archive.modules.extractor.PDFParser
-
Extract URIs from all objects found in a Pdf document's catalog.
- extractURIs(PdfObject) - Method in class org.archive.modules.extractor.PDFParser
-
Parse a PdfDictionary, looking for URIs recursively and adding
them to foundURIs
- extraInfo - Variable in class org.archive.modules.CrawlURI
-
- generator - Variable in class org.archive.modules.writer.WARCWriterProcessor
-
Generator for record IDs
- get(Object, String) - Method in class org.archive.modules.credential.CredentialStore
-
- get(CharSequence, CharSequence) - Static method in class org.archive.modules.extractor.HTMLLinkContext
-
return an instance of HTMLLinkContext for attribute attr
in
element el
.
- get(String) - Static method in class org.archive.modules.extractor.HTMLLinkContext
-
return an instance of HTMLLinkContext for path path
.
- get(int) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
-
- getAcceptCompression() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getAcceptHeaders() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getAcceptNonDnsResolves() - Method in class org.archive.modules.fetcher.FetchDNS
-
- getAction() - Method in class org.archive.modules.forms.HTMLForm
-
- getAll() - Method in class org.archive.modules.credential.CredentialStore
-
- getAlsoCheckVia() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
- getAnnotations() - Method in class org.archive.modules.CrawlURI
-
Get the annotations set for this uri.
- getApplicableSurtPrefix() - Method in class org.archive.modules.forms.FormLoginProcessor
-
- getAttributeEither(CrawlURI, String) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Get a value either from inside the CrawlURI instance, or from
settings (module attributes).
- getAudience() - Method in class org.archive.modules.CrawlMetadata
-
- getAvailableRobotsPolicies() - Method in class org.archive.modules.CrawlMetadata
-
- getBaseURI() - Method in class org.archive.modules.CrawlURI
-
Get the (HTML) Base URI used for derelativizing internal URIs.
- getBeanName() - Method in class org.archive.modules.deciderules.DecideRuleSequence
-
- getBeanName() - Method in class org.archive.modules.Processor
-
- getBlockAwaitingSeedLines() - Method in class org.archive.modules.seeds.TextSeedModule
-
- getByRealm(Set<Credential>, String, CrawlURI) - Static method in class org.archive.modules.credential.HttpAuthenticationCredential
-
Convenience method that does look up on passed set using realm for key.
- getCandidateUserAgents() - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
-
- getCandidateUserAgents() - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
-
- getCanonicalString() - Method in class org.archive.modules.CrawlURI
-
- getCaseSensitiveFilesystem() - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- getCharacterMap() - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- getChmod() - Method in class org.archive.modules.writer.Kw3WriterProcessor
-
- getChmodValue() - Method in class org.archive.modules.writer.Kw3WriterProcessor
-
- getClassKey() - Method in class org.archive.modules.CrawlURI
-
Get the token (usually the hostname + port) which indicates
what "class" this CrawlURI should be grouped with,
for the purposes of ensuring only one item of the
class is processed at once, all items of the class
are held for a politeness period, etc.
- getCollection() - Method in class org.archive.modules.writer.Kw3WriterProcessor
-
- getComment() - Method in class org.archive.modules.deciderules.DecideRule
-
- getCompress() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- getConfiguredHttpVersion() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getConnectTimeoutMs() - Method in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
-
- getContentDeclaredCharset(CrawlURI, String) - Method in class org.archive.modules.extractor.ExtractorHTML
-
- getContentDeclaredCharset(CrawlURI, String) - Method in class org.archive.modules.extractor.ExtractorXML
-
- getContentDigest() - Method in class org.archive.modules.CrawlURI
-
Return the retained content-digest value, if any.
- getContentDigestHistory() - Method in class org.archive.modules.CrawlURI
-
- getContentDigestSchemeString() - Method in class org.archive.modules.CrawlURI
-
- getContentDigestString() - Method in class org.archive.modules.CrawlURI
-
- getContentLength() - Method in class org.archive.modules.CrawlURI
-
For completed HTTP transactions, the length of the content-body.
- getContentLengthThreshold() - Method in class org.archive.modules.deciderules.ContentLengthDecideRule
-
- getContentLengthThreshold() - Method in class org.archive.modules.deciderules.ResourceNoLongerThanDecideRule
-
- getContentRegexes() - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
-
- getContentSize() - Method in class org.archive.modules.CrawlURI
-
Get the size in bytes of this URI's recorded content, inclusive
of things like protocol headers.
- getContentType() - Method in class org.archive.modules.CrawlURI
-
Get the content type of this URI.
- getContentTypeMap() - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- getCookies() - Method in class org.archive.modules.fetcher.AbstractCookieStore.LimitedCookieStoreFacade
-
- getCookies() - Method in class org.archive.modules.fetcher.BdbCookieStore
-
- getCookies() - Method in class org.archive.modules.fetcher.SimpleCookieStore
-
- getCookiesLoadFile() - Method in class org.archive.modules.fetcher.AbstractCookieStore
-
- getCookiesSaveFile() - Method in class org.archive.modules.fetcher.AbstractCookieStore
-
- getCookieStore() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getCountryCode() - Method in class org.archive.modules.net.CrawlHost
-
Get country code of this host
- getCountryCodes() - Method in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
-
- getCrawlDelay() - Method in class org.archive.modules.net.RobotsDirectives
-
- getCreateHostDirectory() - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- getCreatePortDirectory() - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- getCredentials() - Method in class org.archive.modules.CrawlURI
-
- getCredentials() - Method in class org.archive.modules.credential.CredentialStore
-
- getCredentials(CrawlURI, Class<?>) - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getCredentials() - Method in class org.archive.modules.net.CrawlServer
-
- getCredentialStore() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getCredentialTypes() - Static method in class org.archive.modules.credential.CredentialStore
-
- getCustomRobots() - Method in class org.archive.modules.net.CustomRobotsPolicy
-
- getData() - Method in class org.archive.modules.CrawlURI
-
- getDataList(String) - Method in class org.archive.modules.CrawlURI
-
Convenience method: return (creating if necessary) list at
given data key
- getDecision() - Method in class org.archive.modules.deciderules.PredicatedDecideRule
-
- getDefaultCharset() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getDefaultEncoding() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getDefaultMaxFileSize() - Method in class org.archive.modules.writer.ARCWriterProcessor
-
- getDefaultMaxFileSize() - Method in class org.archive.modules.writer.WARCWriterProcessor
-
- getDefaultMaxFileSize() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- getDefaultRules() - Static method in class org.archive.modules.canonicalize.RulesCanonicalizationPolicy
-
A reasonable set of default rules to use, if no others are
provided by operator configuration.
- getDefaultStorePaths() - Method in class org.archive.modules.writer.ARCWriterProcessor
-
- getDefaultStorePaths() - Method in class org.archive.modules.writer.WARCWriterProcessor
-
- getDefaultStorePaths() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- getDeferrals() - Method in class org.archive.modules.CrawlURI
-
Get the deferral count.
- getDescription() - Method in class org.archive.modules.CrawlMetadata
-
- getDigestAlgorithm() - Method in class org.archive.modules.fetcher.FetchDNS
-
- getDigestAlgorithm() - Method in class org.archive.modules.fetcher.FetchFTP
-
- getDigestAlgorithm() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getDigestContent() - Method in class org.archive.modules.fetcher.FetchDNS
-
- getDigestContent() - Method in class org.archive.modules.fetcher.FetchFTP
-
- getDigestContent() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getDirectivesFor(String, boolean) - Method in class org.archive.modules.net.Robotstxt
-
Return the RobotsDirectives, if any, appropriate for the given User-Agent
string.
- getDirectivesFor(String) - Method in class org.archive.modules.net.Robotstxt
-
Return directives to use for the given User-Agent, resorting to wildcard
rules or the default no-directives if necessary.
- getDirectory() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- getDirectoryFile() - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- getDisableJavaDnsResolves() - Method in class org.archive.modules.fetcher.FetchDNS
-
- getDNSRecord(long, Record[]) - Method in class org.archive.modules.fetcher.FetchDNS
-
- getDNSServerIPLabel() - Method in class org.archive.modules.CrawlURI
-
- getDomain() - Method in class org.archive.modules.credential.Credential
-
- getDotBegin() - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- getDotEnd() - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- getDupByHashBytes() - Method in class org.archive.modules.fetcher.FetchStats
-
- getDupByHashUrls() - Method in class org.archive.modules.fetcher.FetchStats
-
- getEarliestNextURIEmitTime() - Method in class org.archive.modules.net.CrawlHost
-
Get the earliest time a URI for this host could be emitted.
- getEmbedHopCount() - Method in class org.archive.modules.CrawlURI
-
Get the embed hop count.
- getEnabled() - Method in class org.archive.modules.canonicalize.BaseRule
-
- getEnabled() - Method in interface org.archive.modules.canonicalize.CanonicalizationRule
-
- getEnabled() - Method in class org.archive.modules.deciderules.DecideRule
-
- getEnabled() - Method in class org.archive.modules.Processor
-
- getEnctype() - Method in class org.archive.modules.forms.HTMLForm
-
- getEngine() - Method in class org.archive.modules.deciderules.ScriptedDecideRule
-
Get the proper ScriptEngine instance -- either shared or local
to this thread.
- getEngine() - Method in class org.archive.modules.ScriptedProcessor
-
Get the proper ScriptEngine instance -- either shared or local
to this thread.
- getEngineName() - Method in class org.archive.modules.deciderules.ScriptedDecideRule
-
- getEngineName() - Method in class org.archive.modules.ScriptedProcessor
-
- getEntity() - Method in class org.archive.modules.fetcher.BasicExecutionAwareEntityEnclosingRequest
-
- getETag() - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
-
- getExtract404s() - Method in interface org.archive.modules.extractor.ExtractorParameters
-
Whether to extract links from responses with a 404 'not found' response
code.
- getExtractAllForms() - Method in class org.archive.modules.forms.ExtractorHTMLForms
-
- getExtractFromDirs() - Method in class org.archive.modules.fetcher.FetchFTP
-
Returns the extract.from.dirs
attribute for this
FetchFTP
and the given curi.
- getExtractIndependently() - Method in interface org.archive.modules.extractor.ExtractorParameters
-
Whether each extractor should make an independent decision as to whether
it can extract links from a URI's content (when value is true), or
whether a previous extractor's success (marking the URI as
hasBeenLinkExtracted) should cancel later extractors (when value is
false).
- getExtractJavascript() - Method in class org.archive.modules.extractor.ExtractorHTML
-
- getExtractOnlyFormGets() - Method in class org.archive.modules.extractor.ExtractorHTML
-
- getExtractorJS() - Method in class org.archive.modules.extractor.ExtractorHTML
-
- getExtractorJS() - Method in class org.archive.modules.extractor.ExtractorSWF
-
- getExtractorParameters() - Method in class org.archive.modules.extractor.Extractor
-
- getExtractParent() - Method in class org.archive.modules.fetcher.FetchFTP
-
Returns the extract.parent
attribute for this
FetchFTP
and the given curi.
- getExtractValueAttributes() - Method in class org.archive.modules.extractor.ExtractorHTML
-
- getExtraInfo() - Method in class org.archive.modules.CrawlURI
-
- getFetchAttempts() - Method in class org.archive.modules.CrawlURI
-
Get the count of attempts (trips through the processing
loop) at getting the document referenced by this URI.
- getFetchBeginTime() - Method in class org.archive.modules.CrawlURI
-
- getFetchCompletedTime() - Method in class org.archive.modules.CrawlURI
-
- getFetchDisregards() - Method in class org.archive.modules.fetcher.FetchStats
-
- getFetchDuration() - Method in class org.archive.modules.CrawlURI
-
- getFetchHistory() - Method in class org.archive.modules.CrawlURI
-
- getFetchNonResponses() - Method in class org.archive.modules.fetcher.FetchStats
-
- getFetchResponses() - Method in class org.archive.modules.fetcher.FetchStats
-
- getFetchStatus() - Method in class org.archive.modules.CrawlURI
-
Return the overall/fetch status of this CrawlURI for its
current trip through the processing loop.
- getFetchSuccesses() - Method in class org.archive.modules.fetcher.FetchStats
-
- getFetchType() - Method in class org.archive.modules.CrawlURI
-
- getFirstARecord(Record[]) - Method in class org.archive.modules.fetcher.FetchDNS
-
- getFormat() - Method in class org.archive.modules.canonicalize.RegexRule
-
- getFormat() - Method in class org.archive.modules.extractor.ExtractorImpliedURI
-
- getFormItems() - Method in class org.archive.modules.credential.HtmlFormCredential
-
- getFormProvince(CrawlURI) - Method in class org.archive.modules.forms.FormLoginProcessor
-
Get the 'form province' - either the configured (applicableSurtPrefix)
or inferred (full current server) range of URIs that is considered
covered by one form login
- getFrequentFlushes() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- getFrom() - Method in class org.archive.modules.CrawlMetadata
-
- getFrom() - Method in interface org.archive.modules.fetcher.UserAgentProvider
-
- getFullVia() - Method in class org.archive.modules.CrawlURI
-
- getHarvester() - Method in class org.archive.modules.writer.Kw3WriterProcessor
-
- getHistoryDbName() - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
-
- getHistoryDbName() - Method in class org.archive.modules.recrawl.PersistOnlineProcessor
-
- getHistoryLength() - Method in class org.archive.modules.recrawl.FetchHistoryProcessor
-
- getHolder() - Method in class org.archive.modules.CrawlURI
-
Return the 'holder' for the convenience of
an external facility.
- getHolderCost() - Method in class org.archive.modules.CrawlURI
-
Return the 'holderCost' for convenience of external facility (frontier)
- getHolderKey() - Method in class org.archive.modules.CrawlURI
-
Return the 'holderKey' for convenience of
an external facility (Frontier).
- getHopChar() - Method in enum org.archive.modules.extractor.Hop
-
Returns a hop character suitable for display in logs.
- getHopCount() - Method in class org.archive.modules.CrawlURI
-
Get total hops from seed.
- getHopString() - Method in enum org.archive.modules.extractor.Hop
-
- getHostAddress(CrawlURI) - Method in class org.archive.modules.deciderules.IpAddressSetDecideRule
-
from WriterPoolProcessor
- getHostAddress(CrawlURI) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
Return IP address of given URI suitable for recording (as in a
classic ARC 5-field header line).
- getHostFor(String) - Method in class org.archive.modules.fetcher.DefaultServerCache
-
- getHostFor(String) - Method in class org.archive.modules.net.ServerCache
-
- getHostFor(UURI) - Method in class org.archive.modules.net.ServerCache
-
- getHostMap() - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- getHostName() - Method in class org.archive.modules.net.CrawlHost
-
Get the host name.
- getHttpAuthChallenges() - Method in class org.archive.modules.CrawlURI
-
- getHttpAuthChallenges() - Method in class org.archive.modules.net.CrawlServer
-
- getHttpBindAddress() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getHttpMethod() - Method in class org.archive.modules.credential.HtmlFormCredential
-
- getHttpProxyHost() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getHttpProxyPassword() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getHttpProxyPort() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getHttpProxyUser() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getHttpResponseHeader(String) - Method in class org.archive.modules.CrawlURI
-
- getId() - Method in class org.archive.modules.fetcher.FetchHTTPRequest.RecordingHttpClientConnection
-
- getIgnoreCookies() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getIgnoreFormActionUrls() - Method in class org.archive.modules.extractor.ExtractorHTML
-
- getIgnoreUnexpectedHtml() - Method in class org.archive.modules.extractor.ExtractorHTML
-
- getInferRootPage() - Method in class org.archive.modules.extractor.ExtractorHTTP
-
- getInFromFile(String) - Method in class org.archive.modules.extractor.PDFParser
-
Read a file named 'doc' and store its' bytes for later processing.
- getIP() - Method in class org.archive.modules.net.CrawlHost
-
Get the IP address for this host.
- getIpAddresses() - Method in class org.archive.modules.deciderules.IpAddressSetDecideRule
-
- getIpFetched() - Method in class org.archive.modules.net.CrawlHost
-
Get the time when the IP address for this host was last looked up.
- getIpTTL() - Method in class org.archive.modules.net.CrawlHost
-
Get the TTL value from the dns record for this host.
- getIsolateThreads() - Method in class org.archive.modules.deciderules.ScriptedDecideRule
-
- getIsolateThreads() - Method in class org.archive.modules.ScriptedProcessor
-
- getJobName() - Method in class org.archive.modules.CrawlMetadata
-
- getJumpTarget() - Method in class org.archive.modules.ProcessResult
-
- getKey() - Method in class org.archive.modules.credential.Credential
-
- getKey() - Method in class org.archive.modules.credential.HtmlFormCredential
-
- getKey() - Method in class org.archive.modules.credential.HttpAuthenticationCredential
-
- getKey() - Method in class org.archive.modules.net.CrawlHost
-
- getKey() - Method in class org.archive.modules.net.CrawlServer
-
- getKeyedProperties() - Method in class org.archive.modules.canonicalize.BaseRule
-
- getKeyedProperties() - Method in class org.archive.modules.canonicalize.RulesCanonicalizationPolicy
-
- getKeyedProperties() - Method in class org.archive.modules.CrawlMetadata
-
- getKeyedProperties() - Method in class org.archive.modules.credential.CredentialStore
-
- getKeyedProperties() - Method in class org.archive.modules.deciderules.DecideRule
-
- getKeyedProperties() - Method in class org.archive.modules.Processor
-
- getKeyedProperties() - Method in class org.archive.modules.ProcessorChain
-
- getLastHop() - Method in class org.archive.modules.CrawlURI
-
convenience access to last hop character, as string
- getLastModified() - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
-
- getLastSuccessTime() - Method in class org.archive.modules.fetcher.FetchStats
-
- getLinkCount() - Method in class org.archive.modules.extractor.ExtractorSWF.CrawlUriSWFAction
-
- getLinkHopCount() - Method in class org.archive.modules.CrawlURI
-
Get the link hop count.
- getListLogicalOr() - Method in class org.archive.modules.deciderules.MatchesListRegexDecideRule
-
- getLogExtraInfo() - Method in class org.archive.modules.deciderules.DecideRuleSequence
-
- getLogFile() - Method in class org.archive.modules.recrawl.PersistLogProcessor
-
- getLoggerModule() - Method in class org.archive.modules.deciderules.DecideRuleSequence
-
- getLoggerModule() - Method in class org.archive.modules.extractor.Extractor
-
- getLoggerModule() - Method in class org.archive.modules.forms.FormLoginProcessor
-
- getLogin() - Method in class org.archive.modules.credential.HttpAuthenticationCredential
-
- getLoginPassword() - Method in class org.archive.modules.forms.FormLoginProcessor
-
- getLoginUri() - Method in class org.archive.modules.credential.HtmlFormCredential
-
- getLoginUsername() - Method in class org.archive.modules.forms.FormLoginProcessor
-
- getLogToFile() - Method in class org.archive.modules.deciderules.DecideRuleSequence
-
- getLookup() - Method in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
-
- getLowerBound() - Method in class org.archive.modules.deciderules.MatchesStatusCodeDecideRule
-
Returns the lower bound on the range of acceptable status codes.
- getLowerBound() - Method in class org.archive.modules.deciderules.ResponseContentLengthDecideRule
-
- getMaxAttributeNameLength() - Method in class org.archive.modules.extractor.ExtractorHTML
-
- getMaxAttributeValLength() - Method in class org.archive.modules.extractor.ExtractorHTML
-
- getMaxElementLength() - Method in class org.archive.modules.extractor.ExtractorHTML
-
- getMaxFetchKBSec() - Method in class org.archive.modules.fetcher.FetchFTP
-
- getMaxFetchKBSec() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getMaxFileSizeBytes() - Method in class org.archive.modules.writer.Kw3WriterProcessor
-
- getMaxFileSizeBytes() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- getMaxHops() - Method in class org.archive.modules.deciderules.TooManyHopsDecideRule
-
- getMaxLengthBytes() - Method in class org.archive.modules.fetcher.FetchFTP
-
- getMaxLengthBytes() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getMaxOutlinks() - Method in interface org.archive.modules.extractor.ExtractorParameters
-
The maximum number of outlinks to discover from any URI's content.
- getMaxPathDepth() - Method in class org.archive.modules.deciderules.TooManyPathSegmentsDecideRule
-
- getMaxPathLength() - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- getMaxRepetitions() - Method in class org.archive.modules.deciderules.PathologicalPathDecideRule
-
- getMaxSegLength() - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- getMaxSizeToDigest() - Method in class org.archive.modules.extractor.HTTPContentDigest
-
- getMaxSizeToParse() - Method in class org.archive.modules.extractor.ExtractorPDF
-
- getMaxSizeToParse() - Method in class org.archive.modules.extractor.ExtractorUniversal
-
- getMaxSpeculativeHops() - Method in class org.archive.modules.deciderules.TransclusionDecideRule
-
- getMaxTotalBytesToWrite() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- getMaxTransHops() - Method in class org.archive.modules.deciderules.TransclusionDecideRule
-
- getMaxWaitForIdleMs() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- getMetadata() - Method in class org.archive.modules.extractor.ExtractorHTML
-
- getMetadata() - Method in class org.archive.modules.writer.ARCWriterProcessor
-
- getMetadata() - Method in class org.archive.modules.writer.WARCWriterProcessor
-
- getMetadata() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- getMetadataProvider() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- getModuleClass() - Method in class org.archive.state.ModuleTestBase
-
Returns the class of the module to test.
- getName() - Method in class org.archive.modules.net.CrawlServer
-
- getNamedUserAgents() - Method in class org.archive.modules.net.Robotstxt
-
- getNonFatalFailures() - Method in class org.archive.modules.CrawlURI
-
- getNotModifiedBytes() - Method in class org.archive.modules.fetcher.FetchStats
-
- getNotModifiedUrls() - Method in class org.archive.modules.fetcher.FetchStats
-
- getNovelBytes() - Method in class org.archive.modules.fetcher.FetchStats
-
- getNovelUrls() - Method in class org.archive.modules.fetcher.FetchStats
-
- getOnlyStoreIfWriteTagPresent() - Method in class org.archive.modules.recrawl.AbstractPersistProcessor
-
- getOperator() - Method in class org.archive.modules.CrawlMetadata
-
- getOperatorContactUrl() - Method in class org.archive.modules.CrawlMetadata
-
- getOperatorFrom() - Method in class org.archive.modules.CrawlMetadata
-
- getOrdinal() - Method in class org.archive.modules.CrawlURI
-
Get the ordinal (serial number) assigned at creation.
- getOrganization() - Method in class org.archive.modules.CrawlMetadata
-
- getOtherDupBytes() - Method in class org.archive.modules.fetcher.FetchStats
-
- getOtherDupUrls() - Method in class org.archive.modules.fetcher.FetchStats
-
- getOutLinks() - Method in class org.archive.modules.CrawlURI
-
Returns discovered links.
- getOverlayMap(String) - Method in class org.archive.modules.CrawlURI
-
- getOverlayNames() - Method in class org.archive.modules.CrawlURI
-
- getPassword() - Method in class org.archive.modules.credential.HttpAuthenticationCredential
-
- getPassword() - Method in class org.archive.modules.fetcher.FetchFTP
-
- getPath() - Method in class org.archive.modules.writer.Kw3WriterProcessor
-
- getPath() - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- getPathFromSeed() - Method in class org.archive.modules.CrawlURI
-
- getPathQuery(CrawlURI) - Method in class org.archive.modules.net.RobotsPolicy
-
- getPattern() - Method in enum org.archive.modules.deciderules.MatchesFilePatternDecideRule.Preset
-
- getPayloadDigest() - Method in class org.archive.modules.revisit.IdenticalPayloadDigestRevisit
-
- getPayloadDigest() - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
-
- getPersistentDataKeys() - Static method in class org.archive.modules.CrawlURI
-
- getPersistentDataMap() - Method in class org.archive.modules.CrawlURI
-
- getPolicyBasisUURI() - Method in class org.archive.modules.CrawlURI
-
Get the UURI that should be used as the basis of policy/overlay
decisions.
- getPolitenessDelay() - Method in class org.archive.modules.CrawlURI
-
- getPool() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- getPoolMaxActive() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- getPort() - Method in class org.archive.modules.net.CrawlServer
-
Get the port number for this server.
- getPrecedence() - Method in class org.archive.modules.CrawlURI
-
- getPrefix() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- getPreloadSource() - Method in class org.archive.modules.recrawl.PersistLoadProcessor
-
- getPreloadSourceUrl() - Method in class org.archive.modules.recrawl.PersistLoadProcessor
-
- getPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.Credential
-
Return the authentication URI, either absolute or relative, that serves
as prerequisite the passed curi
.
- getPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.HtmlFormCredential
-
- getPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.HttpAuthenticationCredential
-
- getPrerequisiteUri() - Method in class org.archive.modules.CrawlURI
-
Get the prerequisite for this URI.
- getProcessors() - Method in class org.archive.modules.ProcessorChain
-
- getProcessStatus() - Method in class org.archive.modules.ProcessResult
-
- getProfileName() - Method in class org.archive.modules.revisit.IdenticalPayloadDigestRevisit
-
- getProfileName() - Method in interface org.archive.modules.revisit.RevisitProfile
-
- getProfileName() - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
-
- getProtocolVersion() - Method in class org.archive.modules.fetcher.BasicExecutionAwareRequest
-
Returns the HTTP protocol version to be used for this request.
- getRealm() - Method in class org.archive.modules.credential.HttpAuthenticationCredential
-
- getRecordedFinishes() - Method in class org.archive.modules.fetcher.FetchStats
-
- getRecordedSize() - Method in class org.archive.modules.CrawlURI
-
Get size of data recorded (transferred)
- getRecordedSize(CrawlURI) - Static method in class org.archive.modules.Processor
-
- getRecorder() - Method in class org.archive.modules.CrawlURI
-
Get the http recorder associated with this uri.
- getRecorder() - Method in class org.archive.state.ModuleTestBase
-
- getRecordID() - Method in class org.archive.modules.writer.WARCWriterProcessor
-
- getRecordIDGenerator() - Method in class org.archive.modules.writer.WARCWriterProcessor
-
- getRefersToDate() - Method in class org.archive.modules.revisit.AbstractProfile
-
- getRefersToRecordID() - Method in class org.archive.modules.revisit.AbstractProfile
-
- getRefersToTargetURI() - Method in class org.archive.modules.revisit.IdenticalPayloadDigestRevisit
-
- getRegex() - Method in class org.archive.modules.canonicalize.RegexRule
-
- getRegex() - Method in class org.archive.modules.deciderules.MatchesFilePatternDecideRule
-
Use a preset if configured to do so.
- getRegex() - Method in class org.archive.modules.deciderules.MatchesRegexDecideRule
-
- getRegex() - Method in class org.archive.modules.extractor.ExtractorImpliedURI
-
- getRegexList() - Method in class org.archive.modules.deciderules.MatchesListRegexDecideRule
-
- getRemaining() - Method in class org.archive.modules.fetcher.FetchStats
-
- getRemoveTriggerUris() - Method in class org.archive.modules.extractor.ExtractorImpliedURI
-
- getRequestLine() - Method in class org.archive.modules.fetcher.BasicExecutionAwareRequest
-
Returns the request line of this request.
- getRescheduleTime() - Method in class org.archive.modules.CrawlURI
-
- getResourceDir() - Method in class org.archive.state.ModuleTestBase
-
Returns the location of the Java resources directory for your project.
- getRevisitProfile() - Method in class org.archive.modules.CrawlURI
-
- getRobotsDenials() - Method in class org.archive.modules.fetcher.FetchStats
-
- getRobotsPolicy() - Method in class org.archive.modules.CrawlMetadata
-
Get the currently-effective RobotsPolicy, as specified by the
string name and chosen from the full available map.
- getRobotsPolicyName() - Method in class org.archive.modules.CrawlMetadata
-
- getRobotstxt() - Method in class org.archive.modules.net.CrawlServer
-
- getRules() - Method in class org.archive.modules.canonicalize.RulesCanonicalizationPolicy
-
- getRules() - Method in class org.archive.modules.deciderules.DecideRuleSequence
-
- getSchedulingDirective() - Method in class org.archive.modules.CrawlURI
-
- getSchemes() - Method in class org.archive.modules.deciderules.SchemeNotInSetDecideRule
-
- getScratchDisk() - Method in interface org.archive.modules.extractor.TempDirProvider
-
- getScratchDisk() - Method in class org.archive.modules.net.DefaultTempDirProvider
-
- getScriptSource() - Method in class org.archive.modules.deciderules.ScriptedDecideRule
-
- getScriptSource() - Method in class org.archive.modules.ScriptedProcessor
-
- getSeedListeners() - Method in class org.archive.modules.seeds.SeedModule
-
- getSeeds() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
- getSeedsAsSurtPrefixes() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
- getSendConnectionClose() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getSendIfModifiedSince() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getSendIfNoneMatch() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getSendRange() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getSendReferer() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getSerialNo() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- getServerCache() - Method in class org.archive.modules.deciderules.DecideRuleSequence
-
- getServerCache() - Method in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
-
- getServerCache() - Method in class org.archive.modules.deciderules.IpAddressSetDecideRule
-
- getServerCache() - Method in class org.archive.modules.fetcher.FetchDNS
-
- getServerCache() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getServerCache() - Method in class org.archive.modules.fetcher.FetchWhois
-
- getServerCache() - Method in class org.archive.modules.writer.Kw3WriterProcessor
-
- getServerCache() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- getServerFor(String) - Method in class org.archive.modules.fetcher.DefaultServerCache
-
- getServerFor(String) - Method in class org.archive.modules.net.ServerCache
-
- getServerFor(UURI) - Method in class org.archive.modules.net.ServerCache
-
- getServerKey(CrawlURI) - Static method in class org.archive.modules.fetcher.FetchHTTP
-
- getServerKey(UURI) - Static method in class org.archive.modules.net.CrawlServer
-
Get key to use doing lookup on server instances.
- getShouldFetchBodyRule() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getShouldMasquerade() - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
-
- getShouldMasquerade() - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
-
- getShouldProcessRule() - Method in class org.archive.modules.Processor
-
- getSkipIdenticalDigests() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- getSocket() - Method in class org.archive.modules.fetcher.FetchHTTPRequest.RecordingHttpClientConnection
-
- getSocketInputStream(Socket) - Method in class org.archive.modules.fetcher.FetchHTTPRequest.RecordingHttpClientConnection
-
- getSocketOutputStream(Socket) - Method in class org.archive.modules.fetcher.FetchHTTPRequest.RecordingHttpClientConnection
-
- getSoTimeoutMs() - Method in class org.archive.modules.fetcher.FetchFTP
-
- getSoTimeoutMs() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getSoTimeoutMs() - Method in class org.archive.modules.fetcher.FetchWhois
-
- getSourceCodeDir() - Method in class org.archive.state.ModuleTestBase
-
Returns the location of the source code directory for your project.
- getSourceSeeds() - Method in class org.archive.modules.deciderules.SourceSeedDecideRule
-
- getSourceTag() - Method in class org.archive.modules.CrawlURI
-
- getSourceTagSeeds() - Method in class org.archive.modules.seeds.SeedModule
-
- getSSLSession() - Method in class org.archive.modules.fetcher.FetchHTTPRequest.RecordingHttpClientConnection
-
- getSslTrustLevel() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getStartNewFilesOnCheckpoint() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- getStats() - Method in class org.archive.modules.writer.WARCWriterProcessor
-
- getStatusCodes() - Method in class org.archive.modules.deciderules.FetchStatusDecideRule
-
- getStorePaths() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- getString(CrawlURI) - Method in class org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule
-
- getString(CrawlURI) - Method in class org.archive.modules.deciderules.FetchStatusMatchesRegexDecideRule
-
- getString(CrawlURI) - Method in class org.archive.modules.deciderules.HopsPathMatchesRegexDecideRule
-
- getString(CrawlURI) - Method in class org.archive.modules.deciderules.MatchesRegexDecideRule
-
- getStripRegex() - Method in class org.archive.modules.extractor.HTTPContentDigest
-
- getSubstats() - Method in interface org.archive.modules.fetcher.FetchStats.HasFetchStats
-
- getSubstats() - Method in class org.archive.modules.net.CrawlHost
-
- getSubstats() - Method in class org.archive.modules.net.CrawlServer
-
- getSuccessBytes() - Method in class org.archive.modules.fetcher.FetchStats
-
- getSuffixAtEnd() - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- getSurtPrefixes() - Method in class org.archive.modules.deciderules.ViaSurtPrefixedDecideRule
-
- getSurtsDumpFile() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
- getSurtsSource() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
- getSurtsSourceFile() - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
- getTemplate() - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
-
- getTemplate() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- getTextSource() - Method in class org.archive.modules.seeds.TextSeedModule
-
- getThreadNumber() - Method in class org.archive.modules.CrawlURI
-
Get the number of the ToeThread responsible for processing this uri.
- getTimeoutSeconds() - Method in class org.archive.modules.fetcher.FetchFTP
-
- getTimeoutSeconds() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getTooLongDirectory() - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- getTotalBytes() - Method in class org.archive.crawler.util.CrawledBytesHistotable
-
- getTotalBytes() - Method in class org.archive.modules.fetcher.FetchStats
-
- getTotalBytesWritten() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- getTotalScheduled() - Method in class org.archive.modules.fetcher.FetchStats
-
- getTotalUrls() - Method in class org.archive.crawler.util.CrawledBytesHistotable
-
- getTransHops() - Method in class org.archive.modules.CrawlURI
-
Tally up the number of transitive (non-simple-link) hops at
the end of this CrawlURI's pathFromSeed.
- getTreatFramesAsEmbedLinks() - Method in class org.archive.modules.extractor.ExtractorHTML
-
- getUnderscoreSet() - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- getUpperBound() - Method in class org.archive.modules.deciderules.MatchesStatusCodeDecideRule
-
Returns the upper bound on the range of acceptable status codes.
- getUpperBound() - Method in class org.archive.modules.deciderules.NotMatchesStatusCodeDecideRule
-
Returns the upper bound on the range of acceptable status codes.
- getUpperBound() - Method in class org.archive.modules.deciderules.ResponseContentLengthDecideRule
-
- getURI() - Method in class org.archive.modules.CrawlURI
-
- getURICount() - Method in class org.archive.modules.Processor
-
Returns the number of URIs this processor has handled.
- getUriRegex() - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
-
- getURIs() - Method in class org.archive.modules.extractor.PDFParser
-
Get a list of URIs retrieved from the Pdf during the
extractURIs operation.
- getURL(String, String) - Method in class org.archive.modules.extractor.ExtractorSWF.CrawlUriSWFAction
-
Overwrite handling of discovered URIs.
- getUseHeaderLength() - Method in class org.archive.modules.deciderules.ResourceNoLongerThanDecideRule
-
- getUseHTTP11() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getUsePreset() - Method in class org.archive.modules.deciderules.MatchesFilePatternDecideRule
-
- getUserAgent() - Method in class org.archive.modules.CrawlMetadata
-
- getUserAgent() - Method in class org.archive.modules.CrawlURI
-
Get the user agent to use for crawling this URI.
- getUserAgent() - Method in interface org.archive.modules.fetcher.UserAgentProvider
-
- getUserAgentProvider() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- getUserAgentTemplate() - Method in class org.archive.modules.CrawlMetadata
-
- getUsername() - Method in class org.archive.modules.fetcher.FetchFTP
-
- getUURI() - Method in class org.archive.modules.CrawlURI
-
- getValidator() - Method in class org.archive.modules.CrawlMetadata
-
- getValidTestData() - Method in class org.archive.modules.extractor.StringExtractorTestBase
-
Returns an array of valid test data pairs.
- getVia() - Method in class org.archive.modules.CrawlURI
-
- getViaContext() - Method in class org.archive.modules.CrawlURI
-
- getWarcHeaders() - Method in class org.archive.modules.revisit.AbstractProfile
-
- getWarcHeaders() - Method in class org.archive.modules.revisit.IdenticalPayloadDigestRevisit
-
- getWarcHeaders() - Method in interface org.archive.modules.revisit.RevisitProfile
-
- getWarcHeaders() - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
-
- getWhoisQuery(CrawlURI) - Method in class org.archive.modules.fetcher.FetchWhois
-
- getWhoisServer(CrawlURI) - Method in class org.archive.modules.fetcher.FetchWhois
-
- getWriteBufferSize() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- getWriteMetadata() - Method in class org.archive.modules.writer.WARCWriterProcessor
-
- getWriteRequests() - Method in class org.archive.modules.writer.WARCWriterProcessor
-
- groovyTemplate() - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
-
- groovyTemplates - Variable in class org.archive.modules.extractor.ExtractorMultipleRegex
-
- GroupList(MatchResult) - Constructor for class org.archive.modules.extractor.ExtractorMultipleRegex.GroupList
-
- IdenticalDigestDecideRule - Class in org.archive.modules.deciderules.recrawl
-
Rule applies configured decision to any CrawlURIs whose revisit profile is set with a profile matching
WARCConstants.PROFILE_REVISIT_IDENTICAL_DIGEST
- IdenticalDigestDecideRule() - Constructor for class org.archive.modules.deciderules.recrawl.IdenticalDigestDecideRule
-
Usual constructor.
- IdenticalPayloadDigestRevisit - Class in org.archive.modules.revisit
-
- IdenticalPayloadDigestRevisit(String) - Constructor for class org.archive.modules.revisit.IdenticalPayloadDigestRevisit
-
Minimal constructor.
- IgnoreRobotsPolicy - Class in org.archive.modules.net
-
Policy to ignore robots.
- IgnoreRobotsPolicy() - Constructor for class org.archive.modules.net.IgnoreRobotsPolicy
-
- IMG_SRC - Static variable in class org.archive.modules.extractor.HTMLLinkContext
-
- IMG_SRCSET - Static variable in class org.archive.modules.extractor.HTMLLinkContext
-
- includesRetireDirective() - Method in class org.archive.modules.CrawlURI
-
- incrementConsecutiveConnectionErrors() - Method in class org.archive.modules.net.CrawlServer
-
- incrementDeferrals() - Method in class org.archive.modules.CrawlURI
-
Increment the deferral count.
- incrementDiscardedOutLinks() - Method in class org.archive.modules.CrawlURI
-
- incrementFetchAttempts() - Method in class org.archive.modules.CrawlURI
-
Increment the count of attempts (trips through the processing
loop) at getting the document referenced by this URI.
- indexOf(Object) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
-
- INFERRED_MISC - Static variable in class org.archive.modules.extractor.LinkContext
-
Stand-in value for inferred urls without other context.
- inferRootPage - Variable in class org.archive.modules.extractor.ExtractorHTTP
-
should all HTTP URIs be used to infer a link to the site's root?
- inheritFrom(CrawlURI) - Method in class org.archive.modules.CrawlURI
-
Inherit (copy) the relevant keys-values from the ancestor.
- initHttpClientBuilder() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
-
- initialize() - Method in class org.archive.modules.extractor.PDFParser
-
Initialize opens the document for reading.
- initializeFromReader(Reader) - Method in class org.archive.modules.net.Robotstxt
-
- initOutputStream(CrawlURI) - Method in class org.archive.modules.writer.Kw3WriterProcessor
-
Get the OutputStream for the file to write to.
- innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.AcceptDecideRule
-
- innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.ContentLengthDecideRule
-
- innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.DecideRule
-
- innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.DecideRuleSequence
-
- innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.PathologicalPathDecideRule
-
- innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.PredicatedDecideRule
-
- innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.PrerequisiteAcceptDecideRule
-
- innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.RejectDecideRule
-
- innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.ScriptedDecideRule
-
- innerDecide(CrawlURI) - Method in class org.archive.modules.deciderules.SeedAcceptDecideRule
-
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ContentExtractor
-
Actually extracts links.
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorCSS
-
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorDOC
-
Processes a word document and extracts any hyperlinks from it.
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorHTML
-
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorJS
-
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorPDF
-
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorSWF
-
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorUniversal
-
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorXML
-
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.TrapSuppressExtractor
-
- innerProcess(CrawlURI) - Method in class org.archive.modules.extractor.Extractor
-
Processes the given URI.
- innerProcess(CrawlURI) - Method in class org.archive.modules.extractor.HTTPContentDigest
-
- innerProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchDNS
-
- innerProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchFTP
-
Processes the given URI.
- innerProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchHTTP
-
- innerProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchWhois
-
- innerProcess(CrawlURI) - Method in class org.archive.modules.forms.FormLoginProcessor
-
- innerProcess(CrawlURI) - Method in class org.archive.modules.Processor
-
Actually performs the process.
- innerProcess(CrawlURI) - Method in class org.archive.modules.recrawl.ContentDigestHistoryLoader
-
- innerProcess(CrawlURI) - Method in class org.archive.modules.recrawl.ContentDigestHistoryStorer
-
- innerProcess(CrawlURI) - Method in class org.archive.modules.recrawl.FetchHistoryProcessor
-
- innerProcess(CrawlURI) - Method in class org.archive.modules.recrawl.PersistLoadProcessor
-
- innerProcess(CrawlURI) - Method in class org.archive.modules.recrawl.PersistLogProcessor
-
- innerProcess(CrawlURI) - Method in class org.archive.modules.recrawl.PersistStoreProcessor
-
- innerProcess(CrawlURI) - Method in class org.archive.modules.ScriptedProcessor
-
- innerProcess(CrawlURI) - Method in class org.archive.modules.writer.Kw3WriterProcessor
-
- innerProcess(CrawlURI) - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- innerProcess(CrawlURI) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- innerProcessResult(CrawlURI) - Method in class org.archive.modules.fetcher.FetchWhois
-
- innerProcessResult(CrawlURI) - Method in class org.archive.modules.Processor
-
- innerProcessResult(CrawlURI) - Method in class org.archive.modules.writer.ARCWriterProcessor
-
Writes a CrawlURI and its associated data to store file.
- innerProcessResult(CrawlURI) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Writes a CrawlURI and its associated data to store file.
- innerProcessResult(CrawlURI) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- innerRejectProcess(CrawlURI) - Method in class org.archive.modules.Processor
-
Invoked after a URI has been rejected.
- innerRejectProcess(CrawlURI) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- INSTANCE - Static variable in class org.archive.modules.net.IgnoreRobotsPolicy
-
- INSTANCE - Static variable in class org.archive.modules.net.ObeyRobotsPolicy
-
- invert(DecideResult) - Static method in enum org.archive.modules.deciderules.DecideResult
-
- IP_ADDRESS - Static variable in class org.archive.modules.extractor.ExtractorUniversal
-
Matches any string that begins with http:// or https:// followed by
something that looks like an ip address (four numbers, none longer then
3 chars seperated by 3 dots).
- IP_ADDRESS_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
-
- IP_ADDRESS_REGEX - Static variable in class org.archive.modules.fetcher.FetchWhois
-
- IP_NEVER_EXPIRES - Static variable in class org.archive.modules.net.CrawlHost
-
Flag value indicating always-valid IP
- IP_NEVER_LOOKED_UP - Static variable in class org.archive.modules.net.CrawlHost
-
Flag value indicating an IP has not yet been looked up
- IpAddressSetDecideRule - Class in org.archive.modules.deciderules
-
IpAddressSetDecideRule must be used with
org.archive.crawler.prefetch.Preselector#setRecheckScope(boolean) set
to true because it relies on Heritrix' dns lookup to establish the ip address
for a URI before it can run.
- IpAddressSetDecideRule() - Constructor for class org.archive.modules.deciderules.IpAddressSetDecideRule
-
- is2XXSuccess() - Method in class org.archive.modules.CrawlURI
-
- isCheckpointRecovery - Variable in class org.archive.modules.fetcher.BdbCookieStore
-
are we a checkpoint recovery? (in which case, reuse stored cookie data?)
- isCheckpointRecovery - Variable in class org.archive.modules.net.BdbServerCache
-
- isCookieCountMaxedForDomain(String) - Method in class org.archive.modules.fetcher.AbstractCookieStore
-
- isDisableSNI() - Method in class org.archive.modules.fetcher.FetchHTTPRequest
-
- isEmpty() - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
-
- isEveryTime() - Method in class org.archive.modules.credential.Credential
-
- isEveryTime() - Method in class org.archive.modules.credential.HtmlFormCredential
-
- isEveryTime() - Method in class org.archive.modules.credential.HttpAuthenticationCredential
-
- isHtmlExpectedHere(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorHTML
-
Test whether this HTML is so unexpected (eg in place of a GIF URI)
that it shouldn't be scanned for links.
- isHttpTransaction() - Method in class org.archive.modules.CrawlURI
-
Return true if this is a http transaction.
- isLocation() - Method in class org.archive.modules.CrawlURI
-
- isMultipleFormSubmitInputs(String) - Method in class org.archive.modules.forms.HTMLForm
-
- isObeyMetaRobotsNofollow() - Method in class org.archive.modules.net.CustomRobotsPolicy
-
- isObeyMetaRobotsNofollow() - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
-
- isObeyMetaRobotsNofollow() - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
-
- isolateThreads - Variable in class org.archive.modules.deciderules.ScriptedDecideRule
-
Whether each ToeThread should get its own independent script
engine, or they should share synchronized access to one
engine.
- isolateThreads - Variable in class org.archive.modules.ScriptedProcessor
-
Whether each ToeThread should get its own independent script
engine, or they should share synchronized access to one
engine.
- isPost() - Method in class org.archive.modules.credential.Credential
-
- isPost() - Method in class org.archive.modules.credential.HtmlFormCredential
-
- isPost() - Method in class org.archive.modules.credential.HttpAuthenticationCredential
-
- isPrerequisite() - Method in class org.archive.modules.CrawlURI
-
Returns true if this CrawlURI is a prerequisite.
- isPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.Credential
-
- isPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.HtmlFormCredential
-
- isPrerequisite(CrawlURI) - Method in class org.archive.modules.credential.HttpAuthenticationCredential
-
- isQuadAddress(CrawlURI, String, CrawlHost) - Method in class org.archive.modules.fetcher.FetchDNS
-
- isRevisit() - Method in class org.archive.modules.CrawlURI
-
Indicates if this CrawlURI object has been deemed a revisit.
- isRobotsExpired(int) - Method in class org.archive.modules.net.CrawlServer
-
Is the robots policy expired.
- isRunning - Variable in class org.archive.modules.deciderules.DecideRuleSequence
-
- isRunning() - Method in class org.archive.modules.deciderules.DecideRuleSequence
-
- isRunning - Variable in class org.archive.modules.fetcher.AbstractCookieStore
-
- isRunning() - Method in class org.archive.modules.fetcher.AbstractCookieStore
-
- isRunning() - Method in class org.archive.modules.fetcher.FetchWhois
-
- isRunning - Variable in class org.archive.modules.net.BdbServerCache
-
- isRunning() - Method in class org.archive.modules.net.BdbServerCache
-
- isRunning - Variable in class org.archive.modules.Processor
-
- isRunning() - Method in class org.archive.modules.Processor
-
- isRunning - Variable in class org.archive.modules.ProcessorChain
-
- isRunning() - Method in class org.archive.modules.ProcessorChain
-
- isRunning() - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
-
- isRunning() - Method in class org.archive.modules.recrawl.PersistLogProcessor
-
- isRunning() - Method in class org.archive.modules.recrawl.PersistOnlineProcessor
-
- isSeed() - Method in class org.archive.modules.CrawlURI
-
- isSuccess() - Method in class org.archive.modules.CrawlURI
-
Ask this URI if it was a success or not.
- isSuccess(CrawlURI) - Static method in class org.archive.modules.Processor
-
- isValidRobots() - Method in class org.archive.modules.net.CrawlServer
-
If true then valid robots.txt information has been retrieved.
- iterator() - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
-
- iterator() - Method in class org.archive.modules.ProcessorChain
-
- main(String[]) - Static method in class org.archive.modules.extractor.PDFParser
-
- main(String[]) - Static method in class org.archive.modules.recrawl.PersistProcessor
-
Utility main for importing a log into a BDB-JE environment or moving a
database between environments (2 arguments), or simply dumping a log
to stderr in a more readable format (1 argument).
- makeBindings(Map<String, ExtractorMultipleRegex.MatchList>, String[], int) - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
-
- makeCrawlURI(String) - Method in class org.archive.state.ModuleTestBase
-
- makeData(String, String) - Method in class org.archive.modules.extractor.StringExtractorTestBase
-
- makeDirty() - Method in class org.archive.modules.net.CrawlHost
-
- makeDirty() - Method in class org.archive.modules.net.CrawlServer
-
- makeExtractor() - Method in class org.archive.modules.extractor.ContentExtractorTestBase
-
Subclasses should return an Extractor instance to test.
- makeHeritable(String) - Method in class org.archive.modules.CrawlURI
-
Make the given key 'heritable', meaning its value will be
added to descendant CrawlURIs.
- makeModule() - Method in class org.archive.modules.extractor.ContentExtractorTestBase
-
- makeModule() - Method in class org.archive.state.ModuleTestBase
-
Return an example instance of the module.
- makeNonHeritable(String) - Method in class org.archive.modules.CrawlURI
-
Make the given key non-'heritable', meaning its value will
not be added to descendant CrawlURIs.
- makeTempDir() - Static method in class org.archive.modules.net.DefaultTempDirProvider
-
- makeWhoisUrl(String, String) - Method in class org.archive.modules.fetcher.FetchWhois
-
- markAsSeen(int, int) - Method in class org.archive.modules.extractor.PDFParser
-
Note that an object (id/generation pair) has been seen by this parser
so that it can be handled differently when it is encountered again.
- markPrerequisite(String) - Method in class org.archive.modules.CrawlURI
-
Do all actions associated with setting a CrawlURI
as
requiring a prerequisite.
- MatchesFilePatternDecideRule - Class in org.archive.modules.deciderules
-
Compares suffix of a passed CrawlURI, UURI, or String against a regular
expression pattern, applying its configured decision to all matches.
- MatchesFilePatternDecideRule() - Constructor for class org.archive.modules.deciderules.MatchesFilePatternDecideRule
-
Usual constructor.
- MatchesFilePatternDecideRule.Preset - Enum in org.archive.modules.deciderules
-
- MatchesListRegexDecideRule - Class in org.archive.modules.deciderules
-
Rule applies configured decision to any CrawlURIs whose String URI
matches the supplied regexs.
- MatchesListRegexDecideRule() - Constructor for class org.archive.modules.deciderules.MatchesListRegexDecideRule
-
Usual constructor.
- MatchesRegexDecideRule - Class in org.archive.modules.deciderules
-
Rule applies configured decision to any CrawlURIs whose String URI
matches the supplied regex.
- MatchesRegexDecideRule() - Constructor for class org.archive.modules.deciderules.MatchesRegexDecideRule
-
Usual constructor.
- MatchesStatusCodeDecideRule - Class in org.archive.modules.deciderules
-
Provides a rule that returns "true" for any CrawlURIs which have a fetch
status code that falls within the provided inclusive range.
- MatchesStatusCodeDecideRule() - Constructor for class org.archive.modules.deciderules.MatchesStatusCodeDecideRule
-
Creates a new MatchStatusCodeDecideRule instance.
- MatchList(String, CharSequence) - Constructor for class org.archive.modules.extractor.ExtractorMultipleRegex.MatchList
-
- MatchList(ExtractorMultipleRegex.GroupList...) - Constructor for class org.archive.modules.extractor.ExtractorMultipleRegex.MatchList
-
- MAX_COOKIES_FOR_DOMAIN - Static variable in class org.archive.modules.fetcher.AbstractCookieStore
-
- MAX_SIZE - Static variable in class org.archive.modules.net.Robotstxt
-
- maxFileSizeBytes - Variable in class org.archive.modules.writer.Kw3WriterProcessor
-
Max size for each file.
- maxFileSizeBytes - Variable in class org.archive.modules.writer.WriterPoolProcessor
-
Max size of each file.
- maxPathLength - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
Maximum file system path length.
- maxSegLength - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
Maximum file system path segment length.
- maxTotalBytesToWrite - Variable in class org.archive.modules.writer.WriterPoolProcessor
-
Total file bytes to write to disk.
- maxWaitForIdleMs - Variable in class org.archive.modules.writer.WriterPoolProcessor
-
Maximum time to wait on idle writer before (possibly) creating an
additional instance.
- maybeAddConditionalGetHeader(boolean, String, String) - Method in class org.archive.modules.fetcher.FetchHTTPRequest
-
Add the given conditional-GET header, if the setting is enabled and
a suitable value is available in the URI history.
- maybeMidfetchAbort(CrawlURI, AbstractExecutionAwareRequest) - Method in class org.archive.modules.fetcher.FetchHTTP
-
- MEDIUM - Static variable in class org.archive.modules.SchedulingConstants
-
Medium priority.
- META - Static variable in class org.archive.modules.extractor.HTMLLinkContext
-
- META_HREF - Static variable in class org.archive.modules.extractor.HTMLLinkContext
-
- metadata - Variable in class org.archive.modules.extractor.ExtractorHTML
-
CrawlMetadata provides the robots honoring policy to use when
considering a robots META tag.
- method - Variable in class org.archive.modules.forms.HTMLForm
-
- MIN_ROBOTS_RETRIES - Static variable in class org.archive.modules.net.CrawlServer
-
only check if robots-fetch is perhaps superfluous
after this many tries
- MirrorWriterProcessor - Class in org.archive.modules.writer
-
Processor module that writes the results of successful fetches to
files on disk.
- MirrorWriterProcessor() - Constructor for class org.archive.modules.writer.MirrorWriterProcessor
-
- ModuleTestBase - Class in org.archive.state
-
Base class for unit testing Module implementations.
- ModuleTestBase() - Constructor for class org.archive.state.ModuleTestBase
-
Magical constructor that attempts to auto-create static key field
descriptions for your module class.
- MostFavoredRobotsPolicy - Class in org.archive.modules.net
-
Follow a most-favored robots policy -- allowing an URL if either the
conventionally-configured User-Agent, or any of a number of alternate
User-Agents (from the candidateUserAgents list) would be allowed.
- MostFavoredRobotsPolicy() - Constructor for class org.archive.modules.net.MostFavoredRobotsPolicy
-
- S_BLOCKED_BY_CUSTOM_PROCESSOR - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
Blocked by custom prefetcher processor.
- S_BLOCKED_BY_QUOTA - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
Blocked due to exceeding an established quota.
- S_BLOCKED_BY_RUNTIME_LIMIT - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
Blocked due to exceeding an established runtime.
- S_BLOCKED_BY_USER - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
blocked from fetch by user setting.
- S_CONNECT_FAILED - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
HTTP connect failed
- S_CONNECT_LOST - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
HTTP connect broken
- S_DEEMED_CHAFF - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
'chaff' detection of traps/content of negligible value applied
- S_DEEMED_NOT_FOUND - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
synthetic status, used when some other status (such as connection-lost)
is considered by policy the same as a document-not-found
- S_DEFERRED - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
temporary status assigned URIs awaiting preconditions; appearance in
logs is a bug
- S_DELETED_BY_USER - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
deleted from frontier by user
- S_DNS_SUCCESS - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
DNS success
- S_DOMAIN_PREREQUISITE_FAILURE - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
DNS prerequisite failed, precluding attempt
- S_DOMAIN_UNRESOLVABLE - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
DNS lookup failed
- S_GETBYNAME_SUCCESS - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
InetAddress.getByName success
- S_NOT_FOUND - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
HTTP 404 NOT FOUND
- S_OTHER_PREREQUISITE_FAILURE - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
DNS prerequisite failed, precluding attempt
- S_OUT_OF_SCOPE - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
out-of-scope upoin reexamination (only when scope changes during
crawl)
- S_PREREQUISITE_UNSCHEDULABLE_FAILURE - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
DNS prerequisite failed, precluding attempt
- S_PROCESSING_THREAD_KILLED - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
Processing thread was killed
- S_ROBOTS_PRECLUDED - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
robots rules precluded fetch
- S_ROBOTS_PREREQUISITE_FAILURE - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
Robots prerequisite failed, precluding attempt
- S_RUNTIME_EXCEPTION - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
Unexpected runtime exception; see runtime-errors.log
- S_SERIOUS_ERROR - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
severe java 'Error' conditions (OutOfMemoryError, StackOverflowError,
etc.) during URI processing
- S_TIMEOUT - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
HTTP timeout (before any meaningful response received)
- S_TOO_MANY_EMBED_HOPS - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
overstepped embed/trans hops
- S_TOO_MANY_LINK_HOPS - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
overstepped link hops
- S_TOO_MANY_RETRIES - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
multiple retries all failed
- S_UNATTEMPTED - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
fetch never tried (perhaps protocol unsupported or illegal URI)
- S_UNFETCHABLE_URI - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
URI recognized as unsupported or illegal)
- S_UNQUEUEABLE - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
URI could not be queued in Frontier; when URIs are properly
filtered for format, should never occur
- S_WHOIS_GENERIC_FINISHED - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
Finished all fetches for serverless WHOIS url (whois:foo.org)
- S_WHOIS_SUCCESS - Static variable in interface org.archive.modules.fetcher.FetchStatusCodes
-
WHOIS success
- saveCookies() - Method in class org.archive.modules.fetcher.AbstractCookieStore
-
- saveCookies(String) - Method in class org.archive.modules.fetcher.AbstractCookieStore
-
- saveHeader(CrawlURI, Map<String, Object>, String) - Method in class org.archive.modules.recrawl.FetchHistoryProcessor
-
Save a header from the given HTTP operation into the Map.
- saveHeader(CrawlURI, ANVLRecord, String, String) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Saves a header from the given HTTP operation into the
provider headers under a new name
- SchedulingConstants - Class in org.archive.modules
-
- SchemeNotInSetDecideRule - Class in org.archive.modules.deciderules
-
Rule applies the configured decision (default REJECT) for any URI which
has a URI-scheme NOT contained in the configured Set.
- SchemeNotInSetDecideRule() - Constructor for class org.archive.modules.deciderules.SchemeNotInSetDecideRule
-
Usual constructor.
- schemes - Variable in class org.archive.modules.deciderules.SchemeNotInSetDecideRule
-
set of schemes to test URI scheme
- SCRIPT_SRC - Static variable in class org.archive.modules.extractor.HTMLLinkContext
-
- ScriptedDecideRule - Class in org.archive.modules.deciderules
-
Rule which runs a JSR-223 script to make its decision.
- ScriptedDecideRule() - Constructor for class org.archive.modules.deciderules.ScriptedDecideRule
-
- ScriptedProcessor - Class in org.archive.modules
-
A processor which runs a JSR-223 script on the CrawlURI.
- ScriptedProcessor() - Constructor for class org.archive.modules.ScriptedProcessor
-
Constructor.
- scriptSource - Variable in class org.archive.modules.deciderules.ScriptedDecideRule
-
- scriptSource - Variable in class org.archive.modules.ScriptedProcessor
-
- SeedAcceptDecideRule - Class in org.archive.modules.deciderules
-
Rule which ACCEPTs all 'seed' URIs (those for which
isSeed is true).
- SeedAcceptDecideRule() - Constructor for class org.archive.modules.deciderules.SeedAcceptDecideRule
-
- seedLine(String) - Method in class org.archive.modules.seeds.TextSeedModule
-
Handle a read line that is probably a seed.
- SeedListener - Interface in org.archive.modules.seeds
-
Implemented by components which want notifications of
seed list changes.
- seedListeners - Variable in class org.archive.modules.seeds.SeedModule
-
- SeedModule - Class in org.archive.modules.seeds
-
- SeedModule() - Constructor for class org.archive.modules.seeds.SeedModule
-
- seeds - Variable in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
- seedsAsSurtPrefixes - Variable in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
Should seeds also be interpreted as SURT prefixes.
- seemsLoginForm() - Method in class org.archive.modules.forms.HTMLForm
-
For now, we consider a POST form with only 1 password
field and 1 potential username field (type text or email)
to be a likely login form.
- serverCache - Variable in class org.archive.modules.deciderules.DecideRuleSequence
-
- serverCache - Variable in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
-
- serverCache - Variable in class org.archive.modules.deciderules.IpAddressSetDecideRule
-
- serverCache - Variable in class org.archive.modules.fetcher.FetchDNS
-
Used to do DNS lookups.
- serverCache - Variable in class org.archive.modules.fetcher.FetchHTTP
-
- serverCache - Variable in class org.archive.modules.fetcher.FetchHTTPRequest.ServerCacheResolver
-
- serverCache - Variable in class org.archive.modules.fetcher.FetchWhois
-
- ServerCache - Class in org.archive.modules.net
-
Abstract class for crawl-global registry of CrawlServer (host:port) and
CrawlHost (hostname) objects.
- ServerCache() - Constructor for class org.archive.modules.net.ServerCache
-
- serverCache - Variable in class org.archive.modules.writer.Kw3WriterProcessor
-
The server cache to use.
- serverCache - Variable in class org.archive.modules.writer.WriterPoolProcessor
-
- ServerCacheResolver(ServerCache) - Constructor for class org.archive.modules.fetcher.FetchHTTPRequest.ServerCacheResolver
-
- serverInetAddr - Variable in class org.archive.modules.fetcher.FetchDNS
-
- ServerNotModifiedRevisit - Class in org.archive.modules.revisit
-
- ServerNotModifiedRevisit() - Constructor for class org.archive.modules.revisit.ServerNotModifiedRevisit
-
Minimal constructor.
- servers - Variable in class org.archive.modules.fetcher.DefaultServerCache
-
hostname[:port] -> CrawlServer.
- set(int, T) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
-
- setAcceptCompression(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Set headers to accept compressed responses.
- setAcceptHeaders(List<String>) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Accept Headers to include in each request.
- setAcceptNonDnsResolves(boolean) - Method in class org.archive.modules.fetcher.FetchDNS
-
- setAction(String) - Method in class org.archive.modules.forms.HTMLForm
-
- setAlsoCheckVia(boolean) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
- setApplicableSurtPrefix(String) - Method in class org.archive.modules.forms.FormLoginProcessor
-
- setApplicationContext(ApplicationContext) - Method in class org.archive.modules.deciderules.ScriptedDecideRule
-
- setApplicationContext(ApplicationContext) - Method in class org.archive.modules.ScriptedProcessor
-
- setAudience(String) - Method in class org.archive.modules.CrawlMetadata
-
- setAvailableRobotsPolicies(Map<String, RobotsPolicy>) - Method in class org.archive.modules.CrawlMetadata
-
- setBaseURI(String) - Method in class org.archive.modules.CrawlURI
-
Set the (HTML) Base URI used for derelativizing internal URIs.
- setBaseURI(UURI) - Method in class org.archive.modules.CrawlURI
-
- setBdbModule(BdbModule) - Method in class org.archive.modules.fetcher.BdbCookieStore
-
- setBdbModule(BdbModule) - Method in class org.archive.modules.fetcher.FetchWhois
-
- setBdbModule(BdbModule) - Method in class org.archive.modules.net.BdbServerCache
-
- setBdbModule(BdbModule) - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
-
- setBdbModule(BdbModule) - Method in class org.archive.modules.recrawl.PersistOnlineProcessor
-
- setBeanName(String) - Method in class org.archive.modules.deciderules.DecideRuleSequence
-
- setBeanName(String) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
- setBeanName(String) - Method in class org.archive.modules.Processor
-
- setBlockAwaitingSeedLines(int) - Method in class org.archive.modules.seeds.TextSeedModule
-
- setCandidateUserAgents(List<String>) - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
-
- setCandidateUserAgents(List<String>) - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
-
- setCanonicalString(String) - Method in class org.archive.modules.CrawlURI
-
- setCaseSensitiveFilesystem(boolean) - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- setCharacterEncoding(CrawlURI, Recorder, HttpResponse) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Set the character encoding based on the result headers or default.
- setCharacterMap(List<String>) - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- setChmod(boolean) - Method in class org.archive.modules.writer.Kw3WriterProcessor
-
- setChmodValue(String) - Method in class org.archive.modules.writer.Kw3WriterProcessor
-
- setClassKey(String) - Method in class org.archive.modules.CrawlURI
-
- setCollection(String) - Method in class org.archive.modules.writer.Kw3WriterProcessor
-
- setComment(String) - Method in class org.archive.modules.deciderules.DecideRule
-
- setCompress(boolean) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- setConnectTimeoutMs(int) - Method in class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
-
- setContentDigest(byte[]) - Method in class org.archive.modules.CrawlURI
-
- setContentDigest(String, byte[]) - Method in class org.archive.modules.CrawlURI
-
- setContentDigestHistory(AbstractContentDigestHistory) - Method in class org.archive.modules.recrawl.ContentDigestHistoryLoader
-
- setContentDigestHistory(AbstractContentDigestHistory) - Method in class org.archive.modules.recrawl.ContentDigestHistoryStorer
-
- setContentLengthThreshold(long) - Method in class org.archive.modules.deciderules.ContentLengthDecideRule
-
- setContentLengthThreshold(long) - Method in class org.archive.modules.deciderules.ResourceNoLongerThanDecideRule
-
- setContentRegexes(Map<String, String>) - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
-
A map of { name => regex }.
- setContentSize(long) - Method in class org.archive.modules.CrawlURI
-
Sets the 'content size' for the URI, which is considered inclusive of all
of all recorded material (such as protocol headers) or even material
'virtually' considered (as in material from a previous fetch
confirmed unchanged with a server).
- setContentType(String) - Method in class org.archive.modules.CrawlURI
-
Set a fetched uri's content type.
- setContentTypeMap(List<String>) - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- setCookiesLoadFile(ConfigFile) - Method in class org.archive.modules.fetcher.AbstractCookieStore
-
- setCookiesSaveFile(ConfigPath) - Method in class org.archive.modules.fetcher.AbstractCookieStore
-
- setCookieStore(AbstractCookieStore) - Method in class org.archive.modules.fetcher.FetchHTTP
-
- setCountryCode(String) - Method in class org.archive.modules.net.CrawlHost
-
Set country code for this hos
- setCountryCodes(List<String>) - Method in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
-
- setCrawlDelay(float) - Method in class org.archive.modules.net.RobotsDirectives
-
- setCreateHostDirectory(boolean) - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- setCreatePortDirectory(boolean) - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- setCredentials(Map<String, Credential>) - Method in class org.archive.modules.credential.CredentialStore
-
- setCredentialStore(CredentialStore) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Used to store credentials.
- setCustomRobots(ReadSource) - Method in class org.archive.modules.net.CustomRobotsPolicy
-
- setDecision(DecideResult) - Method in class org.archive.modules.deciderules.PredicatedDecideRule
-
- setDefaultEncoding(String) - Method in class org.archive.modules.fetcher.FetchHTTP
-
The character encoding to use for files that do not have one specified in
the HTTP response headers.
- setDescription(String) - Method in class org.archive.modules.CrawlMetadata
-
- setDigestAlgorithm(String) - Method in class org.archive.modules.fetcher.FetchDNS
-
- setDigestAlgorithm(String) - Method in class org.archive.modules.fetcher.FetchFTP
-
- setDigestAlgorithm(String) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Which algorithm (for example MD5 or SHA-1) to use to perform an
on-the-fly digest hash of retrieved content-bodies.
- setDigestContent(boolean) - Method in class org.archive.modules.fetcher.FetchDNS
-
- setDigestContent(boolean) - Method in class org.archive.modules.fetcher.FetchFTP
-
- setDigestContent(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Whether or not to perform an on-the-fly digest hash of retrieved
content-bodies.
- setDirectory(ConfigPath) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- setDirectoryFile(String) - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- setDisableJavaDnsResolves(boolean) - Method in class org.archive.modules.fetcher.FetchDNS
-
- setDisableSNI(boolean) - Method in class org.archive.modules.fetcher.FetchHTTPRequest
-
- setDNSServerIPLabel(String) - Method in class org.archive.modules.CrawlURI
-
- setDomain(String) - Method in class org.archive.modules.credential.Credential
-
- setDotBegin(String) - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- setDotEnd(String) - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- setEarliestNextURIEmitTime(long) - Method in class org.archive.modules.net.CrawlHost
-
Set the earliest time a URI for this host could be emitted.
- setEnabled(boolean) - Method in class org.archive.modules.canonicalize.BaseRule
-
- setEnabled(boolean) - Method in class org.archive.modules.deciderules.DecideRule
-
- setEnabled(boolean) - Method in class org.archive.modules.Processor
-
- setEnctype(String) - Method in class org.archive.modules.forms.HTMLForm
-
- setEngineName(String) - Method in class org.archive.modules.deciderules.ScriptedDecideRule
-
- setEngineName(String) - Method in class org.archive.modules.ScriptedProcessor
-
- setEntity(HttpEntity) - Method in class org.archive.modules.fetcher.BasicExecutionAwareEntityEnclosingRequest
-
- setError(String) - Method in class org.archive.modules.CrawlURI
-
- setETag(String) - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
-
- setExtractAllForms(boolean) - Method in class org.archive.modules.forms.ExtractorHTMLForms
-
- setExtractFromDirs(boolean) - Method in class org.archive.modules.fetcher.FetchFTP
-
- setExtractJavascript(boolean) - Method in class org.archive.modules.extractor.ExtractorHTML
-
- setExtractOnlyFormGets(boolean) - Method in class org.archive.modules.extractor.ExtractorHTML
-
- setExtractorJS(ExtractorJS) - Method in class org.archive.modules.extractor.ExtractorHTML
-
- setExtractorJS(ExtractorJS) - Method in class org.archive.modules.extractor.ExtractorSWF
-
- setExtractorParameters(ExtractorParameters) - Method in class org.archive.modules.extractor.Extractor
-
- setExtractParent(boolean) - Method in class org.archive.modules.fetcher.FetchFTP
-
- setExtractValueAttributes(boolean) - Method in class org.archive.modules.extractor.ExtractorHTML
-
- setFetchBeginTime(long) - Method in class org.archive.modules.CrawlURI
-
- setFetchCompletedTime(long) - Method in class org.archive.modules.CrawlURI
-
- setFetchStatus(int) - Method in class org.archive.modules.CrawlURI
-
Set the overall/fetch status of this CrawlURI for
its current trip through the processing loop.
- setFetchType(CrawlURI.FetchType) - Method in class org.archive.modules.CrawlURI
-
- setForceFetch(boolean) - Method in class org.archive.modules.CrawlURI
-
Method to signal that this URI should be fetched even though
it already has been crawled.
- setForceRetire(boolean) - Method in class org.archive.modules.CrawlURI
-
- setFormat(String) - Method in class org.archive.modules.canonicalize.RegexRule
-
- setFormat(String) - Method in class org.archive.modules.extractor.ExtractorImpliedURI
-
- setFormItems(Map<String, String>) - Method in class org.archive.modules.credential.HtmlFormCredential
-
- setFrequentFlushes(boolean) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- setFullVia(CrawlURI) - Method in class org.archive.modules.CrawlURI
-
- setHarvester(String) - Method in class org.archive.modules.writer.Kw3WriterProcessor
-
- setHistoryDbName(String) - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
-
- setHistoryDbName(String) - Method in class org.archive.modules.recrawl.PersistOnlineProcessor
-
- setHistoryLength(int) - Method in class org.archive.modules.recrawl.FetchHistoryProcessor
-
- setHolder(Object) - Method in class org.archive.modules.CrawlURI
-
Remember a 'holder' to which some enclosing/queueing
facility has assigned this CrawlURI
.
- setHolderCost(int) - Method in class org.archive.modules.CrawlURI
-
Remember a 'holderCost' which some enclosing/queueing
facility has assigned this CrawlURI
- setHolderKey(Object) - Method in class org.archive.modules.CrawlURI
-
Remember a 'holderKey' which some enclosing/queueing
facility has assigned this CrawlURI
.
- setHostMap(List<String>) - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- setHttpAuthChallenges(Map<String, String>) - Method in class org.archive.modules.CrawlURI
-
- setHttpAuthChallenges(Map<String, String>) - Method in class org.archive.modules.net.CrawlServer
-
- setHttpBindAddress(String) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Local IP address or hostname to use when making connections (binding
sockets).
- setHttpMethod(HtmlFormCredential.Method) - Method in class org.archive.modules.credential.HtmlFormCredential
-
- setHttpProxyHost(String) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Proxy host IP (set only if needed).
- setHttpProxyPassword(String) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Proxy password (set only if needed).
- setHttpProxyPort(Integer) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Proxy port (set only if needed).
- setHttpProxyUser(String) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Proxy user (set only if needed).
- setIdentityCache(ObjectIdentityCache<?>) - Method in class org.archive.modules.net.CrawlHost
-
- setIdentityCache(ObjectIdentityCache<?>) - Method in class org.archive.modules.net.CrawlServer
-
- setIgnoreCookies(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Disable cookie handling.
- setIgnoreFormActionUrls(boolean) - Method in class org.archive.modules.extractor.ExtractorHTML
-
- setIgnoreUnexpectedHtml(boolean) - Method in class org.archive.modules.extractor.ExtractorHTML
-
- setInferRootPage(boolean) - Method in class org.archive.modules.extractor.ExtractorHTTP
-
- setIP(InetAddress, long) - Method in class org.archive.modules.net.CrawlHost
-
Set the IP address for this host.
- setIpAddresses(Set<String>) - Method in class org.archive.modules.deciderules.IpAddressSetDecideRule
-
- setIsolateThreads(boolean) - Method in class org.archive.modules.deciderules.ScriptedDecideRule
-
- setIsolateThreads(boolean) - Method in class org.archive.modules.ScriptedProcessor
-
- setJobName(String) - Method in class org.archive.modules.CrawlMetadata
-
- setLastModified(String) - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
-
- setListLogicalOr(boolean) - Method in class org.archive.modules.deciderules.MatchesListRegexDecideRule
-
- setLogExtraInfo(boolean) - Method in class org.archive.modules.deciderules.DecideRuleSequence
-
- setLogFile(ConfigPath) - Method in class org.archive.modules.recrawl.PersistLogProcessor
-
- setLoggerModule(SimpleFileLoggerProvider) - Method in class org.archive.modules.deciderules.DecideRuleSequence
-
- setLoggerModule(UriErrorLoggerModule) - Method in class org.archive.modules.extractor.Extractor
-
- setLoggerModule(UriErrorLoggerModule) - Method in class org.archive.modules.forms.FormLoginProcessor
-
- setLogin(String) - Method in class org.archive.modules.credential.HttpAuthenticationCredential
-
- setLoginPassword(String) - Method in class org.archive.modules.forms.FormLoginProcessor
-
- setLoginUri(String) - Method in class org.archive.modules.credential.HtmlFormCredential
-
- setLoginUsername(String) - Method in class org.archive.modules.forms.FormLoginProcessor
-
- setLogToFile(boolean) - Method in class org.archive.modules.deciderules.DecideRuleSequence
-
- setLookup(ExternalGeoLookupInterface) - Method in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
-
- setLowerBound(Integer) - Method in class org.archive.modules.deciderules.MatchesStatusCodeDecideRule
-
Sets the lower bound on the range of acceptable status codes.
- setLowerBound(long) - Method in class org.archive.modules.deciderules.ResponseContentLengthDecideRule
-
The rule will apply if the url has been fetched and content body length
is greater than or equal to this number of bytes.
- setMaxAttributeNameLength(int) - Method in class org.archive.modules.extractor.ExtractorHTML
-
- setMaxAttributeValLength(int) - Method in class org.archive.modules.extractor.ExtractorHTML
-
- setMaxElementLength(int) - Method in class org.archive.modules.extractor.ExtractorHTML
-
- setMaxFetchKBSec(int) - Method in class org.archive.modules.fetcher.FetchFTP
-
- setMaxFetchKBSec(int) - Method in class org.archive.modules.fetcher.FetchHTTP
-
The maximum KB/sec to use when fetching data from a server.
- setMaxFileSizeBytes(long) - Method in class org.archive.modules.writer.Kw3WriterProcessor
-
- setMaxFileSizeBytes(long) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- setMaxHops(int) - Method in class org.archive.modules.deciderules.TooManyHopsDecideRule
-
- setMaxLengthBytes(long) - Method in class org.archive.modules.fetcher.FetchFTP
-
- setMaxLengthBytes(long) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Maximum length in bytes to fetch.
- setMaxPathDepth(int) - Method in class org.archive.modules.deciderules.TooManyPathSegmentsDecideRule
-
- setMaxPathLength(int) - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- setMaxRepetitions(int) - Method in class org.archive.modules.deciderules.PathologicalPathDecideRule
-
- setMaxSegLength(int) - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- setMaxSizeToDigest(long) - Method in class org.archive.modules.extractor.HTTPContentDigest
-
- setMaxSizeToParse(long) - Method in class org.archive.modules.extractor.ExtractorPDF
-
- setMaxSizeToParse(long) - Method in class org.archive.modules.extractor.ExtractorUniversal
-
- setMaxSpeculativeHops(int) - Method in class org.archive.modules.deciderules.TransclusionDecideRule
-
- setMaxTotalBytesToWrite(long) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- setMaxTransHops(int) - Method in class org.archive.modules.deciderules.TransclusionDecideRule
-
- setMaxWaitForIdleMs(int) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- setMetadata(CrawlMetadata) - Method in class org.archive.modules.extractor.ExtractorHTML
-
- setMetadataProvider(CrawlMetadata) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- setMethod(String) - Method in class org.archive.modules.forms.HTMLForm
-
- setObeyMetaRobotsNofollow(boolean) - Method in class org.archive.modules.net.CustomRobotsPolicy
-
- setObeyMetaRobotsNofollow(boolean) - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
-
- setObeyMetaRobotsNofollow(boolean) - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
-
- setOnlyStoreIfWriteTagPresent(boolean) - Method in class org.archive.modules.recrawl.AbstractPersistProcessor
-
- setOperator(String) - Method in class org.archive.modules.CrawlMetadata
-
- setOperatorContactUrl(String) - Method in class org.archive.modules.CrawlMetadata
-
- setOperatorFrom(String) - Method in class org.archive.modules.CrawlMetadata
-
- setOrdinal(long) - Method in class org.archive.modules.CrawlURI
-
- setOrganization(String) - Method in class org.archive.modules.CrawlMetadata
-
- setOtherCodings(CrawlURI, Recorder, HttpResponse) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Set the transfer, content encodings based on headers (if necessary).
- setOverlayMapsSource(OverlayMapsSource) - Method in class org.archive.modules.CrawlURI
-
- setPassword(String) - Method in class org.archive.modules.credential.HttpAuthenticationCredential
-
- setPassword(String) - Method in class org.archive.modules.fetcher.FetchFTP
-
- setPath(ConfigPath) - Method in class org.archive.modules.writer.Kw3WriterProcessor
-
- setPath(ConfigPath) - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- setPayloadDigest(String) - Method in class org.archive.modules.revisit.ServerNotModifiedRevisit
-
- setPolitenessDelay(long) - Method in class org.archive.modules.CrawlURI
-
- setPool(WriterPool) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- setPoolMaxActive(int) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- setPrecedence(int) - Method in class org.archive.modules.CrawlURI
-
- setPrefix(String) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- setPreloadSource(ConfigPath) - Method in class org.archive.modules.recrawl.PersistLoadProcessor
-
- setPreloadSourceUrl(String) - Method in class org.archive.modules.recrawl.PersistLoadProcessor
-
- setPrerequisite(boolean) - Method in class org.archive.modules.CrawlURI
-
Set if this CrawlURI is itself a prerequisite URI.
- setPrerequisiteUri(CrawlURI) - Method in class org.archive.modules.CrawlURI
-
Set a prerequisite for this URI.
- setProcessors(List<Processor>) - Method in class org.archive.modules.ProcessorChain
-
- setRealm(String) - Method in class org.archive.modules.credential.HttpAuthenticationCredential
-
- setRecorder(Recorder) - Method in class org.archive.modules.CrawlURI
-
Set the http recorder to be associated with this uri.
- setRecordIDGenerator(RecordIDGenerator) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
- setRecoveryCheckpoint(Checkpoint) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
- setRecoveryCheckpoint(Checkpoint) - Method in class org.archive.modules.fetcher.BdbCookieStore
-
- setRecoveryCheckpoint(Checkpoint) - Method in class org.archive.modules.fetcher.SimpleCookieStore
-
- setRecoveryCheckpoint(Checkpoint) - Method in class org.archive.modules.net.BdbServerCache
-
- setRecoveryCheckpoint(Checkpoint) - Method in class org.archive.modules.Processor
-
- setRefersToDate(String) - Method in class org.archive.modules.revisit.AbstractProfile
-
Set the refers to date
- setRefersToDate(long) - Method in class org.archive.modules.revisit.AbstractProfile
-
Set the refers to date
- setRefersToRecordID(String) - Method in class org.archive.modules.revisit.AbstractProfile
-
- setRefersToTargetURI(String) - Method in class org.archive.modules.revisit.IdenticalPayloadDigestRevisit
-
- setRegex(Pattern) - Method in class org.archive.modules.canonicalize.RegexRule
-
- setRegex(Pattern) - Method in class org.archive.modules.deciderules.MatchesRegexDecideRule
-
- setRegex(Pattern) - Method in class org.archive.modules.extractor.ExtractorImpliedURI
-
- setRegexList(List<Pattern>) - Method in class org.archive.modules.deciderules.MatchesListRegexDecideRule
-
- setRemoveTriggerUris(boolean) - Method in class org.archive.modules.extractor.ExtractorImpliedURI
-
- setRescheduleTime(long) - Method in class org.archive.modules.CrawlURI
-
- setRevisitProfile(RevisitProfile) - Method in class org.archive.modules.CrawlURI
-
- setRobotsPolicyName(String) - Method in class org.archive.modules.CrawlMetadata
-
- setRules(List<CanonicalizationRule>) - Method in class org.archive.modules.canonicalize.RulesCanonicalizationPolicy
-
- setRules(List<DecideRule>) - Method in class org.archive.modules.deciderules.DecideRuleSequence
-
- setSchedulingDirective(int) - Method in class org.archive.modules.CrawlURI
-
- setSchemes(Set<String>) - Method in class org.archive.modules.deciderules.SchemeNotInSetDecideRule
-
- setScriptSource(ReadSource) - Method in class org.archive.modules.deciderules.ScriptedDecideRule
-
- setScriptSource(ReadSource) - Method in class org.archive.modules.ScriptedProcessor
-
- setSeed(boolean) - Method in class org.archive.modules.CrawlURI
-
Set the isSeed attribute of this URI.
- setSeedListeners(Set<SeedListener>) - Method in class org.archive.modules.seeds.SeedModule
-
- setSeeds(SeedModule) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
- setSeedsAsSurtPrefixes(boolean) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
- setSendConnectionClose(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Send 'Connection: close' header with every request.
- setSendIfModifiedSince(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Send 'If-Modified-Since' header, if previous 'Last-Modified' fetch
history information is available in URI history.
- setSendIfNoneMatch(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Send 'If-None-Match' header, if previous 'Etag' fetch history information
is available in URI history.
- setSendRange(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
-
- setSendReferer(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Send 'Referer' header with every request.
- setServerCache(ServerCache) - Method in class org.archive.modules.deciderules.DecideRuleSequence
-
- setServerCache(ServerCache) - Method in class org.archive.modules.deciderules.ExternalGeoLocationDecideRule
-
- setServerCache(ServerCache) - Method in class org.archive.modules.deciderules.IpAddressSetDecideRule
-
- setServerCache(ServerCache) - Method in class org.archive.modules.fetcher.FetchDNS
-
- setServerCache(ServerCache) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Used to do DNS lookups.
- setServerCache(ServerCache) - Method in class org.archive.modules.fetcher.FetchWhois
-
- setServerCache(ServerCache) - Method in class org.archive.modules.writer.Kw3WriterProcessor
-
- setServerCache(ServerCache) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- setShouldFetchBodyRule(DecideRule) - Method in class org.archive.modules.fetcher.FetchHTTP
-
DecideRules applied after receipt of HTTP response headers but before we
start to download the body.
- setShouldMasquerade(boolean) - Method in class org.archive.modules.net.FirstNamedRobotsPolicy
-
- setShouldMasquerade(boolean) - Method in class org.archive.modules.net.MostFavoredRobotsPolicy
-
- setShouldProcessRule(DecideRule) - Method in class org.archive.modules.Processor
-
- setSizes(CrawlURI, Recorder) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Update CrawlURI internal sizes based on current transaction (and
in the case of 304s, history)
- setSkipIdenticalDigests(boolean) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- setSoTimeoutMs(int) - Method in class org.archive.modules.fetcher.FetchFTP
-
- setSoTimeoutMs(int) - Method in class org.archive.modules.fetcher.FetchHTTP
-
If the socket is unresponsive for this number of milliseconds, give up.
- setSoTimeoutMs(int) - Method in class org.archive.modules.fetcher.FetchWhois
-
- setSourceSeeds(Set<String>) - Method in class org.archive.modules.deciderules.SourceSeedDecideRule
-
- setSourceTag(String) - Method in class org.archive.modules.CrawlURI
-
- setSourceTagSeeds(boolean) - Method in class org.archive.modules.seeds.SeedModule
-
- setSpecialQueryTemplates(Map<String, String>) - Method in class org.archive.modules.fetcher.FetchWhois
-
- setSslTrustLevel(ConfigurableX509TrustManager.TrustLevel) - Method in class org.archive.modules.fetcher.FetchHTTP
-
SSL certificate trust level.
- setStartNewFilesOnCheckpoint(boolean) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
Whether to close output files and start new ones on checkpoint.
- setStatusCodes(List<Integer>) - Method in class org.archive.modules.deciderules.FetchStatusDecideRule
-
- setStorePaths(List<ConfigPath>) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- setStripRegex(String) - Method in class org.archive.modules.extractor.HTTPContentDigest
-
- setSuffixAtEnd(boolean) - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- setSurtPrefixes(List<String>) - Method in class org.archive.modules.deciderules.ViaSurtPrefixedDecideRule
-
- setSurtsDumpFile(ConfigFile) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
- setSurtsSource(ReadSource) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
- setSurtsSourceFile(ConfigFile) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
Deprecated.
- setTemplate(String) - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
-
URI-building template.
- setTemplate(String) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- setTextSource(ReadSource) - Method in class org.archive.modules.seeds.TextSeedModule
-
- setThreadNumber(int) - Method in class org.archive.modules.CrawlURI
-
Set the number of the ToeThread responsible for processing this uri.
- setTimeoutSeconds(int) - Method in class org.archive.modules.fetcher.FetchFTP
-
- setTimeoutSeconds(int) - Method in class org.archive.modules.fetcher.FetchHTTP
-
If the fetch is not completed in this number of seconds, give up (and
retry later).
- setTooLongDirectory(String) - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- setTotalBytesWritten(long) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- setTreatFramesAsEmbedLinks(boolean) - Method in class org.archive.modules.extractor.ExtractorHTML
-
- setUnderscoreSet(List<String>) - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- setUnresolvable(CrawlURI, CrawlHost) - Method in class org.archive.modules.fetcher.FetchDNS
-
- setUp() - Method in class org.archive.modules.extractor.ContentExtractorTestBase
-
- setupCopyEnvironment(File) - Static method in class org.archive.modules.recrawl.PersistProcessor
-
- setupCopyEnvironment(File, boolean) - Static method in class org.archive.modules.recrawl.PersistProcessor
-
- setUpperBound(Integer) - Method in class org.archive.modules.deciderules.MatchesStatusCodeDecideRule
-
Sets the upper bound on the range of acceptable status codes.
- setUpperBound(Integer) - Method in class org.archive.modules.deciderules.NotMatchesStatusCodeDecideRule
-
Sets the upper bound on the range of acceptable status codes.
- setUpperBound(long) - Method in class org.archive.modules.deciderules.ResponseContentLengthDecideRule
-
The rule will apply if the url has been fetched and content body length
is less than or equal to this number of bytes.
- setupPool(AtomicInteger) - Method in class org.archive.modules.writer.ARCWriterProcessor
-
- setupPool(AtomicInteger) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
- setupPool(AtomicInteger) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
Set up pool of files.
- setupSimpleLog(String) - Method in interface org.archive.modules.SimpleFileLoggerProvider
-
- setUriRegex(String) - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
-
Regular expression against which to match the URI.
- setUseHeaderLength(boolean) - Method in class org.archive.modules.deciderules.ResourceNoLongerThanDecideRule
-
- setUseHTTP11(boolean) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Use HTTP/1.1.
- setUsePreset(MatchesFilePatternDecideRule.Preset) - Method in class org.archive.modules.deciderules.MatchesFilePatternDecideRule
-
- setUserAgent(String) - Method in class org.archive.modules.CrawlURI
-
Set the user agent to use when crawling this URI.
- setUserAgentProvider(UserAgentProvider) - Method in class org.archive.modules.fetcher.FetchHTTP
-
- setUserAgentTemplate(String) - Method in class org.archive.modules.CrawlMetadata
-
- setUsername(String) - Method in class org.archive.modules.fetcher.FetchFTP
-
- setVia(UURI) - Method in class org.archive.modules.CrawlURI
-
- setWriteBufferSize(int) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- setWriteMetadata(boolean) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
- setWriteRequests(boolean) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
- setWriteRevisitForIdenticalDigests(boolean) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
- setWriteRevisitForNotModified(boolean) - Method in class org.archive.modules.writer.WARCWriterProcessor
-
Deprecated.
- sharedEngine - Variable in class org.archive.modules.deciderules.ScriptedDecideRule
-
- sharedEngine - Variable in class org.archive.modules.ScriptedProcessor
-
- shortReportLegend() - Method in class org.archive.modules.CrawlURI
-
- shortReportLegend() - Method in class org.archive.modules.fetcher.FetchStats
-
- shortReportLegend() - Method in class org.archive.modules.ProcessorChain
-
- shortReportLine() - Method in class org.archive.modules.CrawlURI
-
- shortReportLine() - Method in class org.archive.modules.fetcher.FetchStats
-
- shortReportLineTo(PrintWriter) - Method in class org.archive.modules.CrawlURI
-
- shortReportLineTo(PrintWriter) - Method in class org.archive.modules.fetcher.FetchStats
-
- shortReportLineTo(PrintWriter) - Method in class org.archive.modules.ProcessorChain
-
- shortReportMap() - Method in class org.archive.modules.CrawlURI
-
- shortReportMap() - Method in class org.archive.modules.fetcher.FetchStats
-
- shortReportMap() - Method in class org.archive.modules.ProcessorChain
-
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ContentExtractor
-
Determines if otherwise valid URIs should have links extracted or not.
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorCSS
-
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorDOC
-
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorHTML
-
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorJS
-
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorPDF
-
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorSWF
-
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorUniversal
-
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorXML
-
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.TrapSuppressExtractor
-
- shouldLoad(CrawlURI) - Method in class org.archive.modules.recrawl.AbstractPersistProcessor
-
Whether the current CrawlURI's state should be loaded
- shouldMasquerade - Variable in class org.archive.modules.net.FirstNamedRobotsPolicy
-
whether to adopt the user-agent that is allowed for the fetch
- shouldMasquerade - Variable in class org.archive.modules.net.MostFavoredRobotsPolicy
-
whether to adopt the user-agent that is allowed for the fetch
- shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.ContentExtractor
-
Determines if links should be extracted from the given URI.
- shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorHTTP
-
- shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorImpliedURI
-
- shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorMultipleRegex
-
- shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorURI
-
- shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.HTTPContentDigest
-
- shouldProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchDNS
-
- shouldProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchFTP
-
- shouldProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchHTTP
-
Can this processor fetch the given CrawlURI.
- shouldProcess(CrawlURI) - Method in class org.archive.modules.fetcher.FetchWhois
-
- shouldProcess(CrawlURI) - Method in class org.archive.modules.forms.ExtractorHTMLForms
-
- shouldProcess(CrawlURI) - Method in class org.archive.modules.forms.FormLoginProcessor
-
- shouldProcess(CrawlURI) - Method in class org.archive.modules.Processor
-
Determines whether the given uri should be processed by this
processor.
- shouldProcess(CrawlURI) - Method in class org.archive.modules.recrawl.ContentDigestHistoryLoader
-
- shouldProcess(CrawlURI) - Method in class org.archive.modules.recrawl.ContentDigestHistoryStorer
-
- shouldProcess(CrawlURI) - Method in class org.archive.modules.recrawl.FetchHistoryProcessor
-
- shouldProcess(CrawlURI) - Method in class org.archive.modules.recrawl.PersistLoadProcessor
-
- shouldProcess(CrawlURI) - Method in class org.archive.modules.recrawl.PersistLogProcessor
-
- shouldProcess(CrawlURI) - Method in class org.archive.modules.recrawl.PersistStoreProcessor
-
- shouldProcess(CrawlURI) - Method in class org.archive.modules.ScriptedProcessor
-
- shouldProcess(CrawlURI) - Method in class org.archive.modules.writer.Kw3WriterProcessor
-
- shouldProcess(CrawlURI) - Method in class org.archive.modules.writer.MirrorWriterProcessor
-
- shouldProcess(CrawlURI) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- shouldStore(CrawlURI) - Method in class org.archive.modules.recrawl.AbstractPersistProcessor
-
Whether the current CrawlURI's state should be persisted (to log or
direct to database)
- shouldWrite(CrawlURI) - Method in class org.archive.modules.writer.WriterPoolProcessor
-
Whether the given CrawlURI should be written to archive files.
- SimpleCookieStore - Class in org.archive.modules.fetcher
-
In-memory cookie store, mostly for testing.
- SimpleCookieStore() - Constructor for class org.archive.modules.fetcher.SimpleCookieStore
-
- SimpleFileLoggerProvider - Interface in org.archive.modules
-
- SimpleLinkContext(String) - Constructor for class org.archive.modules.extractor.LinkContext.SimpleLinkContext
-
- size() - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
-
- size() - Method in class org.archive.modules.ProcessorChain
-
- skipIdenticalDigests - Variable in class org.archive.modules.writer.WriterPoolProcessor
-
Whether to skip the writing of a record when URI history information is
available and indicates the prior fetch had an identical content digest.
- socketFactory - Variable in class org.archive.modules.fetcher.FetchFTP
-
- SocketFactoryWithTimeout() - Constructor for class org.archive.modules.fetcher.FetchFTP.SocketFactoryWithTimeout
-
- sortableKey(Cookie) - Method in class org.archive.modules.fetcher.AbstractCookieStore
-
Returns a string that uniquely identifies the cookie, The format The
format of the key is "normalizedDomain;name;path"
.
- SOURCE_SRCSET - Static variable in class org.archive.modules.extractor.HTMLLinkContext
-
- SourceSeedDecideRule - Class in org.archive.modules.deciderules
-
Rule applies the configured decision for any URI with discovered from one of
the seeds in sourceSeeds
.
- SourceSeedDecideRule() - Constructor for class org.archive.modules.deciderules.SourceSeedDecideRule
-
- sourceSeeds - Variable in class org.archive.modules.deciderules.SourceSeedDecideRule
-
- sourceTagSeeds - Variable in class org.archive.modules.seeds.SeedModule
-
Whether to tag seeds with their own URI as a heritable 'source' String,
which will be carried-forward to all URIs discovered on paths originating
from that seed.
- specialQueryTemplates - Variable in class org.archive.modules.fetcher.FetchWhois
-
- SPECULATIVE_MISC - Static variable in class org.archive.modules.extractor.LinkContext
-
Stand-in value for speculative/aggressively extracted urls without
other context.
- sslContext - Variable in class org.archive.modules.fetcher.FetchHTTP
-
- sslContext() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- sslTrustLevel - Variable in class org.archive.modules.fetcher.FetchHTTP
-
- STANDARD_POLICIES - Static variable in class org.archive.modules.net.RobotsPolicy
-
- start() - Method in class org.archive.modules.deciderules.DecideRuleSequence
-
- start() - Method in class org.archive.modules.fetcher.AbstractCookieStore
-
- start() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- start() - Method in class org.archive.modules.fetcher.FetchWhois
-
- start() - Method in class org.archive.modules.net.BdbServerCache
-
- start() - Method in class org.archive.modules.Processor
-
- start() - Method in class org.archive.modules.ProcessorChain
-
- start() - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
-
- start() - Method in class org.archive.modules.recrawl.PersistLoadProcessor
-
- start() - Method in class org.archive.modules.recrawl.PersistLogProcessor
-
- start() - Method in class org.archive.modules.recrawl.PersistOnlineProcessor
-
- start() - Method in interface org.archive.modules.SimpleFileLoggerProvider
-
- start() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- startCheckpoint(Checkpoint) - Method in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
- startCheckpoint(Checkpoint) - Method in class org.archive.modules.fetcher.BdbCookieStore
-
- startCheckpoint(Checkpoint) - Method in class org.archive.modules.fetcher.SimpleCookieStore
-
- startCheckpoint(Checkpoint) - Method in class org.archive.modules.net.BdbServerCache
-
- startCheckpoint(Checkpoint) - Method in class org.archive.modules.Processor
-
- startCheckpoint(Checkpoint) - Method in class org.archive.modules.recrawl.PersistLogProcessor
-
- startNewFilesOnCheckpoint - Variable in class org.archive.modules.writer.WriterPoolProcessor
-
- STATUS_CODE_KEY - Static variable in interface org.archive.modules.writer.Kw3Constants
-
- statusCodes - Variable in class org.archive.modules.deciderules.FetchStatusDecideRule
-
- stop() - Method in class org.archive.modules.deciderules.DecideRuleSequence
-
- stop() - Method in class org.archive.modules.fetcher.AbstractCookieStore
-
- stop() - Method in class org.archive.modules.fetcher.FetchHTTP
-
- stop() - Method in class org.archive.modules.fetcher.FetchWhois
-
- stop() - Method in class org.archive.modules.net.BdbServerCache
-
- stop() - Method in class org.archive.modules.Processor
-
- stop() - Method in class org.archive.modules.ProcessorChain
-
- stop() - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
-
- stop() - Method in class org.archive.modules.recrawl.PersistLogProcessor
-
- stop() - Method in class org.archive.modules.recrawl.PersistOnlineProcessor
-
- stop() - Method in class org.archive.modules.writer.WriterPoolProcessor
-
- store(CrawlURI) - Method in class org.archive.modules.recrawl.AbstractContentDigestHistory
-
Stores curi.getContentDigestHistory()
for the key
persistKeyFor(curi)
.
- store - Variable in class org.archive.modules.recrawl.BdbContentDigestHistory
-
- store(CrawlURI) - Method in class org.archive.modules.recrawl.BdbContentDigestHistory
-
- store - Variable in class org.archive.modules.recrawl.PersistOnlineProcessor
-
- storeDNSRecord(CrawlURI, String, CrawlHost, Record[]) - Method in class org.archive.modules.fetcher.FetchDNS
-
- storePaths - Variable in class org.archive.modules.writer.WriterPoolProcessor
-
Where to save files.
- StringExtractorTestBase - Class in org.archive.modules.extractor
-
- StringExtractorTestBase() - Constructor for class org.archive.modules.extractor.StringExtractorTestBase
-
- StringExtractorTestBase.TestData - Class in org.archive.modules.extractor
-
- StripExtraSlashes - Class in org.archive.modules.canonicalize
-
Strip any extra slashes, '/', found in the path.
- StripExtraSlashes() - Constructor for class org.archive.modules.canonicalize.StripExtraSlashes
-
- StripSessionCFIDs - Class in org.archive.modules.canonicalize
-
Strip cold fusion session ids.
- StripSessionCFIDs() - Constructor for class org.archive.modules.canonicalize.StripSessionCFIDs
-
- StripSessionIDs - Class in org.archive.modules.canonicalize
-
Strip known session ids.
- StripSessionIDs() - Constructor for class org.archive.modules.canonicalize.StripSessionIDs
-
- stripToMinimal() - Method in class org.archive.modules.CrawlURI
-
Remove all attributes set on this uri.
- StripUserinfoRule - Class in org.archive.modules.canonicalize
-
Strip any 'userinfo' found on http/https URLs.
- StripUserinfoRule() - Constructor for class org.archive.modules.canonicalize.StripUserinfoRule
-
- StripWWWNRule - Class in org.archive.modules.canonicalize
-
Strip any 'www[0-9]*' found on http/https URLs IF they have some
path/query component (content after third slash).
- StripWWWNRule() - Constructor for class org.archive.modules.canonicalize.StripWWWNRule
-
- StripWWWRule - Class in org.archive.modules.canonicalize
-
Strip any 'www' found on http/https URLs, IF they have some
path/query component (content after third slash).
- StripWWWRule() - Constructor for class org.archive.modules.canonicalize.StripWWWRule
-
- subList(int, int) - Method in class org.archive.modules.fetcher.BdbCookieStore.RestrictedCollectionWrappedList
-
- submitStatusFor(String) - Method in class org.archive.modules.forms.FormLoginProcessor
-
- subset(CrawlURI, Class<?>) - Method in class org.archive.modules.credential.CredentialStore
-
Return set made up of all credentials of the passed
type
.
- subset(CrawlURI, Class<?>, String) - Method in class org.archive.modules.credential.CredentialStore
-
Return set made up of all credentials of the passed
type
.
- substats - Variable in class org.archive.modules.net.CrawlHost
-
- substats - Variable in class org.archive.modules.net.CrawlServer
-
- SUCCESS_BYTES - Static variable in class org.archive.modules.fetcher.FetchStats
-
- suffixAtEnd - Variable in class org.archive.modules.writer.MirrorWriterProcessor
-
If true, the suffix is placed at the end of the path, after the query (if
any).
- summary() - Method in class org.archive.crawler.util.CrawledBytesHistotable
-
- SurtPrefixedDecideRule - Class in org.archive.modules.deciderules.surt
-
Rule applies configured decision to any URIs that, when
expressed in SURT form, begin with one of the prefixes
in the configured set.
- SurtPrefixedDecideRule() - Constructor for class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
- surtPrefixes - Variable in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
- surtPrefixes - Variable in class org.archive.modules.deciderules.ViaSurtPrefixedDecideRule
-
- surtsDumpFile - Variable in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
Dump file to save SURT prefixes actually used: Useful debugging SURTs.
- surtsSource - Variable in class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule
-
Text from which to infer SURT prefixes.