Serialized Form
-
Package org.archive.crawler.util
-
Class org.archive.crawler.util.CrawledBytesHistotable extends org.archive.util.Histotable<String> implements Serializable
- serialVersionUID:
- 7923431123239026213L
-
-
Package org.archive.modules
-
Class org.archive.modules.CrawlMetadata extends Object implements Serializable
- serialVersionUID:
- 1L
-
Serialized Fields
-
audience
String audience
-
availableRobotsPolicies
Map<String,RobotsPolicy> availableRobotsPolicies
Map of all available RobotsPolicies, by name, to choose from. assembled from declared instances in configuration plus the standard 'obey' (aka 'classic') and 'ignore' policies. -
description
String description
-
jobName
String jobName
-
kp
org.archive.spring.KeyedProperties kp
-
operator
String operator
-
organization
String organization
-
-
Class org.archive.modules.CrawlURI extends Object implements Serializable
- serialVersionUID:
- 4L
-
Serialization Methods
-
readObject
- Throws:
IOException
ClassNotFoundException
-
writeObject
- Throws:
IOException
-
-
Serialized Fields
-
canonicalString
String canonicalString
-
classKey
String classKey
Frontier/Scheduler lifecycle info. This is an identifier set by the Frontier for its purposes. Usually its the name of the Frontier queue this URI gets queued to. Values can be host + port or IP, etc. -
contentDigest
byte[] contentDigest
A digest (hash, usually SHA1) of retrieved content-body. -
contentDigestScheme
String contentDigestScheme
-
contentLength
long contentLength
-
contentSize
long contentSize
-
contentType
String contentType
Content type of a successfully fetched URI. May be null even on successfully fetched URI. -
data
Map<String,Object> data
Flexible dynamic attributes list.The attribute list is a flexible map of key/value pairs for storing status of this URI for use by other processors. By convention the attribute list is keyed by constants found in the
CoreAttributeConstants
interface. Use this list to carry data or state produced by custom processors rather change the classesCrawlURI
or this class, CrawlURI. -
deferrals
int deferrals
-
extraInfo
org.json.JSONObject extraInfo
-
fetchAttempts
int fetchAttempts
-
fetchStatus
int fetchStatus
-
fetchType
CrawlURI.FetchType fetchType
specified fetch-type: GET, POST, or not-yet-known -
forceRevisit
boolean forceRevisit
-
holderCost
int holderCost
spot for an integer cost to be placed by external facility (frontier). cost is truncated to 8 bits at times, so should not exceed 255 -
isSeed
boolean isSeed
Seed status -
ordinal
long ordinal
Monotonically increasing number within a crawl; useful for tending towards breadth-first ordering. Will sometimes be truncated to 48 bits, so behavior over 281 trillion instantiated CrawlURIs may be buggy -
pathFromSeed
String pathFromSeed
String of letters indicating how this URI was reached from a seed.P precondition R redirection E embedded (as frame, src, link, codebase, etc.) X speculative embed (as from javascript, some alternate-format extractors L link
For example LLLE (an embedded image on a page 3 links from seed). -
politenessDelay
long politenessDelay
-
precedence
int precedence
assigned precedence -
prerequisite
boolean prerequisite
True if this CrawlURI has been deemed a prerequisite by the org.archive.crawler.prefetch.PreconditionEnforcer. This flag is used at least inside in the precondition enforcer so that subsequent prerequisite tests know to let this CrawlURI through because its a prerequisite needed by an earlier prerequisite tests (e.g. If this is a robots.txt, then the subsequent login credentials prereq test must not throw it out because its not a login curi). -
rescheduleTime
long rescheduleTime
A future time at which this CrawlURI should be reenqueued. -
schedulingDirective
int schedulingDirective
-
userAgent
String userAgent
-
uuri
org.archive.net.UURI uuri
The URI being crawled. It's transient to save space when storing to BDB. -
via
org.archive.net.UURI via
Where this URI was (presently) discovered. . Transient to allow more efficient custom serialization -
viaContext
LinkContext viaContext
Context of URI's discovery, as per the 'context' in Link
-
-
-
Package org.archive.modules.canonicalize
-
Class org.archive.modules.canonicalize.BaseRule extends Object implements Serializable
- serialVersionUID:
- 1L
-
Serialized Fields
-
kp
org.archive.spring.KeyedProperties kp
-
-
Class org.archive.modules.canonicalize.FixupQueryString extends BaseRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.canonicalize.LowercaseRule extends BaseRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.canonicalize.RegexRule extends BaseRule implements Serializable
- serialVersionUID:
- -3L
-
Class org.archive.modules.canonicalize.StripExtraSlashes extends BaseRule implements Serializable
- serialVersionUID:
- 1L
-
Class org.archive.modules.canonicalize.StripSessionCFIDs extends BaseRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.canonicalize.StripSessionIDs extends BaseRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.canonicalize.StripUserinfoRule extends BaseRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.canonicalize.StripWWWNRule extends BaseRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.canonicalize.StripWWWRule extends BaseRule implements Serializable
- serialVersionUID:
- 3L
-
-
Package org.archive.modules.credential
-
Class org.archive.modules.credential.Credential extends Object implements Serializable
- serialVersionUID:
- 2L
-
Serialized Fields
-
domain
String domain
The root domain this credential goes against: E.g. www.archive.org
-
-
Class org.archive.modules.credential.CredentialStore extends Object implements Serializable
- serialVersionUID:
- 3L
-
Serialized Fields
-
kp
org.archive.spring.KeyedProperties kp
-
-
Class org.archive.modules.credential.HtmlFormCredential extends Credential implements Serializable
- serialVersionUID:
- -4L
-
Serialized Fields
-
formItems
Map<String,String> formItems
Form items. -
httpMethod
HtmlFormCredential.Method httpMethod
Deprecated.ignored, always POST -
loginUri
String loginUri
Full URI of page that contains the HTML login form we're to apply these credentials too: E.g. http://www.archive.org
-
-
Class org.archive.modules.credential.HttpAuthenticationCredential extends Credential implements Serializable
- serialVersionUID:
- 4L
-
-
Package org.archive.modules.deciderules
-
Class org.archive.modules.deciderules.AcceptDecideRule extends DecideRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.deciderules.AddRedirectFromRootServerToScope extends PredicatedDecideRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.deciderules.ContentLengthDecideRule extends DecideRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule extends MatchesRegexDecideRule implements Serializable
- serialVersionUID:
- -2066930281015155843L
-
Class org.archive.modules.deciderules.ContentTypeNotMatchesRegexDecideRule extends ContentTypeMatchesRegexDecideRule implements Serializable
- serialVersionUID:
- 4729800377757426137L
-
Class org.archive.modules.deciderules.DecideRule extends Object implements Serializable
-
Serialized Fields
-
comment
String comment
-
kp
org.archive.spring.KeyedProperties kp
-
-
-
Class org.archive.modules.deciderules.DecideRuleSequence extends DecideRule implements Serializable
- serialVersionUID:
- 3L
-
Serialized Fields
-
beanName
String beanName
-
isRunning
boolean isRunning
-
logExtraInfo
boolean logExtraInfo
Whether to include the "extra info" field for each entry in crawl.log. "Extra info" is a json object with entries "host", "via", "source" and "hopPath". -
loggerModule
SimpleFileLoggerProvider loggerModule
-
serverCache
ServerCache serverCache
-
-
Class org.archive.modules.deciderules.ExternalGeoLocationDecideRule extends PredicatedDecideRule implements Serializable
- serialVersionUID:
- 3L
-
Serialized Fields
-
countryCodes
List<String> countryCodes
Country code name. -
lookup
ExternalGeoLookupInterface lookup
-
serverCache
ServerCache serverCache
-
-
Class org.archive.modules.deciderules.FetchStatusDecideRule extends PredicatedDecideRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.deciderules.FetchStatusMatchesRegexDecideRule extends MatchesRegexDecideRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.deciderules.FetchStatusNotMatchesRegexDecideRule extends FetchStatusMatchesRegexDecideRule implements Serializable
- serialVersionUID:
- -2220182698344063577L
-
Class org.archive.modules.deciderules.HasViaDecideRule extends PredicatedDecideRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.deciderules.HopCrossesAssignmentLevelDomainDecideRule extends PredicatedDecideRule implements Serializable
- serialVersionUID:
- 1L
-
Class org.archive.modules.deciderules.HopsPathMatchesRegexDecideRule extends MatchesRegexDecideRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.deciderules.IpAddressSetDecideRule extends PredicatedDecideRule implements Serializable
- serialVersionUID:
- -3670434739183271441L
-
Class org.archive.modules.deciderules.MatchesFilePatternDecideRule extends MatchesRegexDecideRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.deciderules.MatchesListRegexDecideRule extends PredicatedDecideRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.deciderules.MatchesRegexDecideRule extends PredicatedDecideRule implements Serializable
- serialVersionUID:
- 2L
-
Class org.archive.modules.deciderules.MatchesStatusCodeDecideRule extends PredicatedDecideRule implements Serializable
-
Class org.archive.modules.deciderules.NotMatchesFilePatternDecideRule extends MatchesFilePatternDecideRule implements Serializable
- serialVersionUID:
- -8161371026787859554L
-
Class org.archive.modules.deciderules.NotMatchesListRegexDecideRule extends MatchesListRegexDecideRule implements Serializable
- serialVersionUID:
- 8691360087063555583L
-
Class org.archive.modules.deciderules.NotMatchesRegexDecideRule extends MatchesRegexDecideRule implements Serializable
- serialVersionUID:
- -2085313401991694306L
-
Class org.archive.modules.deciderules.NotMatchesStatusCodeDecideRule extends MatchesStatusCodeDecideRule implements Serializable
-
Class org.archive.modules.deciderules.PathologicalPathDecideRule extends DecideRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.deciderules.PredicatedDecideRule extends DecideRule implements Serializable
- serialVersionUID:
- 1L
-
Class org.archive.modules.deciderules.PrerequisiteAcceptDecideRule extends DecideRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.deciderules.RejectDecideRule extends DecideRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.deciderules.ResourceLongerThanDecideRule extends ResourceNoLongerThanDecideRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.deciderules.ResourceNoLongerThanDecideRule extends PredicatedDecideRule implements Serializable
- serialVersionUID:
- -8774160016195991876L
-
Class org.archive.modules.deciderules.ResponseContentLengthDecideRule extends PredicatedDecideRule implements Serializable
- serialVersionUID:
- 1L
-
Class org.archive.modules.deciderules.SchemeNotInSetDecideRule extends PredicatedDecideRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.deciderules.ScriptedDecideRule extends DecideRule implements Serializable
- serialVersionUID:
- 3L
-
Serialized Fields
-
appCtx
org.springframework.context.ApplicationContext appCtx
-
engineName
String engineName
engine name; default "beanshell" -
isolateThreads
boolean isolateThreads
Whether each ToeThread should get its own independent script engine, or they should share synchronized access to one engine. Default is true, meaning each thread gets its own isolated engine. -
scriptSource
org.archive.io.ReadSource scriptSource
-
-
Class org.archive.modules.deciderules.SeedAcceptDecideRule extends DecideRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.deciderules.SourceSeedDecideRule extends PredicatedDecideRule implements Serializable
- serialVersionUID:
- 1L
-
Class org.archive.modules.deciderules.TooManyHopsDecideRule extends PredicatedDecideRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.deciderules.TooManyPathSegmentsDecideRule extends PredicatedDecideRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.deciderules.TransclusionDecideRule extends PredicatedDecideRule implements Serializable
- serialVersionUID:
- -3975688876990558918L
-
Class org.archive.modules.deciderules.ViaSurtPrefixedDecideRule extends PredicatedDecideRule implements Serializable
- serialVersionUID:
- 1L
-
Serialized Fields
-
surtPrefixes
org.archive.util.SurtPrefixSet surtPrefixes
-
-
-
Package org.archive.modules.deciderules.recrawl
-
Class org.archive.modules.deciderules.recrawl.IdenticalDigestDecideRule extends PredicatedDecideRule implements Serializable
- serialVersionUID:
- 4275993790856626949L
-
-
Package org.archive.modules.deciderules.surt
-
Class org.archive.modules.deciderules.surt.NotOnDomainsDecideRule extends OnDomainsDecideRule implements Serializable
- serialVersionUID:
- -1634035244888724934L
-
Class org.archive.modules.deciderules.surt.NotOnHostsDecideRule extends OnHostsDecideRule implements Serializable
- serialVersionUID:
- 1512825197255050412L
-
Class org.archive.modules.deciderules.surt.NotSurtPrefixedDecideRule extends SurtPrefixedDecideRule implements Serializable
- serialVersionUID:
- -7491388438128566377L
-
Class org.archive.modules.deciderules.surt.OnDomainsDecideRule extends SurtPrefixedDecideRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.deciderules.surt.OnHostsDecideRule extends SurtPrefixedDecideRule implements Serializable
- serialVersionUID:
- 3L
-
Class org.archive.modules.deciderules.surt.SurtPrefixedDecideRule extends PredicatedDecideRule implements Serializable
- serialVersionUID:
- 3L
-
Serialized Fields
-
beanName
String beanName
-
recoveryCheckpoint
org.archive.checkpointing.Checkpoint recoveryCheckpoint
-
seeds
SeedModule seeds
-
seedsAsSurtPrefixes
boolean seedsAsSurtPrefixes
Should seeds also be interpreted as SURT prefixes. -
surtPrefixes
org.archive.util.SurtPrefixSet surtPrefixes
-
surtsDumpFile
org.archive.spring.ConfigFile surtsDumpFile
Dump file to save SURT prefixes actually used: Useful debugging SURTs. -
surtsSource
org.archive.io.ReadSource surtsSource
Text from which to infer SURT prefixes. Any URLs will be converted to the implied SURT prefix, and literal SURT prefixes may be listed on lines beginning with a '+' character.
-
-
-
Package org.archive.modules.extractor
-
Class org.archive.modules.extractor.ExtractorMultipleRegex.GroupList extends LinkedList<String> implements Serializable
- serialVersionUID:
- 1L
-
Class org.archive.modules.extractor.ExtractorMultipleRegex.MatchList extends LinkedList<ExtractorMultipleRegex.GroupList> implements Serializable
- serialVersionUID:
- 1L
-
Class org.archive.modules.extractor.HTMLLinkContext extends LinkContext implements Serializable
- serialVersionUID:
- 1L
-
Serialized Fields
-
path
String path
The HTML path to the URL.
-
-
Class org.archive.modules.extractor.LinkContext extends Object implements Serializable
- serialVersionUID:
- 4117965561244539334L
-
Class org.archive.modules.extractor.LinkContext.SimpleLinkContext extends LinkContext implements Serializable
- serialVersionUID:
- 1L
-
Serialized Fields
-
desc
String desc
-
-
-
Package org.archive.modules.fetcher
-
Class org.archive.modules.fetcher.DefaultServerCache extends ServerCache implements Serializable
- serialVersionUID:
- 1L
-
Serialized Fields
-
hosts
org.archive.util.ObjectIdentityCache<CrawlHost> hosts
hostname -> CrawlHost. Set in the initialization. -
servers
org.archive.util.ObjectIdentityCache<CrawlServer> servers
hostname[:port] -> CrawlServer. Set in the initialization.
-
-
Class org.archive.modules.fetcher.FetchStats extends CrawledBytesHistotable implements Serializable
- serialVersionUID:
- 2L
-
Serialized Fields
-
lastSuccessTime
long lastSuccessTime
-
-
-
Package org.archive.modules.net
-
Class org.archive.modules.net.BdbServerCache extends DefaultServerCache implements Serializable
- serialVersionUID:
- 1L
-
Serialized Fields
-
bdb
org.archive.bdb.BdbModule bdb
-
isCheckpointRecovery
boolean isCheckpointRecovery
-
isRunning
boolean isRunning
-
-
Class org.archive.modules.net.CrawlHost extends Object implements Serializable
- serialVersionUID:
- -5494573967890942895L
-
Serialized Fields
-
countryCode
String countryCode
-
earliestNextURIEmitTime
long earliestNextURIEmitTime
-
hostname
String hostname
-
ip
InetAddress ip
-
ipFetched
long ipFetched
-
ipTTL
long ipTTL
TTL gotten from dns record. From rfc2035:TTL a 32 bit unsigned integer that specifies the time interval (in seconds) that the resource record may be cached before it should be discarded. Zero values are interpreted to mean that the RR can only be used for the transaction in progress, and should not be cached.
-
substats
FetchStats substats
-
-
Class org.archive.modules.net.CrawlServer extends Object implements Serializable
- serialVersionUID:
- 3L
-
Serialized Fields
-
consecutiveConnectionErrors
int consecutiveConnectionErrors
-
port
int port
-
robotsFetched
long robotsFetched
-
robotstxt
Robotstxt robotstxt
-
server
String server
-
substats
FetchStats substats
-
validRobots
boolean validRobots
-
-
Class org.archive.modules.net.DefaultTempDirProvider extends Object implements Serializable
- serialVersionUID:
- 1L
-
Class org.archive.modules.net.RobotsDirectives extends Object implements Serializable
- serialVersionUID:
- 5386542759286155383L
-
Serialized Fields
-
allows
ConcurrentSkipListSet<String> allows
-
crawlDelay
float crawlDelay
-
disallows
ConcurrentSkipListSet<String> disallows
-
-
Class org.archive.modules.net.Robotstxt extends Object implements Serializable
- serialVersionUID:
- 7025386509301303890L
-
Serialized Fields
-
agentsToDirectives
Map<String,RobotsDirectives> agentsToDirectives
-
hasErrors
boolean hasErrors
-
namedUserAgents
LinkedList<String> namedUserAgents
-
wildcardDirectives
RobotsDirectives wildcardDirectives
-
-
-
Package org.archive.modules.seeds
-
Class org.archive.modules.seeds.SeedModule extends Object implements Serializable
- serialVersionUID:
- 1L
-
Serialized Fields
-
seedListeners
Set<SeedListener> seedListeners
-
sourceTagSeeds
boolean sourceTagSeeds
Whether to tag seeds with their own URI as a heritable 'source' String, which will be carried-forward to all URIs discovered on paths originating from that seed. When present, such source tags appear in the second-to-last crawl.log field.
-
-
Class org.archive.modules.seeds.TextSeedModule extends SeedModule implements Serializable
- serialVersionUID:
- 3L
-
Serialized Fields
-
blockAwaitingSeedLines
int blockAwaitingSeedLines
Number of lines of seeds-source to read on initial load before proceeding with crawl. Default is -1, meaning all. Any other value will cause that number of lines to be loaded before fetching begins, while all extra lines continue to be processed in the background. Generally, this should only be changed when working with very large seed lists, and scopes that do *not* depend on reading all seeds. -
textSource
org.archive.io.ReadSource textSource
Text from which to extract seeds
-
-