Class CrawlURI
- All Implemented Interfaces:
Serializable
,Comparable<CrawlURI>
,org.archive.spring.OverlayContext
,org.archive.util.Reporter
public class CrawlURI extends Object implements org.archive.util.Reporter, Serializable, org.archive.spring.OverlayContext, Comparable<CrawlURI>
Core state is in instance variables but a flexible
attribute list is also available. Use this 'bucket' to carry
custom processing extracted data and state across CrawlURI
processing. See getData()
, etc.
Note: getHttpMethod() has been removed starting with Heritrix 3.3.0. HTTP
response headers are available using getHttpResponseHeader(String)
.
(HTTP fetchers are responsible for setting the values using
putHttpResponseHeader(String, String)
).
- Author:
- Gordon Mohr
- See Also:
- Serialized Form
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
CrawlURI.FetchType
-
Field Summary
Fields Modifier and Type Field Description static String
A_FETCH_HISTORY
fetch history arrayprotected String
canonicalString
protected Map<String,Object>
data
Flexible dynamic attributes list.protected org.json.JSONObject
extraInfo
protected CrawlURI
fullVia
protected Object
holder
protected int
holderCost
spot for an integer cost to be placed by external facility (frontier).protected Object
holderKey
protected long
ordinal
Monotonically increasing number within a crawl; useful for tending towards breadth-first ordering.protected Collection<CrawlURI>
outLinks
All discovered outbound urls as CrawlURIs (navlinks, embeds, etc.)protected org.archive.spring.OverlayMapsSource
overlayMapsSource
protected ArrayList<String>
overlayNames
protected long
politenessDelay
protected long
rescheduleTime
A future time at which this CrawlURI should be reenqueued.static int
UNCALCULATED
-
Constructor Summary
Constructors Constructor Description CrawlURI(org.archive.net.UURI uuri)
Create a new instance of CrawlURI from aUURI
.CrawlURI(org.archive.net.UURI u, String pathFromSeed, org.archive.net.UURI via, LinkContext viaContext)
-
Method Summary
Modifier and Type Method Description void
aboutToLog()
Notify CrawlURI it is about to be logged; opportunity for self-annotationvoid
addExtraInfo(String key, Object value)
static void
autoregisterTo(org.archive.bdb.AutoKryo kryo)
CrawlURI
clearPrerequisiteUri()
Clear prerequisite, if any.int
compareTo(CrawlURI o)
boolean
containsContentTypeCharsetDeclaration()
boolean
containsDataKey(String key)
CrawlURI
createCrawlURI(String destination, LinkContext context, Hop hop)
CrawlURI
createCrawlURI(org.archive.net.UURI destination, LinkContext context, Hop hop)
Utility method for creating CrawlURIs that were found as out links from the current CrawlURI links from this CrawlURI.CrawlURI
createCrawlURI(org.archive.net.UURI destination, LinkContext context, Hop hop, int scheduling, boolean seed)
Utility method for creation of CrawlURIs found extracting links from this CrawlURI.boolean
equals(Object o)
static String
extendHopsPath(String pathFromSeed, char hopChar)
Extend a 'hopsPath' (pathFromSeed string of single-character hop-type symbols), keeping the number of displayed hop-types under MAX_HOPS_DISPLAYED.static String
fetchStatusCodesToString(int code)
Takes a status code and converts it into a human readable string.String
flattenVia()
Method returns string version of this URI's referral URI.boolean
forceFetch()
If this method returns true, this URI should be fetched even though it already has been crawled.static CrawlURI
fromHopsViaString(String uriHopsViaContext)
Collection<String>
getAnnotations()
Get the annotations set for this uri.org.archive.net.UURI
getBaseURI()
Get the (HTML) Base URI used for derelativizing internal URIs.String
getCanonicalString()
String
getClassKey()
Get the token (usually the hostname + port) which indicates what "class" this CrawlURI should be grouped with, for the purposes of ensuring only one item of the class is processed at once, all items of the class are held for a politeness period, etc.byte[]
getContentDigest()
Return the retained content-digest value, if any.HashMap<String,Object>
getContentDigestHistory()
String
getContentDigestSchemeString()
String
getContentDigestString()
long
getContentLength()
For completed HTTP transactions, the length of the content-body.long
getContentSize()
Get the size in bytes of this URI's recorded content, inclusive of things like protocol headers.String
getContentType()
Get the content type of this URI.Set<Credential>
getCredentials()
Map<String,Object>
getData()
List<Object>
getDataList(String key)
Convenience method: return (creating if necessary) list at given data keyint
getDeferrals()
Get the deferral count.int
getEmbedHopCount()
Get the embed hop count.org.json.JSONObject
getExtraInfo()
int
getFetchAttempts()
Get the count of attempts (trips through the processing loop) at getting the document referenced by this URI.long
getFetchBeginTime()
long
getFetchCompletedTime()
long
getFetchDuration()
HashMap<String,Object>[]
getFetchHistory()
int
getFetchStatus()
Return the overall/fetch status of this CrawlURI for its current trip through the processing loop.CrawlURI.FetchType
getFetchType()
CrawlURI
getFullVia()
Object
getHolder()
Return the 'holder' for the convenience of an external facility.int
getHolderCost()
Return the 'holderCost' for convenience of external facility (frontier)Object
getHolderKey()
Return the 'holderKey' for convenience of an external facility (Frontier).int
getHopCount()
Get total hops from seed.Map<String,String>
getHttpAuthChallenges()
String
getHttpResponseHeader(String key)
String
getLastHop()
convenience access to last hop character, as stringint
getLinkHopCount()
Get the link hop count.Collection<Throwable>
getNonFatalFailures()
long
getOrdinal()
Get the ordinal (serial number) assigned at creation.Collection<CrawlURI>
getOutLinks()
Returns discovered links.Map<String,Object>
getOverlayMap(String name)
ArrayList<String>
getOverlayNames()
String
getPathFromSeed()
org.archive.net.UURI
getPolicyBasisUURI()
Get the UURI that should be used as the basis of policy/overlay decisions.long
getPolitenessDelay()
int
getPrecedence()
CrawlURI
getPrerequisiteUri()
Get the prerequisite for this URI.long
getRecordedSize()
Get size of data recorded (transferred)org.archive.util.Recorder
getRecorder()
Get the http recorder associated with this uri.long
getRescheduleTime()
RevisitProfile
getRevisitProfile()
int
getSchedulingDirective()
String
getServerIP()
Returns the IP address the request was fetched against or null if unavailable.String
getSourceTag()
int
getThreadNumber()
Get the number of the ToeThread responsible for processing this uri.int
getTransHops()
Tally up the number of transitive (non-simple-link) hops at the end of this CrawlURI's pathFromSeed.String
getURI()
String
getUserAgent()
Get the user agent to use for crawling this URI.org.archive.net.UURI
getUURI()
org.archive.net.UURI
getVia()
LinkContext
getViaContext()
boolean
hasBeenLinkExtracted()
If true then a link extractor has already claimed this CrawlURI and performed link extraction on the document content.boolean
hasContentDigestHistory()
boolean
hasCredentials()
int
hashCode()
boolean
hasPrerequisiteUri()
boolean
hasRfc2617Credential()
boolean
haveOverlayNamesBeenSet()
boolean
includesRetireDirective()
void
incrementDeferrals()
Increment the deferral count.void
incrementDiscardedOutLinks()
void
incrementFetchAttempts()
Increment the count of attempts (trips through the processing loop) at getting the document referenced by this URI.protected void
inheritFrom(CrawlURI ancestor)
Inherit (copy) the relevant keys-values from the ancestor.boolean
is2XXSuccess()
boolean
isHttpTransaction()
Return true if this is a http transaction.boolean
isLocation()
boolean
isPrerequisite()
Returns true if this CrawlURI is a prerequisite.boolean
isRevisit()
Indicates if this CrawlURI object has been deemed a revisit.boolean
isSeed()
boolean
isSuccess()
Ask this URI if it was a success or not.void
linkExtractorFinished()
Note that link extraction has been performed on this CrawlURI.void
makeHeritable(String key)
Make the given key 'heritable', meaning its value will be added to descendant CrawlURIs.void
makeNonHeritable(String key)
Make the given key non-'heritable', meaning its value will not be added to descendant CrawlURIs.CrawlURI
markPrerequisite(String preq)
Do all actions associated with setting aCrawlURI
as requiring a prerequisite.void
processingCleanup()
Clean up after a run through the processing chain.void
putHttpResponseHeader(String key, String value)
protected org.archive.net.UURI
readUuri(String u)
Read a UURI from a String, handling a null or URIExceptionvoid
reportTo(PrintWriter writer)
void
resetDeferrals()
Reset deferrals counter.void
resetFetchAttempts()
Reset fetchAttempts counter.void
resetForRescheduling()
Reset state that that should not persist when a URI is rescheduled for a specific future time.void
setBaseURI(String baseHref)
Set the (HTML) Base URI used for derelativizing internal URIs.void
setBaseURI(org.archive.net.UURI base)
void
setCanonicalString(String canonical)
void
setClassKey(String key)
void
setContentDigest(byte[] digestValue)
Deprecated.void
setContentDigest(String scheme, byte[] digestValue)
void
setContentSize(long l)
Sets the 'content size' for the URI, which is considered inclusive of all of all recorded material (such as protocol headers) or even material 'virtually' considered (as in material from a previous fetch confirmed unchanged with a server).void
setContentType(String ct)
Set a fetched uri's content type.void
setError(String msg)
void
setFetchBeginTime(long time)
void
setFetchCompletedTime(long time)
void
setFetchHistory(Map<String,Object>[] history)
void
setFetchStatus(int newstatus)
Set the overall/fetch status of this CrawlURI for its current trip through the processing loop.void
setFetchType(CrawlURI.FetchType type)
void
setForceFetch(boolean b)
Method to signal that this URI should be fetched even though it already has been crawled.void
setForceRetire(boolean b)
void
setFullVia(CrawlURI curi)
void
setHolder(Object obj)
Remember a 'holder' to which some enclosing/queueing facility has assigned this CrawlURI .void
setHolderCost(int cost)
Remember a 'holderCost' which some enclosing/queueing facility has assigned this CrawlURIvoid
setHolderKey(Object obj)
Remember a 'holderKey' which some enclosing/queueing facility has assigned this CrawlURI .void
setHttpAuthChallenges(Map<String,String> httpAuthChallenges)
void
setOrdinal(long o)
void
setOverlayMapsSource(org.archive.spring.OverlayMapsSource overrideMapsSource)
void
setPolitenessDelay(long polite)
void
setPrecedence(int precedence)
void
setPrerequisite(boolean prerequisite)
Set if this CrawlURI is itself a prerequisite URI.void
setPrerequisiteUri(CrawlURI pre)
Set a prerequisite for this URI.void
setRecorder(org.archive.util.Recorder httpRecorder)
Set the http recorder to be associated with this uri.void
setRescheduleTime(long time)
void
setRevisitProfile(RevisitProfile revisitProfile)
void
setSchedulingDirective(int priority)
void
setSeed(boolean b)
Set the isSeed attribute of this URI.void
setServerIP(String serverIP)
void
setSourceTag(String sourceTag)
void
setThreadNumber(int i)
Set the number of the ToeThread responsible for processing this uri.void
setUserAgent(String string)
Set the user agent to use when crawling this URI.void
setVia(org.archive.net.UURI via)
String
shortReportLegend()
String
shortReportLine()
void
shortReportLineTo(PrintWriter w)
Map<String,Object>
shortReportMap()
void
stripToMinimal()
Remove all attributes set on this uri.String
toString()
-
Field Details
-
UNCALCULATED
public static final int UNCALCULATED- See Also:
- Constant Field Values
-
A_FETCH_HISTORY
fetch history array- See Also:
- Constant Field Values
-
data
Flexible dynamic attributes list.The attribute list is a flexible map of key/value pairs for storing status of this URI for use by other processors. By convention the attribute list is keyed by constants found in the
CoreAttributeConstants
interface. Use this list to carry data or state produced by custom processors rather change the classesCrawlURI
or this class, CrawlURI. -
ordinal
protected long ordinalMonotonically increasing number within a crawl; useful for tending towards breadth-first ordering. Will sometimes be truncated to 48 bits, so behavior over 281 trillion instantiated CrawlURIs may be buggy -
holder
-
holderKey
-
holderCost
protected int holderCostspot for an integer cost to be placed by external facility (frontier). cost is truncated to 8 bits at times, so should not exceed 255 -
outLinks
All discovered outbound urls as CrawlURIs (navlinks, embeds, etc.) -
overlayNames
-
overlayMapsSource
protected transient org.archive.spring.OverlayMapsSource overlayMapsSource -
canonicalString
-
politenessDelay
protected long politenessDelay -
fullVia
-
rescheduleTime
protected long rescheduleTimeA future time at which this CrawlURI should be reenqueued. -
extraInfo
protected org.json.JSONObject extraInfo
-
-
Constructor Details
-
CrawlURI
public CrawlURI(org.archive.net.UURI uuri)Create a new instance of CrawlURI from aUURI
.- Parameters:
uuri
- the UURI to base this CrawlURI on.
-
CrawlURI
public CrawlURI(org.archive.net.UURI u, String pathFromSeed, org.archive.net.UURI via, LinkContext viaContext)- Parameters:
u
- uuri instance this CrawlURI wraps.pathFromSeed
-via
-viaContext
-
-
-
Method Details
-
fromHopsViaString
public static CrawlURI fromHopsViaString(String uriHopsViaContext) throws org.apache.commons.httpclient.URIException- Throws:
org.apache.commons.httpclient.URIException
-
getSchedulingDirective
public int getSchedulingDirective()- Returns:
- Returns the schedulingDirective.
-
setSchedulingDirective
public void setSchedulingDirective(int priority)- Parameters:
priority
- The schedulingDirective to set.
-
containsDataKey
-
fetchStatusCodesToString
Takes a status code and converts it into a human readable string.- Parameters:
code
- the status code- Returns:
- a human readable string declaring what the status code is.
-
getFetchStatus
public int getFetchStatus()Return the overall/fetch status of this CrawlURI for its current trip through the processing loop.- Returns:
- a value from FetchStatusCodes
-
setFetchStatus
public void setFetchStatus(int newstatus)Set the overall/fetch status of this CrawlURI for its current trip through the processing loop.- Parameters:
newstatus
- a value from FetchStatusCodes
-
getFetchAttempts
public int getFetchAttempts()Get the count of attempts (trips through the processing loop) at getting the document referenced by this URI. Compared against a configured maximum to determine when to stop retrying. TODO: Consider renaming as something more generic, as all processing-loops do not necessarily include an attempted network-fetch (for example, when processing is aborted early to enqueue a prerequisite), and this counter may be reset if a URI is starting a fresh series of tries (as when rescheduled at a future time). Perhaps simply 'tryCount' or 'attempts'?- Returns:
- attempts count
-
incrementFetchAttempts
public void incrementFetchAttempts()Increment the count of attempts (trips through the processing loop) at getting the document referenced by this URI. -
resetFetchAttempts
public void resetFetchAttempts()Reset fetchAttempts counter. -
resetDeferrals
public void resetDeferrals()Reset deferrals counter. -
setPrerequisiteUri
Set a prerequisite for this URI.A prerequisite is a URI that must be crawled before this URI can be crawled.
- Parameters:
pre
- Link to set as prereq.
-
getPrerequisiteUri
Get the prerequisite for this URI.A prerequisite is a URI that must be crawled before this URI can be crawled.
- Returns:
- the prerequisite for this URI or null if no prerequisite.
-
clearPrerequisiteUri
Clear prerequisite, if any. -
hasPrerequisiteUri
public boolean hasPrerequisiteUri()- Returns:
- True if this CrawlURI has a prerequisite.
-
isPrerequisite
public boolean isPrerequisite()Returns true if this CrawlURI is a prerequisite. TODO:FIXME: code elsewhere is confused whether this means that this CrawlURI is a prerquisite for another, or *has* a prequisite; clean up and rename as necessary.- Returns:
- true if this CrawlURI is a prerequisite.
-
setPrerequisite
public void setPrerequisite(boolean prerequisite)Set if this CrawlURI is itself a prerequisite URI.- Parameters:
prerequisite
- True if this CrawlURI is itself a prerequiste uri.
-
getContentType
Get the content type of this URI.- Returns:
- Fetched URIs content type. May be null.
-
setContentType
Set a fetched uri's content type.- Parameters:
ct
- Contenttype.
-
setThreadNumber
public void setThreadNumber(int i)Set the number of the ToeThread responsible for processing this uri.- Parameters:
i
- the ToeThread number.
-
getThreadNumber
public int getThreadNumber()Get the number of the ToeThread responsible for processing this uri.- Returns:
- the ToeThread number.
-
incrementDeferrals
public void incrementDeferrals()Increment the deferral count. -
getDeferrals
public int getDeferrals()Get the deferral count.- Returns:
- the deferral count.
-
stripToMinimal
public void stripToMinimal()Remove all attributes set on this uri.This methods removes the attribute list.
-
getContentSize
public long getContentSize()Get the size in bytes of this URI's recorded content, inclusive of things like protocol headers. It is the responsibility of the classes which fetch the URI to set this value accordingly -- it is not calculated/verified within CrawlURI. This value is consulted in reporting/logging/writing-decisions.- Returns:
- contentSize
- See Also:
setContentSize(long)
-
getAnnotations
Get the annotations set for this uri.- Returns:
- the annotations set for this uri.
-
getHopCount
public int getHopCount()Get total hops from seed.- Returns:
- int hops count
-
getEmbedHopCount
public int getEmbedHopCount()Get the embed hop count.- Returns:
- the embed hop count.
-
getLinkHopCount
public int getLinkHopCount()Get the link hop count.- Returns:
- the link hop count.
-
getUserAgent
Get the user agent to use for crawling this URI. If null the global setting should be used.- Returns:
- user agent or null
-
setUserAgent
Set the user agent to use when crawling this URI. If not set the global settings should be used.- Parameters:
string
- user agent to use
-
getContentLength
public long getContentLength()For completed HTTP transactions, the length of the content-body.- Returns:
- For completed HTTP transactions, the length of the content-body.
-
getRecordedSize
public long getRecordedSize()Get size of data recorded (transferred)- Returns:
- recorded data size
-
setContentSize
public void setContentSize(long l)Sets the 'content size' for the URI, which is considered inclusive of all of all recorded material (such as protocol headers) or even material 'virtually' considered (as in material from a previous fetch confirmed unchanged with a server). (In contrast, content-length matches the HTTP definition, that of the enclosed content-body.) Should be set by a fetcher or other processor as soon as the final size of recorded content is known. Setting to an artificial/incorrect value may affect other reporting/processing. -
hasBeenLinkExtracted
public boolean hasBeenLinkExtracted()If true then a link extractor has already claimed this CrawlURI and performed link extraction on the document content. This does not preclude other link extractors that may have an interest in this CrawlURI from also doing link extraction.There is an onus on link extractors to set this flag if they have run.
- Returns:
- True if a processor has performed link extraction on this CrawlURI
- See Also:
linkExtractorFinished()
-
linkExtractorFinished
public void linkExtractorFinished()Note that link extraction has been performed on this CrawlURI. A processor doing link extraction should invoke this method once it has finished it's work. It should invoke it even if no links are extracted. It should only invoke this method if the link extraction was performed on the document body (not the HTTP headers etc.).- See Also:
hasBeenLinkExtracted()
-
aboutToLog
public void aboutToLog()Notify CrawlURI it is about to be logged; opportunity for self-annotation -
getRecorder
public org.archive.util.Recorder getRecorder()Get the http recorder associated with this uri.- Returns:
- Returns the httpRecorder. May be null but its set early in FetchHttp so there is an issue if its null.
-
setRecorder
public void setRecorder(org.archive.util.Recorder httpRecorder)Set the http recorder to be associated with this uri.- Parameters:
httpRecorder
- The httpRecorder to set.
-
isHttpTransaction
public boolean isHttpTransaction()Return true if this is a http transaction.- Returns:
- True if this is a http transaction.
-
processingCleanup
public void processingCleanup()Clean up after a run through the processing chain. Called on the end of processing chain by Frontier#finish. Null out any state gathered during processing. -
getCredentials
- Returns:
- Credential avatars. Null if none set.
-
hasCredentials
public boolean hasCredentials()- Returns:
- True if there are avatars attached to this instance.
-
isSuccess
public boolean isSuccess()Ask this URI if it was a success or not. Only makes sense to call this method after execution of HttpMethod#execute. Regard any status larger then 0 as success except for below caveat regarding 401s. Useis2XXSuccess()
if looking for a status code in the 200 range.401s caveat: If any rfc2617 credential data present and we got a 401 assume it got loaded in FetchHTTP on expectation that we're to go around the processing chain again. Report this condition as a failure so we get another crack at the processing chain only this time we'll be making use of the loaded credential data.
- Returns:
- True if ths URI has been successfully processed.
- See Also:
is2XXSuccess()
-
is2XXSuccess
public boolean is2XXSuccess()- Returns:
- True if status code is in the 2xx range.
- See Also:
isSuccess()
-
hasRfc2617Credential
public boolean hasRfc2617Credential()- Returns:
- True if we have an rfc2617 payload.
-
setContentDigest
public void setContentDigest(byte[] digestValue)Deprecated.Set the retained content-digest value (usu. SHA1).- Parameters:
digestValue
-
-
setContentDigest
-
getContentDigestSchemeString
-
getContentDigest
public byte[] getContentDigest()Return the retained content-digest value, if any.- Returns:
- Digest value.
-
getContentDigestString
-
setHolder
Remember a 'holder' to which some enclosing/queueing facility has assigned this CrawlURI .- Parameters:
obj
-
-
getHolder
Return the 'holder' for the convenience of an external facility.- Returns:
- holder
-
setHolderKey
Remember a 'holderKey' which some enclosing/queueing facility has assigned this CrawlURI .- Parameters:
obj
-
-
getHolderKey
Return the 'holderKey' for convenience of an external facility (Frontier).- Returns:
- holderKey
-
getOrdinal
public long getOrdinal()Get the ordinal (serial number) assigned at creation.- Returns:
- ordinal
-
setOrdinal
public void setOrdinal(long o) -
getHolderCost
public int getHolderCost()Return the 'holderCost' for convenience of external facility (frontier)- Returns:
- value of holderCost
-
setHolderCost
public void setHolderCost(int cost)Remember a 'holderCost' which some enclosing/queueing facility has assigned this CrawlURI- Parameters:
cost
- value to remember
-
getOutLinks
Returns discovered links. The returned collection might be empty if no links were discovered, or if something like LinksScoper promoted the links to CrawlURIs.- Returns:
- Collection of all discovered outbound links
-
setBaseURI
Set the (HTML) Base URI used for derelativizing internal URIs.- Parameters:
baseHref
- String base href to use- Throws:
org.apache.commons.httpclient.URIException
- if supplied string cannot be interpreted as URI
-
getBaseURI
public org.archive.net.UURI getBaseURI()Get the (HTML) Base URI used for derelativizing internal URIs.- Returns:
- UURI base URI previously set
-
readUuri
Read a UURI from a String, handling a null or URIException- Parameters:
u
- String or null from which to create UURI- Returns:
- the best UURI instance creatable
-
getServerIP
Returns the IP address the request was fetched against or null if unavailable. -
getFetchBeginTime
public long getFetchBeginTime() -
getFetchCompletedTime
public long getFetchCompletedTime() -
getFetchDuration
public long getFetchDuration() -
getFetchType
-
getNonFatalFailures
-
setServerIP
-
setError
-
setFetchBeginTime
public void setFetchBeginTime(long time) -
setFetchCompletedTime
public void setFetchCompletedTime(long time) -
setFetchType
-
setForceRetire
public void setForceRetire(boolean b) -
setBaseURI
public void setBaseURI(org.archive.net.UURI base) -
getData
-
getDataList
Convenience method: return (creating if necessary) list at given data key- Parameters:
key
-- Returns:
- List
-
setSeed
public void setSeed(boolean b)Set the isSeed attribute of this URI.- Parameters:
b
- Is this URI a seed, true or false.
-
isSeed
public boolean isSeed()- Returns:
- Whether seeded.
-
getUURI
public org.archive.net.UURI getUURI()- Returns:
- UURI
-
getURI
- Returns:
- String of URI
-
getPathFromSeed
- Returns:
- path (hop-types) from seed
-
getLastHop
convenience access to last hop character, as string -
getVia
public org.archive.net.UURI getVia()- Returns:
- URI via which this one was discovered
-
setVia
public void setVia(org.archive.net.UURI via) -
getViaContext
- Returns:
- CharSequence context in which this one was discovered
-
isLocation
public boolean isLocation()- Returns:
- True if this CrawlURI was result of a redirect: i.e. Its parent URI redirected to here, this URI was what was in the 'Location:' or 'Content-Location:' HTTP Header.
-
shortReportLine
-
shortReportMap
- Specified by:
shortReportMap
in interfaceorg.archive.util.Reporter
-
shortReportLineTo
- Specified by:
shortReportLineTo
in interfaceorg.archive.util.Reporter
-
shortReportLegend
- Specified by:
shortReportLegend
in interfaceorg.archive.util.Reporter
-
reportTo
- Specified by:
reportTo
in interfaceorg.archive.util.Reporter
- Throws:
IOException
-
flattenVia
Method returns string version of this URI's referral URI.- Returns:
- String version of referral URI
-
getSourceTag
-
setSourceTag
-
makeHeritable
Make the given key 'heritable', meaning its value will be added to descendant CrawlURIs. Only keys with immutable values should be made heritable -- the value instance may be shared until the data map is serialized/deserialized.- Parameters:
key
- to make heritable
-
makeNonHeritable
Make the given key non-'heritable', meaning its value will not be added to descendant CrawlURIs. Only meaningful if key was previously made heritable.- Parameters:
key
- to make non-heritable
-
getClassKey
Get the token (usually the hostname + port) which indicates what "class" this CrawlURI should be grouped with, for the purposes of ensuring only one item of the class is processed at once, all items of the class are held for a politeness period, etc.- Returns:
- Token (usually the hostname) which indicates what "class" this CrawlURI should be grouped with.
-
setClassKey
-
forceFetch
public boolean forceFetch()If this method returns true, this URI should be fetched even though it already has been crawled. This also implies that this URI will be scheduled for crawl before any other waiting URIs for the same host. This value is used to refetch any expired robots.txt or dns-lookups.- Returns:
- true if crawling of this URI should be forced
-
setForceFetch
public void setForceFetch(boolean b)Method to signal that this URI should be fetched even though it already has been crawled. Setting this to true also implies that this URI will be scheduled for crawl before any other waiting URIs for the same host. This value is used to refetch any expired robots.txt or dns-lookups.- Parameters:
b
- set to true to enforce the crawling of this URI
-
getTransHops
public int getTransHops()Tally up the number of transitive (non-simple-link) hops at the end of this CrawlURI's pathFromSeed. In some cases, URIs with greater than zero but less than some threshold such hops are treated specially.TODO: consider moving link-count in here as well, caching calculation, and refactoring CrawlScope.exceedsMaxHops() to use this.
- Returns:
- Transhop count.
-
inheritFrom
Inherit (copy) the relevant keys-values from the ancestor.- Parameters:
ancestor
-
-
createCrawlURI
public CrawlURI createCrawlURI(org.archive.net.UURI destination, LinkContext context, Hop hop) throws org.apache.commons.httpclient.URIExceptionUtility method for creating CrawlURIs that were found as out links from the current CrawlURI links from this CrawlURI.Any relative URIs will be treated as relative to this CrawlURI's UURI.
- Parameters:
destination
- The new URI, possibly a relative URIcontext
-hop
-- Returns:
- New CrawlURI with the current CrawlURI set as the one it inherits from
- Throws:
org.apache.commons.httpclient.URIException
-
createCrawlURI
public CrawlURI createCrawlURI(String destination, LinkContext context, Hop hop) throws org.apache.commons.httpclient.URIException- Throws:
org.apache.commons.httpclient.URIException
-
extendHopsPath
Extend a 'hopsPath' (pathFromSeed string of single-character hop-type symbols), keeping the number of displayed hop-types under MAX_HOPS_DISPLAYED. For longer hops paths, precede the string with a integer and '+', then the displayed hops.- Parameters:
pathFromSeed
-hopChar
-
-
createCrawlURI
public CrawlURI createCrawlURI(org.archive.net.UURI destination, LinkContext context, Hop hop, int scheduling, boolean seed) throws org.apache.commons.httpclient.URIExceptionUtility method for creation of CrawlURIs found extracting links from this CrawlURI.- Throws:
org.apache.commons.httpclient.URIException
-
toString
-
incrementDiscardedOutLinks
public void incrementDiscardedOutLinks() -
getPrecedence
public int getPrecedence()- Returns:
- the precedence
-
setPrecedence
public void setPrecedence(int precedence)- Parameters:
precedence
- the precedence to set
-
getPolicyBasisUURI
public org.archive.net.UURI getPolicyBasisUURI()Get the UURI that should be used as the basis of policy/overlay decisions. In the case of prerequisites, this is the URI that triggered the prerequisite -- the 'via' -- so that the prerequisite lands in the same queue, with the same overlay values, as the triggering URI.- Returns:
- UURI to use for policy decisions
-
haveOverlayNamesBeenSet
public boolean haveOverlayNamesBeenSet()- Specified by:
haveOverlayNamesBeenSet
in interfaceorg.archive.spring.OverlayContext
-
getOverlayNames
- Specified by:
getOverlayNames
in interfaceorg.archive.spring.OverlayContext
-
getOverlayMap
- Specified by:
getOverlayMap
in interfaceorg.archive.spring.OverlayContext
-
setOverlayMapsSource
public void setOverlayMapsSource(org.archive.spring.OverlayMapsSource overrideMapsSource) -
setCanonicalString
-
getCanonicalString
-
setPolitenessDelay
public void setPolitenessDelay(long polite) -
getPolitenessDelay
public long getPolitenessDelay() -
setFullVia
-
getFullVia
-
setRescheduleTime
public void setRescheduleTime(long time) -
getRescheduleTime
public long getRescheduleTime() -
resetForRescheduling
public void resetForRescheduling()Reset state that that should not persist when a URI is rescheduled for a specific future time. -
includesRetireDirective
public boolean includesRetireDirective() -
getExtraInfo
public org.json.JSONObject getExtraInfo() -
addExtraInfo
-
autoregisterTo
public static void autoregisterTo(org.archive.bdb.AutoKryo kryo) -
markPrerequisite
Do all actions associated with setting aCrawlURI
as requiring a prerequisite.- Returns:
- the newly created prerequisite CrawlURI
- Throws:
org.apache.commons.httpclient.URIException
-
containsContentTypeCharsetDeclaration
public boolean containsContentTypeCharsetDeclaration() -
getHttpResponseHeader
- Parameters:
key
- http response header key (case-insensitive)- Returns:
- value of the header or null if there is no such header
- Since:
- 3.3.0
-
putHttpResponseHeader
- Since:
- 3.3.0
-
getHttpAuthChallenges
-
setHttpAuthChallenges
-
getFetchHistory
-
setFetchHistory
-
getContentDigestHistory
-
hasContentDigestHistory
public boolean hasContentDigestHistory() -
isRevisit
public boolean isRevisit()Indicates if this CrawlURI object has been deemed a revisit. -
getRevisitProfile
-
setRevisitProfile
-
compareTo
- Specified by:
compareTo
in interfaceComparable<CrawlURI>
-
hashCode
public int hashCode() -
equals
-
setContentDigest(String, byte[])