org.archive.modules.CrawlURI

All Implemented Interfaces:: Serializable, Comparable<CrawlURI>, org.archive.spring.OverlayContext, org.archive.util.Reporter

public class CrawlURI
extends Object
implements org.archive.util.Reporter, Serializable, org.archive.spring.OverlayContext, Comparable<CrawlURI>

Represents a candidate URI and the associated state it collects as it is crawled.

Core state is in instance variables but a flexible attribute list is also available. Use this 'bucket' to carry custom processing extracted data and state across CrawlURI processing. See getData(), etc.

Note: getHttpMethod() has been removed starting with Heritrix 3.3.0. HTTP response headers are available using getHttpResponseHeader(String). (HTTP fetchers are responsible for setting the values using putHttpResponseHeader(String, String)).

Author:: Gordon Mohr
See Also:: Serialized Form

Nested Class Summary

Nested Classes

Modifier and Type Class Description

static class CrawlURI.FetchType

Field Summary

Fields
Modifier and Type	Field	Description
`static String`	`A_FETCH_HISTORY`	fetch history array
`protected String`	`canonicalString`
`protected Map<String,Object>`	`data`	Flexible dynamic attributes list.
`protected org.json.JSONObject`	`extraInfo`
`protected CrawlURI`	`fullVia`
`protected Object`	`holder`
`protected int`	`holderCost`	spot for an integer cost to be placed by external facility (frontier).
`protected Object`	`holderKey`
`protected long`	`ordinal`	Monotonically increasing number within a crawl; useful for tending towards breadth-first ordering.
`protected Collection<CrawlURI>`	`outLinks`	All discovered outbound urls as CrawlURIs (navlinks, embeds, etc.)
`protected org.archive.spring.OverlayMapsSource`	`overlayMapsSource`
`protected ArrayList<String>`	`overlayNames`
`protected long`	`politenessDelay`
`protected long`	`rescheduleTime`	A future time at which this CrawlURI should be reenqueued.
`static int`	`UNCALCULATED`

Constructor Summary

Constructors
Constructor	Description
`CrawlURI(org.archive.net.UURI uuri)`	Create a new instance of CrawlURI from a `UURI`.
`CrawlURI(org.archive.net.UURI u, String pathFromSeed, org.archive.net.UURI via, LinkContext viaContext)`

Method Summary

Modifier and Type	Method	Description
`void`	`aboutToLog()`	Notify CrawlURI it is about to be logged; opportunity for self-annotation
`void`	`addExtraInfo(String key, Object value)`
`static void`	`autoregisterTo(org.archive.bdb.AutoKryo kryo)`
`CrawlURI`	`clearPrerequisiteUri()`	Clear prerequisite, if any.
`int`	`compareTo(CrawlURI o)`
`boolean`	`containsContentTypeCharsetDeclaration()`
`boolean`	`containsDataKey(String key)`
`CrawlURI`	`createCrawlURI(String destination, LinkContext context, Hop hop)`
`CrawlURI`	`createCrawlURI(org.archive.net.UURI destination, LinkContext context, Hop hop)`	Utility method for creating CrawlURIs that were found as out links from the current CrawlURI links from this CrawlURI.
`CrawlURI`	`createCrawlURI(org.archive.net.UURI destination, LinkContext context, Hop hop, int scheduling, boolean seed)`	Utility method for creation of CrawlURIs found extracting links from this CrawlURI.
`boolean`	`equals(Object o)`
`static String`	`extendHopsPath(String pathFromSeed, char hopChar)`	Extend a 'hopsPath' (pathFromSeed string of single-character hop-type symbols), keeping the number of displayed hop-types under MAX_HOPS_DISPLAYED.
`static String`	`fetchStatusCodesToString(int code)`	Takes a status code and converts it into a human readable string.
`String`	`flattenVia()`	Method returns string version of this URI's referral URI.
`boolean`	`forceFetch()`	If this method returns true, this URI should be fetched even though it already has been crawled.
`static CrawlURI`	`fromHopsViaString(String uriHopsViaContext)`
`Collection<String>`	`getAnnotations()`	Get the annotations set for this uri.
`org.archive.net.UURI`	`getBaseURI()`	Get the (HTML) Base URI used for derelativizing internal URIs.
`String`	`getCanonicalString()`
`String`	`getClassKey()`	Get the token (usually the hostname + port) which indicates what "class" this CrawlURI should be grouped with, for the purposes of ensuring only one item of the class is processed at once, all items of the class are held for a politeness period, etc.
`byte[]`	`getContentDigest()`	Return the retained content-digest value, if any.
`HashMap<String,Object>`	`getContentDigestHistory()`
`String`	`getContentDigestSchemeString()`
`String`	`getContentDigestString()`
`long`	`getContentLength()`	For completed HTTP transactions, the length of the content-body.
`long`	`getContentSize()`	Get the size in bytes of this URI's recorded content, inclusive of things like protocol headers.
`String`	`getContentType()`	Get the content type of this URI.
`Set<Credential>`	`getCredentials()`
`Map<String,Object>`	`getData()`
`List<Object>`	`getDataList(String key)`	Convenience method: return (creating if necessary) list at given data key
`int`	`getDeferrals()`	Get the deferral count.
`int`	`getEmbedHopCount()`	Get the embed hop count.
`org.json.JSONObject`	`getExtraInfo()`
`int`	`getFetchAttempts()`	Get the count of attempts (trips through the processing loop) at getting the document referenced by this URI.
`long`	`getFetchBeginTime()`
`long`	`getFetchCompletedTime()`
`long`	`getFetchDuration()`
`HashMap<String,Object>[]`	`getFetchHistory()`
`int`	`getFetchStatus()`	Return the overall/fetch status of this CrawlURI for its current trip through the processing loop.
`CrawlURI.FetchType`	`getFetchType()`
`CrawlURI`	`getFullVia()`
`Object`	`getHolder()`	Return the 'holder' for the convenience of an external facility.
`int`	`getHolderCost()`	Return the 'holderCost' for convenience of external facility (frontier)
`Object`	`getHolderKey()`	Return the 'holderKey' for convenience of an external facility (Frontier).
`int`	`getHopCount()`	Get total hops from seed.
`Map<String,String>`	`getHttpAuthChallenges()`
`String`	`getHttpResponseHeader(String key)`
`String`	`getLastHop()`	convenience access to last hop character, as string
`int`	`getLinkHopCount()`	Get the link hop count.
`Collection<Throwable>`	`getNonFatalFailures()`
`long`	`getOrdinal()`	Get the ordinal (serial number) assigned at creation.
`Collection<CrawlURI>`	`getOutLinks()`	Returns discovered links.
`Map<String,Object>`	`getOverlayMap(String name)`
`ArrayList<String>`	`getOverlayNames()`
`String`	`getPathFromSeed()`
`org.archive.net.UURI`	`getPolicyBasisUURI()`	Get the UURI that should be used as the basis of policy/overlay decisions.
`long`	`getPolitenessDelay()`
`int`	`getPrecedence()`
`CrawlURI`	`getPrerequisiteUri()`	Get the prerequisite for this URI.
`long`	`getRecordedSize()`	Get size of data recorded (transferred)
`org.archive.util.Recorder`	`getRecorder()`	Get the http recorder associated with this uri.
`long`	`getRescheduleTime()`
`RevisitProfile`	`getRevisitProfile()`
`int`	`getSchedulingDirective()`
`String`	`getServerIP()`	Returns the IP address the request was fetched against or null if unavailable.
`String`	`getSourceTag()`
`int`	`getThreadNumber()`	Get the number of the ToeThread responsible for processing this uri.
`int`	`getTransHops()`	Tally up the number of transitive (non-simple-link) hops at the end of this CrawlURI's pathFromSeed.
`String`	`getURI()`
`String`	`getUserAgent()`	Get the user agent to use for crawling this URI.
`org.archive.net.UURI`	`getUURI()`
`org.archive.net.UURI`	`getVia()`
`LinkContext`	`getViaContext()`
`boolean`	`hasBeenLinkExtracted()`	If true then a link extractor has already claimed this CrawlURI and performed link extraction on the document content.
`boolean`	`hasContentDigestHistory()`
`boolean`	`hasCredentials()`
`int`	`hashCode()`
`boolean`	`hasPrerequisiteUri()`
`boolean`	`hasRfc2617Credential()`
`boolean`	`haveOverlayNamesBeenSet()`
`boolean`	`includesRetireDirective()`
`void`	`incrementDeferrals()`	Increment the deferral count.
`void`	`incrementDiscardedOutLinks()`
`void`	`incrementFetchAttempts()`	Increment the count of attempts (trips through the processing loop) at getting the document referenced by this URI.
`protected void`	`inheritFrom(CrawlURI ancestor)`	Inherit (copy) the relevant keys-values from the ancestor.
`boolean`	`is2XXSuccess()`
`boolean`	`isHttpTransaction()`	Return true if this is a http transaction.
`boolean`	`isLocation()`
`boolean`	`isPrerequisite()`	Returns true if this CrawlURI is a prerequisite.
`boolean`	`isRevisit()`	Indicates if this CrawlURI object has been deemed a revisit.
`boolean`	`isSeed()`
`boolean`	`isSuccess()`	Ask this URI if it was a success or not.
`void`	`linkExtractorFinished()`	Note that link extraction has been performed on this CrawlURI.
`void`	`makeHeritable(String key)`	Make the given key 'heritable', meaning its value will be added to descendant CrawlURIs.
`void`	`makeNonHeritable(String key)`	Make the given key non-'heritable', meaning its value will not be added to descendant CrawlURIs.
`CrawlURI`	`markPrerequisite(String preq)`	Do all actions associated with setting a `CrawlURI` as requiring a prerequisite.
`void`	`processingCleanup()`	Clean up after a run through the processing chain.
`void`	`putHttpResponseHeader(String key, String value)`
`protected org.archive.net.UURI`	`readUuri(String u)`	Read a UURI from a String, handling a null or URIException
`void`	`reportTo(PrintWriter writer)`
`void`	`resetDeferrals()`	Reset deferrals counter.
`void`	`resetFetchAttempts()`	Reset fetchAttempts counter.
`void`	`resetForRescheduling()`	Reset state that that should not persist when a URI is rescheduled for a specific future time.
`void`	`setBaseURI(String baseHref)`	Set the (HTML) Base URI used for derelativizing internal URIs.
`void`	`setBaseURI(org.archive.net.UURI base)`
`void`	`setCanonicalString(String canonical)`
`void`	`setClassKey(String key)`
`void`	`setContentDigest(byte[] digestValue)`	Deprecated. Use `setContentDigest(String, byte[])`
`void`	`setContentDigest(String scheme, byte[] digestValue)`
`void`	`setContentSize(long l)`	Sets the 'content size' for the URI, which is considered inclusive of all of all recorded material (such as protocol headers) or even material 'virtually' considered (as in material from a previous fetch confirmed unchanged with a server).
`void`	`setContentType(String ct)`	Set a fetched uri's content type.
`void`	`setError(String msg)`
`void`	`setFetchBeginTime(long time)`
`void`	`setFetchCompletedTime(long time)`
`void`	`setFetchHistory(Map<String,Object>[] history)`
`void`	`setFetchStatus(int newstatus)`	Set the overall/fetch status of this CrawlURI for its current trip through the processing loop.
`void`	`setFetchType(CrawlURI.FetchType type)`
`void`	`setForceFetch(boolean b)`	Method to signal that this URI should be fetched even though it already has been crawled.
`void`	`setForceRetire(boolean b)`
`void`	`setFullVia(CrawlURI curi)`
`void`	`setHolder(Object obj)`	Remember a 'holder' to which some enclosing/queueing facility has assigned this CrawlURI .
`void`	`setHolderCost(int cost)`	Remember a 'holderCost' which some enclosing/queueing facility has assigned this CrawlURI
`void`	`setHolderKey(Object obj)`	Remember a 'holderKey' which some enclosing/queueing facility has assigned this CrawlURI .
`void`	`setHttpAuthChallenges(Map<String,String> httpAuthChallenges)`
`void`	`setOrdinal(long o)`
`void`	`setOverlayMapsSource(org.archive.spring.OverlayMapsSource overrideMapsSource)`
`void`	`setPolitenessDelay(long polite)`
`void`	`setPrecedence(int precedence)`
`void`	`setPrerequisite(boolean prerequisite)`	Set if this CrawlURI is itself a prerequisite URI.
`void`	`setPrerequisiteUri(CrawlURI pre)`	Set a prerequisite for this URI.
`void`	`setRecorder(org.archive.util.Recorder httpRecorder)`	Set the http recorder to be associated with this uri.
`void`	`setRescheduleTime(long time)`
`void`	`setRevisitProfile(RevisitProfile revisitProfile)`
`void`	`setSchedulingDirective(int priority)`
`void`	`setSeed(boolean b)`	Set the `isSeed` attribute of this URI.
`void`	`setServerIP(String serverIP)`
`void`	`setSourceTag(String sourceTag)`
`void`	`setThreadNumber(int i)`	Set the number of the ToeThread responsible for processing this uri.
`void`	`setUserAgent(String string)`	Set the user agent to use when crawling this URI.
`void`	`setVia(org.archive.net.UURI via)`
`String`	`shortReportLegend()`
`String`	`shortReportLine()`
`void`	`shortReportLineTo(PrintWriter w)`
`Map<String,Object>`	`shortReportMap()`
`void`	`stripToMinimal()`	Remove all attributes set on this uri.
`String`	`toString()`

Methods inherited from class java.lang.Object

clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Field Details
- UNCALCULATED
  
  public static final int UNCALCULATED
  
  See Also:
  
  Constant Field Values
- A_FETCH_HISTORY
  
  public static final String A_FETCH_HISTORY
  
  fetch history array
  
  See Also:
  
  Constant Field Values
- data
  
  protected Map<String,Object> data
  
  Flexible dynamic attributes list.
  The attribute list is a flexible map of key/value pairs for storing status of this URI for use by other processors. By convention the attribute list is keyed by constants found in the CoreAttributeConstants interface. Use this list to carry data or state produced by custom processors rather change the classes CrawlURI or this class, CrawlURI.
- ordinal
  
  protected long ordinal
  
  Monotonically increasing number within a crawl; useful for tending towards breadth-first ordering. Will sometimes be truncated to 48 bits, so behavior over 281 trillion instantiated CrawlURIs may be buggy
- holder
  
  protected transient Object holder
- holderKey
  
  protected transient Object holderKey
- holderCost
  
  protected int holderCost
  
  spot for an integer cost to be placed by external facility (frontier). cost is truncated to 8 bits at times, so should not exceed 255
- outLinks
  
  protected transient Collection<CrawlURI> outLinks
  
  All discovered outbound urls as CrawlURIs (navlinks, embeds, etc.)
- overlayNames
  
  protected transient ArrayList<String> overlayNames
- overlayMapsSource
  
  protected transient org.archive.spring.OverlayMapsSource overlayMapsSource
- canonicalString
  
  protected String canonicalString
- politenessDelay
  
  protected long politenessDelay
- fullVia
  
  protected transient CrawlURI fullVia
- rescheduleTime
  
  protected long rescheduleTime
  
  A future time at which this CrawlURI should be reenqueued.
- extraInfo
  
  protected org.json.JSONObject extraInfo
Constructor Details
- CrawlURI
  
  public CrawlURI(org.archive.net.UURI uuri)
  
  Create a new instance of CrawlURI from a UURI.
  
  Parameters:
  
  uuri - the UURI to base this CrawlURI on.
- CrawlURI
  
  public CrawlURI(org.archive.net.UURI u, String pathFromSeed, org.archive.net.UURI via, LinkContext viaContext)
  
  Parameters:
  
  u - uuri instance this CrawlURI wraps.
  
  pathFromSeed -
  
  via -
  
  viaContext -
Method Details
- fromHopsViaString
  
  public static CrawlURI fromHopsViaString(String uriHopsViaContext) throws org.apache.commons.httpclient.URIException
  
  Throws:
  
  org.apache.commons.httpclient.URIException
- getSchedulingDirective
  
  public int getSchedulingDirective()
  
  Returns:
  
  Returns the schedulingDirective.
- setSchedulingDirective
  
  public void setSchedulingDirective(int priority)
  
  Parameters:
  
  priority - The schedulingDirective to set.
- containsDataKey
  
  public boolean containsDataKey(String key)
- fetchStatusCodesToString
  
  public static String fetchStatusCodesToString(int code)
  
  Takes a status code and converts it into a human readable string.
  
  Parameters:
  
  code - the status code
  
  Returns:
  
  a human readable string declaring what the status code is.
- getFetchStatus
  
  public int getFetchStatus()
  
  Return the overall/fetch status of this CrawlURI for its current trip through the processing loop.
  
  Returns:
  
  a value from FetchStatusCodes
- setFetchStatus
  
  public void setFetchStatus(int newstatus)
  
  Set the overall/fetch status of this CrawlURI for its current trip through the processing loop.
  
  Parameters:
  
  newstatus - a value from FetchStatusCodes
- getFetchAttempts
  
  public int getFetchAttempts()
  
  Get the count of attempts (trips through the processing loop) at getting the document referenced by this URI. Compared against a configured maximum to determine when to stop retrying. TODO: Consider renaming as something more generic, as all processing-loops do not necessarily include an attempted network-fetch (for example, when processing is aborted early to enqueue a prerequisite), and this counter may be reset if a URI is starting a fresh series of tries (as when rescheduled at a future time). Perhaps simply 'tryCount' or 'attempts'?
  
  Returns:
  
  attempts count
- incrementFetchAttempts
  
  public void incrementFetchAttempts()
  
  Increment the count of attempts (trips through the processing loop) at getting the document referenced by this URI.
- resetFetchAttempts
  
  public void resetFetchAttempts()
  
  Reset fetchAttempts counter.
- resetDeferrals
  
  public void resetDeferrals()
  
  Reset deferrals counter.
- setPrerequisiteUri
  
  public void setPrerequisiteUri(CrawlURI pre)
  
  Set a prerequisite for this URI.
  A prerequisite is a URI that must be crawled before this URI can be crawled.
  
  Parameters:
  
  pre - Link to set as prereq.
- getPrerequisiteUri
  
  public CrawlURI getPrerequisiteUri()
  
  Get the prerequisite for this URI.
  A prerequisite is a URI that must be crawled before this URI can be crawled.
  
  Returns:
  
  the prerequisite for this URI or null if no prerequisite.
- clearPrerequisiteUri
  
  public CrawlURI clearPrerequisiteUri()
  
  Clear prerequisite, if any.
- hasPrerequisiteUri
  
  public boolean hasPrerequisiteUri()
  
  Returns:
  
  True if this CrawlURI has a prerequisite.
- isPrerequisite
  
  public boolean isPrerequisite()
  
  Returns true if this CrawlURI is a prerequisite. TODO:FIXME: code elsewhere is confused whether this means that this CrawlURI is a prerquisite for another, or *has* a prequisite; clean up and rename as necessary.
  
  Returns:
  
  true if this CrawlURI is a prerequisite.
- setPrerequisite
  
  public void setPrerequisite(boolean prerequisite)
  
  Set if this CrawlURI is itself a prerequisite URI.
  
  Parameters:
  
  prerequisite - True if this CrawlURI is itself a prerequiste uri.
- getContentType
  
  public String getContentType()
  
  Get the content type of this URI.
  
  Returns:
  
  Fetched URIs content type. May be null.
- setContentType
  
  public void setContentType(String ct)
  
  Set a fetched uri's content type.
  
  Parameters:
  
  ct - Contenttype.
- setThreadNumber
  
  public void setThreadNumber(int i)
  
  Set the number of the ToeThread responsible for processing this uri.
  
  Parameters:
  
  i - the ToeThread number.
- getThreadNumber
  
  public int getThreadNumber()
  
  Get the number of the ToeThread responsible for processing this uri.
  
  Returns:
  
  the ToeThread number.
- incrementDeferrals
  
  public void incrementDeferrals()
  
  Increment the deferral count.
- getDeferrals
  
  public int getDeferrals()
  
  Get the deferral count.
  
  Returns:
  
  the deferral count.
- stripToMinimal
  
  public void stripToMinimal()
  
  Remove all attributes set on this uri.
  This methods removes the attribute list.
- getContentSize
  
  public long getContentSize()
  
  Get the size in bytes of this URI's recorded content, inclusive of things like protocol headers. It is the responsibility of the classes which fetch the URI to set this value accordingly -- it is not calculated/verified within CrawlURI. This value is consulted in reporting/logging/writing-decisions.
  
  Returns:
  
  contentSize
  
  See Also:
  
  setContentSize(long)
- getAnnotations
  
  public Collection<String> getAnnotations()
  
  Get the annotations set for this uri.
  
  Returns:
  
  the annotations set for this uri.
- getHopCount
  
  public int getHopCount()
  
  Get total hops from seed.
  
  Returns:
  
  int hops count
- getEmbedHopCount
  
  public int getEmbedHopCount()
  
  Get the embed hop count.
  
  Returns:
  
  the embed hop count.
- getLinkHopCount
  
  public int getLinkHopCount()
  
  Get the link hop count.
  
  Returns:
  
  the link hop count.
- getUserAgent
  
  public String getUserAgent()
  
  Get the user agent to use for crawling this URI. If null the global setting should be used.
  
  Returns:
  
  user agent or null
- setUserAgent
  
  public void setUserAgent(String string)
  
  Set the user agent to use when crawling this URI. If not set the global settings should be used.
  
  Parameters:
  
  string - user agent to use
- getContentLength
  
  public long getContentLength()
  
  For completed HTTP transactions, the length of the content-body.
  
  Returns:
  
  For completed HTTP transactions, the length of the content-body.
- getRecordedSize
  
  public long getRecordedSize()
  
  Get size of data recorded (transferred)
  
  Returns:
  
  recorded data size
- setContentSize
  
  public void setContentSize(long l)
  
  Sets the 'content size' for the URI, which is considered inclusive of all of all recorded material (such as protocol headers) or even material 'virtually' considered (as in material from a previous fetch confirmed unchanged with a server). (In contrast, content-length matches the HTTP definition, that of the enclosed content-body.) Should be set by a fetcher or other processor as soon as the final size of recorded content is known. Setting to an artificial/incorrect value may affect other reporting/processing.
- hasBeenLinkExtracted
  
  public boolean hasBeenLinkExtracted()
  
  If true then a link extractor has already claimed this CrawlURI and performed link extraction on the document content. This does not preclude other link extractors that may have an interest in this CrawlURI from also doing link extraction.
  There is an onus on link extractors to set this flag if they have run.
  
  Returns:
  
  True if a processor has performed link extraction on this CrawlURI
  
  See Also:
  
  linkExtractorFinished()
- linkExtractorFinished
  
  public void linkExtractorFinished()
  
  Note that link extraction has been performed on this CrawlURI. A processor doing link extraction should invoke this method once it has finished it's work. It should invoke it even if no links are extracted. It should only invoke this method if the link extraction was performed on the document body (not the HTTP headers etc.).
  
  See Also:
  
  hasBeenLinkExtracted()
- aboutToLog
  
  public void aboutToLog()
  
  Notify CrawlURI it is about to be logged; opportunity for self-annotation
- getRecorder
  
  public org.archive.util.Recorder getRecorder()
  
  Get the http recorder associated with this uri.
  
  Returns:
  
  Returns the httpRecorder. May be null but its set early in FetchHttp so there is an issue if its null.
- setRecorder
  
  public void setRecorder(org.archive.util.Recorder httpRecorder)
  
  Set the http recorder to be associated with this uri.
  
  Parameters:
  
  httpRecorder - The httpRecorder to set.
- isHttpTransaction
  
  public boolean isHttpTransaction()
  
  Return true if this is a http transaction.
  
  Returns:
  
  True if this is a http transaction.
- processingCleanup
  
  public void processingCleanup()
  
  Clean up after a run through the processing chain. Called on the end of processing chain by Frontier#finish. Null out any state gathered during processing.
- getCredentials
  
  public Set<Credential> getCredentials()
  
  Returns:
  
  Credential avatars. Null if none set.
- hasCredentials
  
  public boolean hasCredentials()
  
  Returns:
  
  True if there are avatars attached to this instance.
- isSuccess
  
  public boolean isSuccess()
  
  Ask this URI if it was a success or not. Only makes sense to call this method after execution of HttpMethod#execute. Regard any status larger then 0 as success except for below caveat regarding 401s. Use is2XXSuccess() if looking for a status code in the 200 range.
  401s caveat: If any rfc2617 credential data present and we got a 401 assume it got loaded in FetchHTTP on expectation that we're to go around the processing chain again. Report this condition as a failure so we get another crack at the processing chain only this time we'll be making use of the loaded credential data.
  
  Returns:
  
  True if ths URI has been successfully processed.
  
  See Also:
  
  is2XXSuccess()
- is2XXSuccess
  
  public boolean is2XXSuccess()
  
  Returns:
  
  True if status code is in the 2xx range.
  
  See Also:
  
  isSuccess()
- hasRfc2617Credential
  
  public boolean hasRfc2617Credential()
  
  Returns:
  
  True if we have an rfc2617 payload.
- setContentDigest
  
  public void setContentDigest(byte[] digestValue)
  
  Deprecated.
  Use setContentDigest(String, byte[])
  
  Set the retained content-digest value (usu. SHA1).
  
  Parameters:
  
  digestValue -
- setContentDigest
  
  public void setContentDigest(String scheme, byte[] digestValue)
- getContentDigestSchemeString
  
  public String getContentDigestSchemeString()
- getContentDigest
  
  public byte[] getContentDigest()
  
  Return the retained content-digest value, if any.
  
  Returns:
  
  Digest value.
- getContentDigestString
  
  public String getContentDigestString()
- setHolder
  
  public void setHolder(Object obj)
  
  Remember a 'holder' to which some enclosing/queueing facility has assigned this CrawlURI .
  
  Parameters:
  
  obj -
- getHolder
  
  public Object getHolder()
  
  Return the 'holder' for the convenience of an external facility.
  
  Returns:
  
  holder
- setHolderKey
  
  public void setHolderKey(Object obj)
  
  Remember a 'holderKey' which some enclosing/queueing facility has assigned this CrawlURI .
  
  Parameters:
  
  obj -
- getHolderKey
  
  public Object getHolderKey()
  
  Return the 'holderKey' for convenience of an external facility (Frontier).
  
  Returns:
  
  holderKey
- getOrdinal
  
  public long getOrdinal()
  
  Get the ordinal (serial number) assigned at creation.
  
  Returns:
  
  ordinal
- setOrdinal
  
  public void setOrdinal(long o)
- getHolderCost
  
  public int getHolderCost()
  
  Return the 'holderCost' for convenience of external facility (frontier)
  
  Returns:
  
  value of holderCost
- setHolderCost
  
  public void setHolderCost(int cost)
  
  Remember a 'holderCost' which some enclosing/queueing facility has assigned this CrawlURI
  
  Parameters:
  
  cost - value to remember
- getOutLinks
  
  public Collection<CrawlURI> getOutLinks()
  
  Returns discovered links. The returned collection might be empty if no links were discovered, or if something like LinksScoper promoted the links to CrawlURIs.
  
  Returns:
  
  Collection of all discovered outbound links
- setBaseURI
  
  public void setBaseURI(String baseHref) throws org.apache.commons.httpclient.URIException
  
  Set the (HTML) Base URI used for derelativizing internal URIs.
  
  Parameters:
  
  baseHref - String base href to use
  
  Throws:
  
  org.apache.commons.httpclient.URIException - if supplied string cannot be interpreted as URI
- getBaseURI
  
  public org.archive.net.UURI getBaseURI()
  
  Get the (HTML) Base URI used for derelativizing internal URIs.
  
  Returns:
  
  UURI base URI previously set
- readUuri
  
  protected org.archive.net.UURI readUuri(String u)
  
  Read a UURI from a String, handling a null or URIException
  
  Parameters:
  
  u - String or null from which to create UURI
  
  Returns:
  
  the best UURI instance creatable
- getServerIP
  
  public String getServerIP()
  
  Returns the IP address the request was fetched against or null if unavailable.
- getFetchBeginTime
  
  public long getFetchBeginTime()
- getFetchCompletedTime
  
  public long getFetchCompletedTime()
- getFetchDuration
  
  public long getFetchDuration()
- getFetchType
  
  public CrawlURI.FetchType getFetchType()
- getNonFatalFailures
  
  public Collection<Throwable> getNonFatalFailures()
- setServerIP
  
  public void setServerIP(String serverIP)
- setError
  
  public void setError(String msg)
- setFetchBeginTime
  
  public void setFetchBeginTime(long time)
- setFetchCompletedTime
  
  public void setFetchCompletedTime(long time)
- setFetchType
  
  public void setFetchType(CrawlURI.FetchType type)
- setForceRetire
  
  public void setForceRetire(boolean b)
- setBaseURI
  
  public void setBaseURI(org.archive.net.UURI base)
- getData
  
  public Map<String,Object> getData()
- getDataList
  
  public List<Object> getDataList(String key)
  
  Convenience method: return (creating if necessary) list at given data key
  
  Parameters:
  
  key -
  
  Returns:
  
  List
- setSeed
  
  public void setSeed(boolean b)
  
  Set the isSeed attribute of this URI.
  
  Parameters:
  
  b - Is this URI a seed, true or false.
- isSeed
  
  public boolean isSeed()
  
  Returns:
  
  Whether seeded.
- getUURI
  
  public org.archive.net.UURI getUURI()
  
  Returns:
  
  UURI
- getURI
  
  public String getURI()
  
  Returns:
  
  String of URI
- getPathFromSeed
  
  public String getPathFromSeed()
  
  Returns:
  
  path (hop-types) from seed
- getLastHop
  
  public String getLastHop()
  
  convenience access to last hop character, as string
- getVia
  
  public org.archive.net.UURI getVia()
  
  Returns:
  
  URI via which this one was discovered
- setVia
  
  public void setVia(org.archive.net.UURI via)
- getViaContext
  
  public LinkContext getViaContext()
  
  Returns:
  
  CharSequence context in which this one was discovered
- isLocation
  
  public boolean isLocation()
  
  Returns:
  
  True if this CrawlURI was result of a redirect: i.e. Its parent URI redirected to here, this URI was what was in the 'Location:' or 'Content-Location:' HTTP Header.
- shortReportLine
  
  public String shortReportLine()
- shortReportMap
  
  public Map<String,Object> shortReportMap()
  
  Specified by:
  
  shortReportMap in interface org.archive.util.Reporter
- shortReportLineTo
  
  public void shortReportLineTo(PrintWriter w)
  
  Specified by:
  
  shortReportLineTo in interface org.archive.util.Reporter
- shortReportLegend
  
  public String shortReportLegend()
  
  Specified by:
  
  shortReportLegend in interface org.archive.util.Reporter
- reportTo
  
  public void reportTo(PrintWriter writer) throws IOException
  
  Specified by:
  
  reportTo in interface org.archive.util.Reporter
  
  Throws:
  
  IOException
- flattenVia
  
  public String flattenVia()
  
  Method returns string version of this URI's referral URI.
  
  Returns:
  
  String version of referral URI
- getSourceTag
  
  public String getSourceTag()
- setSourceTag
  
  public void setSourceTag(String sourceTag)
- makeHeritable
  
  public void makeHeritable(String key)
  
  Make the given key 'heritable', meaning its value will be added to descendant CrawlURIs. Only keys with immutable values should be made heritable -- the value instance may be shared until the data map is serialized/deserialized.
  
  Parameters:
  
  key - to make heritable
- makeNonHeritable
  
  public void makeNonHeritable(String key)
  
  Make the given key non-'heritable', meaning its value will not be added to descendant CrawlURIs. Only meaningful if key was previously made heritable.
  
  Parameters:
  
  key - to make non-heritable
- getClassKey
  
  public String getClassKey()
  
  Get the token (usually the hostname + port) which indicates what "class" this CrawlURI should be grouped with, for the purposes of ensuring only one item of the class is processed at once, all items of the class are held for a politeness period, etc.
  
  Returns:
  
  Token (usually the hostname) which indicates what "class" this CrawlURI should be grouped with.
- setClassKey
  
  public void setClassKey(String key)
- forceFetch
  
  public boolean forceFetch()
  
  If this method returns true, this URI should be fetched even though it already has been crawled. This also implies that this URI will be scheduled for crawl before any other waiting URIs for the same host. This value is used to refetch any expired robots.txt or dns-lookups.
  
  Returns:
  
  true if crawling of this URI should be forced
- setForceFetch
  
  public void setForceFetch(boolean b)
  
  Method to signal that this URI should be fetched even though it already has been crawled. Setting this to true also implies that this URI will be scheduled for crawl before any other waiting URIs for the same host. This value is used to refetch any expired robots.txt or dns-lookups.
  
  Parameters:
  
  b - set to true to enforce the crawling of this URI
- getTransHops
  
  public int getTransHops()
  
  Tally up the number of transitive (non-simple-link) hops at the end of this CrawlURI's pathFromSeed. In some cases, URIs with greater than zero but less than some threshold such hops are treated specially.
  TODO: consider moving link-count in here as well, caching calculation, and refactoring CrawlScope.exceedsMaxHops() to use this.
  
  Returns:
  
  Transhop count.
- inheritFrom
  
  protected void inheritFrom(CrawlURI ancestor)
  
  Inherit (copy) the relevant keys-values from the ancestor.
  
  Parameters:
  
  ancestor -
- createCrawlURI
  
  public CrawlURI createCrawlURI(org.archive.net.UURI destination, LinkContext context, Hop hop) throws org.apache.commons.httpclient.URIException
  
  Utility method for creating CrawlURIs that were found as out links from the current CrawlURI links from this CrawlURI.
  Any relative URIs will be treated as relative to this CrawlURI's UURI.
  
  Parameters:
  
  destination - The new URI, possibly a relative URI
  
  context -
  
  hop -
  
  Returns:
  
  New CrawlURI with the current CrawlURI set as the one it inherits from
  
  Throws:
  
  org.apache.commons.httpclient.URIException
- createCrawlURI
  
  public CrawlURI createCrawlURI(String destination, LinkContext context, Hop hop) throws org.apache.commons.httpclient.URIException
  
  Throws:
  
  org.apache.commons.httpclient.URIException
- extendHopsPath
  
  public static String extendHopsPath(String pathFromSeed, char hopChar)
  
  Extend a 'hopsPath' (pathFromSeed string of single-character hop-type symbols), keeping the number of displayed hop-types under MAX_HOPS_DISPLAYED. For longer hops paths, precede the string with a integer and '+', then the displayed hops.
  
  Parameters:
  
  pathFromSeed -
  
  hopChar -
- createCrawlURI
  
  public CrawlURI createCrawlURI(org.archive.net.UURI destination, LinkContext context, Hop hop, int scheduling, boolean seed) throws org.apache.commons.httpclient.URIException
  
  Utility method for creation of CrawlURIs found extracting links from this CrawlURI.
  
  Throws:
  
  org.apache.commons.httpclient.URIException
- toString
  
  public String toString()
  
  Overrides:
  
  toString in class Object
  
  Returns:
  
  The UURI this CandidateURI wraps as a string
- incrementDiscardedOutLinks
  
  public void incrementDiscardedOutLinks()
- getPrecedence
  
  public int getPrecedence()
  
  Returns:
  
  the precedence
- setPrecedence
  
  public void setPrecedence(int precedence)
  
  Parameters:
  
  precedence - the precedence to set
- getPolicyBasisUURI
  
  public org.archive.net.UURI getPolicyBasisUURI()
  
  Get the UURI that should be used as the basis of policy/overlay decisions. In the case of prerequisites, this is the URI that triggered the prerequisite -- the 'via' -- so that the prerequisite lands in the same queue, with the same overlay values, as the triggering URI.
  
  Returns:
  
  UURI to use for policy decisions
- haveOverlayNamesBeenSet
  
  public boolean haveOverlayNamesBeenSet()
  
  Specified by:
  
  haveOverlayNamesBeenSet in interface org.archive.spring.OverlayContext
- getOverlayNames
  
  public ArrayList<String> getOverlayNames()
  
  Specified by:
  
  getOverlayNames in interface org.archive.spring.OverlayContext
- getOverlayMap
  
  public Map<String,Object> getOverlayMap(String name)
  
  Specified by:
  
  getOverlayMap in interface org.archive.spring.OverlayContext
- setOverlayMapsSource
  
  public void setOverlayMapsSource(org.archive.spring.OverlayMapsSource overrideMapsSource)
- setCanonicalString
  
  public void setCanonicalString(String canonical)
- getCanonicalString
  
  public String getCanonicalString()
- setPolitenessDelay
  
  public void setPolitenessDelay(long polite)
- getPolitenessDelay
  
  public long getPolitenessDelay()
- setFullVia
  
  public void setFullVia(CrawlURI curi)
- getFullVia
  
  public CrawlURI getFullVia()
- setRescheduleTime
  
  public void setRescheduleTime(long time)
- getRescheduleTime
  
  public long getRescheduleTime()
- resetForRescheduling
  
  public void resetForRescheduling()
  
  Reset state that that should not persist when a URI is rescheduled for a specific future time.
- includesRetireDirective
  
  public boolean includesRetireDirective()
- getExtraInfo
  
  public org.json.JSONObject getExtraInfo()
- addExtraInfo
  
  public void addExtraInfo(String key, Object value)
- autoregisterTo
  
  public static void autoregisterTo(org.archive.bdb.AutoKryo kryo)
- markPrerequisite
  
  public CrawlURI markPrerequisite(String preq) throws org.apache.commons.httpclient.URIException
  
  Do all actions associated with setting a CrawlURI as requiring a prerequisite.
  
  Returns:
  
  the newly created prerequisite CrawlURI
  
  Throws:
  
  org.apache.commons.httpclient.URIException
- containsContentTypeCharsetDeclaration
  
  public boolean containsContentTypeCharsetDeclaration()
- getHttpResponseHeader
  
  public String getHttpResponseHeader(String key)
  
  Parameters:
  
  key - http response header key (case-insensitive)
  
  Returns:
  
  value of the header or null if there is no such header
  
  Since:
  
  3.3.0
- putHttpResponseHeader
  
  public void putHttpResponseHeader(String key, String value)
  
  Since:
  
  3.3.0
- getHttpAuthChallenges
  
  public Map<String,String> getHttpAuthChallenges()
- setHttpAuthChallenges
  
  public void setHttpAuthChallenges(Map<String,String> httpAuthChallenges)
- getFetchHistory
  
  public HashMap<String,Object>[] getFetchHistory()
- setFetchHistory
  
  public void setFetchHistory(Map<String,Object>[] history)
- getContentDigestHistory
  
  public HashMap<String,Object> getContentDigestHistory()
- hasContentDigestHistory
  
  public boolean hasContentDigestHistory()
- isRevisit
  
  public boolean isRevisit()
  
  Indicates if this CrawlURI object has been deemed a revisit.
- getRevisitProfile
  
  public RevisitProfile getRevisitProfile()
- setRevisitProfile
  
  public void setRevisitProfile(RevisitProfile revisitProfile)
- compareTo
  
  public int compareTo(CrawlURI o)
  
  Specified by:
  
  compareTo in interface Comparable<CrawlURI>
- hashCode
  
  public int hashCode()
  
  Overrides:
  
  hashCode in class Object
- equals
  
  public boolean equals(Object o)
  
  Overrides:
  
  equals in class Object

Class CrawlURI

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

UNCALCULATED

A_FETCH_HISTORY

data

ordinal

holder

holderKey

holderCost

outLinks

overlayNames

overlayMapsSource

canonicalString

politenessDelay

fullVia

rescheduleTime

extraInfo

Constructor Details

CrawlURI

CrawlURI

Method Details

fromHopsViaString

getSchedulingDirective

setSchedulingDirective

containsDataKey

fetchStatusCodesToString

getFetchStatus

setFetchStatus

getFetchAttempts

incrementFetchAttempts

resetFetchAttempts

resetDeferrals

setPrerequisiteUri

getPrerequisiteUri

clearPrerequisiteUri

hasPrerequisiteUri

isPrerequisite

setPrerequisite

getContentType

setContentType

setThreadNumber

getThreadNumber

incrementDeferrals

getDeferrals

stripToMinimal

getContentSize

getAnnotations

getHopCount

getEmbedHopCount

getLinkHopCount

getUserAgent

setUserAgent

getContentLength

getRecordedSize

setContentSize

hasBeenLinkExtracted

linkExtractorFinished

aboutToLog

getRecorder

setRecorder

isHttpTransaction

processingCleanup

getCredentials

hasCredentials

isSuccess

is2XXSuccess

hasRfc2617Credential

setContentDigest

setContentDigest

getContentDigestSchemeString

getContentDigest

getContentDigestString

setHolder

getHolder

setHolderKey