public class CrawlURI extends Object implements Reporter, Serializable, OverlayContext, Comparable<CrawlURI>
Core state is in instance variables but a flexible
attribute list is also available. Use this 'bucket' to carry
custom processing extracted data and state across CrawlURI
processing. See getData()
, etc.
Note: getHttpMethod() has been removed starting with Heritrix 3.3.0. HTTP
response headers are available using getHttpResponseHeader(String)
.
(HTTP fetchers are responsible for setting the values using
putHttpResponseHeader(String, String)
).
Modifier and Type | Class and Description |
---|---|
static class |
CrawlURI.FetchType |
Modifier and Type | Field and Description |
---|---|
static String |
A_FETCH_HISTORY
fetch history array
|
protected String |
canonicalString |
protected Map<String,Object> |
data
Flexible dynamic attributes list.
|
protected org.json.JSONObject |
extraInfo |
protected CrawlURI |
fullVia |
protected Object |
holder |
protected int |
holderCost
spot for an integer cost to be placed by external facility (frontier).
|
protected Object |
holderKey |
protected long |
ordinal
Monotonically increasing number within a crawl;
useful for tending towards breadth-first ordering.
|
protected Collection<CrawlURI> |
outLinks
All discovered outbound urls as CrawlURIs (navlinks, embeds, etc.)
|
protected OverlayMapsSource |
overlayMapsSource |
protected ArrayList<String> |
overlayNames |
protected long |
politenessDelay |
protected long |
rescheduleTime
A future time at which this CrawlURI should be reenqueued.
|
static int |
UNCALCULATED |
Constructor and Description |
---|
CrawlURI(UURI uuri)
Create a new instance of CrawlURI from a
UURI . |
CrawlURI(UURI u,
String pathFromSeed,
UURI via,
LinkContext viaContext) |
Modifier and Type | Method and Description |
---|---|
void |
aboutToLog()
Notify CrawlURI it is about to be logged; opportunity
for self-annotation
|
void |
addExtraInfo(String key,
Object value) |
static void |
autoregisterTo(AutoKryo kryo) |
CrawlURI |
clearPrerequisiteUri()
Clear prerequisite, if any.
|
int |
compareTo(CrawlURI o) |
boolean |
containsContentTypeCharsetDeclaration() |
boolean |
containsDataKey(String key) |
CrawlURI |
createCrawlURI(String destination,
LinkContext context,
Hop hop) |
CrawlURI |
createCrawlURI(UURI destination,
LinkContext context,
Hop hop)
Utility method for creating CrawlURIs that were found as out links from the current CrawlURI
links from this CrawlURI.
|
CrawlURI |
createCrawlURI(UURI destination,
LinkContext context,
Hop hop,
int scheduling,
boolean seed)
Utility method for creation of CrawlURIs found extracting
links from this CrawlURI.
|
boolean |
equals(Object o) |
static String |
extendHopsPath(String pathFromSeed,
char hopChar)
Extend a 'hopsPath' (pathFromSeed string of single-character hop-type symbols),
keeping the number of displayed hop-types under MAX_HOPS_DISPLAYED.
|
static String |
fetchStatusCodesToString(int code)
Takes a status code and converts it into a human readable string.
|
String |
flattenVia()
Method returns string version of this URI's referral URI.
|
boolean |
forceFetch()
If this method returns true, this URI should be fetched even though
it already has been crawled.
|
static CrawlURI |
fromHopsViaString(String uriHopsViaContext) |
Collection<String> |
getAnnotations()
Get the annotations set for this uri.
|
UURI |
getBaseURI()
Get the (HTML) Base URI used for derelativizing internal URIs.
|
String |
getCanonicalString() |
String |
getClassKey()
Get the token (usually the hostname + port) which indicates
what "class" this CrawlURI should be grouped with,
for the purposes of ensuring only one item of the
class is processed at once, all items of the class
are held for a politeness period, etc.
|
byte[] |
getContentDigest()
Return the retained content-digest value, if any.
|
HashMap<String,Object> |
getContentDigestHistory() |
String |
getContentDigestSchemeString() |
String |
getContentDigestString() |
long |
getContentLength()
For completed HTTP transactions, the length of the content-body.
|
long |
getContentSize()
Get the size in bytes of this URI's recorded content, inclusive
of things like protocol headers.
|
String |
getContentType()
Get the content type of this URI.
|
Set<Credential> |
getCredentials() |
Map<String,Object> |
getData() |
List<Object> |
getDataList(String key)
Convenience method: return (creating if necessary) list at
given data key
|
int |
getDeferrals()
Get the deferral count.
|
int |
getEmbedHopCount()
Get the embed hop count.
|
org.json.JSONObject |
getExtraInfo() |
int |
getFetchAttempts()
Get the count of attempts (trips through the processing
loop) at getting the document referenced by this URI.
|
long |
getFetchBeginTime() |
long |
getFetchCompletedTime() |
long |
getFetchDuration() |
HashMap<String,Object>[] |
getFetchHistory() |
int |
getFetchStatus()
Return the overall/fetch status of this CrawlURI for its
current trip through the processing loop.
|
CrawlURI.FetchType |
getFetchType() |
CrawlURI |
getFullVia() |
Object |
getHolder()
Return the 'holder' for the convenience of
an external facility.
|
int |
getHolderCost()
Return the 'holderCost' for convenience of external facility (frontier)
|
Object |
getHolderKey()
Return the 'holderKey' for convenience of
an external facility (Frontier).
|
int |
getHopCount()
Get total hops from seed.
|
Map<String,String> |
getHttpAuthChallenges() |
String |
getHttpResponseHeader(String key) |
String |
getLastHop()
convenience access to last hop character, as string
|
int |
getLinkHopCount()
Get the link hop count.
|
Collection<Throwable> |
getNonFatalFailures() |
long |
getOrdinal()
Get the ordinal (serial number) assigned at creation.
|
Collection<CrawlURI> |
getOutLinks()
Returns discovered links.
|
Map<String,Object> |
getOverlayMap(String name) |
ArrayList<String> |
getOverlayNames() |
String |
getPathFromSeed() |
UURI |
getPolicyBasisUURI()
Get the UURI that should be used as the basis of policy/overlay
decisions.
|
long |
getPolitenessDelay() |
int |
getPrecedence() |
CrawlURI |
getPrerequisiteUri()
Get the prerequisite for this URI.
|
long |
getRecordedSize()
Get size of data recorded (transferred)
|
Recorder |
getRecorder()
Get the http recorder associated with this uri.
|
long |
getRescheduleTime() |
RevisitProfile |
getRevisitProfile() |
int |
getSchedulingDirective() |
String |
getServerIP()
Returns the IP address the request was fetched against or null if unavailable.
|
String |
getSourceTag() |
int |
getThreadNumber()
Get the number of the ToeThread responsible for processing this uri.
|
int |
getTransHops()
Tally up the number of transitive (non-simple-link) hops at
the end of this CrawlURI's pathFromSeed.
|
String |
getURI() |
String |
getUserAgent()
Get the user agent to use for crawling this URI.
|
UURI |
getUURI() |
UURI |
getVia() |
LinkContext |
getViaContext() |
boolean |
hasBeenLinkExtracted()
If true then a link extractor has already claimed this CrawlURI and
performed link extraction on the document content.
|
boolean |
hasContentDigestHistory() |
boolean |
hasCredentials() |
int |
hashCode() |
boolean |
hasPrerequisiteUri() |
boolean |
hasRfc2617Credential() |
boolean |
haveOverlayNamesBeenSet() |
boolean |
includesRetireDirective() |
void |
incrementDeferrals()
Increment the deferral count.
|
void |
incrementDiscardedOutLinks() |
void |
incrementFetchAttempts()
Increment the count of attempts (trips through the processing
loop) at getting the document referenced by this URI.
|
protected void |
inheritFrom(CrawlURI ancestor)
Inherit (copy) the relevant keys-values from the ancestor.
|
boolean |
is2XXSuccess() |
boolean |
isHttpTransaction()
Return true if this is a http transaction.
|
boolean |
isLocation() |
boolean |
isPrerequisite()
Returns true if this CrawlURI is a prerequisite.
|
boolean |
isRevisit()
Indicates if this CrawlURI object has been deemed a revisit.
|
boolean |
isSeed() |
boolean |
isSuccess()
Ask this URI if it was a success or not.
|
void |
linkExtractorFinished()
Note that link extraction has been performed on this CrawlURI.
|
void |
makeHeritable(String key)
Make the given key 'heritable', meaning its value will be
added to descendant CrawlURIs.
|
void |
makeNonHeritable(String key)
Make the given key non-'heritable', meaning its value will
not be added to descendant CrawlURIs.
|
CrawlURI |
markPrerequisite(String preq)
Do all actions associated with setting a
CrawlURI as
requiring a prerequisite. |
void |
processingCleanup()
Clean up after a run through the processing chain.
|
void |
putHttpResponseHeader(String key,
String value) |
protected UURI |
readUuri(String u)
Read a UURI from a String, handling a null or URIException
|
void |
reportTo(PrintWriter writer) |
void |
resetDeferrals()
Reset deferrals counter.
|
void |
resetFetchAttempts()
Reset fetchAttempts counter.
|
void |
resetForRescheduling()
Reset state that that should not persist when a URI is
rescheduled for a specific future time.
|
void |
setBaseURI(String baseHref)
Set the (HTML) Base URI used for derelativizing internal URIs.
|
void |
setBaseURI(UURI base) |
void |
setCanonicalString(String canonical) |
void |
setClassKey(String key) |
void |
setContentDigest(byte[] digestValue)
Deprecated.
|
void |
setContentDigest(String scheme,
byte[] digestValue) |
void |
setContentSize(long l)
Sets the 'content size' for the URI, which is considered inclusive of all
of all recorded material (such as protocol headers) or even material
'virtually' considered (as in material from a previous fetch
confirmed unchanged with a server).
|
void |
setContentType(String ct)
Set a fetched uri's content type.
|
void |
setError(String msg) |
void |
setFetchBeginTime(long time) |
void |
setFetchCompletedTime(long time) |
void |
setFetchHistory(Map<String,Object>[] history) |
void |
setFetchStatus(int newstatus)
Set the overall/fetch status of this CrawlURI for
its current trip through the processing loop.
|
void |
setFetchType(CrawlURI.FetchType type) |
void |
setForceFetch(boolean b)
Method to signal that this URI should be fetched even though
it already has been crawled.
|
void |
setForceRetire(boolean b) |
void |
setFullVia(CrawlURI curi) |
void |
setHolder(Object obj)
Remember a 'holder' to which some enclosing/queueing
facility has assigned this CrawlURI
.
|
void |
setHolderCost(int cost)
Remember a 'holderCost' which some enclosing/queueing
facility has assigned this CrawlURI
|
void |
setHolderKey(Object obj)
Remember a 'holderKey' which some enclosing/queueing
facility has assigned this CrawlURI
.
|
void |
setHttpAuthChallenges(Map<String,String> httpAuthChallenges) |
void |
setOrdinal(long o) |
void |
setOverlayMapsSource(OverlayMapsSource overrideMapsSource) |
void |
setPolitenessDelay(long polite) |
void |
setPrecedence(int precedence) |
void |
setPrerequisite(boolean prerequisite)
Set if this CrawlURI is itself a prerequisite URI.
|
void |
setPrerequisiteUri(CrawlURI pre)
Set a prerequisite for this URI.
|
void |
setRecorder(Recorder httpRecorder)
Set the http recorder to be associated with this uri.
|
void |
setRescheduleTime(long time) |
void |
setRevisitProfile(RevisitProfile revisitProfile) |
void |
setSchedulingDirective(int priority) |
void |
setSeed(boolean b)
Set the isSeed attribute of this URI.
|
void |
setServerIP(String serverIP) |
void |
setSourceTag(String sourceTag) |
void |
setThreadNumber(int i)
Set the number of the ToeThread responsible for processing this uri.
|
void |
setUserAgent(String string)
Set the user agent to use when crawling this URI.
|
void |
setVia(UURI via) |
String |
shortReportLegend() |
String |
shortReportLine() |
void |
shortReportLineTo(PrintWriter w) |
Map<String,Object> |
shortReportMap() |
void |
stripToMinimal()
Remove all attributes set on this uri.
|
String |
toString() |
public static final int UNCALCULATED
public static final String A_FETCH_HISTORY
protected Map<String,Object> data
The attribute list is a flexible map of key/value pairs for storing
status of this URI for use by other processors. By convention the
attribute list is keyed by constants found in the
CoreAttributeConstants
interface. Use this list to carry
data or state produced by custom processors rather change the
classes CrawlURI
or this class, CrawlURI.
protected long ordinal
protected transient Object holder
protected transient Object holderKey
protected int holderCost
protected transient Collection<CrawlURI> outLinks
protected transient OverlayMapsSource overlayMapsSource
protected String canonicalString
protected long politenessDelay
protected transient CrawlURI fullVia
protected long rescheduleTime
protected org.json.JSONObject extraInfo
public CrawlURI(UURI uuri)
UURI
.uuri
- the UURI to base this CrawlURI on.public CrawlURI(UURI u, String pathFromSeed, UURI via, LinkContext viaContext)
u
- uuri instance this CrawlURI wraps.pathFromSeed
- via
- viaContext
- public static CrawlURI fromHopsViaString(String uriHopsViaContext) throws org.apache.commons.httpclient.URIException
org.apache.commons.httpclient.URIException
public int getSchedulingDirective()
public void setSchedulingDirective(int priority)
priority
- The schedulingDirective to set.public boolean containsDataKey(String key)
public static String fetchStatusCodesToString(int code)
code
- the status codepublic int getFetchStatus()
public void setFetchStatus(int newstatus)
newstatus
- a value from FetchStatusCodespublic int getFetchAttempts()
public void incrementFetchAttempts()
public void resetFetchAttempts()
public void resetDeferrals()
public void setPrerequisiteUri(CrawlURI pre)
A prerequisite is a URI that must be crawled before this URI can be crawled.
pre
- Link to set as prereq.public CrawlURI getPrerequisiteUri()
A prerequisite is a URI that must be crawled before this URI can be crawled.
public CrawlURI clearPrerequisiteUri()
public boolean hasPrerequisiteUri()
public boolean isPrerequisite()
public void setPrerequisite(boolean prerequisite)
prerequisite
- True if this CrawlURI is itself a prerequiste uri.public String getContentType()
public void setContentType(String ct)
ct
- Contenttype.public void setThreadNumber(int i)
i
- the ToeThread number.public int getThreadNumber()
public void incrementDeferrals()
public int getDeferrals()
public void stripToMinimal()
This methods removes the attribute list.
public long getContentSize()
setContentSize(long)
public Collection<String> getAnnotations()
public int getHopCount()
public int getEmbedHopCount()
public int getLinkHopCount()
public String getUserAgent()
public void setUserAgent(String string)
string
- user agent to usepublic long getContentLength()
public long getRecordedSize()
public void setContentSize(long l)
public boolean hasBeenLinkExtracted()
There is an onus on link extractors to set this flag if they have run.
linkExtractorFinished()
public void linkExtractorFinished()
hasBeenLinkExtracted()
public void aboutToLog()
public Recorder getRecorder()
public void setRecorder(Recorder httpRecorder)
httpRecorder
- The httpRecorder to set.public boolean isHttpTransaction()
public void processingCleanup()
public Set<Credential> getCredentials()
public boolean hasCredentials()
public boolean isSuccess()
is2XXSuccess()
if
looking for a status code in the 200 range.
401s caveat: If any rfc2617 credential data present and we got a 401 assume it got loaded in FetchHTTP on expectation that we're to go around the processing chain again. Report this condition as a failure so we get another crack at the processing chain only this time we'll be making use of the loaded credential data.
is2XXSuccess()
public boolean is2XXSuccess()
isSuccess()
public boolean hasRfc2617Credential()
public void setContentDigest(byte[] digestValue)
setContentDigest(String, byte[])
digestValue
- public void setContentDigest(String scheme, byte[] digestValue)
public String getContentDigestSchemeString()
public byte[] getContentDigest()
public String getContentDigestString()
public void setHolder(Object obj)
obj
- public Object getHolder()
public void setHolderKey(Object obj)
obj
- public Object getHolderKey()
public long getOrdinal()
public void setOrdinal(long o)
public int getHolderCost()
public void setHolderCost(int cost)
cost
- value to rememberpublic Collection<CrawlURI> getOutLinks()
public void setBaseURI(String baseHref) throws org.apache.commons.httpclient.URIException
baseHref
- String base href to useorg.apache.commons.httpclient.URIException
- if supplied string cannot be interpreted as URIpublic UURI getBaseURI()
protected UURI readUuri(String u)
u
- String or null from which to create UURIpublic String getServerIP()
public long getFetchBeginTime()
public long getFetchCompletedTime()
public long getFetchDuration()
public CrawlURI.FetchType getFetchType()
public Collection<Throwable> getNonFatalFailures()
public void setServerIP(String serverIP)
public void setError(String msg)
public void setFetchBeginTime(long time)
public void setFetchCompletedTime(long time)
public void setFetchType(CrawlURI.FetchType type)
public void setForceRetire(boolean b)
public void setBaseURI(UURI base)
public List<Object> getDataList(String key)
key
- public void setSeed(boolean b)
b
- Is this URI a seed, true or false.public boolean isSeed()
public UURI getUURI()
public String getURI()
public String getPathFromSeed()
public String getLastHop()
public UURI getVia()
public void setVia(UURI via)
public LinkContext getViaContext()
public boolean isLocation()
public String shortReportLine()
public Map<String,Object> shortReportMap()
shortReportMap
in interface Reporter
public void shortReportLineTo(PrintWriter w)
shortReportLineTo
in interface Reporter
public String shortReportLegend()
shortReportLegend
in interface Reporter
public void reportTo(PrintWriter writer) throws IOException
reportTo
in interface Reporter
IOException
public String flattenVia()
public String getSourceTag()
public void setSourceTag(String sourceTag)
public void makeHeritable(String key)
key
- to make heritablepublic void makeNonHeritable(String key)
key
- to make non-heritablepublic String getClassKey()
public void setClassKey(String key)
public boolean forceFetch()
public void setForceFetch(boolean b)
b
- set to true to enforce the crawling of this URIpublic int getTransHops()
TODO: consider moving link-count in here as well, caching calculation, and refactoring CrawlScope.exceedsMaxHops() to use this.
protected void inheritFrom(CrawlURI ancestor)
ancestor
- public CrawlURI createCrawlURI(UURI destination, LinkContext context, Hop hop) throws org.apache.commons.httpclient.URIException
Any relative URIs will be treated as relative to this CrawlURI's UURI.
destination
- The new URI, possibly a relative URIcontext
- hop
- org.apache.commons.httpclient.URIException
public CrawlURI createCrawlURI(String destination, LinkContext context, Hop hop) throws org.apache.commons.httpclient.URIException
org.apache.commons.httpclient.URIException
public static String extendHopsPath(String pathFromSeed, char hopChar)
pathFromSeed
- hopChar
- public CrawlURI createCrawlURI(UURI destination, LinkContext context, Hop hop, int scheduling, boolean seed) throws org.apache.commons.httpclient.URIException
org.apache.commons.httpclient.URIException
public String toString()
public void incrementDiscardedOutLinks()
public int getPrecedence()
public void setPrecedence(int precedence)
precedence
- the precedence to setpublic UURI getPolicyBasisUURI()
public boolean haveOverlayNamesBeenSet()
haveOverlayNamesBeenSet
in interface OverlayContext
public ArrayList<String> getOverlayNames()
getOverlayNames
in interface OverlayContext
public Map<String,Object> getOverlayMap(String name)
getOverlayMap
in interface OverlayContext
public void setOverlayMapsSource(OverlayMapsSource overrideMapsSource)
public void setCanonicalString(String canonical)
public String getCanonicalString()
public void setPolitenessDelay(long polite)
public long getPolitenessDelay()
public void setFullVia(CrawlURI curi)
public CrawlURI getFullVia()
public void setRescheduleTime(long time)
public long getRescheduleTime()
public void resetForRescheduling()
public boolean includesRetireDirective()
public org.json.JSONObject getExtraInfo()
public static void autoregisterTo(AutoKryo kryo)
public CrawlURI markPrerequisite(String preq) throws org.apache.commons.httpclient.URIException
CrawlURI
as
requiring a prerequisite.org.apache.commons.httpclient.URIException
public boolean containsContentTypeCharsetDeclaration()
public String getHttpResponseHeader(String key)
key
- http response header key (case-insensitive)public boolean hasContentDigestHistory()
public boolean isRevisit()
public RevisitProfile getRevisitProfile()
public void setRevisitProfile(RevisitProfile revisitProfile)
public int compareTo(CrawlURI o)
compareTo
in interface Comparable<CrawlURI>
Copyright © 2003–2022 Internet Archive. All rights reserved.