public class CrawlServer extends Object implements Serializable, FetchStats.HasFetchStats, IdentityCacheable
Modifier and Type | Field and Description |
---|---|
protected int |
consecutiveConnectionErrors |
static long |
MIN_ROBOTS_RETRIES
only check if robots-fetch is perhaps superfluous
after this many tries
|
static long |
ROBOTS_NOT_FETCHED |
protected long |
robotsFetched |
protected Robotstxt |
robotstxt |
protected FetchStats |
substats |
protected boolean |
validRobots |
Constructor and Description |
---|
CrawlServer(String h)
Creates a new CrawlServer object.
|
Modifier and Type | Method and Description |
---|---|
void |
addCredential(Credential cred)
Add an avatar.
|
static void |
autoregisterTo(AutoKryo kryo) |
boolean |
equals(Object obj) |
Set<Credential> |
getCredentials() |
Map<String,String> |
getHttpAuthChallenges() |
String |
getKey() |
String |
getName() |
int |
getPort()
Get the port number for this server.
|
Robotstxt |
getRobotstxt() |
static String |
getServerKey(UURI uuri)
Get key to use doing lookup on server instances.
|
FetchStats |
getSubstats() |
boolean |
hasCredentials() |
int |
hashCode() |
void |
incrementConsecutiveConnectionErrors() |
boolean |
isRobotsExpired(int validityDuration)
Is the robots policy expired.
|
boolean |
isValidRobots()
If true then valid robots.txt information has been retrieved.
|
void |
makeDirty() |
void |
resetConsecutiveConnectionErrors() |
void |
setHttpAuthChallenges(Map<String,String> httpAuthChallenges) |
void |
setIdentityCache(ObjectIdentityCache<?> cache) |
String |
toString() |
void |
updateRobots(CrawlURI curi)
Update the server's robotstxt
|
public static final long ROBOTS_NOT_FETCHED
public static final long MIN_ROBOTS_RETRIES
protected Robotstxt robotstxt
protected long robotsFetched
protected boolean validRobots
protected FetchStats substats
protected int consecutiveConnectionErrors
public CrawlServer(String h)
h
- the host string for the server.public Robotstxt getRobotstxt()
public void updateRobots(CrawlURI curi)
Heritrix policy on robots.txt http responses:
For comparison, google's policy as of Oct 2017:
curi
- the crawl URI containing the fetched robots.txtpublic String getName()
public int getPort()
public void incrementConsecutiveConnectionErrors()
public void resetConsecutiveConnectionErrors()
public Set<Credential> getCredentials()
public boolean hasCredentials()
public void addCredential(Credential cred)
cred
- Credential avatar to add to set of avatars.public boolean isValidRobots()
public static String getServerKey(UURI uuri) throws org.apache.commons.httpclient.URIException
org.apache.commons.httpclient.URIException
public FetchStats getSubstats()
getSubstats
in interface FetchStats.HasFetchStats
public boolean isRobotsExpired(int validityDuration)
public static void autoregisterTo(AutoKryo kryo)
public String getKey()
getKey
in interface IdentityCacheable
public void makeDirty()
makeDirty
in interface IdentityCacheable
public void setIdentityCache(ObjectIdentityCache<?> cache)
setIdentityCache
in interface IdentityCacheable
Copyright © 2003–2019 Internet Archive. All rights reserved.