Package org.archive.modules.recrawl.wbm
Class WbmPersistLoadProcessor
java.lang.Object
org.archive.modules.Processor
org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
- All Implemented Interfaces:
org.archive.checkpointing.Checkpointable
,org.archive.spring.HasKeyedProperties
,org.springframework.beans.factory.Aware
,org.springframework.beans.factory.BeanNameAware
,org.springframework.context.Lifecycle
A
Processor
for retrieving recrawl info from remote Wayback Machine index.
This is currently in the early stage of experiment. Both low-level protocol and WBM API
semantics will certainly undergo several revisions.
Current interface:
http://web-beta.archive.org/cdx/search/cdx?url=archive.org&startDate=1999 will return raw CDX lines for archive.org, since 1999-01-01 00:00:00.
As index is updated in a separate batch processing job, there's no "Store" counterpart.
- Author:
- Kenji Nagahashi.
-
Nested Class Summary
Nested Classes -
Field Summary
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected String
protected InputStream
int
long
total milliseconds spent in API call.long
number of times cdx-server API call failed.int
org.apache.http.client.HttpClient
long
number of times successfully loaded recrawl info.int
long
number of times getting no recrawl info.long
int
protected void
innerProcess
(CrawlURI uri) unused.protected ProcessResult
innerProcessResult
(CrawlURI curi) boolean
static void
main entry point for quick test.void
setConnectionTimeout
(int connectionTimeout) connection timeout for HTTP client in milliseconds.void
setContentDigestScheme
(String contentDigestScheme) set Content-Digest scheme string to prepend to the hash string found in CDX.void
setGzipAccepted
(boolean gzipAccepted) if set to true,WbmPersistLoadProcessor
adds a headerAccept-Encoding: gzip
to HTTP requests.void
setHistoryLength
(int historyLength) void
setHttpClient
(org.apache.http.client.HttpClient client) void
setMaxConnections
(int maxConnections) void
setQueryRangeSecs
(long queryRangeSecs) void
setQueryURL
(String queryURL) void
setRequestHeaders
(Map<String, String> requestHeaders) all key-value pairs in this map will be added as HTTP headers.void
setSocketTimeout
(int socketTimeout) socket timeout (SO_TIMEOUT) for HTTP client in milliseconds.protected boolean
shouldProcess
(CrawlURI uri) Methods inherited from class org.archive.modules.Processor
doCheckpoint, finishCheckpoint, flattenVia, fromCheckpointJson, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerRejectProcess, isRunning, isSuccess, process, report, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop, toCheckpointJson
-
Constructor Details
-
WbmPersistLoadProcessor
public WbmPersistLoadProcessor()
-
-
Method Details
-
setHistoryLength
public void setHistoryLength(int historyLength) -
getHistoryLength
public int getHistoryLength() -
setQueryURL
-
getQueryURL
-
setContentDigestScheme
set Content-Digest scheme string to prepend to the hash string found in CDX. Heritrix's Content-Digest comparison including this part."sha1:"
by default.- Parameters:
contentDigestScheme
-
-
getContentDigestScheme
-
setSocketTimeout
public void setSocketTimeout(int socketTimeout) socket timeout (SO_TIMEOUT) for HTTP client in milliseconds. -
getSocketTimeout
public int getSocketTimeout() -
setConnectionTimeout
public void setConnectionTimeout(int connectionTimeout) connection timeout for HTTP client in milliseconds.- Parameters:
connectionTimeout
-
-
getConnectionTimeout
public int getConnectionTimeout() -
getMaxConnections
public int getMaxConnections() -
setMaxConnections
public void setMaxConnections(int maxConnections) -
isGzipAccepted
public boolean isGzipAccepted() -
setGzipAccepted
public void setGzipAccepted(boolean gzipAccepted) if set to true,WbmPersistLoadProcessor
adds a headerAccept-Encoding: gzip
to HTTP requests. New CDX server see this header to decide whether to compress the response. it is also possible to override gzipAccepted=true setting with gzip=false request parameter. It is off by default, as it should make little sense to compress single line of CDX.- Parameters:
gzipAccepted
- true to allow gzip compression.
-
getRequestHeaders
-
setRequestHeaders
all key-value pairs in this map will be added as HTTP headers. typically used for providing authentication cookies. this method makes a copy ofrequestHeaders
. note: this property may be dropped in the future if I come up with better interface.- Parameters:
requestHeaders
- map of <header-name, header-value>.
-
getLoadedCount
public long getLoadedCount()number of times successfully loaded recrawl info.- Returns:
- long
-
getMissedCount
public long getMissedCount()number of times getting no recrawl info.- Returns:
- long
-
getErrorCount
public long getErrorCount()number of times cdx-server API call failed.- Returns:
- long
-
getCumulativeFetchTime
public long getCumulativeFetchTime()total milliseconds spent in API call. it is a sum of time waited for next available connection, and actual HTTP request-response round-trip, across all threads. -
setHttpClient
public void setHttpClient(org.apache.http.client.HttpClient client) -
getHttpClient
public org.apache.http.client.HttpClient getHttpClient() -
setQueryRangeSecs
public void setQueryRangeSecs(long queryRangeSecs) - Parameters:
queryRangeSecs
-
-
getQueryRangeSecs
public long getQueryRangeSecs() -
buildURL
-
getCDX
- Throws:
InterruptedException
IOException
-
innerProcessResult
- Overrides:
innerProcessResult
in classProcessor
- Throws:
InterruptedException
-
getLastCrawl
- Throws:
IOException
-
innerProcess
unused.- Specified by:
innerProcess
in classProcessor
- Throws:
InterruptedException
-
shouldProcess
- Specified by:
shouldProcess
in classProcessor
-
main
main entry point for quick test.- Parameters:
args
-- Throws:
Exception
-