Class WbmPersistLoadProcessor

java.lang.Object
org.archive.modules.Processor
org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
All Implemented Interfaces:
org.archive.checkpointing.Checkpointable, org.archive.spring.HasKeyedProperties, org.springframework.beans.factory.Aware, org.springframework.beans.factory.BeanNameAware, org.springframework.context.Lifecycle

public class WbmPersistLoadProcessor extends Processor
A Processor for retrieving recrawl info from remote Wayback Machine index. This is currently in the early stage of experiment. Both low-level protocol and WBM API semantics will certainly undergo several revisions.

Current interface:

http://web-beta.archive.org/cdx/search/cdx?url=archive.org&startDate=1999 will return raw CDX lines for archive.org, since 1999-01-01 00:00:00.

As index is updated in a separate batch processing job, there's no "Store" counterpart.

Author:
Kenji Nagahashi.
  • Constructor Details

    • WbmPersistLoadProcessor

      public WbmPersistLoadProcessor()
  • Method Details

    • setHistoryLength

      public void setHistoryLength(int historyLength)
    • getHistoryLength

      public int getHistoryLength()
    • setQueryURL

      public void setQueryURL(String queryURL)
    • getQueryURL

      public String getQueryURL()
    • setContentDigestScheme

      public void setContentDigestScheme(String contentDigestScheme)
      set Content-Digest scheme string to prepend to the hash string found in CDX. Heritrix's Content-Digest comparison including this part. "sha1:" by default.
      Parameters:
      contentDigestScheme -
    • getContentDigestScheme

      public String getContentDigestScheme()
    • setSocketTimeout

      public void setSocketTimeout(int socketTimeout)
      socket timeout (SO_TIMEOUT) for HTTP client in milliseconds.
    • getSocketTimeout

      public int getSocketTimeout()
    • setConnectionTimeout

      public void setConnectionTimeout(int connectionTimeout)
      connection timeout for HTTP client in milliseconds.
      Parameters:
      connectionTimeout -
    • getConnectionTimeout

      public int getConnectionTimeout()
    • getMaxConnections

      public int getMaxConnections()
    • setMaxConnections

      public void setMaxConnections(int maxConnections)
    • isGzipAccepted

      public boolean isGzipAccepted()
    • setGzipAccepted

      public void setGzipAccepted(boolean gzipAccepted)
      if set to true, WbmPersistLoadProcessor adds a header Accept-Encoding: gzip to HTTP requests. New CDX server see this header to decide whether to compress the response. it is also possible to override gzipAccepted=true setting with gzip=false request parameter. It is off by default, as it should make little sense to compress single line of CDX.
      Parameters:
      gzipAccepted - true to allow gzip compression.
    • getRequestHeaders

      public Map<String,String> getRequestHeaders()
    • setRequestHeaders

      public void setRequestHeaders(Map<String,String> requestHeaders)
      all key-value pairs in this map will be added as HTTP headers. typically used for providing authentication cookies. this method makes a copy of requestHeaders. note: this property may be dropped in the future if I come up with better interface.
      Parameters:
      requestHeaders - map of <header-name, header-value>.
    • getLoadedCount

      public long getLoadedCount()
      number of times successfully loaded recrawl info.
      Returns:
      long
    • getMissedCount

      public long getMissedCount()
      number of times getting no recrawl info.
      Returns:
      long
    • getErrorCount

      public long getErrorCount()
      number of times cdx-server API call failed.
      Returns:
      long
    • getCumulativeFetchTime

      public long getCumulativeFetchTime()
      total milliseconds spent in API call. it is a sum of time waited for next available connection, and actual HTTP request-response round-trip, across all threads.
    • setHttpClient

      public void setHttpClient(org.apache.http.client.HttpClient client)
    • getHttpClient

      public org.apache.http.client.HttpClient getHttpClient()
    • setQueryRangeSecs

      public void setQueryRangeSecs(long queryRangeSecs)
      Parameters:
      queryRangeSecs -
    • getQueryRangeSecs

      public long getQueryRangeSecs()
    • buildURL

      protected String buildURL(String url)
    • getCDX

      protected InputStream getCDX(String qurl) throws InterruptedException, IOException
      Throws:
      InterruptedException
      IOException
    • innerProcessResult

      protected ProcessResult innerProcessResult(CrawlURI curi) throws InterruptedException
      Overrides:
      innerProcessResult in class Processor
      Throws:
      InterruptedException
    • getLastCrawl

      protected HashMap<String,Object> getLastCrawl(InputStream is) throws IOException
      Throws:
      IOException
    • innerProcess

      protected void innerProcess(CrawlURI uri) throws InterruptedException
      unused.
      Specified by:
      innerProcess in class Processor
      Throws:
      InterruptedException
    • shouldProcess

      protected boolean shouldProcess(CrawlURI uri)
      Specified by:
      shouldProcess in class Processor
    • main

      public static void main(String[] args) throws Exception
      main entry point for quick test.
      Parameters:
      args -
      Throws:
      Exception