Class KafkaCrawlLogFeed

java.lang.Object
org.archive.modules.Processor
org.archive.modules.postprocessor.KafkaCrawlLogFeed
All Implemented Interfaces:
org.archive.checkpointing.Checkpointable, org.archive.spring.HasKeyedProperties, org.springframework.beans.factory.Aware, org.springframework.beans.factory.BeanNameAware, org.springframework.context.Lifecycle

public class KafkaCrawlLogFeed extends Processor implements org.springframework.context.Lifecycle
For Kafka 0.8.x. Sends messages in asynchronous mode (producer.type=async) and does not wait for acknowledgment from kafka (request.required.acks=0). Sends messages with no key. These things could be configurable if needed.
Author:
nlevitt
See Also:
  • Field Details

    • logger

      protected static final Logger logger
    • frontier

      protected Frontier frontier
    • serverCache

      protected ServerCache serverCache
    • extraFields

      protected Map<String,String> extraFields
    • dumpPendingAtClose

      protected boolean dumpPendingAtClose
    • brokerList

      protected String brokerList
    • topic

      protected String topic
    • kafkaProducer

      protected transient org.apache.kafka.clients.producer.KafkaProducer<String,byte[]> kafkaProducer
    • stats

  • Constructor Details

    • KafkaCrawlLogFeed

      public KafkaCrawlLogFeed()
  • Method Details

    • getFrontier

      public Frontier getFrontier()
    • setFrontier

      @Autowired public void setFrontier(Frontier frontier)
      Autowired frontier, needed to determine when a url is finished.
    • getServerCache

      public ServerCache getServerCache()
    • setServerCache

      @Autowired public void setServerCache(ServerCache serverCache)
    • getExtraFields

      public Map<String,String> getExtraFields()
    • setExtraFields

      public void setExtraFields(Map<String,String> extraFields)
    • getDumpPendingAtClose

      public boolean getDumpPendingAtClose()
    • setDumpPendingAtClose

      public void setDumpPendingAtClose(boolean dumpPendingAtClose)
      If true, publish all pending urls (i.e. queued urls still in the frontier) when crawl job is stopping. They are recognizable by the status field which has the value 0.
      See Also:
    • setBrokerList

      public void setBrokerList(String brokerList)
      Kafka broker list (kafka property "metadata.broker.list").
    • getBrokerList

      public String getBrokerList()
    • setTopic

      public void setTopic(String topic)
    • getTopic

      public String getTopic()
    • buildMessage

      protected byte[] buildMessage(CrawlURI curi)
    • shouldProcess

      protected boolean shouldProcess(CrawlURI curi)
      Specified by:
      shouldProcess in class Processor
    • stop

      public void stop()
      Specified by:
      stop in interface org.springframework.context.Lifecycle
      Overrides:
      stop in class Processor
    • kafkaProducer

      protected org.apache.kafka.clients.producer.KafkaProducer<String,byte[]> kafkaProducer()
    • innerProcess

      protected void innerProcess(CrawlURI curi) throws InterruptedException
      Specified by:
      innerProcess in class Processor
      Throws:
      InterruptedException