Class KafkaCrawlLogFeed
java.lang.Object
org.archive.modules.Processor
org.archive.modules.postprocessor.KafkaCrawlLogFeed
- All Implemented Interfaces:
org.archive.checkpointing.Checkpointable
,org.archive.spring.HasKeyedProperties
,org.springframework.beans.factory.Aware
,org.springframework.beans.factory.BeanNameAware
,org.springframework.context.Lifecycle
For Kafka 0.8.x. Sends messages in asynchronous mode (producer.type=async)
and does not wait for acknowledgment from kafka (request.required.acks=0).
Sends messages with no key. These things could be configurable if needed.
- Author:
- nlevitt
- See Also:
-
Nested Class Summary
Nested Classes -
Field Summary
FieldsModifier and TypeFieldDescriptionprotected String
protected boolean
protected Frontier
protected org.apache.kafka.clients.producer.KafkaProducer<String,
byte[]> protected static final Logger
protected ServerCache
protected KafkaCrawlLogFeed.StatsCallback
protected String
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected byte[]
buildMessage
(CrawlURI curi) boolean
getTopic()
protected void
innerProcess
(CrawlURI curi) protected org.apache.kafka.clients.producer.KafkaProducer<String,
byte[]> void
setBrokerList
(String brokerList) Kafka broker list (kafka property "metadata.broker.list").void
setDumpPendingAtClose
(boolean dumpPendingAtClose) If true, publish all pending urls (i.e. queued urls still in the frontier) when crawl job is stopping.void
setExtraFields
(Map<String, String> extraFields) void
setFrontier
(Frontier frontier) Autowired frontier, needed to determine when a url is finished.void
setServerCache
(ServerCache serverCache) void
protected boolean
shouldProcess
(CrawlURI curi) void
stop()
Methods inherited from class org.archive.modules.Processor
doCheckpoint, finishCheckpoint, flattenVia, fromCheckpointJson, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, report, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, toCheckpointJson
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface org.springframework.context.Lifecycle
isRunning, start
-
Field Details
-
logger
-
frontier
-
serverCache
-
extraFields
-
dumpPendingAtClose
protected boolean dumpPendingAtClose -
brokerList
-
topic
-
kafkaProducer
-
stats
-
-
Constructor Details
-
KafkaCrawlLogFeed
public KafkaCrawlLogFeed()
-
-
Method Details
-
getFrontier
-
setFrontier
Autowired frontier, needed to determine when a url is finished. -
getServerCache
-
setServerCache
-
getExtraFields
-
setExtraFields
-
getDumpPendingAtClose
public boolean getDumpPendingAtClose() -
setDumpPendingAtClose
public void setDumpPendingAtClose(boolean dumpPendingAtClose) If true, publish all pending urls (i.e. queued urls still in the frontier) when crawl job is stopping. They are recognizable by the status field which has the value 0. -
setBrokerList
Kafka broker list (kafka property "metadata.broker.list"). -
getBrokerList
-
setTopic
-
getTopic
-
buildMessage
-
shouldProcess
- Specified by:
shouldProcess
in classProcessor
-
stop
public void stop() -
kafkaProducer
-
innerProcess
- Specified by:
innerProcess
in classProcessor
- Throws:
InterruptedException
-