Index

A B C D E F G H I J K L M N O P Q R S T U W X Y 
All Classes and Interfaces|All Packages|Constant Field Values|Serialized Form

A

A_RECEIVED_FROM_AMQP - Static variable in class org.archive.crawler.frontier.AMQPUrlReceiver
 
A_SENT_TO_AMQP - Static variable in class org.archive.modules.AMQPPublishProcessor
 
A_TIMESTAMP - Static variable in class org.archive.modules.recrawl.FetchHistoryHelper
key for storing timestamp in crawl history map.
addPreferredOutlinks(CrawlURI, LinkedHashMap<String, String>) - Method in class org.archive.modules.extractor.ExtractorYoutubeFormatStream
 
addVideoOutlink(CrawlURI, String, int, int) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
 
AMQPCrawlLogFeed - Class in org.archive.modules.postprocessor
 
AMQPCrawlLogFeed() - Constructor for class org.archive.modules.postprocessor.AMQPCrawlLogFeed
 
amqpMessageProperties() - Method in class org.archive.modules.AMQPProducerProcessor
 
amqpMessageProperties() - Method in class org.archive.modules.AMQPPublishProcessor
 
amqpMessageProperties() - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
 
amqpProducer - Variable in class org.archive.modules.AMQPProducerProcessor
 
amqpProducer - Variable in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
 
amqpProducer() - Method in class org.archive.modules.AMQPProducerProcessor
 
amqpProducer() - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
 
AMQPProducer - Class in org.archive.modules
 
AMQPProducer(String, String, String) - Constructor for class org.archive.modules.AMQPProducer
 
AMQPProducerProcessor - Class in org.archive.modules
 
AMQPProducerProcessor() - Constructor for class org.archive.modules.AMQPProducerProcessor
 
AMQPPublishProcessor - Class in org.archive.modules
 
AMQPPublishProcessor() - Constructor for class org.archive.modules.AMQPPublishProcessor
 
amqpUri - Variable in class org.archive.crawler.frontier.AMQPUrlReceiver
 
amqpUri - Variable in class org.archive.modules.AMQPProducer
 
amqpUri - Variable in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
 
AMQPUrlPublishedEvent - Class in org.archive.crawler.event
ApplicationEvent published when Heritrix sends a URL to AMQP.
AMQPUrlPublishedEvent(AMQPPublishProcessor, CrawlURI) - Constructor for class org.archive.crawler.event.AMQPUrlPublishedEvent
 
AMQPUrlReceivedEvent - Class in org.archive.crawler.event
ApplicationEvent published when AMQPUrlReceiver receives a URL.
AMQPUrlReceivedEvent(AMQPUrlReceiver, CrawlURI) - Constructor for class org.archive.crawler.event.AMQPUrlReceivedEvent
 
AMQPUrlReceiver - Class in org.archive.crawler.frontier
 
AMQPUrlReceiver() - Constructor for class org.archive.crawler.frontier.AMQPUrlReceiver
 
AMQPUrlReceiver.UrlConsumer - Class in org.archive.crawler.frontier
 
AMQPUrlWaiter - Class in org.archive.modules
Bean to enforce a wait for Umbra's amqp queue
AMQPUrlWaiter() - Constructor for class org.archive.modules.AMQPUrlWaiter
 
appCtx - Variable in class org.archive.crawler.frontier.AMQPUrlReceiver
 
appCtx - Variable in class org.archive.modules.AMQPPublishProcessor
 
applyToSubdomains - Variable in class org.archive.crawler.prefetch.HostQuotaEnforcer
 

B

baseURI - Variable in class org.archive.modules.extractor.KnowledgableExtractorJS.CustomizedCrawlURIFacade
 
BATCH_MAX_SIZE - Static variable in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
BATCH_MAX_TIME_MS - Static variable in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
brokerList - Variable in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 
buildJson(CrawlURI, int, DecideRule, DecideResult) - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
 
buildJson(CrawlURI, Map<String, String>, ServerCache) - Static method in class org.archive.modules.postprocessor.CrawlLogJsonBuilder
 
buildJsonMessage(CrawlURI) - Method in class org.archive.modules.AMQPPublishProcessor
Constructs the json to send via AMQP.
buildMessage(CrawlURI) - Method in class org.archive.modules.AMQPProducerProcessor
 
buildMessage(CrawlURI) - Method in class org.archive.modules.AMQPPublishProcessor
 
buildMessage(CrawlURI) - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
 
buildMessage(CrawlURI) - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 
buildRecord(CrawlURI, URI) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
 
buildURL(String) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
 

C

candidates - Variable in class org.archive.crawler.frontier.AMQPUrlReceiver
 
channel - Variable in class org.archive.crawler.frontier.AMQPUrlReceiver
 
channel() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
 
channel() - Method in class org.archive.modules.AMQPProducer
 
checkAMQPUrlWait() - Method in class org.archive.modules.AMQPUrlWaiter
 
checkForNull(Object) - Static method in class org.archive.modules.postprocessor.CrawlLogJsonBuilder
 
closeLocalTempFile() - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
 
connection - Variable in class org.archive.crawler.frontier.AMQPUrlReceiver
 
connection - Variable in class org.archive.modules.AMQPProducer
 
connection() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
 
considerStrings(Extractor, CrawlURI, CharSequence, boolean) - Method in class org.archive.modules.extractor.KnowledgableExtractorJS
 
controller - Variable in class org.archive.modules.AMQPUrlWaiter
 
controller - Variable in class org.archive.modules.extractor.ExtractorYoutubeDL
 
controller - Variable in class org.archive.modules.postprocessor.WARCLimitEnforcer
 
crawledBatch - Variable in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
crawledBatchLastTime - Variable in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
crawlerLoggerModule - Variable in class org.archive.modules.extractor.ExtractorYoutubeDL
 
CrawlLogJsonBuilder - Class in org.archive.modules.postprocessor
 
CrawlLogJsonBuilder() - Constructor for class org.archive.modules.postprocessor.CrawlLogJsonBuilder
 
createCrawlURI(UURI, LinkContext, Hop) - Method in class org.archive.modules.extractor.KnowledgableExtractorJS.CustomizedCrawlURIFacade
Delegates to wrapped CrawlURI
curi - Variable in class org.archive.crawler.event.AMQPUrlPublishedEvent
 
curi - Variable in class org.archive.crawler.event.AMQPUrlReceivedEvent
 
CustomizedCrawlURIFacade(CrawlURI, UURI) - Constructor for class org.archive.modules.extractor.KnowledgableExtractorJS.CustomizedCrawlURIFacade
 

D

DecideRuleSequenceWithAMQPFeed - Class in org.archive.modules.deciderules
 
DecideRuleSequenceWithAMQPFeed() - Constructor for class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
 
decisionMade(CrawlURI, DecideRule, int, DecideResult) - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
 
dirtySegments - Variable in class org.archive.trough.TroughClient
 
doRedirectInheritance(CrawlURI, String) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
 
dumpPendingAtClose - Variable in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
 
dumpPendingAtClose - Variable in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 

E

errors - Variable in class org.archive.modules.postprocessor.KafkaCrawlLogFeed.StatsCallback
 
evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.ExpressionDecideRule
 
exchange - Variable in class org.archive.crawler.frontier.AMQPUrlReceiver
 
exchange - Variable in class org.archive.modules.AMQPProducer
 
exchange - Variable in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
 
ExpressionDecideRule - Class in org.archive.modules.deciderules
Example usage:
ExpressionDecideRule() - Constructor for class org.archive.modules.deciderules.ExpressionDecideRule
 
extract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorYoutubeChannelFormatStream
 
extract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
- If uri is annotated "youtube-dl" and is a 3xx (redirect), find the redirect among the outlinks and add the "youtube-dl" annotation to it as well, and also make a note of the containing page inside the CrawlURI.
extract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorYoutubeFormatStream
 
ExtractorPDFContent - Class in org.archive.modules.extractor
PDF Content Extractor.
ExtractorPDFContent() - Constructor for class org.archive.modules.extractor.ExtractorPDFContent
 
ExtractorYoutubeChannelFormatStream - Class in org.archive.modules.extractor
 
ExtractorYoutubeChannelFormatStream() - Constructor for class org.archive.modules.extractor.ExtractorYoutubeChannelFormatStream
 
ExtractorYoutubeDL - Class in org.archive.modules.extractor
Extracts links to media by running yt-dlp in a subprocess.
ExtractorYoutubeDL() - Constructor for class org.archive.modules.extractor.ExtractorYoutubeDL
 
ExtractorYoutubeDL.NullOutputStream - Class in org.archive.modules.extractor
Dummy output stream to swallow bytes without storing anything.
ExtractorYoutubeDL.YoutubeDLResults - Class in org.archive.modules.extractor
 
ExtractorYoutubeFormatStream - Class in org.archive.modules.extractor
Youtube stream URI extractor.
ExtractorYoutubeFormatStream() - Constructor for class org.archive.modules.extractor.ExtractorYoutubeFormatStream
 
extraFields - Variable in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
 
extraFields - Variable in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 

F

fail(CrawlURI, byte[], AMQP.BasicProperties, Throwable) - Method in class org.archive.modules.AMQPProducerProcessor
 
FetchHistoryHelper - Class in org.archive.modules.recrawl
collection of utility methods useful for loading and storing crawl history.
FetchHistoryHelper() - Constructor for class org.archive.modules.recrawl.FetchHistoryHelper
 
findYdlAnnotation(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
 
frontier - Variable in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
 
frontier - Variable in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 
frontier - Variable in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 

G

getAmqpUri() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
 
getAmqpUri() - Method in class org.archive.modules.AMQPProducerProcessor
 
getAmqpUri() - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
 
getApplyToSubdomains() - Method in class org.archive.crawler.prefetch.HostQuotaEnforcer
 
getBaseURI() - Method in class org.archive.modules.extractor.KnowledgableExtractorJS.CustomizedCrawlURIFacade
 
getBrokerList() - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 
getCandidates() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
 
getCDX(String) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
 
getClientId() - Method in class org.archive.modules.AMQPPublishProcessor
 
getConnectionTimeout() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
 
getContentDigestScheme() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
 
getCrawlController() - Method in class org.archive.modules.AMQPUrlWaiter
 
getCrawlController() - Method in class org.archive.modules.postprocessor.WARCLimitEnforcer
 
getCrawlerLoggerModule() - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
 
getCumulativeFetchTime() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
total milliseconds spent in API call.
getCuri() - Method in class org.archive.crawler.event.AMQPUrlPublishedEvent
 
getCuri() - Method in class org.archive.crawler.event.AMQPUrlReceivedEvent
 
getDumpPendingAtClose() - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
 
getDumpPendingAtClose() - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 
getErrorCount() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
number of times cdx-server API call failed.
getExchange() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
 
getExchange() - Method in class org.archive.modules.AMQPProducerProcessor
 
getExchange() - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
 
getExtractLimit() - Method in class org.archive.modules.extractor.ExtractorYoutubeFormatStream
 
getExtraFields() - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
 
getExtraFields() - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 
getExtraInfo() - Method in class org.archive.modules.AMQPPublishProcessor
 
getFetchHistory(CrawlURI, long, int) - Static method in class org.archive.modules.recrawl.FetchHistoryHelper
returns a Map to store recrawl data, positioned properly in CrawlURI's fetch history array, according to timestamp. this makes it possible to import crawl history data from multiple sources.
getFilename() - Method in class org.archive.crawler.reporting.XmlCrawlSummaryReport
 
getFrontier() - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
 
getFrontier() - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 
getFrontier() - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
getGroovyExpression() - Method in class org.archive.modules.deciderules.ExpressionDecideRule
 
getHistoryLength() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
 
getHost() - Method in class org.archive.crawler.prefetch.HostQuotaEnforcer
 
getHttpClient() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
 
getItagPriority() - Method in class org.archive.modules.extractor.ExtractorYoutubeFormatStream
 
getKeyedProperties() - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
getKeyedProperties() - Method in class org.archive.modules.recrawl.TroughContentDigestHistory
 
getLastCrawl(InputStream) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
 
getLimits() - Method in class org.archive.modules.postprocessor.WARCLimitEnforcer
 
getLoadedCount() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
number of times successfully loaded recrawl info.
getLocalTempFile() - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
 
getLogMetadataRecord() - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
 
getMaxConnections() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
 
getMaxSizeToParse() - Method in class org.archive.modules.extractor.ExtractorPDFContent
 
getMissedCount() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
number of times getting no recrawl info.
getOutLinks() - Method in class org.archive.modules.extractor.KnowledgableExtractorJS.CustomizedCrawlURIFacade
Delegates to wrapped CrawlURI
getProcessArguments() - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
 
getQueryRangeSecs() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
 
getQueryURL() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
 
getQueueName() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
 
getQuotas() - Method in class org.archive.crawler.prefetch.HostQuotaEnforcer
 
getQuotas() - Method in class org.archive.crawler.prefetch.SourceQuotaEnforcer
 
getRequestHeaders() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
 
getRethinkUrl() - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
getRethinkUrl() - Method in class org.archive.modules.recrawl.TroughContentDigestHistory
 
getRoutingKey() - Method in class org.archive.modules.AMQPProducerProcessor
 
getRoutingKey() - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
 
getScheduledDate() - Method in class org.archive.crawler.reporting.XmlCrawlSummaryReport
 
getSegmentId() - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
getSegmentId() - Method in class org.archive.modules.recrawl.TroughContentDigestHistory
 
getServerCache() - Method in class org.archive.crawler.prefetch.HostQuotaEnforcer
 
getServerCache() - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
 
getServerCache() - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
 
getServerCache() - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 
getServerCache() - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
getSocketTimeout() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
 
getSourceTag() - Method in class org.archive.crawler.prefetch.SourceQuotaEnforcer
 
getStatisticsTracker() - Method in class org.archive.crawler.prefetch.SourceQuotaEnforcer
 
getTopic() - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 
getWarcWriter() - Method in class org.archive.modules.postprocessor.WARCLimitEnforcer
 
getWarcWriters() - Method in class org.archive.modules.postprocessor.WARCLimitEnforcer
 
groovyTemplate() - Method in class org.archive.modules.deciderules.ExpressionDecideRule
 
groovyTemplates - Variable in class org.archive.modules.deciderules.ExpressionDecideRule
 

H

handleDelivery(String, Envelope, AMQP.BasicProperties, byte[]) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver.UrlConsumer
 
handleShutdownSignal(String, ShutdownSignalException) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver.UrlConsumer
 
host - Variable in class org.archive.crawler.prefetch.HostQuotaEnforcer
 
HostQuotaEnforcer - Class in org.archive.crawler.prefetch
Enforces quotas on a host.
HostQuotaEnforcer() - Constructor for class org.archive.crawler.prefetch.HostQuotaEnforcer
 
httpRequest(String, String, String, String, int) - Static method in class org.archive.trough.TroughClient
 

I

incrementDiscardedOutLinks() - Method in class org.archive.modules.extractor.KnowledgableExtractorJS.CustomizedCrawlURIFacade
Delegates to wrapped CrawlURI
innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorPDFContent
 
innerProcess(CrawlURI) - Method in class org.archive.crawler.prefetch.HostQuotaEnforcer
 
innerProcess(CrawlURI) - Method in class org.archive.crawler.prefetch.SourceQuotaEnforcer
 
innerProcess(CrawlURI) - Method in class org.archive.modules.AMQPProducerProcessor
 
innerProcess(CrawlURI) - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 
innerProcess(CrawlURI) - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
innerProcess(CrawlURI) - Method in class org.archive.modules.postprocessor.WARCLimitEnforcer
 
innerProcess(CrawlURI) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
unused.
innerProcessResult(CrawlURI) - Method in class org.archive.crawler.prefetch.HostQuotaEnforcer
 
innerProcessResult(CrawlURI) - Method in class org.archive.crawler.prefetch.SourceQuotaEnforcer
 
innerProcessResult(CrawlURI) - Method in class org.archive.modules.AMQPProducerProcessor
 
innerProcessResult(CrawlURI) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
 
isAutoDelete() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
 
isDurable() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
 
isForceFetch() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
 
isGzipAccepted() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
 
isOpen(RandomAccessFile) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
 
isRunning - Variable in class org.archive.crawler.frontier.AMQPUrlReceiver
 
isRunning() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
 

J

JSON_MIMETYPE - Static variable in class org.archive.trough.TroughClient
 

K

KafkaCrawlLogFeed - Class in org.archive.modules.postprocessor
For Kafka 0.8.x.
KafkaCrawlLogFeed() - Constructor for class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 
KafkaCrawlLogFeed.StatsCallback - Class in org.archive.modules.postprocessor
 
kafkaProducer - Variable in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 
kafkaProducer() - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 
KnowledgableExtractorJS - Class in org.archive.modules.extractor
A subclass of ExtractorJS that has some customized behavior for specific kinds of web pages.
KnowledgableExtractorJS() - Constructor for class org.archive.modules.extractor.KnowledgableExtractorJS
 
KnowledgableExtractorJS.CustomizedCrawlURIFacade - Class in org.archive.modules.extractor
Wraps a CrawlURI, allowing baseURI to be overridden, without changing the underlying CrawlURI.
kp - Variable in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
kp - Variable in class org.archive.modules.recrawl.TroughContentDigestHistory
 

L

limits - Variable in class org.archive.modules.postprocessor.WARCLimitEnforcer
 
load(CrawlURI) - Method in class org.archive.modules.recrawl.TroughContentDigestHistory
 
logCapturedVideo(CrawlURI, String) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
 
logContainingPage(CrawlURI, String) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
 
logger - Static variable in class org.archive.modules.AMQPProducer
 
logger - Variable in class org.archive.modules.AMQPProducerProcessor
 
logger - Static variable in class org.archive.modules.AMQPUrlWaiter
 
logger - Static variable in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 
logger - Static variable in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 

M

main(String[]) - Static method in class org.archive.modules.extractor.ExtractorYoutubeDL
 
main(String[]) - Static method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
main entry point for quick test.
makeCrawlUri(JSONObject) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver.UrlConsumer
 
MAX_VIDEOS_PER_PAGE - Static variable in class org.archive.modules.extractor.ExtractorYoutubeDL
 

N

NullOutputStream() - Constructor for class org.archive.modules.extractor.ExtractorYoutubeDL.NullOutputStream
 

O

onApplicationEvent(CrawlStateEvent) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
 
onApplicationEvent(CrawlStateEvent) - Method in class org.archive.modules.recrawl.TroughContentDigestHistory
 
onApplicationEvent(ApplicationEvent) - Method in class org.archive.modules.AMQPUrlWaiter
 
onCompletion(RecordMetadata, Exception) - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed.StatsCallback
 
openNewTempFile() - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
 
org.archive.crawler.event - package org.archive.crawler.event
 
org.archive.crawler.frontier - package org.archive.crawler.frontier
 
org.archive.crawler.prefetch - package org.archive.crawler.prefetch
 
org.archive.crawler.reporting - package org.archive.crawler.reporting
 
org.archive.modules - package org.archive.modules
 
org.archive.modules.deciderules - package org.archive.modules.deciderules
 
org.archive.modules.extractor - package org.archive.modules.extractor
 
org.archive.modules.postprocessor - package org.archive.modules.postprocessor
 
org.archive.modules.recrawl - package org.archive.modules.recrawl
 
org.archive.modules.recrawl.wbm - package org.archive.modules.recrawl.wbm
 
org.archive.trough - package org.archive.trough
 

P

parseRethinkdbUrl(String) - Method in class org.archive.trough.TroughClient
Parses a url like this rethinkdb://server1:port,server2:port/database/table Sets fields rethinkServers, rethinkDb, rethinkTable
parseStreamMap(String) - Method in class org.archive.modules.extractor.ExtractorYoutubeFormatStream
 
populateHeritableMetadata(CrawlURI, JSONObject) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver.UrlConsumer
 
postCrawledBatch() - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
postUncrawledBatch() - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
postWrite(WARCRecordInfo, CrawlURI) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
Because we are writing an additional WARC Metadata Record for the json video info, there is no CrawlURI for that record, and thus nothing ever goes through the frontier to be logged to the crawl.log.
print(StringBuilder, String[]) - Method in interface org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor.FormatSegment
 
promote(String) - Method in class org.archive.trough.TroughClient
 
promoteDirtySegments() - Method in class org.archive.trough.TroughClient
 
promotionInterval - Variable in class org.archive.trough.TroughClient
 
promotrix - Variable in class org.archive.trough.TroughClient
 
Promotrix() - Constructor for class org.archive.trough.TroughClient.Promotrix
 
props - Variable in class org.archive.modules.AMQPPublishProcessor
 
props - Variable in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
 
props - Variable in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
 
publishMessage(byte[], AMQP.BasicProperties) - Method in class org.archive.modules.AMQPProducer
Publish the message with the supplied properties.

Q

queueName - Variable in class org.archive.crawler.frontier.AMQPUrlReceiver
 
quotas - Variable in class org.archive.crawler.prefetch.HostQuotaEnforcer
 
quotas - Variable in class org.archive.crawler.prefetch.SourceQuotaEnforcer
 

R

r - Static variable in class org.archive.trough.TroughClient
 
rand - Variable in class org.archive.trough.TroughClient
 
read(String, String, Object[]) - Method in class org.archive.trough.TroughClient
 
readToEnd(Reader) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
 
readUrl(String) - Method in class org.archive.trough.TroughClient
 
readUrlCache - Variable in class org.archive.trough.TroughClient
 
readUrlNoCache(String) - Method in class org.archive.trough.TroughClient
 
registerSchema(String, String) - Method in class org.archive.trough.TroughClient
 
REQUEST_HEADER_BLACKLIST - Static variable in class org.archive.crawler.frontier.AMQPUrlReceiver
 
responsePayload(HttpURLConnection) - Static method in class org.archive.trough.TroughClient
 
rethinkDb - Variable in class org.archive.trough.TroughClient
 
rethinkQuery(ReqlExpr, Integer) - Method in class org.archive.trough.TroughClient
Run a rethinkdb query.
rethinkServers - Variable in class org.archive.trough.TroughClient
 
routingKey - Variable in class org.archive.modules.AMQPProducer
 
routingKey - Variable in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
 
run() - Method in class org.archive.trough.TroughClient.Promotrix
 
runYoutubeDL(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
Writes output to this.tempFile.get().

S

SCHEMA_ID - Static variable in class org.archive.modules.recrawl.TroughContentDigestHistory
 
SCHEMA_SQL - Static variable in class org.archive.modules.recrawl.TroughContentDigestHistory
 
segmentManagerUrl() - Method in class org.archive.trough.TroughClient
 
segmentManagerUrl(String) - Method in class org.archive.trough.TroughClient
 
serverCache - Variable in class org.archive.crawler.prefetch.HostQuotaEnforcer
 
serverCache - Variable in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
 
serverCache - Variable in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
 
serverCache - Variable in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 
serverCache - Variable in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
setAmqpUri(String) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
 
setAmqpUri(String) - Method in class org.archive.modules.AMQPProducerProcessor
 
setAmqpUri(String) - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
 
setApplicationContext(ApplicationContext) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
 
setApplicationContext(ApplicationContext) - Method in class org.archive.modules.AMQPPublishProcessor
 
setApplyToSubdomains(boolean) - Method in class org.archive.crawler.prefetch.HostQuotaEnforcer
Whether to apply the quotas to each subdomain of HostQuotaEnforcer.host (separately, not cumulatively).
setAutoDelete(boolean) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
Should be queues be marked as auto-delete?
setBrokerList(String) - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
Kafka broker list (kafka property "metadata.broker.list").
setCandidates(CandidatesProcessor) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
Received urls are run through the supplied CandidatesProcessor, which checks scope and schedules the urls.
setClientId(String) - Method in class org.archive.modules.AMQPPublishProcessor
Client id to include in the json payload.
setConnectionTimeout(int) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
connection timeout for HTTP client in milliseconds.
setContentDigestScheme(String) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
set Content-Digest scheme string to prepend to the hash string found in CDX.
setCrawlController(CrawlController) - Method in class org.archive.modules.AMQPUrlWaiter
 
setCrawlController(CrawlController) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
 
setCrawlController(CrawlController) - Method in class org.archive.modules.postprocessor.WARCLimitEnforcer
 
setCrawlerLoggerModule(CrawlerLoggerModule) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
 
setDumpPendingAtClose(boolean) - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
If true, publish all pending urls (i.e. queued urls still in the frontier) when crawl job is stopping.
setDumpPendingAtClose(boolean) - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
If true, publish all pending urls (i.e. queued urls still in the frontier) when crawl job is stopping.
setDurable(boolean) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
Should be queues be marked as durable?
setExchange(String) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
 
setExchange(String) - Method in class org.archive.modules.AMQPProducerProcessor
 
setExchange(String) - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
 
setExtractLimit(Integer) - Method in class org.archive.modules.extractor.ExtractorYoutubeFormatStream
Maximum number of video urls to extract.
setExtraFields(Map<String, String>) - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
 
setExtraFields(Map<String, String>) - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 
setExtraInfo(Map<String, Object>) - Method in class org.archive.modules.AMQPPublishProcessor
Arbitrary additional information to include in the json payload.
setForceFetch(boolean) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
 
setFrontier(Frontier) - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
Autowired frontier, needed to determine when a url is finished.
setFrontier(Frontier) - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
Autowired frontier, needed to determine when a url is finished.
setFrontier(Frontier) - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
Autowired frontier, needed to determine when a url is finished.
setGroovyExpression(String) - Method in class org.archive.modules.deciderules.ExpressionDecideRule
 
setGzipAccepted(boolean) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
if set to true, WbmPersistLoadProcessor adds a header Accept-Encoding: gzip to HTTP requests.
setHistoryLength(int) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
 
setHost(String) - Method in class org.archive.crawler.prefetch.HostQuotaEnforcer
 
setHttpClient(HttpClient) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
 
setItagPriority(List<String>) - Method in class org.archive.modules.extractor.ExtractorYoutubeFormatStream
Itag priority list.
setLimits(Map<String, Map<String, Long>>) - Method in class org.archive.modules.postprocessor.WARCLimitEnforcer
Should match structure of BaseWARCWriterProcessor.getStats()
setLogMetadataRecord(boolean) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
Whether or not to create a crawl.log entry for any WARC Metadata Records written.
setMaxConnections(int) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
 
setMaxSizeToParse(long) - Method in class org.archive.modules.extractor.ExtractorPDFContent
The maximum size of PDF files to consider.
setProcessArguments(List<String>) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
 
setQueryRangeSecs(long) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
 
setQueryURL(String) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
 
setQueueName(String) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
 
setQuotas(Map<String, Long>) - Method in class org.archive.crawler.prefetch.HostQuotaEnforcer
Keys can be any of the FetchStats keys.
setQuotas(Map<String, Long>) - Method in class org.archive.crawler.prefetch.SourceQuotaEnforcer
Keys can be any of the CrawledBytesHistotable keys.
setRequestHeaders(Map<String, String>) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
all key-value pairs in this map will be added as HTTP headers.
setRethinkUrl(String) - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
setRethinkUrl(String) - Method in class org.archive.modules.recrawl.TroughContentDigestHistory
 
setRoutingKey(String) - Method in class org.archive.modules.AMQPProducerProcessor
 
setRoutingKey(String) - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
 
setScheduledDate(String) - Method in class org.archive.crawler.reporting.XmlCrawlSummaryReport
 
setSegmentId(String) - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
setSegmentId(String) - Method in class org.archive.modules.recrawl.TroughContentDigestHistory
 
setServerCache(ServerCache) - Method in class org.archive.crawler.prefetch.HostQuotaEnforcer
 
setServerCache(ServerCache) - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
 
setServerCache(ServerCache) - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
 
setServerCache(ServerCache) - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 
setServerCache(ServerCache) - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
setSocketTimeout(int) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
socket timeout (SO_TIMEOUT) for HTTP client in milliseconds.
setSourceTag(String) - Method in class org.archive.crawler.prefetch.SourceQuotaEnforcer
 
setStatisticsTracker(StatisticsTracker) - Method in class org.archive.crawler.prefetch.SourceQuotaEnforcer
 
setTopic(String) - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 
setWarcWriter(BaseWARCWriterProcessor) - Method in class org.archive.modules.postprocessor.WARCLimitEnforcer
 
setWarcWriters(List<BaseWARCWriterProcessor>) - Method in class org.archive.modules.postprocessor.WARCLimitEnforcer
 
shouldBuildRecord(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
 
shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorPDFContent
 
shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
Returns true if we should run yt-dlp on this url.
shouldProcess(CrawlURI) - Method in class org.archive.crawler.prefetch.HostQuotaEnforcer
 
shouldProcess(CrawlURI) - Method in class org.archive.crawler.prefetch.SourceQuotaEnforcer
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.AMQPPublishProcessor
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorYoutubeChannelFormatStream
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorYoutubeFormatStream
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.postprocessor.WARCLimitEnforcer
 
shouldProcess(CrawlURI) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
 
SIX_HOURS_MS - Static variable in class org.archive.trough.TroughClient
 
SourceQuotaEnforcer - Class in org.archive.crawler.prefetch
Processor for enforcing quotas by source tag (normally the seed url if enabled).
SourceQuotaEnforcer() - Constructor for class org.archive.crawler.prefetch.SourceQuotaEnforcer
 
sourceTag - Variable in class org.archive.crawler.prefetch.SourceQuotaEnforcer
 
SQL_MIMETYPE - Static variable in class org.archive.trough.TroughClient
 
sqlValue(Object) - Static method in class org.archive.trough.TroughClient
 
start() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
 
start() - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
 
start() - Method in class org.archive.trough.TroughClient
 
statisticsTracker - Variable in class org.archive.crawler.prefetch.SourceQuotaEnforcer
 
stats - Variable in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 
StatsCallback() - Constructor for class org.archive.modules.postprocessor.KafkaCrawlLogFeed.StatsCallback
 
stop() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
 
stop() - Method in class org.archive.modules.AMQPProducer
 
stop() - Method in class org.archive.modules.AMQPProducerProcessor
 
stop() - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
 
stop() - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
 
stop() - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 
stop() - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
stop() - Method in class org.archive.trough.TroughClient
 
store(CrawlURI) - Method in class org.archive.modules.recrawl.TroughContentDigestHistory
 
streamYdlOutput(InputStream, ExtractorYoutubeDL.YoutubeDLResults) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
Streams through yt-dlp json output.
success(CrawlURI, byte[], AMQP.BasicProperties) - Method in class org.archive.modules.AMQPProducerProcessor
 
success(CrawlURI, byte[], AMQP.BasicProperties) - Method in class org.archive.modules.AMQPPublishProcessor
 

T

tempfile - Variable in class org.archive.modules.extractor.ExtractorYoutubeDL
 
TEN_MINUTES_MS - Static variable in class org.archive.trough.TroughClient
 
threadChannel - Variable in class org.archive.modules.AMQPProducer
 
topic - Variable in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
 
total - Variable in class org.archive.modules.postprocessor.KafkaCrawlLogFeed.StatsCallback
 
troughClient - Variable in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
troughClient - Variable in class org.archive.modules.recrawl.TroughContentDigestHistory
 
troughClient() - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
troughClient() - Method in class org.archive.modules.recrawl.TroughContentDigestHistory
 
TroughClient - Class in org.archive.trough
 
TroughClient(String) - Constructor for class org.archive.trough.TroughClient
 
TroughClient(String, Integer) - Constructor for class org.archive.trough.TroughClient
 
TroughClient.Promotrix - Class in org.archive.trough
 
TroughClient.TroughException - Exception in org.archive.trough
 
TroughClient.TroughNoReadUrlException - Exception in org.archive.trough
 
TroughContentDigestHistory - Class in org.archive.modules.recrawl
AbstractContentDigestHistory implementation for trough.
TroughContentDigestHistory() - Constructor for class org.archive.modules.recrawl.TroughContentDigestHistory
 
TroughCrawlLogFeed - Class in org.archive.modules.postprocessor
Post insert statements for these two tables.
TroughCrawlLogFeed() - Constructor for class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
TroughException(Exception) - Constructor for exception org.archive.trough.TroughClient.TroughException
 
TroughException(String) - Constructor for exception org.archive.trough.TroughClient.TroughException
 
TroughException(String, Throwable) - Constructor for exception org.archive.trough.TroughClient.TroughException
 
TroughNoReadUrlException(String) - Constructor for exception org.archive.trough.TroughClient.TroughNoReadUrlException
 

U

uncrawledBatch - Variable in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
uncrawledBatchLastTime - Variable in class org.archive.modules.postprocessor.TroughCrawlLogFeed
 
UrlConsumer(Channel) - Constructor for class org.archive.crawler.frontier.AMQPUrlReceiver.UrlConsumer
 
URLPattern - Static variable in class org.archive.modules.extractor.ExtractorPDFContent
 
urlsPublished - Variable in class org.archive.modules.AMQPUrlWaiter
 
urlsReceived - Variable in class org.archive.modules.AMQPUrlWaiter
 

W

WARCLimitEnforcer - Class in org.archive.modules.postprocessor
 
WARCLimitEnforcer() - Constructor for class org.archive.modules.postprocessor.WARCLimitEnforcer
 
warcWriter - Variable in class org.archive.modules.postprocessor.WARCLimitEnforcer
 
WbmPersistLoadProcessor - Class in org.archive.modules.recrawl.wbm
A Processor for retrieving recrawl info from remote Wayback Machine index.
WbmPersistLoadProcessor() - Constructor for class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
 
WbmPersistLoadProcessor.FormatSegment - Interface in org.archive.modules.recrawl.wbm
 
wrapped - Variable in class org.archive.modules.extractor.KnowledgableExtractorJS.CustomizedCrawlURIFacade
 
write(int) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL.NullOutputStream
 
write(PrintWriter, StatisticsTracker) - Method in class org.archive.crawler.reporting.XmlCrawlSummaryReport
 
write(String, String, Object[]) - Method in class org.archive.trough.TroughClient
 
write(String, String, Object[], String) - Method in class org.archive.trough.TroughClient
 
WRITE_SQL_TMPL - Static variable in class org.archive.modules.recrawl.TroughContentDigestHistory
 
writeUrl(String, String) - Method in class org.archive.trough.TroughClient
 
writeUrlCache - Variable in class org.archive.trough.TroughClient
 
writeUrlNoCache(String, String) - Method in class org.archive.trough.TroughClient
 

X

XmlCrawlSummaryReport - Class in org.archive.crawler.reporting
 
XmlCrawlSummaryReport() - Constructor for class org.archive.crawler.reporting.XmlCrawlSummaryReport
 

Y

YDL_CONTAINING_PAGE_DIGEST - Static variable in class org.archive.modules.extractor.ExtractorYoutubeDL
 
YDL_CONTAINING_PAGE_TIMESTAMP - Static variable in class org.archive.modules.extractor.ExtractorYoutubeDL
 
YDL_CONTAINING_PAGE_URI - Static variable in class org.archive.modules.extractor.ExtractorYoutubeDL
 
YDL_JSON_FILE_DIGEST - Static variable in class org.archive.modules.extractor.ExtractorYoutubeDL
 
ydlLogger - Variable in class org.archive.modules.extractor.ExtractorYoutubeDL
 
YoutubeDLResults(RandomAccessFile) - Constructor for class org.archive.modules.extractor.ExtractorYoutubeDL.YoutubeDLResults
 
A B C D E F G H I J K L M N O P Q R S T U W X Y 
All Classes and Interfaces|All Packages|Constant Field Values|Serialized Form