Index
All Classes and Interfaces|All Packages|Constant Field Values|Serialized Form
A
- A_RECEIVED_FROM_AMQP - Static variable in class org.archive.crawler.frontier.AMQPUrlReceiver
- A_SENT_TO_AMQP - Static variable in class org.archive.modules.AMQPPublishProcessor
- A_TIMESTAMP - Static variable in class org.archive.modules.recrawl.FetchHistoryHelper
-
key for storing timestamp in crawl history map.
- addPreferredOutlinks(CrawlURI, LinkedHashMap<String, String>) - Method in class org.archive.modules.extractor.ExtractorYoutubeFormatStream
- addVideoOutlink(CrawlURI, String, int, int) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
- AMQPCrawlLogFeed - Class in org.archive.modules.postprocessor
- AMQPCrawlLogFeed() - Constructor for class org.archive.modules.postprocessor.AMQPCrawlLogFeed
- amqpMessageProperties() - Method in class org.archive.modules.AMQPProducerProcessor
- amqpMessageProperties() - Method in class org.archive.modules.AMQPPublishProcessor
- amqpMessageProperties() - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
- amqpProducer - Variable in class org.archive.modules.AMQPProducerProcessor
- amqpProducer - Variable in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
- amqpProducer() - Method in class org.archive.modules.AMQPProducerProcessor
- amqpProducer() - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
- AMQPProducer - Class in org.archive.modules
- AMQPProducer(String, String, String) - Constructor for class org.archive.modules.AMQPProducer
- AMQPProducerProcessor - Class in org.archive.modules
- AMQPProducerProcessor() - Constructor for class org.archive.modules.AMQPProducerProcessor
- AMQPPublishProcessor - Class in org.archive.modules
- AMQPPublishProcessor() - Constructor for class org.archive.modules.AMQPPublishProcessor
- amqpUri - Variable in class org.archive.crawler.frontier.AMQPUrlReceiver
- amqpUri - Variable in class org.archive.modules.AMQPProducer
- amqpUri - Variable in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
- AMQPUrlPublishedEvent - Class in org.archive.crawler.event
-
ApplicationEvent published when Heritrix sends a URL to AMQP.
- AMQPUrlPublishedEvent(AMQPPublishProcessor, CrawlURI) - Constructor for class org.archive.crawler.event.AMQPUrlPublishedEvent
- AMQPUrlReceivedEvent - Class in org.archive.crawler.event
-
ApplicationEvent published when AMQPUrlReceiver receives a URL.
- AMQPUrlReceivedEvent(AMQPUrlReceiver, CrawlURI) - Constructor for class org.archive.crawler.event.AMQPUrlReceivedEvent
- AMQPUrlReceiver - Class in org.archive.crawler.frontier
- AMQPUrlReceiver() - Constructor for class org.archive.crawler.frontier.AMQPUrlReceiver
- AMQPUrlReceiver.UrlConsumer - Class in org.archive.crawler.frontier
- AMQPUrlWaiter - Class in org.archive.modules
-
Bean to enforce a wait for Umbra's amqp queue
- AMQPUrlWaiter() - Constructor for class org.archive.modules.AMQPUrlWaiter
- appCtx - Variable in class org.archive.crawler.frontier.AMQPUrlReceiver
- appCtx - Variable in class org.archive.modules.AMQPPublishProcessor
- applyToSubdomains - Variable in class org.archive.crawler.prefetch.HostQuotaEnforcer
B
- baseURI - Variable in class org.archive.modules.extractor.KnowledgableExtractorJS.CustomizedCrawlURIFacade
- BATCH_MAX_SIZE - Static variable in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- BATCH_MAX_TIME_MS - Static variable in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- brokerList - Variable in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
- buildJson(CrawlURI, int, DecideRule, DecideResult) - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
- buildJson(CrawlURI, Map<String, String>, ServerCache) - Static method in class org.archive.modules.postprocessor.CrawlLogJsonBuilder
- buildJsonMessage(CrawlURI) - Method in class org.archive.modules.AMQPPublishProcessor
-
Constructs the json to send via AMQP.
- buildMessage(CrawlURI) - Method in class org.archive.modules.AMQPProducerProcessor
- buildMessage(CrawlURI) - Method in class org.archive.modules.AMQPPublishProcessor
- buildMessage(CrawlURI) - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
- buildMessage(CrawlURI) - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
- buildRecord(CrawlURI, URI) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
- buildURL(String) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
C
- candidates - Variable in class org.archive.crawler.frontier.AMQPUrlReceiver
- channel - Variable in class org.archive.crawler.frontier.AMQPUrlReceiver
- channel() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
- channel() - Method in class org.archive.modules.AMQPProducer
- checkAMQPUrlWait() - Method in class org.archive.modules.AMQPUrlWaiter
- checkForNull(Object) - Static method in class org.archive.modules.postprocessor.CrawlLogJsonBuilder
- closeLocalTempFile() - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
- connection - Variable in class org.archive.crawler.frontier.AMQPUrlReceiver
- connection - Variable in class org.archive.modules.AMQPProducer
- connection() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
- considerStrings(Extractor, CrawlURI, CharSequence, boolean) - Method in class org.archive.modules.extractor.KnowledgableExtractorJS
- controller - Variable in class org.archive.modules.AMQPUrlWaiter
- controller - Variable in class org.archive.modules.extractor.ExtractorYoutubeDL
- controller - Variable in class org.archive.modules.postprocessor.WARCLimitEnforcer
- crawledBatch - Variable in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- crawledBatchLastTime - Variable in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- crawlerLoggerModule - Variable in class org.archive.modules.extractor.ExtractorYoutubeDL
- CrawlLogJsonBuilder - Class in org.archive.modules.postprocessor
- CrawlLogJsonBuilder() - Constructor for class org.archive.modules.postprocessor.CrawlLogJsonBuilder
- createCrawlURI(UURI, LinkContext, Hop) - Method in class org.archive.modules.extractor.KnowledgableExtractorJS.CustomizedCrawlURIFacade
-
Delegates to wrapped CrawlURI
- curi - Variable in class org.archive.crawler.event.AMQPUrlPublishedEvent
- curi - Variable in class org.archive.crawler.event.AMQPUrlReceivedEvent
- CustomizedCrawlURIFacade(CrawlURI, UURI) - Constructor for class org.archive.modules.extractor.KnowledgableExtractorJS.CustomizedCrawlURIFacade
D
- DecideRuleSequenceWithAMQPFeed - Class in org.archive.modules.deciderules
- DecideRuleSequenceWithAMQPFeed() - Constructor for class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
- decisionMade(CrawlURI, DecideRule, int, DecideResult) - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
- dirtySegments - Variable in class org.archive.trough.TroughClient
- doRedirectInheritance(CrawlURI, String) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
- dumpPendingAtClose - Variable in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
- dumpPendingAtClose - Variable in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
E
- errors - Variable in class org.archive.modules.postprocessor.KafkaCrawlLogFeed.StatsCallback
- evaluate(CrawlURI) - Method in class org.archive.modules.deciderules.ExpressionDecideRule
- exchange - Variable in class org.archive.crawler.frontier.AMQPUrlReceiver
- exchange - Variable in class org.archive.modules.AMQPProducer
- exchange - Variable in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
- ExpressionDecideRule - Class in org.archive.modules.deciderules
-
Example usage:
- ExpressionDecideRule() - Constructor for class org.archive.modules.deciderules.ExpressionDecideRule
- extract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorYoutubeChannelFormatStream
- extract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
-
- If
uri
is annotated "youtube-dl" and is a 3xx (redirect), find the redirect among the outlinks and add the "youtube-dl" annotation to it as well, and also make a note of the containing page inside the CrawlURI. - extract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorYoutubeFormatStream
- ExtractorPDFContent - Class in org.archive.modules.extractor
-
PDF Content Extractor.
- ExtractorPDFContent() - Constructor for class org.archive.modules.extractor.ExtractorPDFContent
- ExtractorYoutubeChannelFormatStream - Class in org.archive.modules.extractor
- ExtractorYoutubeChannelFormatStream() - Constructor for class org.archive.modules.extractor.ExtractorYoutubeChannelFormatStream
- ExtractorYoutubeDL - Class in org.archive.modules.extractor
-
Extracts links to media by running yt-dlp in a subprocess.
- ExtractorYoutubeDL() - Constructor for class org.archive.modules.extractor.ExtractorYoutubeDL
- ExtractorYoutubeDL.NullOutputStream - Class in org.archive.modules.extractor
-
Dummy output stream to swallow bytes without storing anything.
- ExtractorYoutubeDL.YoutubeDLResults - Class in org.archive.modules.extractor
- ExtractorYoutubeFormatStream - Class in org.archive.modules.extractor
-
Youtube stream URI extractor.
- ExtractorYoutubeFormatStream() - Constructor for class org.archive.modules.extractor.ExtractorYoutubeFormatStream
- extraFields - Variable in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
- extraFields - Variable in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
F
- fail(CrawlURI, byte[], AMQP.BasicProperties, Throwable) - Method in class org.archive.modules.AMQPProducerProcessor
- FetchHistoryHelper - Class in org.archive.modules.recrawl
-
collection of utility methods useful for loading and storing crawl history.
- FetchHistoryHelper() - Constructor for class org.archive.modules.recrawl.FetchHistoryHelper
- findYdlAnnotation(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
- frontier - Variable in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
- frontier - Variable in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
- frontier - Variable in class org.archive.modules.postprocessor.TroughCrawlLogFeed
G
- getAmqpUri() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
- getAmqpUri() - Method in class org.archive.modules.AMQPProducerProcessor
- getAmqpUri() - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
- getApplyToSubdomains() - Method in class org.archive.crawler.prefetch.HostQuotaEnforcer
- getBaseURI() - Method in class org.archive.modules.extractor.KnowledgableExtractorJS.CustomizedCrawlURIFacade
- getBrokerList() - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
- getCandidates() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
- getCDX(String) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
- getClientId() - Method in class org.archive.modules.AMQPPublishProcessor
- getConnectionTimeout() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
- getContentDigestScheme() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
- getCrawlController() - Method in class org.archive.modules.AMQPUrlWaiter
- getCrawlController() - Method in class org.archive.modules.postprocessor.WARCLimitEnforcer
- getCrawlerLoggerModule() - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
- getCumulativeFetchTime() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
-
total milliseconds spent in API call.
- getCuri() - Method in class org.archive.crawler.event.AMQPUrlPublishedEvent
- getCuri() - Method in class org.archive.crawler.event.AMQPUrlReceivedEvent
- getDumpPendingAtClose() - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
- getDumpPendingAtClose() - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
- getErrorCount() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
-
number of times cdx-server API call failed.
- getExchange() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
- getExchange() - Method in class org.archive.modules.AMQPProducerProcessor
- getExchange() - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
- getExtractLimit() - Method in class org.archive.modules.extractor.ExtractorYoutubeFormatStream
- getExtraFields() - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
- getExtraFields() - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
- getExtraInfo() - Method in class org.archive.modules.AMQPPublishProcessor
- getFetchHistory(CrawlURI, long, int) - Static method in class org.archive.modules.recrawl.FetchHistoryHelper
-
returns a Map to store recrawl data, positioned properly in CrawlURI's fetch history array, according to
timestamp
. this makes it possible to import crawl history data from multiple sources. - getFilename() - Method in class org.archive.crawler.reporting.XmlCrawlSummaryReport
- getFrontier() - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
- getFrontier() - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
- getFrontier() - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- getGroovyExpression() - Method in class org.archive.modules.deciderules.ExpressionDecideRule
- getHistoryLength() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
- getHost() - Method in class org.archive.crawler.prefetch.HostQuotaEnforcer
- getHttpClient() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
- getItagPriority() - Method in class org.archive.modules.extractor.ExtractorYoutubeFormatStream
- getKeyedProperties() - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- getKeyedProperties() - Method in class org.archive.modules.recrawl.TroughContentDigestHistory
- getLastCrawl(InputStream) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
- getLimits() - Method in class org.archive.modules.postprocessor.WARCLimitEnforcer
- getLoadedCount() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
-
number of times successfully loaded recrawl info.
- getLocalTempFile() - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
- getLogMetadataRecord() - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
- getMaxConnections() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
- getMaxSizeToParse() - Method in class org.archive.modules.extractor.ExtractorPDFContent
- getMissedCount() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
-
number of times getting no recrawl info.
- getOutLinks() - Method in class org.archive.modules.extractor.KnowledgableExtractorJS.CustomizedCrawlURIFacade
-
Delegates to wrapped CrawlURI
- getProcessArguments() - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
- getQueryRangeSecs() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
- getQueryURL() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
- getQueueName() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
- getQuotas() - Method in class org.archive.crawler.prefetch.HostQuotaEnforcer
- getQuotas() - Method in class org.archive.crawler.prefetch.SourceQuotaEnforcer
- getRequestHeaders() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
- getRethinkUrl() - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- getRethinkUrl() - Method in class org.archive.modules.recrawl.TroughContentDigestHistory
- getRoutingKey() - Method in class org.archive.modules.AMQPProducerProcessor
- getRoutingKey() - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
- getScheduledDate() - Method in class org.archive.crawler.reporting.XmlCrawlSummaryReport
- getSegmentId() - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- getSegmentId() - Method in class org.archive.modules.recrawl.TroughContentDigestHistory
- getServerCache() - Method in class org.archive.crawler.prefetch.HostQuotaEnforcer
- getServerCache() - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
- getServerCache() - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
- getServerCache() - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
- getServerCache() - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- getSocketTimeout() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
- getSourceTag() - Method in class org.archive.crawler.prefetch.SourceQuotaEnforcer
- getStatisticsTracker() - Method in class org.archive.crawler.prefetch.SourceQuotaEnforcer
- getTopic() - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
- getWarcWriter() - Method in class org.archive.modules.postprocessor.WARCLimitEnforcer
- getWarcWriters() - Method in class org.archive.modules.postprocessor.WARCLimitEnforcer
- groovyTemplate() - Method in class org.archive.modules.deciderules.ExpressionDecideRule
- groovyTemplates - Variable in class org.archive.modules.deciderules.ExpressionDecideRule
H
- handleDelivery(String, Envelope, AMQP.BasicProperties, byte[]) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver.UrlConsumer
- handleShutdownSignal(String, ShutdownSignalException) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver.UrlConsumer
- host - Variable in class org.archive.crawler.prefetch.HostQuotaEnforcer
- HostQuotaEnforcer - Class in org.archive.crawler.prefetch
-
Enforces quotas on a host.
- HostQuotaEnforcer() - Constructor for class org.archive.crawler.prefetch.HostQuotaEnforcer
- httpRequest(String, String, String, String, int) - Static method in class org.archive.trough.TroughClient
I
- incrementDiscardedOutLinks() - Method in class org.archive.modules.extractor.KnowledgableExtractorJS.CustomizedCrawlURIFacade
-
Delegates to wrapped CrawlURI
- innerExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorPDFContent
- innerProcess(CrawlURI) - Method in class org.archive.crawler.prefetch.HostQuotaEnforcer
- innerProcess(CrawlURI) - Method in class org.archive.crawler.prefetch.SourceQuotaEnforcer
- innerProcess(CrawlURI) - Method in class org.archive.modules.AMQPProducerProcessor
- innerProcess(CrawlURI) - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
- innerProcess(CrawlURI) - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- innerProcess(CrawlURI) - Method in class org.archive.modules.postprocessor.WARCLimitEnforcer
- innerProcess(CrawlURI) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
-
unused.
- innerProcessResult(CrawlURI) - Method in class org.archive.crawler.prefetch.HostQuotaEnforcer
- innerProcessResult(CrawlURI) - Method in class org.archive.crawler.prefetch.SourceQuotaEnforcer
- innerProcessResult(CrawlURI) - Method in class org.archive.modules.AMQPProducerProcessor
- innerProcessResult(CrawlURI) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
- isAutoDelete() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
- isDurable() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
- isForceFetch() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
- isGzipAccepted() - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
- isOpen(RandomAccessFile) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
- isRunning - Variable in class org.archive.crawler.frontier.AMQPUrlReceiver
- isRunning() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
J
- JSON_MIMETYPE - Static variable in class org.archive.trough.TroughClient
K
- KafkaCrawlLogFeed - Class in org.archive.modules.postprocessor
-
For Kafka 0.8.x.
- KafkaCrawlLogFeed() - Constructor for class org.archive.modules.postprocessor.KafkaCrawlLogFeed
- KafkaCrawlLogFeed.StatsCallback - Class in org.archive.modules.postprocessor
- kafkaProducer - Variable in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
- kafkaProducer() - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
- KnowledgableExtractorJS - Class in org.archive.modules.extractor
-
A subclass of
ExtractorJS
that has some customized behavior for specific kinds of web pages. - KnowledgableExtractorJS() - Constructor for class org.archive.modules.extractor.KnowledgableExtractorJS
- KnowledgableExtractorJS.CustomizedCrawlURIFacade - Class in org.archive.modules.extractor
-
Wraps a
CrawlURI
, allowing baseURI to be overridden, without changing the underlying CrawlURI. - kp - Variable in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- kp - Variable in class org.archive.modules.recrawl.TroughContentDigestHistory
L
- limits - Variable in class org.archive.modules.postprocessor.WARCLimitEnforcer
- load(CrawlURI) - Method in class org.archive.modules.recrawl.TroughContentDigestHistory
- logCapturedVideo(CrawlURI, String) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
- logContainingPage(CrawlURI, String) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
- logger - Static variable in class org.archive.modules.AMQPProducer
- logger - Variable in class org.archive.modules.AMQPProducerProcessor
- logger - Static variable in class org.archive.modules.AMQPUrlWaiter
- logger - Static variable in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
- logger - Static variable in class org.archive.modules.postprocessor.TroughCrawlLogFeed
M
- main(String[]) - Static method in class org.archive.modules.extractor.ExtractorYoutubeDL
- main(String[]) - Static method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
-
main entry point for quick test.
- makeCrawlUri(JSONObject) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver.UrlConsumer
- MAX_VIDEOS_PER_PAGE - Static variable in class org.archive.modules.extractor.ExtractorYoutubeDL
N
- NullOutputStream() - Constructor for class org.archive.modules.extractor.ExtractorYoutubeDL.NullOutputStream
O
- onApplicationEvent(CrawlStateEvent) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
- onApplicationEvent(CrawlStateEvent) - Method in class org.archive.modules.recrawl.TroughContentDigestHistory
- onApplicationEvent(ApplicationEvent) - Method in class org.archive.modules.AMQPUrlWaiter
- onCompletion(RecordMetadata, Exception) - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed.StatsCallback
- openNewTempFile() - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
- org.archive.crawler.event - package org.archive.crawler.event
- org.archive.crawler.frontier - package org.archive.crawler.frontier
- org.archive.crawler.prefetch - package org.archive.crawler.prefetch
- org.archive.crawler.reporting - package org.archive.crawler.reporting
- org.archive.modules - package org.archive.modules
- org.archive.modules.deciderules - package org.archive.modules.deciderules
- org.archive.modules.extractor - package org.archive.modules.extractor
- org.archive.modules.postprocessor - package org.archive.modules.postprocessor
- org.archive.modules.recrawl - package org.archive.modules.recrawl
- org.archive.modules.recrawl.wbm - package org.archive.modules.recrawl.wbm
- org.archive.trough - package org.archive.trough
P
- parseRethinkdbUrl(String) - Method in class org.archive.trough.TroughClient
-
Parses a url like this rethinkdb://server1:port,server2:port/database/table Sets fields
rethinkServers
,rethinkDb
,rethinkTable
- parseStreamMap(String) - Method in class org.archive.modules.extractor.ExtractorYoutubeFormatStream
- populateHeritableMetadata(CrawlURI, JSONObject) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver.UrlConsumer
- postCrawledBatch() - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- postUncrawledBatch() - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- postWrite(WARCRecordInfo, CrawlURI) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
-
Because we are writing an additional WARC Metadata Record for the json video info, there is no CrawlURI for that record, and thus nothing ever goes through the frontier to be logged to the crawl.log.
- print(StringBuilder, String[]) - Method in interface org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor.FormatSegment
- promote(String) - Method in class org.archive.trough.TroughClient
- promoteDirtySegments() - Method in class org.archive.trough.TroughClient
- promotionInterval - Variable in class org.archive.trough.TroughClient
- promotrix - Variable in class org.archive.trough.TroughClient
- Promotrix() - Constructor for class org.archive.trough.TroughClient.Promotrix
- props - Variable in class org.archive.modules.AMQPPublishProcessor
- props - Variable in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
- props - Variable in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
- publishMessage(byte[], AMQP.BasicProperties) - Method in class org.archive.modules.AMQPProducer
-
Publish the message with the supplied properties.
Q
- queueName - Variable in class org.archive.crawler.frontier.AMQPUrlReceiver
- quotas - Variable in class org.archive.crawler.prefetch.HostQuotaEnforcer
- quotas - Variable in class org.archive.crawler.prefetch.SourceQuotaEnforcer
R
- r - Static variable in class org.archive.trough.TroughClient
- rand - Variable in class org.archive.trough.TroughClient
- read(String, String, Object[]) - Method in class org.archive.trough.TroughClient
- readToEnd(Reader) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
- readUrl(String) - Method in class org.archive.trough.TroughClient
- readUrlCache - Variable in class org.archive.trough.TroughClient
- readUrlNoCache(String) - Method in class org.archive.trough.TroughClient
- registerSchema(String, String) - Method in class org.archive.trough.TroughClient
- REQUEST_HEADER_BLACKLIST - Static variable in class org.archive.crawler.frontier.AMQPUrlReceiver
- responsePayload(HttpURLConnection) - Static method in class org.archive.trough.TroughClient
- rethinkDb - Variable in class org.archive.trough.TroughClient
- rethinkQuery(ReqlExpr, Integer) - Method in class org.archive.trough.TroughClient
-
Run a rethinkdb query.
- rethinkServers - Variable in class org.archive.trough.TroughClient
- routingKey - Variable in class org.archive.modules.AMQPProducer
- routingKey - Variable in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
- run() - Method in class org.archive.trough.TroughClient.Promotrix
- runYoutubeDL(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
-
Writes output to this.tempFile.get().
S
- SCHEMA_ID - Static variable in class org.archive.modules.recrawl.TroughContentDigestHistory
- SCHEMA_SQL - Static variable in class org.archive.modules.recrawl.TroughContentDigestHistory
- segmentManagerUrl() - Method in class org.archive.trough.TroughClient
- segmentManagerUrl(String) - Method in class org.archive.trough.TroughClient
- serverCache - Variable in class org.archive.crawler.prefetch.HostQuotaEnforcer
- serverCache - Variable in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
- serverCache - Variable in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
- serverCache - Variable in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
- serverCache - Variable in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- setAmqpUri(String) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
- setAmqpUri(String) - Method in class org.archive.modules.AMQPProducerProcessor
- setAmqpUri(String) - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
- setApplicationContext(ApplicationContext) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
- setApplicationContext(ApplicationContext) - Method in class org.archive.modules.AMQPPublishProcessor
- setApplyToSubdomains(boolean) - Method in class org.archive.crawler.prefetch.HostQuotaEnforcer
-
Whether to apply the quotas to each subdomain of
HostQuotaEnforcer.host
(separately, not cumulatively). - setAutoDelete(boolean) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
-
Should be queues be marked as auto-delete?
- setBrokerList(String) - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
-
Kafka broker list (kafka property "metadata.broker.list").
- setCandidates(CandidatesProcessor) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
-
Received urls are run through the supplied CandidatesProcessor, which checks scope and schedules the urls.
- setClientId(String) - Method in class org.archive.modules.AMQPPublishProcessor
-
Client id to include in the json payload.
- setConnectionTimeout(int) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
-
connection timeout for HTTP client in milliseconds.
- setContentDigestScheme(String) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
-
set Content-Digest scheme string to prepend to the hash string found in CDX.
- setCrawlController(CrawlController) - Method in class org.archive.modules.AMQPUrlWaiter
- setCrawlController(CrawlController) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
- setCrawlController(CrawlController) - Method in class org.archive.modules.postprocessor.WARCLimitEnforcer
- setCrawlerLoggerModule(CrawlerLoggerModule) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
- setDumpPendingAtClose(boolean) - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
-
If true, publish all pending urls (i.e. queued urls still in the frontier) when crawl job is stopping.
- setDumpPendingAtClose(boolean) - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
-
If true, publish all pending urls (i.e. queued urls still in the frontier) when crawl job is stopping.
- setDurable(boolean) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
-
Should be queues be marked as durable?
- setExchange(String) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
- setExchange(String) - Method in class org.archive.modules.AMQPProducerProcessor
- setExchange(String) - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
- setExtractLimit(Integer) - Method in class org.archive.modules.extractor.ExtractorYoutubeFormatStream
-
Maximum number of video urls to extract.
- setExtraFields(Map<String, String>) - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
- setExtraFields(Map<String, String>) - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
- setExtraInfo(Map<String, Object>) - Method in class org.archive.modules.AMQPPublishProcessor
-
Arbitrary additional information to include in the json payload.
- setForceFetch(boolean) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
- setFrontier(Frontier) - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
-
Autowired frontier, needed to determine when a url is finished.
- setFrontier(Frontier) - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
-
Autowired frontier, needed to determine when a url is finished.
- setFrontier(Frontier) - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
-
Autowired frontier, needed to determine when a url is finished.
- setGroovyExpression(String) - Method in class org.archive.modules.deciderules.ExpressionDecideRule
- setGzipAccepted(boolean) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
-
if set to true,
WbmPersistLoadProcessor
adds a headerAccept-Encoding: gzip
to HTTP requests. - setHistoryLength(int) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
- setHost(String) - Method in class org.archive.crawler.prefetch.HostQuotaEnforcer
- setHttpClient(HttpClient) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
- setItagPriority(List<String>) - Method in class org.archive.modules.extractor.ExtractorYoutubeFormatStream
-
Itag priority list.
- setLimits(Map<String, Map<String, Long>>) - Method in class org.archive.modules.postprocessor.WARCLimitEnforcer
-
Should match structure of
BaseWARCWriterProcessor.getStats()
- setLogMetadataRecord(boolean) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
-
Whether or not to create a crawl.log entry for any WARC Metadata Records written.
- setMaxConnections(int) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
- setMaxSizeToParse(long) - Method in class org.archive.modules.extractor.ExtractorPDFContent
-
The maximum size of PDF files to consider.
- setProcessArguments(List<String>) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
- setQueryRangeSecs(long) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
- setQueryURL(String) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
- setQueueName(String) - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
- setQuotas(Map<String, Long>) - Method in class org.archive.crawler.prefetch.HostQuotaEnforcer
-
Keys can be any of the
FetchStats
keys. - setQuotas(Map<String, Long>) - Method in class org.archive.crawler.prefetch.SourceQuotaEnforcer
-
Keys can be any of the
CrawledBytesHistotable
keys. - setRequestHeaders(Map<String, String>) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
-
all key-value pairs in this map will be added as HTTP headers.
- setRethinkUrl(String) - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- setRethinkUrl(String) - Method in class org.archive.modules.recrawl.TroughContentDigestHistory
- setRoutingKey(String) - Method in class org.archive.modules.AMQPProducerProcessor
- setRoutingKey(String) - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
- setScheduledDate(String) - Method in class org.archive.crawler.reporting.XmlCrawlSummaryReport
- setSegmentId(String) - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- setSegmentId(String) - Method in class org.archive.modules.recrawl.TroughContentDigestHistory
- setServerCache(ServerCache) - Method in class org.archive.crawler.prefetch.HostQuotaEnforcer
- setServerCache(ServerCache) - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
- setServerCache(ServerCache) - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
- setServerCache(ServerCache) - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
- setServerCache(ServerCache) - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- setSocketTimeout(int) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
-
socket timeout (SO_TIMEOUT) for HTTP client in milliseconds.
- setSourceTag(String) - Method in class org.archive.crawler.prefetch.SourceQuotaEnforcer
- setStatisticsTracker(StatisticsTracker) - Method in class org.archive.crawler.prefetch.SourceQuotaEnforcer
- setTopic(String) - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
- setWarcWriter(BaseWARCWriterProcessor) - Method in class org.archive.modules.postprocessor.WARCLimitEnforcer
- setWarcWriters(List<BaseWARCWriterProcessor>) - Method in class org.archive.modules.postprocessor.WARCLimitEnforcer
- shouldBuildRecord(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorPDFContent
- shouldExtract(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
-
Returns
true
if we should run yt-dlp on this url. - shouldProcess(CrawlURI) - Method in class org.archive.crawler.prefetch.HostQuotaEnforcer
- shouldProcess(CrawlURI) - Method in class org.archive.crawler.prefetch.SourceQuotaEnforcer
- shouldProcess(CrawlURI) - Method in class org.archive.modules.AMQPPublishProcessor
- shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorYoutubeChannelFormatStream
- shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
- shouldProcess(CrawlURI) - Method in class org.archive.modules.extractor.ExtractorYoutubeFormatStream
- shouldProcess(CrawlURI) - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
- shouldProcess(CrawlURI) - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
- shouldProcess(CrawlURI) - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- shouldProcess(CrawlURI) - Method in class org.archive.modules.postprocessor.WARCLimitEnforcer
- shouldProcess(CrawlURI) - Method in class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
- SIX_HOURS_MS - Static variable in class org.archive.trough.TroughClient
- SourceQuotaEnforcer - Class in org.archive.crawler.prefetch
-
Processor for enforcing quotas by source tag (normally the seed url if enabled).
- SourceQuotaEnforcer() - Constructor for class org.archive.crawler.prefetch.SourceQuotaEnforcer
- sourceTag - Variable in class org.archive.crawler.prefetch.SourceQuotaEnforcer
- SQL_MIMETYPE - Static variable in class org.archive.trough.TroughClient
- sqlValue(Object) - Static method in class org.archive.trough.TroughClient
- start() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
- start() - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
- start() - Method in class org.archive.trough.TroughClient
- statisticsTracker - Variable in class org.archive.crawler.prefetch.SourceQuotaEnforcer
- stats - Variable in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
- StatsCallback() - Constructor for class org.archive.modules.postprocessor.KafkaCrawlLogFeed.StatsCallback
- stop() - Method in class org.archive.crawler.frontier.AMQPUrlReceiver
- stop() - Method in class org.archive.modules.AMQPProducer
- stop() - Method in class org.archive.modules.AMQPProducerProcessor
- stop() - Method in class org.archive.modules.deciderules.DecideRuleSequenceWithAMQPFeed
- stop() - Method in class org.archive.modules.postprocessor.AMQPCrawlLogFeed
- stop() - Method in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
- stop() - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- stop() - Method in class org.archive.trough.TroughClient
- store(CrawlURI) - Method in class org.archive.modules.recrawl.TroughContentDigestHistory
- streamYdlOutput(InputStream, ExtractorYoutubeDL.YoutubeDLResults) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL
-
Streams through yt-dlp json output.
- success(CrawlURI, byte[], AMQP.BasicProperties) - Method in class org.archive.modules.AMQPProducerProcessor
- success(CrawlURI, byte[], AMQP.BasicProperties) - Method in class org.archive.modules.AMQPPublishProcessor
T
- tempfile - Variable in class org.archive.modules.extractor.ExtractorYoutubeDL
- TEN_MINUTES_MS - Static variable in class org.archive.trough.TroughClient
- threadChannel - Variable in class org.archive.modules.AMQPProducer
- topic - Variable in class org.archive.modules.postprocessor.KafkaCrawlLogFeed
- total - Variable in class org.archive.modules.postprocessor.KafkaCrawlLogFeed.StatsCallback
- troughClient - Variable in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- troughClient - Variable in class org.archive.modules.recrawl.TroughContentDigestHistory
- troughClient() - Method in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- troughClient() - Method in class org.archive.modules.recrawl.TroughContentDigestHistory
- TroughClient - Class in org.archive.trough
- TroughClient(String) - Constructor for class org.archive.trough.TroughClient
- TroughClient(String, Integer) - Constructor for class org.archive.trough.TroughClient
- TroughClient.Promotrix - Class in org.archive.trough
- TroughClient.TroughException - Exception in org.archive.trough
- TroughClient.TroughNoReadUrlException - Exception in org.archive.trough
- TroughContentDigestHistory - Class in org.archive.modules.recrawl
-
AbstractContentDigestHistory implementation for trough.
- TroughContentDigestHistory() - Constructor for class org.archive.modules.recrawl.TroughContentDigestHistory
- TroughCrawlLogFeed - Class in org.archive.modules.postprocessor
-
Post insert statements for these two tables.
- TroughCrawlLogFeed() - Constructor for class org.archive.modules.postprocessor.TroughCrawlLogFeed
- TroughException(Exception) - Constructor for exception org.archive.trough.TroughClient.TroughException
- TroughException(String) - Constructor for exception org.archive.trough.TroughClient.TroughException
- TroughException(String, Throwable) - Constructor for exception org.archive.trough.TroughClient.TroughException
- TroughNoReadUrlException(String) - Constructor for exception org.archive.trough.TroughClient.TroughNoReadUrlException
U
- uncrawledBatch - Variable in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- uncrawledBatchLastTime - Variable in class org.archive.modules.postprocessor.TroughCrawlLogFeed
- UrlConsumer(Channel) - Constructor for class org.archive.crawler.frontier.AMQPUrlReceiver.UrlConsumer
- URLPattern - Static variable in class org.archive.modules.extractor.ExtractorPDFContent
- urlsPublished - Variable in class org.archive.modules.AMQPUrlWaiter
- urlsReceived - Variable in class org.archive.modules.AMQPUrlWaiter
W
- WARCLimitEnforcer - Class in org.archive.modules.postprocessor
- WARCLimitEnforcer() - Constructor for class org.archive.modules.postprocessor.WARCLimitEnforcer
- warcWriter - Variable in class org.archive.modules.postprocessor.WARCLimitEnforcer
- WbmPersistLoadProcessor - Class in org.archive.modules.recrawl.wbm
-
A
Processor
for retrieving recrawl info from remote Wayback Machine index. - WbmPersistLoadProcessor() - Constructor for class org.archive.modules.recrawl.wbm.WbmPersistLoadProcessor
- WbmPersistLoadProcessor.FormatSegment - Interface in org.archive.modules.recrawl.wbm
- wrapped - Variable in class org.archive.modules.extractor.KnowledgableExtractorJS.CustomizedCrawlURIFacade
- write(int) - Method in class org.archive.modules.extractor.ExtractorYoutubeDL.NullOutputStream
- write(PrintWriter, StatisticsTracker) - Method in class org.archive.crawler.reporting.XmlCrawlSummaryReport
- write(String, String, Object[]) - Method in class org.archive.trough.TroughClient
- write(String, String, Object[], String) - Method in class org.archive.trough.TroughClient
- WRITE_SQL_TMPL - Static variable in class org.archive.modules.recrawl.TroughContentDigestHistory
- writeUrl(String, String) - Method in class org.archive.trough.TroughClient
- writeUrlCache - Variable in class org.archive.trough.TroughClient
- writeUrlNoCache(String, String) - Method in class org.archive.trough.TroughClient
X
- XmlCrawlSummaryReport - Class in org.archive.crawler.reporting
- XmlCrawlSummaryReport() - Constructor for class org.archive.crawler.reporting.XmlCrawlSummaryReport
Y
- YDL_CONTAINING_PAGE_DIGEST - Static variable in class org.archive.modules.extractor.ExtractorYoutubeDL
- YDL_CONTAINING_PAGE_TIMESTAMP - Static variable in class org.archive.modules.extractor.ExtractorYoutubeDL
- YDL_CONTAINING_PAGE_URI - Static variable in class org.archive.modules.extractor.ExtractorYoutubeDL
- YDL_JSON_FILE_DIGEST - Static variable in class org.archive.modules.extractor.ExtractorYoutubeDL
- ydlLogger - Variable in class org.archive.modules.extractor.ExtractorYoutubeDL
- YoutubeDLResults(RandomAccessFile) - Constructor for class org.archive.modules.extractor.ExtractorYoutubeDL.YoutubeDLResults
All Classes and Interfaces|All Packages|Constant Field Values|Serialized Form