An abstract class for compactible metadata logs.
An ordered collection of offsets, used to track the progress of processing data from one or more Sources that are present in a streaming query.
An ordered collection of offsets, used to track the progress of processing data from one or more Sources that are present in a streaming query. This is similar to simplified, single-instance vector clock that must progress linearly forward.
User specified options for file streams.
A sink that writes out results to parquet files.
A sink that writes out results to parquet files. Each batch is written out to a unique directory. After all of the files in a batch have been successfully written, the list of file paths is appended to the log atomically. In the case of partial failures, some duplicate data may be present in the target directory, but only one copy of each file will be present in the log.
A special log for FileStreamSink.
A special log for FileStreamSink. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple JSON lines following. Each JSON line is a JSON format of SinkFileStatus.
As reading from many small files is usually pretty slow, FileStreamSinkLog will compact log
files every "spark.sql.sink.file.log.compactLen" batches into a big file. When doing a
compaction, it will read all old log files and merge them with the new batch. During the
compaction, it will also delete the files that are deleted (marked by SinkFileStatus.action).
When the reader uses allFiles
to list all files, this method only returns the visible files
(drops the deleted files).
Writes data given to a FileStreamSink to the given basePath
in the given fileFormat
,
partitioned by the given partitionColumnNames
.
Writes data given to a FileStreamSink to the given basePath
in the given fileFormat
,
partitioned by the given partitionColumnNames
. This writer always appends data to the
directory if it already has data.
A very simple source that reads files from the given directory as they appear.
A Sink that forwards all data into ForeachWriter according to the contract defined by ForeachWriter.
A Sink that forwards all data into ForeachWriter according to the contract defined by ForeachWriter.
The expected type of the sink.
A MetadataLog implementation based on HDFS.
A MetadataLog implementation based on HDFS. HDFSMetadataLog uses the specified path
as the metadata storage.
When writing a new batch, HDFSMetadataLog will firstly write to a temp file and then rename it to the final batch file. If the rename step fails, there must be multiple writers and only one of them will succeed and the others will fail.
Note: HDFSMetadataLog doesn't support S3-like file systems as they don't guarantee listing files in a directory always shows the latest files.
A variant of QueryExecution that allows the execution of the given LogicalPlan plan incrementally.
A variant of QueryExecution that allows the execution of the given LogicalPlan plan incrementally. Possibly preserving state in between each execution.
A simple offset for sources that produce a single linear stream of data.
Used to query the data that has been written into a MemorySink.
A sink that stores the results in memory.
A sink that stores the results in memory. This Sink is primarily intended for use in unit tests and does not provide durability.
A Source that produces value stored in memory as they are added by the user.
A general MetadataLog that supports the following features:
A general MetadataLog that supports the following features:
A FileCatalog that generates the list of files to processing by reading them from the metadata log files generated by the FileStreamSink.
An offset is a monotonically increasing metric used to track progress in the computation of a stream.
Used to identify the state store for a given operator.
A trigger executor that runs a batch every intervalMs
milliseconds.
An interface for systems that can collect the results of a streaming query.
An interface for systems that can collect the results of a streaming query. In order to preserve exactly once semantics a sink must be idempotent in the face of multiple attempts to add the same batch.
The status of a file outputted by FileStreamSink.
The status of a file outputted by FileStreamSink. A file is visible only if it appears in the sink log and its action is not "delete".
the file path.
the file size.
whether this file is a directory.
the file last modification time.
the block replication.
the block size.
the file action. Must be either "add" or "delete".
A source of continually arriving data for a streaming query.
For each input tuple, the key is calculated and the value from the StateStore is added to the stream (in addition to the input tuple) if present.
For each input tuple, the key is calculated and the tuple is put
into the StateStore.
An operator that saves or restores state from the StateStore.
An operator that saves or restores state from the StateStore. The OperatorStateId should
be filled in by prepareForExecution
in IncrementalExecution.
Manages the execution of a streaming Spark SQL query that is occurring in a separate thread.
Manages the execution of a streaming Spark SQL query that is occurring in a separate thread. Unlike a standard query, a streaming query executes repeatedly each time new data arrives at any Source present in the query plan. Whenever new data arrives, a QueryExecution is created and the results are committed transactionally to the given Sink.
A special thread to run the stream query.
A special thread to run the stream query. Some codes require to run in the StreamExecutionThread
and will use classOf[StreamExecutionThread]
to check.
Class that manages all the metrics related to a StreamingQuery.
Class that manages all the metrics related to a StreamingQuery. It does the following. - Calculates metrics (rates, latencies, etc.) based on information reported by StreamExecution. - Allows the current metric values to be queried - Serves some of the metrics through Codahale/DropWizard metrics
A helper class that looks like a Map[Source, Offset].
Used to link a streaming Source of data into a org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.
A bus to forward events to StreamingQueryListeners.
A bus to forward events to StreamingQueryListeners. This one will send received StreamingQueryListener.Events to the Spark listener bus. It also registers itself with Spark listener bus, so that it can receive StreamingQueryListener.Events and dispatch them to StreamingQueryListener.
Used to link a streaming DataSource into a org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.
Used to link a streaming DataSource into a org.apache.spark.sql.catalyst.plans.logical.LogicalPlan. This is only used for creating a streaming org.apache.spark.sql.DataFrame from org.apache.spark.sql.DataFrameReader. It should be used to create Source and converted to StreamingExecutionRelation when passing to StreamExecution to run a query.
A dummy physical plan for StreamingRelation to support org.apache.spark.sql.Dataset.explain
A source that reads text lines through a TCP socket, designed only for tutorials and debugging.
A source that reads text lines through a TCP socket, designed only for tutorials and debugging. This source will *not* work in production applications due to multiple reasons, including no support for fault recovery and keeping all of the text read in memory forever.
An abstract class for compactible metadata logs. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple serialized metadata lines following.
As reading from many small files is usually pretty slow, also too many small files in one folder will mess the FS, CompactibleFileStreamLog will compact log files every 10 batches by default into a big file. When doing a compaction, it will read all old log files and merge them with the new batch.