streaming

Type Members

abstract class CompactibleFileStreamLog[T <: AnyRef] extends HDFSMetadataLog[Array[T]]

An abstract class for compactible metadata logs.
An abstract class for compactible metadata logs. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple serialized metadata lines following.
As reading from many small files is usually pretty slow, also too many small files in one folder will mess the FS, CompactibleFileStreamLog will compact log files every 10 batches by default into a big file. When doing a compaction, it will read all old log files and merge them with the new batch.
class ConsoleSink extends Sink with Logging
class ConsoleSinkProvider extends StreamSinkProvider with DataSourceRegister
case class EventTimeStats(max: Long, min: Long, sum: Long, count: Long) extends Product with Serializable

Class for collecting event time stats with an accumulator
class EventTimeStatsAccum extends AccumulatorV2[Long, EventTimeStats]

Accumulator that collects stats on event time in a batch.
case class EventTimeWatermarkExec(eventTime: Attribute, delay: CalendarInterval, child: SparkPlan) extends SparkPlan with Product with Serializable

Used to mark a column as the containing the event time for a given record.
Used to mark a column as the containing the event time for a given record. In addition to adding appropriate metadata to this column, this operator also tracks the maximum observed event time. Based on the maximum observed time and a user specified delay, we can calculate the watermark after which we assume we will no longer see late records for a particular time period. Note that event time is measured in milliseconds.
class FileStreamOptions extends Logging

User specified options for file streams.
class FileStreamSink extends Sink with Logging

A sink that writes out results to parquet files.
A sink that writes out results to parquet files. Each batch is written out to a unique directory. After all of the files in a batch have been successfully written, the list of file paths is appended to the log atomically. In the case of partial failures, some duplicate data may be present in the target directory, but only one copy of each file will be present in the log.
class FileStreamSinkLog extends CompactibleFileStreamLog[SinkFileStatus]

A special log for FileStreamSink.
A special log for FileStreamSink. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple JSON lines following. Each JSON line is a JSON format of SinkFileStatus.
As reading from many small files is usually pretty slow, FileStreamSinkLog will compact log files every "spark.sql.sink.file.log.compactLen" batches into a big file. When doing a compaction, it will read all old log files and merge them with the new batch. During the compaction, it will also delete the files that are deleted (marked by SinkFileStatus.action). When the reader uses allFiles to list all files, this method only returns the visible files (drops the deleted files).
class FileStreamSource extends Source with Logging

A very simple source that reads files from the given directory as they appear.
class FileStreamSourceLog extends CompactibleFileStreamLog[FileEntry]
case class FileStreamSourceOffset(logOffset: Long) extends Offset with Product with Serializable

Offset for the FileStreamSource.
Offset for the FileStreamSource.
logOffset
Position in the FileStreamSourceLog
class ForeachSink[T] extends Sink with Serializable

A Sink that forwards all data into ForeachWriter according to the contract defined by ForeachWriter.
A Sink that forwards all data into ForeachWriter according to the contract defined by ForeachWriter.
T
The expected type of the sink.
class HDFSMetadataLog[T <: AnyRef] extends MetadataLog[T] with Logging

A MetadataLog implementation based on HDFS.
A MetadataLog implementation based on HDFS. HDFSMetadataLog uses the specified path as the metadata storage.
When writing a new batch, HDFSMetadataLog will firstly write to a temp file and then rename it to the final batch file. If the rename step fails, there must be multiple writers and only one of them will succeed and the others will fail.
Note: HDFSMetadataLog doesn't support S3-like file systems as they don't guarantee listing files in a directory always shows the latest files.
class IncrementalExecution extends QueryExecution with Logging

A variant of QueryExecution that allows the execution of the given LogicalPlan plan incrementally.
A variant of QueryExecution that allows the execution of the given LogicalPlan plan incrementally. Possibly preserving state in between each execution.
case class LongOffset(offset: Long) extends Offset with Product with Serializable

A simple offset for sources that produce a single linear stream of data.
class ManifestFileCommitProtocol extends FileCommitProtocol with Serializable with Logging

A FileCommitProtocol that tracks the list of valid files in a manifest file, used in structured streaming.
case class MemoryPlan(sink: MemorySink, output: Seq[Attribute]) extends LeafNode with Product with Serializable

Used to query the data that has been written into a MemorySink.
class MemorySink extends Sink with Logging

A sink that stores the results in memory.
A sink that stores the results in memory. This Sink is primarily intended for use in unit tests and does not provide durability.
case class MemoryStream[A](id: Int, sqlContext: SQLContext)(implicit evidence$2: Encoder[A]) extends Source with Logging with Product with Serializable

A Source that produces value stored in memory as they are added by the user.
A Source that produces value stored in memory as they are added by the user. This Source is primarily intended for use in unit tests as it can only replay data when the object is still available.
trait MetadataLog[T] extends AnyRef

A general MetadataLog that supports the following features:
A general MetadataLog that supports the following features:
- Allow the user to store a metadata object for each batch.
- Allow the user to query the latest batch id.
- Allow the user to query the metadata object of a specified batch id.
- Allow the user to query metadata objects in a range of batch ids.
- Allow the user to remove obsolete metadata
class MetadataLogFileIndex extends PartitioningAwareFileIndex

A FileIndex that generates the list of files to processing by reading them from the metadata log files generated by the FileStreamSink.
class MetricsReporter extends metrics.source.Source with Logging

Serves metrics from a org.apache.spark.sql.streaming.StreamingQuery to Codahale/DropWizard metrics
abstract class Offset extends AnyRef

An offset is a monotonically increasing metric used to track progress in the computation of a stream.
An offset is a monotonically increasing metric used to track progress in the computation of a stream. Since offsets are retrieved from a Source by a single thread, we know the global ordering of two Offset instances. We do assume that if two offsets are equal then no new data has arrived.
case class OffsetSeq(offsets: Seq[Option[Offset]], metadata: Option[OffsetSeqMetadata] = None) extends Product with Serializable

An ordered collection of offsets, used to track the progress of processing data from one or more Sources that are present in a streaming query.
An ordered collection of offsets, used to track the progress of processing data from one or more Sources that are present in a streaming query. This is similar to simplified, single-instance vector clock that must progress linearly forward.
class OffsetSeqLog extends HDFSMetadataLog[OffsetSeq]

This class is used to log offsets to persistent files in HDFS.
This class is used to log offsets to persistent files in HDFS. Each file corresponds to a specific batch of offsets. The file format contain a version string in the first line, followed by a the JSON string representation of the offsets separated by a newline character. If a source offset is missing, then that line will contain a string value defined in the SERIALIZED_VOID_OFFSET variable in OffsetSeqLog companion object. For instance, when dealing with LongOffset types: v1 // version 1 metadata {0} // LongOffset 0 {3} // LongOffset 3
- // No offset for this source i.e., an invalid JSON string {2} // LongOffset 2 ...
case class OffsetSeqMetadata(batchWatermarkMs: Long = 0, batchTimestampMs: Long = 0) extends Product with Serializable

Contains metadata associated with a OffsetSeq.
Contains metadata associated with a OffsetSeq. This information is persisted to the offset log in the checkpoint location via the OffsetSeq metadata field.
case class OperatorStateId(checkpointLocation: String, operatorId: Long, batchId: Long) extends Product with Serializable

Used to identify the state store for a given operator.
case class ProcessingTimeExecutor(processingTime: ProcessingTime, clock: Clock = new SystemClock()) extends TriggerExecutor with Logging with Product with Serializable

A trigger executor that runs a batch every intervalMs milliseconds.
trait ProgressReporter extends Logging

Responsible for continually reporting statistics about the amount of data processed as well as latency for a streaming query.
Responsible for continually reporting statistics about the amount of data processed as well as latency for a streaming query. This trait is designed to be mixed into the StreamExecution, who is responsible for calling startTrigger and finishTrigger at the appropriate times. Additionally, the status can updated with updateStatusMessage to allow reporting on the streams current state (i.e. "Fetching more data").
case class SerializedOffset(json: String) extends Offset with Product with Serializable

Used when loading a JSON serialized offset from external storage.
Used when loading a JSON serialized offset from external storage. We are currently not responsible for converting JSON serialized data into an internal (i.e., object) representation. Sources should define a factory method in their source Offset companion objects that accepts a SerializedOffset for doing the conversion.
trait Sink extends AnyRef

An interface for systems that can collect the results of a streaming query.
An interface for systems that can collect the results of a streaming query. In order to preserve exactly once semantics a sink must be idempotent in the face of multiple attempts to add the same batch.
case class SinkFileStatus(path: String, size: Long, isDir: Boolean, modificationTime: Long, blockReplication: Int, blockSize: Long, action: String) extends Product with Serializable

The status of a file outputted by FileStreamSink.
The status of a file outputted by FileStreamSink. A file is visible only if it appears in the sink log and its action is not "delete".
path
the file path.
size
the file size.
isDir
whether this file is a directory.
modificationTime
the file last modification time.
blockReplication
the block replication.
blockSize
the block size.
action
the file action. Must be either "add" or "delete".
trait Source extends AnyRef

A source of continually arriving data for a streaming query.
A source of continually arriving data for a streaming query. A Source must have a monotonically increasing notion of progress that can be represented as an Offset. Spark will regularly query each Source to see if any more data is available.
case class StateStoreRestoreExec(keyExpressions: Seq[Attribute], stateId: Option[OperatorStateId], child: SparkPlan) extends SparkPlan with UnaryExecNode with StatefulOperator with Product with Serializable

For each input tuple, the key is calculated and the value from the StateStore is added to the stream (in addition to the input tuple) if present.
case class StateStoreSaveExec(keyExpressions: Seq[Attribute], stateId: Option[OperatorStateId] = None, outputMode: Option[OutputMode] = None, eventTimeWatermark: Option[Long] = None, child: SparkPlan) extends SparkPlan with UnaryExecNode with StatefulOperator with Product with Serializable

For each input tuple, the key is calculated and the tuple is put into the StateStore.
trait StatefulOperator extends SparkPlan

An operator that saves or restores state from the StateStore.
An operator that saves or restores state from the StateStore. The OperatorStateId should be filled in by prepareForExecution in IncrementalExecution.
class StreamExecution extends StreamingQuery with ProgressReporter with Logging

Manages the execution of a streaming Spark SQL query that is occurring in a separate thread.
Manages the execution of a streaming Spark SQL query that is occurring in a separate thread. Unlike a standard query, a streaming query executes repeatedly each time new data arrives at any Source present in the query plan. Whenever new data arrives, a QueryExecution is created and the results are committed transactionally to the given Sink.
abstract class StreamExecutionThread extends UninterruptibleThread

A special thread to run the stream query.
A special thread to run the stream query. Some codes require to run in the StreamExecutionThread and will use classOf[StreamExecutionThread] to check.
case class StreamMetadata(id: String) extends Product with Serializable

Contains metadata associated with a StreamingQuery.
Contains metadata associated with a StreamingQuery. This information is written in the checkpoint location the first time a query is started and recovered every time the query is restarted.
id
unique id of the StreamingQuery that needs to be persisted across restarts
class StreamProgress extends Map[Source, Offset]

A helper class that looks like a Map[Source, Offset].
case class StreamingExecutionRelation(source: Source, output: Seq[Attribute]) extends LeafNode with Product with Serializable

Used to link a streaming Source of data into a org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.
class StreamingQueryListenerBus extends SparkListener with ListenerBus[StreamingQueryListener, Event]

A bus to forward events to StreamingQueryListeners.
A bus to forward events to StreamingQueryListeners. This one will send received StreamingQueryListener.Events to the Spark listener bus. It also registers itself with Spark listener bus, so that it can receive StreamingQueryListener.Events and dispatch them to StreamingQueryListeners.
Note that each bus and its registered listeners are associated with a single SparkSession and StreamingQueryManager. So this bus will dispatch events to registered listeners for only those queries that were started in the associated SparkSession.
class StreamingQueryWrapper extends StreamingQuery with Serializable

Wrap non-serializable StreamExecution to make the query serializable as it's easy to for it to get captured with normal usage.
Wrap non-serializable StreamExecution to make the query serializable as it's easy to for it to get captured with normal usage. It's safe to capture the query but not use it in executors. However, if the user tries to call its methods, it will throw IllegalStateException.
case class StreamingRelation(dataSource: DataSource, sourceName: String, output: Seq[Attribute]) extends LeafNode with Product with Serializable

Used to link a streaming DataSource into a org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.
Used to link a streaming DataSource into a org.apache.spark.sql.catalyst.plans.logical.LogicalPlan. This is only used for creating a streaming org.apache.spark.sql.DataFrame from org.apache.spark.sql.DataFrameReader. It should be used to create Source and converted to StreamingExecutionRelation when passing to StreamExecution to run a query.
case class StreamingRelationExec(sourceName: String, output: Seq[Attribute]) extends SparkPlan with LeafExecNode with Product with Serializable

A dummy physical plan for StreamingRelation to support org.apache.spark.sql.Dataset.explain
class TextSocketSource extends Source with Logging

A source that reads text lines through a TCP socket, designed only for tutorials and debugging.
A source that reads text lines through a TCP socket, designed only for tutorials and debugging. This source will *not* work in production applications due to multiple reasons, including no support for fault recovery and keeping all of the text read in memory forever.
class TextSocketSourceProvider extends StreamSourceProvider with DataSourceRegister with Logging
trait TriggerExecutor extends AnyRef

Value Members

object CompactibleFileStreamLog
object EventTimeStats extends Serializable
object FileStreamSink
object FileStreamSinkLog
object FileStreamSource
object FileStreamSourceLog
object FileStreamSourceOffset extends Serializable
object HDFSMetadataLog
object LongOffset extends Serializable
object MemoryStream extends Serializable
object OffsetSeq extends Serializable
object OffsetSeqLog
object OffsetSeqMetadata extends Serializable
object SinkFileStatus extends Serializable
object StreamMetadata extends Logging with Serializable
object StreamingExecutionRelation extends Serializable
object StreamingRelation extends Serializable
object TextSocketSource
package state

package streaming

Type Members

abstract class CompactibleFileStreamLog[T <: AnyRef] extends HDFSMetadataLog[Array[T]]

class ConsoleSink extends Sink with Logging

class ConsoleSinkProvider extends StreamSinkProvider with DataSourceRegister

case class EventTimeStats(max: Long, min: Long, sum: Long, count: Long) extends Product with Serializable

class EventTimeStatsAccum extends AccumulatorV2[Long, EventTimeStats]

case class EventTimeWatermarkExec(eventTime: Attribute, delay: CalendarInterval, child: SparkPlan) extends SparkPlan with Product with Serializable

class FileStreamOptions extends Logging

class FileStreamSink extends Sink with Logging

class FileStreamSinkLog extends CompactibleFileStreamLog[SinkFileStatus]

class FileStreamSource extends Source with Logging

class FileStreamSourceLog extends CompactibleFileStreamLog[FileEntry]

case class FileStreamSourceOffset(logOffset: Long) extends Offset with Product with Serializable

class ForeachSink[T] extends Sink with Serializable

class HDFSMetadataLog[T <: AnyRef] extends MetadataLog[T] with Logging

class IncrementalExecution extends QueryExecution with Logging

case class LongOffset(offset: Long) extends Offset with Product with Serializable

class ManifestFileCommitProtocol extends FileCommitProtocol with Serializable with Logging

case class MemoryPlan(sink: MemorySink, output: Seq[Attribute]) extends LeafNode with Product with Serializable

class MemorySink extends Sink with Logging

case class MemoryStream[A](id: Int, sqlContext: SQLContext)(implicit evidence$2: Encoder[A]) extends Source with Logging with Product with Serializable

trait MetadataLog[T] extends AnyRef

class MetadataLogFileIndex extends PartitioningAwareFileIndex

class MetricsReporter extends metrics.source.Source with Logging

abstract class Offset extends AnyRef

case class OffsetSeq(offsets: Seq[Option[Offset]], metadata: Option[OffsetSeqMetadata] = None) extends Product with Serializable

class OffsetSeqLog extends HDFSMetadataLog[OffsetSeq]

case class OffsetSeqMetadata(batchWatermarkMs: Long = 0, batchTimestampMs: Long = 0) extends Product with Serializable

case class OperatorStateId(checkpointLocation: String, operatorId: Long, batchId: Long) extends Product with Serializable

case class ProcessingTimeExecutor(processingTime: ProcessingTime, clock: Clock = new SystemClock()) extends TriggerExecutor with Logging with Product with Serializable

trait ProgressReporter extends Logging

case class SerializedOffset(json: String) extends Offset with Product with Serializable

trait Sink extends AnyRef

case class SinkFileStatus(path: String, size: Long, isDir: Boolean, modificationTime: Long, blockReplication: Int, blockSize: Long, action: String) extends Product with Serializable

trait Source extends AnyRef

case class StateStoreRestoreExec(keyExpressions: Seq[Attribute], stateId: Option[OperatorStateId], child: SparkPlan) extends SparkPlan with UnaryExecNode with StatefulOperator with Product with Serializable

case class StateStoreSaveExec(keyExpressions: Seq[Attribute], stateId: Option[OperatorStateId] = None, outputMode: Option[OutputMode] = None, eventTimeWatermark: Option[Long] = None, child: SparkPlan) extends SparkPlan with UnaryExecNode with StatefulOperator with Product with Serializable

trait StatefulOperator extends SparkPlan

class StreamExecution extends StreamingQuery with ProgressReporter with Logging

abstract class StreamExecutionThread extends UninterruptibleThread

case class StreamMetadata(id: String) extends Product with Serializable

class StreamProgress extends Map[Source, Offset]

case class StreamingExecutionRelation(source: Source, output: Seq[Attribute]) extends LeafNode with Product with Serializable

class StreamingQueryListenerBus extends SparkListener with ListenerBus[StreamingQueryListener, Event]

class StreamingQueryWrapper extends StreamingQuery with Serializable

case class StreamingRelation(dataSource: DataSource, sourceName: String, output: Seq[Attribute]) extends LeafNode with Product with Serializable

case class StreamingRelationExec(sourceName: String, output: Seq[Attribute]) extends SparkPlan with LeafExecNode with Product with Serializable

class TextSocketSource extends Source with Logging

class TextSocketSourceProvider extends StreamSourceProvider with DataSourceRegister with Logging

trait TriggerExecutor extends AnyRef

Value Members

object CompactibleFileStreamLog

object EventTimeStats extends Serializable

object FileStreamSink

object FileStreamSinkLog

object FileStreamSource

object FileStreamSourceLog

object FileStreamSourceOffset extends Serializable

object HDFSMetadataLog

object LongOffset extends Serializable

object MemoryStream extends Serializable

object OffsetSeq extends Serializable

object OffsetSeqLog

object OffsetSeqMetadata extends Serializable

object SinkFileStatus extends Serializable

object StreamMetadata extends Logging with Serializable

object StreamingExecutionRelation extends Serializable

object StreamingRelation extends Serializable

object TextSocketSource

package state

Ungrouped