streaming

package streaming

Ordering

Alphabetic

Visibility

Public
All

Type Members

trait CheckpointFileManager extends AnyRef
An interface to abstract out all operation related to streaming checkpoints.
An interface to abstract out all operation related to streaming checkpoints. Most importantly, the key operation this interface provides is createAtomic(path, overwrite) which returns a CancellableFSDataOutputStream. This method is used by HDFSMetadataLog and StateStore implementations to write a complete checkpoint file atomically (i.e. no partial file will be visible), with or without overwrite.
This higher-level interface above the Hadoop FileSystem is necessary because different implementation of FileSystem/FileContext may have different combination of operations to provide the desired atomic guarantees (e.g. write-to-temp-file-and-rename, direct-write-and-cancel-on-failure) and this abstraction allow different implementations while keeping the usage simple (createAtomic -> close or cancel).
class CommitLog extends HDFSMetadataLog[CommitMetadata]
Used to write log files that represent batch commit points in structured streaming.
Used to write log files that represent batch commit points in structured streaming. A commit log file will be written immediately after the successful completion of a batch, and before processing the next batch. Here is an execution summary: - trigger batch 1 - obtain batch 1 offsets and write to offset log - process batch 1 - write batch 1 to completion log - trigger batch 2 - obtain batch 2 offsets and write to offset log - process batch 2 - write batch 2 to completion log ....
The current format of the batch completion log is: line 1: version line 2: metadata (optional json string)
case class CommitMetadata(nextBatchWatermarkMs: Long = 0) extends Product with Serializable
abstract class CompactibleFileStreamLog[T <: AnyRef] extends HDFSMetadataLog[Array[T]]
An abstract class for compactible metadata logs.
An abstract class for compactible metadata logs. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple serialized metadata lines following.
As reading from many small files is usually pretty slow, also too many small files in one folder will mess the FS, CompactibleFileStreamLog will compact log files every 10 batches by default into a big file. When doing a compaction, it will read all old log files and merge them with the new batch.
case class ConsoleRelation(sqlContext: SQLContext, data: DataFrame) extends BaseRelation with Product with Serializable
class ConsoleSinkProvider extends SimpleTableProvider with DataSourceRegister with CreatableRelationProvider
class ContinuousRecordEndpoint extends ThreadSafeRpcEndpoint
A RPC end point for continuous readers to poll for records from the driver.
case class ContinuousRecordPartitionOffset(partitionId: Int, offset: Int) extends PartitionOffset with Product with Serializable
case class ContinuousTrigger(intervalMs: Long) extends Trigger with Product with Serializable
A Trigger that continuously processes streaming data, asynchronously checkpointing at the specified interval.
case class EventTimeStats(max: Long, min: Long, avg: Double, count: Long) extends Product with Serializable
Class for collecting event time stats with an accumulator
class EventTimeStatsAccum extends AccumulatorV2[Long, EventTimeStats]
Accumulator that collects stats on event time in a batch.
case class EventTimeWatermarkExec(eventTime: Attribute, delay: CalendarInterval, child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable
Used to mark a column as the containing the event time for a given record.
Used to mark a column as the containing the event time for a given record. In addition to adding appropriate metadata to this column, this operator also tracks the maximum observed event time. Based on the maximum observed time and a user specified delay, we can calculate the watermark after which we assume we will no longer see late records for a particular time period. Note that event time is measured in milliseconds.
class FileContextBasedCheckpointFileManager extends CheckpointFileManager with RenameHelperMethods with Logging
An implementation of CheckpointFileManager using Hadoop's FileContext API.
class FileStreamOptions extends Logging
User specified options for file streams.
class FileStreamSink extends Sink with Logging
A sink that writes out results to parquet files.
A sink that writes out results to parquet files. Each batch is written out to a unique directory. After all of the files in a batch have been successfully written, the list of file paths is appended to the log atomically. In the case of partial failures, some duplicate data may be present in the target directory, but only one copy of each file will be present in the log.
class FileStreamSinkLog extends CompactibleFileStreamLog[SinkFileStatus]
A special log for FileStreamSink.
A special log for FileStreamSink. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple JSON lines following. Each JSON line is a JSON format of SinkFileStatus.
As reading from many small files is usually pretty slow, FileStreamSinkLog will compact log files every "spark.sql.sink.file.log.compactLen" batches into a big file. When doing a compaction, it will read all old log files and merge them with the new batch. During the compaction, it will also delete the files that are deleted (marked by SinkFileStatus.action). When the reader uses allFiles to list all files, this method only returns the visible files (drops the deleted files).
class FileStreamSource extends SupportsAdmissionControl with Source with Logging
A very simple source that reads files from the given directory as they appear.
class FileStreamSourceLog extends CompactibleFileStreamLog[FileEntry]
case class FileStreamSourceOffset(logOffset: Long) extends Offset with Product with Serializable
Offset for the FileStreamSource.
Offset for the FileStreamSource.
logOffset
Position in the FileStreamSourceLog
class FileSystemBasedCheckpointFileManager extends CheckpointFileManager with RenameHelperMethods with Logging
An implementation of CheckpointFileManager using Hadoop's FileSystem API.
case class FlatMapGroupsWithStateExec(func: (Any, Iterator[Any], LogicalGroupState[Any]) ⇒ Iterator[Any], keyDeserializer: Expression, valueDeserializer: Expression, groupingAttributes: Seq[Attribute], dataAttributes: Seq[Attribute], outputObjAttr: Attribute, stateInfo: Option[StatefulOperatorStateInfo], stateEncoder: ExpressionEncoder[Any], stateFormatVersion: Int, outputMode: OutputMode, timeoutConf: GroupStateTimeout, batchTimestampMs: Option[Long], eventTimeWatermark: Option[Long], child: SparkPlan) extends SparkPlan with UnaryExecNode with ObjectProducerExec with StateStoreWriter with WatermarkSupport with Product with Serializable
Physical operator for executing FlatMapGroupsWithState
Physical operator for executing FlatMapGroupsWithState
func
function called on each group
keyDeserializer
used to extract the key object for each group.
valueDeserializer
used to extract the items in the iterator from an input row.
groupingAttributes
used to group the data
dataAttributes
used to read the data
outputObjAttr
Defines the output object
stateEncoder
used to serialize/deserialize state before calling func
outputMode
the output mode of func
timeoutConf
used to timeout groups that have not received data in a while
batchTimestampMs
processing timestamp of the current batch.
case class GetRecord(offset: ContinuousRecordPartitionOffset) extends Product with Serializable
class HDFSMetadataLog[T <: AnyRef] extends MetadataLog[T] with Logging
A MetadataLog implementation based on HDFS.
A MetadataLog implementation based on HDFS. HDFSMetadataLog uses the specified path as the metadata storage.
When writing a new batch, HDFSMetadataLog will firstly write to a temp file and then rename it to the final batch file. If the rename step fails, there must be multiple writers and only one of them will succeed and the others will fail.
Note: HDFSMetadataLog doesn't support S3-like file systems as they don't guarantee listing files in a directory always shows the latest files.
class IncrementalExecution extends QueryExecution with Logging
A variant of QueryExecution that allows the execution of the given LogicalPlan plan incrementally.
A variant of QueryExecution that allows the execution of the given LogicalPlan plan incrementally. Possibly preserving state in between each execution.
case class LongOffset(offset: Long) extends Offset with Product with Serializable
A simple offset for sources that produce a single linear stream of data.
class ManifestFileCommitProtocol extends FileCommitProtocol with Serializable with Logging
A FileCommitProtocol that tracks the list of valid files in a manifest file, used in structured streaming.
case class MemoryStream[A](id: Int, sqlContext: SQLContext, numPartitions: Option[Int] = None)(implicit evidence$4: Encoder[A]) extends MemoryStreamBase[A] with MicroBatchStream with Logging with Product with Serializable
A Source that produces value stored in memory as they are added by the user.
A Source that produces value stored in memory as they are added by the user. This Source is intended for use in unit tests as it can only replay data when the object is still available.
If numPartitions is provided, the rows will be redistributed to the given number of partitions.
abstract class MemoryStreamBase[A] extends SparkDataStream
A base class for memory stream implementations.
A base class for memory stream implementations. Supports adding data and resetting.
class MemoryStreamInputPartition extends InputPartition
class MemoryStreamScanBuilder extends ScanBuilder with Scan
class MemoryStreamTable extends Table with SupportsRead
trait MetadataLog[T] extends AnyRef
A general MetadataLog that supports the following features:
A general MetadataLog that supports the following features:
- Allow the user to store a metadata object for each batch.
- Allow the user to query the latest batch id.
- Allow the user to query the metadata object of a specified batch id.
- Allow the user to query metadata objects in a range of batch ids.
- Allow the user to remove obsolete metadata
class MetadataLogFileIndex extends PartitioningAwareFileIndex
A FileIndex that generates the list of files to processing by reading them from the metadata log files generated by the FileStreamSink.
class MetricsReporter extends metrics.source.Source with Logging
Serves metrics from a org.apache.spark.sql.streaming.StreamingQuery to Codahale/DropWizard metrics
class MicroBatchExecution extends StreamExecution
sealed trait MultipleWatermarkPolicy extends AnyRef
Policy to define how to choose a new global watermark value if there are multiple watermark operators in a streaming query.
abstract class Offset extends connector.read.streaming.Offset
This class is an alias of org.apache.spark.sql.connector.read.streaming.Offset.
This class is an alias of org.apache.spark.sql.connector.read.streaming.Offset. It's internal and deprecated. New streaming data source implementations should use data source v2 API, which will be supported in the long term.
This class will be removed in a future release.
case class OffsetHolder(start: connector.read.streaming.Offset, end: connector.read.streaming.Offset) extends LeafNode with Product with Serializable
case class OffsetSeq(offsets: Seq[Option[connector.read.streaming.Offset]], metadata: Option[OffsetSeqMetadata] = None) extends Product with Serializable
An ordered collection of offsets, used to track the progress of processing data from one or more Sources that are present in a streaming query.
An ordered collection of offsets, used to track the progress of processing data from one or more Sources that are present in a streaming query. This is similar to simplified, single-instance vector clock that must progress linearly forward.
class OffsetSeqLog extends HDFSMetadataLog[OffsetSeq]
This class is used to log offsets to persistent files in HDFS.
This class is used to log offsets to persistent files in HDFS. Each file corresponds to a specific batch of offsets. The file format contains a version string in the first line, followed by a the JSON string representation of the offsets separated by a newline character. If a source offset is missing, then that line will contain a string value defined in the SERIALIZED_VOID_OFFSET variable in OffsetSeqLog companion object. For instance, when dealing with LongOffset types: v1 // version 1 metadata {0} // LongOffset 0 {3} // LongOffset 3
- // No offset for this source i.e., an invalid JSON string {2} // LongOffset 2 ...
case class OffsetSeqMetadata(batchWatermarkMs: Long = 0, batchTimestampMs: Long = 0, conf: Map[String, String] = Map.empty) extends Product with Serializable
Contains metadata associated with a OffsetSeq.
Contains metadata associated with a OffsetSeq. This information is persisted to the offset log in the checkpoint location via the OffsetSeq metadata field.
case class OneTimeExecutor() extends TriggerExecutor with Product with Serializable
A trigger executor that runs a single batch only, then terminates.
case class ProcessingTimeExecutor(processingTimeTrigger: ProcessingTimeTrigger, clock: Clock = new SystemClock()) extends TriggerExecutor with Logging with Product with Serializable
A trigger executor that runs a batch every intervalMs milliseconds.
case class ProcessingTimeTrigger(intervalMs: Long) extends Trigger with Product with Serializable
A Trigger that runs a query periodically based on the processing time.
A Trigger that runs a query periodically based on the processing time. If interval is 0, the query will run as fast as possible.
trait ProgressReporter extends Logging
Responsible for continually reporting statistics about the amount of data processed as well as latency for a streaming query.
Responsible for continually reporting statistics about the amount of data processed as well as latency for a streaming query. This trait is designed to be mixed into the StreamExecution, who is responsible for calling startTrigger and finishTrigger at the appropriate times. Additionally, the status can updated with updateStatusMessage to allow reporting on the streams current state (i.e. "Fetching more data").
abstract class QueryExecutionThread extends UninterruptibleThread
A special thread to run the stream query.
A special thread to run the stream query. Some codes require to run in the QueryExecutionThread and will use classOf[QueryxecutionThread] to check.
case class RateStreamOffset(partitionToValueAndRunTimeMs: Map[Int, ValueRunTimeMsPair]) extends connector.read.streaming.Offset with Product with Serializable
case class SerializedOffset(json: String) extends Offset with Product with Serializable
Used when loading a JSON serialized offset from external storage.
Used when loading a JSON serialized offset from external storage. We are currently not responsible for converting JSON serialized data into an internal (i.e., object) representation. Sources should define a factory method in their source Offset companion objects that accepts a SerializedOffset for doing the conversion.
trait Sink extends Table
An interface for systems that can collect the results of a streaming query.
An interface for systems that can collect the results of a streaming query. In order to preserve exactly once semantics a sink must be idempotent in the face of multiple attempts to add the same batch.
Note that, we extends Table here, to make the v1 streaming sink API be compatible with data source v2.
case class SinkFileStatus(path: String, size: Long, isDir: Boolean, modificationTime: Long, blockReplication: Int, blockSize: Long, action: String) extends Product with Serializable
The status of a file outputted by FileStreamSink.
The status of a file outputted by FileStreamSink. A file is visible only if it appears in the sink log and its action is not "delete".
path
the file path.
size
the file size.
isDir
whether this file is a directory.
modificationTime
the file last modification time.
blockReplication
the block replication.
blockSize
the block size.
action
the file action. Must be either "add" or "delete".
trait Source extends SparkDataStream
A source of continually arriving data for a streaming query.
A source of continually arriving data for a streaming query. A Source must have a monotonically increasing notion of progress that can be represented as an Offset. Spark will regularly query each Source to see if any more data is available.
Note that, we extends SparkDataStream here, to make the v1 streaming source API be compatible with data source v2.
trait State extends AnyRef
States for StreamExecution's lifecycle.
trait StateStoreReader extends SparkPlan with StatefulOperator
An operator that reads from a StateStore.
case class StateStoreRestoreExec(keyExpressions: Seq[Attribute], stateInfo: Option[StatefulOperatorStateInfo], stateFormatVersion: Int, child: SparkPlan) extends SparkPlan with UnaryExecNode with StateStoreReader with Product with Serializable
For each input tuple, the key is calculated and the value from the StateStore is added to the stream (in addition to the input tuple) if present.
case class StateStoreSaveExec(keyExpressions: Seq[Attribute], stateInfo: Option[StatefulOperatorStateInfo] = None, outputMode: Option[OutputMode] = None, eventTimeWatermark: Option[Long] = None, stateFormatVersion: Int, child: SparkPlan) extends SparkPlan with UnaryExecNode with StateStoreWriter with WatermarkSupport with Product with Serializable
For each input tuple, the key is calculated and the tuple is put into the StateStore.
trait StateStoreWriter extends SparkPlan with StatefulOperator
An operator that writes to a StateStore.
trait StatefulOperator extends SparkPlan
An operator that reads or writes state from the StateStore.
An operator that reads or writes state from the StateStore. The StatefulOperatorStateInfo should be filled in by prepareForExecution in IncrementalExecution.
case class StatefulOperatorStateInfo(checkpointLocation: String, queryRunId: UUID, operatorId: Long, storeVersion: Long, numPartitions: Int) extends Product with Serializable
Used to identify the state store for a given operator.
abstract class StreamExecution extends StreamingQuery with ProgressReporter with Logging
Manages the execution of a streaming Spark SQL query that is occurring in a separate thread.
Manages the execution of a streaming Spark SQL query that is occurring in a separate thread. Unlike a standard query, a streaming query executes repeatedly each time new data arrives at any Source present in the query plan. Whenever new data arrives, a QueryExecution is created and the results are committed transactionally to the given Sink.
case class StreamMetadata(id: String) extends Product with Serializable
Contains metadata associated with a StreamingQuery.
Contains metadata associated with a StreamingQuery. This information is written in the checkpoint location the first time a query is started and recovered every time the query is restarted.
id
unique id of the StreamingQuery that needs to be persisted across restarts
class StreamProgress extends Map[SparkDataStream, connector.read.streaming.Offset]
A helper class that looks like a Map[Source, Offset].
case class StreamingDeduplicateExec(keyExpressions: Seq[Attribute], child: SparkPlan, stateInfo: Option[StatefulOperatorStateInfo] = None, eventTimeWatermark: Option[Long] = None) extends SparkPlan with UnaryExecNode with StateStoreWriter with WatermarkSupport with Product with Serializable
Physical operator for executing streaming Deduplicate.
case class StreamingExecutionRelation(source: SparkDataStream, output: Seq[Attribute])(session: SparkSession) extends LeafNode with MultiInstanceRelation with Product with Serializable
Used to link a streaming Source of data into a org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.
case class StreamingGlobalLimitExec(streamLimit: Long, child: SparkPlan, stateInfo: Option[StatefulOperatorStateInfo] = None, outputMode: Option[OutputMode] = None) extends SparkPlan with UnaryExecNode with StateStoreWriter with Product with Serializable
A physical operator for executing a streaming limit, which makes sure no more than streamLimit rows are returned.
A physical operator for executing a streaming limit, which makes sure no more than streamLimit rows are returned. This physical operator is only meant for logical limit operations that will get a input stream of rows that are effectively appends. For example, - limit on any query in append mode - limit before the aggregation in a streaming aggregation query complete mode
case class StreamingLocalLimitExec(limit: Int, child: SparkPlan) extends SparkPlan with LimitExec with Product with Serializable
A physical operator for executing limits locally on each partition.
A physical operator for executing limits locally on each partition. The main difference from LocalLimitExec is that this will fully consume child plan's iterators to ensure that any stateful operation within child commits all the state changes (many stateful operations commit state changes only after the iterator is consumed).
class StreamingQueryListenerBus extends SparkListener with ListenerBus[StreamingQueryListener, Event]
A bus to forward events to StreamingQueryListeners.
A bus to forward events to StreamingQueryListeners. This one will send received StreamingQueryListener.Events to the Spark listener bus. It also registers itself with Spark listener bus, so that it can receive StreamingQueryListener.Events and dispatch them to StreamingQueryListeners.
Note that each bus and its registered listeners are associated with a single SparkSession and StreamingQueryManager. So this bus will dispatch events to registered listeners for only those queries that were started in the associated SparkSession.
class StreamingQueryWrapper extends StreamingQuery with Serializable
Wrap non-serializable StreamExecution to make the query serializable as it's easy for it to get captured with normal usage.
Wrap non-serializable StreamExecution to make the query serializable as it's easy for it to get captured with normal usage. It's safe to capture the query but not use it in executors. However, if the user tries to call its methods, it will throw IllegalStateException.
case class StreamingRelation(dataSource: DataSource, sourceName: String, output: Seq[Attribute]) extends LeafNode with MultiInstanceRelation with Product with Serializable
Used to link a streaming DataSource into a org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.
Used to link a streaming DataSource into a org.apache.spark.sql.catalyst.plans.logical.LogicalPlan. This is only used for creating a streaming org.apache.spark.sql.DataFrame from org.apache.spark.sql.DataFrameReader. It should be used to create Source and converted to StreamingExecutionRelation when passing to StreamExecution to run a query.
case class StreamingRelationExec(sourceName: String, output: Seq[Attribute]) extends SparkPlan with LeafExecNode with Product with Serializable
A dummy physical plan for StreamingRelation to support org.apache.spark.sql.Dataset.explain
case class StreamingRelationV2(source: TableProvider, sourceName: String, table: Table, extraOptions: CaseInsensitiveStringMap, output: Seq[Attribute], v1Relation: Option[StreamingRelation])(session: SparkSession) extends LeafNode with MultiInstanceRelation with Product with Serializable
Used to link a TableProvider into a streaming org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.
Used to link a TableProvider into a streaming org.apache.spark.sql.catalyst.plans.logical.LogicalPlan. This is only used for creating a streaming org.apache.spark.sql.DataFrame from org.apache.spark.sql.DataFrameReader, and should be converted before passing to StreamExecution.
case class StreamingSymmetricHashJoinExec(leftKeys: Seq[Expression], rightKeys: Seq[Expression], joinType: JoinType, condition: JoinConditionSplitPredicates, stateInfo: Option[StatefulOperatorStateInfo], eventTimeWatermark: Option[Long], stateWatermarkPredicates: JoinStateWatermarkPredicates, stateFormatVersion: Int, left: SparkPlan, right: SparkPlan) extends SparkPlan with BinaryExecNode with StateStoreWriter with Product with Serializable
Performs stream-stream join using symmetric hash join algorithm.
Performs stream-stream join using symmetric hash join algorithm. It works as follows.
/-----------------------\ left side input --------->| left side state |------\ \-----------------------/ | |--------> joined output /-----------------------\ | right side input -------->| right side state |------/ \-----------------------/
Each join side buffers past input rows as streaming state so that the past input can be joined with future input on the other side. This buffer state is effectively a multi-map: equi-join key -> list of past input rows received with the join key
For each input row in each side, the following operations take place. - Calculate join key from the row. - Use the join key to append the row to the buffer state of the side that the row came from. - Find past buffered values for the key from the other side. For each such value, emit the "joined row" (left-row, right-row) - Apply the optional condition to filter the joined rows as the final output.
If a timestamp column with event time watermark is present in the join keys or in the input data, then the it uses the watermark figure out which rows in the buffer will not join with and the new data, and therefore can be discarded. Depending on the provided query conditions, we can define thresholds on both state key (i.e. joining keys) and state value (i.e. input rows). There are three kinds of queries possible regarding this as explained below. Assume that watermark has been defined on both leftTime and rightTime columns used below.
1. When timestamp/time-window + watermark is in the join keys. Example (pseudo-SQL):
SELECT * FROM leftTable, rightTable ON leftKey = rightKey AND window(leftTime, "1 hour") = window(rightTime, "1 hour") // 1hr tumbling windows
In this case, this operator will join rows newer than watermark which fall in the same 1 hour window. Say the event-time watermark is "12:34" (both left and right input). Then input rows can only have time > 12:34. Hence, they can only join with buffered rows where window >= 12:00 - 1:00 and all buffered rows with join window < 12:00 can be discarded. In other words, the operator will discard all state where window in state key (i.e. join key) < event time watermark. This threshold is called State Key Watermark.
2. When timestamp range conditions are provided (no time/window + watermark in join keys). E.g.
SELECT * FROM leftTable, rightTable ON leftKey = rightKey AND leftTime > rightTime - INTERVAL 8 MINUTES AND leftTime < rightTime + INTERVAL 1 HOUR
In this case, the event-time watermark and the BETWEEN condition can be used to calculate a state watermark, i.e., time threshold for the state rows that can be discarded. For example, say each join side has a time column, named "leftTime" and "rightTime", and there is a join condition "leftTime > rightTime - 8 min". While processing, say the watermark on right input is "12:34". This means that from henceforth, only right inputs rows with "rightTime > 12:34" will be processed, and any older rows will be considered as "too late" and therefore dropped. Then, the left side buffer only needs to keep rows where "leftTime > rightTime - 8 min > 12:34 - 8m > 12:26". That is, the left state watermark is 12:26, and any rows older than that can be dropped from the state. In other words, the operator will discard all state where timestamp in state value (input rows) < state watermark. This threshold is called State Value Watermark (to distinguish from the state key watermark).
Note:
- The event watermark value of one side is used to calculate the state watermark of the other side. That is, a condition ~ "leftTime > rightTime + X" with right side event watermark is used to calculate the left side state watermark. Conversely, a condition ~ "left < rightTime + Y" with left side event watermark is used to calculate right side state watermark.
- Depending on the conditions, the state watermark maybe different for the left and right side. In the above example, leftTime > 12:26 AND rightTime > 12:34 - 1 hour = 11:34.
- State can be dropped from BOTH sides only when there are conditions of the above forms that define time bounds on timestamp in both directions.
3. When both window in join key and time range conditions are present, case 1 + 2. In this case, since window equality is a stricter condition than the time range, we can use the State Key Watermark = event time watermark to discard state (similar to case 1).
leftKeys
Expression to generate key rows for joining from left input
rightKeys
Expression to generate key rows for joining from right input
joinType
Type of join (inner, left outer, etc.)
condition
Conditions to filter rows, split by left, right, and joined. See JoinConditionSplitPredicates
stateInfo
Version information required to read join state (buffered rows)
eventTimeWatermark
Watermark of input event, same for both sides
stateWatermarkPredicates
Predicates for removal of state, see JoinStateWatermarkPredicates
left
Left child plan
right
Right child plan
trait TriggerExecutor extends AnyRef
case class ValueRunTimeMsPair(value: Long, runTimeMs: Long) extends Product with Serializable
trait WatermarkSupport extends SparkPlan with UnaryExecNode
An operator that supports watermark.
case class WatermarkTracker(policy: MultipleWatermarkPolicy) extends Logging with Product with Serializable
Tracks the watermark value of a streaming query based on a given policy

Value Members

object ACTIVE extends State with Product with Serializable
object CheckpointFileManager extends Logging
object CleanSourceMode extends Enumeration
object CommitLog
object CommitMetadata extends Serializable
object CompactibleFileStreamLog
object ConsoleTable extends Table with SupportsWrite
object ContinuousTrigger extends Serializable
object EventTimeStats extends Serializable
object FileStreamSink extends Logging
object FileStreamSinkLog
object FileStreamSource
object FileStreamSourceLog
object FileStreamSourceOffset extends Serializable
object HDFSMetadataLog
object INITIALIZING extends State with Product with Serializable
object LongOffset extends Serializable
object MaxWatermark extends MultipleWatermarkPolicy with Product with Serializable
Policy to choose the *max* of the operator watermark values as the global watermark value.
Policy to choose the *max* of the operator watermark values as the global watermark value. So the global watermark will advance if any of the individual operator watermarks has advanced. In other words, in a streaming query with multiple input streams and watermarks defined on all of them, the global watermark will advance as fast as the fastest input. So if there is watermark based state cleanup or late-data dropping, then this policy is the most aggressive one and may lead to unexpected behavior if the data of the slow stream is delayed.
object MemoryStream extends Serializable
object MemoryStreamReaderFactory extends PartitionReaderFactory
object MemoryStreamTableProvider extends SimpleTableProvider
object MicroBatchExecution
object MinWatermark extends MultipleWatermarkPolicy with Product with Serializable
Policy to choose the *min* of the operator watermark values as the global watermark value.
Policy to choose the *min* of the operator watermark values as the global watermark value. Note that this is the safe (hence default) policy as the global watermark will advance only if all the individual operator watermarks have advanced. In other words, in a streaming query with multiple input streams and watermarks defined on all of them, the global watermark will advance as slowly as the slowest input. So if there is watermark based state cleanup or late-data dropping, then this policy is the most conservative one.
object MultipleWatermarkPolicy
object OffsetSeq extends Serializable
object OffsetSeqLog
object OffsetSeqMetadata extends Logging with Serializable
object OneTimeTrigger extends Trigger with Product with Serializable
A Trigger that processes only one batch of data in a streaming query then terminates the query.
object ProcessingTimeTrigger extends Serializable
object RECONFIGURING extends State with Product with Serializable
object SinkFileStatus extends Serializable
object StreamExecution
object StreamMetadata extends Logging with Serializable
object StreamingDeduplicateExec extends Serializable
object StreamingExecutionRelation extends Serializable
object StreamingQueryListenerBus
object StreamingRelation extends Serializable
object StreamingSymmetricHashJoinHelper extends Logging
Helper object for StreamingSymmetricHashJoinExec.
Helper object for StreamingSymmetricHashJoinExec. See that object for more details.
object TERMINATED extends State with Product with Serializable
object WatermarkSupport extends Serializable
object WatermarkTracker extends Serializable

Packages

streaming

package streaming

Type Members

Value Members

Ungrouped

Packages

streaming 

package streaming

Type Members

Value Members

Ungrouped

streaming