package streaming
- Alphabetic
- Public
- All
Type Members
-
trait
CheckpointFileManager extends AnyRef
An interface to abstract out all operation related to streaming checkpoints.
An interface to abstract out all operation related to streaming checkpoints. Most importantly, the key operation this interface provides is
createAtomic(path, overwrite)
which returns aCancellableFSDataOutputStream
. This method is used by HDFSMetadataLog and StateStore implementations to write a complete checkpoint file atomically (i.e. no partial file will be visible), with or without overwrite.This higher-level interface above the Hadoop FileSystem is necessary because different implementation of FileSystem/FileContext may have different combination of operations to provide the desired atomic guarantees (e.g. write-to-temp-file-and-rename, direct-write-and-cancel-on-failure) and this abstraction allow different implementations while keeping the usage simple (
createAtomic
->close
orcancel
). -
class
CommitLog extends HDFSMetadataLog[CommitMetadata]
Used to write log files that represent batch commit points in structured streaming.
Used to write log files that represent batch commit points in structured streaming. A commit log file will be written immediately after the successful completion of a batch, and before processing the next batch. Here is an execution summary: - trigger batch 1 - obtain batch 1 offsets and write to offset log - process batch 1 - write batch 1 to completion log - trigger batch 2 - obtain batch 2 offsets and write to offset log - process batch 2 - write batch 2 to completion log ....
The current format of the batch completion log is: line 1: version line 2: metadata (optional json string)
- case class CommitMetadata(nextBatchWatermarkMs: Long = 0) extends Product with Serializable
-
abstract
class
CompactibleFileStreamLog[T <: AnyRef] extends HDFSMetadataLog[Array[T]]
An abstract class for compactible metadata logs.
An abstract class for compactible metadata logs. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple serialized metadata lines following.
As reading from many small files is usually pretty slow, also too many small files in one folder will mess the FS, CompactibleFileStreamLog will compact log files every 10 batches by default into a big file. When doing a compaction, it will read all old log files and merge them with the new batch.
- case class ConsoleRelation(sqlContext: SQLContext, data: DataFrame) extends BaseRelation with Product with Serializable
- class ConsoleSinkProvider extends SimpleTableProvider with DataSourceRegister with CreatableRelationProvider
-
class
ContinuousRecordEndpoint extends ThreadSafeRpcEndpoint
A RPC end point for continuous readers to poll for records from the driver.
- case class ContinuousRecordPartitionOffset(partitionId: Int, offset: Int) extends PartitionOffset with Product with Serializable
-
case class
ContinuousTrigger(intervalMs: Long) extends Trigger with Product with Serializable
A Trigger that continuously processes streaming data, asynchronously checkpointing at the specified interval.
-
case class
EventTimeStats(max: Long, min: Long, avg: Double, count: Long) extends Product with Serializable
Class for collecting event time stats with an accumulator
-
class
EventTimeStatsAccum extends AccumulatorV2[Long, EventTimeStats]
Accumulator that collects stats on event time in a batch.
-
case class
EventTimeWatermarkExec(eventTime: Attribute, delay: CalendarInterval, child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable
Used to mark a column as the containing the event time for a given record.
Used to mark a column as the containing the event time for a given record. In addition to adding appropriate metadata to this column, this operator also tracks the maximum observed event time. Based on the maximum observed time and a user specified delay, we can calculate the
watermark
after which we assume we will no longer see late records for a particular time period. Note that event time is measured in milliseconds. -
class
FileContextBasedCheckpointFileManager extends CheckpointFileManager with RenameHelperMethods with Logging
An implementation of CheckpointFileManager using Hadoop's FileContext API.
-
class
FileStreamOptions extends Logging
User specified options for file streams.
-
class
FileStreamSink extends Sink with Logging
A sink that writes out results to parquet files.
A sink that writes out results to parquet files. Each batch is written out to a unique directory. After all of the files in a batch have been successfully written, the list of file paths is appended to the log atomically. In the case of partial failures, some duplicate data may be present in the target directory, but only one copy of each file will be present in the log.
-
class
FileStreamSinkLog extends CompactibleFileStreamLog[SinkFileStatus]
A special log for FileStreamSink.
A special log for FileStreamSink. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple JSON lines following. Each JSON line is a JSON format of SinkFileStatus.
As reading from many small files is usually pretty slow, FileStreamSinkLog will compact log files every "spark.sql.sink.file.log.compactLen" batches into a big file. When doing a compaction, it will read all old log files and merge them with the new batch. During the compaction, it will also delete the files that are deleted (marked by SinkFileStatus.action). When the reader uses
allFiles
to list all files, this method only returns the visible files (drops the deleted files). -
class
FileStreamSource extends SupportsAdmissionControl with Source with Logging
A very simple source that reads files from the given directory as they appear.
- class FileStreamSourceLog extends CompactibleFileStreamLog[FileEntry]
-
case class
FileStreamSourceOffset(logOffset: Long) extends Offset with Product with Serializable
Offset for the FileStreamSource.
Offset for the FileStreamSource.
- logOffset
Position in the FileStreamSourceLog
-
class
FileSystemBasedCheckpointFileManager extends CheckpointFileManager with RenameHelperMethods with Logging
An implementation of CheckpointFileManager using Hadoop's FileSystem API.
-
case class
FlatMapGroupsWithStateExec(func: (Any, Iterator[Any], LogicalGroupState[Any]) ⇒ Iterator[Any], keyDeserializer: Expression, valueDeserializer: Expression, groupingAttributes: Seq[Attribute], dataAttributes: Seq[Attribute], outputObjAttr: Attribute, stateInfo: Option[StatefulOperatorStateInfo], stateEncoder: ExpressionEncoder[Any], stateFormatVersion: Int, outputMode: OutputMode, timeoutConf: GroupStateTimeout, batchTimestampMs: Option[Long], eventTimeWatermark: Option[Long], child: SparkPlan) extends SparkPlan with UnaryExecNode with ObjectProducerExec with StateStoreWriter with WatermarkSupport with Product with Serializable
Physical operator for executing
FlatMapGroupsWithState
Physical operator for executing
FlatMapGroupsWithState
- func
function called on each group
- keyDeserializer
used to extract the key object for each group.
- valueDeserializer
used to extract the items in the iterator from an input row.
- groupingAttributes
used to group the data
- dataAttributes
used to read the data
- outputObjAttr
Defines the output object
- stateEncoder
used to serialize/deserialize state before calling
func
- outputMode
the output mode of
func
- timeoutConf
used to timeout groups that have not received data in a while
- batchTimestampMs
processing timestamp of the current batch.
- case class GetRecord(offset: ContinuousRecordPartitionOffset) extends Product with Serializable
-
class
HDFSMetadataLog[T <: AnyRef] extends MetadataLog[T] with Logging
A MetadataLog implementation based on HDFS.
A MetadataLog implementation based on HDFS. HDFSMetadataLog uses the specified
path
as the metadata storage.When writing a new batch, HDFSMetadataLog will firstly write to a temp file and then rename it to the final batch file. If the rename step fails, there must be multiple writers and only one of them will succeed and the others will fail.
Note: HDFSMetadataLog doesn't support S3-like file systems as they don't guarantee listing files in a directory always shows the latest files.
-
class
IncrementalExecution extends QueryExecution with Logging
A variant of QueryExecution that allows the execution of the given LogicalPlan plan incrementally.
A variant of QueryExecution that allows the execution of the given LogicalPlan plan incrementally. Possibly preserving state in between each execution.
-
case class
LongOffset(offset: Long) extends Offset with Product with Serializable
A simple offset for sources that produce a single linear stream of data.
-
class
ManifestFileCommitProtocol extends FileCommitProtocol with Serializable with Logging
A FileCommitProtocol that tracks the list of valid files in a manifest file, used in structured streaming.
-
case class
MemoryStream[A](id: Int, sqlContext: SQLContext, numPartitions: Option[Int] = None)(implicit evidence$4: Encoder[A]) extends MemoryStreamBase[A] with MicroBatchStream with Logging with Product with Serializable
A Source that produces value stored in memory as they are added by the user.
-
abstract
class
MemoryStreamBase[A] extends SparkDataStream
A base class for memory stream implementations.
A base class for memory stream implementations. Supports adding data and resetting.
- class MemoryStreamInputPartition extends InputPartition
- class MemoryStreamScanBuilder extends ScanBuilder with Scan
- class MemoryStreamTable extends Table with SupportsRead
-
trait
MetadataLog[T] extends AnyRef
A general MetadataLog that supports the following features:
A general MetadataLog that supports the following features:
- Allow the user to store a metadata object for each batch.
- Allow the user to query the latest batch id.
- Allow the user to query the metadata object of a specified batch id.
- Allow the user to query metadata objects in a range of batch ids.
- Allow the user to remove obsolete metadata
-
class
MetadataLogFileIndex extends PartitioningAwareFileIndex
A FileIndex that generates the list of files to processing by reading them from the metadata log files generated by the FileStreamSink.
-
class
MetricsReporter extends metrics.source.Source with Logging
Serves metrics from a org.apache.spark.sql.streaming.StreamingQuery to Codahale/DropWizard metrics
- class MicroBatchExecution extends StreamExecution
-
sealed
trait
MultipleWatermarkPolicy extends AnyRef
Policy to define how to choose a new global watermark value if there are multiple watermark operators in a streaming query.
-
abstract
class
Offset extends connector.read.streaming.Offset
This class is an alias of
org.apache.spark.sql.connector.read.streaming.Offset
.This class is an alias of
org.apache.spark.sql.connector.read.streaming.Offset
. It's internal and deprecated. New streaming data source implementations should use data source v2 API, which will be supported in the long term.This class will be removed in a future release.
- case class OffsetHolder(start: connector.read.streaming.Offset, end: connector.read.streaming.Offset) extends LeafNode with Product with Serializable
-
case class
OffsetSeq(offsets: Seq[Option[connector.read.streaming.Offset]], metadata: Option[OffsetSeqMetadata] = None) extends Product with Serializable
An ordered collection of offsets, used to track the progress of processing data from one or more Sources that are present in a streaming query.
An ordered collection of offsets, used to track the progress of processing data from one or more Sources that are present in a streaming query. This is similar to simplified, single-instance vector clock that must progress linearly forward.
-
class
OffsetSeqLog extends HDFSMetadataLog[OffsetSeq]
This class is used to log offsets to persistent files in HDFS.
This class is used to log offsets to persistent files in HDFS. Each file corresponds to a specific batch of offsets. The file format contains a version string in the first line, followed by a the JSON string representation of the offsets separated by a newline character. If a source offset is missing, then that line will contain a string value defined in the SERIALIZED_VOID_OFFSET variable in OffsetSeqLog companion object. For instance, when dealing with LongOffset types: v1 // version 1 metadata {0} // LongOffset 0 {3} // LongOffset 3
- // No offset for this source i.e., an invalid JSON string {2} // LongOffset 2 ...
-
case class
OffsetSeqMetadata(batchWatermarkMs: Long = 0, batchTimestampMs: Long = 0, conf: Map[String, String] = Map.empty) extends Product with Serializable
Contains metadata associated with a OffsetSeq.
-
case class
OneTimeExecutor() extends TriggerExecutor with Product with Serializable
A trigger executor that runs a single batch only, then terminates.
-
case class
ProcessingTimeExecutor(processingTimeTrigger: ProcessingTimeTrigger, clock: Clock = new SystemClock()) extends TriggerExecutor with Logging with Product with Serializable
A trigger executor that runs a batch every
intervalMs
milliseconds. -
case class
ProcessingTimeTrigger(intervalMs: Long) extends Trigger with Product with Serializable
A Trigger that runs a query periodically based on the processing time.
A Trigger that runs a query periodically based on the processing time. If
interval
is 0, the query will run as fast as possible. -
trait
ProgressReporter extends Logging
Responsible for continually reporting statistics about the amount of data processed as well as latency for a streaming query.
Responsible for continually reporting statistics about the amount of data processed as well as latency for a streaming query. This trait is designed to be mixed into the StreamExecution, who is responsible for calling
startTrigger
andfinishTrigger
at the appropriate times. Additionally, the status can updated withupdateStatusMessage
to allow reporting on the streams current state (i.e. "Fetching more data"). -
abstract
class
QueryExecutionThread extends UninterruptibleThread
A special thread to run the stream query.
A special thread to run the stream query. Some codes require to run in the QueryExecutionThread and will use
classOf[QueryxecutionThread]
to check. - case class RateStreamOffset(partitionToValueAndRunTimeMs: Map[Int, ValueRunTimeMsPair]) extends connector.read.streaming.Offset with Product with Serializable
-
case class
SerializedOffset(json: String) extends Offset with Product with Serializable
Used when loading a JSON serialized offset from external storage.
Used when loading a JSON serialized offset from external storage. We are currently not responsible for converting JSON serialized data into an internal (i.e., object) representation. Sources should define a factory method in their source Offset companion objects that accepts a SerializedOffset for doing the conversion.
-
trait
Sink extends Table
An interface for systems that can collect the results of a streaming query.
An interface for systems that can collect the results of a streaming query. In order to preserve exactly once semantics a sink must be idempotent in the face of multiple attempts to add the same batch.
Note that, we extends
Table
here, to make the v1 streaming sink API be compatible with data source v2. -
case class
SinkFileStatus(path: String, size: Long, isDir: Boolean, modificationTime: Long, blockReplication: Int, blockSize: Long, action: String) extends Product with Serializable
The status of a file outputted by FileStreamSink.
The status of a file outputted by FileStreamSink. A file is visible only if it appears in the sink log and its action is not "delete".
- path
the file path.
- size
the file size.
- isDir
whether this file is a directory.
- modificationTime
the file last modification time.
- blockReplication
the block replication.
- blockSize
the block size.
- action
the file action. Must be either "add" or "delete".
-
trait
Source extends SparkDataStream
A source of continually arriving data for a streaming query.
A source of continually arriving data for a streaming query. A Source must have a monotonically increasing notion of progress that can be represented as an Offset. Spark will regularly query each Source to see if any more data is available.
Note that, we extends
SparkDataStream
here, to make the v1 streaming source API be compatible with data source v2. -
trait
State extends AnyRef
States for StreamExecution's lifecycle.
-
trait
StateStoreReader extends SparkPlan with StatefulOperator
An operator that reads from a StateStore.
-
case class
StateStoreRestoreExec(keyExpressions: Seq[Attribute], stateInfo: Option[StatefulOperatorStateInfo], stateFormatVersion: Int, child: SparkPlan) extends SparkPlan with UnaryExecNode with StateStoreReader with Product with Serializable
For each input tuple, the key is calculated and the value from the StateStore is added to the stream (in addition to the input tuple) if present.
-
case class
StateStoreSaveExec(keyExpressions: Seq[Attribute], stateInfo: Option[StatefulOperatorStateInfo] = None, outputMode: Option[OutputMode] = None, eventTimeWatermark: Option[Long] = None, stateFormatVersion: Int, child: SparkPlan) extends SparkPlan with UnaryExecNode with StateStoreWriter with WatermarkSupport with Product with Serializable
For each input tuple, the key is calculated and the tuple is
put
into the StateStore. -
trait
StateStoreWriter extends SparkPlan with StatefulOperator
An operator that writes to a StateStore.
-
trait
StatefulOperator extends SparkPlan
An operator that reads or writes state from the StateStore.
An operator that reads or writes state from the StateStore. The StatefulOperatorStateInfo should be filled in by
prepareForExecution
in IncrementalExecution. -
case class
StatefulOperatorStateInfo(checkpointLocation: String, queryRunId: UUID, operatorId: Long, storeVersion: Long, numPartitions: Int) extends Product with Serializable
Used to identify the state store for a given operator.
-
abstract
class
StreamExecution extends StreamingQuery with ProgressReporter with Logging
Manages the execution of a streaming Spark SQL query that is occurring in a separate thread.
Manages the execution of a streaming Spark SQL query that is occurring in a separate thread. Unlike a standard query, a streaming query executes repeatedly each time new data arrives at any Source present in the query plan. Whenever new data arrives, a QueryExecution is created and the results are committed transactionally to the given Sink.
-
case class
StreamMetadata(id: String) extends Product with Serializable
Contains metadata associated with a StreamingQuery.
Contains metadata associated with a StreamingQuery. This information is written in the checkpoint location the first time a query is started and recovered every time the query is restarted.
- id
unique id of the StreamingQuery that needs to be persisted across restarts
-
class
StreamProgress extends Map[SparkDataStream, connector.read.streaming.Offset]
A helper class that looks like a Map[Source, Offset].
-
case class
StreamingDeduplicateExec(keyExpressions: Seq[Attribute], child: SparkPlan, stateInfo: Option[StatefulOperatorStateInfo] = None, eventTimeWatermark: Option[Long] = None) extends SparkPlan with UnaryExecNode with StateStoreWriter with WatermarkSupport with Product with Serializable
Physical operator for executing streaming Deduplicate.
-
case class
StreamingExecutionRelation(source: SparkDataStream, output: Seq[Attribute])(session: SparkSession) extends LeafNode with MultiInstanceRelation with Product with Serializable
Used to link a streaming Source of data into a org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.
-
case class
StreamingGlobalLimitExec(streamLimit: Long, child: SparkPlan, stateInfo: Option[StatefulOperatorStateInfo] = None, outputMode: Option[OutputMode] = None) extends SparkPlan with UnaryExecNode with StateStoreWriter with Product with Serializable
A physical operator for executing a streaming limit, which makes sure no more than streamLimit rows are returned.
A physical operator for executing a streaming limit, which makes sure no more than streamLimit rows are returned. This physical operator is only meant for logical limit operations that will get a input stream of rows that are effectively appends. For example, - limit on any query in append mode - limit before the aggregation in a streaming aggregation query complete mode
-
case class
StreamingLocalLimitExec(limit: Int, child: SparkPlan) extends SparkPlan with LimitExec with Product with Serializable
A physical operator for executing limits locally on each partition.
A physical operator for executing limits locally on each partition. The main difference from LocalLimitExec is that this will fully consume
child
plan's iterators to ensure that any stateful operation withinchild
commits all the state changes (many stateful operations commit state changes only after the iterator is consumed). -
class
StreamingQueryListenerBus extends SparkListener with ListenerBus[StreamingQueryListener, Event]
A bus to forward events to StreamingQueryListeners.
A bus to forward events to StreamingQueryListeners. This one will send received StreamingQueryListener.Events to the Spark listener bus. It also registers itself with Spark listener bus, so that it can receive StreamingQueryListener.Events and dispatch them to StreamingQueryListeners.
Note that each bus and its registered listeners are associated with a single SparkSession and StreamingQueryManager. So this bus will dispatch events to registered listeners for only those queries that were started in the associated SparkSession.
-
class
StreamingQueryWrapper extends StreamingQuery with Serializable
Wrap non-serializable StreamExecution to make the query serializable as it's easy for it to get captured with normal usage.
Wrap non-serializable StreamExecution to make the query serializable as it's easy for it to get captured with normal usage. It's safe to capture the query but not use it in executors. However, if the user tries to call its methods, it will throw
IllegalStateException
. -
case class
StreamingRelation(dataSource: DataSource, sourceName: String, output: Seq[Attribute]) extends LeafNode with MultiInstanceRelation with Product with Serializable
Used to link a streaming DataSource into a org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.
Used to link a streaming DataSource into a org.apache.spark.sql.catalyst.plans.logical.LogicalPlan. This is only used for creating a streaming org.apache.spark.sql.DataFrame from org.apache.spark.sql.DataFrameReader. It should be used to create Source and converted to StreamingExecutionRelation when passing to StreamExecution to run a query.
-
case class
StreamingRelationExec(sourceName: String, output: Seq[Attribute]) extends SparkPlan with LeafExecNode with Product with Serializable
A dummy physical plan for StreamingRelation to support org.apache.spark.sql.Dataset.explain
-
case class
StreamingRelationV2(source: TableProvider, sourceName: String, table: Table, extraOptions: CaseInsensitiveStringMap, output: Seq[Attribute], v1Relation: Option[StreamingRelation])(session: SparkSession) extends LeafNode with MultiInstanceRelation with Product with Serializable
Used to link a TableProvider into a streaming org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.
Used to link a TableProvider into a streaming org.apache.spark.sql.catalyst.plans.logical.LogicalPlan. This is only used for creating a streaming org.apache.spark.sql.DataFrame from org.apache.spark.sql.DataFrameReader, and should be converted before passing to StreamExecution.
-
case class
StreamingSymmetricHashJoinExec(leftKeys: Seq[Expression], rightKeys: Seq[Expression], joinType: JoinType, condition: JoinConditionSplitPredicates, stateInfo: Option[StatefulOperatorStateInfo], eventTimeWatermark: Option[Long], stateWatermarkPredicates: JoinStateWatermarkPredicates, stateFormatVersion: Int, left: SparkPlan, right: SparkPlan) extends SparkPlan with BinaryExecNode with StateStoreWriter with Product with Serializable
Performs stream-stream join using symmetric hash join algorithm.
Performs stream-stream join using symmetric hash join algorithm. It works as follows.
/-----------------------\ left side input --------->| left side state |------\ \-----------------------/ | |--------> joined output /-----------------------\ | right side input -------->| right side state |------/ \-----------------------/
Each join side buffers past input rows as streaming state so that the past input can be joined with future input on the other side. This buffer state is effectively a multi-map: equi-join key -> list of past input rows received with the join key
For each input row in each side, the following operations take place. - Calculate join key from the row. - Use the join key to append the row to the buffer state of the side that the row came from. - Find past buffered values for the key from the other side. For each such value, emit the "joined row" (left-row, right-row) - Apply the optional condition to filter the joined rows as the final output.
If a timestamp column with event time watermark is present in the join keys or in the input data, then the it uses the watermark figure out which rows in the buffer will not join with and the new data, and therefore can be discarded. Depending on the provided query conditions, we can define thresholds on both state key (i.e. joining keys) and state value (i.e. input rows). There are three kinds of queries possible regarding this as explained below. Assume that watermark has been defined on both
leftTime
andrightTime
columns used below.1. When timestamp/time-window + watermark is in the join keys. Example (pseudo-SQL):
SELECT * FROM leftTable, rightTable ON leftKey = rightKey AND window(leftTime, "1 hour") = window(rightTime, "1 hour") // 1hr tumbling windows
In this case, this operator will join rows newer than watermark which fall in the same 1 hour window. Say the event-time watermark is "12:34" (both left and right input). Then input rows can only have time > 12:34. Hence, they can only join with buffered rows where window >= 12:00 - 1:00 and all buffered rows with join window < 12:00 can be discarded. In other words, the operator will discard all state where window in state key (i.e. join key) < event time watermark. This threshold is called State Key Watermark.
2. When timestamp range conditions are provided (no time/window + watermark in join keys). E.g.
SELECT * FROM leftTable, rightTable ON leftKey = rightKey AND leftTime > rightTime - INTERVAL 8 MINUTES AND leftTime < rightTime + INTERVAL 1 HOUR
In this case, the event-time watermark and the BETWEEN condition can be used to calculate a state watermark, i.e., time threshold for the state rows that can be discarded. For example, say each join side has a time column, named "leftTime" and "rightTime", and there is a join condition "leftTime > rightTime - 8 min". While processing, say the watermark on right input is "12:34". This means that from henceforth, only right inputs rows with "rightTime > 12:34" will be processed, and any older rows will be considered as "too late" and therefore dropped. Then, the left side buffer only needs to keep rows where "leftTime > rightTime - 8 min > 12:34 - 8m > 12:26". That is, the left state watermark is 12:26, and any rows older than that can be dropped from the state. In other words, the operator will discard all state where timestamp in state value (input rows) < state watermark. This threshold is called State Value Watermark (to distinguish from the state key watermark).
Note:
- The event watermark value of one side is used to calculate the state watermark of the other side. That is, a condition ~ "leftTime > rightTime + X" with right side event watermark is used to calculate the left side state watermark. Conversely, a condition ~ "left < rightTime + Y" with left side event watermark is used to calculate right side state watermark.
- Depending on the conditions, the state watermark maybe different for the left and right side. In the above example, leftTime > 12:26 AND rightTime > 12:34 - 1 hour = 11:34.
- State can be dropped from BOTH sides only when there are conditions of the above forms that define time bounds on timestamp in both directions.
3. When both window in join key and time range conditions are present, case 1 + 2. In this case, since window equality is a stricter condition than the time range, we can use the State Key Watermark = event time watermark to discard state (similar to case 1).
- leftKeys
Expression to generate key rows for joining from left input
- rightKeys
Expression to generate key rows for joining from right input
- joinType
Type of join (inner, left outer, etc.)
- condition
Conditions to filter rows, split by left, right, and joined. See JoinConditionSplitPredicates
- stateInfo
Version information required to read join state (buffered rows)
- eventTimeWatermark
Watermark of input event, same for both sides
- stateWatermarkPredicates
Predicates for removal of state, see JoinStateWatermarkPredicates
- left
Left child plan
- right
Right child plan
- trait TriggerExecutor extends AnyRef
- case class ValueRunTimeMsPair(value: Long, runTimeMs: Long) extends Product with Serializable
-
trait
WatermarkSupport extends SparkPlan with UnaryExecNode
An operator that supports watermark.
-
case class
WatermarkTracker(policy: MultipleWatermarkPolicy) extends Logging with Product with Serializable
Tracks the watermark value of a streaming query based on a given
policy
Value Members
- object ACTIVE extends State with Product with Serializable
- object CheckpointFileManager extends Logging
- object CleanSourceMode extends Enumeration
- object CommitLog
- object CommitMetadata extends Serializable
- object CompactibleFileStreamLog
- object ConsoleTable extends Table with SupportsWrite
- object ContinuousTrigger extends Serializable
- object EventTimeStats extends Serializable
- object FileStreamSink extends Logging
- object FileStreamSinkLog
- object FileStreamSource
- object FileStreamSourceLog
- object FileStreamSourceOffset extends Serializable
- object HDFSMetadataLog
- object INITIALIZING extends State with Product with Serializable
- object LongOffset extends Serializable
-
object
MaxWatermark extends MultipleWatermarkPolicy with Product with Serializable
Policy to choose the *max* of the operator watermark values as the global watermark value.
Policy to choose the *max* of the operator watermark values as the global watermark value. So the global watermark will advance if any of the individual operator watermarks has advanced. In other words, in a streaming query with multiple input streams and watermarks defined on all of them, the global watermark will advance as fast as the fastest input. So if there is watermark based state cleanup or late-data dropping, then this policy is the most aggressive one and may lead to unexpected behavior if the data of the slow stream is delayed.
- object MemoryStream extends Serializable
- object MemoryStreamReaderFactory extends PartitionReaderFactory
- object MemoryStreamTableProvider extends SimpleTableProvider
- object MicroBatchExecution
-
object
MinWatermark extends MultipleWatermarkPolicy with Product with Serializable
Policy to choose the *min* of the operator watermark values as the global watermark value.
Policy to choose the *min* of the operator watermark values as the global watermark value. Note that this is the safe (hence default) policy as the global watermark will advance only if all the individual operator watermarks have advanced. In other words, in a streaming query with multiple input streams and watermarks defined on all of them, the global watermark will advance as slowly as the slowest input. So if there is watermark based state cleanup or late-data dropping, then this policy is the most conservative one.
- object MultipleWatermarkPolicy
- object OffsetSeq extends Serializable
- object OffsetSeqLog
- object OffsetSeqMetadata extends Logging with Serializable
-
object
OneTimeTrigger extends Trigger with Product with Serializable
A Trigger that processes only one batch of data in a streaming query then terminates the query.
- object ProcessingTimeTrigger extends Serializable
- object RECONFIGURING extends State with Product with Serializable
- object SinkFileStatus extends Serializable
- object StreamExecution
- object StreamMetadata extends Logging with Serializable
- object StreamingDeduplicateExec extends Serializable
- object StreamingExecutionRelation extends Serializable
- object StreamingQueryListenerBus
- object StreamingRelation extends Serializable
-
object
StreamingSymmetricHashJoinHelper extends Logging
Helper object for StreamingSymmetricHashJoinExec.
Helper object for StreamingSymmetricHashJoinExec. See that object for more details.
- object TERMINATED extends State with Product with Serializable
- object WatermarkSupport extends Serializable
- object WatermarkTracker extends Serializable