Packages

package streaming

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. trait CheckpointFileManager extends AnyRef

    An interface to abstract out all operation related to streaming checkpoints.

    An interface to abstract out all operation related to streaming checkpoints. Most importantly, the key operation this interface provides is createAtomic(path, overwrite) which returns a CancellableFSDataOutputStream. This method is used by HDFSMetadataLog and StateStore implementations to write a complete checkpoint file atomically (i.e. no partial file will be visible), with or without overwrite.

    This higher-level interface above the Hadoop FileSystem is necessary because different implementation of FileSystem/FileContext may have different combination of operations to provide the desired atomic guarantees (e.g. write-to-temp-file-and-rename, direct-write-and-cancel-on-failure) and this abstraction allow different implementations while keeping the usage simple (createAtomic -> close or cancel).

  2. class CommitLog extends HDFSMetadataLog[CommitMetadata]

    Used to write log files that represent batch commit points in structured streaming.

    Used to write log files that represent batch commit points in structured streaming. A commit log file will be written immediately after the successful completion of a batch, and before processing the next batch. Here is an execution summary: - trigger batch 1 - obtain batch 1 offsets and write to offset log - process batch 1 - write batch 1 to completion log - trigger batch 2 - obtain batch 2 offsets and write to offset log - process batch 2 - write batch 2 to completion log ....

    The current format of the batch completion log is: line 1: version line 2: metadata (optional json string)

  3. case class CommitMetadata(nextBatchWatermarkMs: Long = 0) extends Product with Serializable
  4. abstract class CompactibleFileStreamLog[T <: AnyRef] extends HDFSMetadataLog[Array[T]]

    An abstract class for compactible metadata logs.

    An abstract class for compactible metadata logs. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple serialized metadata lines following.

    As reading from many small files is usually pretty slow, also too many small files in one folder will mess the FS, CompactibleFileStreamLog will compact log files every 10 batches by default into a big file. When doing a compaction, it will read all old log files and merge them with the new batch.

  5. case class ConsoleRelation(sqlContext: SQLContext, data: DataFrame) extends BaseRelation with Product with Serializable
  6. class ConsoleSinkProvider extends SimpleTableProvider with DataSourceRegister with CreatableRelationProvider
  7. class ContinuousRecordEndpoint extends ThreadSafeRpcEndpoint

    A RPC end point for continuous readers to poll for records from the driver.

  8. case class ContinuousRecordPartitionOffset(partitionId: Int, offset: Int) extends PartitionOffset with Product with Serializable
  9. case class ContinuousTrigger(intervalMs: Long) extends Trigger with Product with Serializable

    A Trigger that continuously processes streaming data, asynchronously checkpointing at the specified interval.

  10. case class EventTimeStats(max: Long, min: Long, avg: Double, count: Long) extends Product with Serializable

    Class for collecting event time stats with an accumulator

  11. class EventTimeStatsAccum extends AccumulatorV2[Long, EventTimeStats]

    Accumulator that collects stats on event time in a batch.

  12. case class EventTimeWatermarkExec(eventTime: Attribute, delay: CalendarInterval, child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable

    Used to mark a column as the containing the event time for a given record.

    Used to mark a column as the containing the event time for a given record. In addition to adding appropriate metadata to this column, this operator also tracks the maximum observed event time. Based on the maximum observed time and a user specified delay, we can calculate the watermark after which we assume we will no longer see late records for a particular time period. Note that event time is measured in milliseconds.

  13. class FileContextBasedCheckpointFileManager extends CheckpointFileManager with RenameHelperMethods with Logging

    An implementation of CheckpointFileManager using Hadoop's FileContext API.

  14. class FileStreamOptions extends Logging

    User specified options for file streams.

  15. class FileStreamSink extends Sink with Logging

    A sink that writes out results to parquet files.

    A sink that writes out results to parquet files. Each batch is written out to a unique directory. After all of the files in a batch have been successfully written, the list of file paths is appended to the log atomically. In the case of partial failures, some duplicate data may be present in the target directory, but only one copy of each file will be present in the log.

  16. class FileStreamSinkLog extends CompactibleFileStreamLog[SinkFileStatus]

    A special log for FileStreamSink.

    A special log for FileStreamSink. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple JSON lines following. Each JSON line is a JSON format of SinkFileStatus.

    As reading from many small files is usually pretty slow, FileStreamSinkLog will compact log files every "spark.sql.sink.file.log.compactLen" batches into a big file. When doing a compaction, it will read all old log files and merge them with the new batch. During the compaction, it will also delete the files that are deleted (marked by SinkFileStatus.action). When the reader uses allFiles to list all files, this method only returns the visible files (drops the deleted files).

  17. class FileStreamSource extends SupportsAdmissionControl with Source with Logging

    A very simple source that reads files from the given directory as they appear.

  18. class FileStreamSourceLog extends CompactibleFileStreamLog[FileEntry]
  19. case class FileStreamSourceOffset(logOffset: Long) extends Offset with Product with Serializable

    Offset for the FileStreamSource.

    Offset for the FileStreamSource.

    logOffset

    Position in the FileStreamSourceLog

  20. class FileSystemBasedCheckpointFileManager extends CheckpointFileManager with RenameHelperMethods with Logging

    An implementation of CheckpointFileManager using Hadoop's FileSystem API.

  21. case class FlatMapGroupsWithStateExec(func: (Any, Iterator[Any], LogicalGroupState[Any]) ⇒ Iterator[Any], keyDeserializer: Expression, valueDeserializer: Expression, groupingAttributes: Seq[Attribute], dataAttributes: Seq[Attribute], outputObjAttr: Attribute, stateInfo: Option[StatefulOperatorStateInfo], stateEncoder: ExpressionEncoder[Any], stateFormatVersion: Int, outputMode: OutputMode, timeoutConf: GroupStateTimeout, batchTimestampMs: Option[Long], eventTimeWatermark: Option[Long], child: SparkPlan) extends SparkPlan with UnaryExecNode with ObjectProducerExec with StateStoreWriter with WatermarkSupport with Product with Serializable

    Physical operator for executing FlatMapGroupsWithState

    Physical operator for executing FlatMapGroupsWithState

    func

    function called on each group

    keyDeserializer

    used to extract the key object for each group.

    valueDeserializer

    used to extract the items in the iterator from an input row.

    groupingAttributes

    used to group the data

    dataAttributes

    used to read the data

    outputObjAttr

    Defines the output object

    stateEncoder

    used to serialize/deserialize state before calling func

    outputMode

    the output mode of func

    timeoutConf

    used to timeout groups that have not received data in a while

    batchTimestampMs

    processing timestamp of the current batch.

  22. case class GetRecord(offset: ContinuousRecordPartitionOffset) extends Product with Serializable
  23. class HDFSMetadataLog[T <: AnyRef] extends MetadataLog[T] with Logging

    A MetadataLog implementation based on HDFS.

    A MetadataLog implementation based on HDFS. HDFSMetadataLog uses the specified path as the metadata storage.

    When writing a new batch, HDFSMetadataLog will firstly write to a temp file and then rename it to the final batch file. If the rename step fails, there must be multiple writers and only one of them will succeed and the others will fail.

    Note: HDFSMetadataLog doesn't support S3-like file systems as they don't guarantee listing files in a directory always shows the latest files.

  24. class IncrementalExecution extends QueryExecution with Logging

    A variant of QueryExecution that allows the execution of the given LogicalPlan plan incrementally.

    A variant of QueryExecution that allows the execution of the given LogicalPlan plan incrementally. Possibly preserving state in between each execution.

  25. case class LongOffset(offset: Long) extends Offset with Product with Serializable

    A simple offset for sources that produce a single linear stream of data.

  26. class ManifestFileCommitProtocol extends FileCommitProtocol with Serializable with Logging

    A FileCommitProtocol that tracks the list of valid files in a manifest file, used in structured streaming.

  27. case class MemoryStream[A](id: Int, sqlContext: SQLContext, numPartitions: Option[Int] = None)(implicit evidence$4: Encoder[A]) extends MemoryStreamBase[A] with MicroBatchStream with Logging with Product with Serializable

    A Source that produces value stored in memory as they are added by the user.

    A Source that produces value stored in memory as they are added by the user. This Source is intended for use in unit tests as it can only replay data when the object is still available.

    If numPartitions is provided, the rows will be redistributed to the given number of partitions.

  28. abstract class MemoryStreamBase[A] extends SparkDataStream

    A base class for memory stream implementations.

    A base class for memory stream implementations. Supports adding data and resetting.

  29. class MemoryStreamInputPartition extends InputPartition
  30. class MemoryStreamScanBuilder extends ScanBuilder with Scan
  31. class MemoryStreamTable extends Table with SupportsRead
  32. trait MetadataLog[T] extends AnyRef

    A general MetadataLog that supports the following features:

    A general MetadataLog that supports the following features:

    • Allow the user to store a metadata object for each batch.
    • Allow the user to query the latest batch id.
    • Allow the user to query the metadata object of a specified batch id.
    • Allow the user to query metadata objects in a range of batch ids.
    • Allow the user to remove obsolete metadata
  33. class MetadataLogFileIndex extends PartitioningAwareFileIndex

    A FileIndex that generates the list of files to processing by reading them from the metadata log files generated by the FileStreamSink.

  34. class MetricsReporter extends metrics.source.Source with Logging

    Serves metrics from a org.apache.spark.sql.streaming.StreamingQuery to Codahale/DropWizard metrics

  35. class MicroBatchExecution extends StreamExecution
  36. sealed trait MultipleWatermarkPolicy extends AnyRef

    Policy to define how to choose a new global watermark value if there are multiple watermark operators in a streaming query.

  37. abstract class Offset extends connector.read.streaming.Offset

    This class is an alias of org.apache.spark.sql.connector.read.streaming.Offset.

    This class is an alias of org.apache.spark.sql.connector.read.streaming.Offset. It's internal and deprecated. New streaming data source implementations should use data source v2 API, which will be supported in the long term.

    This class will be removed in a future release.

  38. case class OffsetHolder(start: connector.read.streaming.Offset, end: connector.read.streaming.Offset) extends LeafNode with Product with Serializable
  39. case class OffsetSeq(offsets: Seq[Option[connector.read.streaming.Offset]], metadata: Option[OffsetSeqMetadata] = None) extends Product with Serializable

    An ordered collection of offsets, used to track the progress of processing data from one or more Sources that are present in a streaming query.

    An ordered collection of offsets, used to track the progress of processing data from one or more Sources that are present in a streaming query. This is similar to simplified, single-instance vector clock that must progress linearly forward.

  40. class OffsetSeqLog extends HDFSMetadataLog[OffsetSeq]

    This class is used to log offsets to persistent files in HDFS.

    This class is used to log offsets to persistent files in HDFS. Each file corresponds to a specific batch of offsets. The file format contains a version string in the first line, followed by a the JSON string representation of the offsets separated by a newline character. If a source offset is missing, then that line will contain a string value defined in the SERIALIZED_VOID_OFFSET variable in OffsetSeqLog companion object. For instance, when dealing with LongOffset types: v1 // version 1 metadata {0} // LongOffset 0 {3} // LongOffset 3

    • // No offset for this source i.e., an invalid JSON string {2} // LongOffset 2 ...
  41. case class OffsetSeqMetadata(batchWatermarkMs: Long = 0, batchTimestampMs: Long = 0, conf: Map[String, String] = Map.empty) extends Product with Serializable

    Contains metadata associated with a OffsetSeq.

    Contains metadata associated with a OffsetSeq. This information is persisted to the offset log in the checkpoint location via the OffsetSeq metadata field.

  42. case class OneTimeExecutor() extends TriggerExecutor with Product with Serializable

    A trigger executor that runs a single batch only, then terminates.

  43. case class ProcessingTimeExecutor(processingTimeTrigger: ProcessingTimeTrigger, clock: Clock = new SystemClock()) extends TriggerExecutor with Logging with Product with Serializable

    A trigger executor that runs a batch every intervalMs milliseconds.

  44. case class ProcessingTimeTrigger(intervalMs: Long) extends Trigger with Product with Serializable

    A Trigger that runs a query periodically based on the processing time.

    A Trigger that runs a query periodically based on the processing time. If interval is 0, the query will run as fast as possible.

  45. trait ProgressReporter extends Logging

    Responsible for continually reporting statistics about the amount of data processed as well as latency for a streaming query.

    Responsible for continually reporting statistics about the amount of data processed as well as latency for a streaming query. This trait is designed to be mixed into the StreamExecution, who is responsible for calling startTrigger and finishTrigger at the appropriate times. Additionally, the status can updated with updateStatusMessage to allow reporting on the streams current state (i.e. "Fetching more data").

  46. abstract class QueryExecutionThread extends UninterruptibleThread

    A special thread to run the stream query.

    A special thread to run the stream query. Some codes require to run in the QueryExecutionThread and will use classOf[QueryxecutionThread] to check.

  47. case class RateStreamOffset(partitionToValueAndRunTimeMs: Map[Int, ValueRunTimeMsPair]) extends connector.read.streaming.Offset with Product with Serializable
  48. case class SerializedOffset(json: String) extends Offset with Product with Serializable

    Used when loading a JSON serialized offset from external storage.

    Used when loading a JSON serialized offset from external storage. We are currently not responsible for converting JSON serialized data into an internal (i.e., object) representation. Sources should define a factory method in their source Offset companion objects that accepts a SerializedOffset for doing the conversion.

  49. trait Sink extends Table

    An interface for systems that can collect the results of a streaming query.

    An interface for systems that can collect the results of a streaming query. In order to preserve exactly once semantics a sink must be idempotent in the face of multiple attempts to add the same batch.

    Note that, we extends Table here, to make the v1 streaming sink API be compatible with data source v2.

  50. case class SinkFileStatus(path: String, size: Long, isDir: Boolean, modificationTime: Long, blockReplication: Int, blockSize: Long, action: String) extends Product with Serializable

    The status of a file outputted by FileStreamSink.

    The status of a file outputted by FileStreamSink. A file is visible only if it appears in the sink log and its action is not "delete".

    path

    the file path.

    size

    the file size.

    isDir

    whether this file is a directory.

    modificationTime

    the file last modification time.

    blockReplication

    the block replication.

    blockSize

    the block size.

    action

    the file action. Must be either "add" or "delete".

  51. trait Source extends SparkDataStream

    A source of continually arriving data for a streaming query.

    A source of continually arriving data for a streaming query. A Source must have a monotonically increasing notion of progress that can be represented as an Offset. Spark will regularly query each Source to see if any more data is available.

    Note that, we extends SparkDataStream here, to make the v1 streaming source API be compatible with data source v2.

  52. trait State extends AnyRef

    States for StreamExecution's lifecycle.

  53. trait StateStoreReader extends SparkPlan with StatefulOperator

    An operator that reads from a StateStore.

  54. case class StateStoreRestoreExec(keyExpressions: Seq[Attribute], stateInfo: Option[StatefulOperatorStateInfo], stateFormatVersion: Int, child: SparkPlan) extends SparkPlan with UnaryExecNode with StateStoreReader with Product with Serializable

    For each input tuple, the key is calculated and the value from the StateStore is added to the stream (in addition to the input tuple) if present.

  55. case class StateStoreSaveExec(keyExpressions: Seq[Attribute], stateInfo: Option[StatefulOperatorStateInfo] = None, outputMode: Option[OutputMode] = None, eventTimeWatermark: Option[Long] = None, stateFormatVersion: Int, child: SparkPlan) extends SparkPlan with UnaryExecNode with StateStoreWriter with WatermarkSupport with Product with Serializable

    For each input tuple, the key is calculated and the tuple is put into the StateStore.

  56. trait StateStoreWriter extends SparkPlan with StatefulOperator

    An operator that writes to a StateStore.

  57. trait StatefulOperator extends SparkPlan

    An operator that reads or writes state from the StateStore.

    An operator that reads or writes state from the StateStore. The StatefulOperatorStateInfo should be filled in by prepareForExecution in IncrementalExecution.

  58. case class StatefulOperatorStateInfo(checkpointLocation: String, queryRunId: UUID, operatorId: Long, storeVersion: Long, numPartitions: Int) extends Product with Serializable

    Used to identify the state store for a given operator.

  59. abstract class StreamExecution extends StreamingQuery with ProgressReporter with Logging

    Manages the execution of a streaming Spark SQL query that is occurring in a separate thread.

    Manages the execution of a streaming Spark SQL query that is occurring in a separate thread. Unlike a standard query, a streaming query executes repeatedly each time new data arrives at any Source present in the query plan. Whenever new data arrives, a QueryExecution is created and the results are committed transactionally to the given Sink.

  60. case class StreamMetadata(id: String) extends Product with Serializable

    Contains metadata associated with a StreamingQuery.

    Contains metadata associated with a StreamingQuery. This information is written in the checkpoint location the first time a query is started and recovered every time the query is restarted.

    id

    unique id of the StreamingQuery that needs to be persisted across restarts

  61. class StreamProgress extends Map[SparkDataStream, connector.read.streaming.Offset]

    A helper class that looks like a Map[Source, Offset].

  62. case class StreamingDeduplicateExec(keyExpressions: Seq[Attribute], child: SparkPlan, stateInfo: Option[StatefulOperatorStateInfo] = None, eventTimeWatermark: Option[Long] = None) extends SparkPlan with UnaryExecNode with StateStoreWriter with WatermarkSupport with Product with Serializable

    Physical operator for executing streaming Deduplicate.

  63. case class StreamingExecutionRelation(source: SparkDataStream, output: Seq[Attribute])(session: SparkSession) extends LeafNode with MultiInstanceRelation with Product with Serializable

    Used to link a streaming Source of data into a org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.

  64. case class StreamingGlobalLimitExec(streamLimit: Long, child: SparkPlan, stateInfo: Option[StatefulOperatorStateInfo] = None, outputMode: Option[OutputMode] = None) extends SparkPlan with UnaryExecNode with StateStoreWriter with Product with Serializable

    A physical operator for executing a streaming limit, which makes sure no more than streamLimit rows are returned.

    A physical operator for executing a streaming limit, which makes sure no more than streamLimit rows are returned. This physical operator is only meant for logical limit operations that will get a input stream of rows that are effectively appends. For example, - limit on any query in append mode - limit before the aggregation in a streaming aggregation query complete mode

  65. case class StreamingLocalLimitExec(limit: Int, child: SparkPlan) extends SparkPlan with LimitExec with Product with Serializable

    A physical operator for executing limits locally on each partition.

    A physical operator for executing limits locally on each partition. The main difference from LocalLimitExec is that this will fully consume child plan's iterators to ensure that any stateful operation within child commits all the state changes (many stateful operations commit state changes only after the iterator is consumed).

  66. class StreamingQueryListenerBus extends SparkListener with ListenerBus[StreamingQueryListener, Event]

    A bus to forward events to StreamingQueryListeners.

    A bus to forward events to StreamingQueryListeners. This one will send received StreamingQueryListener.Events to the Spark listener bus. It also registers itself with Spark listener bus, so that it can receive StreamingQueryListener.Events and dispatch them to StreamingQueryListeners.

    Note that each bus and its registered listeners are associated with a single SparkSession and StreamingQueryManager. So this bus will dispatch events to registered listeners for only those queries that were started in the associated SparkSession.

  67. class StreamingQueryWrapper extends StreamingQuery with Serializable

    Wrap non-serializable StreamExecution to make the query serializable as it's easy for it to get captured with normal usage.

    Wrap non-serializable StreamExecution to make the query serializable as it's easy for it to get captured with normal usage. It's safe to capture the query but not use it in executors. However, if the user tries to call its methods, it will throw IllegalStateException.

  68. case class StreamingRelation(dataSource: DataSource, sourceName: String, output: Seq[Attribute]) extends LeafNode with MultiInstanceRelation with Product with Serializable

    Used to link a streaming DataSource into a org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.

    Used to link a streaming DataSource into a org.apache.spark.sql.catalyst.plans.logical.LogicalPlan. This is only used for creating a streaming org.apache.spark.sql.DataFrame from org.apache.spark.sql.DataFrameReader. It should be used to create Source and converted to StreamingExecutionRelation when passing to StreamExecution to run a query.

  69. case class StreamingRelationExec(sourceName: String, output: Seq[Attribute]) extends SparkPlan with LeafExecNode with Product with Serializable

    A dummy physical plan for StreamingRelation to support org.apache.spark.sql.Dataset.explain

  70. case class StreamingRelationV2(source: TableProvider, sourceName: String, table: Table, extraOptions: CaseInsensitiveStringMap, output: Seq[Attribute], v1Relation: Option[StreamingRelation])(session: SparkSession) extends LeafNode with MultiInstanceRelation with Product with Serializable

    Used to link a TableProvider into a streaming org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.

    Used to link a TableProvider into a streaming org.apache.spark.sql.catalyst.plans.logical.LogicalPlan. This is only used for creating a streaming org.apache.spark.sql.DataFrame from org.apache.spark.sql.DataFrameReader, and should be converted before passing to StreamExecution.

  71. case class StreamingSymmetricHashJoinExec(leftKeys: Seq[Expression], rightKeys: Seq[Expression], joinType: JoinType, condition: JoinConditionSplitPredicates, stateInfo: Option[StatefulOperatorStateInfo], eventTimeWatermark: Option[Long], stateWatermarkPredicates: JoinStateWatermarkPredicates, stateFormatVersion: Int, left: SparkPlan, right: SparkPlan) extends SparkPlan with BinaryExecNode with StateStoreWriter with Product with Serializable

    Performs stream-stream join using symmetric hash join algorithm.

    Performs stream-stream join using symmetric hash join algorithm. It works as follows.

    /-----------------------\ left side input --------->| left side state |------\ \-----------------------/ | |--------> joined output /-----------------------\ | right side input -------->| right side state |------/ \-----------------------/

    Each join side buffers past input rows as streaming state so that the past input can be joined with future input on the other side. This buffer state is effectively a multi-map: equi-join key -> list of past input rows received with the join key

    For each input row in each side, the following operations take place. - Calculate join key from the row. - Use the join key to append the row to the buffer state of the side that the row came from. - Find past buffered values for the key from the other side. For each such value, emit the "joined row" (left-row, right-row) - Apply the optional condition to filter the joined rows as the final output.

    If a timestamp column with event time watermark is present in the join keys or in the input data, then the it uses the watermark figure out which rows in the buffer will not join with and the new data, and therefore can be discarded. Depending on the provided query conditions, we can define thresholds on both state key (i.e. joining keys) and state value (i.e. input rows). There are three kinds of queries possible regarding this as explained below. Assume that watermark has been defined on both leftTime and rightTime columns used below.

    1. When timestamp/time-window + watermark is in the join keys. Example (pseudo-SQL):

    SELECT * FROM leftTable, rightTable ON leftKey = rightKey AND window(leftTime, "1 hour") = window(rightTime, "1 hour") // 1hr tumbling windows

    In this case, this operator will join rows newer than watermark which fall in the same 1 hour window. Say the event-time watermark is "12:34" (both left and right input). Then input rows can only have time > 12:34. Hence, they can only join with buffered rows where window >= 12:00 - 1:00 and all buffered rows with join window < 12:00 can be discarded. In other words, the operator will discard all state where window in state key (i.e. join key) < event time watermark. This threshold is called State Key Watermark.

    2. When timestamp range conditions are provided (no time/window + watermark in join keys). E.g.

    SELECT * FROM leftTable, rightTable ON leftKey = rightKey AND leftTime > rightTime - INTERVAL 8 MINUTES AND leftTime < rightTime + INTERVAL 1 HOUR

    In this case, the event-time watermark and the BETWEEN condition can be used to calculate a state watermark, i.e., time threshold for the state rows that can be discarded. For example, say each join side has a time column, named "leftTime" and "rightTime", and there is a join condition "leftTime > rightTime - 8 min". While processing, say the watermark on right input is "12:34". This means that from henceforth, only right inputs rows with "rightTime > 12:34" will be processed, and any older rows will be considered as "too late" and therefore dropped. Then, the left side buffer only needs to keep rows where "leftTime > rightTime - 8 min > 12:34 - 8m > 12:26". That is, the left state watermark is 12:26, and any rows older than that can be dropped from the state. In other words, the operator will discard all state where timestamp in state value (input rows) < state watermark. This threshold is called State Value Watermark (to distinguish from the state key watermark).

    Note:

    • The event watermark value of one side is used to calculate the state watermark of the other side. That is, a condition ~ "leftTime > rightTime + X" with right side event watermark is used to calculate the left side state watermark. Conversely, a condition ~ "left < rightTime + Y" with left side event watermark is used to calculate right side state watermark.
    • Depending on the conditions, the state watermark maybe different for the left and right side. In the above example, leftTime > 12:26 AND rightTime > 12:34 - 1 hour = 11:34.
    • State can be dropped from BOTH sides only when there are conditions of the above forms that define time bounds on timestamp in both directions.

    3. When both window in join key and time range conditions are present, case 1 + 2. In this case, since window equality is a stricter condition than the time range, we can use the State Key Watermark = event time watermark to discard state (similar to case 1).

    leftKeys

    Expression to generate key rows for joining from left input

    rightKeys

    Expression to generate key rows for joining from right input

    joinType

    Type of join (inner, left outer, etc.)

    condition

    Conditions to filter rows, split by left, right, and joined. See JoinConditionSplitPredicates

    stateInfo

    Version information required to read join state (buffered rows)

    eventTimeWatermark

    Watermark of input event, same for both sides

    stateWatermarkPredicates

    Predicates for removal of state, see JoinStateWatermarkPredicates

    left

    Left child plan

    right

    Right child plan

  72. trait TriggerExecutor extends AnyRef
  73. case class ValueRunTimeMsPair(value: Long, runTimeMs: Long) extends Product with Serializable
  74. trait WatermarkSupport extends SparkPlan with UnaryExecNode

    An operator that supports watermark.

  75. case class WatermarkTracker(policy: MultipleWatermarkPolicy) extends Logging with Product with Serializable

    Tracks the watermark value of a streaming query based on a given policy

Value Members

  1. object ACTIVE extends State with Product with Serializable
  2. object CheckpointFileManager extends Logging
  3. object CleanSourceMode extends Enumeration
  4. object CommitLog
  5. object CommitMetadata extends Serializable
  6. object CompactibleFileStreamLog
  7. object ConsoleTable extends Table with SupportsWrite
  8. object ContinuousTrigger extends Serializable
  9. object EventTimeStats extends Serializable
  10. object FileStreamSink extends Logging
  11. object FileStreamSinkLog
  12. object FileStreamSource
  13. object FileStreamSourceLog
  14. object FileStreamSourceOffset extends Serializable
  15. object HDFSMetadataLog
  16. object INITIALIZING extends State with Product with Serializable
  17. object LongOffset extends Serializable
  18. object MaxWatermark extends MultipleWatermarkPolicy with Product with Serializable

    Policy to choose the *max* of the operator watermark values as the global watermark value.

    Policy to choose the *max* of the operator watermark values as the global watermark value. So the global watermark will advance if any of the individual operator watermarks has advanced. In other words, in a streaming query with multiple input streams and watermarks defined on all of them, the global watermark will advance as fast as the fastest input. So if there is watermark based state cleanup or late-data dropping, then this policy is the most aggressive one and may lead to unexpected behavior if the data of the slow stream is delayed.

  19. object MemoryStream extends Serializable
  20. object MemoryStreamReaderFactory extends PartitionReaderFactory
  21. object MemoryStreamTableProvider extends SimpleTableProvider
  22. object MicroBatchExecution
  23. object MinWatermark extends MultipleWatermarkPolicy with Product with Serializable

    Policy to choose the *min* of the operator watermark values as the global watermark value.

    Policy to choose the *min* of the operator watermark values as the global watermark value. Note that this is the safe (hence default) policy as the global watermark will advance only if all the individual operator watermarks have advanced. In other words, in a streaming query with multiple input streams and watermarks defined on all of them, the global watermark will advance as slowly as the slowest input. So if there is watermark based state cleanup or late-data dropping, then this policy is the most conservative one.

  24. object MultipleWatermarkPolicy
  25. object OffsetSeq extends Serializable
  26. object OffsetSeqLog
  27. object OffsetSeqMetadata extends Logging with Serializable
  28. object OneTimeTrigger extends Trigger with Product with Serializable

    A Trigger that processes only one batch of data in a streaming query then terminates the query.

  29. object ProcessingTimeTrigger extends Serializable
  30. object RECONFIGURING extends State with Product with Serializable
  31. object SinkFileStatus extends Serializable
  32. object StreamExecution
  33. object StreamMetadata extends Logging with Serializable
  34. object StreamingDeduplicateExec extends Serializable
  35. object StreamingExecutionRelation extends Serializable
  36. object StreamingQueryListenerBus
  37. object StreamingRelation extends Serializable
  38. object StreamingSymmetricHashJoinHelper extends Logging

    Helper object for StreamingSymmetricHashJoinExec.

    Helper object for StreamingSymmetricHashJoinExec. See that object for more details.

  39. object TERMINATED extends State with Product with Serializable
  40. object WatermarkSupport extends Serializable
  41. object WatermarkTracker extends Serializable

Ungrouped