Packages

class FileStreamSinkLog extends CompactibleFileStreamLog[SinkFileStatus]

A special log for FileStreamSink. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple JSON lines following. Each JSON line is a JSON format of SinkFileStatus.

As reading from many small files is usually pretty slow, FileStreamSinkLog will compact log files every "spark.sql.sink.file.log.compactLen" batches into a big file. When doing a compaction, it will read all old log files and merge them with the new batch. During the compaction, it will also delete the files that are deleted (marked by SinkFileStatus.action). When the reader uses allFiles to list all files, this method only returns the visible files (drops the deleted files).

Linear Supertypes
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. FileStreamSinkLog
  2. CompactibleFileStreamLog
  3. HDFSMetadataLog
  4. Logging
  5. MetadataLog
  6. AnyRef
  7. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. Protected

Instance Constructors

  1. new FileStreamSinkLog(metadataLogVersion: Int, sparkSession: SparkSession, path: String, _retentionMs: Option[Long] = None)

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##: Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. def add(batchId: Long, logs: Array[SinkFileStatus]): Boolean

    Store the metadata for the specified batchId and return true if successful.

    Store the metadata for the specified batchId and return true if successful. If the batchId's metadata has already been stored, this method will return false.

    Definition Classes
    CompactibleFileStreamLogHDFSMetadataLogMetadataLog
  5. def addNewBatchByStream(batchId: Long)(fn: (OutputStream) => Unit): Boolean

    Store the metadata for the specified batchId and return true if successful.

    Store the metadata for the specified batchId and return true if successful. This method fills the content of metadata via executing function. If the function throws an exception, writing will be automatically cancelled and this method will propagate the exception.

    If the batchId's metadata has already been stored, this method will return false.

    Writing the metadata is done by writing a batch to a temp file then rename it to the batch file.

    There may be multiple HDFSMetadataLog using the same metadata path. Although it is not a valid behavior, we still need to prevent it from destroying the files.

    Definition Classes
    HDFSMetadataLog
  6. def allFiles(): Array[SinkFileStatus]

    Returns all files except the deleted ones.

    Returns all files except the deleted ones.

    Definition Classes
    CompactibleFileStreamLog
  7. def applyFnToBatchByStream[RET](batchId: Long, skipExistingCheck: Boolean = false)(fn: (InputStream) => RET): RET

    Apply provided function to each entry in the specific batch metadata log.

    Apply provided function to each entry in the specific batch metadata log.

    Unlike get which will materialize all entries into memory, this method streamlines the process via READ-AND-PROCESS. This helps to avoid the memory issue on huge metadata log file.

    NOTE: This no longer fails early on corruption. The caller should handle the exception properly and make sure the logic is not affected by failing in the middle.

    Definition Classes
    HDFSMetadataLog
  8. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  9. val batchCache: Map[Long, Array[SinkFileStatus]]

    Cache the latest two batches.

    Cache the latest two batches. StreamExecution usually just accesses the latest two batches when committing offsets, this cache will save some file system operations.

    Attributes
    protected[sql]
    Definition Classes
    HDFSMetadataLog
  10. val batchFilesFilter: PathFilter

    A PathFilter to filter only batch files

    A PathFilter to filter only batch files

    Attributes
    protected
    Definition Classes
    HDFSMetadataLog
  11. def batchIdToPath(batchId: Long): Path
  12. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.CloneNotSupportedException]) @native()
  13. final lazy val compactInterval: Int
    Attributes
    protected
    Definition Classes
    CompactibleFileStreamLog
  14. val defaultCompactInterval: Int
    Attributes
    protected
    Definition Classes
    FileStreamSinkLogCompactibleFileStreamLog
  15. def deserialize(in: InputStream): Array[SinkFileStatus]

    Read and deserialize the metadata from input stream.

    Read and deserialize the metadata from input stream. If this method is overridden in a subclass, the overriding method should not close the given input stream, as it will be closed in the caller.

    Definition Classes
    CompactibleFileStreamLogHDFSMetadataLog
  16. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  17. def equals(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef → Any
  18. val fileCleanupDelayMs: Long

    If we delete the old files after compaction at once, there is a race condition in S3: other processes may see the old files are deleted but still cannot see the compaction file using "list".

    If we delete the old files after compaction at once, there is a race condition in S3: other processes may see the old files are deleted but still cannot see the compaction file using "list". The allFiles handles this by looking for the next compaction file directly, however, a live lock may happen if the compaction happens too frequently: one processing keeps deleting old files while another one keeps retrying. Setting a reasonable cleanup delay could avoid it.

    Attributes
    protected
    Definition Classes
    FileStreamSinkLogCompactibleFileStreamLog
  19. val fileManager: CheckpointFileManager
    Attributes
    protected
    Definition Classes
    HDFSMetadataLog
  20. def filterInBatch(batchId: Long)(predicate: (SinkFileStatus) => Boolean): Option[Array[SinkFileStatus]]

    Apply filter on all entries in the specific batch.

    Apply filter on all entries in the specific batch.

    Definition Classes
    CompactibleFileStreamLog
  21. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.Throwable])
  22. def foreachInBatch(batchId: Long)(fn: (SinkFileStatus) => Unit): Unit

    Apply function on all entries in the specific batch.

    Apply function on all entries in the specific batch. The method will throw FileNotFoundException if the metadata log file doesn't exist.

    NOTE: This doesn't fail early on corruption. The caller should handle the exception properly and make sure the logic is not affected by failing in the middle.

    Definition Classes
    CompactibleFileStreamLog
  23. def get(startId: Option[Long], endId: Option[Long]): Array[(Long, Array[SinkFileStatus])]

    Return metadata for batches between startId (inclusive) and endId (inclusive).

    Return metadata for batches between startId (inclusive) and endId (inclusive). If startId is None, just return all batches before endId (inclusive).

    Definition Classes
    HDFSMetadataLogMetadataLog
  24. def get(batchId: Long): Option[Array[SinkFileStatus]]

    Return the metadata for the specified batchId if it's stored.

    Return the metadata for the specified batchId if it's stored. Otherwise, return None.

    Definition Classes
    HDFSMetadataLogMetadataLog
  25. final def getClass(): Class[_ <: AnyRef]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  26. def getLatest(): Option[(Long, Array[SinkFileStatus])]

    Return the latest batch Id and its metadata if exist.

    Return the latest batch Id and its metadata if exist.

    Definition Classes
    HDFSMetadataLogMetadataLog
  27. def getLatestBatchId(): Option[Long]

    Return the latest batch id without reading the file.

    Return the latest batch id without reading the file.

    Definition Classes
    HDFSMetadataLog
  28. def getOrderedBatchFiles(): Array[FileStatus]

    Get an array of [FileStatus] referencing batch files.

    Get an array of [FileStatus] referencing batch files. The array is sorted by most recent batch file first to oldest batch file.

    Definition Classes
    HDFSMetadataLog
  29. def getPrevBatchFromStorage(batchId: Long): Option[Long]

    Get the id of the previous batch from storage

    Get the id of the previous batch from storage

    batchId

    get the previous batch id of this batch with batchId

    Definition Classes
    HDFSMetadataLog
  30. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  31. def initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
    Attributes
    protected
    Definition Classes
    Logging
  32. def initializeLogIfNecessary(isInterpreter: Boolean): Unit
    Attributes
    protected
    Definition Classes
    Logging
  33. def isBatchFile(path: Path): Boolean
  34. val isDeletingExpiredLog: Boolean
    Attributes
    protected
    Definition Classes
    FileStreamSinkLogCompactibleFileStreamLog
  35. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  36. def isTraceEnabled(): Boolean
    Attributes
    protected
    Definition Classes
    Logging
  37. def listBatches: Array[Long]

    List the available batches on file system.

    List the available batches on file system.

    Attributes
    protected
    Definition Classes
    HDFSMetadataLog
  38. def listBatchesOnDisk: Array[Long]

    List the batches persisted to storage

    List the batches persisted to storage

    returns

    array of batches ids

    Definition Classes
    HDFSMetadataLog
  39. def log: Logger
    Attributes
    protected
    Definition Classes
    Logging
  40. def logDebug(msg: => String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  41. def logDebug(msg: => String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  42. def logError(msg: => String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  43. def logError(msg: => String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  44. def logInfo(msg: => String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  45. def logInfo(msg: => String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  46. def logName: String
    Attributes
    protected
    Definition Classes
    Logging
  47. def logTrace(msg: => String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  48. def logTrace(msg: => String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  49. def logWarning(msg: => String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  50. def logWarning(msg: => String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  51. val metadataCacheEnabled: Boolean
    Attributes
    protected
    Definition Classes
    HDFSMetadataLog
  52. val metadataPath: Path
    Definition Classes
    HDFSMetadataLog
  53. val minBatchesToRetain: Int
    Attributes
    protected
    Definition Classes
    CompactibleFileStreamLog
  54. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  55. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  56. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  57. def pathToBatchId(path: Path): Long
  58. def purge(thresholdBatchId: Long): Unit

    CompactibleFileStreamLog maintains logs by itself, and manual purging might break internal state, specifically which latest compaction batch is purged.

    CompactibleFileStreamLog maintains logs by itself, and manual purging might break internal state, specifically which latest compaction batch is purged.

    To simplify the situation, this method just throws UnsupportedOperationException regardless of given parameter, and let CompactibleFileStreamLog handles purging by itself.

    Definition Classes
    CompactibleFileStreamLogHDFSMetadataLogMetadataLog
  59. def purgeAfter(thresholdBatchId: Long): Unit

    Removes all log entries later than thresholdBatchId (exclusive).

    Removes all log entries later than thresholdBatchId (exclusive).

    Definition Classes
    HDFSMetadataLog
  60. val retentionMs: Long
  61. def serialize(logData: Array[SinkFileStatus], out: OutputStream): Unit

    Serialize the metadata and write to the output stream.

    Serialize the metadata and write to the output stream. If this method is overridden in a subclass, the overriding method should not close the given output stream, as it will be closed in the caller.

    Definition Classes
    CompactibleFileStreamLogHDFSMetadataLog
  62. def shouldRetain(log: SinkFileStatus, currentTime: Long): Boolean

    Determine whether the log should be retained or not.

    Determine whether the log should be retained or not.

    Default implementation retains all log entries. Implementations should override the method to change the behavior.

    Definition Classes
    FileStreamSinkLogCompactibleFileStreamLog
  63. final def synchronized[T0](arg0: => T0): T0
    Definition Classes
    AnyRef
  64. def toString(): String
    Definition Classes
    AnyRef → Any
  65. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])
  66. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])
  67. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException]) @native()
  68. def write(batchMetadataFile: Path, fn: (OutputStream) => Unit): Unit
    Attributes
    protected
    Definition Classes
    HDFSMetadataLog

Inherited from HDFSMetadataLog[Array[SinkFileStatus]]

Inherited from Logging

Inherited from MetadataLog[Array[SinkFileStatus]]

Inherited from AnyRef

Inherited from Any

Ungrouped