Filter out the obsolete logs.
Deserialize the string into data object.
Deserialize the string into data object.
If we delete the old files after compaction at once, there is a race condition in S3: other processes may see the old files are deleted but still cannot see the compaction file using "list".
If we delete the old files after compaction at once, there is a race condition in S3: other
processes may see the old files are deleted but still cannot see the compaction file using
"list". The allFiles
handles this by looking for the next compaction file directly, however,
a live lock may happen if the compaction happens too frequently: one processing keeps deleting
old files while another one keeps retrying. Setting a reasonable cleanup delay could avoid it.
Serialize the data into encoded string.
Serialize the data into encoded string.
Store the metadata for the specified batchId and return true
if successful.
Store the metadata for the specified batchId and return true
if successful. If the batchId's
metadata has already been stored, this method will return false
.
Note that this method must be called on a org.apache.spark.util.UninterruptibleThread
so that interrupts can be disabled while writing the batch file. This is because there is a
potential dead-lock in Hadoop "Shell.runCommand" before 2.5.0 (HADOOP-10622). If the thread
running "Shell.runCommand" is interrupted, then the thread can get deadlocked. In our
case, writeBatch
creates a file using HDFS API and calls "Shell.runCommand" to set the
file permissions, and can get deadlocked if the stream execution thread is stopped by
interrupt. Hence, we make sure that this method is called on UninterruptibleThread which
allows us to disable interrupts here. Also see SPARK-14131.
Returns all files except the deleted ones.
Return metadata for batches between startId (inclusive) and endId (inclusive).
Return metadata for batches between startId (inclusive) and endId (inclusive). If startId
is
None
, just return all batches before endId (inclusive).
Return the metadata for the specified batchId if it's stored.
Return the metadata for the specified batchId if it's stored. Otherwise, return None.
Return the latest batch Id and its metadata if exist.
Return the latest batch Id and its metadata if exist.
Removes all the log entry earlier than thresholdBatchId (exclusive).
Removes all the log entry earlier than thresholdBatchId (exclusive).
An abstract class for compactible metadata logs. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple serialized metadata lines following.
As reading from many small files is usually pretty slow, also too many small files in one folder will mess the FS, CompactibleFileStreamLog will compact log files every 10 batches by default into a big file. When doing a compaction, it will read all old log files and merge them with the new batch.