Class StreamingTombstoneHistogramBuilder


  • public class StreamingTombstoneHistogramBuilder
    extends java.lang.Object
    Histogram that can be constructed from streaming of data. Histogram used to retrieve the number of droppable tombstones for example via SSTableReader.getDroppableTombstonesBefore(long).

    When an sstable is written (or streamed), this histogram-builder receives the "local deletion timestamp" as an long via update(long). Negative values are not supported.

    Algorithm: Histogram is represented as collection of {point, weight} pairs. When new point p with weight m is added:

    1. If point p is already exists in collection, add m to recorded value of point p
    2. If there is no point p in the collection, add point p with weight m
    3. If point was added and collection size became larger than maxBinSize:
    1. Find nearest points p1 and p2 in the collection
    2. Replace these two points with one weighted point p3 = (p1*m1+p2*m2)/(p1+p2)

    There are some optimization to make histogram builder faster:

    1. Spool: big map that saves from excessively merging of small bin. This map can contains up to maxSpoolSize points and accumulate weight from same points. For example, if spoolSize=100, binSize=10 and there are only 50 different points. it will be only 40 merges regardless how many points will be added.
    2. Spool is organized as open-addressing primitive hash map where odd elements are points and event elements are values. Spool can not resize => when number of collisions became bigger than threshold or size became large that array_size/2 Spool is drained to bin
    3. Bin is organized as sorted arrays. It reduces garbage collection pressure and allows to find elements in log(binSize) time via binary search
    4. To use existing Arrays.binarySearch {point, values} in bin pairs is packed in one long

    The original algorithm is taken from following paper: Yael Ben-Haim and Elad Tom-Tov, "A Streaming Parallel Decision Tree Algorithm" (2010) http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf

    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      TombstoneHistogram build()
      Creates a 'finished' snapshot of the current state of the histogram, but leaves this builder instance open for subsequent additions to the histograms.
      void flushHistogram()
      Drain the temporary spool into the final bins
      void releaseBuffers()
      Release inner spool buffers.
      static int saturatingCastToInt​(long value)  
      static long saturatingCastToLong​(long value)  
      static long saturatingCastToMaxDeletionTime​(long value)
      Cast to an long with maximum value of Cell.MAX_DELETION_TIME to avoid representing values that aren't a tombstone
      void update​(long point)
      Adds new point to this histogram with a default value of 1.
      void update​(long point, int value)
      Adds new point {@param point} with value {@param value} to this histogram.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • StreamingTombstoneHistogramBuilder

        public StreamingTombstoneHistogramBuilder​(int maxBinSize,
                                                  int maxSpoolSize,
                                                  int roundSeconds)
    • Method Detail

      • update

        public void update​(long point)
        Adds new point to this histogram with a default value of 1.
        Parameters:
        point - the point to be added
      • update

        public void update​(long point,
                           int value)
        Adds new point {@param point} with value {@param value} to this histogram.
      • flushHistogram

        public void flushHistogram()
        Drain the temporary spool into the final bins
      • releaseBuffers

        public void releaseBuffers()
        Release inner spool buffers. Histogram remains readable and writable, but with lesser performance. Not intended for use before finalization.
      • build

        public TombstoneHistogram build()
        Creates a 'finished' snapshot of the current state of the histogram, but leaves this builder instance open for subsequent additions to the histograms. Basically, this allows us to have some degree of sanity wrt sstable early open.
      • saturatingCastToInt

        public static int saturatingCastToInt​(long value)
      • saturatingCastToLong

        public static long saturatingCastToLong​(long value)
      • saturatingCastToMaxDeletionTime

        public static long saturatingCastToMaxDeletionTime​(long value)
        Cast to an long with maximum value of Cell.MAX_DELETION_TIME to avoid representing values that aren't a tombstone