org.apache.spark.util

collection

package collection

Visibility
  1. Public
  2. All

Type Members

  1. class AppendOnlyMap[K, V] extends Iterable[(K, V)] with Serializable

    :: DeveloperApi :: A simple open hash table optimized for the append-only use case, where keys are never removed, but the value for each key may be changed.

    :: DeveloperApi :: A simple open hash table optimized for the append-only use case, where keys are never removed, but the value for each key may be changed.

    This implementation uses quadratic probing with a power-of-2 hash table size, which is guaranteed to explore all spaces for each key (see http://en.wikipedia.org/wiki/Quadratic_probing).

    TODO: Cache the hash values of each key? java.util.HashMap does that.

    Annotations
    @DeveloperApi()
  2. class BitSet extends Serializable

    A simple, fixed-size bit set implementation.

    A simple, fixed-size bit set implementation. This implementation is fast because it avoids safety/bound checking.

  3. class ExternalAppendOnlyMap[K, V, C] extends Iterable[(K, C)] with Serializable with Logging with Spillable[SizeTracker]

    :: DeveloperApi :: An append-only map that spills sorted content to disk when there is insufficient space for it to grow.

    :: DeveloperApi :: An append-only map that spills sorted content to disk when there is insufficient space for it to grow.

    This map takes two passes over the data:

    (1) Values are merged into combiners, which are sorted and spilled to disk as necessary (2) Combiners are read from disk and merged together

    The setting of the spill threshold faces the following trade-off: If the spill threshold is too high, the in-memory map may occupy more memory than is available, resulting in OOM. However, if the spill threshold is too low, we spill frequently and incur unnecessary disk writes. This may lead to a performance regression compared to the normal case of using the non-spilling AppendOnlyMap.

    Two parameters control the memory threshold:

    spark.shuffle.memoryFraction specifies the collective amount of memory used for storing these maps as a fraction of the executor's total memory. Since each concurrently running task maintains one map, the actual threshold for each map is this quantity divided by the number of running tasks.

    spark.shuffle.safetyFraction specifies an additional margin of safety as a fraction of this threshold, in case map size estimation is not sufficiently accurate.

    Annotations
    @DeveloperApi()

Ungrouped