com.twitter.scalding

MapsideReduce

class MapsideReduce[V] extends BaseOperation[SummingCache[Tuple, V]] with Function[SummingCache[Tuple, V]] with ScaldingPrepare[SummingCache[Tuple, V]]

An implementation of map-side combining which is appropriate for associative and commutative functions If a cacheSize is given, it is used, else we query the config for cascading.aggregateby.threshold (standard cascading param for an equivalent case) else we use a default value of 100,000

This keeps a cache of keys up to the cache-size, summing values as keys collide On eviction, or completion of this Operation, the key-value pairs are put into outputCollector.

This NEVER spills to disk and generally never be a performance penalty. If you have poor locality in the keys, you just don't get any benefit but little added cost.

Note this means that you may still have repeated keys in the output even on a single mapper since the key space may be so large that you can't fit all of them in the cache at the same time.

You can use this with the Fields-API by doing:

val msr = new MapsideReduce(Semigroup.from(fn), 'key, 'value, None)
// MUST map onto the same key,value space (may be multiple fields)
val mapSideReduced = pipe.eachTo(('key, 'value) -> ('key, 'value)) { _ => msr }

That said, this is equivalent to AggregateBy, and the only value is that it is much simpler than AggregateBy. AggregateBy assumes several parallel reductions are happening, and thus has many loops, and array lookups to deal with that. Since this does many fewer allocations, and has a smaller code-path it may be faster for the typed-API.

Linear Supertypes
ScaldingPrepare[SummingCache[Tuple, V]], Function[SummingCache[Tuple, V]], BaseOperation[SummingCache[Tuple, V]], Operation[SummingCache[Tuple, V]], DeclaresResults, Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. MapsideReduce
  2. ScaldingPrepare
  3. Function
  4. BaseOperation
  5. Operation
  6. DeclaresResults
  7. Serializable
  8. AnyRef
  9. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new MapsideReduce(commutativeSemigroup: Semigroup[V], keyFields: Fields, valueFields: Fields, cacheSize: Option[Int])(implicit conv: TupleConverter[V], set: TupleSetter[V])

Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  5. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  6. val DEFAULT_CACHE_SIZE: Int

  7. val SIZE_CONFIG_KEY: String

  8. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  9. val boxedSemigroup: Externalizer[Semigroup[V]]

  10. def cacheSize(fp: FlowProcess[_]): Int

  11. def cleanup(flowProcess: FlowProcess[_], operationCall: OperationCall[SummingCache[Tuple, V]]): Unit

    Definition Classes
    MapsideReduce → BaseOperation → Operation
  12. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  13. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  14. def equals(arg0: Any): Boolean

    Definition Classes
    BaseOperation → AnyRef → Any
  15. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  16. def flush(flowProcess: FlowProcess[_], operationCall: OperationCall[SummingCache[Tuple, V]]): Unit

    Definition Classes
    MapsideReduce → BaseOperation → Operation
  17. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  18. def getFieldDeclaration(): Fields

    Definition Classes
    BaseOperation → Operation → DeclaresResults
  19. def getNumArgs(): Int

    Definition Classes
    BaseOperation → Operation
  20. def getTrace(): String

    Definition Classes
    BaseOperation
  21. def hashCode(): Int

    Definition Classes
    BaseOperation → AnyRef → Any
  22. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  23. def isSafe(): Boolean

    Definition Classes
    BaseOperation → Operation
  24. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  25. final def notify(): Unit

    Definition Classes
    AnyRef
  26. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  27. def operate(flowProcess: FlowProcess[_], functionCall: FunctionCall[SummingCache[Tuple, V]]): Unit

    Definition Classes
    MapsideReduce → Function
  28. def prepare(flowProcess: FlowProcess[_], operationCall: OperationCall[SummingCache[Tuple, V]]): Unit

    Definition Classes
    MapsideReduceScaldingPrepare → BaseOperation → Operation
  29. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  30. def toString(): String

    Definition Classes
    BaseOperation → AnyRef → Any
  31. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  32. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  33. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from ScaldingPrepare[SummingCache[Tuple, V]]

Inherited from Function[SummingCache[Tuple, V]]

Inherited from BaseOperation[SummingCache[Tuple, V]]

Inherited from Operation[SummingCache[Tuple, V]]

Inherited from DeclaresResults

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped