com.twitter.algebird

TopNCMSMonoid

class TopNCMSMonoid[K] extends TopCMSMonoid[K]

Monoid for top-N based TopCMS sketches. Use with care! (see warning below)

Warning: Adding top-N CMS instances (++) is an unsafe operation

Top-N computations are not associative. The effect is that a top-N CMS has an ordering bias (with regard to heavy hitters) when merging CMS instances (e.g. via ++). This means merging heavy hitters across CMS instances may lead to incorrect, biased results: the outcome is biased by the order in which CMS instances / heavy hitters are being merged, with the rule of thumb being that the earlier a set of heavy hitters is being merged, the more likely is the end result biased towards these heavy hitters.

The warning above only applies when adding CMS instances (think: cms1 ++ cms2). In comparison, heavy hitters are correctly computed when:

See the discussion in Algebird issue 353 for further details.

Alternatives

The following, alternative data structures may be better picks than a top-N based CMS given the warning above:

Usage

The type K is the type of items you want to count. You must provide an implicit CMSHasher[K] for K, and Algebird ships with several such implicits for commonly used types such as Long and BigInt:

import com.twitter.algebird.CMSHasherImplicits._

If your type K is not supported out of the box, you have two options: 1) You provide a "translation" function to convert items of your (unsupported) type K to a supported type such as Double, and then use the contramap function of CMSHasher to create the required CMSHasher[K] for your type (see the documentation of CMSHasher for an example); 2) You implement a CMSHasher[K] from scratch, using the existing CMSHasher implementations as a starting point.

Note: Because Arrays in Scala/Java not have sane equals and hashCode implementations, you cannot safely use types such as Array[Byte]. Extra work is required for Arrays. For example, you may opt to convert Array[T] to a Seq[T] via toSeq, or you can provide appropriate wrapper classes. Algebird provides one such wrapper class, Bytes, to safely wrap an Array[Byte] for use with CMS.

K

The type used to identify the elements to be counted. For example, if you want to count the occurrence of user names, you could map each username to a unique numeric ID expressed as a Long, and then count the occurrences of those Longs with a CMS of type K=Long. Note that this mapping between the elements of your problem domain and their identifiers used for counting via CMS should be bijective. We require a CMSHasher context bound for K, see CMSHasherImplicits for available implicits that can be imported. Which type K should you pick in practice? For domains that have less than 2^64 unique elements, you'd typically use Long. For larger domains you can try BigInt, for example.

Linear Supertypes
TopCMSMonoid[K], Monoid[TopCMS[K]], Semigroup[TopCMS[K]], Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. TopNCMSMonoid
  2. TopCMSMonoid
  3. Monoid
  4. Semigroup
  5. Serializable
  6. AnyRef
  7. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new TopNCMSMonoid(cms: CMS[K], heavyHittersN: Int = 100)

    cms

    A CMS instance, which is used for the counting and the frequency estimation performed by this class.

    heavyHittersN

    The maximum number of heavy hitters to track.

Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  5. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  6. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  7. def assertNotZero(v: TopCMS[K]): Unit

    Definition Classes
    Monoid
  8. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  9. def create(data: Seq[K]): TopCMS[K]

    Creates a sketch out of multiple items.

    Creates a sketch out of multiple items.

    Definition Classes
    TopCMSMonoid
  10. def create(item: K): TopCMS[K]

    Creates a sketch out of a single item.

    Creates a sketch out of a single item.

    Definition Classes
    TopCMSMonoid
  11. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  12. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  13. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  14. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  15. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  16. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  17. def isNonZero(v: TopCMS[K]): Boolean

    Definition Classes
    Monoid
  18. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  19. def nonZeroOption(v: TopCMS[K]): Option[TopCMS[K]]

    Definition Classes
    Monoid
  20. final def notify(): Unit

    Definition Classes
    AnyRef
  21. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  22. val params: TopCMSParams[K]

    Definition Classes
    TopCMSMonoid
  23. def plus(left: TopCMS[K], right: TopCMS[K]): TopCMS[K]

    Combines the two sketches.

    Combines the two sketches.

    The sketches must use the same hash functions.

    Definition Classes
    TopCMSMonoidSemigroup
  24. def sum(vs: TraversableOnce[TopCMS[K]]): TopCMS[K]

    Definition Classes
    Monoid
  25. def sumOption(iter: TraversableOnce[TopCMS[K]]): Option[TopCMS[K]]

    override this if there is a faster way to do this sum than reduceLeftOption on plus

    override this if there is a faster way to do this sum than reduceLeftOption on plus

    Definition Classes
    Semigroup
  26. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  27. def toString(): String

    Definition Classes
    AnyRef → Any
  28. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  29. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  30. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  31. val zero: TopCMS[K]

    Definition Classes
    TopCMSMonoidMonoid

Inherited from TopCMSMonoid[K]

Inherited from Monoid[TopCMS[K]]

Inherited from Semigroup[TopCMS[K]]

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped