org.apache.commons.collections4.bloomfilter (Apache Commons Collections 4.5.0-M1 API)

package org.apache.commons.collections4.bloomfilter

Collects extensible Bloom filter classes and interfaces.

Background:

The Bloom filter is a probabilistic data structure that indicates where things are not. Conceptually it is a bit vector. You create a Bloom filter by creating hashes and converting those to enabled bits in the vector. Multiple Bloom filters may be merged together into one Bloom filter. It is possible to test if a filter B has merged into another filter A by verifying that (A & B) == B.

Bloom filters are generally used where hash tables would be too large, or as a filter front end for longer processes. For example most browsers have a Bloom filter that is built from all known bad URLs (ones that serve up malware). When you enter a URL the browser builds a Bloom filter and checks to see if it is "in" the bad URL filter. If not the URL is good, if it matches, then the expensive lookup on a remote system is made to see if it actually is in the list. There are lots of other uses, and in most cases the reason is to perform a fast check as a gateway for a longer operation.

Some Bloom filters (e.g. CountingBloomFilter) use counters rather than bits. In this case each counter is called a cell.

BloomFilter

The Bloom filter architecture here is designed for speed of execution, so some methods like merge, remove, add, and subtract may throw exceptions. Once an exception is thrown the state of the Bloom filter is unknown. The choice to use not use atomic transactions was made to achieve maximum performance under correct usage.

In addition the architecture is designed so that the implementation of the storage of bits is abstracted. Programs that utilize the Bloom filters may use the BitMapProducer or IndexProducer to retrieve a representation of the internal structure. Additional methods are available in the BitMap to assist in manipulation of the representations.

The Bloom filter code is an interface that requires implementation of 9 methods:

BloomFilter.cardinality() returns the number of bits enabled in the Bloom filter.
BloomFilter.characteristics() which returns an integer of characteristics flags.
BloomFilter.clear() which resets the Bloomfilter to its initial empty state.
BloomFilter.contains(IndexProducer) which returns true if the bits specified by the indices generated by IndexProducer are enabled in the Bloom filter.
BloomFilter.copy() which returns a fresh copy of the bitmap.
BloomFilter.getShape() which returns the shape the Bloom filter was created with.
BloomFilter.merge(BitMapProducer) which merges the BitMaps from the BitMapProducer into the internal representation of the Bloom filter.
BloomFilter.merge(IndexProducer) which merges the indices from the IndexProducer into the internal representation of the Bloom filter.

Other methods should be implemented where they can be done so more efficiently than the default implementations.

CountingBloomFilter

The counting Bloom filter extends the Bloom filter by counting the number of times a specific bit has been enabled or disabled. This allows the removal (opposite of merge) of Bloom filters at the expense of additional overhead.

LayeredBloomFilter

The layered Bloom filter extends the Bloom filter by creating layers of Bloom filters that can be queried as a single Filter or as a set of filters. This adds the ability to perform windowing on streams of data.

Shape

The Shape describes the Bloom filter using the number of bits and the number of hash functions

Hasher

A Hasher converts bytes into a series of integers based on a Shape. Each hasher represents one item being added to the Bloom filter.

The EnhancedDoubleHasher uses a combinatorial generation technique to create the integers. It is easily initialized by using a byte array returned by the standard MessageDigest or other hash function to initialize the Hasher. Alternatively a pair of a long values may also be used.

Other implementations of the Hasher are easy to implement, and should make use of the Hasher.Filter and/or Hasher.FileredIntConsumer classes to filter out duplicate indices when implementing Hasher.uniqueIndices(Shape).

References

https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/BloomFilter.java#L60

Since:: 4.5

Related Packages

Package

Description

org.apache.commons.collections4

Interfaces and utilities shared across all packages.
Class

Description

ArrayCountingBloomFilter

A counting Bloom filter using an int array to track cells for each enabled bit.

BitMap

Contains functions to convert int indices into Bloom filter bit positions and visa versa.

BitMapProducer

Produces bit map longs for a Bloom filter.

BloomFilter

The interface that describes a Bloom filter.

BloomFilterProducer

Produces Bloom filters from a collection (e.g.

CellProducer

Some Bloom filter implementations use a count rather than a bit flag.

CellProducer.CellConsumer

Represents an operation that accepts an <index, count> pair.

CountingBloomFilter

The interface that describes a Bloom filter that associates a count with each bit index rather than a bit.

EnhancedDoubleHasher

A Hasher that implements combinatorial hashing as described by Krisch and Mitzenmacher using the enhanced double hashing technique described in the wikipedia article Double Hashing.

Hasher

A Hasher creates IndexProducer based on the hash implementation and the provided Shape.

IndexFilter

A convenience class for Hasher implementations to filter out duplicate indices.

IndexProducer

An object that produces indices of a Bloom filter.

LayeredBloomFilter

Layered Bloom filters are described in Zhiwang, Cen; Jungang, Xu; Jian, Sun (2010), "A multi-layer Bloom filter for duplicated URL detection", Proc.

LayerManager

Implementation of the methods to manage the layers in a layered Bloom filter.

LayerManager.Builder

Builder to create Layer Manager

LayerManager.Cleanup

Static methods to create a Consumer of a LinkedList of BloomFilter perform tests on whether to reduce the collection of Bloom filters.

LayerManager.ExtendCheck

A collection of common ExtendCheck implementations to test whether to extend the depth of a LayerManager.

LongBiPredicate

Represents a function that accepts a two long-valued argument and produces a binary result.

SetOperations

Implementations of set operations on BitMapProducers.

Shape

The definition of a Bloom filter shape.

SimpleBloomFilter

A bloom filter using an array of bit maps to track enabled bits.

SparseBloomFilter

A bloom filter using a TreeSet of integers to track enabled bits.

WrappedBloomFilter

An abstract class to assist in implementing Bloom filter decorators.

Package org.apache.commons.collections4.bloomfilter