Probabilistic Filters By Example

Probablistic filters are high-speed, space-efficient data structures that support set-membership tests with a one-sided error. These filters can claim that a given entry is definitely not represented in a set of entries, or might be represented in the set. That is, negative responses are conclusive, whereas positive responses incur a small false positive probability (FPP).

The trade-off for this one-sided error is space-efficiency. Cuckoo Filters and Bloom Filters require approximately 7 bits per entry at 3% FPP, regardless of the size of the entries. This makes them useful for applictations where the volume of original data makes traditional storage impractical.

Bloom filters have been in use since the 1970s and are well understood. Implementations are widely available. Variants exist that support deletion and counting, though with expanded storage requirements.

Cuckoo filters were described in Cuckoo Filter: Practically Better Than Bloom, a paper by researchers at CMU in 2014. Cuckoo filters improve on Bloom filters by supporting deletion, limited counting, and bounded FPP with similar storage efficiency as a standard Bloom filter.

Below is side-by-side simulation of the inner workings of Cuckoo and Bloom filters.

Probablistic Filters Visualized

Cuckoo Filter

Bloom Filter

Buckets

FPP: 0%

In filter:

Fingerprint (-bit):

FPP: 0%

In filter:

Enter a string:

MultiSet:

How do they work?

Cuckoo Filters operate by hashing an entry with one hash function, and inserting a small f-bit fingerprint of the entry into an open position in either of two alternate buckets. When both buckets are full, the filter recursively kicks existing entries to their alternate buckets until space is found or attempts are exhausted. Lookups repeat the hash function and check both buckets for the fingerprint. When a matching fingerprint is not found, the entry is definitely not in the filter. When a matching fingerprint is found in either bucket, the entry might be in the filter. False positives occur when another entry inserted a matching fingerprint into either of the two checked buckets. Deletion is supported by removing one instance of an entry's fingerprint from either bucket. Counting is supported by inserting multiple fingerprints of the same value into the same pair of buckets.

Bloom Filters operate by hashing an entry with k hash functions, and setting k bits within a bit vector upon insertion. Lookups repeat the k hash functions and check the corresponding bits. When any checked bit is not set, the entry is definitely not in the filter. When all checked bits are set, the entry might be in the filter. False positives occur when all checked bits happen to be set by any combination of previously inserted entries.

What are they good for?

Probabilistic filters are used in a variety of applications where slow or expensive operations can be avoided prior to execution by a consulting comparitavely fast or cheap set membership test.

DB query optimization: data stored in a database can be inserted into a probabilistic filter. Later, prior to querying for the data, the filter can be consulted to test whether the data exists. When the filter response is negative, an unnecessary DB query can be avoided.

Edge filtering: filter data can be distributed to the edges of networks, similar to edge caching, where queries for potentially non-existent data received at the edge can be filtered out quickly. Unlike a cache, a copy of original data is not stored in the filter.

Which should I choose?

Choose Cuckoo, if available, unless your application is timing sensitive on insertion. Cuckoo filters outperform Bloom filters on insertion at first - both are O(1), though Cuckoo Filters compute just one or two hashes vs Bloom Filters fixed k hashes. However Cuckoo Filter insertion performance drops off as load increases due to recursively "kicking" entries.

...for reasonably large sized sets, for the same false positive rate as a corresponding Bloom filter, cuckoo filters use less space than Bloom filters, are faster on lookups (but slower on insertions/to construct), and amazingly also allow deletions of keys (which Bloom filters cannot do). -Michael Mitzenmacher (2014)

	Cuckoo Filter	Standard Bloom Filter	Counting Bloom Filter
Insert	O(1), amortized variable, longer as load factor approaches capacity	O(1) fixed, performs constant k hashes	O(1) fixed, performs constant k hashes
As load increases	FPP trends toward desired max insertions may be rejected if counting or deletion support is enabled	FPP trends toward 100% insertions cannot be rejected	FPP trends toward 100% insertions may be rejected
Lookup	O(1) maximum of two buckets to check	O(1) performs constant k hashes	O(1) performs constant k hashes
Count	O(1) minimal suport: max == entries per bucket X 2	unsupported	O(1) performs constant k hashes
Delete	O(1) maximum of two buckets to inspect	unsupported	O(1) performs constant k hashes
Bits per entry	smaller when desired FPP <= 3%	smaller when desired FPP > 3%	larger than Cuckoo & Standard Bloom multiplied by number of bits per counter
Bits per entry	1.05 [ log₂(1/FPP) + log₂(2b) ] best when FPP <= 0.5% "semi-sort cuckoo" best when FPP <= 3%	1.44 log₂(1/FPP) best when FPP > 0.5%	c [ 1.44 log₂(1/FPP) ] where c is the number of bits per counter, e.g. 4
Availability	limited (as of early 2016) cpp java go	widely available	widely available

Probabilistic Filters By Example

Probablistic Filters Visualized

How do they work?

What are they good for?

Which should I choose?

See also