# Probabilistic Filters By Example

Probablistic filters are high-speed, space-efficient data structures that support set-membership
tests with a one-sided error. These filters can claim that a given entry is *definitely not*
represented in a set of entries, or *might* be represented in the set. That is, negative responses are
conclusive, whereas positive responses incur a small false positive probability (FPP).

The trade-off for this one-sided error is space-efficiency. Cuckoo Filters and Bloom Filters require approximately 7 bits per entry at 3% FPP, regardless of the size of the entries. This makes them useful for applictations where the volume of original data makes traditional storage impractical.

**Bloom filters** have been in use since the 1970s and are well understood. Implementations are widely
available. Variants exist that support deletion and counting, though with expanded storage requirements.

**Cuckoo filters** were described in
Cuckoo Filter: Practically Better Than Bloom,
a paper by researchers at CMU in 2014. Cuckoo filters improve on Bloom filters by supporting deletion,
limited counting, and bounded FPP with similar storage efficiency as a standard Bloom filter.

Below is side-by-side simulation of the inner workings of Cuckoo and Bloom filters.

## Probablistic Filters Visualized

Cuckoo Filter | Bloom Filter |
---|---|

Buckets
FPP: 0% In filter: Fingerprint (-bit): |
FPP: 0% In filter: |

Enter a string:

MultiSet:

## How do they work?

**Cuckoo Filters** operate by hashing an entry with one hash function, and inserting a small f-bit fingerprint of
the entry into an open position in either of two alternate buckets. When both buckets are full, the filter recursively
kicks existing entries to their alternate buckets until
space is found or attempts are exhausted. Lookups repeat the hash function and check both buckets for the fingerprint. When a
matching fingerprint is not found, the entry is *definitely not* in the filter. When a matching fingerprint is found in
either bucket, the entry *might* be in the filter. False positives occur when another entry inserted a mathcing
fingerprint into either of the two checked buckets. Deletion is supported by removing one instance of an entry's fingerprint from
either bucket. Counting is supported by inserting multiple fingerprints of the same value into the same pair of buckets.

**Bloom Filters** operate by hashing an entry with k hash functions, and setting k bits within a bit vector upon
insertion. Lookups repeat the k hash functions and check the corresponding bits. When any checked bit is not set,
the entry is *definitely not* in the filter. When all checked bits are set, the entry *might* be in the
filter. False positives occur when all checked bits happen to be set by any combination of previously inserted entries.

## What are they good for?

Probabilistic filters are used in a variety of applications where slow or expensive operations can be avoided prior to execution by a consulting comparitavely fast or cheap set membership test.

**DB query optimization:** data stored in a database can be inserted into a probabilistic filter.
Later, prior to querying for the data, the filter can be consulted to test whether the data exists. When the
filter response is negative, an unnecessary DB query can be avoided.

**Edge filtering: **
filter data can be distributed to the edges of networks, similar to edge caching, where queries for
potentially non-existent data received at the edge can be filtered out quickly. Unlike a cache, a copy of original
data is not stored in the filter.

## Which should I choose?

Choose Cuckoo, if available, unless your application is timing sensitive on insertion. Though Cuckoo filters outperform Bloom filters on insertion at first - O(1) vs O(k), respectively - their insertion performance drops off due to recursive entry "kicking" as the filter approaches its max capacity.

...for reasonably large sized sets, for the same false positive rate as a corresponding Bloom filter, cuckoo filters use less space than Bloom filters, are faster on lookups (but slower on insertions/to construct), and amazingly also allow deletions of keys (which Bloom filters cannot do). -Michael Mitzenmacher (2014)

Cuckoo Filter | Standard Bloom Filter | Counting Bloom Filter | |
---|---|---|---|

Insert | Variable. O(1) amortized longer as load factor approaches capacity |
Fixed. O(k) | Fixed. O(k) |

As load increases | FPP trends toward desired max Insertions |
FPP trends toward 100% Insertions cannot be rejected |
FPP trends toward 100% Insertions |

Lookup | O(1) Maximum of two buckets to check |
O(k) | O(k) |

Count | O(1) minimal suport: max == entries per bucket X 2 |
unsupported |
O(k) |

Delete | O(1) Maximum of two buckets to inspect |
unsupported |
O(k) |

Bits per entry | smaller when desired FPP <= 3% | smaller when desired FPP > 3% | larger than Cuckoo & Standard Bloom multiplied by number of bits per counter |

Bits per entry | 1.05 [ log_{2}(1/FPP) + log_{2}(2b) ]best when FPP <= 0.5% "semi-sort cuckoo" best when FPP <= 3% |
1.44 log_{2}(1/FPP)best when FPP > 0.5% |
c [ 1.44 log_{2}(1/FPP) ]where c is the number of bits per counter, e.g. 4 |

Availability | limited (as of early 2016)
cpp java go |
widely available | widely available |

## See also

- Cuckoo Filter: Practically Better Than Bloom, Bin Fan, et al.
- Cuckoo Filters, Michael Mitzenmacher
- Bloom Filters by Example, Bill Mill
- Bloom filter on Wikipedia