Bloom Filters : A Beginner’s Guide

#	User	Rating
1	Benq	3792
2	VivaciousAubergine	3647
3	Kevin114514	3603
4	jiangly	3583
5	turmax	3559
6	tourist	3541
7	strapple	3515
8	ksun48	3461
9	dXqwq	3436
10	Otomachi_Una	3413

#	User	Contrib.
1	Qingyu	157
2	adamant	153
3	Um_nik	147
4	Proof_by_QED	146
5	Dominater069	145
6	errorgorn	142
7	cry	139
8	YuukiS	135
9	TheScrasse	134
10	chromate00	133

Bloom Filters

Ever wondered how

a Website with a massive user base could determine whether a username is taken or not in milliseconds?
How does Chrome checks if the url you are visiting isn't malicious without comparing it against a bulky list of URLs?
How Quora filters out posts you have already seen ?

So the problem here is ultimately you have a set of billions of strings, and a server needs to check whether a string passed to it by the client exists in the set or not.

Straightforward but Not so Good Ways : Considering we have N strings and the average length of each string is L.

Linear search

O(N*L) Search Time Complexity, Iterate over all the strings and compare and see if the string is present.
O(N*L) space complexity.

Binary search

backed by a Red-Black Tree / balanced BST
O(L × log n) Search Time Complexity, Each comparison in the tree takes O(L), and the tree height is O(log n)
O(N*L) space complexity.

trie

O(S) Search Time Complexity, where $$$S$$$ is the length of the search string
O(n×L), space complexity. In the worst case (all strings are unique with no shared prefixes)

The problem Here is when we are operating at an enterprise scale; even an $$$O(S)$$$ operation/request is costly to the server doing the comparison because it has to do billions of such requests/second.

Here is where a Bloom Filter can be helpful:

Concept:

Bloom Filter is a probabilistic data structure that uses bits to determine whether an item might be in a set or definitely is not, using far less space than a full list or hash set.
Rather than storing the data items directly, Bloom filters represent them using a bit array. Since multiple items can map to the same bits through hashing, hash collisions are not only possible, they're expected.

Working:

The client has a set of hash functions $$$h_1 + h_2 + h_3 ... h_k$$$
We have a bit array of m bits, all set to zero initially. $$$[0,0,0,0,0,0,0]$$$ where $$$m « S$$$
Each hash function generates an index $$$∀i∈[0,m-1]$$$ in O(1) time.
The hash function is deterministic, so it generates the same index for the same string.
So instead of passing the entire string to the server, the client instead passes the $$$k$$$ indexes generated from the $$$k$$$ hash functions calculated on the client side itself in $$$O(k)$$$ time.
This way we are also reducing the network latency by reducing the payload size.

Search Operation:

The hash operation is offloaded to the client side, so there is even less load on the server.
The set bits indexes passed to the server will act as the filter.
It iterates over all the indices got from the client, and if any one of the bits is unset, then the string is definitely not present in the list.
This searching takes $$$O(k)$$$.

What if all the bits are present?

Well, in that case, the string might/might not be present in the list, because the same set of bit indexes could be generated by any/⁣Cumulatively many string as well.
But this way we are eliminating a large set of strings that are definitely not present.
In this case, we actually have to do a search, probably with a trie.

How can we reduce false positive probability ?

By increasing the number of Hash Functions $$$(↑k)$$$
By increasing the size of bit array $$$(↑m)$$$
By using a hash function that generates uniformly distributed/un-skewed indexes.

Insert Operation

The set bits indexes passed to the server will all be set.
The insert operation takes $$$O(k)$$$ time.
Thought it's expected that some of the hash bit indexes passed are already set by some previous string.

Delete Opetation

The drawback of the Bloom filter is we cannot delete elements from the list by just clearing the hash bits.

Why unsetting bits Won't work ?

If we clear the bits set by a string, we might end up clearing a subset of bits of some other string.
Eg — Let's say a string "Hi i'm Saksham" generated hash bit indexes as [1,2,3] and another string "Hello World!" generated hash bit indexes as [3,7,12].
Now if we want to delete the string "Hello World!" by clearing the bits [3,7,12], we will end up clearing the bit 3 (overlap).
Now if we ever search for the string "Hi i'm Saksham" we won't find it because of the unset but 3.
So this way we lose the property of definitely not present.

Saksham_Sahgal's blog

Bloom Filters