Bloom Filters
Ever wondered how
- a Website with a massive user base could determine whether a username is taken or not in milliseconds?
- How does
Chromechecks if the url you are visiting isn't malicious without comparing it against a bulky list of URLs? - How
Quorafilters out posts you have already seen ?
So the problem here is ultimately you have a set of billions of strings, and a server needs to check whether a string passed to it by the client exists in the set or not.
Straightforward but Not so Good Ways : Considering we have N strings and the average length of each string is L.
Linear search
- O(N*L) Search Time Complexity, Iterate over all the strings and compare and see if the string is present.
- O(N*L) space complexity.
Binary search
- backed by a
Red-Black Tree / balanced BST - O(L × log n) Search Time Complexity, Each comparison in the tree takes O(L), and the tree height is O(log n)
- O(N*L) space complexity.
trie
- O(S) Search Time Complexity, where $$$S$$$ is the length of the search string
- O(n×L), space complexity. In the worst case (all strings are unique with no shared prefixes)
The problem Here is when we are operating at an enterprise scale; even an $$$O(S)$$$ operation/request is costly to the server doing the comparison because it has to do billions of such requests/second.
Here is where a Bloom Filter can be helpful:
Concept:
- Bloom Filter is a
probabilistic data structurethat uses bits to determine whether an item might be in a set or definitely is not, using far less space than a full list or hash set. - Rather than storing the data items directly, Bloom filters represent them using a bit array. Since multiple items can map to the same bits through hashing, hash collisions are not only possible, they're expected.
Working:
- The client has a set of hash functions $$$h_1 + h_2 + h_3 ... h_k$$$
- We have a bit array of m bits, all set to zero initially. $$$[0,0,0,0,0,0,0]$$$ where $$$m « S$$$
- Each hash function generates an index $$$∀i∈[0,m-1]$$$ in O(1) time.
- The hash function is deterministic, so it generates the same index for the same string.
- So instead of passing the entire string to the server, the client instead passes the $$$k$$$ indexes generated from the $$$k$$$ hash functions calculated on the client side itself in $$$O(k)$$$ time.
- This way we are also reducing the network latency by reducing the payload size.
Search Operation:
- The hash operation is offloaded to the client side, so there is even less load on the server.
- The
set bits indexespassed to the server will act as thefilter. - It iterates over all the indices got from the client, and if any one of the bits is unset, then the string is definitely not present in the list.
- This searching takes $$$O(k)$$$.
What if all the bits are present?
- Well, in that case, the string
might/might notbe present in the list, because the same set of bit indexes could be generated byany/Cumulatively manystring as well. - But this way we are eliminating a large set of strings that are definitely not present.
- In this case, we actually have to do a search, probably with a
trie.
How can we reduce false positive probability ?
- By increasing the number of Hash Functions $$$(↑k)$$$
- By increasing the size of bit array $$$(↑m)$$$
- By using a hash function that generates uniformly distributed/un-skewed indexes.
Insert Operation
- The
set bits indexespassed to the server will all be set. - The insert operation takes $$$O(k)$$$ time.
- Thought it's expected that some of the hash bit indexes passed are already set by some previous string.
Delete Opetation
- The drawback of the Bloom filter is we cannot delete elements from the list by just clearing the hash bits.
Why unsetting bits Won't work ?
- If we clear the bits set by a string, we might end up clearing a subset of bits of some other string.
- Eg — Let's say a string
"Hi i'm Saksham"generated hash bit indexes as [1,2,3] and another string"Hello World!"generated hash bit indexes as [3,7,12]. - Now if we want to delete the string
"Hello World!"by clearing the bits [3,7,12], we will end up clearing the bit3(overlap). - Now if we ever search for the string
"Hi i'm Saksham"we won't find it because of the unset but3. - So this way we lose the property of
definitely not present.







