Saksham_Sahgal's blog

By Saksham_Sahgal, history, 9 months ago, In English

Bloom Filters


Ever wondered how

  • a Website with a massive user base could determine whether a username is taken or not in milliseconds?
  • How does Chrome checks if the url you are visiting isn't malicious without comparing it against a bulky list of URLs?
  • How Quora filters out posts you have already seen ?

So the problem here is ultimately you have a set of billions of strings, and a server needs to check whether a string passed to it by the client exists in the set or not.

Straightforward but Not so Good Ways : Considering we have N strings and the average length of each string is L.

Linear search

  • O(N*L) Search Time Complexity, Iterate over all the strings and compare and see if the string is present.
  • O(N*L) space complexity.

Binary search

  • backed by a Red-Black Tree / balanced BST
  • O(L × log n) Search Time Complexity, Each comparison in the tree takes O(L), and the tree height is O(log n)
  • O(N*L) space complexity.

trie

  • O(S) Search Time Complexity, where $$$S$$$ is the length of the search string
  • O(n×L), space complexity. In the worst case (all strings are unique with no shared prefixes)

The problem Here is when we are operating at an enterprise scale; even an $$$O(S)$$$ operation/request is costly to the server doing the comparison because it has to do billions of such requests/second.


Here is where a Bloom Filter can be helpful:

Concept:

  • Bloom Filter is a probabilistic data structure that uses bits to determine whether an item might be in a set or definitely is not, using far less space than a full list or hash set.
  • Rather than storing the data items directly, Bloom filters represent them using a bit array. Since multiple items can map to the same bits through hashing, hash collisions are not only possible, they're expected.

Working:

  • The client has a set of hash functions $$$h_1 + h_2 + h_3 ... h_k$$$
  • We have a bit array of m bits, all set to zero initially. $$$[0,0,0,0,0,0,0]$$$ where $$$m « S$$$
  • Each hash function generates an index $$$∀i∈[0,m-1]$$$ in O(1) time.
  • The hash function is deterministic, so it generates the same index for the same string.
  • So instead of passing the entire string to the server, the client instead passes the $$$k$$$ indexes generated from the $$$k$$$ hash functions calculated on the client side itself in $$$O(k)$$$ time.
  • This way we are also reducing the network latency by reducing the payload size.

Search Operation:

  • The hash operation is offloaded to the client side, so there is even less load on the server.
  • The set bits indexes passed to the server will act as the filter.
  • It iterates over all the indices got from the client, and if any one of the bits is unset, then the string is definitely not present in the list.
  • This searching takes $$$O(k)$$$.

What if all the bits are present?

  • Well, in that case, the string might/might not be present in the list, because the same set of bit indexes could be generated by any/⁣Cumulatively many string as well.
  • But this way we are eliminating a large set of strings that are definitely not present.
  • In this case, we actually have to do a search, probably with a trie.

How can we reduce false positive probability ?

  • By increasing the number of Hash Functions $$$(↑k)$$$
  • By increasing the size of bit array $$$(↑m)$$$
  • By using a hash function that generates uniformly distributed/un-skewed indexes.

Insert Operation

  • The set bits indexes passed to the server will all be set.
  • The insert operation takes $$$O(k)$$$ time.
  • Thought it's expected that some of the hash bit indexes passed are already set by some previous string.

Delete Opetation

  • The drawback of the Bloom filter is we cannot delete elements from the list by just clearing the hash bits.

Why unsetting bits Won't work ?

  • If we clear the bits set by a string, we might end up clearing a subset of bits of some other string.
  • Eg — Let's say a string "Hi i'm Saksham" generated hash bit indexes as [1,2,3] and another string "Hello World!" generated hash bit indexes as [3,7,12].
  • Now if we want to delete the string "Hello World!" by clearing the bits [3,7,12], we will end up clearing the bit 3 (overlap).
  • Now if we ever search for the string "Hi i'm Saksham" we won't find it because of the unset but 3.
  • So this way we lose the property of definitely not present.
  • Vote: I like it
  • 0
  • Vote: I do not like it