Блог пользователя mentalist

Автор mentalist, 11 лет назад, По-английски

I have been studying the Rabin-Karp algorithm for string matching lately, and I am impressed with its simplicity and average efficiency. I learned that its best-case complexity is O(n), and that its worst-case complexity is O(nm). I would like to find a way to eliminate this worst-case scenario, but have an error probability. If the error probability is low enough, then I will be able to use it and assume that the result will always be correct (unless I'm extremely unlucky). Below there is some pseudocode which implements the rabin-karp algorithm:

1| rabin-karp(text, pat) {
2|   n = size(text), m = size(pat);
3|   text_hash = hash(text[1..m]), pat_hash = hash(pat[1..m]);
4|   for i=1 to n-m+1 {
5|     if text_hash = pat_hash
6|       if text[i..i+m-1] = pat         // this step takes O(m) time
7|         print "Index: " i;
8|     text_hash = hash(text[i+1..i+m]); // this is computed in O(1)
9|   print "done";

To achieve my goal, I have to eliminate line six. To be able to do that, I need to be sure that this will not affect the correctness of the algorithm too much. In other words, I need to be sure that my hash function will be good enough to make the error probability low, but I can't come up with something clever. Could somebody help me? Thanks in advance.

  • Проголосовать: нравится
  • +6
  • Проголосовать: не нравится

»
11 лет назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

Correct me if I am wrong!!!

What about generating many(lets say 10) different completely different hashes (by choosing different bases and mod values) for pattern and texts. If all 10 hashes are equal for some substring of text and the pattern, then there will be high chances that they are same.

Additionally to be completely sure that there is a possible match, you can run a O(m) loop to verify that. Now the chances that all 10 hashes are same are rare and the verification step is carried on with very less probability.

Here's the code for 2 different hashes for clarity

»
11 лет назад, # |
Rev. 2   Проголосовать: нравится +1 Проголосовать: не нравится

I think there's a small mistake in line 3 of your pseudocode: text_hash = hash(text[1..m]) (given m <= n).

»
11 лет назад, # |
  Проголосовать: нравится +13 Проголосовать: не нравится

All that you may need to know.

Let

A probability of collision when checking two random strings is about

A mean number of collisions in a set of N random strings is

If I know your a and p, your solution is just a hundred points for me on CF. But if smb don't know that, you may assume the mean case.

You may random your hash constants, e.g. use 2 different pairs (a1, p) and (a2, p) where p = 10^9+7 and check both hashes whether them match or not.

»
11 лет назад, # |
Rev. 2   Проголосовать: нравится 0 Проголосовать: не нравится

You can use a 2^64 as your mod, so you can even get rid of the % on the code using just unsigned long long int. I don't know how to prove it, but the guy who taught me this hash function, proved it once that the probability of having collisions with a prime base using 2^64 as its module is about 1/10⁶, so the chance of hitting a collision is like winning on the lottery.

I've never had a problem using this hash function, so for me it's quite safe using it.