szawinis's blog

By szawinis, history, 7 years ago, In English

Picking the base and mod for string hashing is very important in decreasing the probability of hash collisions. How much are the guidelines affected by the problem itself? So far, I've read that the base should be larger than the alphabet, and the mod should be really large (but not overflow). But is there anything else?

  • Vote: I like it
  • +14
  • Vote: I do not like it

»
7 years ago, # |
  Vote: I like it +18 Vote: I do not like it

You sound like you are on the right track. The base and mod must be coprime — usually, you just choose a prime base and prime mod anyway. Furthermore, you usually need hash two or more mods to stay safe. This is because of the birthday paradox. Normally people use base 137 and hash in 1e9+7 and 1e9+9.

  • »
    »
    7 years ago, # ^ |
      Vote: I like it +13 Vote: I do not like it

    Base should also be strictly greater than the number of different values an individual element can be. E.g. 137 is enough for English latters, but not enough for arbitrary chars.

    Also, it's good to ensure that all codes of individual elements are between 0 and base-1 so they do not collide modulo base for some random reason.

  • »
    »
    5 years ago, # ^ |
    Rev. 2   Vote: I like it -24 Vote: I do not like it

    :)

»
7 years ago, # |
  Vote: I like it +21 Vote: I do not like it