I was trying to find some formal analysis on the error rate of Zobrist Hashing (used in 1977D - XORificator, and there's a tutorial in XOR Hashing [TUTORIAL]).

But I couldn't find any. (If you find any, feel free to post it here.)

Luckily, I managed to derive some upper bound on the error rate. I write it here and you can tell me if there's any problem in my derivation.

The problem statement:

You are given a set of $m$-bit integers $A=[a_1,a_2,...,a_N]$, and $n$ sets $[S_1,S_2,...,S_n]$ , each of which is a subset of $A$, and for every $1\le i < j \le n$, $S_i\ne S_j$. Denote $f(S_i)=\bigoplus\limits_{x\in S_i}x$, which is the xor sum of elements in $S_i$. Now, if I randomly assign each $a_1,a_2,...,a_N$ in $A$, what is the (estimated upper bound) probability that there exists some $1\le i < j \le n$ such that $f(S_i)=f(S_j)$?

The analysis:

One observation is that $f(S_i)=f(S_j)$ is true if and only if $f(S_i\cup S_j-S_i\cap S_j) = 0$. So let's denote $T_{ij}=S_i\cup S_j-S_i\cap S_j$. Now we want to estimate the probability that there's some $T_{ij}$, satisfying $f(T_{ij})=0$. For some specific $i$ and $j$, it is obvious that $P(f(T_{ij})=0)=\frac{1}{2^m}$ because elements in $A$ are randomly generated. However, there are $\frac{n(n-1)}{2}$ pairs of $(i,j)$, and all these $T_{ij}$ are actually not independent. For example, if $m=1$ (all integers in $A$ are 1-bit), and $A=[a_1,a_2],S_1=[a_1],S_2=[a_2],S_3=[a_1,a_2]$. Then no matter how you assign $a_1$ and $a_2$, for $f(T_{1,2})=a_1\oplus a_2,f(T_{2,3})=a_1,f(T_{1,3})=a_2$, there must be at least one $0$. So you can not treat the $f(T_{ij})$ as independent random variables for different $ij$.

The linearity property of expectation is used in my derivation. Let's first derive the expectation of how many pairs of $1\le i<j\le n$ satisfying $f(T_{ij})=0$ as

$\mathbf{E}\sum\limits_{i,j}[f(T_{ij})=0] = \sum\limits_{i,j}\mathbf{E}[f(T_{ij})=0] = \sum\limits_{i,j}P(f(T_{ij})=0) = \sum\limits_{i,j}\frac{1}{2^m} = \frac{n(n-1)}{2^{m+1}}$

The expectation of how many pairs of $1\le i<j\le n$ satisfying $f(T_{ij})=0$ can also be written as $P_1 + 2P_2 + ... + kP_k + ... + \frac{n(n-1)}{2}P_{n(n-1)/2}$ , where $P_k$ is the probability that there're exactly $k$ pairs of $1\le i<j\le n$ satisfying $f(T_{ij})=0$.

So, $P_1 + 2P_2 + ... + kP_k + ... + \frac{n(n-1)}{2}P_{\frac{n(n-1)}{2}}=\frac{n(n-1)}{2^{m+1}}$.

So, the probability that there exists some $1\le i < j \le n$ such that $f(S_i)=f(S_j)$ is

$P_1 + P_2 + ... + P_k + ... + P_{n(n-1)/2} < P_1 + 2P_2 + ... + kP_k + ... + \frac{n(n-1)}{2}P_{n(n-1)/2} = \frac{n(n-1)}{2^{m+1}}$.

If $n=3\times 10^5$ and $m=64$, the above upper bound is $2.4\times 10^{-9}$.

I'm glad to know if you find something wrong. :)

