I was trying to find some formal analysis on the error rate of Zobrist Hashing (used in 1977D - XORificator, and there's a tutorial in XOR Hashing [TUTORIAL]).

But I couldn't find any. (If you find any, feel free to post it here.)

Luckily, I managed to derive some upper bound on the error rate. I write it here and you can tell me if there's any problem in my derivation.

**The problem statement:**

You are given a set of $$$m$$$-bit integers $$$A=[a_1,a_2,...,a_N]$$$, and $$$n$$$ sets $$$[S_1,S_2,...,S_n]$$$ , each of which is a subset of $$$A$$$, and for every $$$1\le i < j \le n$$$, $$$S_i\ne S_j$$$. Denote $$$f(S_i)=\bigoplus\limits_{x\in S_i}x$$$, which is the xor sum of elements in $$$S_i$$$. Now, if I randomly assign each $$$a_1,a_2,...,a_N$$$ in $$$A$$$, what is the (estimated upper bound) probability that there exists some $$$1\le i < j \le n$$$ such that $$$f(S_i)=f(S_j)$$$?

**The analysis:**

One observation is that $$$f(S_i)=f(S_j)$$$ is true if and only if $$$f(S_i\cup S_j-S_i\cap S_j) = 0$$$. So let's denote $$$T_{ij}=S_i\cup S_j-S_i\cap S_j$$$. Now we want to estimate the probability that there's some $$$T_{ij}$$$, satisfying $$$f(T_{ij})=0$$$. For some specific $$$i$$$ and $$$j$$$, it is obvious that $$$P(f(T_{ij})=0)=\frac{1}{2^m}$$$ because elements in $$$A$$$ are randomly generated. However, there are $$$\frac{n(n-1)}{2}$$$ pairs of $$$(i,j)$$$, and all these $$$T_{ij}$$$ are actually not independent. For example, if $$$m=1$$$ (all integers in $$$A$$$ are 1-bit), and $$$A=[a_1,a_2],S_1=[a_1],S_2=[a_2],S_3=[a_1,a_2]$$$. Then no matter how you assign $$$a_1$$$ and $$$a_2$$$, for $$$f(T_{1,2})=a_1\oplus a_2,f(T_{2,3})=a_1,f(T_{1,3})=a_2$$$, there must be at least one $$$0$$$. So you can not treat the $$$f(T_{ij})$$$ as independent random variables for different $$$ij$$$.

The linearity property of expectation is used in my derivation. Let's first derive the expectation of how many pairs of $$$1\le i<j\le n$$$ satisfying $$$f(T_{ij})=0$$$ as

$$$\mathbf{E}\sum\limits_{i,j}[f(T_{ij})=0] = \sum\limits_{i,j}\mathbf{E}[f(T_{ij})=0] = \sum\limits_{i,j}P(f(T_{ij})=0) = \sum\limits_{i,j}\frac{1}{2^m} = \frac{n(n-1)}{2^{m+1}}$$$

The expectation of how many pairs of $$$1\le i<j\le n$$$ satisfying $$$f(T_{ij})=0$$$ can also be written as $$$P_1 + 2P_2 + ... + kP_k + ... + \frac{n(n-1)}{2}P_{n(n-1)/2}$$$ , where $$$P_k$$$ is the probability that there're exactly $$$k$$$ pairs of $$$1\le i<j\le n$$$ satisfying $$$f(T_{ij})=0$$$.

So, $$$P_1 + 2P_2 + ... + kP_k + ... + \frac{n(n-1)}{2}P_{\frac{n(n-1)}{2}}=\frac{n(n-1)}{2^{m+1}}$$$.

So, the probability that there exists some $$$1\le i < j \le n$$$ such that $$$f(S_i)=f(S_j)$$$ is

$$$P_1 + P_2 + ... + P_k + ... + P_{n(n-1)/2} < P_1 + 2P_2 + ... + kP_k + ... + \frac{n(n-1)}{2}P_{n(n-1)/2} = \frac{n(n-1)}{2^{m+1}}$$$.

If $$$n=3\times 10^5$$$ and $$$m=64$$$, the above upper bound is $$$2.4\times 10^{-9}$$$.

I'm glad to know if you find something wrong. :)