An analysis on the error rate of Zobrist Hashing (Xor Hashing)

#	User	Rating
1	jiangly	3977
2	tourist	3815
3	jqdai0815	3682
4	ksun48	3614
5	orzdevinwang	3526
6	ecnerwala	3514
7	Benq	3483
8	hos.lyric	3381
9	gamegame	3374
10	heuristica	3358

#	User	Contrib.
1	cry	169
2	-is-this-fft-	162
3	Um_nik	161
4	atcoder_official	160
5	djm03178	157
5	Dominater069	157
7	adamant	154
8	luogu_official	152
8	awoo	152
10	TheScrasse	148

I was trying to find some formal analysis on the error rate of Zobrist Hashing (used in 1977D - XORificator, and there's a tutorial in XOR Hashing [TUTORIAL]).

But I couldn't find any. (If you find any, feel free to post it here.)

Luckily, I managed to derive some upper bound on the error rate. I write it here and you can tell me if there's any problem in my derivation.

The problem statement:

You are given a set of $m$ -bit integers $A=[a_1,a_2,...,a_N]$ , and $n$ sets $[S_1,S_2,...,S_n]$ , each of which is a subset of $A$ , and for every $1\le i < j \le n$ , $S_i\ne S_j$ . Denote $f(S_i)=\bigoplus\limits_{x\in S_i}x$ , which is the xor sum of elements in $S_i$ . Now, if I randomly assign each $a_1,a_2,...,a_N$ in $A$ , what is the (estimated upper bound) probability that there exists some $1\le i < j \le n$ such that $f(S_i)=f(S_j)$ ?

The analysis:

One observation is that if $f(S_i)=f(S_j)$ holds, then $f(S_i\cup S_j-S_i\cap S_j)$ will be $0$ . So let's denote $T_{ij}=S_i\cup S_j-S_i\cap S_j$ . Now we want to estimate the probability that there's some $T_{ij}$ , satisfying $f(T_{ij})=0$ . For some specific $i$ and $j$ , it is obvious that $P(f(T_{ij}=0))=\frac{1}{2^m}$ because elements in $A$ are randomly generated. However, there are $\frac{n(n-1)}{2}$ pairs of $(i,j)$ , and all these $T_{ij}$ are actually not independent. For example, if $m=1$ (all integers in $A$ are 1-bit), and $A=[a_1,a_2],S_1=[a_1],S_2=[a_2],S_3=[a_1,a_2]$ . Then no matter how you assign $a_1$ and $a_2$ , for $f(T_{1,2})=a_1\oplus a_2,f(T_{2,3})=a_1,f(T_{1,3})=a_2$ , there must be at least one $0$ . So you can not treat the $f(T_{ij})$ as independent random variables for different $ij$ .

The linearity property of expectation is used in my derivation. Let's first derive the expectation of how many pairs of $1\le i<j\le n$ satisfying $f(T_{ij})=0$ as

$\mathbf{E}\sum\limits_{i,j}[f(T_{ij})=0] = \sum\limits_{i,j}\mathbf{E}[f(T_{ij})=0] = \sum\limits_{i,j}P(f(T_{ij})=0) = \sum\limits_{i,j}\frac{1}{2^m} = \frac{n(n-1)}{2^{m+1}}$

The expectation of how many pairs of $1\le i<j\le n$ satisfying $f(T_{ij})=0$ can also be written as $P_1 + 2P_2 + ... + kP_k + ... + \frac{n(n-1)}{2}P_{n(n-1)/2}$ , where $P_k$ is the probability that there're exactly $k$ pairs of $1\le i<j\le n$ satisfying $f(T_{ij})=0$ .

So, $P_1 + 2P_2 + ... + kP_k + ... + \frac{n(n-1)}{2}P_{\frac{n(n-1)}{2}}=\frac{n(n-1)}{2^{m+1}}$ .

So, the probability that there exists some $1\le i < j \le n$ such that $f(S_i)=f(S_j)$ is

$P_1 + P_2 + ... + P_k + ... + P_{n(n-1)/2} < P_1 + 2P_2 + ... + kP_k + ... + \frac{n(n-1)}{2}P_{n(n-1)/2} = \frac{n(n-1)}{2^{m+1}}$ .

If $n=3\times 10^5$ and $m=64$ , the above upper bound is $2.4\times 10^{-9}$ .

I'm glad to know if you find something wrong. :)

	Rev.	Lang.	By	When	Δ	Comment
	en2		kinoud	2024-05-29 15:46:17	38
	en1		kinoud	2024-05-29 15:37:39	2893	Initial revision (published)

Rev.

Lang.

When

Comment

en2

kinoud

2024-05-29 15:46:17

en1

kinoud

2024-05-29 15:37:39

2893

Initial revision (published)

History