I was trying to find some formal analysis on the error rate of Zobrist Hashing (used in 1977D - XORificator, and there's a tutorial in XOR Hashing [TUTORIAL]).
But I couldn't find any. (If you find any, feel free to post it here.)
Luckily, I managed to derive some upper bound on the error rate. I write it here and you can tell me if there's any problem in my derivation.
The problem statement:
You are given a set of m-bit integers A=[a1,a2,...,aN], and n sets [S1,S2,...,Sn] , each of which is a subset of A, and for every 1≤i<j≤n, Si≠Sj. Denote f(Si)=⨁x∈Six, which is the xor sum of elements in Si. Now, if I randomly assign each a1,a2,...,aN in A, what is the (estimated upper bound) probability that there exists some 1≤i<j≤n such that f(Si)=f(Sj)?
The analysis:
One observation is that if f(Si)=f(Sj) holds, then f(Si∪Sj−Si∩Sj) will be 0. So let's denote Tij=Si∪Sj−Si∩Sj. Now we want to estimate the probability that there's some Tij, satisfying f(Tij)=0. For some specific i and j, it is obvious that P(f(Tij=0))=12m because elements in A are randomly generated. However, there are n(n−1)2 pairs of (i,j), and all these Tij are actually not independent. For example, if m=1 (all integers in A are 1-bit), and A=[a1,a2],S1=[a1],S2=[a2],S3=[a1,a2]. Then no matter how you assign a1 and a2, for f(T1,2)=a1⊕a2,f(T2,3)=a1,f(T1,3)=a2, there must be at least one 0. So you can not treat the f(Tij) as independent random variables for different ij.
The linearity property of expectation is used in my derivation. Let's first derive the expectation of how many pairs of 1≤i<j≤n satisfying f(Tij)=0 as
E∑i,j[f(Tij)=0]=∑i,jE[f(Tij)=0]=∑i,jP(f(Tij)=0)=∑i,j12m=n(n−1)2m+1
The expectation of how many pairs of 1≤i<j≤n satisfying f(Tij)=0 can also be written as P1+2P2+...+kPk+...+n(n−1)2Pn(n−1)/2 , where Pk is the probability that there're exactly k pairs of 1≤i<j≤n satisfying f(Tij)=0.
So, P1+2P2+...+kPk+...+n(n−1)2Pn(n−1)2=n(n−1)2m+1.
So, the probability that there exists some 1≤i<j≤n such that f(Si)=f(Sj) is
P1+P2+...+Pk+...+Pn(n−1)/2<P1+2P2+...+kPk+...+n(n−1)2Pn(n−1)/2=n(n−1)2m+1.
If n=3×105 and m=64, the above upper bound is 2.4×10−9.
I'm glad to know if you find something wrong. :)