Let us start with 2014H - Robin Hood Archery. A common solution assigns each value a random integer in $$$[0,L)$$$ and XORs the values in the interval. If every value appears an even number of times, the result is $$$0$$$. Otherwise, the result is $$$0$$$ with probability at most $$$1/L$$$.
Inspired by the hack used on 282284449 with xorshift, I found that many other submissions can also be hacked. The pitfalls of XOR hashing with mt19937_64 do not seem to be widely known, so I wrote this blog.
If we view both the internal state and the outputs as vectors over $$$\mathbb{F}_2$$$, then the $$$i$$$-th output of mt19937_64 with initial state $$$x$$$ can be written as $$$o_i = BA^i x$$$, where $$$A$$$ is the state transition matrix and $$$B$$$ is the output matrix over $$$\mathbb{F}_2$$$.
Thus, to make the XOR zero for every seed, it is enough to find a nonzero polynomial $$$p(t)=\sum_i a_i t^i$$$ such that $$$p(A)=0$$$. Then
Such a polynomial can be the minimal polynomial of $$$A$$$, whose degree is $$$19937$$$ for mt19937_64. By observing the least significant bit of the first $$$2 \times 19937$$$ outputs, we can recover the recurrence of this bit sequence using Berlekamp–Massey. This recurrence divides the minimal polynomial. Since mt19937_64 is designed to have maximal period, the corresponding polynomial is primitive, hence irreducible. Thus any nonzero recovered recurrence must be the full minimal polynomial.
The following is the test case generator.
Using uniform_int_distribution(0, L) incorrectly can still cause the same issue. For example, the implementation is unsafe on common standard-library implementations when $$$L = 2^n-1$$$, because the distribution may degenerate into taking the low $$$n$$$ bits of rng(), which is still linear over $$$\mathbb{F}_2$$$.









Does it means we can hack a fixed seeded solution or even randomized seed? I can't understand the math part ><
Even randomized seed. For example, 282285096 can be hacked. You can check the
mainfunction in the test case generator: for any seed, the hash is always $$$0$$$.So is this an issue only for XOR hashing or for most randomized algorithms which use the Mersenne Twister? I guess the solution would be to use other PRNGs like splitmix for most purposes? I'm not that good at math :/
when the man............... comes around.
Hear the trumpets, hear the pipers......
one hundred million angels singing!
Multitudes are marching to the big kettle drum!
I would've loved to continue but I got an exam in an hour :)
Voices calling voices crying; some are born and some are dying; it's Alpha and Omega's Kingdom Come.
Also, good luck on your exam. (not part of the song)
Time to switch to splitmix64
Just use aesctr. fast and (mostly) secure https://github.com/lemire/testingRNG/tree/master#visual-summary
(In comparison,
std::mt19937_64uses 2906 ms).i just learned xor hashing bro why :(
It is easy to avoid being hacked, just make some non-bitwise operation on rng() randomly.
Like this:
Do not just divide or multiply it by $$$2^k$$$ or it will just degenerate into taking a subset of rng(). (which is still hackable)
oh okay thanks !!!
Will shuffling the indexes in the beginning solve it?
Yes. Most small modifications can fix it.
So what is the “correct” xor hashing?
Most approaches are correct:
rng() % xorrng() / x, where $$$x$$$ is not a power of two, e.g.rng() / 3;rng() + x, where $$$x \ne 0$$$, e.g.rng() + 1;Interestingly, though, the incorrect approach is the most commonly used one.
rng() + 1is actually not a good choice, because then the non-linear part has very low entropyWhat is entropy?
Entropy (information theory). To put it simpler, just consider
x ^ x + 1withxsampled asrng(). It is usually very small, so the xor sum over a hashed set is usually very small. So the xor difference between the "patched"rng() + 1and the "broken"rng()hash values is usually very small, and since on a set constructed as described in the blog the latter is zero, the former will usually be very smallYeah, you're right. I found a test case with $$$n, q \le 2 \cdot 10^5$$$ where using
rng() + 1leads to a hash collision with probability over 99%. :(