It seemed that this problem got not much more development going to it since its introduction, so my solution is still the fastest after nearly a year: https://judge.yosupo.jp/submission/268155.
There're only two main optimizations: AVX input, and back-substitution using Method of Four Russians (this is how I became the fastest over adamant's lead.)
Would there be a faster solution to this problem?



