Here is my best implementation for 472G: 331895855. It works the same way as 8014415, just with AVX2 intrinsics instead of SSE code. It's about 30% faster than popcnt (which is already fast) and 2-3x faster than the old SSE code. It's currently the fastest on CF.
Why was the top solutions using fast I/O libraries?







