The fastest approach for 472G

#	User	Rating
1	Benq	3792
2	VivaciousAubergine	3647
3	Kevin114514	3603
4	jiangly	3583
5	turmax	3559
6	tourist	3541
7	strapple	3515
8	ksun48	3461
9	dXqwq	3436
10	Otomachi_Una	3413

#	User	Contrib.
1	Qingyu	157
2	adamant	153
3	Um_nik	147
4	Proof_by_QED	146
5	Dominater069	145
6	errorgorn	141
7	cry	139
8	YuukiS	135
9	TheScrasse	134
10	chromate00	133

Here is my best implementation for 472G: 331895855. It works the same way as 8014415, just with AVX2 intrinsics instead of SSE code. It's about 30% faster than popcnt (which is already fast) and 2-3x faster than the old SSE code. It's currently the fastest on CF.

Why was the top solutions using fast I/O libraries?

i_love_sqrt_decomp's blog