№	Пользователь	Рейтинг
1	tourist	3985
2	jiangly	3741
3	jqdai0815	3682
4	Benq	3529
5	orzdevinwang	3526
6	ksun48	3489
7	Radewoosh	3483
8	Kevin114514	3443
9	ecnerwala	3392
9	Um_nik	3392

№	Пользователь	Вклад
1	cry	167
2	Um_nik	163
3	maomao90	162
3	atcoder_official	162
5	adamant	158
5	-is-this-fft-	158
7	awoo	156
8	djm03178	155
9	TheScrasse	154
10	Dominater069	153

Finally, I finished the implementation of an improved version of TrueSkill rating system that EbTech named "TrueSkill from St.Petersburg".

TL;DR

Results are here. Full repository is here.

Important notes on the results

Note that those results were obtained by running on the EbTech's testset (rounds only before Codeforces Round #645 and team contests excluded) and thus do not represent the current standings on Codeforces. Also, the file contains only users with at least 10 contests and at least one contest in 2020 (otherwise, the file would be very large). However, you may obtain full rating by running the algorithm on your computer, it should take about a minute. Also note that the results may seem not very impressive since I chose the first parameters out of the blue and they should be changed of course.

What is this rating system?

It is like original TrueSkill, but without its problems, like multiway ties. What is TrueSkill? It is a generalization of Elo/Glicko to the case when many teams compete simultaneously. However, the original TrueSkill was made to deal with many players, large teams and not so many ties, and therefore should not work correctly in our case. The improved version of TrueSkill was designed for the sport version of "What? Where? When?", thus it solves the original TrueSkill problems and perfectly match Codeforces-like competitions.

Pros

It is mathematically perfect. It does exactly Bayesian inference on the model with many simultaneous players (just like Codeforces), no approximate inference or any other tweaking of the results. Therefore, no other rating system could be more precise. However, there is a room for customisation to fit your personal preferences (the stability of the rating, etc.).
It is very fast. As I have already mentioned, it processes almost all Codeforces history in less than a minute. The complexity of processing contest with $$$n$$$ players is $$$\mathcal{O}\left(n \log \frac{1}{\varepsilon} \right)$$$ where $$$\varepsilon$$$ is the designated precision of users' ratings.
Teams out-of-the-box. Despite the fact that results above were obtained on tests with team contests excluded, it is not the rating system issue. The only reason why I excluded team contests from the testset is that CF-loader I took from EbTech's repository doesn't process team contests and I don't want to spend time for making a loader by myself. However, now team performance is computed as sum of player performances, but it could be easily changed to any other formula.
Reduced rating inflation, explicit performances, visible rating feature and more.
tourist on the top of the rating!

Cons

Some formulas lack of numerical stability. That happened because formulas from the original article contain some mistakes (at least they fail on some obvious tests like $$$(-m) \cdot \chi_{[-\varepsilon, \varepsilon]} = -\left(m \cdot \chi_{[-\varepsilon, \varepsilon]}\right)$$$) and I failed to retrace the author's steps in the computations. However, they seem to be somewhere near the truth and they are numerically stable, so I believe that it could be fixed.

Further discussion

Rating system parameters

As I already mentioned, I haven't spent lots of time choosing the best parameters for Codeforces case, so there is still some work with data ahead.

Normal vs logistic distribution

Which distribution expresses player rating and performance better? Any choice is very debatable. Currently my implementation uses normal distribution, but it could be easily replaced with logistic or any other distribution, the only thing you need to do is to implement operators like multiplication, division, addition and two multiplication on indicator operators (last two are more tricky than the others, but still not very hard, see article appendix for explanation what to do and this file as an example for normal distributions).

Large multiway tie on the last place

Codeforces standings have an unusual artifact of a large multiway tie of a players with zero score which does not perfectly match the theory and possibly affect those players' ratings too much. There are many ways to deal with this problem, e.g. exclude them from rating at all, make separate tie-$$$\varepsilon$$$ for them, etc.

Acknowledgements

I'd like to thank EbTech for starting this discussion that led us here, S. Nikolenko for pointing the solution out to me and MikeMirzayanov for great Codeforces system that crashed during my experiments only twice.

Rev.	Кто	Когда	Δ	Комментарий
en10	nikgaevoy	2020-07-04 21:59:43	0	(published)
en9	nikgaevoy	2020-07-04 21:59:24	351	(saved to drafts)
en8	nikgaevoy	2020-07-04 21:37:27	0	(published)
en7	nikgaevoy	2020-07-04 21:36:57	101
en6	nikgaevoy	2020-07-04 21:32:20	5	Tiny change: 'case when when many' -> 'case when many'
en5	nikgaevoy	2020-07-04 21:31:51	3	Tiny change: '. Also, this file cont' -> '. Also, the file cont'
en4	nikgaevoy	2020-07-04 21:30:49	488	Tiny change: '5.jpg)\n\nPros\n' -> '5.jpg)\n\n[cut]\n\nPros\n'
en3	nikgaevoy	2020-07-04 17:28:43	264	Tiny change: 'n[cut]\n\n### Pr' -> 'n[cut]\n\n\n### Pr'
en2	nikgaevoy	2020-07-04 17:23:38	5
en1	nikgaevoy	2020-07-04 17:21:47	4673	Initial revision (saved to drafts)