Codeforces and Polygon may be unavailable from December 6, 19:00 (UTC) to December 6, 21:00 (UTC) due to technical maintenance. ×

Yet Another Rating System for CodeForces

Revision en1, by nikgaevoy, 2020-07-04 17:21:47

Finally, I finished the implementation of an improved version of TrueSkill rating system that EbTech named "TrueSkill from St.Petersburg".

TL;DR

Results are here. Full repository is here.

Important notes on results

Note that those results were obtained by running on the EbTech's testset (rounds only before Codeforces Round #645 and team contests excluded) and could not represent current standings. Also, this file contains only users with at least 10 contests and at least one contest in 2020 (otherwise, the file would be very large). However, you may obtain full rating by running the algorithm in your computer, it would take about a minute. Also note that results may seem not very impressive since I chose the first parameters that come into my mind and they are surely should be changed.

What is this rating system?

It is like original TrueSkill, but without its problems, like multiway ties. What is TrueSkill? It is a generalisation of Elo/Glicko to the case of many simultaneous teams competing. However, original TrueSkill was made to deal with many players, large teams and not so many ties and should not work correctly in our case. The improved version of TrueSkill was designed for the sport version of "What? Where? When?" which means it solved original TrueSkill problems and perfectly match CodeForces-like competitions.

Pros

  • It is mathematically perfect. It does exactly Bayesian inference on the model with many simultaneous players (just like CodeForces). Therefore, no other rating system could be more precise. However, there is a room for customisation

  • It is very fast. As I already mentioned, it processes almost all CodeForces history in less than a minute. The complexity of processing contest with $$$n$$$ players is $$$\mathcal{O}\left(n \log \frac{1}{\varepsilon} \right)$$$ where $$$\varepsilon$$$ is the designated precision of users' ratings.

  • Teams out-of-the-box. Despite the fact that results above were obtained on tests with team contests excluded, it is not the rating system issue. The only reason why I excluded team contests from the testset is that CF-loader I took from EbTech's repository doesn't process team contests and I don't want to make loader by myself. However, now team performance is computed as sum of player performances, but it could be easily changed to any other formula.

  • Reduced rating inflation, explicit performances, visible rating feature and more.

  • tourist on the top of the rating!

Cons

  • Some formulas lack of numerical stability. That happened because formulas from the original article contain some mistakes (at least they fail on some obvious tests like $$$(-m) \cdot \chi_{[-\varepsilon, \varepsilon]} = -\left(m \cdot \chi_{[-\varepsilon, \varepsilon]}\right)$$$) and I can't find the way they were achieved. However, they seem to be near the truth and they are numerically stable, so I believe that the could be fixed.

Further discussion

Rating system parameters

As I already mentioned, I spent not so many time choosing the best parameters for CodeForces case, so there is still some work with data ahead.

Normal vs logistic distribution

Which distribution expresses player rating and performance better? Any choice is very debatable. Currently my implementation uses normal distribution inside, but it could be easily replaced with logistic or any other distribution, the only thing you need to do is to implement operators like multiplication, division, addition and two multiplication on indicator operators (last two are more tricky than others, but still not very hard, see article appendix for explanation what to do and this file as an example for normal distributions).

Large multiway tie on the last place

Codeforces standings has an unusual artifact of a large multiway tie of a players with zero score which does not perfectly match the theory and possibly affect those players ratings too much. There are many ways to deal with this problem, e.g. exclude them from rating at all, make separate tie-$$$\varepsilon$$$ for them, etc.

History

 
 
 
 
Revisions
 
 
  Rev. Lang. By When Δ Comment
en10 English nikgaevoy 2020-07-04 21:59:43 0 (published)
en9 English nikgaevoy 2020-07-04 21:59:24 351 (saved to drafts)
en8 English nikgaevoy 2020-07-04 21:37:27 0 (published)
en7 English nikgaevoy 2020-07-04 21:36:57 101
en6 English nikgaevoy 2020-07-04 21:32:20 5 Tiny change: 'case when when many' -> 'case when many'
en5 English nikgaevoy 2020-07-04 21:31:51 3 Tiny change: '. Also, this file cont' -> '. Also, the file cont'
en4 English nikgaevoy 2020-07-04 21:30:49 488 Tiny change: '5.jpg)\n\nPros\n' -> '5.jpg)\n\n[cut]\n\nPros\n'
en3 English nikgaevoy 2020-07-04 17:28:43 264 Tiny change: 'n[cut]\n\n### Pr' -> 'n[cut]\n\n\n### Pr'
en2 English nikgaevoy 2020-07-04 17:23:38 5
en1 English nikgaevoy 2020-07-04 17:21:47 4673 Initial revision (saved to drafts)