Rating the Difficulty of Codeforces Problems

#	User	Rating
1	Benq	3792
2	VivaciousAubergine	3647
3	Kevin114514	3611
4	jiangly	3583
5	strapple	3515
6	tourist	3470
7	Radewoosh	3415
8	Um_nik	3376
9	maroonrk	3361
10	XVIII	3345

#	User	Contrib.
1	Qingyu	162
2	adamant	148
3	Um_nik	146
4	Dominater069	143
5	errorgorn	141
6	cry	138
7	Proof_by_QED	136
8	YuukiS	135
9	chromate00	134
10	soullless	133

There are several heuristics that exist for finding suitable practice problems on Codeforces. One common piece of advice is to sort all problems by number of solvers and to work down the list. However, the number of solvers depends on the recency of the contest, the day of the week the contest was held, the time of the day the contest was held, and numerous other factors. Another heuristic is to try to solve problems at a certain level, for instance, all Div. 1 C problems. However, the difficulty of such problems can vary greatly from contest to contest.

We decided to create a better measure of problem difficulty, which we’ll call “problem rating”. Problem rating is an integral value, with higher problem ratings reflecting harder problems. In order for the rating to be meaningful, we made it compatible (in the mathematical sense) with user ratings. Technically, problem rating is defined as the rating of a contestant whose expected place in a contest equals the number of participants who actually solved the problem. In practice, if you meet a problem whose rating is equal to your rating, you are expected to solve it in half of your contests.

To give you some intuition into this number, say there is a contest with 10 contestants rated 2000. If 5 of those contestants solved it (and 5 did not), the problem would be rated 2000. If 7 solved the problem, it would be rated 1852. If 9 solved the problem, it would be rated 1618. For convenience, we give problems solved by all users a rating of 0 and problems solved by no users a rating of 5000.

Of course, there are some shortcomings to this approach. For instance, not all problems are equally attempted by all contestants during a contest; most of us are much more likely to attempt earlier problems than later ones. Furthermore, since users who don’t solve any problems don’t count as participants, our ratings are deflated, especially when the first problem of a problem set is difficult. These are issues we hope to address as we refine our ratings in the future.

We’ve provided a list of problem ratings at our github page.

How much more difficult are Div. 1 questions compared to Div. 2 questions?

As one would expect, the distribution of Div. 1 problem rating (solid black line) are shifted to the right compared to Div. 2 problem ratings (dashed black line), with combined Div. 1 + Div. 2 contests (dashed gray line) somewhere in the middle. However, there is quite a bit of overlap between the Div. 1 and Div. 2 problem ratings. Do we actually get more points for solving harder questions?

There is a clear linear relationship between the point value of problems and their rating, although there is quite a bit of spread for each point value. A Div. 2 500-point question could be as easy as gray or as hard as blue, and a Div. 1 500-point question ranges from gray all the way to yellow.

How stable are problem ratings and contestant ratings over time?

In this plot, each dot indicates one problem. The hardest problem in each contest is colored red, and the easiest problem in each contest is colored blue. Both the hardest and easiest problems in Div. 1 have been creeping upwards over time, while easy problem in Div. 2 have been much more stable. The average rating of Div. 1 participants (solid black line) and Div. 2 participants (dashed black line) appear to be rather stable over time, with a jump recently due to the recategorization of Div. 1 and Div. 2.

Which types of problems are the hardest to solve?

While it is difficult to categorize problems accurately with only a few tags, there might still be interesting patterns in the distribution of problem ratings with respect to problem tags. Categories here are sorted by increasing median problem ratings. There is a significant amount of spread within each tag. Unsurprisingly, problems tagged with “FFT” appear to be the most difficult, but are also infrequently seen. “Flows”, “matrices”, “divide and conquer”, and “string suffix structures” are next, followed by problems without any tags. “Implementation” problems appear to be the easiest, but still quite a few did not get any solves in-contest.

Are problem ratings stable between divisions?

Since a few problems are shared between Div. 1 and Div. 2 contests, we decided to see how well Div. 1 scores correlated with Div. 2 scores for these problems. Here, each dot is a problem shared between two divisions, and they are colored by their Div. 1 ratings. We see a strong linear relationship between the two, and as expected, Div. 1 problem ratings tend to be lower than Div. 2 ratings. This is especially apparently in a few problems that are unsolved by Div. 2 participants but have been solved by Div. 1 participants. This pattern could be attributed to a number of different factors: Div. 2 participants are both less likely to solve more difficult question than Div. 1 participants, and they also tend to see these problems later in the contest, leaving them less time to solve them.

What’s next?

Once we have a way of calculating the difficulty of problems, we can visualize a user’s rating trajectory superimposed on top of problems they’ve done either in contest or for practice. Here’s a profile of I_love_Tanya_Romanova, a long-time champion on the problemset leaderboard.

We hope that we can use these visualizations to make predictions on how users can best improve their ratings.