TLDR;
I built an enterprise grade machine learning model to predict a user rating 6 months in the future. The model has mean absolute error of 65.15 (that is, $$$ E \left [ \left | \hat{x} - x \right | \right ] = 65.15 $$$). You can use it here:
https://fbrunodr.com/predict-codeforces-rating
Motivation:
Check this thread: https://mirror.codeforces.com/blog/entry/143626?#comment-1282206
Before going forward with this post I have to admit a pretty important thing: I did not do what I promised, as I did not build a foundational model on top of codeforces data. Reasons:
Takes to much time to train on my personal laptop (or money to rent gpus, which I am not willing to expend for a toy project).
I still almost went down the path of finetuning some feature extractor model (such as this one), but then I remembered I had to deploy this somewhere. My website runs in a small dedicated server (I don't do serveless to avoid unexpected bills), so running a medium language model there was not a viable option. I also did not feel like renting gpus for that (again because of money).
So I did not strictly build a state of the art machine learning rating predictor model... But I did the next best thing which is: feature engineering + tree decision model. I describe in detail how I did this in the next sections and how you could train an actual foundational model for this task at the end (if you are actually willing to waste time or money on this).
Data collection
This is actually the most important section, as you need lots of data to train a machine learning model. I heavily used the codeforces API for that (even getting IP banned a couple times). Anyway, here is what I did:
Used https://mirror.codeforces.com/api/user.ratedList?activeOnly=false&includeRetired=false to get non-retired users (people who have logged at least once in the past month).
Selected 1.5 k top users + 50 k uniformly random users from the previous step.
Collected submissions data + rating data + blogs data from each selected user and saved all the data.
Just to have an idea, the total data I collected from codeforces adds up to 26.06 GB!! (and I didn't even use all the users).
Label preparation
As I said in the previous section I used a decision tree machine learning model to predict rating. A decision tree model works like this:
$$$ \text{model}: \text{tabular data} \rightarrow \text{prediction} $$$
So we need to get labeled data in tabular format to train the model. For that I did the following:
For each user on each 1st day of each month from Jan 2021 to Nov 2024 we extract features from this user as well as their rating 6 months in the future (30 * 24 * 60 * 60 seconds in the future to be precise).
The extracted features are the tabular data and the rating 6 months in the future is the label. By the way, what are those features? Well, you can set them to any data you want (be aware of data leakage, we don't want the model seeing in the future). For this specific model the chosen features can be seen here. Here are some of the most important features to give you an idea:
Number of problems done during a contest in the last 3 months
Time since account creation
Number of ACs on problems much above current user rating in the last 3 months
User's region
Delta rating in the past 12 months
Model training
After collecting and preparing all the data it is time to train the model. First I separated the users into training and validation groups, using a 80:20 split. The idea is that we only train on the training users and only evaluate in the validation users. The patterns the model learn in the training group should apply in the validation group, even if the model has never "seen" those users before (which is going to happen in production). We also separate traininig and validation chronologically: we only trained with data from [2021-01-01, 2023-11-01] and only validated on data from [2023-01-01, 2024-11-01]. We do this because we also want the model to be robust to time variation (as the data used on production is going to be at least 6 months more recent than the data used to train, so if the model overfits to a time window it's not going to be so useful in production). Notice I had to cap the validation to 2024-11-01, because 6 months in the future from that is 2025-05-01, which is close to current date (maybe I could have gone a single month more, as I collected this data in June anyway, but whatever, doesn't make that big of a difference), and the labels are always 6 months in the future.
Anyway, after doing all this work (yeah bro, data collection and processing is 90 % of the job) we are finally ready to actually training the model: simply feed (feature, label) pairs to a decision tree model, choose some hyperparameters, choose a loss function and let it do its work. I used catboost by the way, for three reasons:
Basically as good as XGBoost and Lightgbm, which are other sota decision tree models.
It has support for categorical features out of the box (so I don't have to one-hot encode things, which is annoying).
😸s (yes, I meant cats. This weight a lot when deciding for 😸boost vs lightgbm).
Also I used mean absolute error as loss function, as this is the thing I was trying to optimize in the first place (basically how much the model missed the correct prediction). Finally, after 1 min 15 s of training (yes, that is how fast and cheap decision tree models are) we got a model that predicts a user rating 6 months in the future with 65.15 of mean absolute error! By the way, how good is that?
Caveats on rating prediction
Rating prediction is a fundamentally hard thing. Consider a model was trying to predict my rating 6 months in the future and that happened to be in January 3rd 2025. I had 2029 of rating in this day. So a perfect model would predict this value, right? Consider I run another prediction just one day after, after basically nothing had changed. Now my rating 6 months in the future is 1894. Had the model not changed its prediction (because basically nothing changed) it would have an error of 135, which is the rating I lost on Hello 2025. And it gets worse (or better for me): just 1 week after I recovered most of the lost rating in a single contest... as you can see rating is super volatile (you probably have noticed that by yourself anyway). So a model fundamentally can't get a mean absolute error as close to 0 as one wants (unless it predicts the future perfectly, in which case it would have better uses than predicting codeforces' ratings).
That said, how do we even know if the mean absolute error of our model is good? Is it good because we think an average error of 65 ∆ is acceptable? Did the model even learn anything at all? We need a baseline. One obvious baseline is assuming your rating 6 months in the future is going to be the exact rating you have now. This baseline gives a mean average error of: ... 72.34, which is not that far from our model in the first place. One may ask: is predicting one's rating 6 months in the future with only 7 points better than baseline even good? Well, it's actually hard to tell, because of the reasons outlined in the previous paragraph: one can never get even close to 0 because rating is super volatile. Most people have deltas above 50 in absolute value at some point (even not considering the first 5 contests), which means a model could easily get 50 of error while being almost optimal. Let's say the optimal rating predicting model has mean average loss = $$$L$$$. We already know the baseline is $$$72.34$$$. This means the model can only be $$$72.34 - L$$$ better than the baseline. If $$$L = 40$$$ then our model is far from optimal, as $$$72.34 - 65.15 = 7.19$$$ is a far worse improvement than $$$72.34 - 40 = 32.34$$$. That said, if $$$L = 60$$$ then we have a pretty good model, as far as it is possible to have one. Sadly it's not easy to estimate $$$L$$$, so we won't do that here, I just wanted to add this important observation.
How to improve your rating (according to data science)
Predicting one's rating is not the most fun part of using a machine learning model. I would say the best part is using the patterns the model learned to learn something yourself. For example, most codeforces' users are interested in increasing their rating, but what are people who increase their rating actually doing? Analyzing the shap plot of the trained model we see this:

some importat conclusions we can take:
People who have done lot's of problems during rated contests in the last 3 months are the most likely to increase their ratings, while people who are doing almost no contests are are likely to decrease. If you want to increase your rating you have to participate in more contests and this is the single most important factor.
People with new accounts likely haven't converged, so there is still a lot of room for growth; while older accounts usually don't grow as much (completly expected).
The second most important thing you can do is solve problems with rating much above yours (see third feature in the graph). Here much above means $$$ \text{problem rating} - \text{your rating} \gt 400 $$$. Of course solving 3000 rated problems while you are 1200 is not going to help (as you won't understand a thing), but the data suggest solving problems 1700 / 1800 while being 1200 is the second most important thing you can do after contests.
I will comment about region separately.
The 5th thing is a bit confusing.
rating_delta_12mis actually $$$\text{rating 12 months ago} - \text{current rating}$$$. This implies that people who had higher rating 12 months ago are likely to keep losing their rating in the next 6 months, while people who improved their rating a lot in the previous 12 months are likely to keep improving. This means competitive programming is like the gym: you have to keep working out to keep your gains.This one is similar to number 3, but insteado of problems with rating > 400 above your current rating this means $$$ 200 \lt \text{problem rating} - \text{your rating} ≤ 400 $$$. The overral advice here is focusing on doing problems at least 200 above your current rating to improve.
Similar to number 2. People who have done lots of contests have already converged.
This highlights how volatile is rating. If you had much higher rating 5 contests ago you are more likely to have higher rating in the next 6 months. This means you likely lost rating due to volatility and you will get back soon, don't need to worry that much.
Not much to comment.
Same as 8, people who had higher peak are more likely to gain their rating back, as rating is highly volatile.
Point 5 is kinda contradictory with points 8 and 10, but I think they play distinct roles: if you used to be much better in the past you can easily improve back again, but chances are you will just keep losing rating slowly if you don't take action (again like the gym, it's easier to put muscle back than put for the first time, but you will keep losing if you don't do anything).
I believe you are smart and you can analyze the remaining for yourself, but be aware they get less and less important down the list. Before moving on, why is region so high? After some data engineering I separated the countries according to their participants competitive programming performance and cultural similarity (or whatever was going in my mind, this was not exactly scientific and I asked chatgpt to do most of the work). The mapping between country and region can be found here. At the end we can see how each region influences your predicted delta rating in 6 months:

First of all, sorry if this offended anyone, it was not my intention. Second: people from countries with lots of good competitors (Super nerds is China, USA, Japan, South Korea, Russian, Hong kong, Taiwan and Singapore) have only a slight advantage over improving their rating compared to others. In the end the only real difference is whether you set or not your country in codeforces (I believe this is kinda of an engagement thing — if you engage in codeforces you are more likely to keep improving) (mind you correlation ≠ causation).
Training a foundational model to get SOTA performance
Finally the thing I promised and did not deliver. How could you build a foundational model over codeforces data to train a state of the art rating prediction model? First you must collect all the available data from a lot of users (like I did and described in the data collection section). From each user order their data by time and train a language model to predict what event the user will do next on this data. The idea here is that the model will predict the next event relatively well iff the model has some understanding of what is actually going on (think of that like a world model of what happens in codeforces). If the model "understands what is going on" the embeddings generated by that model over some user's data hold some deep information about the user. Those embeddings can be fed to a decision tree model to predict a user's future rating. The pipeline looks like this:
Notice that the embeddings generated by the foundational model can be used in a lot of distinct downstream tasks (downstream task is basically another task that takes the input of this one, like the flow of a river). That is why those models are called foundational: they hold some deeper level understading of the data that can be used in lot's of distinct tasks.
Anyway, I didn't do that because of two details: time and money 😅. The main disadvantage of this method is that it's hard to draw the conclusions we did on the section "How to improve your rating" and do model explainability. The model could be severely biased and you will never know. The main advantage is state of the art performance + no feature engineering.
Anyway, hope you guys liked the blog. Comment what rating the model predicted in the future so we can check later if it is doing well. Hope the new wave of geniuses that came with O3 pro is not going to impact the ratings a lot.
Update: added used data to kaggle (so you don't have to abuse codeforces' servers like I did): https://www.kaggle.com/datasets/fbrunodr/codeforces-users-data








