As you may (or may not) know, OpenAI recently claimed that their latest O3 model is capable of reaching a staggering rating of 2700, posing a serious threat to the integrity of online CP contests.
My friend and I even had a debate about whether or not this would put an end to CP in the next decade, and we've concluded that, as advanced LLMs become widely available to the public, traditional CP formats like Codeforces contests wouldn't stand the test of time.
Imho, no cheating detection mechanism is sufficient to catch cheaters who actually know what they're doing. For example, they can ask an LLM to produce the textual solution and step-by-step instructions on how to implement it, thereby avoiding any suspicion. The point is, mass cheating with LLMs is inevitable, so the only real solution is to mitigate the effects of cheaters on people who just want to grind CP and have a good time.
One solution to tackle the problem of cheater-induced rating inflation came up in the back of my mind:
- Suppose that future LLMs consistently perform at GM level. We'll devise a fine-tuned model to take in the problem statement and evaluate the expected difficulty of that problem (from the human perspective) along with some potential solutions.
- Codeforces coordinators and problem-setters can later refine the output to reduce overfitting and biases, ensuring that the expected difficulty matches the real distribution.
- Since cheaters can affect the final standing, making it unreliable, we can switch to a new system where individual contest performance is calculated via the expected difficulty of solved problems, instead of their relative standing. In other words, we'll switch from:
where $$$P$$$ is the contest performance and $$$d(...)$$$ denotes the expected difficulty of a problemset.