Hi :)

I am sure you have recently encountered a bunch of blogs about AI cheaters (e.g., this), and you may have different feelings towards it, like anger, fear, or hopelessness. This blog is an extended version of my comment, which tried to propose some feasible, small changes that may save online CP (for now).

Changes

In short, I describe some solutions enabling a CP-skilled based competition in which using AI is allowed. These changes make problems harder for (current) AI and make the human part more prominent.

No testing results during the contest

This is not a new thing at all; in fact, this is how Topcoder used to be back in the day, and even now, COCI is like this. There is a spectrum of showing test results in contests, starting from old topcoder or COCI that show you nothing other than compilation and sample tests, to IOI or atcoder contests that show your result on each test of the whole test cases (or subtasks). Thinking of corner cases, writing generators, and debugging all by yourself is a technique that (current) LLMs are not very good at. A simple proof for this is that you have to allow them to submit 10K submissions on IOI to get gold medal but any gold medalist (or even any medalist) can write a simple generator and validator for their solutions (specially if they are as fast as a LLM in coding) and get free infinite submissions. This is a sign of how LLMs are just hacking CP as a search problem.

Not showing the limitations (or showing a part of it)

This is a bit odd and annoying. But I have been spending quite a lot of time reading LLMs chain of thoughts and comparing them to mine. I think they are not good at complexity analysis (checkout my experiences 1, 2, 3, they are a bit unrelated, but they show how poorly they perform when it comes to analyzing time complexity and validation) and if the problems don't say the limits for some (or all of the) parameters and ask the contestant to solve it with the best possible complexity (or complexities) that they can. Skilled people with AI will outperform others.

Harder problems

Back when I started CP, there were only Div 2 and Div 1, so I will talk only about them. Based on this paper, even current strong models perform awfully in Div 1 contests. They are fast and great in Div 2 or classical problems (of course with no human help), but harder problems are out of their reach (another clue of search hacking). A simple change could be to make Div 2, Div 1 and make Div 1, Div 0 (i.e., starting from Div 1 B or C). This may sound cruel to starters, but maybe with good use of AI during the contest, green coders can easily get D1A accepted during 2hrs (maybe). Also, I always hated that I couldn't have enough time to think on D1C or D1D because I was too occupied with D1A & D1B implementations; with this, people have more time to think on hard problems during contests.

New problem scoring system

I believe the codeforces scoring system is designed to award fast coders more than problem solvers. For example, in this Div 1 contest, if you solve B in the first minute without wrong attempts, you will have a higher score than a person who solves G at the last minute with a couple of wrong attempts. And because LLMs are fast, they will be better than all of us in any D2 contests. By making the scoring system like Atcoder (that the score does not decrease during the contest, but a penalty is calculated for your submissions) and even changing the scoring system to grow exponentially (not linear) as index of the problem increases, this will make sure that a person that can solve a very hard problem is always better than fast easy solvers (which (current) LLMs are now all fast easy solvers).

Rolling out

My idea is not to just change the codeforces tomorrow but to start testing these or similar small changes in unofficial contests where people are allowed to use LLMs. We see how it goes; If it is too easy or too hard, we change it. If everything goes well, everybody may have a separate rating for participating in AI-allowed contests (like Topcoder's different ratings) (and maybe people's names are displayed in two colors :D). It will be a long process through a lot of trial and fails, which I think CP deserves.

Why is change needed

Cheaters and bots are everywhere, and I won't bet on AI code detection because it's getting harder to recognize human-generated code, content, voice, art, etc, from AI-generated. After all, it is the exact thing they have been trained to be, to be like us!

From another point of view, I see using AI as a skill. If you don't think it is a skill in CP, just check this blog and see how genius people can use these tools.

Why it might work

CP is a search problem of size $$$2^{50K}$$$ (I am sure much better approximates can be made). Any code generator with infinite time is better than tourist (and transitively all of us).

Current models without chains of thoughts are only capable of solving problems that are near to their distribution, and by test time compute, they can do impressive things, but it gets exponentially harder for them to go a little bit deeper (checkout ARC-AGI). So, a simple solution would be to make the domain bigger and harder to search. It will stop them for quite some time.

I know AI is getting better and better, and trillions of dollars have been spent on it and will be spent! But as long as the human mind has any advantage over models, we should be able to make a version of CP that measures that human ability.

Regulations

I guess there should be some regulation on what AI people can use in these types of contests, but (for now) skilled people with free models do much better than non-skilled people with huge models. And cheating by spending is the best kind of cheating, because it does not scale to thousands of people.

Comments (18)

Show archived | Write comment?

employed

12 months ago, hide # |

You shouldn't change cf but develop anti cheat systems instead, and it's as simple as a phone verification, ID, and an anti cheat app that's all that's required to prevent more than half the cheaters

→ Reply

TwentyOneHundredOrBust

+20

I don't see how this helps anything. LLMs are already good at solving problems in the first submission. And certainly LLM users will be able to write generators and checkers much faster than human users. And not showing the limits, really? What if the problem has N, M, K, author intends N to be large, while users think maybe it's M and spend forever solving with a wrong complexity? How do you know if you should constant optimize your solution vs. second-guessing your complexity? This all sounds like it would ruin the user experience while not even doing anything against AI usage.

Your citations are also very outdated. 10k submissions per problem was with o1. o3 is 900 elo higher. CodeElo bench hasn't been updated for months and has none of the SOTA models.

amsen

12 months ago, hide # ^ |

← Rev. 2 →

In the described contests, everyone is allowed to use LLMs, there are no pure human vs LLM cheaters.

I didn't find any public resources describing how o3 achieved this ELO; if you have found new public resources, please share.

it's 2700 elo

https://arxiv.org/pdf/2502.06807

only 50 submissions per problem were needed for gold medal performance on ioi

-10

Thank you for sharing the paper. Based on the paper, the gold medal an ~2700 elo is achieved by pass@10 1K ranked. This means that the model generated 1K solutions, then sorted them by test-time and submitted the top 10 of them (in IOI submitted top 50 of them) and reached the reported rating.

If the first idea is applied, and each participant has to choose one solution for each problem, the model performance would be reduced to pass@1, which is much worse than ~2700 (you can check it out in Table 1).

As a reminder, generating 1K, o3 solutions will be very expensive as well.

+10

the pass@1 is very good
again, the users will just use the llms to write checkers and brutes, therefore multiple submissions can be simulated
test time compute scales log, so pass@10 with 10 generations only is likely only 100-200 elo or so behind pass@10 1k
there is literally a blog right now where a 1600 made it to 2400 with o3-mini-high despite o3mh being marked as only 2100 by openai https://openai.com/index/openai-o3-mini/. therefore with assistance o3 is probably LGM

It isn't; there are a lot of problems with less than 0.1 probability of being solved by o3 that is marked as solved in the elo calculation.
It will be part of the competition, so there is no problem with it (people cannot generate even 100 solutions per problem, BTW, if they don't want to burn money).
Yes, it scales in log, but sorting them by test-time means choosing the ideas that trigger more thinking, just checkout problems that are solved in pass@10 in less than 0.25 and are solved by pass@10 ranked.
That's another clue that by making contests harder and allowing people to use LLMs (which the red cheater shows helps), the contest will become fair again.

But "fair again" sounds like it will just be people herding AI...

I hope it won't be like this, at least it's worth testing in unrated contests.

+36

Let me tell you a story. Once, there was a game called chess. And some games were played over many days or even months, with the moves sent over letters and later emails. This was called correspondence chess. And there were no chess-playing programs in the beginning, and all was well. Then there were public chess engines available which were quite strong, but not as strong as the strongest players. And there was no way to prevent their use in correspondence chess, so it was permitted in some contexts and disallowed in others (ineffectively). But the humans still had to think, because the engines were weak and did not have positional knowledge. So life went on. And one day the engines became stronger than all humans in normal time controls. But the engines still lacked some positional knowledge and intuition, so blindly copying the engine moves would not always prevent loss; the humans still had to think, and the best humans were still differentiated from each other. And one day the engines became much stronger than all humans. And today there is very little correspondence chess, because they have gotten so strong to the point that blindly copying the engine moves will never bring a loss, so all games are draws, barring mistaken inputs.

Mayank_Pushpjeet

What if codeforeces contest can somehow be proctored?

_HUECTRUM_

+14

Unfortunately, the second and third points make little sense. Not showing limitation will be both confusing to contestants and not really effective as it's easy to brute force — just ask an LLM to generate a solution for $$$n=10^{18}$$$, if it doesn't make sense, go down to $$$n=10^9$$$ etc. And the third one basically just assumes harder problems will always remain unbeatable, which is an assumption that isn't obviously true to me.

The first one should be pretty effective though.

Part of what has to be done certainly involves taking cheating more seriously in general — e.g. making ban evasion a valid reason to ban someone (just check today's blog about a red cheater, where they were immediately able to create a new account to comment with no punishment). Plagiarism checker needs to be way more aggressive, reporting users should be an easily accessible feature, and I would probably argue for a mandatory PC monitoring software during contests (people do play games with anticheats, after all).

The second one is almost useless if the first one is not applied. But if it is, then it makes the problems way harder for LLMs to hack. If you set the limitation, they will always somehow end up with a solution that they think will have the proper complexity, and if the first one is applied, you have no way to check if it is correct other than reading it yourself and generating big tests alongside little tests to be sure.

I am not suggesting to do it for all problems, but it is a tool for designing harder, hard to hack problems with minimal effort from problem setters.

You can check things with your eyes, it's not like cheaters are unable to read code. Also, making a contest marginally harder for LLMs while also making it an absolutely miserable experience for everyone is a poor strategy.

There's a lot of organizational stuff that can be done first. Chess didn't reinvent the rules to combat cheating, and neither should we.

Dominater069

+27

Other than the scoring changes, which I understand and agree with; this blog suggests some wild (and bad) changes.

You are not proposing to make the contest hard for LLMs, but rather make it hard for humans. I have to stress test every problem and make sure that it passes before submitting. This is not something CF currently encourages, and I do not think a lot of people would enjoy this. (Except for chinese noi enjoyers, which exist ig, but are a very small minority)

You made a small bug? Great you now get 0.

not showing the limitations

What the fuck? How am i supposed to know whether my level of solution is good enough for the problem or not. N could be upto 20 indicating a non polynomial bruteforce, or maybe its upto 10^5 indicating a linear approach. The 2 problems are very different?? This is extremely unreasonable in a contest of any less than 7 days(where one can reasonably try to solve as optimal as they can)

No. Just no. You want to save codeforces by killing 99% of its competitors. I guess it works if your only aim is the top 1%.

And people dont solve Div1A with AI. AI solves it while they look at the brilliant chain of thoughts of the AI.

On a final note, please remember what makes it fun for most people. The changes you propose would absolutely reverse that.

tfg

I like the "no testing result during the contest" suggestion and was thinking of making a blog about that. It would greatly punish those whose their main method of solving is guessing and I'm all for it.

As for the scoring changes it's very much needed. This situation is like what happened with AlphaStart imo (my read on that is that AI wasn't necessarily better at strategy, just better at exploiting its inhumane reaction and input speed) so now more than ever there's the need to give less importance to solving fast and more importance to solving more.

-18

"No testing result during contest" changes nothing except it makes me have to implement a stress test everytime, because for anything greater than div2A, even if I have a proof I woll write a stress test. And thus, the situation becomes the same whether i guess or not, since I can just as easily verify my guess. Only difference : I need extra 5-10minutes on every problem writing a stupid bruteforce.

Read the research paper evaluating O3. They gave the rating of 2700 after taking the median score for all solved problems, not the actual score they would have gotten. It beat me in nearly every single contest from that period.