AI language models and cheating in online contests

#	User	Rating
1	Benq	3792
2	VivaciousAubergine	3647
3	Kevin114514	3611
4	jiangly	3583
5	strapple	3515
6	tourist	3470
7	Radewoosh	3415
8	Um_nik	3376
9	maroonrk	3361
10	XVIII	3345

#	User	Contrib.
1	Qingyu	162
2	adamant	148
3	Um_nik	146
4	Dominater069	143
5	errorgorn	141
6	cry	138
7	Proof_by_QED	136
8	YuukiS	135
9	chromate00	134
10	soullless	133

Hello, Codeforces!

I am one of the moderators of Luogu, which is the largest competitive programming online judge system and online community in China. We hold online contests frequently.
To emphasize academic integrity, we run an anti-plagiarism system to catch the cheaters after every contest.
Traditionally, all cases of cheating are like “person A uses person B’s code and directly submit it in the contest”. But as more and more people are knowing how to use ChatGPT or other AI language models, the situation is becoming more and more complex.
Now after every easy enough contest (we also have Div. 3 and Div. 4), those easy problems which are capable for ChatGPT to solve, will contain an amount of AI generated submissions.

I kindly ask other contest-hosting platforms, what’s your solution to this problem? How to deal with these AI generated submissions, should we regard these as cheating and disqualifies the submitters? If so, is there any possibilities that there will be many false positives, which their codes are written by themselves but are misjudged as AI generated?

Here in Codeforces, does anyone know how did Codeforces do? Or how did AtCoder do, as their AtCoder Beginner Contests’ first few problems are easy enough? If you are organizing other online platforms, and have some experience, please comment below! Thank everyone who discuss constructively!

Comments (40)

Show archived | Write comment?

FiniteMoves

3 years ago, hide # |

+67

off-topic : why not make luogu an international platform just like atcoder ? Atcoder also was only for Japanese people at first , then it converted into international platform

→ Reply

thelastdinner

3 years ago, hide # ^ |

-51

in fact people from anywhere can use luogu in www.luogu.com.cn

PsychoPinkQ

+85

but they probably can't read Chinese LOL

PinkieRabbit

+77

Actually Luogu, as a company, has plans to going internationalized, but this depends on our CEO’s thought, I’m just a student moderator. As the community and user will change completely overseas, this plan is suspended. You may not know, Luogu’s CEO has gone to Japan, where he had a friendly conversation with, and asked for advice from AtCoder’s CEO, Takahashi-kun (his handle is chokudai on Codeforces).

stash

chokudai

liboya5074

21 month(s) ago, hide # ^ |

+11

the Luogu Dev Team is working on it.

the new international is at https://www.luogu.com, but it is only half-done yet.

VLamarca

+55

My opinion: using AI should not disqualify the users. Why bother with it? Too much work with false positives. Lets just hope AI does not get good enough for Div1/Div2 :)

t0uris

+15

PinkieRabbit why you don't make it English??

incra

2 years ago, hide # ^ |

It's going to be English.

EnDeRBeaT

+50

I don't think codeforces does anything about AI generated code, because its rules state that generated code is allowed if:

the code is generated using tools that were written and published/distributed before the start of the round.

And ChatGPT is definitely counted as one of those tools.

← Rev. 2 →

+27

That’s very convincing, this rule is very tolerant. Though I doubt that this “pre-ChatGPT” rule may be considered outdated now?

Also in Luogu we have different rule

比赛期间，选手可以使用自己在比赛开始前编写好的代码；禁止使用他人编写的代码，无论这些代码是否在比赛前编写完成

which means “during the contest, participants may use their own codes written before the contest, as long as the codes are written by themselves; it is prohibited to use codes written by others, no matter if they were written before the contest or not.”

This rule forbids participants to directly use code templates by others (tourist’s template for example (if you (the reader) is not tourist himself)). Actually USACO has similar rules. No matter if you agree with this rule or not, the Codeforces’s rule and result are obviously not applicable here.

+16

Even if the rule is out of touch with current reality, there is little you can do about ChatGPT. Tools that predict if the text was AI generated exist for half a year, and what? They flag US constitution or Bible as AI generated. It's very unreliable, any semi official text is likely to be labeled as AI generated, just because of how slick ChatGPT is.

It will probably be worse for code, since there are much less individual quirks for programming and they are much more subtle, training some model to distinguish that is a hell of a job.

+19

The situation is, ChatGPT may gives very similar codes according to the prompts, and that triggers our anti-plagiarism system, now that’s very subtle. Caught participants may complain about they weren’t copying others’ code, but ChatGPT’s. Though we can interpret the word “others” as it includes ChatGPT, then those participants can say nothing.

Zain__Mansour

← Rev. 4 →

It's not a problem.. in (div3,div4) most people and ai models can solve a,b,c but only a real person can solve d,e,f in my opinion the problem setters can make a small trick in the problems makes the ai models not able to solve'em

Tom66

Make problems difficult enough or contain corner cases(or smth else) so that ChatGPT won't get AC.

mt19937

+14

The soul purpose of such rounds is to encourage newcomers to give contests.
I don't see making them harder as a good solution, but maybe that's the best we have...

Perpetually_Purple

+43

If chatgpt can solve the problem correctly then even a monkey can.

+23

That is not the case, today a Luogu Monthly Div.1 and Div.2 has ended. The problem Div.2 B is

Given a length $$$n$$$ ($$$1 \le n \le {10}^5$$$) sequence $$$a_1, a_2, \ldots, a_n$$$ consisting of non-negative integers $$$0 \le a_i \lt 2^{20}$$$.
If a subsegment $$$a_l \sim a_r$$$ ($$$1 \le l \le r \le n$$$) satisfies $$$\bigoplus_{i = l}^{r} a_i = 0$$$ (that is, the XOR-sum of the subsegment is zero), then it’s called a good subsegment.
In one operation, you can choose a good subsegment $$$a_l \sim a_r$$$ and reverse it, i.e., the subsegment becomes $$$a_r, a_{r - 1}, \ldots, a_{l + 1}, a_l$$$ after the operation.
You can perform the operation zero, one, or multiple times. The goal is to maximize the number of good subsegments in the end.

And ChatGPT solves this problem.

You can think of the problem for a while. I don’t think this problem is easy enough for a monkey to solve.

TwentyOneHundredOrBust

+73

I just fed this to chatgpt. It threw up a bunch of code and a completely nonsense explanation and because the problem is a trick question it just happens that it's right...

Maybe the lesson is not to give trick questions?

If you are worried, why not just check if the problem is solvable by chatgpt before the contest? You can use the api, which they claim is not used for training data and deleted after 30 days. (Of course if someone works at openai and knows your account that's another problem.)

Golovanov399

Can a language model do an anti-plagiarism check?

randomalt194738294

I am not sure if that is possible with code

How do you think it is done now?

There even was a contest where participants had to write their own anti-plagiarism check: https://mirror.codeforces.com/contest/537

like when it is creating sentences it is easy to detect it , but when someone generates a code they can easily change things and it wouldnt be accurate + they can still get the idea from gpt and code it on their own

abhishekJr

PinkieRabbit would you share which algorithm you implement in anti-plagiarism module. I googled but could not find a promising solution. We can't use jaccard, cosine similarity as they do not catch sequences leading to more false positives. Could only think of LCS on tokenized code submissions. How do you deal with false positives? There are thousands of submissions and there is high chance two user thinking same sequences.

BELEB

-13

I've been contemplating sharing my observations on a this topic, and I believe it's the time to share things anyway.

I have conducted several experiments using LLms to assess their performance in various programming contests. Among the models I tested, GPT-4 has consistently stood out, hence I predominantly utilized it. To enhance its capabilities, I integrated an agent on top of GPT-4 that automates the processes of code writing, testing, and evaluation. Additionally, I employed prompt engineering to simulate a chain of thought.

The results have been quite astounding. I allowed the model to participate in three contests autonomously, and it achieved impressive scores of +1900 and +1600 in div3 and div4 contests "fully solved a Div4", respectively. I've also observed that the manner in which the problem statement is presented to the model significantly impacts its performance. On occasion, minor interventions are required to guide the model to the correct solution.

For those interested in replicating my findings, I recommend using ChatGPT Plus combined with a code interpreter. The outcomes should be commendable. As a side note, my current configuration of the model has a rating of +2000 on LeetCode and even achieved a +2400 performance rating in one of the rounds.

I'm sharing this to highlight the potential of such models. While many believe these models don't make a significant difference, I speculate that an advanced model beyond GPT-4 could potentially solve even Div1 problems.

Congrats, this is the best troll I have read in weeks :)

This is one of the reasons I didn't want to post such comment. And I'm not here to argue but anyways I will record a screencast of gpt4 solving div3 or div4 contest and I'll share it publicly and we'll see if it's a troll.

+44

How long does it take to solve before there is no improvement? A couple of minutes at most? If you can show video of gpt4 autonomously full solving a div3/4 in live contest in 15 minutes (and not "I solved the problem in my brain as a human and now I'm handholding it through writing the code") I'll gift you 20 bucks.

I don't recall claiming that it can solve the contest in 15 minutes. Are you now challenging whether it can solve the problem, or are you questioning the speed at which it can solve the problem? Either way, at least the last time, it managed to ace a Div4 contest in around an hour.

+34

Ok, then just show me it autonomously full solving the live contest with no time limit and I'll gift you 20 bucks. Also claim your 1000 free citations and international fame.

Watergirl

+25

proof or it didn't happen

No, really. You see OpenAI themselves saying that GPT-4 got a whopping 400 rating on Codeforces, and now someone comes claiming it gets to 1600+ rating? I don't (necessarily) accuse you of lying, but such bold claims must be supported by equally robust evidence.

I agree that these are bold claims that require proof, but I don't think OpenAI is interested in or investing time in making their model more optimal for programming contests. As for the 400 rating, I'm sure it's based on the raw output from OpenAI's API. If you understand my comment, you'll see that I'm not claiming it always provides the correct solution on the first try. However, in many cases, the way the user feeds the model samples and prompts helps the model debug or fix the code. In any case, I will post the proof as soon as I can.

+13

I admit that I was too quick to judge. Though, it still sounds to me as too good (or too bad?) to be true, so I'm eager to see your results in more detail.

Ok... I don’t think this is reliable...

My main concern is that ChatGPT may give different participants similar codes, and that will be caught by the system. In this case, should we accept participants’ appeal complaining about they weren’t cheating, and they were just using AI instead?

Hk16

PinkieRabbit there is an youtuber who uploads solution in youtuber shorts at contest time. There are many cheaters who took help from it. Can't we stop them?

You should contact MikeMirzayanov for this.

Block_Cipher

I guess use AI modal to check the submission of a particular contestant like how he/she use the write code in there previous submission including all the previous contest he/she had given before.

DON_F

chatgpt can solve some problems that most people can solve it but I think in the future it will be able to solve difficult problems that need advanced techniques so we must solve these issues before this happens

disponat

My opinion is that using language models to generate code is not cheating. They are legitimate tools for code generation, even if their performance is lacking at the moment. All they do is lower the "skill floor" of programming.

If you think about it, higher level languages like python already do that: you can write powerful code without thinking about memory allocation or integer overflow or ...

They can be used just as well in contests as in real-world problems outside contests.

aviralarpan3301

2 years ago, hide # |

+273

Thanks PinkieRabbit for taking the steps to prevent cheating, truly amazing!

PinkieRabbit's blog