A codeforces dataset? Thoughts and Discussions

Today I stumbled upon Mike’s recent blog post, and I want to start by thanking him for maintaining such an incredible platform. As Mike notes, it’s surprisingly hard to stop people from crawling Codeforces, and this got me thinking…

Why are people crawling Codeforces anyway?

My guess is that, with the boom in LLM research, many groups are scraping Codeforces to train cutting-edge AI models. For example, I found at least five recent arXiv papers whose authors appear to have used Codeforces data:

Is CAPTCHA really the answer?

Is it bad if researchers use Codeforces data for AI?
I mean, research is ... research... It ultimately brings benefits to humanity. Many of those papers actually complain that they only got partial problem sets (no checker code, missing statements, etc.), which arguably hinders AI research rather than helps it.
Can CAPTCHA stop a frontier lab’s LLM agent?
If these groups really want the data, my hunch is that no CAPTCHA—however elaborate—can keep them out.
Can we stop the development of AI to prevent AI cheating? Well you can try.. I am not too worried about AI cheating, because there is no way to stop it. Sooner or later, AI is going to ace competitive programming. The question is, should we try to stop what is inevitable. Or should we, as a community, contribute our accumulated intelligence fossil to the development of AGI?

An alternative: an official dataset

What if Codeforces released a curated dataset—complete problem statements, checker code, tests, metadata—under a clear license for research use?
- Researchers would get everything they need, without resorting to ad-hoc scraping.
- Codeforces could control access, ensure credit, even track citations.
- CAPTCHAs and other blockades could be simplified or removed.

But wait! This is just robbery?

I also realize the following logic stands:

Why should Codeforces give away its data just because researchers want it? Isn’t that akin to saying I want to still your money for my research because: 1. Research is good. 2. I can steal it if you don’t let me have it.

Fair point. Nobody should question that logic. But in practice, both points are just simply true. And they are already "stealing". — so why not offer a reasonable, license-based middle ground? For example:

Enterprise licenses for large companies (at market rates).
Academic licenses at reduced cost or free for non-commercial research.

That way, everyone wins: Codeforces retains control and funding, researchers get reliable access, and the arms race of ever-more-elaborate scraping is defused. But maybe I overlooked something here.

I’m curious to hear what the community thinks. Is an official research dataset a viable compromise? Or is there another solution I’ve overlooked?

(I used gpt-4o to polish my writting.)

Comments (11)

Write comment?

PalintR

9 months ago, hide # |

-58

it's 7 AM, what are you doing now

→ Reply

kostka

← Rev. 2 →

+78

From Codeforces Terms and Conditions:

You are expressly and emphatically restricted from all of the following:

using this Website in any way that is, or may be, damaging to this Website;
using this Website in any way that impacts user access to this Website;

So without an explicit permission, in my opinion, scrapping is against the terms and conditions. Cloudflare protection to prevent automated data collection is an example of measures against that.

Is it bad if researchers use Codeforces data for AI?
I mean, research is ... research...

Sure, then please ask Facebook, TikTok, or Google to share their datasets and/or algorithms.

I kinda want to see Codeforces suing all these scrappers.

I can steal it if you don’t let me have it.

Amazing attitude.

WuHongxun

9 months ago, hide # ^ |

But that’s exactly the unpopular point I’m making here.

If there is a group thieves that we cannot stop, should we keep fighting them or should we prioritize our lives and find other ways?

I also like to see codeforces suing them, but it’s nearly impossible to prove them stealing. Especially when they only use the data for RLVR instead of SFT.

And yes, big companies like Google Meta do release research datasets, like https://research.google/resources/datasets/ https://ai.meta.com/datasets/

GPT4-B

bingo that's what i was talking about. RLVR only need statements because we don't have access to test cases any ways.

Mindeveloped

+16

I mean, research is ... research... It ultimately brings benefits to humanity.

No more such shit.

Anyway, these crawlers heavily increasing the server load during contests is the real problem. Creating a license won't help that. And even if you do that, there's no guarantee that they will follow it. Otherwise Codeforces could just say "1e18 coins per submission" and those AI companies would stop.

-9

Why? Could you elaborate? I sincerely think AI companies and AI research being more benefits than evil to humanity.

Because you have no proof for this claim. In my opinion most companies solely care about money.

Though it was supposed to be lighthearted and the below paragraph is the point I want to stress.

Oh I agree they only care about money and are maximally greedy. But they solely care about money doesn’t mean the endeavor they are advancing is not important and beneficial to humanity.

About the crawling, my theory is that if the data is available to them through a proper channel with a fair price, the crawling should naturally stop.

Muaath_5

reasonable, license-based middle ground

If it's cost was higher than the cost of bots/IPs/servers, then it won't prevent

manchik

datasets are already available at https://huggingface.co/datasets/open-r1/codeforces and https://huggingface.co/datasets/open-r1/codeforces-submissions

Wind_Eagle

I totally agree. If you upload such a dataset, there will be no need to download data from Codeforces.

Instead of discussing the harm or benefit of AI, it is better to discuss the feasibility of all these protective measures. The reality is that they can be easily circumvented, especially if you are a large company or a major AI researcher. Why do we need measures that prevent nothing, but at the same time harm users? It's much better to post a complete dataset to stop this endless stream of crawling.

#	User	Rating
1	Benq	3792
2	VivaciousAubergine	3647
3	Kevin114514	3603
4	jiangly	3583
5	turmax	3559
6	tourist	3541
7	strapple	3515
8	ksun48	3461
9	dXqwq	3436
10	Otomachi_Una	3413

#	User	Contrib.
1	Qingyu	157
2	adamant	153
3	Um_nik	147
4	Proof_by_QED	146
5	Dominater069	145
6	errorgorn	142
7	cry	139
8	YuukiS	135
9	TheScrasse	134
10	chromate00	133

WuHongxun's blog

Why are people crawling Codeforces anyway?

Is CAPTCHA really the answer?

An alternative: an official dataset

But wait! This is just robbery?