A codeforces dataset? Thoughts and Discussions

Today I stumbled upon Mike’s recent blog post, and I want to start by thanking him for maintaining such an incredible platform. As Mike notes, it’s surprisingly hard to stop people from crawling Codeforces, and this got me thinking…

Why are people crawling Codeforces anyway?

My guess is that, with the boom in LLM research, many groups are scraping Codeforces to train cutting-edge AI models. For example, I found at least five recent arXiv papers whose authors appear to have harvested Codeforces data:

Is CAPTCHA really the answer?

Is it bad if researchers use Codeforces data for AI?
I mean, research is ... research... It ultimately brings benefits to humanity. Many of those papers actually complain that they only got partial problem sets (no checker code, missing statements, etc.), which arguably hinders AI research rather than helps it.
Can CAPTCHA stop a frontier lab’s LLM agent?
If these groups really want the data, my hunch is that no CAPTCHA—however elaborate—can keep them out.

An alternative: an official dataset

What if Codeforces released a curated dataset—complete problem statements, checker code, tests, metadata—under a clear license for research use?
- Researchers would get everything they need, without resorting to ad-hoc scraping.
- Codeforces could control access, ensure credit, even track citations.
- CAPTCHAs and other blockades could be simplified or removed.

But wait! This is just robbery?

I also realize the following logic stands:

Why should Codeforces give away its data just because researchers want it? Isn’t that akin to saying I want to still your money for my research because: 1. Research is good. 2. I can steal it if you don’t let me have it.

Fair point. Nobody should question that logic. But in practice, both points are just simply true. And they are already "stealing". — so why not offer a reasonable, license-based middle ground? For example:

Enterprise licenses for large companies (at market rates).
Academic licenses at reduced cost or free for non-commercial research.

That way, everyone wins: Codeforces retains control and funding, researchers get reliable access, and the arms race of ever-more-elaborate scraping is defused. But maybe I overlooked something here.

I’m curious to hear what the community thinks. Is an official research dataset a viable compromise? Or is there another solution I’ve overlooked?

(I used gpt-4o to polish my writting.)

Rev.	By	When	Δ	Comment
en5	WuHongxun	2025-07-14 02:09:48	2	Tiny change: 'ating?\n Well ' -> 'ating?\n\n Well '
en4	WuHongxun	2025-07-14 02:09:25	2	Tiny change: 'ating?\n Well ' -> 'ating?\n\n Well '
en3	WuHongxun	2025-07-14 02:08:50	402
en2	WuHongxun	2025-07-14 02:01:08	83
en1	WuHongxun	2025-07-14 01:55:41	2877	Initial revision (published)

Rev.

Lang.

When

Comment

en5