A codeforces dataset? Thoughts and Discussions

Revision en2, by WuHongxun, 2025-07-14 02:01:08

Today I stumbled upon Mike’s recent blog post, and I want to start by thanking him for maintaining such an incredible platform. As Mike notes, it’s surprisingly hard to stop people from crawling Codeforces, and this got me thinking…


Why are people crawling Codeforces anyway?

My guess is that, with the boom in LLM research, many groups are scraping Codeforces to train cutting-edge AI models. For example, I found at least five recent arXiv papers whose authors appear to have harvested Codeforces data:


Is CAPTCHA really the answer?

  1. Is it bad if researchers use Codeforces data for AI?
    I mean, research is ... research... It ultimately brings benefits to humanity. Many of those papers actually complain that they only got partial problem sets (no checker code, missing statements, etc.), which arguably hinders AI research rather than helps it.

  2. Can CAPTCHA stop a frontier lab’s LLM agent?
    If these groups really want the data, my hunch is that no CAPTCHA—however elaborate—can keep them out.


An alternative: an official dataset

What if Codeforces released a curated dataset—complete problem statements, checker code, tests, metadata—under a clear license for research use?
- Researchers would get everything they need, without resorting to ad-hoc scraping.
- Codeforces could control access, ensure credit, even track citations.
- CAPTCHAs and other blockades could be simplified or removed.


But wait! This is just robbery?

I also realize the following logic stands:

Why should Codeforces give away its data just because researchers want it? Isn’t that akin to saying I want to still your money for my research because: 1. Research is good. 2. I can steal it if you don’t let me have it.

Fair point. Nobody should question that logic. But in practice, both points are just simply true. And they are already "stealing". — so why not offer a reasonable, license-based middle ground? For example:

  • Enterprise licenses for large companies (at market rates).
  • Academic licenses at reduced cost or free for non-commercial research.

That way, everyone wins: Codeforces retains control and funding, researchers get reliable access, and the arms race of ever-more-elaborate scraping is defused. But maybe I overlooked something here.


I’m curious to hear what the community thinks. Is an official research dataset a viable compromise? Or is there another solution I’ve overlooked?

(I used gpt-4o to polish my writting.)

Tags #discussion, community, dataset

History

 
 
 
 
Revisions
 
 
  Rev. Lang. By When Δ Comment
en5 English WuHongxun 2025-07-14 02:09:48 2 Tiny change: 'ating?**\n Well ' -> 'ating?**\n\n Well '
en4 English WuHongxun 2025-07-14 02:09:25 2 Tiny change: 'ating?**\n Well ' -> 'ating?**\n\n Well '
en3 English WuHongxun 2025-07-14 02:08:50 402
en2 English WuHongxun 2025-07-14 02:01:08 83
en1 English WuHongxun 2025-07-14 01:55:41 2877 Initial revision (published)