Блог пользователя WuHongxun

Автор WuHongxun, история, 9 месяцев назад, По-английски

Today I stumbled upon Mike’s recent blog post, and I want to start by thanking him for maintaining such an incredible platform. As Mike notes, it’s surprisingly hard to stop people from crawling Codeforces, and this got me thinking…


Why are people crawling Codeforces anyway?

My guess is that, with the boom in LLM research, many groups are scraping Codeforces to train cutting-edge AI models. For example, I found at least five recent arXiv papers whose authors appear to have used Codeforces data:


Is CAPTCHA really the answer?

  1. Is it bad if researchers use Codeforces data for AI?
    I mean, research is ... research... It ultimately brings benefits to humanity. Many of those papers actually complain that they only got partial problem sets (no checker code, missing statements, etc.), which arguably hinders AI research rather than helps it.

  2. Can CAPTCHA stop a frontier lab’s LLM agent?
    If these groups really want the data, my hunch is that no CAPTCHA—however elaborate—can keep them out.

  3. Can we stop the development of AI to prevent AI cheating? Well you can try.. I am not too worried about AI cheating, because there is no way to stop it. Sooner or later, AI is going to ace competitive programming. The question is, should we try to stop what is inevitable. Or should we, as a community, contribute our accumulated intelligence fossil to the development of AGI?


An alternative: an official dataset

What if Codeforces released a curated dataset—complete problem statements, checker code, tests, metadata—under a clear license for research use?
- Researchers would get everything they need, without resorting to ad-hoc scraping.
- Codeforces could control access, ensure credit, even track citations.
- CAPTCHAs and other blockades could be simplified or removed.


But wait! This is just robbery?

I also realize the following logic stands:

Why should Codeforces give away its data just because researchers want it? Isn’t that akin to saying I want to still your money for my research because: 1. Research is good. 2. I can steal it if you don’t let me have it.

Fair point. Nobody should question that logic. But in practice, both points are just simply true. And they are already "stealing". — so why not offer a reasonable, license-based middle ground? For example:

  • Enterprise licenses for large companies (at market rates).
  • Academic licenses at reduced cost or free for non-commercial research.

That way, everyone wins: Codeforces retains control and funding, researchers get reliable access, and the arms race of ever-more-elaborate scraping is defused. But maybe I overlooked something here.


I’m curious to hear what the community thinks. Is an official research dataset a viable compromise? Or is there another solution I’ve overlooked?

(I used gpt-4o to polish my writting.)

  • Проголосовать: нравится
  • +66
  • Проголосовать: не нравится

»
9 месяцев назад, скрыть # |
 
Проголосовать: нравится -58 Проголосовать: не нравится

it's 7 AM, what are you doing now

»
9 месяцев назад, скрыть # |
Rev. 2  
Проголосовать: нравится +78 Проголосовать: не нравится

From Codeforces Terms and Conditions:

You are expressly and emphatically restricted from all of the following:

  • using this Website in any way that is, or may be, damaging to this Website;
  • using this Website in any way that impacts user access to this Website;

So without an explicit permission, in my opinion, scrapping is against the terms and conditions. Cloudflare protection to prevent automated data collection is an example of measures against that.

Is it bad if researchers use Codeforces data for AI?

I mean, research is ... research...

Sure, then please ask Facebook, TikTok, or Google to share their datasets and/or algorithms.

I kinda want to see Codeforces suing all these scrappers.

I can steal it if you don’t let me have it.

Amazing attitude.

»
9 месяцев назад, скрыть # |
 
Проголосовать: нравится +16 Проголосовать: не нравится

I mean, research is ... research... It ultimately brings benefits to humanity.

No more such shit.

Anyway, these crawlers heavily increasing the server load during contests is the real problem. Creating a license won't help that. And even if you do that, there's no guarantee that they will follow it. Otherwise Codeforces could just say "1e18 coins per submission" and those AI companies would stop.

»
9 месяцев назад, скрыть # |
 
Проголосовать: нравится +3 Проголосовать: не нравится

I totally agree. If you upload such a dataset, there will be no need to download data from Codeforces.

Instead of discussing the harm or benefit of AI, it is better to discuss the feasibility of all these protective measures. The reality is that they can be easily circumvented, especially if you are a large company or a major AI researcher. Why do we need measures that prevent nothing, but at the same time harm users? It's much better to post a complete dataset to stop this endless stream of crawling.