WuHongxun's blog

By WuHongxun, history, 9 months ago, In English

Today I stumbled upon Mike’s recent blog post, and I want to start by thanking him for maintaining such an incredible platform. As Mike notes, it’s surprisingly hard to stop people from crawling Codeforces, and this got me thinking…


Why are people crawling Codeforces anyway?

My guess is that, with the boom in LLM research, many groups are scraping Codeforces to train cutting-edge AI models. For example, I found at least five recent arXiv papers whose authors appear to have used Codeforces data:


Is CAPTCHA really the answer?

  1. Is it bad if researchers use Codeforces data for AI?
    I mean, research is ... research... It ultimately brings benefits to humanity. Many of those papers actually complain that they only got partial problem sets (no checker code, missing statements, etc.), which arguably hinders AI research rather than helps it.

  2. Can CAPTCHA stop a frontier lab’s LLM agent?
    If these groups really want the data, my hunch is that no CAPTCHA—however elaborate—can keep them out.

  3. Can we stop the development of AI to prevent AI cheating? Well you can try.. I am not too worried about AI cheating, because there is no way to stop it. Sooner or later, AI is going to ace competitive programming. The question is, should we try to stop what is inevitable. Or should we, as a community, contribute our accumulated intelligence fossil to the development of AGI?


An alternative: an official dataset

What if Codeforces released a curated dataset—complete problem statements, checker code, tests, metadata—under a clear license for research use?
- Researchers would get everything they need, without resorting to ad-hoc scraping.
- Codeforces could control access, ensure credit, even track citations.
- CAPTCHAs and other blockades could be simplified or removed.


But wait! This is just robbery?

I also realize the following logic stands:

Why should Codeforces give away its data just because researchers want it? Isn’t that akin to saying I want to still your money for my research because: 1. Research is good. 2. I can steal it if you don’t let me have it.

Fair point. Nobody should question that logic. But in practice, both points are just simply true. And they are already "stealing". — so why not offer a reasonable, license-based middle ground? For example:

  • Enterprise licenses for large companies (at market rates).
  • Academic licenses at reduced cost or free for non-commercial research.

That way, everyone wins: Codeforces retains control and funding, researchers get reliable access, and the arms race of ever-more-elaborate scraping is defused. But maybe I overlooked something here.


I’m curious to hear what the community thinks. Is an official research dataset a viable compromise? Or is there another solution I’ve overlooked?

(I used gpt-4o to polish my writting.)

Full text and comments »

  • Vote: I like it
  • +66
  • Vote: I do not like it

By WuHongxun, history, 8 years ago, In English

Recently I gave a second thought on the solution of 98E and got confused about some technical detail. It occurs to me that there may be something wrong with the problem.

In the editorial, it claims that the expected utility for "bluffing and move on" should be 1-P(n,m-1) and the reason was "In the same manner we fill other cells of the matrix.". Obviously not a compelling reason.

I guess the writer's proof was something like this: Since the opponent chooses to move on. It must lose when I'm not bluffing. So it chooses to behave optimally in the case that I am bluffing which is equivalent to the case that I showed and discarded this card.

The problem is that they are actually not equivalent because I did not discard that card. Since f[n][m] may not coincide with 1 — f[m][n], I can bluff about this card even though the opponent knows I'm bluffing. I waste a step and change the state to 1 — f[m][n] instead, and I may actually benefit from it. The cards known by the opponent but not discarded matters!

So now the claim made in the editorial seems quite suspicious. I'm looking forward to a clear and intuitive explanation. Any idea ?

Full text and comments »

  • Vote: I like it
  • +11
  • Vote: I do not like it