Hello Codeforces Community!
I’m currently working on an exciting AI/ML project aimed at training language models specifically for competitive programming tasks. My goal is to fine-tune an open-source large language model (like StarCoder or CodeLlama) on Codeforces problems so that it can better understand problem statements, generate solutions, or even explain problems like a human would. What I Need
I am looking to build or find a structured dataset of Codeforces problems that includes:
Problem Statements: Full problem descriptions (title, statement, input/output format, constraints, examples, explanations).
Metadata: Problem tags (like "math", "dp", "graphs"), problem rating (difficulty), contest ID, problem index (A, B, C...).
Editorials: If possible, links to editorial articles or parsed editorial text.
Solutions (Optional but Helpful):
Accepted solutions (preferably in C++, Python, Java, etc.).
Testcases (Optional): Public testcases shown in the problem statement.Dataset Format Preferred
For fine-tuning, the dataset would ideally be in JSON, JSONL, CSV, or parquet format.
Example JSON entry for a problem:
{ "contestId": 1560, "index": "A", "title": "Dislike of Threes", "tags": ["implementation", "math"], "rating": 800, "statement": { "text": "Let's define a sequence ...", "input": "The first line contains...", "output": "For each test case...", "examples": [ { "input": "3\n7\n10\n21", "output": "9\n12\n28" } ], "constraints": "1 ≤ t ≤ 1000, 1 ≤ k ≤ 1000" }, "editorial": "https://mirror.codeforces.com/blog/entry/???", "solutions": [ { "language": "C++", "code": "#include ..." } ] }
Why This Dataset?
This dataset can help train models that:
Understand problem statements in natural language. Generate correct or partially correct code solutions. Help beginners by providing hints or explaining steps. Solve problems in multiple programming languages.
It can benefit not only research in AI for competitive programming but also educational tools, coding assistants, and learning platforms. Technical Details for Fine-Tuning (For Those Interested)
For fine-tuning a language model on Codeforces problems, typical dataset requirements include:
High-quality natural language text: problem statements and explanations. Structured input/output format: so the model can learn parsing constraints, examples, and expected outputs. Paired code data (if available): aligns the problem with human-written code, improving code generation ability. Consistent formatting: JSON or JSONL format is best for pipeline compatibility.
How You Can Help
If anyone has scraped Codeforces data in the past and is willing to share. Suggestions on whether there are existing public datasets. Tips on the best way to scrape this data efficiently (using the Codeforces API or web scraping). Help in parsing problem statements into structured data.
Contact / Collaboration mail: siri43667@gmail.com discord siri43667 I’m open to collaborating with anyone interested in this project. Whether you're a machine learning enthusiast, a fellow competitive programmer, or someone with scraping expertise — let's connect!
Please feel free to reply here or message me.
Thank you, Codeforces community! Let's push the boundaries of coding and AI together!



