Request for Dataset: Building an AI Model for Competitive Programming with Codeforces Data

Hello Codeforces Community!

I’m currently working on an exciting AI/ML project aimed at training language models specifically for competitive programming tasks. My goal is to fine-tune an open-source large language model (like StarCoder or CodeLlama) on Codeforces problems so that it can better understand problem statements, generate solutions, or even explain problems like a human would. What I Need

I am looking to build or find a structured dataset of Codeforces problems that includes:

Problem Statements: Full problem descriptions (title, statement, input/output format, constraints, examples, explanations).

Metadata: Problem tags (like "math", "dp", "graphs"), problem rating (difficulty), contest ID, problem index (A, B, C...).

Editorials: If possible, links to editorial articles or parsed editorial text.

Solutions (Optional but Helpful):

    Accepted solutions (preferably in C++, Python, Java, etc.).

Testcases (Optional): Public testcases shown in the problem statement.

Dataset Format Preferred

For fine-tuning, the dataset would ideally be in JSON, JSONL, CSV, or parquet format.

Example JSON entry for a problem:

{ "contestId": 1560, "index": "A", "title": "Dislike of Threes", "tags": ["implementation", "math"], "rating": 800, "statement": { "text": "Let's define a sequence ...", "input": "The first line contains...", "output": "For each test case...", "examples": [ { "input": "3\n7\n10\n21", "output": "9\n12\n28" } ], "constraints": "1 ≤ t ≤ 1000, 1 ≤ k ≤ 1000" }, "editorial": "https://mirror.codeforces.com/blog/entry/???", "solutions": [ { "language": "C++", "code": "#include ..." } ] }

Why This Dataset?

This dataset can help train models that:

Understand problem statements in natural language.

Generate correct or partially correct code solutions.

Help beginners by providing hints or explaining steps.

Solve problems in multiple programming languages.

It can benefit not only research in AI for competitive programming but also educational tools, coding assistants, and learning platforms. Technical Details for Fine-Tuning (For Those Interested)

For fine-tuning a language model on Codeforces problems, typical dataset requirements include:

High-quality natural language text: problem statements and explanations.

Structured input/output format: so the model can learn parsing constraints, examples, and expected outputs.

Paired code data (if available): aligns the problem with human-written code, improving code generation ability.

Consistent formatting: JSON or JSONL format is best for pipeline compatibility.

How You Can Help

If anyone has scraped Codeforces data in the past and is willing to share.

Suggestions on whether there are existing public datasets.

Tips on the best way to scrape this data efficiently (using the Codeforces API or web scraping).

Help in parsing problem statements into structured data.

Contact / Collaboration mail: siri43667@gmail.com discord siri43667 I’m open to collaborating with anyone interested in this project. Whether you're a machine learning enthusiast, a fellow competitive programmer, or someone with scraping expertise — let's connect!

Please feel free to reply here or message me.

Thank you, Codeforces community! Let's push the boundaries of coding and AI together!

	Rev.	Lang.	By	When	Δ	Comment
	en2		Hidden-Ninja	2025-06-27 15:18:49	6	(published)
	en1		Hidden-Ninja	2025-06-27 15:17:46	3636	Initial revision (saved to drafts)

#	User	Rating
1	Benq	3792
2	VivaciousAubergine	3647
3	Kevin114514	3603
4	jiangly	3583
5	turmax	3559
6	tourist	3541
7	strapple	3515
8	ksun48	3461
9	dXqwq	3436
10	Otomachi_Una	3413

#	User	Contrib.
1	Qingyu	157
2	adamant	153
3	Um_nik	147
3	Proof_by_QED	147
5	Dominater069	145
6	errorgorn	142
7	cry	139
8	YuukiS	135
9	TheScrasse	134
10	chromate00	133

History