OpenAI o1 IOI submissions

#	User	Rating
1	tourist	4009
2	jiangly	3823
3	Benq	3738
4	Radewoosh	3633
5	jqdai0815	3620
6	orzdevinwang	3529
7	ecnerwala	3446
8	Um_nik	3396
9	ksun48	3390
10	gamegame	3386

#	User	Contrib.
1	cry	167
2	maomao90	163
2	Um_nik	163
4	atcoder_official	161
5	adamant	159
6	-is-this-fft-	158
7	awoo	157
8	TheScrasse	154
9	nor	153
9	Dominater069	153

Hi Codeforces! I am a member of the reasoning team at OpenAI. We are especially excited to see your interest in the OpenAI o1 model launch, many of us being Codeforces users ourselves (chenmark, meret, qwerty787788, among others). Given the curiosity around the IOI results, we wanted to share the submissions that scored 362.14—above the gold medal threshold—from the research blog post with you. These were the highest scoring among 10,000 submissions, so still a ways to go until top human performance, but we aspire to be there one day.

The following C++ programs (including comments!) are written entirely by the model. Special thanks to PavelKunyavskiy for maintaining the IOI mirror, which we used to check our scores. We hope you enjoy taking a look!

nile (100/100)

Submission (100/100)

message (79.64/100)

Submission (79.64/100; subtask 1 and partial credit on subtask 2)

tree (30/100)

Submission 1 (17/100; subtasks 1 and 4)
Submission 2 (13/100; subtask 2)

hieroglyphs (44/100)

Submission 1 (34/100; subtasks 1, 2, and 4)
Submission 2 (10/100; subtask 3)

mosaic (37/100)

Submission 1 (22/100; subtasks 1, 2, and 4)
Submission 2 (20/100; subtasks 1, 3, and 5)

sphinx (71.5/100)

Submission 1 (50/100; 50% partial credit on all subtasks)
Submission 2 (43/100; subtasks 1, 2, and 3)

Lastly, we hope you find the new model magical and delightful—we can’t wait to hear about the amazing things you’ll build with it. (But please don’t use it to cheat on Codeforces!)

Comments (30)

Write comment?

GusterGoose27

2 months ago, # |

+62

Great work!

It seems that o1 has extremely impressive scores all around; its most impressive score is probably actually hieroglyphs, where a score of 44 would place it fourth relative to onsite contestants! It seems that the model was able to decipher some of the subtasks where we could not!

→ Reply

caustique

And how was the performance of the model on Codeforces problems measured? Did it participate in rated rounds? Is it possible to reveal a username of the model on Codeforces?

yummy

2 months ago, # ^ |

We evaluated the Codeforces performance of the model via simulation, doing a best effort to approximate how the model would have performed had it participated live. With our Codeforces eval, the model is limited to 10 submissions per problem. We use these submissions to simulate the score; from the score we get a ranking; and from the ranking we estimate the model's rating.

zera.zhang

4 weeks ago, # ^ |

'Simulation' means not participating in contest, only engaging virtually? Is the ranking solely based on the results from the IOI_contest_codeforces? How is the model's rating estimated—does it use the Elo system for just that one contest?

In the problem Sphinx, submission1 scored 50 and submission2 scored 43. How is the total score calculated to be 71.5?

entropy07

Could be a naive question but: Do you guys (OpenAI) plan on watermarking the code generated by future models? It could make the process of detecting AI generated code much easier.

VLamarca

Watermarking a 50 line code seems impossible, unlike watermarking an image

+18

I can't speak to future plans for OpenAI. That said, speaking for myself (and not OpenAI), I think watermarking is a cool research direction but not a panacea. For many problems, all AC solutions fall into a few broad buckets, and within those buckets, it is difficult to identify AI vs. non-AI solutions if one is allowed to rewrite/obfuscate code.

kostka

Out of curiosity: can you share if there are any endeavors in problem setting?

+20

We don't have any results on problem setting, and I could imagine that writing creative problems is a bit out of reach of current models. (I struggle to even get them to tell me a new joke :)) But synthetic problems have been used in the training of models e.g. AlphaGeometry

← Rev. 3 →

+13

This is probably obvious, but I want to ask: Did AI use stress testing locally to check for correctness? I suppose it is capable of writing the brute force solution and test locally. Just curious if the 10k submissions could be avoided (or if this could even improve the performance).

Maybe you didnt want to do that because adding human heuristics on top of the AI just for the sake of performance is not the goal?

+12

In the blog post, we discussed this a little:

For each problem, our system sampled many candidate submissions and submitted 50 of them based on a test-time selection strategy. Submissions were selected based on performance on the IOI public test cases, model-generated test cases, and a learned scoring function. If we had instead submitted at random, we would have only scored 156 points on average, suggesting that this strategy was worth nearly 60 points under competition constraints.

It would be super cool if one day the AI could do stress testing without human heuristics on top!

← Rev. 4 →

Edit: Oh, I got it. The model only submitted 50 solutions, as is the competition constraint. It generated thousands of solutions, but it only submitted 50.

cebolinha

+36

I think you misunderstood here...

There are actually 3 different results:

submitting 50 random submissions: 156 points
strategically choosing 50 submissions: 213 points
up to 10k submissions: 362 points

It can be seen on this webpage: https://openai.com/index/learning-to-reason-with-llms/#coding

+11

Oh, thanks! I’m just lazy to read about it. I prefer to read on codeforces comments :)

So I keep my position: I would expect that a sophisticated heuristic on top of the model with stress testing would, in most cases, be as accurate as the real verdict. That is, score should not improve by allowing more submissions.

egoi202e

When do you think AI will be able to solve Master level problems? Or is that even possible?

returnAC

No way competitive programmers are the ones trying to ruin the sport

SuperJ6

Creating 1 algorithm to solve all problems is the ultimate challenge.

vjudge1

+10

For each task and the 10,000 submissions, if the score distrubution histogram can be shared, it will be more impressive!

Olympia

How much computing power was used?

smy_mind

what was the prompt after seeing that the code is failing? did it generate some testcases somehow?

zfnu

What is the effective context size for o1-ioi model (in tokens)? I assume since competitive programming doesn't require major decomposition for tasks (and tasks themselves are small) it should be way lower. Or even here bigger context size always leads to better results with no diminishing returns so far?
Problem with RLHF is that it generally tries to optimize human vibe (humans liking the answer), not some clear final metric. When tuning o1-ioi, have you tried giving rating/number of points as a reward function and using it instead of hf?

bigSchrodinger

Why competitive programming?

MikeMirzayanov

+82

Thanks for the posting such details. You guys do so interesting things!

Enchom

← Rev. 2 →

+27

Very insightful.

Edit: Seems like its solution is in fact correct, so 1-0 for the AI against me

Original: In particular I find the results on "message" interesting — it seems like its basic idea of determining a known safe column is not really correct, but given 10 000 submissions I imagine it tried a lot of different ways to communicate a safe column and eventually one went past the grader. That gives one view of why more submissions can be more helpful. I haven't examined the sphinx code, but I imagine in principle a similar thing is possible there, too.

ksun48

What's wrong with it?

Actually, you're right, I seem to have misunderstood its approach. Not sure why so many people agreed with me.

Are you suggesting it can be hacked? Can you hack it?

Meron3r

but, what about cheaters who use ai?

s-lissov

+24

It is not much different from cheaters who use their friends/submit from multiple accounts. New technology, but the same old problem.

amsen

Amazing things you are doing :)

yummy's blog