Analyzing how good O1-Mini actually is

№	Пользователь	Рейтинг
1	Benq	3792
2	VivaciousAubergine	3647
3	Kevin114514	3611
4	jiangly	3583
5	strapple	3515
6	tourist	3470
7	Radewoosh	3415
8	Um_nik	3376
9	maroonrk	3361
10	XVIII	3345

№	Пользователь	Вклад
1	Qingyu	162
2	adamant	148
3	Um_nik	145
4	Dominater069	143
5	errorgorn	141
6	cry	138
7	Proof_by_QED	135
7	YuukiS	135
9	chromate00	134
10	soullless	132

I would like to thank redpanda and rewhile for their great help in writing and reviewing this blog. They are also the users who queried O1-mini for all the following problems.

We tried O1-mini on several problems, from a variety of sources. Let's list the results first.

Note :

Some of the WA verdicts here actually means that the AI just "stopped thinking" which means the AI thought for a long enough time without any useable results so it ran into an error.
All ratings mentioned are Codeforces ratings, the Atcoder ratings have been converted to codeforces rating. To measure the approximate codeforces rating of the atcoder problems mentioned here, you can use https://kenkoooo.com/atcoder/#/table/ + https://silverfoxxxy.github.io/rating-converter.

D2ABs

1762B - Make Array Good [1100] AC
1998B - Minimize Equal Sum Subarrays [1000] AC
1998A - Find K Distinct Points with Fixed Center [800] AC
1831A - Twin Permutations [800] AC
1831B - Array merging [1000] AC
1905B - Begginer's Zelda [1100] AC (this one even used 4o)
1905A - Constructive Problems [800] AC
2009C - The Legend of Freya the Frog [1100] AC
1816A - Ian Visits Mary [800] WA
1944B - Equal XOR [1100] WA
1758B - XOR = Average [900] WA

Standard problems

2000E - Photoshoot for Gorillas [1400] AC
13E - Holes [2700] AC after giving hidden test
abc365_e — Xor Sigma Problem [1544] AC
abc293_g — Triple Index [2037] AC
abc258_g — Triangle [2181] AC
abc342_g — Retroactive Range Chmax [2305] WA
abc362_g — Count Substring Query [1940] TLE
1891F - A Growing Tree [2000] WA

ARC AB Problems

D2C+ (Non-standard) problems

1904D1 - Set To Max (Easy Version) [1600] WA
1904C - Array Game [1400] WA
1967A - Permutation Counting [1400] WA
1965A - Everything Nim [1400] WA
1943A - MEX Game 1 [1300] WA
1893A - Anonymous Informant [1400] WA
agc051_a — Dodecagon [1842] WA
agc043_a — Range Flip Find Route [1417] WA
the famous 1188B - Count Pairs [2300] WA

Conclusion

Overview

gpt o1-mini is very good and faster than humans at some standard problems ≤ atcoder blue, these types of problems can't appear in contests if we want fair contests where AI doesnt determine ranking
gpt o1-mini is very good at most easy "adhoc" problems like D2AB and is faster than or as fast as tourist at these problems, I'm not certain about how to make D2A's that ai cant solve and D2B has limited options too.
the AI struggles a lot on harder "adhoc" problems , especially multistep problems where the steps aren't standard
Div1 people are completely unaffected (for now at least). Some extra care needs to go into choosing d2Bs such that O1-mini can't solve them, but otherwise Div2 contests are also mostly unaffected.

D2AB Ad-Hoc problems

It can "solve" most of the Div2AB level problems in under a minute. Why do i double quote around "solve"? Often, even when the AI can't really reason the solution, if the solution is simple enough like $$$min(a, b)$$$, it gets there by brute forcing.

It is hard to create problems which are suitable for such positions and still not solvable by beginners, but I don't believe it is impossible. We have 3 examples listed above ($$$1$$$ d2A and $$$2$$$ d2B), which are all constructive problems (and the construction isn't as simple as print $$$1, 2, ... n$$$). Constructive problems are one of the weaknesses of the AI, or more generally, problems where the ad-hoc part cannot be reached easily through bruteforcing.

It also failed arc183_a — Median of Good Sequences and 1943A - MEX Game 1, which are easy enough to be D2Bs.

$$$2$$$ tips in particular for easy problems i would like to stress on for problemsetters :

Don't have them be a stupid (even if it is cute) result. For example, output $$$a + b$$$ or print $$$1, 2, .. n$$$.
Try to have problems where genuine insight can't be easily replaced by guessing.

I know these are hard to do, especially for D2A. D2B might still be possible....and it is important if we try to make it as fair as possible.

Another option is to remove D2A and start at D2B (with registered => rated) with a less steep difficulty progress over the contest. There won't be a big difference for most participants however maybe completely new beginners will feel bad however if they solve no problems.

Standard Contests (ABC/Div3/Div4)

AI can perform exceedingly well in such contests. It is trained on infinite problems, and can recognize techniques with ease. For, example it solved the MO's problem in only $$$27$$$ seconds, while it took minutes for several Ad hoc problems (even when it managed to solve it)

These contests simply aren't fair in the current format. Even if you ban AI, i can always just understand and rewrite it's solution. I strongly urge one of the following $$$3$$$ actions :

Remove them
Have them unrated and solely for practice
Rework them to not have so many Standard problems

$$$2$$$-nd option probably isn't very feasible, and you may not want to do the $$$1$$$-st. Thus, I suggest the $$$3$$$-rd.

Div2C+ Ad-hoc Problems

These problems generally need multi steps, and thus, the AI has a very low probability to solve it. It failed each and every one of the last $$$5$$$ D1As under 1400 rating.

Problems requiring some thinking which can't be easily replaced by brute-forcing solutions is the way to beat it, along with having multistep problems. With every non-trivial step (or even a trivial step given how the AI works) reduces the probability for the AI to solve the problem exponentially.

A meta shift needs to happen in the recent future, where we have to solely focus on making multistep problems with several thinking non-trivial steps. However, we are not there yet.

Rating Approximations

(These are purely approximations from the data we have, and is my personal opinion. They could be more accurate with more data but unfortunately we only have limited queries.)

ABC, Div3, Div4 contests : [1900+] according to CF rating
Div2 contests : [1500]
ARC : [1300] according to CF rating
Div1 & AGC : -

Final Remarks

We are mostly fine. Some amount of care needs to go to build easy problems which AI can't solve but other than that not majorly affected. Banning AI probably is the right move now however.

In the future, I don't see this type of "generate $$$10^9$$$ solutions and find the right one" AI being good at solving harder problems, especially if we try to counter it. An AI that can actually reason is much more dangerous imo...

Комментарии (49)

Показать архивные | Написать комментарий?

Wandoka

20 месяцев назад, скрыть # |

← Rev. 2 →

+61

I really want to see a codeforces account, that will be run by AI. I am still salty about that Deep mind post, really wanted to see something like this in action

→ Ответить

Milil

17 месяцев назад, скрыть # ^ |

← Rev. 3 →

Yeah, they mentioned that they have 3 versions of that pesky o1 model — mini, standard (or something like that) and "fine-tuned for IOI", the latter by their words is kinda cool at solving problems and can reach around 1700 CF rating or so, but it's too soon to say that because it could have achieved that by solving simple problems very quickly and not solving slightly harder ones at all. Interesting that they never shared the account they tested their model with.

DNR

-45

I don't think completely changing the competitive programming meta (which one would have to do as AI keeps getting better) just so that it can remain "competitive" is a good thing to do. A lot of people engage in cp primarily because they enjoy solving problems (including "standard" ones), with contests being a secondary motivation.

We should perhaps instead just form some "trusted communities" consisting of people who we know would not use AI assistance during contests and only compete within these communities (we can have seperate browser extensions/discord bots to view our relative ranks/performances within these communities).

20 месяцев назад, скрыть # ^ |

-86

in the short term, we can take drastic measures like not providing samples for problems and not giving feedback for submissions until the end of the contest to prevent random gpt greys from conquering the leaderboards

safarisoul

Glad to know that I'm still useful for non-standard problems.
I guess an open door for AI to participate as AI could help prevent AI to "pretend" human to participate.

BiggestOtaku

+16

If AI wants to practice they can just do it after the contest

nik_Um

Or unrated

0mar

Did you test o1-preview which is supposed to have a rating of 1258 or o1-mini which has a rating of 1600+? I didn’t know that o1-mini was available now.Edit: nvm it is available.

fippo

I've checked the Div2 C problem from one of latest contests (https://mirror.codeforces.com/contest/2007/problem/C) on o1-preview, and it solved it from the second attempt (giving hidden test as an example). So, at least some of the problems Div2C+ are in danger. Probably need to ensure task is not solvable by AI before contests.

Dominater069

+13

I dont disagree that certain C are obviously at risk but that example in particular is because it was a knowledge problem about bezouts thereom

The performance of such AI is directly correlated with

standardness of the problem
how easy the final solution is

This problem fulfills both

rangerscowboys

+11

I have the feeling that the AI doesn't do it well the first time, but if we keep querying it, it will get AC. (maybe you need to clarify what the problem is stating, which cheaters could unfortunately do)

triple__a

I am curious how codeforces will react to such "revolution". Maybe it is time to reject problems standard enough that can be solved by AI?

TheScrasse

+43

We already used to reject div2A problems solvable by ChatGPT.

Wind_Eagle

Yes, but what shall we do when all D2A and D2B will be solvable by AI? Cancel div2 rounds? :)

+66

I think it's up to Mike and KAN... Personally, I claim removing D2A and D2B from Codeforces rounds will happen with high probability in the next few months.

tickbird

at this point cf may consider stop rating contest for div3/4 I think.

it really become who prompt fast basically..

tfg

+347

I see only one solution. Make more div1 contests!

future_2100

+36

Purpose of participating in contests on Codeforces:-

A — Enhance problem-solving skills. B — Rating

If your answer is option A, then don’t be afraid of AI.

TwentyOneHundredOrBust

+38

This blog probably isn't representative of the true skill of the AI. This is just individual problems being asked just one time, and we're not sure what the prompts look like. They claimed the 1650-1800 ratings were made in conditions similar to in-contest. Presumably that means that the models were thinking for the entire 2 hours and generating multiple codes, brute force tests etc., similar to what was done for the IOI evaluation; they also said that the submissions were based on consensus of the best generated codes.

Don't forget that after the AI was given 10k submissions per IOI problem it went from just below bronze to a gold medal. It seems that generating a lot of solutions is a pretty decent strategy.

+29

The predicted ratings in this blog dont differ widely from the ratings given by open AI unless they solely took part in div1s, so i think its fair to say it wont improve much further with just allowing few more submissions (they allowed upto 10 i think)

Unfortunately testing more submissions is hard due to the limits

As for the 10k stat, i find that a very weird stat to quote. Its like saying in 1 out of 1000 times, an expert will perform at a grand master level out of random chance. Yes and?

+26

It would improve greatly with 100x thinking time. Did you see the graph of AIME score to compute power used? It has a couple submits but potentially generates hundreds of candidates from which it chooses the best ones to submit.

For the 10k stat, it's extremely useful. It means that if you have enough computational power you can pretty much buy red thinking power. It's not like the expert randomly becoming red because the AI generating submissions is much more deterministic. As compute increases, all you have to do to solve a practical unsolved problem may just be writing really good tests.

It would improve greatly with 100x thinking time

I think I agree with more time but I dont think thats shown by the 10k submissions performance in IOI. I also disagree with the measure greatly (depending on the definition of greatly)

It is not a computationally hard task to stress test 10k submissions (at least for most problems, most stresses take 0.01s?), so if the correct solution did exist in the first 10k submissions, it should be easily found

The thing that will actually improve with more time is simply being able to try more solutions.

mt19937

10k stat seems like an ugly strategy, but it doesn't matter (to me). If reasoning in such contests can be crunched down by "dumb" computation. It still seems like a big deal to me, You cannot discredit it for being ugly.
This is something AI did with chess.

redpanda

For the prompts i copypasted them into gpt 4o and asked it to format it nicely then manually checked if it was correct compared to the actual statement on the online judge. After that I put it into gpt o1-mini either wrote "please solve in c++" or "please solve this competitive programming problem in c++" then copypasted the nicely formatted statement below.

amsen

In the case of their performance in the IOI contest. I believe we should take a look at their solutions but Open(ironic)AI is not good at sharing things.

My guess is in IOI, as a human you have to decide whether to take time and solve each subtask and append it to your solutions, and get part of the score, or solve the bigger subtask (or the main problem). The decision is important because we code slowly and we make bugs. AI codes fast. And if made bugs, it has infinite submission to fix the bug (in the case of gold medal).

Based on radoslav11's comment, the AI only did one task completely.

In conclusion, I claim two things:

IOI, because of the existence of subtasks and problems like that are optimizable is the most convenient contest for AI to do.
Any CM person, with a text -> code AI (say github Co-pilot) and infinite submissions (even 100 submissions) can achieve a gold medal in at least this year's IOI.

As we can see, AI does much worse in not optimizable contests like codeforces and I believe the ultimate test to check whether we are cooked or not is the ICPC world-finals (hoping there are less heavy code less thinking problems).

shorya1835

+32

If i am being really honest, from my experience, most of the models that have been developed in the past few years have only gotten better at standard problems similar to the ones in its training data or identifying basic math/logic when it comes to any reasoning task, even then the logic and observations are usually something very similar to what it has already seen in its training data. I think we are still very far from a model that can solve recent D2C's with even 50 percent accuracy, simply because these models are very bad at any form of reasoning ability and the only real progress is there due to better data representation/memorization with better quality data, training methods and bigger models.

aviralarpan3301

it solved C and E1 in last div2. they are 1800 and 2100 rated on clist respectively. it was also solving D and the coordinators had to change it.

Source?

https://mirror.codeforces.com/contest/2005/submission/281146546

https://mirror.codeforces.com/contest/2005/submission/281156553

Just looked at the problems, i don't think those problems are 1800 and 2100 rated, clist ratings are off a lot of times and also for E1 the higher rating is due to the position, both are pretty simple dp problems (atmost 1700), but still its more than i expected the model to be able to do. It seems the C solution got hacked but it was able to solve E1, i liked the analysis dominator gave but i think there is a need to check how the models does with better prompts. A lot of times people are able to semi solve the problem by making a few observations, a back and forth with the AI using both the prompter's and the AI's observations might achieve better results than expected on non standard problems.

What about C and E1 of round 972? They are rated 1800 and 2100 on clist respectively.

Refer to the section on standard problems :)

indian_rounds_LULW

-8

It seems that AI also managed to solve problem D, keep in mind that the problem literally had to be made harder minutes before the contest just so AI would not be able to solve it. And yes, all that guys submissions are the work of an AI. Thoughts? And please don't tell me to "refer to the section on standard problems".

I don't think he used it to solve D. I though so at first, but you can see that this AI submission of D which got WA is quite different. https://mirror.codeforces.com/contest/2005/submission/281197540

here you go, i collected some data for you (most are last few contests d2C, but unfortunately somehow last few div2s excluding the last one have been actually good at C-D level, so i went back a bit more to make my point)

All the below problems have been asked to chatgpt once only, and i haven't tried to fix any of its bug (i am sorry it already tested my patience as it is, you can try if you want)

I am listing problems sorted by their rating, how I feel about them and chatgpt's result

2003C - Turtle and Good Pairs [1200] : Very nice problem, only issue is kinda guessable. O1 : WA on test 1
1979C - Earning on Bets [1200] : Again a very nice problem, liked very much. O1 : "finished thinking"
1993C - Light Switches [1400] : Boring and standard optimization. O1 : AC
2007C - Dora and C++ [1500] : Bad problem, hardest part is knowledge/guessing bezout's thereom. O1 : AC
2001C - Guess The Tree [1500] : Really enjoyable and good problem. O1 : "finished thinking"
1982D - Beauty of the mountains [1700] : Yet another bad knowledge problem, again on bezouts thereom (+2d prefix sums if you want to count that). O1 : AC
1995C - Squaring [1800] : solution is obvious, only part is handling overflows. I dont like that. O1 : WA on sample (disappointed tbh, hoped it would solve it, it got close but it couldn't handle the overflow correctly)
2005C - Lazy Narek [1800] : Why not mention this too? A very straight forward standard exercise on dynamic programming. O1 : AC (as we all know)

Certainly a very small dataset, i apologize for that but I am lazy. If you want, you can ask me my opinion about other problems and then feed them to O1 and check.

I think it's fairly obvious that O1 performs much better on standard problems. I don't claim that it absolutely cannot solve the problems which i mentioned as good above. Maybe it can with the help of more prompts. The point is to demonstrate how much superior it is in stupid standard problems.

Set stupid problems, win stupid prizes....Last contest was terrible from my perspective : B2, C, D, E1 are all bad problems. A and E2 are above average and B1 is good. For months, I have complained about lower divisions having standard problems, which should not appear in rated contests. O1 is only proving me right....

My codechef rounds

Ofcourse since i am judging other people, I should also self-reflect. Let us take a look at my last codechef round ( https://www.codechef.com/START150A )

Equal Pairs (Hard) [d2B level] : Solved by AI. It is somewhat standard, but more just easy imo. Either way, I agree it is a below average problem.
Replacing Game [d2C level] : WA by AI. I really liked this problem (I didn't author it)
Minimum Operations (Easy version) [d2C.5 level] : TLE by AI. Honestly, I don't like this problem, I only used it to try to balance the gap between 1 and 2 (guess how that turned out)
Minimum Operations (Hard version) [d2D level] : Not solved by AI.

I won't claim to be perfect, and you can surely find bad examples of problems by me in codechef.

I'm curious why you don't like 2007C — Dora and C++. I know it's a standard Bezout's theorem problem, but don't you need to first build some intuition or proof that you can apply it despite not being able to directly subtract. I personally liked it, but maybe that's because I haven't solved many Bezout's theorems (only the two on your list lol) problems so maybe that's a standard idea and I'm just unaware. It also seems like you mainly like adhoc problems. Would you agree with that characterization?

Update:

Also just solved 1979C — Earning on Bets. I personally really disliked the problem. The only reason its rated 1200 is because a large (maybe more than 50%?) of the people that submitted the solution completely guessed it. What are your thoughts?

but don't you need to first build some intuition or proof that you can apply it despite not being able to directly subtract

IMO fairly trivial compared to the difficulty of knowing bezout's. I only came to know bezout's thereom at like 2100.

It also seems like you mainly like adhoc problems. Would you agree with that characterization?

Absolutely

large (maybe more than 50%?) of the people that submitted the solution completely guessed it.

I find the proof very easy...check out my solution to it in this blog https://mirror.codeforces.com/blog/entry/133289. In general, I dont mind guessable problems if they have a simple proof as well (where simple is decided relative to difficulty)

Just read your proof, which is very well written and actually does make me appreciate the problem a bit more.

I still think that it should be rated higher than 1200, and is therefore misleading to use for your AI point.

Why its rated at only 1200

i didnt think it was that standard but i am weak so......

C, D, E1 are all insta solves to me because they have the same repititve ideas (direct dp, log times gcd changes, direct win loss dp + easy optimization). Here insta means < 2 mins after understanding problem.

Before you claim it is because I am an IGM, it is not true. There are several examples of me taking way too long on simple problems, the most prominent being in Div1/combined rounds (since the quality of problems is higher on average)

yeah i get the point.

some of these ideas are difficult to spot for me, but they are standard

practise :) Eventually, you will become good at it.

yesyesyesyes

abcsumits

Yesyesyes

Rohak

The only way to not die in this era is to grind harder than ever to reach atleast CM so the AI solutions dont affect you.

culver0412

+14

OMG My problem beat AI (1816A - Ian Visits Mary)

BedwarKing

Have you tried to copy the problem statement and tell the AI to use the problem's algorithm.

If haven't, I wish you can test it out and make another blog to illustrate the result that AI give. Thank you!!

yellow_13

An AI that can actually reason is much more dangerous imo

Want to know your thoughts on this, do you think it is likely to happen? If yes then how long do you think it'll take?

Also just a side question, I didn't really read much about the AI itself, so I just wanted to know, how are we so sure that it's not using some kind of reasoning already? Like how do we know that it's currently just generating $$$10^9$$$ solutions and finding the right one, is this mentioned somewhere in some sort of release notes for the AI?

17 месяцев назад, скрыть # |

can someone do this with o1 now that it is out?

Блог пользователя Dominater069

D2ABs

Standard problems

ARC AB Problems

D2C+ (Non-standard) problems

Conclusion

Overview

D2AB Ad-Hoc problems

Standard Contests (ABC/Div3/Div4)

Div2C+ Ad-hoc Problems

Rating Approximations

Final Remarks

Conclusion

My codechef rounds