sberens's blog

By sberens, history, 5 hours ago, In English

From https://openai.com/index/learning-to-reason-with-llms/:

We trained a model that scored 213 points and ranked in the 49th percentile in the 2024 International Olympiad in Informatics (IOI), by initializing from o1 and training to further improve programming skills. This model competed in the 2024 IOI under the same conditions as the human contestants. It had ten hours to solve six challenging algorithmic problems and was allowed 50 submissions per problem.

For each problem, our system sampled many candidate submissions and submitted 50 of them based on a test-time selection strategy. Submissions were selected based on performance on the IOI public test cases, model-generated test cases, and a learned scoring function. If we had instead submitted at random, we would have only scored 156 points on average, suggesting that this strategy was worth nearly 60 points under competition constraints.

With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy.

Finally, we simulated competitive programming contests hosted by Codeforces to demonstrate this model’s coding skill. Our evaluations closely matched competition rules and allowed for 10 submissions. GPT-4o achieved an Elo rating3 of 808, which is in the 11th percentile of human competitors. This model far exceeded both GPT-4o and o1—it achieved an Elo rating of 1807, performing better than 93% of competitors.

  • Vote: I like it
  • +94
  • Vote: I do not like it

»
5 hours ago, # |
  Vote: I like it +9 Vote: I do not like it

This is so much more impressive than I thought would be achieved even in the next 4 years.

  • »
    »
    4 hours ago, # ^ |
    Rev. 2   Vote: I like it +3 Vote: I do not like it

    So to make it clear what models have what:

    https://openai.com/index/learning-to-reason-with-llms/

    https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/

    • The strongest model, o1-ioi, was fine tuned on competitive programming problems. 1807 rating. Not available to the public at the moment.

    • The next strongest model is o1, 1673 rating. Not available to public.

    • o1-mini is publicly available and 1650 rating.

    • o1-preview is publicly available and 1258 rating.

    The naming is kind of confusing, because o1-preview would be expected to be stronger than o1-mini, but o1-mini is actually the strongest.

»
5 hours ago, # |
  Vote: I like it +19 Vote: I do not like it

Time to massively overhaul the cheating detector... or prepare for some insane rating inflation.

  • »
    »
    5 hours ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    I don't think using AI is even considered cheating currently...

    • »
      »
      »
      5 hours ago, # ^ |
        Vote: I like it +6 Vote: I do not like it

      Yeah I'm thinking it's not just about having the AI literally spit out the full solution.

      "Hey ChatGPT, here's a problem statement, what algorithm should I use to solve it?" would allow a (current rating system) 1000-level to solve 1800-level problems. It might even allow a 2000-level to solve 2800-level problems by reducing the search space — even if an AI's rating is 1800, it can still help a human do better on harder problems. Of course, it's 1800 now, maybe it will be 4800 soon?

      And there's really nothing we can do about it, it will just force programming contests to evolve in some way. With some possibility that online contests will become completely useless. (But we are not there yet)

»
5 hours ago, # |
  Vote: I like it +11 Vote: I do not like it

i guess my 1600 is not so special anymore

  • »
    »
    4 hours ago, # ^ |
      Vote: I like it +9 Vote: I do not like it

    It was never very frankly...neither is mine :/

    • »
      »
      »
      4 hours ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      at least it felt good achieving a blue rank back then :(

»
4 hours ago, # |
  Vote: I like it 0 Vote: I do not like it

The article says we are releasing an early version of this model, OpenAI o1-preview, for immediate use in ChatGPT and to trusted API users, so the 1258 rating ver. is coming soon.

»
4 hours ago, # |
  Vote: I like it +83 Vote: I do not like it

I can imagine a day when I wake up, only to read the news that I basically have lower rating than a bot.

  • »
    »
    4 hours ago, # ^ |
      Vote: I like it +33 Vote: I do not like it

    i already woke up to this

  • »
    »
    4 hours ago, # ^ |
      Vote: I like it +28 Vote: I do not like it

    This happened to me before bots existed.

  • »
    »
    4 hours ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    That already happened in Chess long back. Before that no one thought a bot can hit GM level. It will eventually come up to cp too it's just a matter of time.

  • »
    »
    3 hours ago, # ^ |
      Vote: I like it +4 Vote: I do not like it

    this happened to me today :(

»
4 hours ago, # |
  Vote: I like it 0 Vote: I do not like it

This is concerning

»
4 hours ago, # |
  Vote: I like it +6 Vote: I do not like it

We are all doomed :/

  • »
    »
    4 hours ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    It will not take away your ability to enjoy solving problems.

    • »
      »
      »
      4 hours ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      But no one enjoys solving problems when they starving..hehe, just kidding....or..?

»
4 hours ago, # |
  Vote: I like it +31 Vote: I do not like it

well, fuck

»
4 hours ago, # |
Rev. 3   Vote: I like it 0 Vote: I do not like it

Maybe o1-ioi handle is Scripted1234?

  • »
    »
    3 hours ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    no this seems to be genuine profile unless they used his name

»
4 hours ago, # |
  Vote: I like it 0 Vote: I do not like it

So my dreams of becoming an expert is gone?

»
4 hours ago, # |
  Vote: I like it +49 Vote: I do not like it

It would've been great to get more information about the IOI performance — especially how the 362.14 score was distributed, how the code looked, and what kind of training was put on top of o1 to produce o1-ioi (and whether that was unbiased)

Maybe I am wrong, but even with unlimited number of submissions I find gold at the IOI extremely impressive compared to 1800 on Codeforces.

  • »
    »
    3 hours ago, # ^ |
      Vote: I like it +24 Vote: I do not like it

    and whether that was unbiased

    We will find out soon. Hackercup starts in 7 days and allows LLM submissions, so we no longer need a custom environment that may or may not be biased.

    I find gold at the IOI extremely impressive compared to 1800 on Codeforces.

    Agreed.

    • »
      »
      »
      67 minutes ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      It would be amazing if they participate in HackerCup's AI track, but I very highly doubt they would do that. By their own metrics the public o1 models aren't particularly impressive at competitive programming.

  • »
    »
    3 hours ago, # ^ |
      Vote: I like it +7 Vote: I do not like it

    I agree, also I wish OpenAI was more transparent (it has 'open' in the name...). How are we living in a world where Mark Zuckerberg is releasing the best open source AI models?:))

»
3 hours ago, # |
  Vote: I like it -7 Vote: I do not like it

This model is constrained because of the inference cost, so this rating should be viewed as lower bound of what's possible. Highly likely that iterating for longer time during reasoning is going to produce even better results. Would be fun to see an actual paper showing if it's true and if it is what are the diminishing returns related to the time spent during reasoning.

»
3 hours ago, # |
Rev. 2   Vote: I like it +1 Vote: I do not like it

Sadly, the days are not far when a bot will overtake the Masters. More thing to worry is that these models (o1-mini) are available to public.

img source

»
3 hours ago, # |
  Vote: I like it +14 Vote: I do not like it

training on test is all you need!

»
3 hours ago, # |
  Vote: I like it 0 Vote: I do not like it

Are the CF contest submissions public?

»
3 hours ago, # |
  Vote: I like it -20 Vote: I do not like it

I would say that the google model that got a silver model at IMO was much more impressive than this

  • »
    »
    2 hours ago, # ^ |
      Vote: I like it +15 Vote: I do not like it

    Well, it abused the modern ways of solving geometry(Wu method). As far as I remember the model is still not capable of solving any IMO-level combinatorics task

  • »
    »
    102 minutes ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Just you wait until they become LGMs (maybe even Tourists, who knows).

  • »
    »
    72 minutes ago, # ^ |
    Rev. 2   Vote: I like it +15 Vote: I do not like it

    Couldn't disagree more.

    There's a long history in automated theorem proving and automatic ways to solve geometry. 2D geometry in particular is very uninteresting and I don't think that it's particularly surprising that computers can solve it. I'm sure if you went through the IMOs of the past, there were tools that could solve some problems long before the OpenAI models. It is impressive that they scored 4/6, but the two they didn't solve were specifically the combinatorics ones which I would argue require more interesting argumentation rather than algebraic bashing.

    For competitive programming tasks, I haven't seen anything that comes close to being coherent prior to this past year. I remember efforts to try and design such AI advertised here on CF as far as 10 years ago, but solving any IOI problem was a ridiculous idea.

»
3 hours ago, # |
  Vote: I like it 0 Vote: I do not like it

Did someone make a test to confirm that the public o1-mini can actually solve problems up to 1650 rating as claimed?

  • »
    »
    43 minutes ago, # ^ |
      Vote: I like it +2 Vote: I do not like it

    Yes, it's impressive (although I'm not sure whether it has access to the editorials).

    I gave it five 1600-rated problems (2004D, 1996E, 1994C, 1985G, 1980E), and it solved three out of the five (submission links: 2004D, 1996E, 1985G) in one try (after using C++).

    Here's the conversation link

»
2 hours ago, # |
  Vote: I like it +6 Vote: I do not like it

Anyone aware where to buy cows ? I am thinking to start a farm before the prices for cow skyrocket due to gpt making us redundant

»
2 hours ago, # |
Rev. 3   Vote: I like it +26 Vote: I do not like it

This is crazy. Maybe the end of online competitive programming era.

Some users may argue this is not a big deal, using GO as example. In the Go game, the AI beat world rank 1 player Jie Ke very hard in 2017, does Go die? No, because in the real competition, players cannot use AI.

However, in codeforces, all players are competing offline, which means you don't know if he or she is cheating or use AI. Imagine you are an expert, in a div2 round, you solve 2A and 2B in 15 minutes. Then you discover that you have already rank 10000+, because many many newbies just copy the problem statement, paste to chatGPT and solved up to 2D, both 2C and 2D has 10000+ accepted users. How do you feel?

It just like you are top runners in Marathon race, many cheaters just ride the motorcycle to finish the race in 30 minutes, break the world record, and the committee just say their result is effective, you rank behind them.

I can solve the problems just for fun, but cheaters will not. And it is not fun if cheaters beat me every time I participate, and make my rating drop from master to newbie.

It's time for codeforces official to do something do handle this situation. I am serious.

  • »
    »
    105 minutes ago, # ^ |
    Rev. 2   Vote: I like it 0 Vote: I do not like it

    While I agree with the general sentiment, isn't this similar to the situation with online strategy games like chess? Any player can quite easily cheat. All it takes is using a few engine moves when needed, looking at opening prep, or even just having access to the computer evaluation. You could even argue that as chess computers have gotten better, online cheating hasn't increased by all that much (I'm not sure if there are statistics that disprove this).

    The main reason is that everyone knows you can cheat on chess.com or lichess. Your chess.com rating and your lichess rating hold very little value. There is not that much of an incentive to cheat because there isn't that much value to the rating itself.

    With that being said, I just realized that there is a concrete difference between chess and CP. With chess, there are a lot of official FIDE tournaments where your rating actually matters. There isn't really an equivalent for CP. The only equivalents are in-person contests like ICPC and IOI which are pretty hard to go to or compete and are restricted to specific age brackets.

    • »
      »
      »
      88 minutes ago, # ^ |
        Vote: I like it +3 Vote: I do not like it

      I totally agree with you with the last paragraph.

      You know in leetcode contest, the cheating is much more severe. Because some of the recruiters in some country, they know leetcode, and if you get a high rank in leetcode, for example, guardian badge, you will be more likely to pass the resume screening. For chess, even if you get very high ranking, you will not get a job for it.

      And the in-person contests for programming is very sparse, and there is no contest for people that are already at work. Nearly all contests are for high school (IOI) or undergraduate (ICPC), if you pass the era, you can only participate Meta Hacker Cup for fun. Really hope if there is in person contest available in the future, especially for all ages.

  • »
    »
    104 minutes ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    But is there are really some way to handle this situation? Any suggestions?

  • »
    »
    84 minutes ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Problem with current cheating is that it's done in closed channels, shared information is not public and it puts other participants at a disadvantage. LLMs are public and can be used by everyone as another tool just like IDE, plugins or prewritten algorithms, so it's fair.

  • »
    »
    81 minute(s) ago, # ^ |
      Vote: I like it +5 Vote: I do not like it

    If the model is public, what if authors run their problem ideas through GPT until they find ones it can’t solve? This may reduce the quantity of rounds, and slightly increase the average difficulty, but the integrity of contests would be preserved.

    • »
      »
      »
      68 minutes ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      This will help in the current phase if the chatGPT still have limitations. That is , make sure chatGPT at least cannot solve this problem in the first few attempts and with guidance.

      However, if the AI is as strong as tourist, this kind of problem will not exist, and if exists, this is a 3000+ problem.

      • »
        »
        »
        »
        54 minutes ago, # ^ |
          Vote: I like it +3 Vote: I do not like it

        I think a lot of problems in the world could be solved if we had an infinite amount of tourists. The death of CP would be a small price to pay.

        • »
          »
          »
          »
          »
          38 minutes ago, # ^ |
            Vote: I like it 0 Vote: I do not like it

          Yes, just move on to another hobby if that day come. The life have more wonderful things to do.

»
2 hours ago, # |
  Vote: I like it 0 Vote: I do not like it

i'm also not sure what they mean by "simulation" of contest, did o1 participate in latest contests with unique problems?

did they make sure that codeforces problemset and solutions are not in the training data?

  • »
    »
    1 minute ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    That's the real question did they have editorials as training set?

»
2 hours ago, # |
  Vote: I like it 0 Vote: I do not like it

A point I noticed while reading the blog was that the claim that it is in the 93rd percentile is based on a cf blog(linked in the references) that is 5 years old now. I wonder would there be any changes considering the current rating distribution.

»
116 minutes ago, # |
  Vote: I like it 0 Vote: I do not like it

Yall gotta test it in next contest to see if 1600 is inflated/leaked test set or not since the 1600 elo version is available and we’re sure those problems won’t have been leaked to the training data.

»
106 minutes ago, # |
  Vote: I like it 0 Vote: I do not like it

Oh mother fu--

Well, it's personal now

»
105 minutes ago, # |
  Vote: I like it 0 Vote: I do not like it

rating of 800 for gpt4o (current version) seems quite accurate. if o1 will be accessible to public then I think it will fundamentally change competition under Div.1; if everyone use o1 then blue will be new gray.

»
93 minutes ago, # |
  Vote: I like it +6 Vote: I do not like it

I’m still skeptical. I will be a full believer if they can demonstrate it consistently performing at 1800 level in live contests.

»
84 minutes ago, # |
  Vote: I like it 0 Vote: I do not like it

Frick, I'm 2 points ahead of o1-ioi. Better step my game up.

»
48 minutes ago, # |
  Vote: I like it 0 Vote: I do not like it

Now I have to email organizers, correct?

»
26 minutes ago, # |
  Vote: I like it +1 Vote: I do not like it

The death knell for Codeforces has already sounded.