piaoyun's blog

By piaoyun, history, 2 months ago, In English

I must say that I have no ideas about the details how OpenAI tested o1 model in IOI and Codeforces contests. This framework may not work or they have tried it.

Here are some facts:

  1. o1 performs relatively poor in IOI with 50 tries each.

  2. o1 achieves IOI Gold Medal with 10000 tries each.

  3. o1 only achieves 1600+ rating (far from IOI Gold Medal) on Codeforces.

  4. According to the survey by community (https://mirror.codeforces.com/blog/entry/133887), o1 can solve very hard problem (2700) but also fail some very easy problems (800)

  5. Codeforces's rule prohibit o1 from having too many tries.

4 and 5 may be the reason why o1 only achieve 1600 on Codeforces. The difference between IOI Gold and 1600 is, that IOI rules provide a no-cost validation so its final score is max(for each try).

I believe, OpenAI didn't pay much attention to how to conquer the submission limitation of Codeforces. They may also independently generate 50 or 10000 codes. Thus the potential of AI cheating is suppressed and can soon threat to higher rating players.

The point is, is there a way to validate each piece of code without submitting it? YE5.

Any well-trained CPers / OIers may easily come up with their practice in some contests where participants can only submit once. They write a pretest generator, a true but slow brute-force solution and their final solution. Keep comparing the results of both until after a bunch of tests there is a difference or not.

Brute-force is always easier to write, some extremely slow brute-force like exponential algorithms can hardly be wrong. Solving problems iteratively is the common experience of us.

So the simple framework works like this:

  1. generate and validate an exponential solution can pass all given pretests.

  2. generate larger pretest and use the exponential solution to validate newly generated n^2 solution.

...

  1. generate total scale pretest and use previous fast solution to validate final solution.

  2. submit

If it's stuck at step 2 for a long time. The exponential solution is wrong, generate a new one and ask for more human-made pretests. The validation process may consume much time and should be accelerated with multi-threads strategy. Also next stage solutions and be generated and validated parallel.

  • Vote: I like it
  • +14
  • Vote: I do not like it

»
2 months ago, # |
  Vote: I like it -8 Vote: I do not like it

There's rumors that they have collaborated with devin for o1. Wouldn't CPers from devin have considered this?

  • »
    »
    2 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Maybe. Because the only use of this self-validated framework is cheating better on Codeforces. The problem it only conquers is the submission limitation.

    The reason is:

    1. Brute-force result can hardly be integrated into Chain-of-Thought (I guess).

    2. All these things, generating pretests, comparing run on CPU and can't be merged with attention mechanism. It is more similar to AI-Agent and just doesn't exist in o1.

    My total point is. Of course they can come up with it. But it is not important for AGI. They may not care and save their precious time not to focus on such practical application.

»
2 months ago, # |
  Vote: I like it 0 Vote: I do not like it

Won't work. o1-mini\preview has 30-50 queries per week limit for the 20$ subscription, with approach like that user will run out of queries or will have to use api, which is going to cost way more for nothing (except some imaginary rating points).

  • »
    »
    2 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    It would soon be cheaper just like other models. This is unscapable.

  • »
    »
    2 months ago, # ^ |
      Vote: I like it +1 Vote: I do not like it

    Another practical possibility is Meta quickly follows up with new generation Llama-o1 which can be deployed locally.

    • »
      »
      »
      2 months ago, # ^ |
      Rev. 2   Vote: I like it 0 Vote: I do not like it

      This is more realistic but even with quantization top llama model requires several gpus + probably it will require CoT prompting. Although if someone is able to pull off a setup like that I don't think it counts as "cheating" at this point. This is really hard to do and pretty impressive if someone achieves even a 1800 rating with this setup.

»
2 months ago, # |
Rev. 2   Vote: I like it 0 Vote: I do not like it

Long ago I started list of thinking strategies in my practice blog in "extra advice to think" section. The original use was to build a framework for something like this at a hackathon. I think branching prompts to attempt different strategies and combine them, as well as testing ideas from different strategies by generating a brute force then running and receiving feedback, can create a very powerful solver already out of o1 that works a lot like how a human can. Perhaps i will finally try an hard attempt soon, or someone else can do it better.

Just saw the comment above. While that is still a limitation, u can probably use gpt-o4 for a lot of individual one shot tests, then work only use a few prompts on overview and combination with o1.

  • »
    »
    2 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    cool!

  • »
    »
    2 months ago, # ^ |
    Rev. 2   Vote: I like it 0 Vote: I do not like it

    I was very surprised in dominator's o1 test blog some of the problem o1 could not solve. I particularly tried out 1904C - Array Game myself to see where it went wrong, and while o1 is phenomenal at pattern matching to a standard pattern, at least for that problem it just got stuck trying to force particular path when it mistakenly thought it saw a pattern even when suggesting it should try something else (a lot like humans can do).

    However, I believe prompting gpt to force a fixed set of different strategy attempts on a problem and creating code to experiment and see results on the attempts it comes up with should be able to get over the previously explained hurtle for simple problems like this. Hopefully at least one prompt gets it down the right path, and o1 can combine the attempts and see which one is matching correctly.