I must say that I have no ideas about the details how OpenAI tested o1 model in IOI and Codeforces contests. This framework may not work or they have tried it.
Here are some facts:
o1 performs relatively poor in IOI with 50 tries each.
o1 achieves IOI Gold Medal with 10000 tries each.
o1 only achieves 1600+ rating (far from IOI Gold Medal) on Codeforces.
According to the survey by community (https://mirror.codeforces.com/blog/entry/133887), o1 can solve very hard problem (2700) but also fail some very easy problems (800)
Codeforces's rule prohibit o1 from having too many tries.
4 and 5 may be the reason why o1 only achieve 1600 on Codeforces. The difference between IOI Gold and 1600 is, that IOI rules provide a no-cost validation so its final score is max(for each try).
I believe, OpenAI didn't pay much attention to how to conquer the submission limitation of Codeforces. They may also independently generate 50 or 10000 codes. Thus the potential of AI cheating is suppressed and can soon threat to higher rating players.
The point is, is there a way to validate each piece of code without submitting it? YE5.
Any well-trained CPers / OIers may easily come up with their practice in some contests where participants can only submit once. They write a pretest generator, a true but slow brute-force solution and their final solution. Keep comparing the results of both until after a bunch of tests there is a difference or not.
Brute-force is always easier to write, some extremely slow brute-force like exponential algorithms can hardly be wrong. Solving problems iteratively is the common experience of us.
So the simple framework works like this:
generate and validate an exponential solution can pass all given pretests.
generate larger pretest and use the exponential solution to validate newly generated n^2 solution.
...
generate total scale pretest and use previous fast solution to validate final solution.
submit
If it's stuck at step 2 for a long time. The exponential solution is wrong, generate a new one and ask for more human-made pretests. The validation process may consume much time and should be accelerated with multi-threads strategy. Also next stage solutions and be generated and validated parallel.