ratherSimple1's blog

By ratherSimple1, history, 13 months ago, In English

Quoting directly from the technical report of Google's new AlphaCode 2 model,

"We found AlphaCode 2 solved 43% of these competition problems, a close to 2× improvement over the prior record-setting AlphaCode system, which solved 25%. Mapping this to competition rankings, we estimate that AlphaCode 2 sits at the 85th percentile on average – i.e. it performs better than 85% of entrants, ranking just between the ‘Expert’ and ‘Candidate Master’ categories on Codeforces"

What do you think it means for the future? If they figure out how to improve it like their alpha go/chess models then we might have something magical in our hands pretty soon

  • Vote: I like it
  • +237
  • Vote: I do not like it

| Write comment?
»
13 months ago, # |
Rev. 11   Vote: I like it +129 Vote: I do not like it

Sounds interesting, though taking 10 attempts to solve 43% of all problems does not seem to be around the 85th percentile if you think about WA/TLE penalties (they did seem to take into account time penalties though).

Did they correct for test data leakage, since they mentioned that they fine-tuned on the Codeforces dataset? They mentioned that Gemini was the model that made all the difference, so I am curious as to whether that model was trained on code or editorials for the remaining problems (since it is easy to overfit on the tiniest of data when you have huge models).

Also curious about which divisions the contests on which AlphaCode 2 performed better than 99.5% of the participants were for. I would be really impressed if they could do well on the latter half of a Div1 contest without any test data leakage, since Div2/3 contests seem to have problems with more limited ideas than Div1 (especially Edu Rounds, which are supposed to be standard in some sense, and thus are more "learnable" by machine learning models).

To be honest, sampling a million concrete code samples and sifting through them to make clusters (which is what they do) sounds like a really bad approach for a human or any algorithm that claims to have "learnt" something — at some point it just devolves into guessing and tuning your ranking functions. This is in the same vein as "generative" models not generating things based on reasoning, but based on some transfer distribution that comes from probabilities, which is a far cry from being verifiable. TL;DR — in its current form, it seems like proof by AC and that the models are not really learning anything in a rigorous sense. I don't expect any PAC-style estimates relating these models to mathematical rigor any time soon, but would be pleasantly surprised if there exist such estimates. Of course, the ideal scenario is to solve the IMO Grand Challenge.

Edit: there is also the AIMO prize with the prize fund of $10^7.

Edit: Since this is the top comment, it is worth noting here that there is a strong chance of test data leakage, because of the following:

The AI solution advertised: https://mirror.codeforces.com/contest/1810/submission/234586634

Someone's solution: https://mirror.codeforces.com/contest/1810/submission/204262433

Someone else's solution: https://mirror.codeforces.com/contest/1810/submission/200078083

I'm told that the original code for the last of the above submissions is by a Chinese competitive programmer and the code was posted on a Chinese blog website (seems very plausible given that this happens for a lot of hard problems, though I do not have an exact link to validate the claim), so test data leakage from other websites is also possible.

I'm even sure there is a solution that is much closer to this solution than the above one, but still the structure being that similar (even though there are other solutions with a distinct structure) provides compelling evidence to start investigating whether these claims suffer from test data leakage or not.

I also looked into the other AI submissions (supposedly from their other alts), and they seem to be other minor changes on top of the other submissions.

Obviously, the best way to see whether their model can successfully solve competitive programming problems is to test their performance in real contests.

  • »
    »
    13 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    In the report they say that they tested on recent Div. 2 and Div.1 — 2 contests. If the model solves 43%, then it means that, roughly speaking, it solves 3 — 4 problems in Div. 2 rounds and at most 4-5 problems in Div. 1 — 2 rounds. If so, then this model can solve at most 2 problems in Div. 1 rounds.

    • »
      »
      »
      13 months ago, # ^ |
      Rev. 2   Vote: I like it 0 Vote: I do not like it

      That is assuming that the same kind of semantic limitations that apply to human beings also apply to AI. Comparing it to chess, it seems that there would be competitive programming problems that would inherently be easier for an AI to solve (owing to their solution methodology), but not for a human, and vice versa.

      • »
        »
        »
        »
        13 months ago, # ^ |
          Vote: I like it +3 Vote: I do not like it

        Agree, it might happen, my estimations are not precise. Personally I hope that the model that will surpass human in CP will not sample millions of "solutions" and select the best one. Although as I understood, Alpha zero (as well as Stockfish) just computes many chess lines and use AI to correctly evaluate positions at the end of every line, which is also kind of bruteforce

        • »
          »
          »
          »
          »
          13 months ago, # ^ |
            Vote: I like it 0 Vote: I do not like it

          I agree, though I guess most people will agree that the best way to measure the performance of any model in competitive programming is to actually try it out during contests and let it get to a stable rating.

          Still not the best measure for reasoning, since you can guess solutions for a lot of easier problems reasonably quickly.

    • »
      »
      »
      13 months ago, # ^ |
        Vote: I like it +3 Vote: I do not like it

      There is also a thing, that it might solve a harder problem but not solve problem A in Div 3 simply because of not understanding the problem statements. According to their presentation, it could solve problem G from Div 1 + 2 CodeTON Round which only 21 people could solve during the round

  • »
    »
    13 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Probably the fact that they used the last rounds for inference indicates that they tried to avoid leakage?

    • »
      »
      »
      13 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      As TwentyOneHundredOrBust points out below, it is possible that they used tags. But if they just use it as an augmentation technique, it seems it avoids leakage through those means.

      Also, it is possible that the last few rounds are not really that recent (it is completely possible that the data used for training included those samples too).

  • »
    »
    13 months ago, # ^ |
      Vote: I like it +4 Vote: I do not like it

    This is one of the authors' response to concerns about test data leakage: https://twitter.com/RemiLeblond/status/1732677521290789235

    • »
      »
      »
      13 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      Thanks for the link, I had already seen this tweet before commenting on the official response on this blog below. I can only wait till they substantiate their claims with live contest performance.

      I am just curious whether their claims in their paper hold up or not because I am not very convinced by their arguments (it is very hard to avoid contamination when it comes to the internet), so the best way is to test on something that is impossible to see yet, which is the future unless someone creates a time machine.

      • »
        »
        »
        »
        13 months ago, # ^ |
          Vote: I like it +5 Vote: I do not like it

        I think we all agree with you about live contest validation. Personally I am also a bit skeptical of their claims (though it wouldn't be the first time I underestimated LLMs recently).

»
13 months ago, # |
Rev. 2   Vote: I like it +40 Vote: I do not like it

We also randomize targeted metadata included in the prompt, such as the problem difficulty rating and its categorical tags

Isn't that answer leakage?

Where is the appendix where it explains how the evaluation is done and how training data leakage was avoided? Did this gigantic foundational gemini model already see the editorials in its training data? Does solving 43% of problems really correspond to 1800 elo? (Solving ABC fast in div2 doesn't seem like it would.)

This would hold a lot more credibility if they just let the model run in a couple of live contests.

This report is very disappointing. The alphacode 1 paper was significantly higher-quality than this. Hopefully they release a much more thorough version.

  • »
    »
    13 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    1) Probably the fact that they used the last rounds for inference indicates that they tried to avoid leakage? 2) If they sample millions of solutions, then tag information does not matter much, since they can sample , say 100 mln solutions to cover all tags.

    • »
      »
      »
      13 months ago, # ^ |
        Vote: I like it +9 Vote: I do not like it

      The gemini foundational model is trained on a ton of data. Trying to avoid leakage isn't the same as avoiding it. How recent were these rounds? Weeks, months? We don't know, because it wasn't described. They mentioned combined div "1+2" rounds, the second-to-last one of those was in September. Do we know that gemini is at least 3 months out of date?

      It would be much more reliable to see it in live contests, where it's guaranteed that there's no leakage.

  • »
    »
    13 months ago, # ^ |
      Vote: I like it +41 Vote: I do not like it

    The "randomized tags" means that when they sampled, they randomly tried giving it different subset of tags (e.g. "try solving this as a dp problem", "try solving this as a greedy problem") as a way of introducing diversity into the samples. I'm pretty sure they didn't "cheat" by knowing the correct tags, they just tried them all. AlphaCode 1 did the same thing.

»
13 months ago, # |
  Vote: I like it +22 Vote: I do not like it

Looks like this could be the model in question (thanks to this comment)

Curious, how it can flop on problems A-C and make 5+ WA submissions in a row (an average participant would have already quit at that point), but solve a 3000+ problem on a first try

Nonetheless, that is some impressive stuff, but until these llms learn how to apply binary search on the answer I think the competitive programming as we know it is relatively safe

Waiting for a more detailed report from DeepMind, or just an opportunity to test the Gemini model in action (which I believe is just a foundation for AlphaCode 2)

»
13 months ago, # |
  Vote: I like it +3 Vote: I do not like it

I wonder what it will be able to a decade later. So much improvement just happened in less than 2 years.

  • »
    »
    13 months ago, # ^ |
      Vote: I like it +32 Vote: I do not like it

    I do not see any improvements tbh. Just something that was fed more training data.

    • »
      »
      »
      13 months ago, # ^ |
        Vote: I like it +1 Vote: I do not like it

      I also don't see any progress here, literally the same thing but just higher rating. My bet is that in a decade these models will perform much better, but there will be no real improvement.

»
13 months ago, # |
Rev. 2   Vote: I like it +11 Vote: I do not like it

I think we can believe these data only when they make one account, take part in contests, and continuously achieve good ranks.

Yeah, and one more thing is that in the contest testing process now we must include AIs as well, at least for div. 2C and div. 2D.I know we can't do more with Div. 2A and B-level problems because now it seems more possible to solve them using AI.  

Remembering Alon Musk here, "Mark my words, AI is far more dangerous."

»
13 months ago, # |
Rev. 3   Vote: I like it 0 Vote: I do not like it

I think the problem statement they tested was from the past codeTon round4 Which is basically

So There is a possibility that it already knows the solution of this, as submissions and editorial are there (obviously from reading this data) as part of training data

So let's see if it can solve problems in live contest

  • »
    »
    13 months ago, # ^ |
    Rev. 3   Vote: I like it +11 Vote: I do not like it

    It also seems like they cherrypicked this example, since the virtual participation of the bot is only able to solve A, C and G out of the 8 problems in the set, taking 8 submissions to AC on C.

    My hypothesis is that it was able to solve A on its own, needed a lot of effort on C (which is still commendable) despite the possibility of test data leakage, and was able to regurgitate its learnt solution for G.

    Also found it really funny that AI tries to hack itself/hardcodes solutions to samples: https://mirror.codeforces.com/contest/1810/submission/234586193

    Clearly shows the biases it gets from training on CF code, and might point towards test data leakage if the code is anywhere close to being relevant to that problem.

    • »
      »
      »
      13 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      It trying to "hack itself" 234586193 could just be because that is an easy way for it to get AC on the sample. I don't think that has anything to do with data leakage.

      • »
        »
        »
        »
        13 months ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        Yeah, just that the existence of a submission that hacks itself which has mostly correct solution otherwise would strongly suggest data leakage.

        It doesn't seem to be the case for this solution since it's an idiosyncrasy of the solution filtering process.

»
13 months ago, # |
  Vote: I like it 0 Vote: I do not like it

At this rate, someday the term "Competitive Programming" will evolve from "competing to solve programming problems" to "competing to create AI to solve programming problems" lol

»
13 months ago, # |
  Vote: I like it +10 Vote: I do not like it

cAn aI s0Ive nPc Prob15m?

»
13 months ago, # |
  Vote: I like it +192 Vote: I do not like it

Here is the submission 229535002 for the 3200 rated problem 1810G - The Maximum Prefix shown in Google's video

for (long long i = 0; i <= n; i++) cin >> h[i];
for (long long i = 0; i <= n; i++) dp[0][i] = h[i];
for (long long i = 1; i <= n; i++) {
  for (long long j = 0; j <= n; j++) {
    dp[i][j] = ((p[i] * dp[i - 1][j + 1]) % mod + ((1 - p[i] + mod) * dp[i - 1][max(0LL, j - 1)]) % mod) % mod;
  }
}

Here is the 8 months old editorial for the same problem

for(int i = 0;i <= n;i++) scanf("%d",&h[i]);
for(int i = 0;i <= n;i++) f[0][i] = h[i];
for(int i = 1;i <= n;i++) {
  for(int j = 0;j <= n;j++) {
    f[i][j] = (1LL*p[i]*f[i - 1][j + 1] + 1LL*q[i]*f[i - 1][max(0 , j - 1)] ) % mod;
  }

Is it time to bring back this meme?

»
13 months ago, # |
  Vote: I like it +61 Vote: I do not like it

I just realized AdamantChicken2 shortens to AC2 -> AlphaCode 2.

»
13 months ago, # |
  Vote: I like it +27 Vote: I do not like it

whenever I read that AIs solve human tests (cp, university exams, school exams, etc) I cannot help but find the discussions misinterpreting the results.

Computer Programs are fast. If AlphaCode 2 can solve 43% of the problems within contest time, then it may not be able to solve 44% within 1 year. It already had near infinite time. Now imagine how good AlphaCode 2 would place in a contest that runs for a year. How good would the AI place if very unique problems with long solutions appear. Not good. Under these conditions, the AI may not even reach pupil status.

Also Note that according to the report, the AI is not expert-cm. The AI placed in the top 0.5% in 2 contests. Meaning it has contests where it places in red, but therefore also has contests where it does in fact places like pupil. This should not shock anyone, the report clearly states that the AI is basically good at guessing which of the millions solutions to random problems work for the current problem.

»
13 months ago, # |
  Vote: I like it +27 Vote: I do not like it

why they not simply let AI doing some rated contest that make sure problem is new but not let AI vp old contest spark unnecessary arguments

  • »
    »
    13 months ago, # ^ |
    Rev. 2   Vote: I like it +15 Vote: I do not like it

    Desperate attempts at publicity and staying relevant. It is suspicious how they don't address test set leakage either, at all.

    Edit: a bit less cynical response: Google releases have traditionally been quite controlled, so it makes sense to not expect them to release the bot in the wild for open tests as of yet. The small parts of the machinery behind what is AlphaCode 2 might be really good, but the AI itself doesn't seem to be quite there yet — not implying that this can't change in the future.

»
13 months ago, # |
  Vote: I like it +132 Vote: I do not like it

Thanks for the interest!

To address some questions in the comments:

Regarding possible data leakage: this was indeed a huge concern for us and we went through extensive lengths to avoid contamination. Please see this post on X for more details on how we did it. TLDR: we're confident that AlphaCode 2 did not see this specific problem (or its editorial/solutions); however, given that prefix sums appear regularly in competitive coding, the model has likely seen somewhat related (but distinct) problems and managed to adapt this knowledge to come up with its own solution.

An additional data point: AlphaCode 2 has a solve rate of nearly 100% on problems it was actually trained on, which is very far from its performance on any held-out contest we evaluated on.

On the potential leak of tags: we are using randomly-chosen tags for each sample we generate. Interestingly, for that particular sample the tags did not contain "dp" :)

PS: Extra kudos to TwentyOneHundredOrBust for figuring out the easter egg in the account name :)

  • »
    »
    13 months ago, # ^ |
    Rev. 2   Vote: I like it +162 Vote: I do not like it

    Thanks for addressing some of the concerns, including mine.

    It would be really cool if AlphaCode 2 participates in a statistically significant number (enough to be able to derive narrow enough 95% confidence intervals) of the upcoming rated Codeforces contests, since I am still not completely convinced given that the code is so similar to other solutions.

    And it would be the best form of testing too, since Codeforces problems are scrutinized extremely carefully during their preparation, often by veterans in the community who have seen a multitude of problems and can usually detect with a high amount of accuracy whether any problems are duplicates of past problems or not.

    If done, this would truly be a parallel to Deep Blue's match with Kasparov, which consisted of multiple games of chess, with the number of them being barely enough to be able to draw conclusions.

    • »
      »
      »
      13 months ago, # ^ |
        Vote: I like it +56 Vote: I do not like it

      +1 support to the proposal

    • »
      »
      »
      13 months ago, # ^ |
        Vote: I like it +37 Vote: I do not like it

      +1, I want to see AlphaCode 2 compete in some rounds to see how good it really is when it faces new problems.

  • »
    »
    13 months ago, # ^ |
      Vote: I like it +18 Vote: I do not like it

    The progress made by AlphaCode 2 is impressive. However, with its potential release to the public, there is a growing concern about how it could affect the competitive programming scene. What measures might be implemented to preserve the integrity of online competitions, particularly on platforms like Codeforces, against the sophisticated capabilities of models like AlphaCode 2?

  • »
    »
    13 months ago, # ^ |
      Vote: I like it +347 Vote: I do not like it

  • »
    »
    13 months ago, # ^ |
      Vote: I like it +83 Vote: I do not like it

    So is your explanation to this that it is just a notorious coincidence? Your submissions use both the same variable names and formatting as the editorial, see for example 229535001 and 228987836. They are so similar that had two people submitted this in a competition, then I would have thought that they were cheating.

    Btw it really would be great if you test on live contests / fresh problems. That way there wouldn't be any doubt about contamination.

    • »
      »
      »
      13 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      It's a problem with such a short solution that some correct solutions are bound to look almost exactly the same

  • »
    »
    13 months ago, # ^ |
      Vote: I like it +39 Vote: I do not like it

    You were able to come up with an AI model able to solve CF problems but couldn't think of running it on live contest and had to go to "extensive lengths to avoid contamination" ??

    • »
      »
      »
      13 months ago, # ^ |
      Rev. 2   Vote: I like it +2 Vote: I do not like it

      you are mixing up 2 different things.

      Announcing an AI's strength should come with live contest results. I don't know why they did not do that either. And it makes their research look unprofessional. Especially considering its Google.

      But training an AI has to be done in an offline-setting. It would not be feasible to only train the AI every 3 days for 2 hours, only when a live contest happens. In order to train an AI, you have to be able to judge its performance and for that it is necessary to go to "extensive lengths to avoid contamination"

      • »
        »
        »
        »
        13 months ago, # ^ |
        Rev. 2   Vote: I like it +27 Vote: I do not like it

        I think you are misunderstanding what contamination means.

        They are saying that they first trained their model on lots of problems and then selected some new problems to test the model on. Here they went to great lengths to ensure that the question they are testing on is not contaminated, that is the questions are completely new and the model has never seen that question in "training" phase.

        So what i wanted to say was why go to such lengths when you can test your model on a live contest where the problems are likely new and never seen before ? For training you can use as much historical data as you want

        • »
          »
          »
          »
          »
          13 months ago, # ^ |
            Vote: I like it -20 Vote: I do not like it

          I already answered why the researchers have to ensure test data is not contaminated. Let me try again.

          If the test data is contaminated, your resulting AI will be weaker than possible. This is obvious, right?

          You cannot train an AI on live contests, depending on the AI, that may take centuries. This is obvious, too, right?

          • »
            »
            »
            »
            »
            »
            13 months ago, # ^ |
              Vote: I like it +11 Vote: I do not like it

            No one said training should be done on live contests. Evaluation should though. The model here doesn't seem to be trained via reinforcement learning (at least not a major part of it), which is the misconception you seem to have.

            A garden variety simplified model training process looks like this (if you ignore cross validation and similar stuff):

            • You have a training set that is used to train your model.
            • You leave out a validation set for things like early stopping and so on — this is different from training but you still implicitly optimize on this set.
            • You have a test set on which you compute results of your model.

            All of these sets should be completely disjoint (at least independent), but should also come from the same data distribution, in order for the results to make sense.

            The disjoint-ness of training+validation and testing data is what the researchers at Google claim, and this claim is being questioned in a lot of the above comments.

            The solution a lot of the above comments are suggesting is to ensure that the test set comes from a set which is impossible to get contaminations from, unless someone has a time machine — which is the future.

  • »
    »
    13 months ago, # ^ |
    Rev. 2   Vote: I like it +22 Vote: I do not like it

    If you're so confident in your model, why don't you make it compete with actual human competitors in a real live contest? Then we'll see if the claims are really true, otherwise there isn't much to backup the claim.

»
13 months ago, # |
  Vote: I like it +79 Vote: I do not like it

So the summary is: The bot solved A, C and G from a Div1+2 round, nothing else. Its participation was virtual. Gs solution is extremely similar to authors solution. They promise there was no leakage. To me this is pathetic

  • »
    »
    13 months ago, # ^ |
      Vote: I like it +13 Vote: I do not like it

    Difficulty measure for such a model may be quite different than what we perceive as humans. G's solution is very short and its statement is written in clear mathematical terms rather than some Alice getting arrays as birthday presents. To me this is believable

    • »
      »
      »
      13 months ago, # ^ |
        Vote: I like it +32 Vote: I do not like it

      Extraordinary claims require extraordinary evidence

      • »
        »
        »
        »
        13 months ago, # ^ |
          Vote: I like it +11 Vote: I do not like it

        I don't think this is extraordinary. Length of the solution and formality of the statement are probably the two factors that I would expect to be affecting the performance the most

        • »
          »
          »
          »
          »
          13 months ago, # ^ |
            Vote: I like it +13 Vote: I do not like it

          You do realise that the bot participated in the round 6 months after it happened right? Why the heck dont they make it participate in a live round? I do agree that the solution is very short and more likely to be guessed by brute force approach. But truth is Google showed a bunch of unprofessional signs by the way this "reasearch" results were presented.

          • »
            »
            »
            »
            »
            »
            13 months ago, # ^ |
              Vote: I like it 0 Vote: I do not like it

            I do realise that. I am far from convinced that it will revolutionize the playing field and show enough consistency and participating in an actual rated contest would be the only way to do so for me. But that's not what was the topic of your original comment, which regarded strictly the believability of what they have already claimed regarding solving that particular problem G. Which is far from being obviously false to me, as it seems it is to you

»
13 months ago, # |
  Vote: I like it 0 Vote: I do not like it

High time to ask, "Is it the end of Competitive Programming? Should I quit grinding and hustling?"

  • »
    »
    13 months ago, # ^ |
      Vote: I like it +4 Vote: I do not like it

    No

    • »
      »
      »
      13 months ago, # ^ |
        Vote: I like it +2 Vote: I do not like it

      Considering AlphaCode2, the moment it is made available to public, don't you think it will kill this domain?

      Like even if it is able to solve Div-2 problems and not hard Div-1 problems, it would destroy the CP for folks falling in the spectrum of 0-1900

      • »
        »
        »
        »
        13 months ago, # ^ |
          Vote: I like it +22 Vote: I do not like it

        I'm not convinced that any such tool currently is good enough to beat a 1400 rated programmer with a win rate of > 50%. This number is what official claims seem to indicate, but it could very well be 1000 (reason below).

        There is way too much contamination, and the general avoidance of temporally separated testing mechanisms (like how tests on CF work) inflates their perfomance metrics. There have been papers that talk about this phenomenon of true-out-of-sample decay, especially about Project Euler and Codeforces for ChatGPT.

»
13 months ago, # |
  Vote: I like it +70 Vote: I do not like it

why not evaluating on their own codejam/kickstart problemset :)

  • »
    »
    13 months ago, # ^ |
      Vote: I like it +17 Vote: I do not like it

    Code jam is dead... unless you mean this is a good excuse to bring it back. The PR stunt could be: beat AlphaCode2 in a contest and win a tshirt?

»
13 months ago, # |
  Vote: I like it 0 Vote: I do not like it

In This video, it is claimed that Gemini can solve a 3200 elo problem. Is this an existential crisis for newbie competitive programmers like me?

  • »
    »
    13 months ago, # ^ |
      Vote: I like it +30 Vote: I do not like it

    Depends on whether you can implement a DP solution given the editorial code.

  • »
    »
    13 months ago, # ^ |
      Vote: I like it +1 Vote: I do not like it

    No. This problem seems to be a total outlier and it is beyond me, why they feature it on their video. The second highest rated problem the AI (AdamantChicken2) solved is 2100.

    And besides that, the AI has NOT been trained under live contest constraints. For example, this submission is for the Problem G. You can see the AI (purringgrattini) cheating the submission to gain more output information. This is not anything you could ever do inside a live contest.

»
13 months ago, # |
  Vote: I like it +25 Vote: I do not like it

Interestingly, in AdamantChicken2's code, the "quick power" is written as "qmi" (quick mi), it seems that AdamantChicken2 also knows Chinese phonetics.

  • »
    »
    13 months ago, # ^ |
    Rev. 2   Vote: I like it +71 Vote: I do not like it

    I found some chinese in one of their submissions 232778386.

    I also found that they are using a function called Gamal() 232777058 232778378 for cin.tie stuff, which seems to straight up be taken from the template of Gamal74 205625212.

    Note that all of these submissions I linked are for the same exact problem.

    • »
      »
      »
      13 months ago, # ^ |
        Vote: I like it +16 Vote: I do not like it

      No test data leakage confirmed!

    • »
      »
      »
      13 months ago, # ^ |
        Vote: I like it +8 Vote: I do not like it

      If it's a template then it could appear in earlier training data as well.

      • »
        »
        »
        »
        13 months ago, # ^ |
        Rev. 7   Vote: I like it +42 Vote: I do not like it

        Agreed. It seems that Gamal74 started using the Gamal() template in July 2022. This is the oldest submission using it that I could find 162836061. So the training data is at least as recent as July 2022.

        However it is a bit suspicious that the bot is using Gamal() on that specific problem. For example, it is the only problem that Gamal74 has solved in that entire contest. Also, I've never seen the bot use Gamal() when solving other problems. It seems to only do it for this problem.

        EDIT: I just did a more thourogh check. Some of the bot submissions 232777056 232777056 use a newer version of the template. The first time I can see this version of the template being used by Gamal74 is in late September 2022 173629356 .

»
13 months ago, # |
  Vote: I like it +14 Vote: I do not like it

Would like to see future untrained problem performance

»
13 months ago, # |
  Vote: I like it 0 Vote: I do not like it

Not implying that this is impossible, but it actually seemed a bit unnatural to me that an AI could accurately do tasks that require high amount of precision for correctness, like solving a very specific math problem or a programming statement with a very high confidence rate.

For drawing, writing and all it makes sense, where just the overall pattern/structure matters...

»
13 months ago, # |
Rev. 3   Vote: I like it +39 Vote: I do not like it

I guess the comments above talk about plausibility, but I honestly don't care that much whether it is true or not. I am pretty sure it will happen at some point anyway, so whatever. And I am not surprised by this in any way! Because we know that the most important thing in solving competitive programming is practice, and that's exactly the thing that machines are much better at than we are. I genuinely believe that if you give a chicken a million years to practice, it can become a candidate master or a master. I am not trying to insult candidate masters with this comment, it is nevertheless hard to become one! I just mean it in the same way as that if a chicken studied chess for a million years, it would become better than Magnus Carlsen. It's just about human resources. If some 3500 legendary grandmaster would call a problem "an exercise" from their experience mountain, why a computer can not learn to do such exercises? And the reason why I still am not impressed is because indeed these are the problems you just need to learn to go through, and the real stuff begins later when a 3500 will have to start thinking. And I don't see AI being able to solve any of these problems any time soon so whatever...

  • »
    »
    12 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    But what we human need is a thinkable machine rather than a omnipotent database. If a new kind of cp problem (required with some new algortihms) comes out in the future, would a database finds out the corresponding answer? No, it will just try every combination of the alogorithms in its databse, and eventually tell us it doesn't know.

    • »
      »
      »
      12 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      Idk. If so, it is a good shake for authors to come up with novel ideas.

»
13 months ago, # |
  Vote: I like it +42 Vote: I do not like it

If I understand correctly, this large language model is like a clever monkey typing in the infinite monkey theorem? Because it seems like it's just randomly putting together some code and then trying them one by one, but using some clustering method to speed up this process.

  • »
    »
    12 months ago, # ^ |
      Vote: I like it -8 Vote: I do not like it

    It is nowhere as brute-force as you are describing. Machine learning is much more like humans learning. It has a vast number of parameters which enable it to make a lot of accurate guesses. From which this particular model submits the top 10 most likely-to-be accurate codes and is considered to get it right if any of the ten gets AC.

    • »
      »
      »
      12 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      Except that in this case, the vast number of parameters lead to overfitting on the data and, in extreme cases like this, learn the data itself rather than the distribution/properties that help it generalize. After this sort of training is done, the process here devolves into merely accessing the created database by using a random generation algorithm and trying them one by one.

»
12 months ago, # |
  Vote: I like it 0 Vote: I do not like it

so guys should i stop grinding or no ?

  • »
    »
    12 months ago, # ^ |
      Vote: I like it +19 Vote: I do not like it

    AI can do the things doesn't mean you will stop learning. Try to grind more effectively. Afterall, the motive of CP is upgrading your reasoning skills. You give up learning things and AI is coming for everything.

»
12 months ago, # |
  Vote: I like it +9 Vote: I do not like it

Probably the rules of using code generators will have to be modified. Even if they publish the code and the weights of the neural networks, the majority of people will not have sufficient resources to run this code during the contest.

»
12 months ago, # |
  Vote: I like it +9 Vote: I do not like it

We don't need AI We need EI Emotional Intelligence