pajenegod's blog

By pajenegod, history, 10 months ago, In English

Yesterday, investigating Strange TLE by cin using GNU C++20 (64), I found an easy and reproducable way to trigger a slowdown bug that I believe has been plaguing Codeforces for some time now. So I'm making this blog to raise awarness of it. MikeMirzayanov, please take a look at this!

Here is how to trigger the slowness bug:

  1. Take any problem with a relatively large input on Codeforces ($$$2 \cdot 10^5$$$ ints is enough).
  2. Take a random AC C++ submission that uses std::cin.
  3. Add the line vector<vector<int>> TLE(40000, vector<int>(7)); somewhere in global space.
  4. Submit using either C++20(64 bit) or C++17(64 bit).
  5. ???
  6. TLE

For example take tourist's solution to problem 1936-D - Bitwise Paradox. With the vector of death added to the code, it gets TLE on TC5 (taking $$$> 5$$$ s). While without the deadly vector, the submission takes 155 ms on TC5.

Here is a stand alone example with the slowdown (credit to kostia244). It runs 100 times slower with the vector of death.

#include<bits/stdc++.h>
using namespace std;

vector<vector<int>> TLE(40000, vector<int>(7));

int main() {
    string s;
    for(int i = 0; i < 17; i++) s += to_string(i) + " ";
    
    for (int j = 0; j < 60000; ++j) {
        istringstream ss(s);
        int x;
        while (ss >> x);
    }
}
What is causing this?
Other ways to trigger the slowdown bug
Other blogs on this topic
  • Vote: I like it
  • +658
  • Vote: I do not like it

»
10 months ago, # |
Rev. 3   Vote: I like it +10 Vote: I do not like it

I was able to reduce kostia's code to this:

#include<bits/stdc++.h>
using namespace std;

int main() {
    for (int i = 0; i < 40001; i++) {
        new int[7]; // 10 works consistently too
    }

    string s;
    for(int i = 0; i < 17*30000; i++) s += to_string(i) + " ";
    istringstream ss(s); int x; while (ss >> x);
}

Length of 40000 or 40002 doesn't work, only 40001 works. Without that loop the code gets WA in 77ms.

»
10 months ago, # |
Rev. 3   Vote: I like it +27 Vote: I do not like it

This takes 500ms for sizes from 7 to 10, <=15ms for any other size.

#include <vector>
using namespace std;

int main() {
    vector<vector<int>> TLE(40000, vector<int>(7));
    for (int j = 0; j < 60000; ++j) {
        vector<int>(7); // any size from 7 to 10 works
    }
}
  • »
    »
    10 months ago, # ^ |
      Vote: I like it +14 Vote: I do not like it

    TLE vector probably triggers some especially bad case for the allocator.

»
10 months ago, # |
Rev. 2   Vote: I like it -19 Vote: I do not like it

is the slowness bug random cause I tried the same as told by pajenegod

without death vector:248597380 78 ms

with death vector:249350892 93 ms

I ran the codes on C++20(64bit) and input is 2.10^5 but still not a big change in time of both codes.

  • »
    »
    10 months ago, # ^ |
    Rev. 4   Vote: I like it +32 Vote: I do not like it

    You have #define int long long in your code. The death vector needs to be vector<vector<int>> and not vector<vector<long long>>. Your solution takes 1996 ms / 2 s 249353007 with the actual death vector.

    • »
      »
      »
      10 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      OH SORRY.MY BAD!! This really is a concerning issue . PS: Thanks for the blog

  • »
    »
    10 months ago, # ^ |
      Vote: I like it +13 Vote: I do not like it

    WHY ARE YOU TAKE MY PICTURE

»
10 months ago, # |
Rev. 2   Vote: I like it +16 Vote: I do not like it

A simple experiment shows that a VECTOR only affects the cin during its lifetime, and once the VECTOR is released, its effect disappears.

That's why any global space or main will have an effect on this.

In addition, vector<vector<long long>> TLE(40000, vector<long long>(5)); also has an effect, It doesn't seem to matter what the innermost type is, but rather the byte count.

  • »
    »
    10 months ago, # ^ |
      Vote: I like it -45 Vote: I do not like it

    upvoted. orz

    • »
      »
      »
      9 months ago, # ^ |
        Vote: I like it +2 Vote: I do not like it

      stop asking for contribution man!! o_o

      • »
        »
        »
        »
        9 months ago, # ^ |
        Rev. 2   Vote: I like it -20 Vote: I do not like it

        orz is respect word. It's not asking for contribution. I value this user's insights regarding the death vector! I also speculate he played a WA24 scheme before, in this so he helped catch many cheater too! check

»
10 months ago, # |
Rev. 2   Vote: I like it -91 Vote: I do not like it

unrelated to the topic, but

what is benefit of using C++20(64) over c++17 for CF, I haven't seen any use case (feature that present C++20(64) but not available in c++17, besides obvious one like synthetic sugar of course)

just curious

  • »
    »
    10 months ago, # ^ |
      Vote: I like it -50 Vote: I do not like it

    downvoted

  • »
    »
    10 months ago, # ^ |
      Vote: I like it +26 Vote: I do not like it

    one thing that i can think of is the contains() method for map and set. it's not there in c++17

  • »
    »
    10 months ago, # ^ |
      Vote: I like it +13 Vote: I do not like it

    in C++20, you can use ranges, countl_zero, bitceil, ... etc

  • »
    »
    9 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    you can use std::lcm and std::gcd ,it's more simple than C++17

    • »
      »
      »
      9 months ago, # ^ |
        Vote: I like it +8 Vote: I do not like it

      but these two are already added in STL with C++17

  • »
    »
    9 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    I like std::range and priority_queue<sth,vector<sth>,decltype(lambda)>.

    • »
      »
      »
      9 months ago, # ^ |
      Rev. 2   Vote: I like it +3 Vote: I do not like it

      As an alternative, you can use std::make_heap, std::push_heap and std::pop_heap which use a (sub)range of STL containers (e.g. vector) to construct a heap in place:

      int main() {
      	std::vector<int> v1{ 3, 2, 4, 1, 5, 9 };
      	std::make_heap(v1.begin(), v1.end()); // 9 5 4 1 2 3
      	std::pop_heap(v1.begin(), v1.end());  // 5 3 4 1 2 9
      	auto top = v1.back();    // 9
      	v1.pop_back();   // 5 3 4 1 2
      
      	std::vector<int> v2{ 3, 2, 4, 5, 9 };
      	std::make_heap(v2.begin(), v2.end(), std::greater<>{}); // 2 3 4 5 9
      	v2.push_back(1);	// 2 3 4 5 9 1
      	std::push_heap(v2.begin(), v2.end(), std::greater<>{});  // 1 3 2 5 9 4
      }
      

      These functions also have their C++20 "std::ranges" version like std::ranges::make_heap(v); though, albeit the C++11 version is already not so difficult to use.

  • »
    »
    9 months ago, # ^ |
      Vote: I like it +27 Vote: I do not like it

    __int128_t is very useful too, and isn't present on 32 bits versions

»
10 months ago, # |
Rev. 2   Vote: I like it -35 Vote: I do not like it

[deleted]

  • »
    »
    10 months ago, # ^ |
      Vote: I like it +44 Vote: I do not like it

    Not really the same thing, it is expected that different compilers make programs with different run times. Your code is just slower by a small fraction. The issue is a slowdown of 10 to 100 times on something that shouldn't be an issue.

    • »
      »
      »
      10 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      Sorry, you are right they aren't the same.

      it's just that I am upset that someone can fail the system testing due to things like this.

»
10 months ago, # |
Rev. 2   Vote: I like it +16 Vote: I do not like it

Thanks for the investigative work. I was also able to recreate the problem in my blog (having a set of size 99904) in this problem: TLE on 9 (takes 10 seconds in my mashup), doesn't happen in 32 bits, doesn't happen with other sizes (in this case i tried 99910).

The difference between mine and your problem is that i can't just do it by adding a set of a magic size, i can't do the insertions all at once. This makes me think that it is related to allocating a bunch of small data in random places in memory, because a vector of vectors probably just allocates randomly every time (it is a nested structure and is less likely to have any extra optimizations built in) while there could be an optimization for sets where data gets allocated roughly at the same spot if it detects you're doing it all at once (this is also backed by the fact that doing new int[7] also works for your case but doing 7 arrays of size 40000 doesn't, even though they occupy the same total ammount of memory).

Maybe it has to do with memory fragmentation? Maybe cin is trying to do something and fails? I have no idea but it will be fun to see how this rabbit hole will end up.

»
10 months ago, # |
Rev. 2   Vote: I like it +113 Vote: I do not like it

Here are the results of my investigation.

Base code for C++17 (64-bit):

int main() {
    for (int i = 0; i < 40001; i++) {
        char* a = new char[25];
    }
    for (int j = 0; j < 60000; ++j) {
        char* b = new char[25];
        delete[] b;
    }
}

It is also possible to repeat this with C++20 (64-bit) or with malloc/free. However, since I don't actually use this memory, the compiler does its magic. Fun fact: in some OJ with while(true); in your code, you will not get TL (at least not because of while).

This code runs in ~500ms and 0ms if 40001 is changed to 40000.

After playing with the sizes of 'a' and 'b', I found that as long as they are within the range (25, 40), not necessarily the same, it produces the same result. So it may have something to do with allocating memory in the heap. Sizes 24 and 40 are common sizes of chunks in bins (32 is skipped, though).

Then I checked the addresses of allocated memory. 'b' is always in the same place. Usually, the distance between 'a' chunks is 48. But the 40001st and 40002nd chunks are 112 bytes far. And so are 41365-41366, 42729-42730. And guess what? They all cause slowdown!! Big numbers of the form 445 + 1364 * k work. So vector<vector<int>> TLE(444 + 1364 * k, vector<int>(7)); all cause slowdown.

One more observation: code with 40001 uses 1924KB, with 40002 uses 1924KB, and so on until 40001+1364, which uses 1988KB. Maybe the reason is how Codeforces works with memory? Maybe pages?

I am curious to see how this progresses and the reason behind the slowdown.

  • »
    »
    10 months ago, # ^ |
      Vote: I like it +32 Vote: I do not like it

    Wow, nice find! I just tried using this with $$$k=41$$$ in a hack on problem 1936-C - Pokémon Arena, and it totally works.

  • »
    »
    10 months ago, # ^ |
      Vote: I like it +34 Vote: I do not like it

    Btw, I just checked and it seems $$$k$$$ needs to be $$$\geq 8$$$ for the slowdown bug to happen. So the smallest death vector is vector<vector<int>> TLE(11356, vector<int>(7)).

  • »
    »
    10 months ago, # ^ |
      Vote: I like it +43 Vote: I do not like it

    ‎‎I pinned the issue down to one of the two functions V6_HeapAlloc (__sbh_alloc_block) and V5_HeapAlloc (__old_sbh_alloc_block) in sbheap.c in MSVCRT in my blog. My guess is the allocation algorithm is very bad at some edge cases, for example a linked list of allocated chunks followed by a free chunk (I haven't read the entire code, so this is only a guess of how it can be slow).

»
10 months ago, # |
  Vote: I like it -146 Vote: I do not like it

I think maybe it's just because std::cin is just slow. Try scanf or fast read instead.

  • »
    »
    9 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    cin with ios::sync_with_stdio(false) and cin.tie(nullptr) is usually faster than scanf.

    • »
      »
      »
      9 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      What's more, the "scanf-performance" in those MSVCRT 64-bit compilers is significantly worse than cin without tie and sync.

      See 240611683 that takes 1013ms to run, TLE if used scanf; 32-bit version 240612877 takes 1500ms to run, almost the same run time if used scanf.

»
10 months ago, # |
  Vote: I like it +47 Vote: I do not like it

I made a post about the situation.

»
10 months ago, # |
  Vote: I like it +1 Vote: I do not like it

Me a python user after seeing this blog — Yes

  • »
    »
    10 months ago, # ^ |
      Vote: I like it +10 Vote: I do not like it

    It was surprising to me to see C++ having this kind of performance bug, since performance bugs like these are normally something you'd find in PyPy. See for example https://github.com/pypy/pypy/issues/3794 . You could easily cook up some example in Python similar to the death vector where PyPy gets TLE for no apparent reason.

    • »
      »
      »
      10 months ago, # ^ |
      Rev. 2   Vote: I like it -10 Vote: I do not like it

      After seeing that "j%6" bug that you just shared, I am convinced that we pythoners were once part of Lucifers army and python is hated by God

»
10 months ago, # |
  Vote: I like it +15 Vote: I do not like it

Dang, this is gonna be great for hacking

  • »
    »
    10 months ago, # ^ |
      Vote: I like it +54 Vote: I do not like it

    Unfortunately, yes.

    In my opinion in-contest hacking should be turned off for problems where this bug can be abused for hacking (grid / matrix problems) until this issue has been fixed.

»
10 months ago, # |
Rev. 2   Vote: I like it 0 Vote: I do not like it

is this possibly related to how size_t is still 32 bits even when the language chosen claims to be C++ (64 bit) ?

edit: this is false. size_t is indeed 64 bits. i confused it with long

  • »
    »
    10 months ago, # ^ |
      Vote: I like it +3 Vote: I do not like it

    What? C++ language is neither 32 bit nor 64 bit.

    • »
      »
      »
      10 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      there are many ways to compile a single c++ source code. you can compile for 32-bit machines, or 64-bit machines, or 5-bit machines if you write your own compiler.

»
10 months ago, # |
Rev. 2   Vote: I like it +145 Vote: I do not like it

Thank you for this investigation. I have introduced a new language: C++20 (GCC 13-64). Please test it. My tests show that this issue has been fixed in this version.

The current plan is to hide all other 64-bit C++ compilers except for the new C++20 (GCC 13-64).

  • »
    »
    10 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Is it stable enough to be used in the contest today?

    • »
      »
      »
      10 months ago, # ^ |
        Vote: I like it +3 Vote: I do not like it

      I think so. If you're not sure, just stick to the time-tested 32-bit versions.

  • »
    »
    10 months ago, # ^ |
    Rev. 2   Vote: I like it 0 Vote: I do not like it

    UPD: sorry, I forgot to submit it with C++20 (GCC-13) :(

  • »
    »
    10 months ago, # ^ |
    Rev. 2   Vote: I like it -8 Vote: I do not like it

    I think only winlibs with UCRT (not MSVCRT) fixes this issue, so to fix this the solution should be to install UCRT and then winlibs with UCRT. seems to have been fixed?

  • »
    »
    10 months ago, # ^ |
      Vote: I like it +41 Vote: I do not like it

    Hi, bits/extc++.h is a header which is like bits/stdc++.h except it includes more stuff, making it a strictly better bits/stdc++.h in the sense that you also don't have to add includes when using, for example, pbds things.

    It currently does not compile on the new language, although it did compile on the 64 bit version of c++17. Could you please take a look at it?

    Compilation Message
    • »
      »
      »
      10 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      btw, it also doesn't compile for old version of c++ 20 compiler.

  • »
    »
    10 months ago, # ^ |
    Rev. 3   Vote: I like it -9 Vote: I do not like it

  • »
    »
    10 months ago, # ^ |
    Rev. 2   Vote: I like it +44 Vote: I do not like it

    Unfortunately it seems that GCC 13-64 still has this issue, but cases when it manifests have been shuffled around. Identical submissions: TL with GCC 13, OK with GCC 11.

    • »
      »
      »
      10 months ago, # ^ |
        Vote: I like it +26 Vote: I do not like it

      Curious, why does your example give TLE when submitted, but not during custom invocation?

      • »
        »
        »
        »
        10 months ago, # ^ |
        Rev. 2   Vote: I like it +21 Vote: I do not like it

        I get different sets of "evil" constants in custom invocation and submitting in a mashup. Both sets seem to reproduce perfectly, but I've no idea why they're different.

        UPD An example of "custom invocation-evil" constant for GCC 13 is 12495.

    • »
      »
      »
      10 months ago, # ^ |
      Rev. 2   Vote: I like it +49 Vote: I do not like it

      The following code takes 9s in custom test, with C++20 (64, GCC 13):

      Code

      Now the issue suddenly becomes 100x more confusing to me, because:

      1. In my previous research I concluded that malloc in C++17 (64) and C++20 (64, GCC 11) is not a direct HeapAlloc call. Replacing malloc with HeapAlloc solved it, so I assumed a bad library implementation caused the slow that case.
      2. 98% of the time in the code above is spent in alloc, so HeapAlloc causes the slow in this case.
      3. The above code is only slow in C++20 (64, GCC 13). However, the sequence of allocations/deallocations seems to be the same in C++17 and C++20 (64, GCC 11).
      4. If I remove the printf line or the cout line, it runs fast again. printf takes no time. The sequence of allocations/deallocations is not affected by printf.

      I have no clue about anything now.

  • »
    »
    9 months ago, # ^ |
      Vote: I like it +12 Vote: I do not like it
    It's very sad that
  • »
    »
    9 months ago, # ^ |
      Vote: I like it +14 Vote: I do not like it

    would it be possible to keep at least one compiler for c++20?

    • »
      »
      »
      9 months ago, # ^ |
        Vote: I like it +16 Vote: I do not like it

      Yeah -- I have a good chunk of boilerplate code that depends on C++20, and it's a pain to backport them to C++17.

      I would like to ask the admins to announce the (un)availability of C++20 before a contest starts, so we can decide (not) to participate.

»
9 months ago, # |
  Vote: I like it +35 Vote: I do not like it

I hope to find the cause of the relevant problem and fix it as soon as possible.
GNU C++20(64) is faster than C++14/17(32) when using variable type long long, and we connot use variable type __int128.
Some problem is hard to be solved without GNU C++20(64). :(

»
9 months ago, # |
  Vote: I like it +25 Vote: I do not like it

Maybe not just ban them. While fixing this bug, you can still make C++17(64) and C++20(64) online and useable.

»
9 months ago, # |
  Vote: I like it 0 Vote: I do not like it

Just noticed c++20 is back, does it means the bug is fixed?

  • »
    »
    9 months ago, # ^ |
      Vote: I like it +8 Vote: I do not like it

    No. See ToxicPie9's test code:

    #include <iostream>
    #include <windows.h>
    
    using namespace std;
    
    void *alloc(size_t s) {
        printf("");
        auto p = (void *)HeapAlloc(GetProcessHeap(), 0, s ? s : 1);
        asm volatile("" ::"r"(p));
        return p;
    }
    
    void dealloc(void *p) {
        HeapFree(GetProcessHeap(), 0, p);
    }
    
    int main() {
        int evil = 12495;
        int count = 1e6;
        for (int i = 0; i < evil; i++) {
            alloc(0x28);
            dealloc(alloc(0x1c));
        }
        for (int i = evil; i < count; i++) {
            dealloc(alloc(0x1c));
        }
        cout << "hi\n";
    }
    

    It takes about 9 seconds on the CF custom test, while testing this code locally using UCRT mingw64 gcc13.2, it takes no time, showing that this version is still the buggy MSVCRT one.

    • »
      »
      »
      9 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      Just tested it after Mike migrate the server to newer version, and it runs about 100ms. Seems this bug is gone :)

      • »
        »
        »
        »
        7 months ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        No, it's not fixed at least for C++.

        My further tests show that the bug is just hidden somewhere else. Need some other methods for this bug. See my test and conclusions.

    • »
      »
      »
      8 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      So why not fix this by just migrating to UCRT? :(

      • »
        »
        »
        »
        8 months ago, # ^ |
        Rev. 3   Vote: I like it 0 Vote: I do not like it

        UPD: It's not fixed at least for C++.

        Further investigation shows that the bug is just hidden somewhere else. Need some other methods for this bug. See my test and conclusions.

        Original post↓


        It is fixed by updating the system to Windows Server 2022, those legacy 32-bit compilers seemed to benefit from this (you can try copying 253400416 to custom test, and TLE would not happen). As MikeMirzayanov said, they are working on resolving the situation more systematically.

        MikeMirzayanov did not reveal how they fixed the problem, but I suppose they also migrated the 64-bit compilers to the UCRT version of the mingw64 toolchain.

»
9 months ago, # |
  Vote: I like it +2 Vote: I do not like it

MikeMirzayanov I reported (cf blog) This simple program does not compile on my local setup bug to GCC ~8 months ago. It's not fixed it. Open bug link on bugzilla

Today's GCC version upgrade has introduced this bug to codeforces. Sample compilation error submission — 252744779

  • »
    »
    8 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    The easiest way to avoid it is to remove those pragmas, especially "target" directives. It's a bug affecting every GCC13 on any platform, and Clang if it's using GCC13's libstdc++.

    C++ is the tier-0 fast language for competitive programming in Codeforces, I think it's not particularly useful to do so when this language already has a big advantage over others.