pajenegod's blog

By pajenegod, history, 2 years ago, In English

Yesterday, investigating Strange TLE by cin using GNU C++20 (64), I found an easy and reproducable way to trigger a slowdown bug that I believe has been plaguing Codeforces for some time now. So I'm making this blog to raise awarness of it. MikeMirzayanov, please take a look at this!

Here is how to trigger the slowness bug:

  1. Take any problem with a relatively large input on Codeforces ($$$2 \cdot 10^5$$$ ints is enough).
  2. Take a random AC C++ submission that uses std::cin.
  3. Add the line vector<vector<int>> TLE(40000, vector<int>(7)); somewhere in global space.
  4. Submit using either C++20(64 bit) or C++17(64 bit).
  5. ???
  6. TLE

For example take tourist's solution to problem 1936-D - Bitwise Paradox. With the vector of death added to the code, it gets TLE on TC5 (taking $$$ \gt 5$$$ s). While without the deadly vector, the submission takes 155 ms on TC5.

Here is a stand alone example with the slowdown (credit to kostia244). It runs 100 times slower with the vector of death.

#include<bits/stdc++.h>
using namespace std;

vector<vector<int>> TLE(40000, vector<int>(7));

int main() {
    string s;
    for(int i = 0; i < 17; i++) s += to_string(i) + " ";
    
    for (int j = 0; j < 60000; ++j) {
        istringstream ss(s);
        int x;
        while (ss >> x);
    }
}
What is causing this?
Other ways to trigger the slowdown bug
Other blogs on this topic
  • Vote: I like it
  • +658
  • Vote: I do not like it

| Write comment?
»
2 years ago, hide # |
Rev. 3  
Vote: I like it +10 Vote: I do not like it

I was able to reduce kostia's code to this:

#include<bits/stdc++.h>
using namespace std;

int main() {
    for (int i = 0; i < 40001; i++) {
        new int[7]; // 10 works consistently too
    }

    string s;
    for(int i = 0; i < 17*30000; i++) s += to_string(i) + " ";
    istringstream ss(s); int x; while (ss >> x);
}

Length of 40000 or 40002 doesn't work, only 40001 works. Without that loop the code gets WA in 77ms.

»
2 years ago, hide # |
Rev. 3  
Vote: I like it +27 Vote: I do not like it

This takes 500ms for sizes from 7 to 10, <=15ms for any other size.

#include <vector>
using namespace std;

int main() {
    vector<vector<int>> TLE(40000, vector<int>(7));
    for (int j = 0; j < 60000; ++j) {
        vector<int>(7); // any size from 7 to 10 works
    }
}
»
2 years ago, hide # |
Rev. 2  
Vote: I like it -19 Vote: I do not like it

is the slowness bug random cause I tried the same as told by pajenegod

without death vector:248597380 78 ms

with death vector:249350892 93 ms

I ran the codes on C++20(64bit) and input is 2.10^5 but still not a big change in time of both codes.

»
2 years ago, hide # |
Rev. 2  
Vote: I like it +16 Vote: I do not like it

A simple experiment shows that a VECTOR only affects the cin during its lifetime, and once the VECTOR is released, its effect disappears.

That's why any global space or main will have an effect on this.

In addition, vector<vector<long long>> TLE(40000, vector<long long>(5)); also has an effect, It doesn't seem to matter what the innermost type is, but rather the byte count.

»
2 years ago, hide # |
Rev. 2  
Vote: I like it -91 Vote: I do not like it

unrelated to the topic, but

what is benefit of using C++20(64) over c++17 for CF, I haven't seen any use case (feature that present C++20(64) but not available in c++17, besides obvious one like synthetic sugar of course)

just curious

»
2 years ago, hide # |
Rev. 2  
Vote: I like it -35 Vote: I do not like it

[deleted]

  • »
    »
    2 years ago, hide # ^ |
     
    Vote: I like it +44 Vote: I do not like it

    Not really the same thing, it is expected that different compilers make programs with different run times. Your code is just slower by a small fraction. The issue is a slowdown of 10 to 100 times on something that shouldn't be an issue.

    • »
      »
      »
      2 years ago, hide # ^ |
       
      Vote: I like it 0 Vote: I do not like it

      Sorry, you are right they aren't the same.

      it's just that I am upset that someone can fail the system testing due to things like this.

»
2 years ago, hide # |
Rev. 2  
Vote: I like it +16 Vote: I do not like it

Thanks for the investigative work. I was also able to recreate the problem in my blog (having a set of size 99904) in this problem: TLE on 9 (takes 10 seconds in my mashup), doesn't happen in 32 bits, doesn't happen with other sizes (in this case i tried 99910).

The difference between mine and your problem is that i can't just do it by adding a set of a magic size, i can't do the insertions all at once. This makes me think that it is related to allocating a bunch of small data in random places in memory, because a vector of vectors probably just allocates randomly every time (it is a nested structure and is less likely to have any extra optimizations built in) while there could be an optimization for sets where data gets allocated roughly at the same spot if it detects you're doing it all at once (this is also backed by the fact that doing new int[7] also works for your case but doing 7 arrays of size 40000 doesn't, even though they occupy the same total ammount of memory).

Maybe it has to do with memory fragmentation? Maybe cin is trying to do something and fails? I have no idea but it will be fun to see how this rabbit hole will end up.

»
2 years ago, hide # |
Rev. 2  
Vote: I like it +113 Vote: I do not like it

Here are the results of my investigation.

Base code for C++17 (64-bit):

int main() {
    for (int i = 0; i < 40001; i++) {
        char* a = new char[25];
    }
    for (int j = 0; j < 60000; ++j) {
        char* b = new char[25];
        delete[] b;
    }
}

It is also possible to repeat this with C++20 (64-bit) or with malloc/free. However, since I don't actually use this memory, the compiler does its magic. Fun fact: in some OJ with while(true); in your code, you will not get TL (at least not because of while).

This code runs in ~500ms and 0ms if 40001 is changed to 40000.

After playing with the sizes of 'a' and 'b', I found that as long as they are within the range (25, 40), not necessarily the same, it produces the same result. So it may have something to do with allocating memory in the heap. Sizes 24 and 40 are common sizes of chunks in bins (32 is skipped, though).

Then I checked the addresses of allocated memory. 'b' is always in the same place. Usually, the distance between 'a' chunks is 48. But the 40001st and 40002nd chunks are 112 bytes far. And so are 41365-41366, 42729-42730. And guess what? They all cause slowdown!! Big numbers of the form 445 + 1364 * k work. So vector<vector<int>> TLE(444 + 1364 * k, vector<int>(7)); all cause slowdown.

One more observation: code with 40001 uses 1924KB, with 40002 uses 1924KB, and so on until 40001+1364, which uses 1988KB. Maybe the reason is how Codeforces works with memory? Maybe pages?

I am curious to see how this progresses and the reason behind the slowdown.

  • »
    »
    2 years ago, hide # ^ |
     
    Vote: I like it +32 Vote: I do not like it

    Wow, nice find! I just tried using this with $$$k=41$$$ in a hack on problem 1936-C - Арена покемонов, and it totally works.

  • »
    »
    2 years ago, hide # ^ |
     
    Vote: I like it +34 Vote: I do not like it

    Btw, I just checked and it seems $$$k$$$ needs to be $$$\geq 8$$$ for the slowdown bug to happen. So the smallest death vector is vector<vector<int>> TLE(11356, vector<int>(7)).

  • »
    »
    2 years ago, hide # ^ |
     
    Vote: I like it +43 Vote: I do not like it

    ‎‎I pinned the issue down to one of the two functions V6_HeapAlloc (__sbh_alloc_block) and V5_HeapAlloc (__old_sbh_alloc_block) in sbheap.c in MSVCRT in my blog. My guess is the allocation algorithm is very bad at some edge cases, for example a linked list of allocated chunks followed by a free chunk (I haven't read the entire code, so this is only a guess of how it can be slow).

»
2 years ago, hide # |
 
Vote: I like it -146 Vote: I do not like it

I think maybe it's just because std::cin is just slow. Try scanf or fast read instead.

»
2 years ago, hide # |
 
Vote: I like it +47 Vote: I do not like it

I made a post about the situation.

»
2 years ago, hide # |
 
Vote: I like it +1 Vote: I do not like it

Me a python user after seeing this blog — Yes

  • »
    »
    2 years ago, hide # ^ |
     
    Vote: I like it +10 Vote: I do not like it

    It was surprising to me to see C++ having this kind of performance bug, since performance bugs like these are normally something you'd find in PyPy. See for example https://github.com/pypy/pypy/issues/3794 . You could easily cook up some example in Python similar to the death vector where PyPy gets TLE for no apparent reason.

»
2 years ago, hide # |
 
Vote: I like it +15 Vote: I do not like it

Dang, this is gonna be great for hacking

  • »
    »
    2 years ago, hide # ^ |
     
    Vote: I like it +54 Vote: I do not like it

    Unfortunately, yes.

    In my opinion in-contest hacking should be turned off for problems where this bug can be abused for hacking (grid / matrix problems) until this issue has been fixed.

»
2 years ago, hide # |
Rev. 2  
Vote: I like it 0 Vote: I do not like it

is this possibly related to how size_t is still 32 bits even when the language chosen claims to be C++ (64 bit) ?

edit: this is false. size_t is indeed 64 bits. i confused it with long

  • »
    »
    2 years ago, hide # ^ |
     
    Vote: I like it +3 Vote: I do not like it

    What? C++ language is neither 32 bit nor 64 bit.

    • »
      »
      »
      2 years ago, hide # ^ |
       
      Vote: I like it 0 Vote: I do not like it

      there are many ways to compile a single c++ source code. you can compile for 32-bit machines, or 64-bit machines, or 5-bit machines if you write your own compiler.

»
2 years ago, hide # |
Rev. 2  
Vote: I like it +145 Vote: I do not like it

Thank you for this investigation. I have introduced a new language: C++20 (GCC 13-64). Please test it. My tests show that this issue has been fixed in this version.

The current plan is to hide all other 64-bit C++ compilers except for the new C++20 (GCC 13-64).

  • »
    »
    2 years ago, hide # ^ |
     
    Vote: I like it 0 Vote: I do not like it

    Is it stable enough to be used in the contest today?

  • »
    »
    2 years ago, hide # ^ |
    Rev. 2  
    Vote: I like it 0 Vote: I do not like it

    UPD: sorry, I forgot to submit it with C++20 (GCC-13) :(

  • »
    »
    2 years ago, hide # ^ |
    Rev. 2  
    Vote: I like it -8 Vote: I do not like it

    I think only winlibs with UCRT (not MSVCRT) fixes this issue, so to fix this the solution should be to install UCRT and then winlibs with UCRT. seems to have been fixed?

  • »
    »
    2 years ago, hide # ^ |
     
    Vote: I like it +41 Vote: I do not like it

    Hi, bits/extc++.h is a header which is like bits/stdc++.h except it includes more stuff, making it a strictly better bits/stdc++.h in the sense that you also don't have to add includes when using, for example, pbds things.

    It currently does not compile on the new language, although it did compile on the 64 bit version of c++17. Could you please take a look at it?

    Compilation Message
  • »
    »
    2 years ago, hide # ^ |
    Rev. 3  
    Vote: I like it -9 Vote: I do not like it

  • »
    »
    2 years ago, hide # ^ |
    Rev. 2  
    Vote: I like it +44 Vote: I do not like it

    Unfortunately it seems that GCC 13-64 still has this issue, but cases when it manifests have been shuffled around. Identical submissions: TL with GCC 13, OK with GCC 11.

    • »
      »
      »
      2 years ago, hide # ^ |
       
      Vote: I like it +26 Vote: I do not like it

      Curious, why does your example give TLE when submitted, but not during custom invocation?

      • »
        »
        »
        »
        2 years ago, hide # ^ |
        Rev. 2  
        Vote: I like it +21 Vote: I do not like it

        I get different sets of "evil" constants in custom invocation and submitting in a mashup. Both sets seem to reproduce perfectly, but I've no idea why they're different.

        UPD An example of "custom invocation-evil" constant for GCC 13 is 12495.

    • »
      »
      »
      2 years ago, hide # ^ |
      Rev. 2  
      Vote: I like it +49 Vote: I do not like it

      The following code takes 9s in custom test, with C++20 (64, GCC 13):

      Code

      Now the issue suddenly becomes 100x more confusing to me, because:

      1. In my previous research I concluded that malloc in C++17 (64) and C++20 (64, GCC 11) is not a direct HeapAlloc call. Replacing malloc with HeapAlloc solved it, so I assumed a bad library implementation caused the slow that case.
      2. 98% of the time in the code above is spent in alloc, so HeapAlloc causes the slow in this case.
      3. The above code is only slow in C++20 (64, GCC 13). However, the sequence of allocations/deallocations seems to be the same in C++17 and C++20 (64, GCC 11).
      4. If I remove the printf line or the cout line, it runs fast again. printf takes no time. The sequence of allocations/deallocations is not affected by printf.

      I have no clue about anything now.

    • »
      »
      »
      9 months ago, hide # ^ |
       
      Vote: I like it +10 Vote: I do not like it

      This is quite late, but was this resolved? I tried submitting in the same language as you did (C++20 (GCC 13-64)) and it passed: 330539517

  • »
    »
    2 years ago, hide # ^ |
     
    Vote: I like it +12 Vote: I do not like it
    It's very sad that
  • »
    »
    2 years ago, hide # ^ |
     
    Vote: I like it +14 Vote: I do not like it

    would it be possible to keep at least one compiler for c++20?

    • »
      »
      »
      2 years ago, hide # ^ |
       
      Vote: I like it +16 Vote: I do not like it

      Yeah -- I have a good chunk of boilerplate code that depends on C++20, and it's a pain to backport them to C++17.

      I would like to ask the admins to announce the (un)availability of C++20 before a contest starts, so we can decide (not) to participate.

»
2 years ago, hide # |
 
Vote: I like it +35 Vote: I do not like it

I hope to find the cause of the relevant problem and fix it as soon as possible.
GNU C++20(64) is faster than C++14/17(32) when using variable type long long, and we connot use variable type __int128.
Some problem is hard to be solved without GNU C++20(64). :(

»
2 years ago, hide # |
 
Vote: I like it 0 Vote: I do not like it

Just noticed c++20 is back, does it means the bug is fixed?

  • »
    »
    2 years ago, hide # ^ |
     
    Vote: I like it +8 Vote: I do not like it

    No. See ToxicPie9's test code:

    #include <iostream>
    #include <windows.h>
    
    using namespace std;
    
    void *alloc(size_t s) {
        printf("");
        auto p = (void *)HeapAlloc(GetProcessHeap(), 0, s ? s : 1);
        asm volatile("" ::"r"(p));
        return p;
    }
    
    void dealloc(void *p) {
        HeapFree(GetProcessHeap(), 0, p);
    }
    
    int main() {
        int evil = 12495;
        int count = 1e6;
        for (int i = 0; i < evil; i++) {
            alloc(0x28);
            dealloc(alloc(0x1c));
        }
        for (int i = evil; i < count; i++) {
            dealloc(alloc(0x1c));
        }
        cout << "hi\n";
    }
    

    It takes about 9 seconds on the CF custom test, while testing this code locally using UCRT mingw64 gcc13.2, it takes no time, showing that this version is still the buggy MSVCRT one.

»
2 years ago, hide # |
 
Vote: I like it +2 Vote: I do not like it

MikeMirzayanov I reported (cf blog) This simple program does not compile on my local setup bug to GCC ~8 months ago. It's not fixed it. Open bug link on bugzilla

Today's GCC version upgrade has introduced this bug to codeforces. Sample compilation error submission — 252744779