Блог пользователя pajenegod

Автор pajenegod, история, 2 месяца назад, По-английски

Yesterday, investigating Strange TLE by cin using GNU C++20 (64), I found an easy and reproducable way to trigger a slowdown bug that I believe has been plaguing Codeforces for some time now. So I'm making this blog to raise awarness of it. MikeMirzayanov, please take a look at this!

Here is how to trigger the slowness bug:

  1. Take any problem with a relatively large input on Codeforces ($$$2 \cdot 10^5$$$ ints is enough).
  2. Take a random AC C++ submission that uses std::cin.
  3. Add the line vector<vector<int>> TLE(40000, vector<int>(7)); somewhere in global space.
  4. Submit using either C++20(64 bit) or C++17(64 bit).
  5. ???
  6. TLE

For example take tourist's solution to problem 1936-D - Bitwise Paradox. With the vector of death added to the code, it gets TLE on TC5 (taking $$$> 5$$$ s). While without the deadly vector, the submission takes 155 ms on TC5.

Here is a stand alone example with the slowdown (credit to kostia244). It runs 100 times slower with the vector of death.

#include<bits/stdc++.h>
using namespace std;

vector<vector<int>> TLE(40000, vector<int>(7));

int main() {
    string s;
    for(int i = 0; i < 17; i++) s += to_string(i) + " ";
    
    for (int j = 0; j < 60000; ++j) {
        istringstream ss(s);
        int x;
        while (ss >> x);
    }
}
What is causing this?
Other ways to trigger the slowdown bug
Other blogs on this topic
  • Проголосовать: нравится
  • +658
  • Проголосовать: не нравится

»
2 месяца назад, # |
Rev. 3   Проголосовать: нравится +10 Проголосовать: не нравится

I was able to reduce kostia's code to this:

#include<bits/stdc++.h>
using namespace std;

int main() {
    for (int i = 0; i < 40001; i++) {
        new int[7]; // 10 works consistently too
    }

    string s;
    for(int i = 0; i < 17*30000; i++) s += to_string(i) + " ";
    istringstream ss(s); int x; while (ss >> x);
}

Length of 40000 or 40002 doesn't work, only 40001 works. Without that loop the code gets WA in 77ms.

»
2 месяца назад, # |
Rev. 3   Проголосовать: нравится +27 Проголосовать: не нравится

This takes 500ms for sizes from 7 to 10, <=15ms for any other size.

#include <vector>
using namespace std;

int main() {
    vector<vector<int>> TLE(40000, vector<int>(7));
    for (int j = 0; j < 60000; ++j) {
        vector<int>(7); // any size from 7 to 10 works
    }
}
  • »
    »
    2 месяца назад, # ^ |
      Проголосовать: нравится +14 Проголосовать: не нравится

    TLE vector probably triggers some especially bad case for the allocator.

»
2 месяца назад, # |
Rev. 2   Проголосовать: нравится -19 Проголосовать: не нравится

is the slowness bug random cause I tried the same as told by pajenegod

without death vector:248597380 78 ms

with death vector:249350892 93 ms

I ran the codes on C++20(64bit) and input is 2.10^5 but still not a big change in time of both codes.

  • »
    »
    2 месяца назад, # ^ |
    Rev. 4   Проголосовать: нравится +32 Проголосовать: не нравится

    You have #define int long long in your code. The death vector needs to be vector<vector<int>> and not vector<vector<long long>>. Your solution takes 1996 ms / 2 s 249353007 with the actual death vector.

  • »
    »
    2 месяца назад, # ^ |
      Проголосовать: нравится +13 Проголосовать: не нравится

    WHY ARE YOU TAKE MY PICTURE

»
2 месяца назад, # |
Rev. 2   Проголосовать: нравится +16 Проголосовать: не нравится

A simple experiment shows that a VECTOR only affects the cin during its lifetime, and once the VECTOR is released, its effect disappears.

That's why any global space or main will have an effect on this.

In addition, vector<vector<long long>> TLE(40000, vector<long long>(5)); also has an effect, It doesn't seem to matter what the innermost type is, but rather the byte count.

»
2 месяца назад, # |
Rev. 2   Проголосовать: нравится -91 Проголосовать: не нравится

unrelated to the topic, but

what is benefit of using C++20(64) over c++17 for CF, I haven't seen any use case (feature that present C++20(64) but not available in c++17, besides obvious one like synthetic sugar of course)

just curious

  • »
    »
    2 месяца назад, # ^ |
      Проголосовать: нравится -50 Проголосовать: не нравится

    downvoted

  • »
    »
    2 месяца назад, # ^ |
      Проголосовать: нравится +26 Проголосовать: не нравится

    one thing that i can think of is the contains() method for map and set. it's not there in c++17

  • »
    »
    2 месяца назад, # ^ |
      Проголосовать: нравится +13 Проголосовать: не нравится

    in C++20, you can use ranges, countl_zero, bitceil, ... etc

  • »
    »
    2 месяца назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    you can use std::lcm and std::gcd ,it's more simple than C++17

  • »
    »
    2 месяца назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    I like std::range and priority_queue<sth,vector<sth>,decltype(lambda)>.

    • »
      »
      »
      2 месяца назад, # ^ |
      Rev. 2   Проголосовать: нравится +3 Проголосовать: не нравится

      As an alternative, you can use std::make_heap, std::push_heap and std::pop_heap which use a (sub)range of STL containers (e.g. vector) to construct a heap in place:

      int main() {
      	std::vector<int> v1{ 3, 2, 4, 1, 5, 9 };
      	std::make_heap(v1.begin(), v1.end()); // 9 5 4 1 2 3
      	std::pop_heap(v1.begin(), v1.end());  // 5 3 4 1 2 9
      	auto top = v1.back();    // 9
      	v1.pop_back();   // 5 3 4 1 2
      
      	std::vector<int> v2{ 3, 2, 4, 5, 9 };
      	std::make_heap(v2.begin(), v2.end(), std::greater<>{}); // 2 3 4 5 9
      	v2.push_back(1);	// 2 3 4 5 9 1
      	std::push_heap(v2.begin(), v2.end(), std::greater<>{});  // 1 3 2 5 9 4
      }
      

      These functions also have their C++20 "std::ranges" version like std::ranges::make_heap(v); though, albeit the C++11 version is already not so difficult to use.

  • »
    »
    2 месяца назад, # ^ |
      Проголосовать: нравится +27 Проголосовать: не нравится

    __int128_t is very useful too, and isn't present on 32 bits versions

»
2 месяца назад, # |
Rev. 2   Проголосовать: нравится -35 Проголосовать: не нравится

[deleted]

  • »
    »
    2 месяца назад, # ^ |
      Проголосовать: нравится +44 Проголосовать: не нравится

    Not really the same thing, it is expected that different compilers make programs with different run times. Your code is just slower by a small fraction. The issue is a slowdown of 10 to 100 times on something that shouldn't be an issue.

    • »
      »
      »
      2 месяца назад, # ^ |
        Проголосовать: нравится 0 Проголосовать: не нравится

      Sorry, you are right they aren't the same.

      it's just that I am upset that someone can fail the system testing due to things like this.

»
2 месяца назад, # |
Rev. 2   Проголосовать: нравится +16 Проголосовать: не нравится

Thanks for the investigative work. I was also able to recreate the problem in my blog (having a set of size 99904) in this problem: TLE on 9 (takes 10 seconds in my mashup), doesn't happen in 32 bits, doesn't happen with other sizes (in this case i tried 99910).

The difference between mine and your problem is that i can't just do it by adding a set of a magic size, i can't do the insertions all at once. This makes me think that it is related to allocating a bunch of small data in random places in memory, because a vector of vectors probably just allocates randomly every time (it is a nested structure and is less likely to have any extra optimizations built in) while there could be an optimization for sets where data gets allocated roughly at the same spot if it detects you're doing it all at once (this is also backed by the fact that doing new int[7] also works for your case but doing 7 arrays of size 40000 doesn't, even though they occupy the same total ammount of memory).

Maybe it has to do with memory fragmentation? Maybe cin is trying to do something and fails? I have no idea but it will be fun to see how this rabbit hole will end up.

»
2 месяца назад, # |
Rev. 2   Проголосовать: нравится +113 Проголосовать: не нравится

Here are the results of my investigation.

Base code for C++17 (64-bit):

int main() {
    for (int i = 0; i < 40001; i++) {
        char* a = new char[25];
    }
    for (int j = 0; j < 60000; ++j) {
        char* b = new char[25];
        delete[] b;
    }
}

It is also possible to repeat this with C++20 (64-bit) or with malloc/free. However, since I don't actually use this memory, the compiler does its magic. Fun fact: in some OJ with while(true); in your code, you will not get TL (at least not because of while).

This code runs in ~500ms and 0ms if 40001 is changed to 40000.

After playing with the sizes of 'a' and 'b', I found that as long as they are within the range (25, 40), not necessarily the same, it produces the same result. So it may have something to do with allocating memory in the heap. Sizes 24 and 40 are common sizes of chunks in bins (32 is skipped, though).

Then I checked the addresses of allocated memory. 'b' is always in the same place. Usually, the distance between 'a' chunks is 48. But the 40001st and 40002nd chunks are 112 bytes far. And so are 41365-41366, 42729-42730. And guess what? They all cause slowdown!! Big numbers of the form 445 + 1364 * k work. So vector<vector<int>> TLE(444 + 1364 * k, vector<int>(7)); all cause slowdown.

One more observation: code with 40001 uses 1924KB, with 40002 uses 1924KB, and so on until 40001+1364, which uses 1988KB. Maybe the reason is how Codeforces works with memory? Maybe pages?

I am curious to see how this progresses and the reason behind the slowdown.

  • »
    »
    2 месяца назад, # ^ |
      Проголосовать: нравится +32 Проголосовать: не нравится

    Wow, nice find! I just tried using this with $$$k=41$$$ in a hack on problem 1936-C - Pokémon Arena, and it totally works.

  • »
    »
    2 месяца назад, # ^ |
      Проголосовать: нравится +34 Проголосовать: не нравится

    Btw, I just checked and it seems $$$k$$$ needs to be $$$\geq 8$$$ for the slowdown bug to happen. So the smallest death vector is vector<vector<int>> TLE(11356, vector<int>(7)).

  • »
    »
    2 месяца назад, # ^ |
      Проголосовать: нравится +43 Проголосовать: не нравится

    ‎‎I pinned the issue down to one of the two functions V6_HeapAlloc (__sbh_alloc_block) and V5_HeapAlloc (__old_sbh_alloc_block) in sbheap.c in MSVCRT in my blog. My guess is the allocation algorithm is very bad at some edge cases, for example a linked list of allocated chunks followed by a free chunk (I haven't read the entire code, so this is only a guess of how it can be slow).

»
2 месяца назад, # |
  Проголосовать: нравится -146 Проголосовать: не нравится

I think maybe it's just because std::cin is just slow. Try scanf or fast read instead.

  • »
    »
    2 месяца назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    cin with ios::sync_with_stdio(false) and cin.tie(nullptr) is usually faster than scanf.

    • »
      »
      »
      2 месяца назад, # ^ |
        Проголосовать: нравится 0 Проголосовать: не нравится

      What's more, the "scanf-performance" in those MSVCRT 64-bit compilers is significantly worse than cin without tie and sync.

      See 240611683 that takes 1013ms to run, TLE if used scanf; 32-bit version 240612877 takes 1500ms to run, almost the same run time if used scanf.

»
2 месяца назад, # |
  Проголосовать: нравится +47 Проголосовать: не нравится

I made a post about the situation.

»
2 месяца назад, # |
  Проголосовать: нравится +1 Проголосовать: не нравится

Me a python user after seeing this blog — Yes

  • »
    »
    2 месяца назад, # ^ |
      Проголосовать: нравится +10 Проголосовать: не нравится

    It was surprising to me to see C++ having this kind of performance bug, since performance bugs like these are normally something you'd find in PyPy. See for example https://github.com/pypy/pypy/issues/3794 . You could easily cook up some example in Python similar to the death vector where PyPy gets TLE for no apparent reason.

    • »
      »
      »
      2 месяца назад, # ^ |
      Rev. 2   Проголосовать: нравится -10 Проголосовать: не нравится

      After seeing that "j%6" bug that you just shared, I am convinced that we pythoners were once part of Lucifers army and python is hated by God

»
2 месяца назад, # |
  Проголосовать: нравится +15 Проголосовать: не нравится

Dang, this is gonna be great for hacking

  • »
    »
    2 месяца назад, # ^ |
      Проголосовать: нравится +54 Проголосовать: не нравится

    Unfortunately, yes.

    In my opinion in-contest hacking should be turned off for problems where this bug can be abused for hacking (grid / matrix problems) until this issue has been fixed.

»
2 месяца назад, # |
Rev. 2   Проголосовать: нравится 0 Проголосовать: не нравится

is this possibly related to how size_t is still 32 bits even when the language chosen claims to be C++ (64 bit) ?

edit: this is false. size_t is indeed 64 bits. i confused it with long

  • »
    »
    2 месяца назад, # ^ |
      Проголосовать: нравится +3 Проголосовать: не нравится

    What? C++ language is neither 32 bit nor 64 bit.

    • »
      »
      »
      2 месяца назад, # ^ |
        Проголосовать: нравится 0 Проголосовать: не нравится

      there are many ways to compile a single c++ source code. you can compile for 32-bit machines, or 64-bit machines, or 5-bit machines if you write your own compiler.

»
2 месяца назад, # |
Rev. 2   Проголосовать: нравится +145 Проголосовать: не нравится

Thank you for this investigation. I have introduced a new language: C++20 (GCC 13-64). Please test it. My tests show that this issue has been fixed in this version.

The current plan is to hide all other 64-bit C++ compilers except for the new C++20 (GCC 13-64).

  • »
    »
    2 месяца назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    Is it stable enough to be used in the contest today?

  • »
    »
    2 месяца назад, # ^ |
    Rev. 2   Проголосовать: нравится 0 Проголосовать: не нравится

    UPD: sorry, I forgot to submit it with C++20 (GCC-13) :(

  • »
    »
    2 месяца назад, # ^ |
    Rev. 2   Проголосовать: нравится -8 Проголосовать: не нравится

    I think only winlibs with UCRT (not MSVCRT) fixes this issue, so to fix this the solution should be to install UCRT and then winlibs with UCRT. seems to have been fixed?

  • »
    »
    2 месяца назад, # ^ |
      Проголосовать: нравится +41 Проголосовать: не нравится

    Hi, bits/extc++.h is a header which is like bits/stdc++.h except it includes more stuff, making it a strictly better bits/stdc++.h in the sense that you also don't have to add includes when using, for example, pbds things.

    It currently does not compile on the new language, although it did compile on the 64 bit version of c++17. Could you please take a look at it?

    Compilation Message
  • »
    »
    2 месяца назад, # ^ |
    Rev. 3   Проголосовать: нравится -9 Проголосовать: не нравится

  • »
    »
    2 месяца назад, # ^ |
    Rev. 2   Проголосовать: нравится +44 Проголосовать: не нравится

    Unfortunately it seems that GCC 13-64 still has this issue, but cases when it manifests have been shuffled around. Identical submissions: TL with GCC 13, OK with GCC 11.

    • »
      »
      »
      2 месяца назад, # ^ |
        Проголосовать: нравится +26 Проголосовать: не нравится

      Curious, why does your example give TLE when submitted, but not during custom invocation?

      • »
        »
        »
        »
        2 месяца назад, # ^ |
        Rev. 2   Проголосовать: нравится +21 Проголосовать: не нравится

        I get different sets of "evil" constants in custom invocation and submitting in a mashup. Both sets seem to reproduce perfectly, but I've no idea why they're different.

        UPD An example of "custom invocation-evil" constant for GCC 13 is 12495.

    • »
      »
      »
      2 месяца назад, # ^ |
      Rev. 2   Проголосовать: нравится +49 Проголосовать: не нравится

      The following code takes 9s in custom test, with C++20 (64, GCC 13):

      Code

      Now the issue suddenly becomes 100x more confusing to me, because:

      1. In my previous research I concluded that malloc in C++17 (64) and C++20 (64, GCC 11) is not a direct HeapAlloc call. Replacing malloc with HeapAlloc solved it, so I assumed a bad library implementation caused the slow that case.
      2. 98% of the time in the code above is spent in alloc, so HeapAlloc causes the slow in this case.
      3. The above code is only slow in C++20 (64, GCC 13). However, the sequence of allocations/deallocations seems to be the same in C++17 and C++20 (64, GCC 11).
      4. If I remove the printf line or the cout line, it runs fast again. printf takes no time. The sequence of allocations/deallocations is not affected by printf.

      I have no clue about anything now.

  • »
    »
    2 месяца назад, # ^ |
      Проголосовать: нравится +12 Проголосовать: не нравится
    It's very sad that
  • »
    »
    2 месяца назад, # ^ |
      Проголосовать: нравится +14 Проголосовать: не нравится

    would it be possible to keep at least one compiler for c++20?

    • »
      »
      »
      2 месяца назад, # ^ |
        Проголосовать: нравится +16 Проголосовать: не нравится

      Yeah -- I have a good chunk of boilerplate code that depends on C++20, and it's a pain to backport them to C++17.

      I would like to ask the admins to announce the (un)availability of C++20 before a contest starts, so we can decide (not) to participate.

»
2 месяца назад, # |
  Проголосовать: нравится +35 Проголосовать: не нравится

I hope to find the cause of the relevant problem and fix it as soon as possible.
GNU C++20(64) is faster than C++14/17(32) when using variable type long long, and we connot use variable type __int128.
Some problem is hard to be solved without GNU C++20(64). :(

»
2 месяца назад, # |
  Проголосовать: нравится +25 Проголосовать: не нравится

Maybe not just ban them. While fixing this bug, you can still make C++17(64) and C++20(64) online and useable.

»
7 недель назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

Just noticed c++20 is back, does it means the bug is fixed?

  • »
    »
    7 недель назад, # ^ |
      Проголосовать: нравится +8 Проголосовать: не нравится

    No. See ToxicPie9's test code:

    #include <iostream>
    #include <windows.h>
    
    using namespace std;
    
    void *alloc(size_t s) {
        printf("");
        auto p = (void *)HeapAlloc(GetProcessHeap(), 0, s ? s : 1);
        asm volatile("" ::"r"(p));
        return p;
    }
    
    void dealloc(void *p) {
        HeapFree(GetProcessHeap(), 0, p);
    }
    
    int main() {
        int evil = 12495;
        int count = 1e6;
        for (int i = 0; i < evil; i++) {
            alloc(0x28);
            dealloc(alloc(0x1c));
        }
        for (int i = evil; i < count; i++) {
            dealloc(alloc(0x1c));
        }
        cout << "hi\n";
    }
    

    It takes about 9 seconds on the CF custom test, while testing this code locally using UCRT mingw64 gcc13.2, it takes no time, showing that this version is still the buggy MSVCRT one.

    • »
      »
      »
      5 недель назад, # ^ |
        Проголосовать: нравится 0 Проголосовать: не нравится

      Just tested it after Mike migrate the server to newer version, and it runs about 100ms. Seems this bug is gone :)

      • »
        »
        »
        »
        66 минут назад, # ^ |
          Проголосовать: нравится 0 Проголосовать: не нравится

        No, it's not fixed at least for C++.

        My further tests show that the bug is just hidden somewhere else. Need some other methods for this bug. See my test and conclusions.

    • »
      »
      »
      4 недели назад, # ^ |
        Проголосовать: нравится 0 Проголосовать: не нравится

      So why not fix this by just migrating to UCRT? :(

      • »
        »
        »
        »
        4 недели назад, # ^ |
        Rev. 3   Проголосовать: нравится 0 Проголосовать: не нравится

        UPD: It's not fixed at least for C++.

        Further investigation shows that the bug is just hidden somewhere else. Need some other methods for this bug. See my test and conclusions.

        Original post↓


        It is fixed by updating the system to Windows Server 2022, those legacy 32-bit compilers seemed to benefit from this (you can try copying 253400416 to custom test, and TLE would not happen). As MikeMirzayanov said, they are working on resolving the situation more systematically.

        MikeMirzayanov did not reveal how they fixed the problem, but I suppose they also migrated the 64-bit compilers to the UCRT version of the mingw64 toolchain.

»
7 недель назад, # |
  Проголосовать: нравится +2 Проголосовать: не нравится

MikeMirzayanov I reported (cf blog) This simple program does not compile on my local setup bug to GCC ~8 months ago. It's not fixed it. Open bug link on bugzilla

Today's GCC version upgrade has introduced this bug to codeforces. Sample compilation error submission — 252744779

  • »
    »
    4 недели назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    The easiest way to avoid it is to remove those pragmas, especially "target" directives. It's a bug affecting every GCC13 on any platform, and Clang if it's using GCC13's libstdc++.

    C++ is the tier-0 fast language for competitive programming in Codeforces, I think it's not particularly useful to do so when this language already has a big advantage over others.