Yesterday, investigating Strange TLE by cin using GNU C++20 (64), I found an easy and reproducable way to trigger a slowdown bug that I believe has been plaguing Codeforces for some time now. So I'm making this blog to raise awarness of it. MikeMirzayanov, please take a look at this!
Here is how to trigger the slowness bug:
- Take any problem with a relatively large input on Codeforces ($$$2 \cdot 10^5$$$ ints is enough).
- Take a random AC C++ submission that uses
std::cin
. - Add the line
vector<vector<int>> TLE(40000, vector<int>(7));
somewhere in global space. - Submit using either C++20(64 bit) or C++17(64 bit).
- ???
- TLE
For example take tourist's solution to problem 1936-D - Bitwise Paradox. With the vector of death added to the code, it gets TLE on TC5 (taking $$$> 5$$$ s). While without the deadly vector, the submission takes 155 ms on TC5.
Here is a stand alone example with the slowdown (credit to kostia244). It runs 100 times slower with the vector of death.
#include<bits/stdc++.h>
using namespace std;
vector<vector<int>> TLE(40000, vector<int>(7));
int main() {
string s;
for(int i = 0; i < 17; i++) s += to_string(i) + " ";
for (int j = 0; j < 60000; ++j) {
istringstream ss(s);
int x;
while (ss >> x);
}
}
What is causing this?
Other ways to trigger the slowdown bug
Other blogs on this topic
I was able to reduce kostia's code to this:
Length of
40000
or40002
doesn't work, only40001
works. Without that loop the code gets WA in 77ms.This takes 500ms for sizes from 7 to 10, <=15ms for any other size.
TLE vector probably triggers some especially bad case for the allocator.
is the slowness bug random cause I tried the same as told by pajenegod
without death vector:248597380 78 ms
with death vector:249350892 93 ms
I ran the codes on C++20(64bit) and input is 2.10^5 but still not a big change in time of both codes.
You have
#define int long long
in your code. The death vector needs to bevector<vector<int>>
and notvector<vector<long long>>
. Your solution takes 1996 ms / 2 s 249353007 with the actual death vector.OH SORRY.MY BAD!! This really is a concerning issue . PS: Thanks for the blog
WHY ARE YOU TAKE MY PICTURE
A simple experiment shows that a VECTOR only affects the cin during its lifetime, and once the VECTOR is released, its effect disappears.
That's why any global space or main will have an effect on this.
In addition,
vector<vector<long long>> TLE(40000, vector<long long>(5));
also has an effect, It doesn't seem to matter what the innermost type is, but rather the byte count.upvoted. orz
stop asking for contribution man!! o_o
orz is respect word. It's not asking for contribution. I value this user's insights regarding the death vector! I also speculate he played a WA24 scheme before, in this so he helped catch many cheater too! check
unrelated to the topic, but
what is benefit of using C++20(64) over c++17 for CF, I haven't seen any use case (feature that present C++20(64) but not available in c++17, besides obvious one like synthetic sugar of course)
just curious
downvoted
...
one thing that i can think of is the contains() method for map and set. it's not there in c++17
in C++20, you can use ranges, countl_zero, bitceil, ... etc
you can use
std::lcm
andstd::gcd
,it's more simple than C++17but these two are already added in STL with C++17
I like
std::range
andpriority_queue<sth,vector<sth>,decltype(lambda)>
.As an alternative, you can use std::make_heap, std::push_heap and std::pop_heap which use a (sub)range of STL containers (e.g. vector) to construct a heap in place:
These functions also have their C++20 "std::ranges" version like
std::ranges::make_heap(v);
though, albeit the C++11 version is already not so difficult to use.__int128_t is very useful too, and isn't present on 32 bits versions
[deleted]
Not really the same thing, it is expected that different compilers make programs with different run times. Your code is just slower by a small fraction. The issue is a slowdown of 10 to 100 times on something that shouldn't be an issue.
Sorry, you are right they aren't the same.
it's just that I am upset that someone can fail the system testing due to things like this.
Thanks for the investigative work. I was also able to recreate the problem in my blog (having a set of size 99904) in this problem: TLE on 9 (takes 10 seconds in my mashup), doesn't happen in 32 bits, doesn't happen with other sizes (in this case i tried 99910).
The difference between mine and your problem is that i can't just do it by adding a set of a magic size, i can't do the insertions all at once. This makes me think that it is related to allocating a bunch of small data in random places in memory, because a vector of vectors probably just allocates randomly every time (it is a nested structure and is less likely to have any extra optimizations built in) while there could be an optimization for sets where data gets allocated roughly at the same spot if it detects you're doing it all at once (this is also backed by the fact that doing new int[7] also works for your case but doing 7 arrays of size 40000 doesn't, even though they occupy the same total ammount of memory).
Maybe it has to do with memory fragmentation? Maybe cin is trying to do something and fails? I have no idea but it will be fun to see how this rabbit hole will end up.
Here are the results of my investigation.
Base code for C++17 (64-bit):
It is also possible to repeat this with C++20 (64-bit) or with malloc/free. However, since I don't actually use this memory, the compiler does its magic. Fun fact: in some OJ with
while(true);
in your code, you will not get TL (at least not because of while).This code runs in ~500ms and 0ms if 40001 is changed to 40000.
After playing with the sizes of 'a' and 'b', I found that as long as they are within the range (25, 40), not necessarily the same, it produces the same result. So it may have something to do with allocating memory in the heap. Sizes 24 and 40 are common sizes of chunks in bins (32 is skipped, though).
Then I checked the addresses of allocated memory. 'b' is always in the same place. Usually, the distance between 'a' chunks is 48. But the 40001st and 40002nd chunks are 112 bytes far. And so are 41365-41366, 42729-42730. And guess what? They all cause slowdown!! Big numbers of the form
445 + 1364 * k
work. Sovector<vector<int>> TLE(444 + 1364 * k, vector<int>(7));
all cause slowdown.One more observation: code with 40001 uses 1924KB, with 40002 uses 1924KB, and so on until 40001+1364, which uses 1988KB. Maybe the reason is how Codeforces works with memory? Maybe pages?
I am curious to see how this progresses and the reason behind the slowdown.
Wow, nice find! I just tried using this with $$$k=41$$$ in a hack on problem 1936-C - Pokémon Arena, and it totally works.
Btw, I just checked and it seems $$$k$$$ needs to be $$$\geq 8$$$ for the slowdown bug to happen. So the smallest death vector is
vector<vector<int>> TLE(11356, vector<int>(7))
.I pinned the issue down to one of the two functions
V6_HeapAlloc
(__sbh_alloc_block
) andV5_HeapAlloc
(__old_sbh_alloc_block
) insbheap.c
in MSVCRT in my blog. My guess is the allocation algorithm is very bad at some edge cases, for example a linked list of allocated chunks followed by a free chunk (I haven't read the entire code, so this is only a guess of how it can be slow).I think maybe it's just because
std::cin
is just slow. Tryscanf
or fast read instead.cin
withios::sync_with_stdio(false)
andcin.tie(nullptr)
is usually faster thanscanf
.What's more, the "scanf-performance" in those MSVCRT 64-bit compilers is significantly worse than cin without tie and sync.
See 240611683 that takes 1013ms to run, TLE if used scanf; 32-bit version 240612877 takes 1500ms to run, almost the same run time if used scanf.
I made a post about the situation.
Me a python user after seeing this blog — Yes
It was surprising to me to see C++ having this kind of performance bug, since performance bugs like these are normally something you'd find in PyPy. See for example https://github.com/pypy/pypy/issues/3794 . You could easily cook up some example in Python similar to the death vector where PyPy gets TLE for no apparent reason.
After seeing that "j%6" bug that you just shared, I am convinced that we pythoners were once part of Lucifers army and python is hated by God
Dang, this is gonna be great for hacking
Unfortunately, yes.
In my opinion in-contest hacking should be turned off for problems where this bug can be abused for hacking (grid / matrix problems) until this issue has been fixed.
is this possibly related to how
size_t
is still 32 bits even when the language chosen claims to be C++ (64 bit) ?edit: this is false.
size_t
is indeed 64 bits. i confused it withlong
What? C++ language is neither 32 bit nor 64 bit.
there are many ways to compile a single c++ source code. you can compile for 32-bit machines, or 64-bit machines, or 5-bit machines if you write your own compiler.
Thank you for this investigation. I have introduced a new language: C++20 (GCC 13-64). Please test it. My tests show that this issue has been fixed in this version.
The current plan is to hide all other 64-bit C++ compilers except for the new C++20 (GCC 13-64).
Is it stable enough to be used in the contest today?
I think so. If you're not sure, just stick to the time-tested 32-bit versions.
UPD: sorry, I forgot to submit it with C++20 (GCC-13) :(
I think only winlibs with UCRT (not MSVCRT) fixes this issue, so to fix this the solution should be to install UCRT and then winlibs with UCRT.seems to have been fixed?Hi,
bits/extc++.h
is a header which is likebits/stdc++.h
except it includes more stuff, making it a strictly betterbits/stdc++.h
in the sense that you also don't have to add includes when using, for example, pbds things.It currently does not compile on the new language, although it did compile on the 64 bit version of c++17. Could you please take a look at it?
When compiling (on the new c++20 64 bit language):
btw, it also doesn't compile for old version of c++ 20 compiler.
Unfortunately it seems that GCC 13-64 still has this issue, but cases when it manifests have been shuffled around. Identical submissions: TL with GCC 13, OK with GCC 11.
Curious, why does your example give TLE when submitted, but not during custom invocation?
I get different sets of "evil" constants in custom invocation and submitting in a mashup. Both sets seem to reproduce perfectly, but I've no idea why they're different.
UPD An example of "custom invocation-evil" constant for GCC 13 is 12495.
The following code takes 9s in custom test, with C++20 (64, GCC 13):
Now the issue suddenly becomes 100x more confusing to me, because:
malloc
in C++17 (64) and C++20 (64, GCC 11) is not a directHeapAlloc
call. Replacingmalloc
withHeapAlloc
solved it, so I assumed a bad library implementation caused the slow that case.alloc
, soHeapAlloc
causes the slow in this case.printf
line or thecout
line, it runs fast again.printf
takes no time. The sequence of allocations/deallocations is not affected byprintf
.I have no clue about anything now.
could it be the windows heap ?
https://stackoverflow.com/questions/36762516/why-does-repeated-memory-allocation-in-windows-slow-down
Maybe not, because UCRTmingw64 gcc13.2 and msvc2022 would not trigger this weird 9s bug. But the MSVCRTmingw64 gcc13.2 (including the one recently back) would do.
We can't use
__int128
now.If there's any solution, please tell me. Thank you.
Use python
would it be possible to keep at least one compiler for c++20?
Yeah -- I have a good chunk of boilerplate code that depends on C++20, and it's a pain to backport them to C++17.
I would like to ask the admins to announce the (un)availability of C++20 before a contest starts, so we can decide (not) to participate.
I hope to find the cause of the relevant problem and fix it as soon as possible.
GNU C++20(64) is faster than C++14/17(32) when using variable type
long long
, and we connot use variable type__int128
.Some problem is hard to be solved without GNU C++20(64). :(
Maybe not just ban them. While fixing this bug, you can still make C++17(64) and C++20(64) online and useable.
There will be unexpected hacks.
Just noticed c++20 is back, does it means the bug is fixed?
No. See ToxicPie9's test code:
It takes about 9 seconds on the CF custom test, while testing this code locally using UCRT mingw64 gcc13.2, it takes no time, showing that this version is still the buggy MSVCRT one.
Just tested it after Mike migrate the server to newer version, and it runs about 100ms. Seems this bug is gone :)
No, it's not fixed at least for C++.
My further tests show that the bug is just hidden somewhere else. Need some other methods for this bug. See my test and conclusions.
So why not fix this by just migrating to UCRT? :(
UPD: It's not fixed at least for C++.
Further investigation shows that the bug is just hidden somewhere else. Need some other methods for this bug. See my test and conclusions.
Original post↓
It is fixed by updating the system to Windows Server 2022, those legacy 32-bit compilers seemed to benefit from this (you can try copying 253400416 to custom test, and TLE would not happen). As MikeMirzayanov said, they are working on resolving the situation more systematically.
MikeMirzayanov did not reveal how they fixed the problem, but I suppose they also migrated the 64-bit compilers to the UCRT version of the mingw64 toolchain.
MikeMirzayanov I reported (cf blog) This simple program does not compile on my local setup bug to GCC ~8 months ago. It's not fixed it. Open bug link on bugzilla
Today's GCC version upgrade has introduced this bug to codeforces. Sample compilation error submission — 252744779
The easiest way to avoid it is to remove those pragmas, especially "target" directives. It's a bug affecting every GCC13 on any platform, and Clang if it's using GCC13's libstdc++.
C++ is the tier-0 fast language for competitive programming in Codeforces, I think it's not particularly useful to do so when this language already has a big advantage over others.