Slowdown bug affecting C++ (64 bit) on Codeforces

#	User	Rating
1	Benq	3792
2	VivaciousAubergine	3647
3	jiangly	3631
4	Kevin114514	3574
5	maroonrk	3521
6	strapple	3515
7	Radewoosh	3461
8	tourist	3428
9	turmax	3378
10	Um_nik	3376

#	User	Contrib.
1	Qingyu	162
2	adamant	148
3	Um_nik	146
4	Dominater069	143
5	errorgorn	141
6	cry	138
7	Proof_by_QED	136
8	YuukiS	135
9	chromate00	134
10	soullless	133

Yesterday, investigating Strange TLE by cin using GNU C++20 (64), I found an easy and reproducable way to trigger a slowdown bug that I believe has been plaguing Codeforces for some time now. So I'm making this blog to raise awarness of it. MikeMirzayanov, please take a look at this!

Here is how to trigger the slowness bug:

Take any problem with a relatively large input on Codeforces ($$$2 \cdot 10^5$$$ ints is enough).
Take a random AC C++ submission that uses std::cin.
Add the line vector<vector<int>> TLE(40000, vector<int>(7)); somewhere in global space.
Submit using either C++20(64 bit) or C++17(64 bit).
???
TLE

For example take tourist's solution to problem 1936-D - Bitwise Paradox. With the vector of death added to the code, it gets TLE on TC5 (taking $$$ \gt 5$$$ s). While without the deadly vector, the submission takes 155 ms on TC5.

Here is a stand alone example with the slowdown (credit to kostia244). It runs 100 times slower with the vector of death.

#include<bits/stdc++.h>
using namespace std;

vector<vector<int>> TLE(40000, vector<int>(7));

int main() {
    string s;
    for(int i = 0; i < 17; i++) s += to_string(i) + " ";
    
    for (int j = 0; j < 60000; ++j) {
        istringstream ss(s);
        int x;
        while (ss >> x);
    }
}

What is causing this?

Other ways to trigger the slowdown bug

Other blogs on this topic

#include<bits/stdc++.h> using namespace std; int main() { for (int i = 0; i < 40001; i++) { new int[7]; // 10 works consistently too } string s; for(int i = 0; i < 17*30000; i++) s += to_string(i) + " "; istringstream ss(s); int x; while (ss >> x); }

#include <vector> using namespace std; int main() { vector<vector<int>> TLE(40000, vector<int>(7)); for (int j = 0; j < 60000; ++j) { vector<int>(7); // any size from 7 to 10 works } }

int main() { std::vector<int> v1{ 3, 2, 4, 1, 5, 9 }; std::make_heap(v1.begin(), v1.end()); // 9 5 4 1 2 3 std::pop_heap(v1.begin(), v1.end()); // 5 3 4 1 2 9 auto top = v1.back(); // 9 v1.pop_back(); // 5 3 4 1 2 std::vector<int> v2{ 3, 2, 4, 5, 9 }; std::make_heap(v2.begin(), v2.end(), std::greater<>{}); // 2 3 4 5 9 v2.push_back(1); // 2 3 4 5 9 1 std::push_heap(v2.begin(), v2.end(), std::greater<>{}); // 1 3 2 5 9 4 }

#include <iostream> #include <windows.h> using namespace std; void *alloc(size_t s) { printf(""); auto p = (void *)HeapAlloc(GetProcessHeap(), 0, s ? s : 1); asm volatile("" ::"r"(p)); return p; } void dealloc(void *p) { HeapFree(GetProcessHeap(), 0, p); } int main() { int evil = 12495; int count = 1e6; for (int i = 0; i < evil; i++) { alloc(0x28); dealloc(alloc(0x1c)); } for (int i = evil; i < count; i++) { dealloc(alloc(0x1c)); } cout << "hi\n"; }

Comments (63)

Show archived | Write comment?

drdilyor

2 years ago, hide # |

← Rev. 3 →

+10

I was able to reduce kostia's code to this:

Length of 40000 or 40002 doesn't work, only 40001 works. Without that loop the code gets WA in 77ms.

→ Reply

bkbtpout

+27

This takes 500ms for sizes from 7 to 10, <=15ms for any other size.

2 years ago, hide # ^ |

+14

TLE vector probably triggers some especially bad case for the allocator.

not_insync

← Rev. 2 →

-19

is the slowness bug random cause I tried the same as told by pajenegod

without death vector:248597380 78 ms

with death vector:249350892 93 ms

I ran the codes on C++20(64bit) and input is 2.10^5 but still not a big change in time of both codes.

pajenegod

← Rev. 4 →

+32

You have #define int long long in your code. The death vector needs to be vector<vector<int>> and not vector<vector<long long>>. Your solution takes 1996 ms / 2 s 249353007 with the actual death vector.

OH SORRY.MY BAD!! This really is a concerning issue . PS: Thanks for the blog

nonrice

+13

WHY ARE YOU TAKE MY PICTURE

down_07

+16

A simple experiment shows that a VECTOR only affects the cin during its lifetime, and once the VECTOR is released, its effect disappears.

That's why any global space or main will have an effect on this.

In addition, vector<vector<long long>> TLE(40000, vector<long long>(5)); also has an effect, It doesn't seem to matter what the innermost type is, but rather the byte count.

Heap_OverFlow

-45

upvoted. orz

attractors

stop asking for contribution man!! o_o

-20

orz is respect word. It's not asking for contribution. I value this user's insights regarding the death vector! I also speculate he played a WA24 scheme before, in this so he helped catch many cheater too! check

tickbird

-91

unrelated to the topic, but

what is benefit of using C++20(64) over c++17 for CF, I haven't seen any use case (feature that present C++20(64) but not available in c++17, besides obvious one like synthetic sugar of course)

just curious

-50

downvoted

-119

...

boxlesscat

+26

one thing that i can think of is the contains() method for map and set. it's not there in c++17

I_AM_MOSOL

in C++20, you can use ranges, countl_zero, bitceil, ... etc

Stronger_Than_Kdlyh

you can use std::lcm and std::gcd ,it's more simple than C++17

tatianyi

but these two are already added in STL with C++17

MrPython

I like std::range and priority_queue<sth,vector<sth>,decltype(lambda)>.

As an alternative, you can use std::make_heap, std::push_heap and std::pop_heap which use a (sub)range of STL containers (e.g. vector) to construct a heap in place:

These functions also have their C++20 "std::ranges" version like std::ranges::make_heap(v); though, albeit the C++11 version is already not so difficult to use.

Giga_Cronos

__int128_t is very useful too, and isn't present on 32 bits versions

Bit-Ghifo

-35

[deleted]

defnotmee

+44

Not really the same thing, it is expected that different compilers make programs with different run times. Your code is just slower by a small fraction. The issue is a slowdown of 10 to 100 times on something that shouldn't be an issue.

Sorry, you are right they aren't the same.

it's just that I am upset that someone can fail the system testing due to things like this.

Thanks for the investigative work. I was also able to recreate the problem in my blog (having a set of size 99904) in this problem: TLE on 9 (takes 10 seconds in my mashup), doesn't happen in 32 bits, doesn't happen with other sizes (in this case i tried 99910).

The difference between mine and your problem is that i can't just do it by adding a set of a magic size, i can't do the insertions all at once. This makes me think that it is related to allocating a bunch of small data in random places in memory, because a vector of vectors probably just allocates randomly every time (it is a nested structure and is less likely to have any extra optimizations built in) while there could be an optimization for sets where data gets allocated roughly at the same spot if it detects you're doing it all at once (this is also backed by the fact that doing new int[7] also works for your case but doing 7 arrays of size 40000 doesn't, even though they occupy the same total ammount of memory).

Maybe it has to do with memory fragmentation? Maybe cin is trying to do something and fails? I have no idea but it will be fun to see how this rabbit hole will end up.

998batrr

+113

Here are the results of my investigation.

Base code for C++17 (64-bit):

int main() {
    for (int i = 0; i < 40001; i++) {
        char* a = new char[25];
    }
    for (int j = 0; j < 60000; ++j) {
        char* b = new char[25];
        delete[] b;
    }
}

It is also possible to repeat this with C++20 (64-bit) or with malloc/free. However, since I don't actually use this memory, the compiler does its magic. Fun fact: in some OJ with while(true); in your code, you will not get TL (at least not because of while).

This code runs in ~500ms and 0ms if 40001 is changed to 40000.

After playing with the sizes of 'a' and 'b', I found that as long as they are within the range (25, 40), not necessarily the same, it produces the same result. So it may have something to do with allocating memory in the heap. Sizes 24 and 40 are common sizes of chunks in bins (32 is skipped, though).

Then I checked the addresses of allocated memory. 'b' is always in the same place. Usually, the distance between 'a' chunks is 48. But the 40001st and 40002nd chunks are 112 bytes far. And so are 41365-41366, 42729-42730. And guess what? They all cause slowdown!! Big numbers of the form 445 + 1364 * k work. So vector<vector<int>> TLE(444 + 1364 * k, vector<int>(7)); all cause slowdown.

One more observation: code with 40001 uses 1924KB, with 40002 uses 1924KB, and so on until 40001+1364, which uses 1988KB. Maybe the reason is how Codeforces works with memory? Maybe pages?

I am curious to see how this progresses and the reason behind the slowdown.

Wow, nice find! I just tried using this with $$$k=41$$$ in a hack on problem 1936-C - Pokémon Arena, and it totally works.

+34

Btw, I just checked and it seems $$$k$$$ needs to be $$$\geq 8$$$ for the slowdown bug to happen. So the smallest death vector is vector<vector<int>> TLE(11356, vector<int>(7)).

ToxicPie9

+43

‎‎I pinned the issue down to one of the two functions V6_HeapAlloc (__sbh_alloc_block) and V5_HeapAlloc (__old_sbh_alloc_block) in sbheap.c in MSVCRT in my blog. My guess is the allocation algorithm is very bad at some edge cases, for example a linked list of allocated chunks followed by a free chunk (I haven't read the entire code, so this is only a guess of how it can be slow).

Skywalker_OIer

-146

I think maybe it's just because std::cin is just slow. Try scanf or fast read instead.

cin with ios::sync_with_stdio(false) and cin.tie(nullptr) is usually faster than scanf.

+47

I made a post about the situation.

theAyconic1

Me a python user after seeing this blog — Yes

It was surprising to me to see C++ having this kind of performance bug, since performance bugs like these are normally something you'd find in PyPy. See for example https://github.com/pypy/pypy/issues/3794 . You could easily cook up some example in Python similar to the death vector where PyPy gets TLE for no apparent reason.

camc

+15

Dang, this is gonna be great for hacking

+54

Unfortunately, yes.

In my opinion in-contest hacking should be turned off for problems where this bug can be abused for hacking (grid / matrix problems) until this issue has been fixed.

ehnryx

is this possibly related to how size_t is still 32 bits even when the language chosen claims to be C++ (64 bit) ?

edit: this is false. size_t is indeed 64 bits. i confused it with long

Actium

What? C++ language is neither 32 bit nor 64 bit.

there are many ways to compile a single c++ source code. you can compile for 32-bit machines, or 64-bit machines, or 5-bit machines if you write your own compiler.

MikeMirzayanov

+145

Thank you for this investigation. I have introduced a new language: C++20 (GCC 13-64). Please test it. My tests show that this issue has been fixed in this version.

The current plan is to hide all other 64-bit C++ compilers except for the new C++20 (GCC 13-64).

AkibAzmain

Is it stable enough to be used in the contest today?

I think so. If you're not sure, just stick to the time-tested 32-bit versions.

tril0713

UPD: sorry, I forgot to submit it with C++20 (GCC-13) :(

chromate00

-8

~~I think only winlibs with UCRT (not MSVCRT) fixes this issue, so to fix this the solution should be to install UCRT and then winlibs with UCRT.~~ seems to have been fixed?

skittles1412

+41

Hi, bits/extc++.h is a header which is like bits/stdc++.h except it includes more stuff, making it a strictly better bits/stdc++.h in the sense that you also don't have to add includes when using, for example, pbds things.

It currently does not compile on the new language, although it did compile on the 64 bit version of c++17. Could you please take a look at it?

Compilation Message

When compiling (on the new c++20 64 bit language):

#include "bits/extc++.h"

int main() {}

Invocation failed [COMPILATION_ERROR]
Can't compile file:
In file included from C:/Programs/gcc13-64-winlibs/include/c++/13.2.0/x86_64-w64-mingw32/bits/extc++.h:83,
                 from program.cpp:1:
C:/Programs/gcc13-64-winlibs/include/c++/13.2.0/ext/codecvt_specializations.h:42:10: fatal error: iconv.h: No such file or directory
   42 | #include <iconv.h>
      |          ^~~~~~~~~
compilation terminated.

btw, it also doesn't compile for old version of c++ 20 compiler.

Anshul_Johri

-9

maxplus

Unfortunately it seems that GCC 13-64 still has this issue, but cases when it manifests have been shuffled around. Identical submissions: TL with GCC 13, OK with GCC 11.

Curious, why does your example give TLE when submitted, but not during custom invocation?

+21

I get different sets of "evil" constants in custom invocation and submitting in a mashup. Both sets seem to reproduce perfectly, but I've no idea why they're different.

UPD An example of "custom invocation-evil" constant for GCC 13 is 12495.

+49

The following code takes 9s in custom test, with C++20 (64, GCC 13):

Code

Now the issue suddenly becomes 100x more confusing to me, because:

In my previous research I concluded that malloc in C++17 (64) and C++20 (64, GCC 11) is not a direct HeapAlloc call. Replacing malloc with HeapAlloc solved it, so I assumed a bad library implementation caused the slow that case.
98% of the time in the code above is spent in alloc, so HeapAlloc causes the slow in this case.
The above code is only slow in C++20 (64, GCC 13). However, the sequence of allocations/deallocations seems to be the same in C++17 and C++20 (64, GCC 11).
If I remove the printf line or the cout line, it runs fast again. printf takes no time. The sequence of allocations/deallocations is not affected by printf.

I have no clue about anything now.

could it be the windows heap ?

https://stackoverflow.com/questions/36762516/why-does-repeated-memory-allocation-in-windows-slow-down

Benq

10 months ago, hide # ^ |

This is quite late, but was this resolved? I tried submitting in the same language as you did (C++20 (GCC 13-64)) and it passed: 330539517

+20

No, it wasn't. TL with GCC 13-64, OK with GCC 14-64

FFTotoro

+12

It's very sad that

would it be possible to keep at least one compiler for c++20?

_Kee

Yeah -- I have a good chunk of boilerplate code that depends on C++20, and it's a pain to backport them to C++17.

I would like to ask the admins to announce the (un)availability of C++20 before a contest starts, so we can decide (not) to participate.

NetSpeed1

+35

I hope to find the cause of the relevant problem and fix it as soon as possible.
GNU C++20(64) is faster than C++14/17(32) when using variable type long long, and we connot use variable type __int128.
Some problem is hard to be solved without GNU C++20(64). :(

Misuki

Just noticed c++20 is back, does it means the bug is fixed?

No. See ToxicPie9's test code:

It takes about 9 seconds on the CF custom test, while testing this code locally using UCRT mingw64 gcc13.2, it takes no time, showing that this version is still the buggy MSVCRT one.

Just tested it after Mike migrate the server to newer version, and it runs about 100ms. Seems this bug is gone :)

CaveNightingale

So why not fix this by just migrating to UCRT? :(

aryanc403

MikeMirzayanov I reported (cf blog) This simple program does not compile on my local setup bug to GCC ~8 months ago. It's not fixed it. Open bug link on bugzilla

Today's GCC version upgrade has introduced this bug to codeforces. Sample compilation error submission — 252744779

pajenegod's blog