You won't believe how this simple trick defeated the unexplained bug destroying every top LGM

#	User	Rating
1	Benq	3792
2	VivaciousAubergine	3647
3	Kevin114514	3611
4	jiangly	3583
5	strapple	3515
6	tourist	3470
7	dXqwq	3436
8	Radewoosh	3415
9	Otomachi_Una	3413
10	Um_nik	3376

#	User	Contrib.
1	Qingyu	161
2	adamant	150
3	Um_nik	146
4	Dominater069	144
5	errorgorn	141
6	cry	139
7	Proof_by_QED	136
8	YuukiS	135
9	chromate00	134
9	TheScrasse	134

TL;DR

Currently people are discussing a slowdown bug on Codeforces that seems to happen randomly, and can cause code to run 100x slower and get TLE. More details in pajenegod's blog post.

In this article, I present a mitigation: add the following to your code.

#include <windows.h>
void *operator new(size_t size) {
    if (void *ptr = HeapAlloc(GetProcessHeap(), 0, size ? size : 1)) return ptr;
    throw std::bad_alloc{};
}
void operator delete(void *ptr) {
    HeapFree(GetProcessHeap(), 0, ptr);
}

If you use malloc/free in C++ (you shouldn't), also change them.

void *my_malloc(size_t size) {
    if (void *ptr = HeapAlloc(GetProcessHeap(), 0, size ? size : 1)) return ptr;
    throw std::bad_alloc{};
}
void my_free(void *ptr) {
    HeapFree(GetProcessHeap(), 0, ptr);
}

If you use aligned new/delete in C++, also change them.

void *operator new(size_t size, align_val_t al) {
    auto align = (uintptr_t)al;
    void *mem = ::operator new(size + align + sizeof(void *));
    void **ptr = (void **)(((uintptr_t)mem + align + sizeof(void *)) & ~(align - 1));
    ptr[-1] = mem;
    return ptr;
}
void operator delete(void *ptr, align_val_t al) noexcept {
    ::operator delete(((void **)ptr)[-1]);
}

DISCLAIMER

TL;DR, part 2

This section is for Codeforces admins including MikeMirzayanov.

I have very high confidence that some poorly implemented functions in Microsoft's C runtime is the root cause of the bug which is outside contestants' control, creating unfair disadvantages in contests. The bug has likely existed since C++17 (64-bit) first came out on Codeforces.

Every C++ submission on Codeforces (and also the entire Codeforces infrastructure) currently depends on that library, on top of the internals of Windows that very few people has any clue about. Both of these are infamous for having bad implementations in many places, and I don't think it's a good idea to rely on those for one of the online judges with the most users.

I ask Codeforces developers to seriously consider other, more reliable possibilities, for example an option for judgment on Linux machines, possibly using one of the plenty judge server implementations already available. An increased stability and robustness would hopefully make contest and judging experience on Codeforces a lot better.

What's happening?

There has been a bug on Codeforces causing random TLEs recently. Sometimes when a seemingly unrelated line of code is added, or the order of two lines are swapped, or a vector size is changed, etc., the code suddenly becomes 100x slower on certain inputs. The bug got some attention because of its unexplained behavior, and the fact that many submissions (include tourist's) that should have passed a problem got TLE or were hacked because of this. An example showing the bug (by kostia244):

#include <bits/stdc++.h>

using namespace std;

int main() {
    vector<vector<int>> TLE(40000, vector<int>(7));

    string s;
    for (int i = 0; i < 17; i++) s += to_string(i) + " ";

    for (int j = 0; j < 10000; ++j) {
        istringstream ss(s);
        int x;
        while (ss >> x);
    }
}

A loop of about 10000 * 17 iterations takes 1.5s to complete. The weird and funny thing about the bug is that if 40000 or 7 is changed slightly, or if the vector is moved to the middle, then nothing happens and the code runs normally. This peculiar behavior indicates something is buggy and TLE is not caused by the code's authors. It only happens on Codeforces, and only with 64-bit C++. No one was able to reproduce it anywhere else or explain why.

After pajenegod also posted a blog on the situation, people in the comments (drdilyor, bkbtpout, et al.) quickly discovered that it's only the (de-)allocations that matter. Here's my attempt at reducing the code to only new and delete:

#include <bits/stdc++.h>
using namespace std;

void *alloc(size_t s) {
    auto c = new char[s];
    asm volatile("" ::"r"(c));
    return c;
}

int main() {
    auto c = alloc(0x1c);
    for (int i = 0; i < 40000; i++) {
        alloc(0x1c);
    }
    delete[] c;
    for (int i = 0; i < 100000; i++) {
        delete[] alloc(0x1c);
    }
}

This consistently takes around 900ms in both C++17 (64) and C++20 (64).

An overview of what I investigated in the past few hours

The vector<int>(7) in one case immediately caught my attention, because it makes an allocation of 28 bytes. 28 is special because it's $$$32 - 4$$$, where $$$32$$$ is a common chunk size of various memory allocation algorithms, and $$$4$$$ is the size of a pointer used to store chunks (if you didn't know, on Codeforces malloc uses 32-bit libraries and returns 32-bit pointers, even with 64-bit C++. you are scammed). This combined with the fact that changing the 7 makes the bug disappear, hints that something is wrong with allocations.

By default new and delete with GCC simply call the malloc and free functions from the C standard library. I tried adding some time measurements to see if malloc is really at fault. And surely enough, it is.

Code

#include <bits/stdc++.h>

using namespace std;
using namespace std::chrono;

nanoseconds alloc_time, dealloc_time;

void *alloc(size_t s) {
    auto t = high_resolution_clock::now();
    auto p = (void *)malloc(s);
    alloc_time += high_resolution_clock::now() - t;
    asm volatile("" ::"r"(p));
    return p;
}

void dealloc(void *p) {
    auto t = high_resolution_clock::now();
    free(p);
    dealloc_time += high_resolution_clock::now() - t;
}

int main() {
    auto c = alloc(0x1c);
    for (int i = 0; i < 40000; i++) {
        alloc(0x1c);
    }
    dealloc(c);
    for (int i = 0; i < 100000; i++) {
        dealloc(alloc(0x1c));
    }
    printf("alloc: %lld ms\n", round<milliseconds>(alloc_time).count());
    printf("dealloc: %lld ms\n", round<milliseconds>(dealloc_time).count());
}

Result in custom test

alloc: 899 ms
dealloc: 0 ms

=====
Used: 904 ms, 1904 KB

Knowing that it's not caused by the C++ STL themselves, but the allocation functions called when creating them, debugging became a little easier.

With my superficial knowledge of heaps, I can quickly guess what is going on. The 40000 mallocs eats the top chunk to a specific size, then the first free sends a chunk into the tcache, or the 0x20 fastbin if that's not a thing. Then the next sequence of malloc/frees keep taking and returning the same chunk, and this combined with the 40000 mallocs somehow breaks the allocation algorithm, making it take a lot more time, is what happened, right? Oh, wait. uh. no...

...wait. WAIT. I forgor 💀

The above is what might happen in GNU's C library. Codeforces runs on Windows, using Microsoft's C library. What does that do?

Codeforces's C++20 (64) uses WinLibs, which uses either MSVCRT (Microsoft Visual C++ Runtime) or UCRT (Universal C Runtime) as the C standard library. We don't know what exact setup Codeforces has, but we can first investigate what each library does. Fortunately, the source code of both are available.

Let's look at UCRT first. In non-debug mode, malloc simply calls _malloc_base, which is just a wrapper to the Windows API HeapAlloc. Similarly free calls HeapFree. These memory management functions are provided by the operating system, could they be the source of bugs? We can test this by replacing malloc and free calls by the API calls directly.

Test code

#include <bits/stdc++.h>
#include <windows.h>

using namespace std;

void *operator new(size_t size) {
    if (void *ptr = HeapAlloc(GetProcessHeap(), 0, size ? size : 1)) return ptr;
    throw std::bad_alloc{};
}

void operator delete(void *ptr) {
    HeapFree(GetProcessHeap(), 0, ptr);
}

void *alloc(size_t s) {
    auto c = new char[s];
    asm volatile("" ::"r"(c));
    return c;
}

int main() {
    auto c = alloc(0x1c);
    for (int i = 0; i < 40000; i++) {
        alloc(0x1c);
    }
    delete[] c;
    for (int i = 0; i < 100000; i++) {
        delete[] alloc(0x1c);
    }
}

This runs normally, which means that UCRT is unlikely the reason of the bug.

Then we look at MSVCRT's source code. It also uses Windows API functions, except when it doesn't. Depending on the system version, there are 2 more implementation of memory allocation algorithms, V6_HeapAlloc and V5_HeapAlloc, which seem to allocate memory by looking through a linked list for free chunks. I did not have enough time or energy to read the implementation, but it is possible that the function has bad performance in certain cases.

Then I found this interesting article, which shows that MSVCRT's malloc might really be some terrible implementation, and Codeforces users are not its first victims.

If we can know for sure which C library C++20 (winlibs) and C++17 (msys2) use, we can basically pin the possibility down to one bad function in a library. However, it's difficult to get the exact setup on Codeforces even with custom test. If only there was a way to download compiled binaries from Codeforces... wait, there is. On Polygon, full problem packages include compiled versions of checkers, validators, etc. I can simply add a generator to a random problem and download the package to obtain an executable file (although only C++17). Fortunately, it has some debug info, and is statically linked.

After briefly inspecting the 3MB, impossible-to-reverse-engineer PE32+ executable, I can confirm that it indeed uses MSVCRT's malloc. Now we can basically be sure that the cause of the weirdest slowdown bugs in Codeforces history is the poor implementation of the C runtime.

A possible fix?

Now we know the exact issue that caused the mysterious slowdowns: Microsoft's astronomically disastrous dumpster fire C library. How to fix this? It's easy, just replace it with memory management libraries that are known to work well, for example jemalloc, mimalloc (ironically made by MS themselves to replace their library functions they themselves know are terrible), or ptmalloc. One could also replace the entire library with, say, UCRT or GNU libc.

But as regular Codeforces users we can't control what libraries we use; only Codeforces is able to change that. However, we can rewrite our code to not use the C library to allocate at all. It's possible to do it by calling raw Windows API functions, as shown above. The following is copy-pasted from the TL;DR section:

#include <windows.h>
void *operator new(size_t size) {
    if (void *ptr = HeapAlloc(GetProcessHeap(), 0, size ? size : 1)) return ptr;
    throw std::bad_alloc{};
}
void operator delete(void *ptr) {
    HeapFree(GetProcessHeap(), 0, ptr);
}

If you use malloc/free in C++ (you shouldn't), also change them.

void *my_malloc(size_t size) {
    if (void *ptr = HeapAlloc(GetProcessHeap(), 0, size ? size : 1)) return ptr;
    throw std::bad_alloc{};
}
void my_free(void *ptr) {
    HeapFree(GetProcessHeap(), 0, ptr);
}

If you use aligned new/delete in C++, also change them.

void *operator new(size_t size, align_val_t al) {
    auto align = (uintptr_t)al;
    void *mem = ::operator new(size + align + sizeof(void *));
    void **ptr = (void **)(((uintptr_t)mem + align + sizeof(void *)) & ~(align - 1));
    ptr[-1] = mem;
    return ptr;
}
void operator delete(void *ptr, align_val_t al) noexcept {
    ::operator delete(((void **)ptr)[-1]);
}

DISCLAIMER

Does this fix the slowdown bug? No one will likely ever know, because unlike Microsoft's C libraries or STL, the lack of source code (and the sheer complexity) makes the Windows API impossible to understand. But at least we now know the bug is probably caused by library functions and can pray that skipping the library solves it.

In a way you could say that the bug is not Codeforces's fault... but it's not really not their fault either, as there is a very simple solution (that would also solve countless other issues reported by users before): run GNU/Linux, like literally every other online judge of considerable scale.

Update: About hacks

After the bug was discovered, people are hacking submissions by using special inputs that cause a certain sequence of allocations to happen. Since this is a problem of the library and not of contestants' codes, I don't think it's fair doing that.

A similar case is hacking bad std::unordered_map, compared to, e.g., hacking bad rolling hashes. The latter is a problem with contestant implementations, and I enjoy hacking them. But the former is just bad implementation from C++ STL, and isn't really the contestants' fault, so I don't usually hack them. This problem is more well-known though, and has many simple fixes available, so it's probably OK to hack them.

Before this slowdown bug and its fixes are well-known, I think hacking using the bug is equivalent to exploiting n-day vulnerabilities on Codeforces to manipulate verdicts, and I'm not very supportive of that. Let's hope a patch is deployed soon.

Afterwords

I'm not touching Windows or PE for the next 6 months. Leave a comment if you play CTFs and understand my feeling.

Subscribe for more posts like this where i solve CP bugaboos with the most absurd techniques

Comments (18)

Show archived | Write comment?

newminh

2 years ago, hide # |

+43

How so orz sir how can I be as good at low level stuffs as you

→ Reply

oToToT

2 years ago, hide # ^ |

+41

play ctf and read solutions.

Abhimanyu_Sharma

+13

Should we use C++17 until it is fixed?

shiny_shine

Yes, in fact currently Codeforces disabled GNU C++20(64) language option.

L_Wave

-21

Also I once heard that the gp_hash_table in pbds contains bugs. Not sure if it's true or not, I decided not to use it any more.

beaten_by_ai

← Rev. 2 →

What alternative do you use for it (asking because I also use it sometimes to squeeze in unintended sols, mostly failing, but sometimes it is useful)?

cc_hash_table, of course.

pajenegod

The only "bug" I know of is that doing swap(a,b) on two gp_hash_tables is O(n), while it is O(1) using unordered_map. As seen in this comment.

gp_hash_table is also slightly easier to anti-hash hack than unordered_map since gp_hash_table mods the hash by some power of 2. So if someone is using gp_hash_table without a custom hash, then they can be hacked by simply being given multiples of some sufficiently large power of two.

Igor_Parfenov

I just checked. Remind, that I have MinGW version 11.2.0 from here and I didn't encounter slow execution. A comment in output asm file says, that it uses UCRT.

But I have interesting question not related to this topic.

In asm code the malloc is replaced with mysterious _Z5allocy (with functions _Znay and _ZdlPv), which is declared through .globl. I coundn't find any information about _Z5allocy neither in Google, nor in MinGW repo, nor in UCRT repo, nor in my local mingw64 directory. How it can be possible?

Input C

Output Asm

    .file   "main.cpp"
    .text
    .p2align 4
    .def    __tcf_1;    .scl    3;  .type   32; .endef
    .seh_proc   __tcf_1
__tcf_1:
.LFB13889:
    .seh_endprologue
    leaq    _ZStL8__ioinit(%rip), %rcx
    jmp _ZNSt8ios_base4InitD1Ev
    .seh_endproc
    .p2align 4
    .globl  _Z5allocy
    .def    _Z5allocy;  .scl    2;  .type   32; .endef
    .seh_proc   _Z5allocy
_Z5allocy:
.LFB12013:
    subq    $40, %rsp
    .seh_stackalloc 40
    .seh_endprologue
    call    _Znay
    addq    $40, %rsp
    ret
    .seh_endproc
    .def    __main; .scl    2;  .type   32; .endef
    .section    .text.startup,"x"
    .p2align 4
    .globl  main
    .def    main;   .scl    2;  .type   32; .endef
    .seh_proc   main
main:
.LFB12014:
    pushq   %r12
    .seh_pushreg    %r12
    pushq   %rbx
    .seh_pushreg    %rbx
    subq    $40, %rsp
    .seh_stackalloc 40
    .seh_endprologue
    call    __main
    movl    $28, %ecx
    call    _Znay
    movq    %rax, %r12
    movl    $40000, %ebx
    .p2align 4,,10
    .p2align 3
.L5:
    movl    $28, %ecx
    call    _Znay
    subl    $1, %ebx
    jne .L5
    movq    %r12, %rcx
    movl    $100000, %ebx
    call    _ZdlPv
    .p2align 4,,10
    .p2align 3
.L6:
    movl    $28, %ecx
    call    _Znay
    movq    %rax, %rcx
    call    _ZdlPv
    subl    $1, %ebx
    jne .L6
    xorl    %eax, %eax
    addq    $40, %rsp
    popq    %rbx
    popq    %r12
    ret
    .seh_endproc
    .p2align 4
    .def    _GLOBAL__sub_I__Z5allocy;   .scl    3;  .type   32; .endef
    .seh_proc   _GLOBAL__sub_I__Z5allocy
_GLOBAL__sub_I__Z5allocy:
.LFB13912:
    subq    $40, %rsp
    .seh_stackalloc 40
    .seh_endprologue
    leaq    _ZStL8__ioinit(%rip), %rcx
    call    _ZNSt8ios_base4InitC1Ev
    leaq    __tcf_1(%rip), %rcx
    addq    $40, %rsp
    jmp atexit
    .seh_endproc
    .section    .ctors,"w"
    .align 8
    .quad   _GLOBAL__sub_I__Z5allocy
.lcomm _ZStL8__ioinit,1,1
    .ident  "GCC: (MinGW-W64 x86_64-ucrt-posix-seh, built by Brecht Sanders) 11.2.0"
    .def    _ZNSt8ios_base4InitD1Ev;    .scl    2;  .type   32; .endef
    .def    _Znay;  .scl    2;  .type   32; .endef
    .def    _ZdlPv; .scl    2;  .type   32; .endef
    .def    _ZNSt8ios_base4InitC1Ev;    .scl    2;  .type   32; .endef
    .def    atexit; .scl    2;  .type   32; .endef

_Z5allocy is just alloc(unsigned long long) (a.k.a alloc(size_t)).
These are mangled function name. To demangle these kind of functions, you can try execute c++filt _Z5allocy on your computer (c++filt is part of the binutils project)

Thanks.

I continued a little and used objdump on compiled to .exe program. There is _Znay function, which simply calls __imp__Znay function.

There is also _imp_Znay label, but it doesn't look like a function: all __imp__* labels have size 8 bytes with second byte equal to a5 and bytes 3-8 equal to 00. I guess, this is where the pointer to dynamically linked function is substituted, and first byte is something like id.

ToxicPie9

I found a simple way to determine whether HeapAlloc/HeapFree are used. If they are, then since malloc and HeapAlloc are basically equivalent, the following code would run normally:

#include <stdlib.h>
#include <windows.h>

int main() {
    HeapFree(GetProcessHeap(), 0, malloc(69));
}

Using Codeforces custom test:

In C++17 (64) and C++20 (64), it returns with a STATUS_HEAP_CORRUPTION error, indicating that malloc's result did not come from HeapAlloc directly and was possibly handled in another library function.
In C11 (32) and C++17 (32), it returns with a STATUS_ACCESS_VIOLATION error (sometimes a larger program is needed to trigger), indicating something similar. A different behavior can mean that they use a different implementation from 64-bit C++. Since they don't seem to be affected by the slowdown bug, this might be the case.
It works normally in Visual C++ 2017 (larger programs with only delete replaced also works), which means that its implementation calls HeapAlloc internally.

defnotmee

+29

Great blog mister, things make a lot more sense now.

Yet another L for closed source, it is crazy how bad they can fumble such a low level API call and how that can go unnoticed. Hope mike actually does something this time now that we have much better evidence that there is a problem.

Bit_Slayer

How can I attain proficiency in fundamental concepts like you ?

bweah

+20

Kindly elucidate, esteemed sir, upon the strategies and techniques through which I may aspire to achieve a level of competence akin to yours in the realm of tasks that are of a more rudimentary nature.