It's almost 2025, why are you still choosing GCC 7 (32-bit)?

#	User	Rating
1	tourist	3985
2	jiangly	3814
3	jqdai0815	3682
4	Benq	3529
5	orzdevinwang	3526
6	ksun48	3517
7	Radewoosh	3410
8	hos.lyric	3399
9	ecnerwala	3392
9	Um_nik	3392

#	User	Contrib.
1	cry	169
2	maomao90	162
2	Um_nik	162
4	atcoder_official	161
5	djm03178	158
6	-is-this-fft-	157
7	adamant	155
8	awoo	154
8	Dominater069	154
10	nor	150

This is my answer to https://mirror.codeforces.com/blog/entry/137484, in which -is-this-fft- asked a bizarre question. This answer is too big for a comment so I made a blog post instead.

The question

The submission 297353626 behaves normally with 180 MiB memory usage while 297352244 got a runtime error due to stack overflow (with 256 MiB stack). Both submissions have identical code, don't seem to contain bugs, and an assertion exists to make sure the recursion is at most $$$2\times10^5$$$ layers deep. What's happening?

TL;DR

Notice that the solution with stack overflow uses GCC 7 (32-bit) while the accepted ones use GCC 13 (64-bit). This is not a coincidence.

The main reason of unreasonable stack usage boils down to two points: GCC 7, and 32-bit.

32-bit

First of all, the code is compiled and run in 32-bit mode. Compiling the submission with GCC 7.5 in 32-bit on godbolt shows that the long long solve(long long, long long) function has a gargantuan stack frame, seen from its prelude:

_Z5solvexx:
    push    ebp
    push    edi
    push    esi
    push    ebx
    sub     esp, 1948

This means that each call to solve uses $$$8\times2+4+4\times4+1948=1984$$$ bytes of stack space (2 8-byte arguments + 4-byte return address + 4 4-byte registers + 1948 bytes for stack variables). In a $$$2\times10^5$$$-layer recursion, the stack usage would be about 378 MiB, which probably overflows the stack. However, solve looks like a very simple function. Why does it need that much memory?

In 32-bit x86, there are less registers to use than in 64-bit mode, and each register only stores 32 bits. In the submission, every integer variable is a 64-bit long long, including loop indices, answer values, etc. Since 64-bit arithmetic cannot be done directly, all those variables have to be stored somewhere on the stack, and when they are used, only half of a variable (32 bits) is pulled out and processed each time. If 64-bit intermediate values are required in the middle of a computation, they must be stored as well.

If we count all the variables, including temporary values used in statements, it can be seen that they use a good amount of space (at least a couple hundred bytes) but nowhere close to 1948 bytes. What else is happening?

cdecl my beloved

Another disastrous part comes from function calling. Let's take the function long long solve(long long, long long) as an example.

In 64-bit x86, the main calling conventions used in C are Microsoft x64 and System V. Since there are many registers, the first few (small) arguments and the return value can be put in registers for faster access. In both conventions, the 2 64-bit arguments are placed in 2 registers, and the 64-bit return value is stored in a register.

In 32-bit, the main calling convention in C is cdecl. The 16 bytes of function arguments are pushed onto the stack in reverse order, as well as an extra pointer. This extra pointer points to 8 bytes of stack-allocated space and is used to store the return value. This causes every solve call itself to use 24 more bytes of stack space.

The problem gets worse when functions get a bit bigger, for example:

pair<long long, long long> op_min(pair<long long, long long> p, pair<long long, long long> q): in 64-bit, a pair is stored in 2 registers for both arguments and the return value, so no extra stack is used; in 32-bit, they are all on the stack, taking up 48 bytes.
pair<long long, long long> Segtree::query(Segtree *this, ll ql, ll qr, ll u): left as exercise for the reader.

GCC

You might think compiler optimizations help by optimizing register usage, reusing stack space, etc., but in this case, GCC actually made the problem a lot worse.

Looking at the compiled code closely, you can see that the solve function has too much calls to Segtree::query. 28, to be exact. This is because of function inlining — to reduce the overhead of function calls, compilers sometimes replace a function call with the code of the function itself. This can be done recursively, for example if the query calls in the query function were to be replaced by itself twice, it results in a large function with 8 query calls.

With inlining, the short solve function gets expanded into a lot of code. Each function call and its surrounding code requires stack space to operate, and it seems that the older GCC 7 is less efficient at utilizing stack space for different parts. For example, it can be seen that the results of different query or op_min calls are all stored in variables in different places, but their values are not used together. While the optimization seems to work well for speed, it multiplies the stack usage problems from previous sections.

GCC 14 does the optimizing job a lot better. In 32-bit mode, a lot of code is inlined too, but stack variables are reused more efficiently, and it uses only 380 bytes instead of 1948. In 64-bit mode, even more stack space is saved, requiring only 88 bytes.

Conclusion

It's almost 2025, consider using something newer than GCC 7 (32-bit) on Codeforces. Also to the clown developers/end users who keep shipping/running 32-bit executables on Windows for "compatibility issues" or other reasons: please stop.

Comments (3)

Write comment?

conqueror_of_tourist

7 hours ago, # |

+57

Literally 1984.

→ Reply

Thunder8971

6 hours ago, # |

It's a very smart response. Where could I learn more about that topic?

bthero

5 hours ago, # |

+14

I also used GCC 7 (32-bit) on the contest yesterday, but my submission passed without any issues: 297315717. Arguably, my code is even worse because I am creating and returning a new std::array of size $$$60$$$ on each recursion call. So my first thought was "did I just get lucky with the stack?"

After a bit of experimenting, here is what I can share:

Adding __attribute__ ((noinline)) to the Segtree.query() function indeed resolves the stack issue: 297387859. So inlining is playing a big role here. But how come it didn't affect my solution?
The crucial difference between the code of -is-this-fft- and mine is that my segment tree is just hand-written without any structs. And the culprit here is the op() function that is passed as a template argument. When you replace it with a plain min() call, it also gets accepted: 297388251.

The takeaway? I honestly don't know. But I will definitely stick to GCC 14 (64 bit) from now on.