Hi everyone!
As you may know, it is often recommended to put these lines in your header:
#pragma GCC optimize("Ofast,unroll-loops")
#pragma GCC target("avx2")
And they're usually great![1] Well, until they aren't. For last 2-3 days I was riddled with the issue that CP-Algorithms Library would perform very well on Library Checker, but whenever I tried to also apply it on Codeforces, it will fail to speed up the code, and sometimes would even make it much slower. As I finally figured out the cause, I'd like to write a blog about it, so that other people don't waste as much time.
Mitigating these issues improved my running time on 1975G - Zimpha Fan Club from 1530 ms in 316602668 to 203 ms in 352602775.
Tl'dr
See the "Takeaway" section at the bottom.
Example problem
I will reproduce the issue on a simple example:
911G - Mass Change Queries: You are given an array of $$$n$$$ integers. Process $$$q$$$ queries, where each query is to go over $$$a_i$$$ for $$$l \leq i \leq r$$$ such that $$$a_i = x$$$ and change it to $$$a_i = y$$$.
A very simple solution to this problem is 352724261, which looks like this:
#pragma GCC optimize("Ofast")
...
while(q--) {
int l, r, x, y;
cin >> l >> r >> x >> y;
uint8_t ux = x, uy = y;
for(int i = l - 1; i < r; i++) {
a[i] = a[i] == ux ? uy : a[i];
}
}
...
It runs in 2093 ms, and if we compile it, we will see that the hot loop is optimized with xmm registers, which are 128-bit wide.
Inlining failures with [[gnu::target("avx2")]]
Now, we can enable 256-bit wide ymm registers in 352724979, expecting a 2x speedup:
#pragma GCC optimize("Ofast")
#pragma GCC target("avx2")
...
Wait, we can't, we get Compilation Error! If we look at compilation log, we see:
C:/Programs/gcc14-64-msys2/mingw64/include/c++/14.2.0/bits/allocator.h:182:7: error: inlining failed in call to 'always_inline' 'constexpr std::allocator< <template-parameter-1-1> >::~allocator() noexcept [with _Tp = unsigned char]': target specific option mismatch
182 | ~allocator() _GLIBCXX_NOTHROW { }
| ^
It is a known bug in GCC 13 and 14. It has been fixed in GCC 15, but unfortunately not back-ported to earlier versions.
Alright, what can we do about it? One of recommendations you may find in Codeforces comments is to try a special function attribute that will enable just our function to use wide registers. Let's try it in 352726371:
#pragma GCC optimize("Ofast")
...
[[gnu::target("avx2")]]
void solve() {
...
}
...
Wait, what the heck?! It gets Time Limit Exceeded! Does it mean avx2 is bad? Well, not quite.
There is another recommendation you may find in Codeforces blogs and comments, namely to apply the pragma after including bits/allocator.h, so that we circumvent the bug. Let's try it in 352726771:
#pragma GCC optimize("Ofast")
#include <bits/allocator.h>
#pragma GCC target("avx2")
#include <bits/stdc++.h>
...
And, we now get Accepted in 1031 ms, meaning that it finally works!
Okay, it works, so everyone is happy now, right? Well, not quite. What was that about [[gnu::target("avx2")]]? Why would it make the program run slower than running it without any attribute? If we compile the code, we will see that the reason for slowdown is that, somehow, vector's operator [] is no longer inlined, preventing the hot loop from vectorizing!
Now, you may assume, like I did, that it is some peculiarity of [[gnu::target(...)]], so we shouldn't use it, but it's actually a peculiarity of Ofast! Indeed, let's "downgrade" it to O3 in 352727561:
#pragma GCC optimize("O3")
...
[[gnu::target("avx2")]]
void solve() {
....
}
And we get Accepted in 1140 ms! Looking further into it, we will see that #pragma optimize("O3,fast-math") will also prevent solve from inlining vector's operator [], so the culprit is really -ffast-math, particularly when it is enabled via pragma, as this actually doesn't happen when enabling it via compilation arguments.
Inlining failures with lambdas
Reading the above, you may think "okay, well, I shouldn't use [[gnu::target("avx2")]] when using Ofast". Unfortunately, it's not the only thing that fails. Our motivational problem is not the best example of it, but assume that for some reason you would like to factor out the hot part of our loop in a lambda, like I do in 352729139:
...
auto update = [&](auto &i) {
a[i] = a[i] == ux ? uy : a[i];
};
for(int i = l - 1; i < r; i++) {
update(i);
}
...
We get Time Limit Exceeded again! Looking at the assembly, we see that now the issue is that compiler fails to inline the update lambda. You may assume that it means you shouldn't use lambdas in hot loops, but there are still two ways to make it work.
First of all, we can, again, downgrade optimization type to O3 in 352729593:
#pragma GCC optimize("O3")
#include <bits/allocator.h>
#pragma GCC target("avx2")
#include <bits/stdc++.h>
...
And it gets Accepted in 1062 ms again! But if you really want to keep Ofast, there is another way.
We can apply __attribute__((always_inline)) the lambda in 352730006:
...
auto update = [&](auto &i) __attribute__((always_inline)) {
a[i] = a[i] == ux ? uy : a[i];
};
for(int i = l - 1; i < r; i++) {
update(i);
}
...
And it will pass again in 1093 ms. It is important to note that you must use this specific syntax on the specific lambda that you want to inline. You can't use [[gnu::always_inline]], the modern attribute syntax, because this way they don't apply to lambdas. You also can't just add inline to your optimize pragma, it will not help.
Takeaway
Okay, let's summarize now.
What triggers the issue?
- Using
Ofastorfast-mathspecifically via#pragma optimizerather than command line, and - Putting
#pragma GCC target("avx2")after some other include, in particularbits/allocator.h
What is affected?
- Inlining of standard functions in any function marked by
[[gnu::target(...)]], and - Inlining of any lambdas
How to mitigate?
- Use
O3instead ofOfast, or - Put
#pragma target("avx2")before any includes (needs custom allocator to use standard containers, e.g. this), or - Don't use
[[gnu::target(...)]]attribute, and - mark lambdas you want to be inlined with
__attribute__((always_inline)).
What am I missing out on when using O3 instead of Ofast?
In G++, -Ofast is equivalent to the combination of the following:
-O3-ffast-math-fallow-store-data-races-fno-semantic-interposition
Of these flags, -fno-semantic-interposition is only relevant for shared libraries, and -fallow-store-data-races should theoretically optimize some code in ways that are unsafe in multi-threaded environment. In particular it should try to make some control flow expressions branch-less, but I could not reproduce this behavior at all, so it might be dubious.
In other words, for competitive programming purposes, turning -Ofast into -O3 mostly means losing on -ffast-math. Now, what fast-math does is, it allows compiler to think of floating point numbers as mathematical real numbers, which includes making assumptions that should be true mathematically, but may not be in the floating-point type. This includes assuming associativity of floating-point expressions, assuming that nan and inf do not exist, etc.
In other words, it may slightly help if your program is really heavy on floating-point compute (depends on how you write them), but pretty much useless otherwise, and may be detrimental in pretty unexpected ways, even beyond things related to floating point compute.
Because of it, I will try to use O3 instead of Ofast in my own code. Whether you want to do the same is up to you, but at least you should keep the above in mind, and be aware of things you need to mitigate if you stick to Ofast.
How CP-Algorithms Library handles this?
I decided to remove all [[gnu::target("avx2")]] from the code, and instead now wrap sensitive files in
#pragma GCC push_options
#pragma GCC target("avx2")
...
#pragma GCC pop_options
Additionally, I explicitly marked lambdas used in some hot loops with __attribute__((always_inline)) to ensure they are unaffected. I also try to use a custom allocator whenever I need to work with a standard container.
This way, the library should largely benefit from avx2, regardless of whether target("avx2") or even optimize("Ofast") are added on top of it, or whether the code is compiled with -march=native or -mavx2, and it should not prevent code compilation with global avx2 target due to the bug in bits/allocator.h, as it now uses a custom allocator.
References
- nor. “GCC Optimization Pragmas”, 26 October 2021.
Acknowledgements
Thanks a lot to purplesyringa for pointing out to me that the culprit is actually fast-math, and that the same optimization flag was the reason of a similar issue in the past.









🥱
Use
-O2in all code where you're not aiming for some specific feature of-O3and higher. Nowadays I tend to set-O3by function attributes only because it often slows down regular code.The compiler is fairly smart. Sometimes (especially when writing some part of a library) you need to override its behaviour in a specific situation because it's not smarter than people, but to override it globally is FAFO even if we disregard platform-dependence (be glad you don't ever have to use the IAR compiler) and unsafety of some optimisation flags. There's a reason
-O2is the gold standard.For reader's reference: Autovectorization needs either
-ftree-vectorizeor-O3.Sure but these special options are also kinda... limited in how much they help solutions pass. An example of a problem where I forced a suboptimal solution through manually because compiler optimisations failed me: 63882069.
I compiled your Takeaway section and encountered the following warning:
suggest parentheses around '&&' within '||' [-Wparentheses]Hope this helps 👍
Who dares to downvote adamant!?
interesting details..