Hello Codeforces!
I am in the process of making improvements and updates to the judgment machines. I've read the post https://mirror.codeforces.com/blog/entry/94587 and I think, maybe it is a good idea to make such a compilation line -O3 -funroll-loops -march=native -mtune=native
? I haven't done any research that it is definitely better than -O2
and it is best in the general case for CP solutions. In a way, this will only strengthen the gap from Python/PyPy/Java, on the other hand: in pragmas and so you can set up everything. What do you think? What are suggestions to the command line?
P.S. You got it right. Yes, gcc11 std=c++20
, pypy3 64-bit
and more are coming.
yeah c++20 will be great
I think
-O2
is enough in CP, regardless of whether there are better compilation options. I even think we should not allow#pragmas
. Too much compiler optimization may let a bad solution get Accepted, which will make CP bored.How do you disable pragmas to run in code when the user has them in code ? (genuine question, idk anything about pragmas other than the fact that they are some dark magic used to speed up code)
I am not quite sure about the details myself, but I remember hearing somewhere that other OJ's are able to "stop" pragmas. Correct me if I'm wrong, but USACO judgers do this. I also know that that the teamscode OJ used for the recent teamscode round also stopped the effect of pragmas. Nonetheless, I do not know this topic well enough, so I might be wrong.
Where do we need to write this ?
Mike can write this somewhere inside Codeforces code
Agreed ...
Different optimizations you can refer to this .
I don't think this works. One can use
_Pragma ("GCC optimize(3)")
instead.Or you can use "??=" as a synonym for # and write ??=pragma
Then the contestants will just use function attributes, which have exactly the same effect as pragmas, but for individual functions:
Here's an example: 128071115. The fast_function define can be also obfuscated a little bit to avoid trivial regexp based detection. There may be other loopholes.
Disabling AVX instructions support on the judge machines may be a useful equalizer option (AVX is the biggest contributor when the pragmas help). Some people on the Internet are saying that disabling AVX is possible in Windows: https://newbedev.com/how-can-i-check-whether-intel-s-avx-is-enabled-on-my-computer
Actually, we can disable
#pragmas
at compiler level, like this, but I think it's not necessary.I think that -O2 is a fine standard. For example, in Belarusian OI we have -O2 as a standard.
One can use pragmas to get -O3, -funroll-loops and other features.
what you need to do is
could you please explain what each of them does?
no-one cares
fine
Deleted.
The optimization is really a good news for C++ programmers.
But I have seen that sometimes it can give negative-optimization.
So I suggest that users can decide to use it or not by themselves.
Ahhh!!! c++20 noice noice
Xellos claims that O3 can sometimes be slower than O2 in https://mirror.codeforces.com/blog/entry/66279. That thread is also a great resource for pragmas.
I think rather than asking people you can do this more scientifically by rerunning existing submissions under both the old and new flags. It will be interesting to see the stats of how many AC will become MLE or WA(due to some UB being exposed?) and which types of problems benefit the most from O3/unroll and which ones it will actually slow down. It will make an interesting blog post if you get those stats!
[Deleted]
It's wrong constraints, not wrong solution. Who to blame is the author.
Yeah, you kinda right. We are suggested a problem, we have to implement a solution, which works correctly and fits into the time (and memory) limit. But what in fact do the contest authors want?
The correctness of algorithm is checked using testcases. The time and memory limit check, how good is the solution. Very often, the quality is just the complexity. The editorials show us the intended complexity. They do not show neither intended language, nor intended pragmas.
Apparently, the main thing is not to allow too slow — incorrect — solutions to pass. That is why time limits can be tight. Then it can be difficult to make accepted a solution with correct complexity — particularly if it has big constant factor (say, written not in C++). But tight limits are discouraged in codeforces.
You are suggested a list of languages you can write a solution in. I believe that they are alternatives — we can pick any of them, translate our correct idea and get accepted. I believe that choosing C++ isn't a must (in perfect world). The correct idea must be accepted, whenever it is written in C++ or not. But the languages differ in the speed. The way to check the asymptotics is to set enormous limits — say, $$$n \le 10^9$$$ for $$$O(n\log{n})$$$ algorithm and about 300 seconds of time limit. But it is impossible.
Consider some example. Problemsetters decide that solution must work in $$$O(n\log^2{n})$$$ and $$$O(n^2/w)$$$ should not pass. They implement the best solution for $$$O(n^2/w)$$$, common solution for $$$O(n\log^2{n})$$$ and set time limit somewhere between their ruinning times. The more the gap is, the more other language or non-optimal solutions are allowed. Forbidding to use pragmas is increasing the size of the gap. It simplifies work for problemsetters and increases probability of receiving good problem. Yes, it's their fault, but do you want to help them preparing high quality problems or not?
P. S. Yes, I used pragmas three or four times and all times the program worked slower. So you can read my comment as " I do not know, where (and how) should I use pragmas. Please, forbid all the contestants to use them!". And to add up, putting some random lines in the beginning of the file to make the program faster hardly correlate in my mind with thinking about best solution for the problem.
[Deleted]
Looking forward to the release of the new standard of C++.
Wow, thanks! That is a very useful and meaningful idea.
pypy3 64 bit is coming!! I couldn't be happier :D
Psyched for C++ 20 :D
Sometimes (for example, when there are no autovectorized loops), pragmas can be useless and even slow down the program (I had some examples of it). Maybe there are some that never slow down?
I think the optimization of compiler would be great, but on the other hand, this would give less pleasure in optimizing the code itself. By the way, thanks for the update!
Probably the wrong place to ask this but I dont know where to ask so sorry for it :)
Are there any upcoming lectures for EDU section ?
O2 is enough.A Chinese oj called POJ did not even turn on o2 optimization,and only supports the C98 standard QAQ
POJ might be the single worst online judge that exists today... The website seems like it is stuck in the 1990s.
What about
c++2b
? That would be even better.As far as I know,
c++20
itself isn't fully supported in any compiler as of yet. It'll be a long way to getting it to work on cf anyway, soc++2b
doesn't make sense.If your comment was sarcasm, well...
It wasn't sarcastic in the slightest.
C++20 is not fully supported, right. C++23 neither. That does not mean we can't use features that are supported, however, does it?
Also, I'm not sure why making C++20 work would be difficult. It only requires updating the compiler, as far as I am aware.
Ah I see. The rest of the comment will reference this.
My point about C++20 was that if compilers don't fully support all features, it might be a great surprise for some people if some features won't work on CF (but work locally). For instance, this post: https://mirror.codeforces.com/blog/entry/94457, where even C++17 features don't work, since the version of GCC is 9.2.0, while most people on Ubuntu use 9.3.0 (and people on rolling-release distros use much more recent versions, like 11.1.0). However, the majority of "good" C++20 features have already been implemented by GCC, so it might be less of an issue for C++20. Some good features (like constexpr std::string and std::vector) still haven't been implemented on GCC though, so I would still prefer waiting for them to be implemented.
The problem would be worse for C++23 (since most C++23 features in the STL aren't implemented in compilers anyway), so people who have updated compilers might face unexpected issues in due course of time as the compiler gets outdated.
As far as updating the compiler is concerned, I suppose it takes some non-trivial amount of effort which might lead to maintenance delays (even for half an hour, that's quite a bit of a delay). I don't think CF would want to follow a rolling-release pattern and update compilers as and when more features are implemented, which would keep us stuck with compilers that don't fully implement the standard (as in the case of 9.2.0). I hope that clears up the reasons behind my concerns.
I think a lot of people will be happy to use set.contains(x) though (And no, I don't know why it took this long to add a .contains method)
GCC is known to support newer language features even when the selected standard is quite old. Structured buildings, for instance, supported even with
-std=c++11
option--GCC will still compile them and only raise a warning.So why don't we do a similar thing here? Make a compiler called 'GCC (C++20, partial C++2b support)', or maybe even 'GCC (C++20)' if you think minor lies are allowable, that uses the
-std=c++2b
CLI option? In this case, if you use C++20 features only, you're in the clear. And if you use newer features, you'll either get an error because the feature was not implemented yet, or it will compile correctly and you won't lose precious seconds rewriting the code.It's a win-win situation. Participants that are aware of this little trick can benefit from it. Participants that aren't will get the same results as before, except they get more lucky, so to say, in pushing the boundaries, if they accidentally use too new features.
Do you think the only problem is the updating difficulty, and that if it was automated, it'd work out?
Sure, that can be done I guess (not sure about all the technical details about deploying these updates on CF though).
It's just a matter of personal taste to wait for a more polished implementation of C++20 (or maybe even C++23). However, if there is a guarantee that compiler updates can happen frequently enough without breaking stuff very frequently, I would go with the newer compilers too, since that mimics my own update cycle as well. Note that there tend to be bugs in new releases of compilers too, so that might be a minor inconvenience here and there.
pypy3 64-bit ,now python users don't have to switch in python and pypy. Now we are going to use pypy3 always.
I think this is good, because it will reduce the impact of reasons other than the complexity of the program on the judge results. :)
I think this option will be good as a separate compiler option when you submit a task.
only if
-O3
is necessary to beatRust
-ersI understand that it was a joke, but I still want to answer :)
From my experience, Rust standard library is incredibly slow even in comparison with
-O2
version of STL. At leastBTreeSet
, which seems to be the counterpart tostd::set
in Rust.Would be happy to hear from Rust experts that I am wrong.
Let's insert 5000000 32-bit integers into sorted set and compute some polynomial hash.
Rust 128344948 submission runs 1637 ms and used 49200 Kb memory.
GNU C++17 (64) 128344998 submission runs 5256 ms and used 239500 Kb memory.
Why this happen and why
BTreeSet
is slower than red-black tree for small sets you could read in the man.According to the following blog, Speed up Code executions with help of Pragma in C/C++, the optimization flag
-Ofast
is superset of-O3
, and its performance is slightly better than-O3
.This blog is misleading about the -Ofast option and doesn't mention the downsides. This is what the official documentation says: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
If people don't mind deviating from strict standards compliance, then they can always use pragmas to enforce -Ofast for their code. But everyone else would be rightfully upset if their valid code doesn't behave as expected, especially if this results in a WA verdict.
Thanks for sharing the gcc link. It seems that I was lucky enough to not get WA verdict when I tried this pragma optimization flag.
This is a great idea except for one thing. I am strongly against
-O2
and-O3
keys in command line. (Also sometimes it is not fair for people not using C++)I have an example:
test.cpp:
Then run
g++ test.cpp -O1 -o test && ./test
in command shell. I got such result:But if I run
g++ test.cpp -O2 -o test && ./test
org++ test.cpp -O3 -o test && ./test
, I will get a program that goes into an infinite loop.A part of it's output:
So, I think
-O1
would be better for the command line.Possible explanation (I'm not an expert):
Unsigned comparison is a bit faster than signed because it is not just a "stupid" comparison. Processor needs to look at first bit, than choose how it will compare numbers (it depends on first bit), еtс. Compiler does not consider the context that it can be a negative number and just uses unsigned comparation to imporove speed.
Overflow of signed integers is undefined behaviour in C++. That's why your program gets stuck in O2/O3.
Your code is just incorrect. Add
-fsanitize=undefined
option to the gcc command line when compiling, run your program and it will tell you what's wrong. Yes, C++ is a very difficult programming language and it's often criticized for that. If you want an infinite loop and no unexpected surprises, then you can use Python.I'd say calling a function that doesn't exists somewhere and not having that raising an error until the function is called or having a typo in a variable name and the program not accusing that of being an error is an unexpected surprise by itself.
Yet I don't see any Python users posting blogs about such errors and being unable to troubleshoot them. The biggest source of actual confusion may be something like a somewhat unpredictable performance of string concatenation in CPython. There's also Rust for those, who prefer statically typed programming languages and catching more errors at the compilation stage.
As far as I understand,
-O3
can be faster and also can be slower than-O2
, but it is more often faster than slower. The only reason why the standard compilation line for big competitions (such as ICPC) contains-O2
is that-O3
allows the compiler to do some optimizations that are not present in the C++ standard (or maybe even do something contradicting the standard, assuming it does not change the program behaviour, but I am not sure about that).On the other hand,
-O2
is guaranteed to perform almost all optimizations from the standard, so it is usually enough for any reasonable solution, and you always can add pragmas whenever you specifically need-O3
.Anyway, it is impossible to make the best command line to deal with pragmas once and forever, because there always will be something like
sse
oravx
, that is not added by-march=native -mtune=native
, but still can be added manually via pragmas and make some difference in running time.In my experience it's more often slower than faster since regular CP code isn't bottlenecked by how instructions are generated.
To be very honest, I think it is more often more or less equal. Also, I bet that it depends not only on the problem but also on your preferred code style. E.g. I tend to write very template-heavy code that relies on inline optimizations that are different in
-O2
and-O3
(according to godbolt). I can't quite tell which one is better though.In general, I think that if the optimizations went that far, then you better simply check both
-O2
and-O3
.I do check, it's how I know. However, even if it was 50/50 on which flag gives better results, then O2 is the one to use as the default.
That's the thing. You get different code, that doesn't mean you get faster code. It should depend on problem type rather than code style though (unless we're talking about bad code style that needs O3), for example randomly accessing memory is bottlenecked by cache misses and no amount of compiler optimisation can change that.
Great resource: https://wiki.gentoo.org/wiki/GCC_optimization.
Basically, feel free to use machine-specific options (
-march=native
is enough) since that's where the code will run, but keep optimisation to-O2
.Note
-pipe
which just makes compilation faster but that's also useful.Here is an alternative opinion from another Linux distro: https://documentation.suse.com/sbp/all/html/SBP-GCC-10/index.html
Looks like you are right and -march=native also implies -mtune=native at least for the x86 target. As for -funroll-loops/-funroll-all-loops, GCC documentation says that:
It's important to note that popular open source software typically has a long history. Years had been used to iron out bugs and improve performance. Many of the performance critical parts had been already identified and optimized using hand written assembly or intrinsics. Many loops that could actually benefit from unrolling, had been unrolled manually. There are relatively few low hanging fruits left for the compilers. This limits the usefulness of -ftree-vectorize and -funroll-loops optimization options when used with such code.
But competitive programming solutions are usually small, somewhat messy and there's not much time available for doing manual loops unrolling due to short contests duration. This provides more optimization opportunities for the compiler. Pre-written library code is the only exception.
Keep in mind that compute-intensive in OS applications usually means things like video processing, not complex algorithms. The vast majority doesn't have performance-critical parts at all.
The tradeoff isn't simply "no time to do this", it's "no time to do this probably useless thing". Copypasting the insides of a loop and renaming variables isn't hard when you know it's useful, but when you don't know, there's a good chance it's not useful.
Manual loop unrolling doesn't make any sense anymore, unless it is a non-trivial rearrangement of program logic that the compiler can't detect. At O3, the compiler is smart enough to unroll loops which it deems fit conservatively (for instance, this: https://godbolt.org/z/hzKvnsGfz — try it without
-funroll-loops
).-funroll-loops
usually leads to a better performance (especially if it's a lightweight tight loop), but again, compiler options don't always do what you want them to do, and this doesn't unroll all loops either, since the unrolling is heuristic-based. Compilers might even reroll your loops if they find that some of your loops are hand-unrolled and are pretty bad due to the absence of such a guarantee (this is just a hypothetical).Compilers are smart enough at micro-optimizations (more so than most programmers), so I believe we shouldn't rule out the usefulness of loop-rolling that the compiler does.
This is probably the most sane thing to do.
I would however add that if
-march=native
isn't set already, thenbmi
/bmi2
(if not that, thenpopcnt
,lzcnt
,tzcnt
) should be considered for non-trivial bit operations (like__builtin_popcount
,__builtin_ctz
,__builtin_clz
,__lg
and so on). These would be pretty helpful in doing these bitwise operations and also speed up bitsets a bit.Ofast
is unsafe for floating point operations (I remember someone saying that it breaks some solutions to geometry problems, which was the biggest reason I stopped usingOfast
), andO3
might make code slower (from my experience on CF as well, but I tend to include it anyway since it happens rarely). One good thing aboutO3
(as pointed out above) is the larger bit vectors that you get (but you can do that by enabling AVX2 anyway).The point is, optimized code should not be necessary to pass, although it's nice to have.
-march=native
or-mtune=native
being the only addition to the compiler options should be fine in my opinion. The reason why it's not used in production is that people want to build for other architectures too, but since all code compilation and execution happens on one system in the case of competitive programming, it makes sense to actually get as much performance out of it as possible. Some might argue it's even the best thing, since it allows you to use instructions from most ISAs that are supported, which makes it fair game for people who don't actually want to use pragmas (and for people who are afraid of using pragmas due to the fear that their submission will get an RE due to SIGILL).I would actually be interested in the flags that are supported on codeforces (the flags turned on by
-march=native
can be found by runningg++ -march=native -E -v - </dev/null 2>&1 | grep cc1
). For instance, my machine gives the following output./usr/lib/gcc/x86_64-pc-linux-gnu/11.1.0/cc1 -E -quiet -v — -march=skylake -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -mno-sse4a -mno-fma4 -mno-xop -mfma -mno-avx512f -mbmi -mbmi2 -maes -mpclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mno-vpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mclflushopt -mno-clwb -mno-clzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mlzcnt -mmovbe -mno-movdir64b -mno-movdiri -mno-mwaitx -mno-pconfig -mno-pku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mno-rdpid -mrdrnd -mrdseed -mno-rtm -mno-serialize -msgx -mno-sha -mno-shstk -mno-tbm -mno-tsxldtrk -mno-vaes -mno-waitpkg -mno-wbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=9216 -mtune=skylake -dumpbase -
Still, in an ideal scenario, such optimization options shouldn't be involved (and perhaps be banned from code too), but cheesing problems using SIMD is way too fun.
There's actually another way to get free performance: write high level idiomatic code and constexpr'ing as much as you can.
Better to show with an example: https://gcc.godbolt.org/z/YocGTro41
High level doesn't mean slow. Example above is overly high-level, it computes some nontrivial thing (sum of all node labels in some implicitly defined graph that is not stored in memory) and it compiles in a single machine instruction that just moves the result into the register.
You can do memory allocations for no cost (one can think about memory for constants/precomputed stuff as registers); nasty cases/whatever can be elegantly expressed in this way.
Not sure if this is common knowledge in CP.
That's not taking any inputs, so it's not applicable to programming contests.
The question isn't whether a speedup trick is possible in some situation that can arise with non-zero chance. Plenty of those exist. It has to be useful in contests.
[Deleted]
As a somewhat strong competitor who mainly uses a language other than C++: Don't worry about widening the performance gap between C++ and slower languages with a change like this one. Most of what these flags will accomplish was already possible with pragmas, and SIMD/vectorization only provides a large advantage in some fairly specific situations, which are often simply absent in the performance bottlenecks of even somewhat complex code. The main effect of changing the compilation flags would be to make these optimizations more accessible, and harder for problem-setters to overlook.
Most* of the pressure I feel to sometimes switch to C++ comes from the small minority of problems that have rather tight time constraints, and I think the same goes for pretty much every contestant not using a reliably-very-fast language like C++. So, unless this change meaningfully affects the frequency or severity of such occurrences, I think it won't meaningfully alter competitors' decisions. And while this change will certainly lead to tighter constraints on some problems, I intuitively expect that in most situations where this is enough to create serious difficulties for slower languages, it will come with the flip-side of also removing unintended C++ solutions (often much simpler than the alternatives), and those are themselves a strong incentive to switch languages for that problem.
The rest of this pressure comes from the relative inconvenience of describing some mutation-heavy algorithms in a pure language like Haskell. But this comes mainly from (in decreasing order of importance) (1) my tendency to not realize how stupid and improvable my ideas are until I've started implementing them, (2) the unique characteristics of fully-immutable coding, and (3) my lack of skill and confidence at the programming part of competitive programming; and none of those have anything to do with C++ in particular. I could just as easily switch to some other language (say, Kotlin) on a problem for this reason.
PS: A belated thanks for the (unannounced?) change 6 months ago removing
-XSafe
from the GHC compilation settings. It would also be nice if the version of GHC shown in the Languages drop-down (currently 8.10.1) agreed with the version of GHC actually being used (currently 8.6.3).In the long run, I think the best way to narrow the gap is by adding more languages that can be run as 64-bit executables. This is why I am excited by the introduction of PyPy3 64-bit even though I will probably rarely use it myself. I think many share this sentiment.
Oh wow reaching GM with Haskell. orz
Please don't, like 90% of the time
-O3
flag is slower by a lot.-O2
is best. If user wants they can upgrade to-O3
with pragma.Can you provide any examples of real Codeforces submissions that get slower with -O3? (since you write "90% of the time" and "by a lot" you'll probably need multiple examples) I've never seen any cases where -O3 is significantly slower. (but adding -funroll-loops to the default command line is probably a bad idea)
Add swift please
n
Java SE 17 (LTS) is coming soon and hopefully u will add it as well.
I believe complexity is one of the most important things for CP. If the programs cannot accept the judgement though it uses a accurate algorithm with acceptable complexity just due to the fact that it doesn't use skills like fast read or other optimizing codes, this situation should never happen in rated contests because of unfairness. I do believe more auto
-O2
or-O3
can avoid this in certain condition. I believe there should be a judgement program to decide which one to use.