# Introduction

I'm writing this blog because of the large number of blogs asking about why they get strange floating arithmetic behaviour in C++. For example:

"WA using GNU C++17 (64) and AC using GNU C++17" https://mirror.codeforces.com/blog/entry/78094

"The curious case of the pow function" https://mirror.codeforces.com/blog/entry/21844

"Why does this happen?" https://mirror.codeforces.com/blog/entry/51884

"Why can this code work strangely?" https://mirror.codeforces.com/blog/entry/18005

and many many more.

# Example

Here is a simple example of the kind of weird behaviour I'm talking about

**Example showing the issue**

**Output for 32 bit g++**

**Output for 64 bit g++**

Looking at this example, the output that one would expect from $$$10 * 10 - 10^{-15}$$$ is exactly $$$100$$$ since $$$100$$$ is the closest representable value of a double. This is exactly what happens in 64 bit g++. However, in 32 bit g++ there seems to be some kind of hidden *excess precision* causing the output to only sometimes(???) be $$$100$$$.

# Explanation

In C and C++ there are different modes (referred to as methods) of how floating point arithmetic is done, see (https://en.wikipedia.org/wiki/C99#IEEE_754_floating-point_support). You can detect which one is being used by the value of `FLT_EVAL_METHOD`

found in `cfloat`

. In mode 2 (which is what 32 bit g++ uses by default) **all** floating point arithmetic is done using long double. Note that in this mode numbers are temporarily stored as long doubles while being operated on, this can / will cause a kind of excess precision. In mode 0 (which is what 64 bit g++ uses by default) the arithmetic is done using each corresponding type, so there is no excess precision.

# Detecting and turning on/off excess precision

Here is a simple example of how to detect excess precision (partly taken from https://stackoverflow.com/a/20870774)

**Test for detecting excess precision**

If b is rounded (as one would "expect" since it is a double), then the result is zero. Otherwise it is something like 8e-17 because of excess precision. I tried running this in custom invocation. MSVC(C++17), Clang and g++17(64bit) all use mode 0 and round b to 0, while g++11, g++14 and g++17 as expected all use mode 2 and b = 8e-17.

The culprit behind all of this misery is the old x87 instruction set, which only supports (80 bit) long double arithmetic. The modern solution is to on top of this use the SSE instruction set (version 2 or later), which supports both float and double arithmetic. On GCC you can turn this on with the flags `-mfpmath=sse -msse2`

. This will not change the value of `FLT_EVAL_METHOD`

, but it will effectively turn off excess precision, see 81993714.

It is also possible to effectively turn on excess precision with `-mfpmath=387`

, see 81993724.

# Fun exercise

Using your newfound knowledge of excess precision, try to find a compiler + input to "hack" this

**Try to hack this**

# Conclusion / TLDR

32 bit g++ by default does all of its floating point arithmetic with (80 bit) long double. This causes a ton of frustrating and weird behaviours. 64 bit g++ does not have this issue.

So what is the suggested solution for CP contests?

`#pragma GCC target("fpmath=387") // Turns on excess precision`

=> This seems very brittle and platform-dependent.Does using

`long double`

for all calculations also work?Solution: Don't do stupid things with floating point. Don't check floating point for equality. Be aware how floating point behaves wrt precision.

If you need extended precision you can explicitly use it, but in most cases it won't be needed if you take some care with how you do floating point operations (or avoid floating point when it's not needed).

I explicitly use

`-Wfloat-equal`

to ensure I never end up equating floating points.However, at least in the example codes in the blog, a very simple subtraction operation is done. While I am aware of floating point issues, I would say that extended precision is more preferable than thinking about whether this particular line may lead to WA later.

Is there any substantial difference between

`a - b == 0`

and`a == b`

? The code in the blog does`a - b`

and realizes it can give either 0 or stupidly close to zero. This is the exact same phenomenon as trying to compare two floats. I.e. not doing equality checks resolves it.Yes, using long double everywhere works, doing it like that means you won't have to worry about excess precision. In general simply using 64 bit C++ means you don't have to worry about excess precision.

I wrote this blog to inform people about how floating arithmetic is done in C++. That does not mean I think excess precision is a good idea. The way I see it, excess precision is a thing of the past and I'm happy I don't have to bother with it when I use 64 bit C++.

Well explained! But the real problem is people comparing floats using a == b :)

Doesn't this violate IEEE 754 since it requires basic floating point operations to be correctly rounded (results in closest representable value)? Is it related to https://mirror.codeforces.com/blog/entry/21844?

Edit: Seems that IEEE 754 allows excess precision.

After testing I'm pretty sure https://mirror.codeforces.com/blog/entry/21844 is related to excess precision. I found a minimum working example that I believe has the same issue.

Example showing the issueOutput for 32 bit g++Output for 64 bit g++As you can see, excess precision is able to "leak out" of

`f`

.The thing I'm not sure of is if pow having excess precision should be seen as a bug or not. From testing on cf, only when submitting under g++11 does pow have excess precision. It seems to not have excess precision on any later version. So probably pow having excess precision should be categorized as a bug, and it has now been fixed.

Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).The Code in (Detecting and turning on/off excess precision) Section:

using g++20 in Clion Gives 0 and the

`FLT_EVAL_METHOD`

says that the mode is 2 but when I change`double b = a*a`

to`long double b = a*a`

it give 8e-17 as I understand from you they should work the same or did i get something wrong?and The code in (Fun exercise) Section :

how does the

`int w = pow(y, 2);`

affect the result?The way mode 2 works is that all consecutive floating point calculations are done using

`long doubles`

. But when the floating point numbers are stored in memory, they are stored as their respecitive type.The pow call forces

`y`

to be stored in memory, which rounds`y`

. For example, if`y`

had the value $$$1-2^{-60}$$$, then the pow call rounds it to exactly`1.0`

since`1.0`

is the closest representable double to $$$1-2^{-60}$$$.