# Introduction↵
↵
I'm writing this blog because of the large number of blogs asking about why they get strange floating arithmetic behavior in C++.↵
For example:↵
↵
"WA using GNU C++17 (64) and AC using GNU C++17" https://mirror.codeforces.com/blog/entry/78094↵
↵
"The curious case of the pow function" https://mirror.codeforces.com/blog/entry/21844↵
↵
"Why does this happen?" https://mirror.codeforces.com/blog/entry/51884↵
↵
"Why can this code work strangely?" https://mirror.codeforces.com/blog/entry/18005↵
↵
and many many more.↵
↵
# Example↵
↵
Here is a simple example of the kind of weird behavior I'm talking about ↵
↵
<spoiler summary="Example showing the issue">↵
↵
~~~↵
#include <iostream>↵
using namespace std;↵
↵
double f(double a, double b) {↵
return a * a - b;↵
}↵
↵
int main() {↵
cout.precision(60);↵
↵
// Calculate 10*10 - 1e-15↵
double ans;↵
ans = f(atof("10"), atof("1e-15"));↵
cout << (long double) ans << '\n';↵
cout << (int) ans << "\n\n";↵
↵
ans = f(atof("10"), atof("1e-15"));↵
cout << (int) ans << '\n';↵
cout << (double) ans <<'\n'"\n\n";↵
↵
ans = f(atof("10"), atof("1e-15"));↵
cout << (double) ans << '\n';↵
cout << (int) ans << '\n'long double) ans << "\n\n";↵
↵
ans = f(atof("10"), atof("1e-15"));↵
cout << (long double) ans << '\n';↵
cout << (double) ans << "\n\n";↵
return 0;↵
}↵
~~~↵
↵
</spoiler>↵
↵
<spoiler summary="Output for 32 bit g++">↵
~~~↵
99.99999999999999900079927783735911361873149871826171875100↵
100↵
↵
99↵
100↵
↵
100↵
100↵
↵
99.99999999999999900079927783735911361873149871826171875↵
100↵
~~~↵
</spoiler>↵
↵
<spoiler summary="Output for 64 bit g++">↵
~~~↵
100↵
100↵
↵
100↵
100↵
↵
100↵
100↵
↵
100↵
100↵
~~~↵
</spoiler>↵
↵
Looking at this example, the output that one would expect from $10 * 10 - 10^{-15}$ is exactly $100$ since $100$ is the closest representable value of a double. This is exactly what happens in 64 bit g++. However, in 32 bit g++ there seems to be some kind of hidden _excess precision_ causing the output to only sometimes(???) be $100$.↵
↵
# Explanation↵
↵
In C and C++ there are different modes (referred to as methods) of how floating point arithmetic is done, see (https://en.wikipedia.org/wiki/C99#IEEE_754_floating-point_support). You can detect which one is being used by the value of `FLT_EVAL_METHOD` found in `cfloat`. In mode 2 (which is what 32 bit g++ uses by default) **all** floating point arithmetic is done using long double. Note that in this mode numbers are temporarily stored as long doubles while being operated on, this can / will cause a kind of excess precision. In mode 0 (which is what 64 bit g++ uses by default) the arithmetic is done using each corresponding type, so there is no excess precision.↵
↵
# Detecting and turning on/off excess precision↵
↵
Here is a simple example of how to detect excess precision (partly taken from https://stackoverflow.com/a/20870774)↵
↵
<spoiler summary="Test for detecting excess precision">↵
↵
~~~~~↵
// #pragma GCC target("fpmath=sse,sse2") // Turns off excess precision↵
// #pragma GCC target("fpmath=387") // Turns on excess precision↵
↵
#include <iostream>↵
#include <cstdlib>↵
#include <cfloat>↵
using namespace std;↵
↵
int main() {↵
cout << "This is compiled in mode "<< FLT_EVAL_METHOD << '\n';↵
cout << "0 means no excess precision.\n";↵
cout << "2 means there is excess precision.\n\n";↵
↵
cout << "The following test detects excess precision\n";↵
cout << "0 if no excess precision, or 8e-17 if there is excess precision.\n";↵
double a = atof("1.2345678");↵
double b = a*a;↵
cout << b - 1.52415765279683990130 << '\n';↵
return 0;↵
}↵
↵
~~~~~↵
↵
</spoiler>↵
↵
If b is rounded (as one would "expect" since it is a double), then the result is zero. Otherwise it is something like 8e-17 because of excess precision. I tried running this in custom invocation. MSVC(C++17), Clang and g++17(64bit) all use mode 0 and round b to 0, while g++11, g++14 and g++17 as expected all use mode 2 and b = 8e-17.↵
↵
The culprit behind all of this misery is the old x87 instruction set, which only supports (80 bit) long double arithmetic. The modern solution is to on top of this use the SSE instruction set (version 2 or later), which supports both float and double arithmetic. On GCC you can turn this on with the flags `-mfpmath=sse -msse2`. This will not change the value of `FLT_EVAL_METHOD`, but it will effectively turn off excess precision, see [submission:81993714].↵
↵
It is also possible to effectively turn on excess precision with `-mfpmath=387`, see [submission:81993724].↵
↵
# Conclusion / TLDR↵
32 bit g++ by default does all of its floating point arithmetic with (80 bit) long double. This causes a ton of frustrating and weird behaviors. 64 bit g++ does not have this issue.
↵
I'm writing this blog because of the large number of blogs asking about why they get strange floating arithmetic behavior in C++.↵
For example:↵
↵
"WA using GNU C++17 (64) and AC using GNU C++17" https://mirror.codeforces.com/blog/entry/78094↵
↵
"The curious case of the pow function" https://mirror.codeforces.com/blog/entry/21844↵
↵
"Why does this happen?" https://mirror.codeforces.com/blog/entry/51884↵
↵
"Why can this code work strangely?" https://mirror.codeforces.com/blog/entry/18005↵
↵
and many many more.↵
↵
# Example↵
↵
Here is a simple example of the kind of weird behavior I'm talking about ↵
↵
<spoiler summary="Example showing the issue">↵
↵
~~~↵
#include <iostream>↵
using namespace std;↵
↵
double f(double a, double b) {↵
return a * a - b;↵
}↵
↵
int main() {↵
cout.precision(60);↵
↵
// Calculate 10*10 - 1e-15↵
double ans;↵
ans = f(atof("10"), atof("1e-15"));↵
cout << (
cout << (int) ans << "\n\n";↵
↵
ans = f(atof("10"), atof("1e-15"));↵
cout << (int) ans << '\n';↵
cout << (double) ans <<
↵
ans = f(atof("10"), atof("1e-15"));↵
cout << (double) ans << '\n';↵
cout << (
↵
ans = f(atof("10"), atof("1e-15"));↵
cout << (long double) ans << '\n';↵
cout << (double) ans << "\n\n";↵
return 0;↵
}↵
~~~↵
↵
</spoiler>↵
↵
<spoiler summary="Output for 32 bit g++">↵
~~~↵
100↵
↵
99↵
100↵
↵
100↵
100↵
↵
99.99999999999999900079927783735911361873149871826171875↵
100↵
~~~↵
</spoiler>↵
↵
<spoiler summary="Output for 64 bit g++">↵
~~~↵
100↵
100↵
↵
100↵
100↵
↵
100↵
100↵
↵
100↵
100↵
~~~↵
</spoiler>↵
↵
Looking at this example, the output that one would expect from $10 * 10 - 10^{-15}$ is exactly $100$ since $100$ is the closest representable value of a double. This is exactly what happens in 64 bit g++. However, in 32 bit g++ there seems to be some kind of hidden _excess precision_ causing the output to only sometimes(???) be $100$.↵
↵
# Explanation↵
↵
In C and C++ there are different modes (referred to as methods) of how floating point arithmetic is done, see (https://en.wikipedia.org/wiki/C99#IEEE_754_floating-point_support). You can detect which one is being used by the value of `FLT_EVAL_METHOD` found in `cfloat`. In mode 2 (which is what 32 bit g++ uses by default) **all** floating point arithmetic is done using long double. Note that in this mode numbers are temporarily stored as long doubles while being operated on, this can / will cause a kind of excess precision. In mode 0 (which is what 64 bit g++ uses by default) the arithmetic is done using each corresponding type, so there is no excess precision.↵
↵
# Detecting and turning on/off excess precision↵
↵
Here is a simple example of how to detect excess precision (partly taken from https://stackoverflow.com/a/20870774)↵
↵
<spoiler summary="Test for detecting excess precision">↵
↵
~~~~~↵
// #pragma GCC target("fpmath=sse,sse2") // Turns off excess precision↵
// #pragma GCC target("fpmath=387") // Turns on excess precision↵
↵
#include <iostream>↵
#include <cstdlib>↵
#include <cfloat>↵
using namespace std;↵
↵
int main() {↵
cout << "This is compiled in mode "<< FLT_EVAL_METHOD << '\n';↵
cout << "0 means no excess precision.\n";↵
cout << "2 means there is excess precision.\n\n";↵
↵
cout << "The following test detects excess precision\n";↵
cout << "0 if no excess precision, or 8e-17 if there is excess precision.\n";↵
double a = atof("1.2345678");↵
double b = a*a;↵
cout << b - 1.52415765279683990130 << '\n';↵
return 0;↵
}↵
↵
~~~~~↵
↵
</spoiler>↵
↵
If b is rounded (as one would "expect" since it is a double), then the result is zero. Otherwise it is something like 8e-17 because of excess precision. I tried running this in custom invocation. MSVC(C++17), Clang and g++17(64bit) all use mode 0 and round b to 0, while g++11, g++14 and g++17 as expected all use mode 2 and b = 8e-17.↵
↵
The culprit behind all of this misery is the old x87 instruction set, which only supports (80 bit) long double arithmetic. The modern solution is to on top of this use the SSE instruction set (version 2 or later), which supports both float and double arithmetic. On GCC you can turn this on with the flags `-mfpmath=sse -msse2`. This will not change the value of `FLT_EVAL_METHOD`, but it will effectively turn off excess precision, see [submission:81993714].↵
↵
It is also possible to effectively turn on excess precision with `-mfpmath=387`, see [submission:81993724].↵
↵
# Conclusion / TLDR↵
32 bit g++ by default does all of its floating point arithmetic with (80 bit) long double. This causes a ton of frustrating and weird behaviors. 64 bit g++ does not have this issue.