### pajenegod's blog

By pajenegod, history, 4 months ago, I don't think blogs like this one should be normal on Codeforces.

I know that sometimes comments can be frustrating. Sometimes it is the commenters fault, and sometimes it is the community's fault for upvoting mean spirited comments. I understand that in some cases, the criticism the commenter receives is fair and well-deserved.

But there is a fine line between criticism, saying that you didn't like the comment, and hanging out a comment on the front page and letting everyone attack it. The former two are acceptable (and sometimes even needed, because commenter have to improve and learn from their mistakes); the latter one, in my opinion, should not be acceptable.

I think that blogs discussing toxicity on codeforces are a good thing, but singling out a comment made by some kid onto the front page of Codeforces is a very inconsiderate and irresponsible thing to do.

By pajenegod, history, 5 months ago, Hi Codeforces!

I have something exciting to tell you guys about today! I have recently come up with a really neat and simple recursive algorithm for multiplying polynomials in $O(n \log n)$ time. It is so neat and simple that I think it might possibly revolutionize the way that fast polynomial multiplication is taught and coded. You don't need to know anything about FFT to understand and implement this algorithm.

Big thanks to nor, c1729 and Spheniscine for discussing the contents of the blog with me and comming up with ideas for how to improve the blog =).

I've split this blog up into two parts. The first part is intended for anyone to be able to read and understand. The second part is advanced and goes into a ton of interesting ideas and concepts related to this algorithm.

Prerequisite: Polynomial quotient and remainder, see Wiki article and this Stackexchange example.

Given two polynomials $P$ and $Q$, an integer $n$ and a non-zero complex number $c$, where degree $P < n$ and degree $Q < n$. Your task is to calculate the polynomial $P(x) \, Q(x) \% (x^n - c)$ in $O(n \log n)$ time. You may assume that $n$ is a power of two.

### Solution:

We can create a divide and conquer algorithm for $P(x) \, Q(x) \% (x^n - c)$ based on the difference of squares formula. Assuming $n$ is even, then $(x^n - c) = (x^{n/2} - \sqrt{c}) (x^{n/2} + \sqrt{c})$. The idea behind the algorithm is to calculate $P(x) \, Q(x) \% (x^{n/2} - \sqrt{c})$ and $P(x) \, Q(x) \% (x^{n/2} + \sqrt{c})$ using 2 recursive calls, and then use that result to calculate $P(x) \, Q(x) \% (x^n - c)$.

So how do we actually calculate $P(x) \, Q(x) \% (x^n - c)$ using $P(x) \, Q(x) \% (x^{n/2} - \sqrt{c})$ and $P(x) \, Q(x) \% (x^{n/2} + \sqrt{c})$?

Well, we can use the following formula:

\begin{aligned} A(x) \% (x^n - c) = &\frac{1}{2} \left(1 + \frac{x^{n/2}}{\sqrt{c}}\right) \left(A(x) \% (x^{n/2} - \sqrt{c})\right) \, + \\ &\frac{1}{2} \left(1 - \frac{x^{n/2}}{\sqrt{c}}\right) \left(A(x) \% (x^{n/2} + \sqrt{c})\right). \end{aligned}
Proof of the formula

This formula is very useful. If we substitute $A(x)$ by $P(x) Q(x)$, then the formula tells us how to calculate $P(x) \, Q(x) \% (x^n - c)$ using $P(x) \, Q(x) \% (x^{n/2} - \sqrt{c})$ and $P(x) \, Q(x) \% (x^{n/2} + \sqrt{c})$ in linear time. With this we have the recipie for implementing a $O(n \log n)$ divide and conquer algorithm:

Input:

• Integer $n$ (power of 2),
• Non-zero complex number $c$,
• Two polynomials $P(x) \% (x^n - c)$ and $Q(x) \% (x^n - c)$.

Output:

• The polynomial $P(x) \, Q(x) \% (x^n - c)$.

Algorithm:

Step 1. (Base case) If $n = 1$, then return $P(0) \cdot Q(0)$. Otherwise:

Step 2. Starting from $P(x) \% (x^n - c)$ and $Q(x) \% (x^n - c)$, in $O(n)$ time calculate

\begin{align} & P(x) \% (x^{n/2} - \sqrt{c}), \\ & Q(x) \% (x^{n/2} - \sqrt{c}), \\ & P(x) \% (x^{n/2} + \sqrt{c}) \text{ and} \\ & Q(x) \% (x^{n/2} + \sqrt{c}). \end{align}

Step 3. Make two recursive calls to calculate $P(x) \, Q(x) \% (x^{n/2} - \sqrt{c})$ and $P(x) \, Q(x) \% (x^{n/2} + \sqrt{c})$.

Step 4. Using the formula, calculate $P(x) \, Q(x) \% (x^n - c)$ in $O(n)$ time. Return the result.

Here is a Python implementation following this recipie:

One final thing that I want to mention before going into the advanced section is that this algorithm can also be used to do fast unmodded polynomial multiplication, i.e. given polynomials $P(x)$ and $Q(x)$ calculate $P(x) \, Q(x)$. The trick is simply to pick $n$ large enough such that $P(x) \, Q(x) = P(x) \, Q(x) \% (x^n - c)$, and then use the exact same algorithm as before. $c$ can be arbitrarily picked (any non-zero complex number works).

Python implementation for general Fast polynomial multiplication

If you want to try out implementing this algorithm yourself, then here is a very simple problem to test out your implementation on: SPOJ:POLYMUL.

### (Advanced) Speeding up the algorithm

This section will be about tricks that can be used to speed up the algorithm. The first two tricks will speed up the algorithm by a factor of 2 each. The last trick is advanced, and it has the potential to both speed up the algorithm and also make it more numerically stable.

$n$ doesn't actually need to be a power of 2
Imaginary-cyclic convolution
Calculating fast_polymult_mod(P, Q, n, c) using using fast_polymult_mod(P, Q, n, 1) (reweight technique)

This algorithm is actually FFT in disguise. But it is also different compared to any other FFT algorithm that I've seen in the past (for example the Cooley–Tukey FFT algorithm).

Using this algorithm to calculate FFT
This algorithm is not the same algorithm as Cooley–Tukey
FFT implementation in Python based on this algorithm
FFT implementation in C++ based on this algorithm

### (Advanced) Connection between this algorithm and NTT

Just like how there is FFT and NTT, there are two variants of this algorithm too. One using complex floating point numbers, and the other using modulo a prime (or more generally modulo an odd composite number).

Using modulo integers instead of complex numbers
Calculating fast_polymult_mod(P, Q, n, c) using fast_polymult_mod(P, Q, 2*n, 1)
This algorithm works to some degree even for bad NTT primes
NTT implementation in Python based on this algorithm
NTT implementation in C++ based on this algorithm
Blazingly fast NTT C++ implementation

### (Advanced) Shorter implementations ("Codegolfed version")

It is possible to make really short but slightly less natural implementations of this algorithm. Originally I was thinking of using this shorter version in the blog, but in the end I didn't do it. So here they are. If you want a short implemention of this algorithm to use in practice, then I would recommend taking one of these implementations and porting it to C++.

Short Python implementation without any speedup tricks
Short Python implementation supporting odd and even $n$ (making it up to 2 times faster)
Short Python implementation supporting odd and even $n$ and imaginary cyclic convolution (making it up to 4 times faster) fft,
By pajenegod, history, 8 months ago, Take a look at this C++ submission 199864568:

#import <bits/stdc++.h>
using namespace std;
int main()
{
int a;
cin >> a;
cout << ((a%2==0 && a>2) ? "YES" : "NO");
} Don't see it?
Spoilers gcc,
By pajenegod, history, 15 months ago, I recently had a very interesting idea for how to greatly speed up convolution (a.k.a. polynomial multiplication).

def convolution(A, B):
C =  * (len(A) + len(B) - 1)
for i in range(len(A)):
for j in range(len(B)):
C[i + j] += A[i] * B[j]
return C


The standard technique for how to do convolution fast is to make use of cyclic convolution (polynomial mult mod $x^n - 1$).

def cyclic_convolution(A, B):
n = len(A) # A and B needs to have the same size
C =  * n
for i in range(n):
for j in range(n):
C[(i + j) % n] += A[i] * B[j]
return C


Cyclic convolution can be calculated in $O(n \log n)$ using FFT, which is really fast. The issue here is that in order to do convolution using cyclic convolution, we need to pad with a lot of 0s to not be affected by the wrap around. All this 0-padding feels very inefficient.

So here is my idea. What if we do polynomial multiplication mod $x^n - i$ instead of mod $x^n - 1$? Then when we get wrap around, it will be multiplied by the imaginary unit, so it wont interfere with the real part! I call this the imaginary cyclic convolution.

def imaginary_cyclic_convolution(A, B):
n = len(A) # A and B needs to have the same size
C =  * n
for i in range(n):
for j in range(n):
C[(i + j) % n] += (1 if i + j < n else 1j) * A[i] * B[j]
return C


Imaginary cyclic convolution is the perfect algorithm to use for implementing convolution. Using it, we no longer need to do copious amount of 0 padding, since the imaginary unit will take care of the wrap around. In fact, the size (the value of $n$) required is exactly half of what we would need if we had used cyclic convolution.

One question still remains, how do we implement imaginary cyclic convolution efficiently?

The trick is rather simple. Let $\omega = i^{\frac{1}{n}}$. Now note that if $C(\omega x) = A(\omega x) B(\omega x) \mod x^n - 1$ then $C(x) = A(x) B(x) \mod x^n - i$. So here is the algorithm

def imaginary_cyclic_convolution(A, B):
n = len(A) # A and B needs to have the same size
w = (1j)**(1/n) # n-th root of imaginary unit

# Transform the polynomials A(x) -> A(wx) and B(x) -> B(wx)
A = [A[i] * w**i for i in range(n)]
B = [B[i] * w**i for i in range(n)]

C = cyclic_convolution(A, B)

# Transform the polynomial C(wx) -> C(x)
C = [C[i] / w**i for i in range(n)]
return C fft,
By pajenegod, history, 15 months ago,  Спойлер By pajenegod, history, 18 months ago, Hi CF! During this past weekend I was reading up on Montgomery transformation, which is a really interesting and useful technique to do fast modular multiplication. However, all of the explanations I could find online felt very unintuitive for me, so I decided to write my own blog on the subject. A big thanks to kostia244, nor, nskybytskyi and -is-this-fft- for reading this blog and giving me some feedback =).

### Fast modular multiplication

Let $P=10^9+7$ and let $a$ and $b$ be two numbers in $[0,P)$. Our goal is to calculate $a \cdot b \, \% \, P$ without ever actually calling $\% \, P$. This is because calling $\% \, P$ is very costly.

If you haven't noticed that calling $\% \, P$ is really slow, then the reason you haven't noticed it is likely because the compiler automatically optimizes away the $\% \, P$ call if $P$ is known at compile time. But if $P$ is not known at compile time, then the compiler will have to call $\% \, P$, which is really really slow.

### Montgomery reduction of $a \cdot b$

It turns out that the trick to calculate $a \cdot b \, \% \, P$ efficently is to calculate $a \cdot b \cdot 2^{-32} \, \% \, P$ efficiently. So the goal for this section will be to figure out how to calculate $a \cdot b \cdot 2^{-32} \, \% \, P$ efficently. $a \cdot b \cdot 2^{-32} \, \% \, P$ is called the Montgomery reduction of $a \cdot b$, denoted by $\text{m_reduce}(a \cdot b)$.

#### Idea (easy case)

Suppose that $a \cdot b$ just happens to be divisible by $2^{32}$. Then $(a \cdot b \cdot 2^{-32}) \, \% \, P = (a \cdot b) \gg 32$, which runs super fast!

#### Idea (general case)

Can we do something similar if $a \cdot b$ is not divisible by $2^{32}$? The answer is yes! The trick is to find some integer $m$ such that $(a \cdot b + m \cdot P)$ is divisible by $2^{32}$. Then $a \cdot b \cdot 2^{-32} \, \% \, P = (a \cdot b + m \cdot P) \cdot 2^{-32} \, \% \, P = (a \cdot b + m \cdot P) \gg 32$.

So how do we find such an integer $m$? We want $(a \cdot b + m \cdot P) \,\%\, 2^{32} = 0$ so $m = (-a \cdot b \cdot P^{-1}) \,\%\, 2^{32}$. So if we precalculate $(-P^{-1}) \,\%\, 2^{32}$ then calculating $m$ can be done blazingly fast.

### Montgomery transformation

Since the Montgomery reduction divides $a \cdot b$ by $2^{32}$, we would like some some way of multiplying by $2^{32}$ modulo $P$. The operation $x \cdot 2^{32} \, \% \, P$ is called the Montgomery transform of $x$, denoted by $\text{m_transform}(x)$.

The trick to implement $\text{m_transform}$ efficently is to make use of the Montgomery reduction. Note that $\text{m_transform}(x) = \text{m_reduce}(x \cdot (2^{64} \, \% \, P))$, so if we precalculate $2^{64} \, \% \, P$, then $\text{m_transform}$ also runs blazingly fast.

### Montgomery multiplication

Using $\text{m_reduce}$ and $\text{m_transform}$ there are multiple different ways of calculating $a \cdot b \, \% \, P$ effectively. One way is to run $\text{m_transform}(\text{m_reduce}(a \cdot b))$. This results in two calls to $\text{m_reduce}$ per multiplication.

Another common way to do it is to always keep all integers transformed in the so called Montgomery space. If $a' = \text{m_transform}(a)$ and $b' = \text{m_transform}(b)$ then $\text{m_transform}(a \cdot b \, \% \, P) = \text{m_reduce}(a' \cdot b')$. This effectively results in one call to $\text{m_reduce}$ per multiplication, however you now have to pay to move integers in to and out of the Montgomery space.

### Example implementation

Here is a Python 3.8 implementation of Montgomery multiplication. This implementation is just meant to serve as a basic example. Implement it in C++ if you want it to run fast.

P = 10**9 + 7
r = 2**32
r2 = r * r % P
Pinv = pow(-P, -1, r) # (-P^-1) % r

def m_reduce(ab):
m = ab * Pinv % r
return (ab + m * P) // r

def m_transform(a):
return m_reduce(a * r2)

# Example of how to use it
a = 123456789
b = 35
a_prim = m_transform(a) # mult a by 2^32
b_prim = m_transform(b) # mult b by 2^32
prod_prim = m_reduce(a_prim * b_prim) # divide a' * b' by 2^32
prod = m_reduce(prod_prim) # divide prod' by 2^32
print('%d * %d %% %d = %d' % (a, b, P, prod)) # prints 123456789 * 35 % 1000000007 = 320987587


### Final remarks

One important issue that I've so far swept under the rug is that the output of m_reduce is actually in $[0, 2 P)$ and not $[0, P)$. I just want end by discussing this issue. I can see two ways of handling this:

• Alternative 1. You can force $\text{m_reduce}(a \cdot b)$ to be in $[0, P)$ for $a$ and $b$ in $[0, P)$ by adding an if-stament to the output of m_reduce. This will work for any odd integer $P < 2^{31}$.
Fixed implementation of m_reduce
• Alternative 2. Assuming $P$ is an odd integer $< 2^{30}$ then if $a$ and $b$ $\in [0, 2 P)$ you can show that the output of $\text{m_reduce}(a \cdot b)$ is also in $[0,2 P)$. So if you are fine working with $[0, 2 P) \vphantom]$ everywhere then you don't need any if-statements. Nyaan's github has a nice C++ implementation of Montgomery multiplication using this style of implementation. By pajenegod, history, 3 years ago, I've always liked using Python (PyPy) for solving problems in competitive programming. And most problems are very doable, even in Python. What I've found is that the most difficult problems to solve in Python are those requiring 64 bit integers.

The reason why 64 bit integers are problematic is because CF runs Windows, and PyPy only supports 32 bit on Windows. So whenever a problem involves integers that cannot fit inside of a signed 32 bit int, PyPy switches to big integers (which runs insanely slow, sometimes a factor of 20 times slower).

What I currently have to do to get around big integers

However with the latest PyPy version (version 7.3.4) PyPy has finally switched to 64 bit on Windows! So upgrading PyPy would mean no more problems with big integers. This would make PyPy far more usable and more beginner friendly. So if possible please update PyPy's version on CF to 7.3.4! MikeMirzayanov

Edit: Reading Results of 2020 [list some changes and improvements] blog I realized that I should probably be tagging geranazavr555, kuviman and cannor147 too. pypy, tle,
By pajenegod, history, 3 years ago, Let me tell you the story of how I made $2200 from doing competitive programming. Spoiler Once many many fortnights ago Hackerrank held one of its regular competitions, "World CodeSprint 9". This was back when Hackerrank actually sent out its prizes. The competition was very unusual in that one of its hardest problems was a scored based approximation problem. This competition was also the first time that I would get placed in the top 100s! Using my beloved Python :) As I recall the prize for getting placed 4 to 100 was a t-shirt and$75. More precisely these $75 were sent either in Bitcoins or Amazon giftcards depending on where the prize winners lived, and in my case I got Bitcoins. I received the$75 in Bitcoins on 21st of March 2017.

Prices 2017

When I got them, I didn't really know how to do anything with them, so I kind of just forgot about them. Turns out that the value of Bitcoin has increased a bit since then:

Prices 2017-2021

30 times more to be precise! So today I just sold them for a bit over \$2200 (Sold when the price hit 26640€/btc). Not too shabby for a 34th place finish in a regular competition! =D By pajenegod, history, 4 years ago, # Introduction

I'm writing this blog because of the large number of blogs asking about why they get strange floating arithmetic behaviour in C++. For example:

"WA using GNU C++17 (64) and AC using GNU C++17" https://mirror.codeforces.com/blog/entry/78094

"The curious case of the pow function" https://mirror.codeforces.com/blog/entry/21844

"Why does this happen?" https://mirror.codeforces.com/blog/entry/51884

"Why can this code work strangely?" https://mirror.codeforces.com/blog/entry/18005

and many many more.

# Example

Here is a simple example of the kind of weird behaviour I'm talking about

Example showing the issue
Output for 32 bit g++
Output for 64 bit g++

Looking at this example, the output that one would expect from $10 * 10 - 10^{-15}$ is exactly $100$ since $100$ is the closest representable value of a double. This is exactly what happens in 64 bit g++. However, in 32 bit g++ there seems to be some kind of hidden excess precision causing the output to only sometimes(???) be $100$.

# Explanation

In C and C++ there are different modes (referred to as methods) of how floating point arithmetic is done, see (https://en.wikipedia.org/wiki/C99#IEEE_754_floating-point_support). You can detect which one is being used by the value of FLT_EVAL_METHOD found in cfloat. In mode 2 (which is what 32 bit g++ uses by default) all floating point arithmetic is done using long double. Note that in this mode numbers are temporarily stored as long doubles while being operated on, this can / will cause a kind of excess precision. In mode 0 (which is what 64 bit g++ uses by default) the arithmetic is done using each corresponding type, so there is no excess precision.

# Detecting and turning on/off excess precision

Here is a simple example of how to detect excess precision (partly taken from https://stackoverflow.com/a/20870774)

Test for detecting excess precision

If b is rounded (as one would "expect" since it is a double), then the result is zero. Otherwise it is something like 8e-17 because of excess precision. I tried running this in custom invocation. MSVC(C++17), Clang and g++17(64bit) all use mode 0 and round b to 0, while g++11, g++14 and g++17 as expected all use mode 2 and b = 8e-17.

The culprit behind all of this misery is the old x87 instruction set, which only supports (80 bit) long double arithmetic. The modern solution is to on top of this use the SSE instruction set (version 2 or later), which supports both float and double arithmetic. On GCC you can turn this on with the flags -mfpmath=sse -msse2. This will not change the value of FLT_EVAL_METHOD, but it will effectively turn off excess precision, see 81993714.

It is also possible to effectively turn on excess precision with -mfpmath=387, see 81993724.

# Fun exercise

Using your newfound knowledge of excess precision, try to find a compiler + input to "hack" this

Try to hack this

# Conclusion / TLDR

32 bit g++ by default does all of its floating point arithmetic with (80 bit) long double. This causes a ton of frustrating and weird behaviours. 64 bit g++ does not have this issue. 