Hi Codeforces!
I have something exciting to tell you guys about today! I have recently come up with a really neat and simple recursive algorithm for multiplying polynomials in $$$O(n \log n)$$$ time. It is so neat and simple that I think it might possibly revolutionize the way that fast polynomial multiplication is taught and coded. You don't need to know anything about FFT to understand and implement this algorithm.
Big thanks to nor, c1729 and Spheniscine for discussing the contents of the blog with me and comming up with ideas for how to improve the blog =).
I've split this blog up into two parts. The first part is intended for anyone to be able to read and understand. The second part is advanced and goes into a ton of interesting ideas and concepts related to this algorithm.
Prerequisite: Polynomial quotient and remainder, see Wiki article and this Stackexchange example.
Task:
Given two polynomials $$$P$$$ and $$$Q$$$, an integer $$$n$$$ and a non-zero complex number $$$c$$$, where degree $$$P < n$$$ and degree $$$Q < n$$$. Your task is to calculate the polynomial $$$P(x) \, Q(x) \% (x^n - c)$$$ in $$$O(n \log n)$$$ time. You may assume that $$$n$$$ is a power of two.
Solution:
We can create a divide and conquer algorithm for $$$P(x) \, Q(x) \% (x^n - c)$$$ based on the difference of squares formula. Assuming $$$n$$$ is even, then $$$(x^n - c) = (x^{n/2} - \sqrt{c}) (x^{n/2} + \sqrt{c})$$$. The idea behind the algorithm is to calculate $$$P(x) \, Q(x) \% (x^{n/2} - \sqrt{c})$$$ and $$$P(x) \, Q(x) \% (x^{n/2} + \sqrt{c})$$$ using 2 recursive calls, and then use that result to calculate $$$P(x) \, Q(x) \% (x^n - c)$$$.
So how do we actually calculate $$$P(x) \, Q(x) \% (x^n - c)$$$ using $$$P(x) \, Q(x) \% (x^{n/2} - \sqrt{c})$$$ and $$$P(x) \, Q(x) \% (x^{n/2} + \sqrt{c})$$$?
Well, we can use the following formula:
$$$ \begin{aligned} A(x) \% (x^n - c) = &\frac{1}{2} \left(1 + \frac{x^{n/2}}{\sqrt{c}}\right) \left(A(x) \% (x^{n/2} - \sqrt{c})\right) \, + \\ &\frac{1}{2} \left(1 - \frac{x^{n/2}}{\sqrt{c}}\right) \left(A(x) \% (x^{n/2} + \sqrt{c})\right). \end{aligned} $$$ Proof of the formulaNote that \begin{equation} A(x) = \frac{1}{2} \left(1 + \frac{x^{n/2}}{\sqrt{c}}\right) A(x) + \frac{1}{2} \left(1 — \frac{x^{n/2}}{\sqrt{c}}\right) A(x). \end{equation}
Let $$$Q^-(x)$$$ denote the quotient of $$$A(x)$$$ divided by $$$(x^n/2 - \sqrt{c})$$$ and let $$$Q^+(x)$$$ denote the quotient of $$$A(x)$$$ divided by $$$(x^n/2 + \sqrt{c})$$$. Then
$$$ \begin{aligned} \left(1 + \frac{x^{n/2}}{\sqrt{c}}\right) A(x) &= \left(1 + \frac{x^{n/2}}{\sqrt{c}}\right) \left((A(x) \% (x^{n/2} - \sqrt{c})) + Q^-(x) (x^{n/2} - \sqrt{c})\right) \\ &= \left(1 + \frac{x^{n/2}}{\sqrt{c}}\right) \left(A(x) \% (x^{n/2} - \sqrt{c})\right) + \frac{1}{\sqrt{c}} Q^-(x) \, (x^n - c) \end{aligned} $$$and
$$$ \begin{aligned} \left(1 - \frac{x^{n/2}}{\sqrt{c}}\right) A(x) &= \left(1 - \frac{x^{n/2}}{\sqrt{c}}\right) \left((A(x) \% (x^{n/2} + \sqrt{c})) + Q^+(x) (x^{n/2} + \sqrt{c})\right) \\ &= \left(1 - \frac{x^{n/2}}{\sqrt{c}}\right) \left(A(x) \% (x^{n/2} + \sqrt{c})\right) - \frac{1}{\sqrt{c}} Q^+(x) (x^n - c). \end{aligned} $$$With this we have shown that
$$$ \begin{aligned} A(x) = &\frac{1}{2} \left(1 + \frac{x^{n/2}}{\sqrt{c}}\right) \left(A(x) \% (x^{n/2} - \sqrt{c})\right) \, + \\ &\frac{1}{2} \left(1 - \frac{x^{n/2}}{\sqrt{c}}\right) \left(A(x) \% (x^{n/2} + \sqrt{c})\right) \, + \\ &\frac{1}{\sqrt{c}} \frac{Q^-(x) - Q^+(x)}{2} (x^n - c). \end{aligned} $$$Here $$$A(x)$$$ is expressed as remainder + quotient times $$$(x^n - c)$$$. So we have proven the formula.
This formula is very useful. If we substitute $$$A(x)$$$ by $$$P(x) Q(x)$$$, then the formula tells us how to calculate $$$P(x) \, Q(x) \% (x^n - c)$$$ using $$$P(x) \, Q(x) \% (x^{n/2} - \sqrt{c})$$$ and $$$P(x) \, Q(x) \% (x^{n/2} + \sqrt{c})$$$ in linear time. With this we have the recipie for implementing a $$$O(n \log n)$$$ divide and conquer algorithm:
Input:
- Integer $$$n$$$ (power of 2),
- Non-zero complex number $$$c$$$,
- Two polynomials $$$P(x) \% (x^n - c)$$$ and $$$Q(x) \% (x^n - c)$$$.
Output:
- The polynomial $$$P(x) \, Q(x) \% (x^n - c)$$$.
Algorithm:
Step 1. (Base case) If $$$n = 1$$$, then return $$$P(0) \cdot Q(0)$$$. Otherwise:
Step 2. Starting from $$$P(x) \% (x^n - c)$$$ and $$$Q(x) \% (x^n - c)$$$, in $$$O(n)$$$ time calculate
$$$ \begin{align} & P(x) \% (x^{n/2} - \sqrt{c}), \\ & Q(x) \% (x^{n/2} - \sqrt{c}), \\ & P(x) \% (x^{n/2} + \sqrt{c}) \text{ and} \\ & Q(x) \% (x^{n/2} + \sqrt{c}). \end{align} $$$Step 3. Make two recursive calls to calculate $$$P(x) \, Q(x) \% (x^{n/2} - \sqrt{c})$$$ and $$$P(x) \, Q(x) \% (x^{n/2} + \sqrt{c})$$$.
Step 4. Using the formula, calculate $$$P(x) \, Q(x) \% (x^n - c)$$$ in $$$O(n)$$$ time. Return the result.
Here is a Python implementation following this recipie:
Python solution to the task"""
Calculates P(x) * Q(x) % (x^n - c) in O(n log n) time
Input:
n: Integer, needs to be power of 2
c: Non-zero complex floating point number
P: A list of length n representing a polynomial P(x) % (x^n - c)
Q: A list of length n representing a polynomial Q(x) % (x^n - c)
Output:
A list of length n representing the polynomial P(x) * Q(x) % (x^n - c)
"""
def fast_polymult_mod(P, Q, n, c):
assert len(P) == n and len(Q) == n
# Base case
if n == 1:
return [P[0] * Q[0]]
assert n % 2 == 0
import cmath
sqrtc = cmath.sqrt(c)
# Calulate P_minus := P mod (x^(n/2) - sqrt(c))
# Q_minus := Q mod (x^(n/2) - sqrt(c))
P_minus = [p1 + sqrtc * p2 for p1,p2 in zip(P[:n//2], P[n//2:])]
Q_minus = [q1 + sqrtc * q2 for q1,q2 in zip(Q[:n//2], Q[n//2:])]
# Calulate P_plus := P mod (x^(n/2) + sqrt(c))
# Q_plus := Q mod (x^(n/2) + sqrt(c))
P_plus = [p1 - sqrtc * p2 for p1,p2 in zip(P[:n//2], P[n//2:])]
Q_plus = [q1 - sqrtc * q2 for q1,q2 in zip(Q[:n//2], Q[n//2:])]
# Recursively calculate PQ_minus := P * Q % (x^n/2 - sqrt(c))
# PQ_plus := P * Q % (x^n/2 + sqrt(c))
PQ_minus = fast_polymult_mod(P_minus, Q_minus, n//2, sqrtc)
PQ_plus = fast_polymult_mod(P_plus, Q_plus, n//2, -sqrtc)
# Calculate PQ mod (x^n - c) using PQ_minus and PQ_plus
PQ = [(m + p)/2 for m,p in zip(PQ_minus, PQ_plus)] +\
[(m - p)/(2*sqrtc) for m,p in zip(PQ_minus, PQ_plus)]
return PQ
One final thing that I want to mention before going into the advanced section is that this algorithm can also be used to do fast unmodded polynomial multiplication, i.e. given polynomials $$$P(x)$$$ and $$$Q(x)$$$ calculate $$$P(x) \, Q(x)$$$. The trick is simply to pick $$$n$$$ large enough such that $$$P(x) \, Q(x) = P(x) \, Q(x) \% (x^n - c)$$$, and then use the exact same algorithm as before. $$$c$$$ can be arbitrarily picked (any non-zero complex number works).
Python implementation for general Fast polynomial multiplication"""
Calculates P(x) * Q(x)
Input:
P: A list representing a polynomial P(x)
Q: A list representing a polynomial Q(x)
Output:
A list representing the polynomial P(x) * Q(x)
"""
def fast_polymult(P, Q):
# Calculate length of the list representing P*Q
n1 = len(P)
n2 = len(Q)
res_len = n1 + n2 - 1
# Pick n sufficiently big
n = 1
while n < res_len:
n *= 2
# Pad with extra 0s to reach length n
P = P + [0] * (n - n1)
Q = Q + [0] * (n - n2)
# Pick non-zero c arbitrarily =)
c = 123.24
# Calculate P*Q mod x^n - c
PQ = fast_polymult_mod(P, Q, n, c)
# Remove extra 0 padding and return
return PQ[:res_len]
If you want to try out implementing this algorithm yourself, then here is a very simple problem to test out your implementation on: SPOJ:POLYMUL.
(Advanced) Speeding up the algorithm
This section will be about tricks that can be used to speed up the algorithm. The first two tricks will speed up the algorithm by a factor of 2 each. The last trick is advanced, and it has the potential to both speed up the algorithm and also make it more numerically stable.
$n$ doesn't actually need to be a power of 2We don't actually need the assumption that $$$n$$$ is a power of 2. If $$$n$$$ ever becomes odd during the recursion, then we have two choices: Either fall back to a $$$O(n^2)$$$ algorithm or fall back to the unmodded $$$O(n \log{n})$$$ Polynomial multiplication algorithm.
Let us discuss the run time of falling back to the $$$O(n^2)$$$ algorithm when $$$n$$$ becomes odd. Assume that $$$n = a \cdot 2^b$$$, where $$$a$$$ is an odd integer and $$$b$$$ is an integer. Think of the recursive algorithm as having layers, one layer for each possible value of $$$n$$$. The first $$$b$$$ layers will all take $$$O(n)$$$ time each. In the $$$(b+1)$$$-th layer the value of $$$n$$$ is $$$a$$$. Using the $$$O(n^2)$$$ polynomial multiplication algorithm leads to this layer taking $$$O(n/a \cdot a^2) = O(n \cdot a)$$$ time. The final time complexity comes out to be $$$O((a + b) \, n)$$$.
Python implementation that works for both odd and even $n$"""
Calculates P(x) * Q(x) % (x^n - c) in O((a + b) * n) time, where n = a*2^b.
Input:
n: Integer
c: Non-zero complex floating point number
P: A list of length n representing a polynomial P(x) % (x^n - c)
Q: A list of length n representing a polynomial Q(x) % (x^n - c)
Output:
A list of length n representing the polynomial P(x) * Q(x) % (x^n - c)
"""
def fast_polymult_mod2(P, Q, n, c):
assert len(P) == n and len(Q) == n
# Base case (n is odd)
if n & 1:
# Calculate the answer in O(n^2) time
res = [0] * (2*n)
for i in range(n):
for j in range(n):
res[i + j] += P[i] * Q[j]
return [r1 + c * r2 for r1,r2 in zip(res[:n], res[n:])]
assert n % 2 == 0
import cmath
sqrtc = cmath.sqrt(c)
# Calulate P_minus := P mod (x^(n/2) - sqrt(c))
# Q_minus := Q mod (x^(n/2) - sqrt(c))
P_minus = [p1 + sqrtc * p2 for p1,p2 in zip(P[:n//2], P[n//2:])]
Q_minus = [q1 + sqrtc * q2 for q1,q2 in zip(Q[:n//2], Q[n//2:])]
# Calulate P_plus := P mod (x^(n/2) + sqrt(c))
# Q_plus := Q mod (x^(n/2) + sqrt(c))
P_plus = [p1 - sqrtc * p2 for p1,p2 in zip(P[:n//2], P[n//2:])]
Q_plus = [q1 - sqrtc * q2 for q1,q2 in zip(Q[:n//2], Q[n//2:])]
# Recursively calculate PQ_minus := P * Q % (x^n/2 - sqrt(c))
# PQ_plus := P * Q % (x^n/2 + sqrt(c))
PQ_minus = fast_polymult_mod2(P_minus, Q_minus, n//2, sqrtc)
PQ_plus = fast_polymult_mod2(P_plus, Q_plus, n//2, -sqrtc)
# Calculate PQ mod (x^n - c) using PQ_minus and PQ_plus
PQ = [(m + p)/2 for m,p in zip(PQ_minus, PQ_plus)] +\
[(m - p)/(2*sqrtc) for m,p in zip(PQ_minus, PQ_plus)]
return PQ
The reason why this is super useful is that it allows us to speed up the fast unmodded polynomial multiplication algorithm. As long as we are fine with $$$a$$$ being less than say $$$10$$$, then we might be able to choose a significantly smaller $$$n$$$ compared to what would be possible if we were allowed to only choose powers of two. This trick has the potential of making the fast unmodded polynomial multiplication algorithm run twice as fast.
Python implementation for more efficient fast unmodded polynomial multiplication"""
Calculates P(x) * Q(x)
Input:
P: A list representing a polynomial P(x)
Q: A list representing a polynomial Q(x)
Output:
A list representing the polynomial P(x) * Q(x)
"""
def fast_polymult2(P, Q):
# Calculate length of the list representing P*Q
n1 = len(P)
n2 = len(Q)
res_len = n1 + n2 - 1
# Pick n sufficiently big
b = 0
alim = 10
while alim * 2**b < res_len:
b += 1
a = (res_len - 1) // 2**b + 1
n = a * 2**b
# Pad with extra 0s to reach length n
P = P + [0] * (n - n1)
Q = Q + [0] * (n - n2)
# Pick non-zero c arbitrarily =)
c = 123.24
# Calculate P*Q mod x^n - c
PQ = fast_polymult_mod2(P, Q, n, c)
# Remove extra 0 padding and return
return PQ[:res_len]
Imaginary-cyclic convolutionSuppose that $$$P(x)$$$ and $$$Q(x)$$$ are two real polynomial, and that we want to calculate $$$P(x) \, Q(x)$$$. As discussed earlier, we can calculate the unmodded polynomial product by picking $$$n$$$ large enough such that $$$(P(x) \, Q(x)) \% (x^n - c) = P(x) \, Q(x)$$$ (here $$$c$$$ is any non-zero complex number), and then running the divide and conquer algorithm. But it turns out there is something smarter that we can do.
If we use $$$c = \text{i}$$$ (the imaginary unit) as the inital value of $$$c$$$, then this will allow us to pick an even smaller value for $$$n$$$. The reason for this is that if we get "overflow" from $$$n$$$ being too small, then that overflow will be placed into the imaginary part of the result $$$(P(x) \, Q(x)) \% (x^n - \text{i})$$$. This means that by using $$$c = \text{i}$$$ we are allowed to to pick $$$n$$$ as half the size compared to if we weren't using $$$c=\text{i}$$$. So this trick speeds the fast unmodded polynomial multiplication algorithm up by exactly a factor of 2.
Calculating fast_polymult_mod(P, Q, n, c) using using fast_polymult_mod(P, Q, n, 1) (reweight technique)There is somewhat well known technique called "reweighting" that allows us to switch between working with $$$\% (x^n - c)$$$ and working with $$$\% (x^n - 1)$$$. I've previously written a blog explaining this technique, see here. The technique uses the (2^b)-th root of $$$c$$$ to switch variables from $$$\frac{x^n}{c}$$$ to $$$y^n$$$, where $$$n = a \cdot 2^b$$$.
So why would we be interested in switching from $$$\% (x^n - c)$$$ to $$$\% (x^n - 1)$$$? The reason is that by using $$$c=1$$$, we don't need to bother with multiplying or dividing with $$$c$$$ or $$$\sqrt{c}$$$ anywhere, since $$$c=\sqrt{c}=1$$$. Additionally, if $$$c=-1$$$ or $$$c=\text{i}$$$ or $$$c=\text{-i}$$$, then multiplying or dividing by $$$c$$$ can be done very efficiently. So whenever $$$c$$$ becomes something other than $$$1,-1,\text{i}$$$ or $$$-\text{i}$$$, then it makes sense to use the reweight trick to switch back to $$$c=1$$$. This will significantly reduce the number of floating point operations used by the algorithm. Fewer floating point operations means that the algorithm both has the potential to be faster and more nummerically stable. So reweighting is definitely something to consider if you want to create a heavily optimized polynomial multiplication implementation.
This algorithm is actually FFT in disguise. But it is also different compared to any other FFT algorithm that I've seen in the past (for example the Cooley–Tukey FFT algorithm).
Using this algorithm to calculate FFTIn the tail of the recursion (i.e. when $$$n$$$ reaches 1), you are calculating $$$P(x) \, Q(x) \% (x - c)$$$, for some non-zero complex number $$$c$$$. This is infact the same thing as evaluating the polynomial $$$P(x) \, Q(x)$$$ at $$$x=c$$$. Furthermore, if you initially started with $$$c=1$$$, then the $$$c$$$ in the tail will be some $$$n$$$-th root of unity. If you analyze it more carefully, then you will see that each tail corresponds to a different $$$n$$$-th root of unity. So what the algorithm is actually doing is evaluating $$$P(x) \, Q(x)$$$ in all possible $$$n$$$-th roots of unity.
The $$$n$$$-th order FFT of a polynomial is defined as the polynomial evaluated in all $$$n$$$-th roots of unity. This means that the algorithm is infact an FFT algorithm. However, if you want to use it to calculate FFT, then make sure you order the $$$n$$$-th roots of unity according to the standard order used for FFT algorithms. The standard order is $$$\exp{(\frac{2 \pi \text{i}}{n} 0)}, \exp{(\frac{2 \pi \text{i}}{n} 1)}, ..., \exp{(\frac{2 \pi \text{i}}{n} (n-1))}$$$. To get the ordering correct, you will probably need to do a "bit reversal" at the end.
This algorithm is not the same algorithm as Cooley–TukeyThe Cooley-Tukey algorithm is the standard algorithm for calculating FFT. It is for exmple used in this blog [Tutorial] FFT by -is-this-fft-. The idea behind the algorithm is to split up the polynomial $$$P(x)$$$ into an even part $$$P_{\text{even}}(x^2)$$$ and an odd part $$$x \, P_{\text{odd}}(x^2)$$$. You can calculate the FFT of $$$P(x)$$$ using the FFTs of $$$P_{\text{even}}(x)$$$ and $$$P_{\text{odd}}(x)$$$. So Cooley-Tukey is a $$$O(n \log{n})$$$ divide and conquer algorithm that repeatedly splits up the polynomial into odd and even parts.
The wiki article for Cooley-Tukey has a nice description of the algorithm
$$$ \begin{align} X_k &= E_k + e^{- \frac{2 \pi \text{i}}{n} k} O_k, \\ X_{k+\frac{n}{2}} &= E_k - e^{- \frac{2 \pi \text{i}}{n} k} O_k. \end{align} $$$If you compare this to calculating FFT using the divide and conquer polynomial mod method you instead get
$$$ \begin{align} X_k &= E_k + c \, O_k, \\ X_{k+\frac{n}{2}} &= E_k - c \, O_k, \end{align} $$$where $$$c$$$ is an $$$n$$$-th root of unity that is independent of $$$k$$$. This is very different compared to Cooley-Tukey since $$$c$$$ doesn't have a dependence on $$$k$$$ unlike $$$e^{- \frac{2 \pi \text{i}}{n} k}$$$. Infact, $$$c$$$ being constant means that the polynomial mod method has the potential to be faster than Cooley-Tukey.
FFT implementation in Python based on this algorithmHere is an FFT implementation. A codegolfed version of the same code can be found on Pyrival.
"""
Calculates FFT(P) in O(n log n) time.
This implementation is based on the
polynomial modulo multiplication algorithm.
Input:
P: A list of length n representing a polynomial P(x).
n needs to be a power of 2.
Output:
A list of length n representing the FFT of the polynomial P,
i.e. the list [P(exp(2j pi / n * i) for i in range(n)]
"""
rt = [1] # List used to store roots of unity
def FFT(P):
n = len(P)
# Assert n is a power of 2
assert n and (n - 1) & n == 0
# Make a copy of P to not modify original P
P = P[:]
# Precalculate the roots
while 2 * len(rt) < n:
# 4*len(rt)-th root of unity
import cmath
root = cmath.exp(2j * cmath.pi / (4 * len(rt)))
rt.extend([r * root for r in rt])
# Transform P
k = n
while k > 1:
for i in range(n//k):
r = rt[i]
for j1 in range(i*k, i*k + k//2):
j2 = j1 + k//2
z = r * P[j2]
P[j2] = P[j1] - z
P[j1] += z
k //= 2
# Bit reverse P before returning
rev = [0] * n
for i in range(1, n):
rev[i] = rev[i // 2] // 2 + (i & 1) * n // 2
return [P[r] for r in rev]
# Inverse of FFT(P) using a standard trick
def inverse_FFT(fft_P):
n = len(fft_P)
return FFT([fft_P[-i]/n for i in range(n)])
"""
Calculates P(x) * Q(x)
Input:
P: A list representing a polynomial P(x)
Q: A list representing a polynomial Q(x)
Output:
A list representing the polynomial P(x) * Q(x)
"""
def fast_polymult_using_FFT(P, Q):
# Calculate length of the list representing P*Q
n1 = len(P)
n2 = len(Q)
res_len = n1 + n2 - 1
# Pick n sufficiently big
n = 1
while n < res_len:
n *= 2
# Pad with extra 0s to reach length n
P = P + [0] * (n - n1)
Q = Q + [0] * (n - n2)
# Transform P and Q
fft_P = FFT(P)
fft_Q = FFT(Q)
# Calculate FFT of P*Q
fft_PQ = [p*q for p,q in zip(fft_P,fft_Q)]
# Inverse FFT
PQ = inverse_FFT(fft_PQ)
# Remove padding and return
return PQ[:res_len]
"""
Calculates P(x) * Q(x)
Input:
P: A list representing a polynomial P(x)
Q: A list representing a polynomial Q(x)
Output:
A list representing the polynomial P(x) * Q(x)
"""
def fast_polymult_using_FFT(P, Q):
# Calculate length of the list representing P*Q
n1 = len(P)
n2 = len(Q)
res_len = n1 + n2 - 1
# Pick n sufficiently big
n = 1
while n < res_len:
n *= 2
# Pad with extra 0s to reach length n
P = P + [0] * (n - n1)
Q = Q + [0] * (n - n2)
# Transform P and Q
fft_P = FFT(P)
fft_Q = FFT(Q)
# Calculate FFT of P*Q
fft_PQ = [p*q for p,q in zip(fft_P,fft_Q)]
# Inverse FFT
PQ = inverse_FFT(fft_PQ)
# Remove padding and return
return PQ[:res_len]
FFT implementation in C++ based on this algorithmHere is an FTT implementation. It is coded in the same style as in KACTL.
typedef complex<double> C;
typedef vector<double> vd;
void fft(vector<C>& a) {
int n = sz(a);
static vector R{1.L + 0il};
static vector rt{1. + 0i};
for (static int k = 2; k < n; k *= 2) {
R.resize(n/2); rt.resize(n/2);
auto x = pow(1il, 2./k);
rep(i,k/2,k) rt[i] = R[i] = R[i-k/2] * x;;
}
for (int k = n; k > 1; k /= 2) rep(i,0,n/k) rep(j,i*k,i*k + k/2) {
C &u = a[j], &v = a[j+k/2], &r = rt[i];
C z(v.real()*r.real() - v.imag()*r.imag(),
v.real()*r.imag() + v.imag()*r.real());
v = u - z;
u = u + z;
}
vi rev(n);
rep(i,0,n) rev[i] = rev[i / 2] / 2 + (i & 1) * n / 2;
rep(i,0,n) if (i < rev[i]) swap(a[i], a[rev[i]]);
}
vd conv(const vd& a, const vd& b) {
if (a.empty() || b.empty()) return {};
vd res(sz(a) + sz(b) - 1);
int L = 32 - __builtin_clz(sz(res)), n = 1 << L;
vector<C> in(n), out(n);
copy(all(a), begin(in));
rep(i,0,sz(b)) in[i].imag(b[i]);
fft(in);
for (C& x : in) x *= x;
rep(i,0,n) out[i] = in[-i & (n - 1)] - conj(in[i]);
fft(out);
rep(i,0,sz(res)) res[i] = imag(out[i]) / (4 * n);
return res;
}
(Advanced) Connection between this algorithm and NTT
Just like how there is FFT and NTT, there are two variants of this algorithm too. One using complex floating point numbers, and the other using modulo a prime (or more generally modulo an odd composite number).
Using modulo integers instead of complex numbersThis algorithm requires three properties. Firstly it needs to be possible to divide by $$$2$$$, and secondly $$$\sqrt{c}$$$ needs to exist, and thirdly it needs to be possible to divide by $$$\sqrt{c}$$$. This means that we don't technically need complex numbers, we could also use other number systems (like working modulo a prime or modulo an odd composite number).
Primes that work nicely for this purpose are called "NTT primes", which means that the prime — 1 is divisible by a large power of $$$2$$$. Common examples of NTT primes are: $$$998244353 = 119 \cdot 2^{23} + 1$$$, $$$167772161 = 5 \cdot 2^{25} + 1$$$ and $$$469762049 = 7 \cdot 2^{26} + 1$$$.
Calculating fast_polymult_mod(P, Q, n, c) using fast_polymult_mod(P, Q, 2*n, 1)One very important remark is that it is possible to calculate $$$P(x) \, Q(x) \% (x^n - c)$$$ by multiplying $$$P(x) \, Q(x)$$$ unmodded, and afterwards modding it with $$$(x^n - c)$$$. The reason why this is useful is that if $$$\sqrt{c}$$$ doesn't exist, then this trick allows us to go back to having $$$c = 1$$$. This can be costly timewise, but it opens up the possibility of using "bad NTT primes"! For example $$$13 \cdot 2^{10} + 1$$$ is a pretty bad NTT prime, but it can still be used by this algorithm to efficiently multiply polynomials.
This algorithm works to some degree even for bad NTT primesOne of the things I dislike about NTT is that for NTT to be defined, there needs to exist a $$$n$$$-th root of unity. Usually problems involving NTT are designed so that this is never an issue. But if you want to use NTT where it hasn't been designed to magically work, then this is a really big issue. The NTT can become undefined!
Note that this algorithm does not exactly share the same drawback of being undefined. The reason for this is that if $$$\sqrt{c}$$$ doesn't exist, then the algorithm can simply choose to either switch over to a $$$O(n^2)$$$ polynomial multiplication algorithm, or fall back to fast unmodded polynomial multiplication. The implications from this is that this algorithm can do fast modded polynomial multiplication even if it is given a relatively bad NTT prime. I just find this property to be really cool!
A good example of when NTT becomes undefined is this yosup problem convolution_mod_large. Here the NTT mod is $$$998244353 = 119 \cdot 2^{23}$$$. The tricky thing about the problem is that $$$n=2^{24}$$$. Since $$$998244353 = 119 \cdot 2^{23} + 1$$$ there wont exist any $$$n$$$-th root of unity, so the NTT of length $$$n$$$ is undefined. However, the divide and conquer approach from this blog can easily solve the problem by falling back to the $$$O(n^2)$$$ algorithm.
At the time of writing this, I currently hold the record for the fastest solution to Convolution (Large)
2572 ms. But even a simple recursive version is enough to get AC 9292 ms.
NTT implementation in Python based on this algorithmHere is an NTT implementation. A codegolfed version of the same code can be found on Pyrival.
# Mod used for NTT
# Requirement: Any odd integer > 2
# It is important that MOD - 1 is
# divisible by lots of 2s
MOD = (119 << 23) + 1
assert MOD & 1
# Precalc non-quadratic_residue (used by the NTT)
non_quad_res = 2
while pow(non_quad_res, MOD//2, MOD) != MOD - 1:
non_quad_res += 1
rt = [1]
"""
Calculates NTT(P) in O(n log n) time.
This implementation is based on the
polynomial modulo multiplication algorithm.
Input:
P: A list of length n representing a polynomial P(x).
n needs to be a power of 2.
Output:
A list of length n representing the NTT of the polynomial P,
i.e. the list [P(root**i) % MOD for i in range(n)]
where root is an n-th root of unity mod MOD
"""
def NTT(P):
n = len(P)
# Assert n is a power of 2
assert n and (n - 1) & n == 0
# Check that NTT is defined for this n
assert (MOD - 1) % n == 0
# Make a copy of P to not modify original P
P = P[:]
# Precalculate the roots
while 2 * len(rt) < n:
# 4*len(rt)-th root of unity
root = pow(non_quad_res, MOD//(4 * len(rt)), MOD)
rt.extend([r * root % MOD for r in rt])
# Transform P
k = n
while k > 1:
for i in range(n//k):
r = rt[i]
for j1 in range(i*k, i*k + k//2):
j2 = j1 + k//2
z = r * P[j2]
P[j2] = (P[j1] - z) % MOD
P[j1] = (P[j1] + z) % MOD
k //= 2
# Bit reverse P before returning
rev = [0] * n
for i in range(1, n):
rev[i] = rev[i // 2] // 2 + (i & 1) * n // 2
return [P[r] for r in rev]
# Inverse of NTT(P) using a standard trick
def inverse_NTT(ntt_P):
n = len(ntt_P)
n_inv = pow(n, -1, MOD) # Requires Python 3.8
# The following works in any Python version, but requires MOD to be prime
# n_inv = pow(n, MOD - 2, MOD)
assert n * n_inv % MOD == 1
return NTT([ntt_P[-i] * n_inv % MOD for i in range(n)])
"""
Calculates P(x) * Q(x) (where the coeffiecents are returned % MOD)
Input:
P: A list representing a polynomial P(x)
Q: A list representing a polynomial Q(x)
Output:
A list representing the polynomial P(x) * Q(x) (with coeffients % MOD)
"""
def fast_polymult_using_NTT(P, Q):
# Calculate length of the list representing P*Q
n1 = len(P)
n2 = len(Q)
res_len = n1 + n2 - 1
# Pick n sufficiently big
n = 1
while n < res_len:
n *= 2
# Pad with extra 0s to reach length n
P = P + [0] * (n - n1)
Q = Q + [0] * (n - n2)
# Transform P and Q
ntt_P = NTT(P)
ntt_Q = NTT(Q)
# Calculate NTT of P*Q
ntt_PQ = [p * q % MOD for p,q in zip(ntt_P,ntt_Q)]
# Inverse NTT
PQ = inverse_NTT(ntt_PQ)
# Remove padding and return
return PQ[:res_len]
NTT implementation in C++ based on this algorithmHere is an NTT implementation. It is coded in the same style as in KACTL.
const ll mod = (119 << 23) + 1;// = 998244353
// For p < 2^30 there is also e.g. 5 << 25, 7 << 26, 479 << 21
// and 483 << 21 The last two are > 10^9.
typedef vector<ll> vl;
#include "../number-theory/ModPow.h"
void ntt(vl &a) {
int n = sz(a);
static ll r = 3;
while(modpow(r, mod/2) + 1 < mod) ++r;
static vl rt{1};
for (static int k = 2; k < n; k *= 2) {
rt.resize(n/2);
ll x = modpow(r, mod/2/k);
rep(i,k/2,k) rt[i] = rt[i-k/2] * x % mod;
}
for (int k = n; k > 1; k /= 2) rep(i,0,n/k) rep(j,i*k,i*k + k/2) {
ll &u = a[j], &v = a[j+k/2], z = rt[i] * v % mod;
v = u - z + (u < z ? mod : 0);
u = u + z - (u + z >= mod ? mod : 0);
}
vi rev(n);
rep(i,0,n) rev[i] = rev[i / 2] / 2 + (i & 1) * n / 2;
rep(i,0,n) if (i < rev[i]) swap(a[i], a[rev[i]]);
}
vl conv(vl a, vl b) {
if (a.empty() || b.empty()) return {};
int s = sz(a) + sz(b) - 1, B = 32 - __builtin_clz(s), n = 1 << B;
int inv = modpow(n, mod - 2);
vl out(n);
a.resize(n); b.resize(n);
ntt(a), ntt(b);
rep(i,0,n) out[-i & (n - 1)] = (ll)a[i] * b[i] % mod * inv % mod;
ntt(out);
return {out.begin(), out.begin() + s};
}
Blazingly fast NTT C++ implementationAt the time of writing this, I currently hold the record for the fastest solution to Convolution (Large)
2572 ms. If what you want is a super fast C++ implementation then I would suggest using this implementation.
(Advanced) Shorter implementations ("Codegolfed version")
It is possible to make really short but slightly less natural implementations of this algorithm. Originally I was thinking of using this shorter version in the blog, but in the end I didn't do it. So here they are. If you want a short implemention of this algorithm to use in practice, then I would recommend taking one of these implementations and porting it to C++.
Short Python implementation without any speedup tricks"""
Calculates P(x) * Q(x) % (x^n - c) in O(n log n) time
Input:
n: Integer, needs to be power of 2
c: Non-zero complex floating point number
P: A list of length 2*n representing a polynomial P(x) % (x^2n - c^2)
Q: A list of length 2*n representing a polynomial Q(x) % (x^2n - c^2)
Output:
A list of length n representing the polynomial P(x) * Q(x) % (x^n - c)
"""
def fast_polymult_mod3(P, Q, n, c):
assert len(P) == 2*n and len(Q) == 2*n
# Mod P and Q by (x^n - c)
P = [p1 + c * p2 for p1,p2 in zip(P[:n], P[n:])]
Q = [q1 + c * q2 for q1,q2 in zip(Q[:n], Q[n:])]
# Base case
if n == 1:
return [P[0] * Q[0]]
assert n % 2 == 0
import cmath
sqrtc = cmath.sqrt(c)
# Recursively calculate PQ_minus := P * Q % (x^n/2 - sqrt(c))
# PQ_plus := P * Q % (x^n/2 + sqrt(c))
PQ_minus = fast_polymult_mod3(P, Q, n//2, sqrtc)
PQ_plus = fast_polymult_mod3(P, Q, n//2, -sqrtc)
# Calculate PQ mod (x^n - c) using PQ_minus and PQ_plus
PQ = [(m + p)/2 for m,p in zip(PQ_minus, PQ_plus)] +\
[(m - p)/(2*sqrtc) for m,p in zip(PQ_minus, PQ_plus)]
return PQ
"""
Calculates P(x) * Q(x)
Input:
P: A list representing a polynomial P(x)
Q: A list representing a polynomial Q(x)
Output:
A list representing the polynomial P(x) * Q(x)
"""
def fast_polymult3(P, Q):
# Calculate length of the list representing P*Q
n1 = len(P)
n2 = len(Q)
res_len = n1 + n2 - 1
# Pick n sufficiently big
n = 1
while n < res_len:
n *= 2
# Pad with extra 0s to reach length 2*n
P = P + [0] * (2*n - n1)
Q = Q + [0] * (2*n - n2)
# Pick non-zero c arbitrarily =)
c = 123.24
# Calculate P*Q mod x^n - c
PQ = fast_polymult_mod3(P, Q, n, c)
# Remove extra 0 padding and return
return PQ[:res_len]
Short Python implementation supporting odd and even $n$ (making it up to 2 times faster)"""
Calculates P(x) * Q(x) % (x^n - c) in O(n log n) time
Input:
n: Integer, needs to be power of 2
c: Non-zero complex floating point number
P: A list of length 2*n representing a polynomial P(x) % (x^2n - c^2)
Q: A list of length 2*n representing a polynomial Q(x) % (x^2n - c^2)
Output:
A list of length n representing the polynomial P(x) * Q(x) % (x^n - c)
"""
def fast_polymult_mod4(P, Q, n, c):
assert len(P) == 2*n and len(Q) == 2*n
# Mod P and Q by (x^n - c)
P = [p1 + c * p2 for p1,p2 in zip(P[:n], P[n:])]
Q = [q1 + c * q2 for q1,q2 in zip(Q[:n], Q[n:])]
# Base case (n is odd)
if n & 1:
# Calculate the answer in O(n^2) time
res = [0] * (2*n)
for i in range(n):
for j in range(n):
res[i + j] += P[i] * Q[j]
return [r1 + c * r2 for r1,r2 in zip(res[:n], res[n:])]
assert n % 2 == 0
import cmath
sqrtc = cmath.sqrt(c)
# Recursively calculate PQ_minus := P * Q % (x^n/2 - sqrt(c))
# PQ_plus := P * Q % (x^n/2 + sqrt(c))
PQ_minus = fast_polymult_mod4(P, Q, n//2, sqrtc)
PQ_plus = fast_polymult_mod4(P, Q, n//2, -sqrtc)
# Calculate PQ mod (x^n - c) using PQ_minus and PQ_plus
PQ = [(m + p)/2 for m,p in zip(PQ_minus, PQ_plus)] +\
[(m - p)/(2*sqrtc) for m,p in zip(PQ_minus, PQ_plus)]
return PQ
"""
Calculates P(x) * Q(x)
Input:
P: A list representing a polynomial P(x)
Q: A list representing a polynomial Q(x)
Output:
A list representing the polynomial P(x) * Q(x)
"""
def fast_polymult4(P, Q):
# Calculate length of the list representing P*Q
n1 = len(P)
n2 = len(Q)
res_len = n1 + n2 - 1
# Pick n sufficiently big
b = 0
alim = 10
while alim * 2**b < res_len:
b += 1
a = (res_len - 1) // 2**b + 1
n = a * 2**b
# Pad with extra 0s to reach length 2*n
P = P + [0] * (2*n - n1)
Q = Q + [0] * (2*n - n2)
# Pick non-zero c arbitrarily =)
c = 123.24
# Calculate P*Q mod x^n - c
PQ = fast_polymult_mod4(P, Q, n, c)
# Remove extra 0 padding and return
return PQ[:res_len]
Short Python implementation supporting odd and even $n$ and imaginary cyclic convolution (making it up to 4 times faster)"""
Calculates P(x) * Q(x) % (x^n - c) in O(n log n) time
Input:
n: Integer, needs to be power of 2
c: Non-zero complex floating point number
P: A list of length 2*n representing a polynomial P(x) % (x^2n - c^2)
Q: A list of length 2*n representing a polynomial Q(x) % (x^2n - c^2)
Output:
A list of length n representing the polynomial P(x) * Q(x) % (x^n - c)
"""
def fast_polymult_mod4(P, Q, n, c):
assert len(P) == 2*n and len(Q) == 2*n
# Mod P and Q by (x^n - c)
P = [p1 + c * p2 for p1,p2 in zip(P[:n], P[n:])]
Q = [q1 + c * q2 for q1,q2 in zip(Q[:n], Q[n:])]
# Base case (n is odd)
if n & 1:
# Calculate the answer in O(n^2) time
res = [0] * (2*n)
for i in range(n):
for j in range(n):
res[i + j] += P[i] * Q[j]
return [r1 + c * r2 for r1,r2 in zip(res[:n], res[n:])]
assert n % 2 == 0
import cmath
sqrtc = cmath.sqrt(c)
# Recursively calculate PQ_minus := P * Q % (x^n/2 - sqrt(c))
# PQ_plus := P * Q % (x^n/2 + sqrt(c))
PQ_minus = fast_polymult_mod4(P, Q, n//2, sqrtc)
PQ_plus = fast_polymult_mod4(P, Q, n//2, -sqrtc)
# Calculate PQ mod (x^n - c) using PQ_minus and PQ_plus
PQ = [(m + p)/2 for m,p in zip(PQ_minus, PQ_plus)] +\
[(m - p)/(2*sqrtc) for m,p in zip(PQ_minus, PQ_plus)]
return PQ
"""
Calculates P(x) * Q(x) of two real polynomials
Input:
P: A list representing a real polynomial P(x)
Q: A list representing a real polynomial Q(x)
Output:
A list representing the real polynomial P(x) * Q(x)
"""
def fast_polymult5(P, Q):
# Calculate length of the list representing P*Q
n1 = len(P)
n2 = len(Q)
res_len = n1 + n2 - 1
# Pick n sufficiently big
b = 1
alim = 10
while alim * 2**b < res_len:
b += 1
a = (res_len - 1) // 2**b + 1
n = a * 2**b
# Pick c = i (imaginary unit)
c = 1j
# and decrease the size of n by a factor of 2
n //= 2
# Pad with extra 0s to reach length 2*n
P = P + [0] * (2*n - n1)
Q = Q + [0] * (2*n - n2)
# Calculate P*Q mod x^n - i
PQ = fast_polymult_mod4(P, Q, n, c)
# The imaginary part contains the "overflow"
PQ = [pq.real for pq in PQ] + [pq.imag for pq in PQ]
# Remove extra 0 padding and return
return PQ[:res_len]
Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).
Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).
Somewhat direct and compact way to obtain the key formula is as follows.
Note that $$$x^n-c \equiv -2c \pmod{x^n+c}$$$ and $$$x^n+c \equiv 2c \pmod{x^n-c}$$$. Thus, the expression
has remainder $$$Q(x)$$$ modulo $$$(x^n-c)$$$ and $$$P(x)$$$ modulo $$$(x^n+c)$$$. This is very similar to Chinese remainder theorem.
But since FFT is commonly used as a black box, it's very hard to estimate how useful this recursion really is, until it's checked on problems designed to compare various FFT implementations with one another, such as Convolution or Convolution (large)...
I tested out the suggested NTT implementation vs current kactl implementation.
So, seems roughly the same, but maybe the approach from this blog would be better if it's used for convolution directly, rather than through NTT? Unfortunately, the blog doesn't provide C++ implementation for this case...
The two implementations take about about the same time because computationally they are almost the same algorithm.
However, it is possible to make something that runs faster than standard NTT. Here are two things to consider:
You need almost no extra padding using the method from this blog compared to standard NTT.
You can make the base case be something like
if (n <= 64)
, and then switch to a fast $$$O(n^2)$$$ polynomial multiplication algorithm. This can be significantly faster than going down all the way to $$$n = 1$$$.Yes I agree. Because of that I have been working on making efficient C++ implementations.
Here is a simple recursive implementation that makes use of 1. and 2: (239 ms), and here is a slightly more complicated iterative version: (205 ms). Adding on Montgomery and FastIO to the iterative version brings the time down to 63 ms. This same implementation also holds the current record for the fastest solution to Convolution (Large).
Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).
Great blog!
Below I'll try to compare this blog algorithm and standard FFT with splitting to even and odd indices (probably it's Cooley-Tukey algorithm, or only its special case, don't know).
Let $$$n = n_1 n_2$$$. I was told at my university algorithm course that every discrete Fourier transform of size $$$n$$$ (that is, calculating $$$\sum\limits_{i = 0}^{n - 1}a_i \omega^{ji}$$$ for $$$0 \le j < n$$$ where $$$\omega^n = 1$$$) could be done via $$$n_2$$$ discrete Forier transforms of sizes $$$n_1$$$ and $$$n_1$$$ transforms on the result — of sizes $$$n_2$$$ (and some $$$\mathcal{O}(n)$$$ stuff). It decomposes indices and powers to the form of $$$in_2 + j$$$ or $$$k + n_1l$$$ and uses cyclicity of $$$\omega$$$.
Let $$$a \in \mathbb{C}^n, DFT_{n,\omega} := \left[\omega^{ij}\right]_{i=0..n}^{j=0..n}, \quad \omega_1 := \omega^{n_1}, \omega_2 := \omega^{n_2}$$$. Then
is a $$$n_2 \times n_1$$$ matrix with values of $$$DFT_{n,\omega} \cdot a$$$ written row by row.
Here $$$\left[x_{i,j}\right]_{i=0..m}^{j=0..k}$$$ denotes $$$m\times k$$$ matrix for $$$0 \le i < m$$$ and $$$0 \le j < k$$$; $$$\circ$$$ is element-wise product.
Of course, the equality holds only if $$$\omega^n = 1$$$.
Curiously enough, all ways to factorize $$$n$$$ (to small enough factors) result in the same complexity. For example, $$$T(n) = 2\cdot T\left(\frac{n}{2}\right) + \frac{n}{2}\cdot T(2) + \mathcal{O}(n)$$$ and $$$T(n) = 2\sqrt{n}\cdot T(\sqrt{n}) + \mathcal{O}(n)$$$ both resolve to $$$T(n) = \mathcal{O}(n\log n)$$$.
Introduce $$$D(n) = \frac{T(n)}{n}$$$ and rewrite recurrence with it.
So, one have some freedom while calculating discrete Fourier transform of size $$$n$$$. The most common scheme is to split the elements into two parts — odd and even, go into recursion and then merge results. It corresponds to $$$n_1 = \frac{n}{2}, n_2 = 2$$$: firstly you calculate $$$2$$$ independent DFT's of sizes $$$n_1$$$ and then merge them — in fact by $$$n_1$$$ DFT's of sizes $$$2$$$.
The algorithm from this blog is different. For $$$c = 1$$$ it's the same as $$$n_1 = 2$$$ and $$$n_2 = \frac{n}{2}$$$: firstly you process data (take polynomials modulo $$$x^{n/2} \pm \sqrt{c}$$$ (in fact, doing $$$n_2$$$ DFT's of size 2), then go into recursion (and no need to process results afterwards). In the recursion $$$c$$$ alters, but the reweight technique, mentioned in the blog, shows us that it doesn't matter.
Nevertheless, it was really unexpected to me that this interpretation somehow is able to deal with arbitrary $$$c$$$ in such a natural way. I don't know (putting aside general calculations for arbitrary $$$n_1, n_2$$$), which algorithm for a beginner is easier to understand. The main fundamental difference, as far as I can see, is that this interpretation is more "polynomial": you don't calculate values at some points and then interpolate, but take polynomials modulo. This allows more general tricks such that "bad NTT primes" and "avoiding padding too many zeroes to make n a power of 2". This tricks are impossible with even-odd FFT, aren't they?
The last thing to notice is that I implemented in C++ even-odd FFT (iterative (723 ms), recursive (879 ms)) and $$$x^n - c$$$ version (iterative (537 ms), recursive (699 ms)). Putting aside that they don't show high performance, in my implementation this blog algorithm is a bit faster than a standard one.
Actually there is a way to do this with even-odd FFT (Cooley-Tukey) too.
Suppose $$$n = a \, 2^b$$$. Start by splitting up the coefficients of your polynomials depending on their index mod
a
. This will split up $$$P$$$ into $$$a$$$ polynomials, and $$$Q$$$ into $$$a$$$ polynomials. Note that it is possible to construct $$$P \, Q$$$ by stitching together the products between all pairs of $$$P$$$ polynomials and $$$Q$$$ polynomials. If implemented efficiently, then this will run in $$$O((a + b) \, n)$$$. In comparision, using the algorithm from this blog results in $$$O((a + b) \, n)$$$.So the conclusion is that there are some analogues to
bad NTT primes
andavoiding padding too many zeroes to make n a power of 2
with even-odd FFT. But I think that using even-odd FFT will result in a more complicated and slower running algorithm.About performance. I think there are two things that you are missing which will make this algorithm a lot faster than even-odd FFT (Cooley-Tukey).
Firstly, pick $$$n = a \, 2^b$$$ in order to minimize padding. This wont matter for the Yosupo problem you benchmarked on since the input constraints are "nice" (Yosupo uses power of 2). But in general this will save a factor of two in time and memory over even-odd FFT.
Secondly, try using a more advanced base case than
If you instead make the base case something like
if (n <= 64)
, then you will have a much faster running algorithm. The reason for this is that if you have low degree polynomials, then the fastest thing to do is to just multiply them in $$$O(n^2)$$$ time. This turns out to give a fairly significant speed boost.My implementations take recursive (239 ms), iterative (205 ms). Adding on Montgomery and FastIO to the iterative version brings down the time to 63 ms. This same implementation also holds the current record for the fastest solution to Convolution (Large).
Hm, indeed, it's possible to implement even-odd FFT in $$$\mathcal{O}((a + b)n)$$$ using DFT instead of multiplications as subroutine.
Regarding performance: yes, I see now (and could have easily guessed before) that there is much room to optimize. I tried to push it a little more (based on your codes), and it indeed becomes a bit faster, but not much. So, I just wanted to show that in my particular implementation this blog algorithm runs ~20% faster than even-odd FFT, but provided inefficiency of it, the result doesn't really matter.
I ain’t reading allat
I was not exactly sure about how it worked, but it seems that I have had a very similar (if not exactly the same) implementation of this NTT transform in our ICPC notebook from two years ago. I think I stole it from yosupo judge when looking for a fast but easy to code implementation of NTT.
Also, this implementation seems to not require bit reversal at all.
I think my reasoning at that time was that the Hadamard linear transform that appears when doing polynomial evaluation can be written as a product of elementary "butterfly" matrices, but they can be evaluated in any order, and this implementation just chooses to compute it in reverse (large to small instead of small to large). I might be wrong, though, as my memory is really foggy nowadays.
Oh cool. Thats a really nice and short implementation!
The way I understand it is that FFT algorithms require bit reversal if they work in-place in memory. Your implementation gets around this by not working in-place. So in a sense,
swap(a, b);
is your bit reversal.Yes, that’s true. It’s nice that it makes the code shorter, and it also seemed to be a bit faster than bit reversal solutions in my experiments (probably due to cache friendliness).
I tried implementing this style of NTT combined with the algorithm from this blog. The code becomes really neat and short, but it also seems about 30 % slower compared to the in-place version.
Even if it turns out to be slower, I think the added simplicity is likely worth it for CP. It is also possible to codegolf it down a ton, for example you can switch
(MOD - 1) / 2
toMOD / 2
and(MOD - 1) / (n / step)
toMOD / n * step
. You also don't need to explicitly invert $$$g$$$, instead you could just doModInt w = g.pow(rev ? MOD - 1 - MOD / n * step : MOD / n * step)
.Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).
Discussed it with some polynomial-pros and found out that this algorithm is well known in china:
https://www.luogu.com.cn/blog/chtholly-willem/you-hua-DIT-DIF-NTT
https://blog.seniorious.cc/2023/relearn-FFT/
yet another "well known in China" time
Is there some funny name for it, like for other things that are well known in China?
it's called DIF-DIT, not funny enough for a name though