Tutorial: A simple O(n log n) polynomial multiplication algorithm

Hi Codeforces!

I have something exciting to tell you guys about today! I have recently come up with a really neat and simple recursive algorithm for multiplying polynomials in $$$O(n \log n)$$$ time. It is so neat and simple that I think it might possibly revolutionize the way that fast polynomial multiplication is taught and coded. You don't need to know anything about FFT to understand and implement this algorithm.

Big thanks to nor, c1729 and Spheniscine for discussing the contents of the blog with me and comming up with ideas for how to improve the blog =).

I've split this blog up into two parts. The first part is intended for anyone to be able to read and understand. The second part is advanced and goes into a ton of interesting ideas and concepts related to this algorithm.

Prerequisite: Polynomial quotient and remainder, see Wiki article and this Stackexchange example.

Task:

Given two polynomials $$$P$$$ and $$$Q$$$, an integer $$$n$$$ and a non-zero complex number $$$c$$$, where degree $$$P \lt n$$$ and degree $$$Q \lt n$$$. Your task is to calculate the polynomial $$$P(x) \, Q(x) \% (x^n - c)$$$ in $$$O(n \log n)$$$ time. You may assume that $$$n$$$ is a power of two.

Solution:

We can create a divide and conquer algorithm for $$$P(x) \, Q(x) \% (x^n - c)$$$ based on the difference of squares formula. Assuming $$$n$$$ is even, then $$$(x^n - c) = (x^{n/2} - \sqrt{c}) (x^{n/2} + \sqrt{c})$$$. The idea behind the algorithm is to calculate $$$P(x) \, Q(x) \% (x^{n/2} - \sqrt{c})$$$ and $$$P(x) \, Q(x) \% (x^{n/2} + \sqrt{c})$$$ using 2 recursive calls, and then use that result to calculate $$$P(x) \, Q(x) \% (x^n - c)$$$.

So how do we actually calculate $$$P(x) \, Q(x) \% (x^n - c)$$$ using $$$P(x) \, Q(x) \% (x^{n/2} - \sqrt{c})$$$ and $$$P(x) \, Q(x) \% (x^{n/2} + \sqrt{c})$$$?

Well, we can use the following formula:

$$$ \begin{aligned} A(x) \% (x^n - c) = &\frac{1}{2} \left(1 + \frac{x^{n/2}}{\sqrt{c}}\right) \left(A(x) \% (x^{n/2} - \sqrt{c})\right) \, + \\ &\frac{1}{2} \left(1 - \frac{x^{n/2}}{\sqrt{c}}\right) \left(A(x) \% (x^{n/2} + \sqrt{c})\right). \end{aligned} $$$

Proof of the formula

This formula is very useful. If we substitute $$$A(x)$$$ by $$$P(x) Q(x)$$$, then the formula tells us how to calculate $$$P(x) \, Q(x) \% (x^n - c)$$$ using $$$P(x) \, Q(x) \% (x^{n/2} - \sqrt{c})$$$ and $$$P(x) \, Q(x) \% (x^{n/2} + \sqrt{c})$$$ in linear time. With this we have the recipie for implementing a $$$O(n \log n)$$$ divide and conquer algorithm:

Input:

Integer $$$n$$$ (power of 2),
Non-zero complex number $$$c$$$,
Two polynomials $$$P(x) \% (x^n - c)$$$ and $$$Q(x) \% (x^n - c)$$$.

Output:

The polynomial $$$P(x) \, Q(x) \% (x^n - c)$$$.

Algorithm:

Step 1. (Base case) If $$$n = 1$$$, then return $$$P(0) \cdot Q(0)$$$. Otherwise:

Step 2. Starting from $$$P(x) \% (x^n - c)$$$ and $$$Q(x) \% (x^n - c)$$$, in $$$O(n)$$$ time calculate

$$$ \begin{align} & P(x) \% (x^{n/2} - \sqrt{c}), \\ & Q(x) \% (x^{n/2} - \sqrt{c}), \\ & P(x) \% (x^{n/2} + \sqrt{c}) \text{ and} \\ & Q(x) \% (x^{n/2} + \sqrt{c}). \end{align} $$$

Step 3. Make two recursive calls to calculate $$$P(x) \, Q(x) \% (x^{n/2} - \sqrt{c})$$$ and $$$P(x) \, Q(x) \% (x^{n/2} + \sqrt{c})$$$.

Step 4. Using the formula, calculate $$$P(x) \, Q(x) \% (x^n - c)$$$ in $$$O(n)$$$ time. Return the result.

Here is a Python implementation following this recipie:

Python solution to the task

"""
Calculates P(x) * Q(x) % (x^n - c) in O(n log n) time

Input:
  n: Integer, needs to be power of 2
  c: Non-zero complex floating point number
  P: A list of length n representing a polynomial P(x) % (x^n - c)
  Q: A list of length n representing a polynomial Q(x) % (x^n - c)
Output:
  A list of length n representing the polynomial P(x) * Q(x) % (x^n - c)
"""
def fast_polymult_mod(P, Q, n, c):
    assert len(P) == n and len(Q) == n
    
    # Base case
    if n == 1:
        return [P[0] * Q[0]]

    assert n % 2 == 0
    import cmath
    sqrtc = cmath.sqrt(c)

    # Calulate P_minus := P mod (x^(n/2) - sqrt(c))
    #          Q_minus := Q mod (x^(n/2) - sqrt(c))

    P_minus = [p1 + sqrtc * p2 for p1,p2 in zip(P[:n//2], P[n//2:])]
    Q_minus = [q1 + sqrtc * q2 for q1,q2 in zip(Q[:n//2], Q[n//2:])]

    # Calulate P_plus := P mod (x^(n/2) + sqrt(c))
    #          Q_plus := Q mod (x^(n/2) + sqrt(c))

    P_plus = [p1 - sqrtc * p2 for p1,p2 in zip(P[:n//2], P[n//2:])]
    Q_plus = [q1 - sqrtc * q2 for q1,q2 in zip(Q[:n//2], Q[n//2:])]

    # Recursively calculate PQ_minus := P * Q % (x^n/2 - sqrt(c)) 
    #                       PQ_plus  := P * Q % (x^n/2 + sqrt(c))
    
    PQ_minus = fast_polymult_mod(P_minus, Q_minus, n//2, sqrtc)
    PQ_plus  = fast_polymult_mod(P_plus,  Q_plus,  n//2, -sqrtc)

    # Calculate PQ mod (x^n - c) using PQ_minus and PQ_plus
    PQ = [(m + p)/2         for m,p in zip(PQ_minus, PQ_plus)] +\
         [(m - p)/(2*sqrtc) for m,p in zip(PQ_minus, PQ_plus)]
    
    return PQ

One final thing that I want to mention before going into the advanced section is that this algorithm can also be used to do fast unmodded polynomial multiplication, i.e. given polynomials $$$P(x)$$$ and $$$Q(x)$$$ calculate $$$P(x) \, Q(x)$$$. The trick is simply to pick $$$n$$$ large enough such that $$$P(x) \, Q(x) = P(x) \, Q(x) \% (x^n - c)$$$, and then use the exact same algorithm as before. $$$c$$$ can be arbitrarily picked (any non-zero complex number works).

Python implementation for general Fast polynomial multiplication

"""
Calculates P(x) * Q(x)

Input:
  P: A list representing a polynomial P(x)
  Q: A list representing a polynomial Q(x)
Output:
  A list representing the polynomial P(x) * Q(x)
"""
def fast_polymult(P, Q):
    # Calculate length of the list representing P*Q
    n1 = len(P)
    n2 = len(Q)
    res_len = n1 + n2 - 1
    
    # Pick n sufficiently big
    n = 1
    while n < res_len:
        n *= 2

    # Pad with extra 0s to reach length n
    P = P + [0] * (n - n1)
    Q = Q + [0] * (n - n2)
    
    # Pick non-zero c arbitrarily =)
    c = 123.24

    # Calculate P*Q mod x^n - c
    PQ = fast_polymult_mod(P, Q, n, c)

    # Remove extra 0 padding and return
    return PQ[:res_len]

If you want to try out implementing this algorithm yourself, then here is a very simple problem to test out your implementation on: SPOJ:POLYMUL.

(Advanced) Speeding up the algorithm

This section will be about tricks that can be used to speed up the algorithm. The first two tricks will speed up the algorithm by a factor of 2 each. The last trick is advanced, and it has the potential to both speed up the algorithm and also make it more numerically stable.

$n$ doesn't actually need to be a power of 2

We don't actually need the assumption that $$$n$$$ is a power of 2. If $$$n$$$ ever becomes odd during the recursion, then we have two choices: Either fall back to a $$$O(n^2)$$$ algorithm or fall back to the unmodded $$$O(n \log{n})$$$ Polynomial multiplication algorithm.

Let us discuss the run time of falling back to the $$$O(n^2)$$$ algorithm when $$$n$$$ becomes odd. Assume that $$$n = a \cdot 2^b$$$, where $$$a$$$ is an odd integer and $$$b$$$ is an integer. Think of the recursive algorithm as having layers, one layer for each possible value of $$$n$$$. The first $$$b$$$ layers will all take $$$O(n)$$$ time each. In the $$$(b+1)$$$-th layer the value of $$$n$$$ is $$$a$$$. Using the $$$O(n^2)$$$ polynomial multiplication algorithm leads to this layer taking $$$O(n/a \cdot a^2) = O(n \cdot a)$$$ time. The final time complexity comes out to be $$$O((a + b) \, n)$$$.

Python implementation that works for both odd and even $n$

"""
Calculates P(x) * Q(x) % (x^n - c) in O((a + b) * n) time, where n = a*2^b.

Input:
  n: Integer
  c: Non-zero complex floating point number
  P: A list of length n representing a polynomial P(x) % (x^n - c)
  Q: A list of length n representing a polynomial Q(x) % (x^n - c)
Output:
  A list of length n representing the polynomial P(x) * Q(x) % (x^n - c)
"""
def fast_polymult_mod2(P, Q, n, c):
    assert len(P) == n and len(Q) == n
    
    # Base case (n is odd)
    if n & 1:
        # Calculate the answer in O(n^2) time
        res = [0] * (2*n)
        for i in range(n):
            for j in range(n):
                res[i + j] += P[i] * Q[j]
        return [r1 + c * r2 for r1,r2 in zip(res[:n], res[n:])]

    assert n % 2 == 0
    import cmath
    sqrtc = cmath.sqrt(c)

    # Calulate P_minus := P mod (x^(n/2) - sqrt(c))
    #          Q_minus := Q mod (x^(n/2) - sqrt(c))

    P_minus = [p1 + sqrtc * p2 for p1,p2 in zip(P[:n//2], P[n//2:])]
    Q_minus = [q1 + sqrtc * q2 for q1,q2 in zip(Q[:n//2], Q[n//2:])]

    # Calulate P_plus := P mod (x^(n/2) + sqrt(c))
    #          Q_plus := Q mod (x^(n/2) + sqrt(c))

    P_plus = [p1 - sqrtc * p2 for p1,p2 in zip(P[:n//2], P[n//2:])]
    Q_plus = [q1 - sqrtc * q2 for q1,q2 in zip(Q[:n//2], Q[n//2:])]

    # Recursively calculate PQ_minus := P * Q % (x^n/2 - sqrt(c)) 
    #                       PQ_plus  := P * Q % (x^n/2 + sqrt(c))
    
    PQ_minus = fast_polymult_mod2(P_minus, Q_minus, n//2, sqrtc)
    PQ_plus  = fast_polymult_mod2(P_plus,  Q_plus,  n//2, -sqrtc)

    # Calculate PQ mod (x^n - c) using PQ_minus and PQ_plus
    PQ = [(m + p)/2         for m,p in zip(PQ_minus, PQ_plus)] +\
         [(m - p)/(2*sqrtc) for m,p in zip(PQ_minus, PQ_plus)]
    
    return PQ

The reason why this is super useful is that it allows us to speed up the fast unmodded polynomial multiplication algorithm. As long as we are fine with $$$a$$$ being less than say $$$10$$$, then we might be able to choose a significantly smaller $$$n$$$ compared to what would be possible if we were allowed to only choose powers of two. This trick has the potential of making the fast unmodded polynomial multiplication algorithm run twice as fast.

Python implementation for more efficient fast unmodded polynomial multiplication

"""
Calculates P(x) * Q(x)

Input:
  P: A list representing a polynomial P(x)
  Q: A list representing a polynomial Q(x)
Output:
  A list representing the polynomial P(x) * Q(x)
"""
def fast_polymult2(P, Q):
    # Calculate length of the list representing P*Q
    n1 = len(P)
    n2 = len(Q)
    res_len = n1 + n2 - 1
    
    # Pick n sufficiently big
    b = 0
    alim = 10
    while alim * 2**b < res_len:
        b += 1
    a = (res_len - 1) // 2**b + 1
    n = a * 2**b

    # Pad with extra 0s to reach length n
    P = P + [0] * (n - n1)
    Q = Q + [0] * (n - n2)
    
    # Pick non-zero c arbitrarily =)
    c = 123.24

    # Calculate P*Q mod x^n - c
    PQ = fast_polymult_mod2(P, Q, n, c)

    # Remove extra 0 padding and return
    return PQ[:res_len]

Imaginary-cyclic convolution

Calculating fast_polymult_mod(P, Q, n, c) using using fast_polymult_mod(P, Q, n, 1) (reweight technique)

(Advanced) -is-this-fft-?

This algorithm is actually FFT in disguise. But it is also different compared to any other FFT algorithm that I've seen in the past (for example the Cooley–Tukey FFT algorithm).

Using this algorithm to calculate FFT

This algorithm is not the same algorithm as Cooley–Tukey

FFT implementation in Python based on this algorithm

Here is an FFT implementation. A codegolfed version of the same code can be found on Pyrival.

"""
Calculates FFT(P) in O(n log n) time.

This implementation is based on the 
polynomial modulo multiplication algorithm.

Input:
  P: A list of length n representing a polynomial P(x).
     n needs to be a power of 2.
Output:
  A list of length n representing the FFT of the polynomial P,
     i.e. the list [P(exp(2j pi / n * i) for i in range(n)]
"""
rt = [1] # List used to store roots of unity
def FFT(P):
    n = len(P)
    # Assert n is a power of 2
    assert n and (n - 1) & n == 0
    # Make a copy of P to not modify original P
    P = P[:] 
    
    # Precalculate the roots
    while 2 * len(rt) < n:
        # 4*len(rt)-th root of unity
        import cmath
        root = cmath.exp(2j * cmath.pi / (4 * len(rt)))
        rt.extend([r * root for r in rt])

    # Transform P
    k = n
    while k > 1:
        for i in range(n//k):
            r = rt[i]
            for j1 in range(i*k, i*k + k//2):
                j2 = j1 + k//2
                z = r * P[j2]
                P[j2] = P[j1] - z
                P[j1] += z
        k //= 2
    
    # Bit reverse P before returning
    rev = [0] * n
    for i in range(1, n):
        rev[i] = rev[i // 2] // 2 + (i & 1) * n // 2

    return [P[r] for r in rev]

# Inverse of FFT(P) using a standard trick
def inverse_FFT(fft_P):
    n = len(fft_P)
    return FFT([fft_P[-i]/n for i in range(n)])

"""
Calculates P(x) * Q(x)

Input:
  P: A list representing a polynomial P(x)
  Q: A list representing a polynomial Q(x)
Output:
  A list representing the polynomial P(x) * Q(x)
"""
def fast_polymult_using_FFT(P, Q):
    # Calculate length of the list representing P*Q
    n1 = len(P)
    n2 = len(Q)
    res_len = n1 + n2 - 1
    
    # Pick n sufficiently big
    n = 1
    while n < res_len:
        n *= 2

    # Pad with extra 0s to reach length n
    P = P + [0] * (n - n1)
    Q = Q + [0] * (n - n2)
    
    # Transform P and Q
    fft_P = FFT(P)
    fft_Q = FFT(Q)

    # Calculate FFT of P*Q
    fft_PQ = [p*q for p,q in zip(fft_P,fft_Q)]

    # Inverse FFT
    PQ = inverse_FFT(fft_PQ)
    
    # Remove padding and return
    return PQ[:res_len]
"""
Calculates P(x) * Q(x)

Input:
  P: A list representing a polynomial P(x)
  Q: A list representing a polynomial Q(x)
Output:
  A list representing the polynomial P(x) * Q(x)
"""
def fast_polymult_using_FFT(P, Q):
    # Calculate length of the list representing P*Q
    n1 = len(P)
    n2 = len(Q)
    res_len = n1 + n2 - 1
    
    # Pick n sufficiently big
    n = 1
    while n < res_len:
        n *= 2

    # Pad with extra 0s to reach length n
    P = P + [0] * (n - n1)
    Q = Q + [0] * (n - n2)
    
    # Transform P and Q
    fft_P = FFT(P)
    fft_Q = FFT(Q)

    # Calculate FFT of P*Q
    fft_PQ = [p*q for p,q in zip(fft_P,fft_Q)]

    # Inverse FFT
    PQ = inverse_FFT(fft_PQ)
    
    # Remove padding and return
    return PQ[:res_len]

FFT implementation in C++ based on this algorithm

Here is an FTT implementation. It is coded in the same style as in KACTL.

typedef complex<double> C;
typedef vector<double> vd;
void fft(vector<C>& a) {
	int n = sz(a);
	static vector R{1.L + 0il};
	static vector rt{1. + 0i};
	for (static int k = 2; k < n; k *= 2) {
		R.resize(n/2); rt.resize(n/2);
                auto x = pow(1il, 2./k);
		rep(i,k/2,k) rt[i] = R[i] = R[i-k/2] * x;;
	}
	for (int k = n; k > 1; k /= 2) rep(i,0,n/k) rep(j,i*k,i*k + k/2) {
		C &u = a[j], &v = a[j+k/2], &r = rt[i];
		C z(v.real()*r.real() - v.imag()*r.imag(), 
		    v.real()*r.imag() + v.imag()*r.real());
		v = u - z;
		u = u + z;
	}
	vi rev(n);
	rep(i,0,n) rev[i] = rev[i / 2] / 2 + (i & 1) * n / 2;
	rep(i,0,n) if (i < rev[i]) swap(a[i], a[rev[i]]);
}

vd conv(const vd& a, const vd& b) {
	if (a.empty() || b.empty()) return {};
	vd res(sz(a) + sz(b) - 1);
	int L = 32 - __builtin_clz(sz(res)), n = 1 << L;
	vector<C> in(n), out(n);
	copy(all(a), begin(in));
	rep(i,0,sz(b)) in[i].imag(b[i]);
	fft(in);
	for (C& x : in) x *= x;
	rep(i,0,n) out[i] = in[-i & (n - 1)] - conj(in[i]);
	fft(out);
	rep(i,0,sz(res)) res[i] = imag(out[i]) / (4 * n);
	return res;
}

(Advanced) Connection between this algorithm and NTT

Just like how there is FFT and NTT, there are two variants of this algorithm too. One using complex floating point numbers, and the other using modulo a prime (or more generally modulo an odd composite number).

Using modulo integers instead of complex numbers

Calculating fast_polymult_mod(P, Q, n, c) using fast_polymult_mod(P, Q, 2*n, 1)

This algorithm works to some degree even for bad NTT primes

NTT implementation in Python based on this algorithm

Here is an NTT implementation. A codegolfed version of the same code can be found on Pyrival.

# Mod used for NTT
# Requirement: Any odd integer > 2
# It is important that MOD - 1 is
# divisible by lots of 2s
MOD = (119 << 23) + 1
assert MOD & 1

# Precalc non-quadratic_residue (used by the NTT)
non_quad_res = 2
while pow(non_quad_res, MOD//2, MOD) != MOD - 1:
    non_quad_res += 1
rt = [1]

"""
Calculates NTT(P) in O(n log n) time.

This implementation is based on the 
polynomial modulo multiplication algorithm.

Input:
  P: A list of length n representing a polynomial P(x).
     n needs to be a power of 2.
Output:
  A list of length n representing the NTT of the polynomial P,
     i.e. the list [P(root**i) % MOD for i in range(n)]
     where root is an n-th root of unity mod MOD
"""
def NTT(P):
    n = len(P)
    # Assert n is a power of 2
    assert n and (n - 1) & n == 0

    # Check that NTT is defined for this n
    assert (MOD - 1) % n == 0

    # Make a copy of P to not modify original P
    P = P[:] 
    
    # Precalculate the roots
    while 2 * len(rt) < n:
        # 4*len(rt)-th root of unity
        root = pow(non_quad_res, MOD//(4 * len(rt)), MOD)
        rt.extend([r * root % MOD for r in rt])

    # Transform P
    k = n
    while k > 1:
        for i in range(n//k):
            r = rt[i]
            for j1 in range(i*k, i*k + k//2):
                j2 = j1 + k//2
                z = r * P[j2]
                P[j2] = (P[j1] - z) % MOD
                P[j1] = (P[j1] + z) % MOD
        k //= 2
    
    # Bit reverse P before returning
    rev = [0] * n
    for i in range(1, n):
        rev[i] = rev[i // 2] // 2 + (i & 1) * n // 2

    return [P[r] for r in rev]

# Inverse of NTT(P) using a standard trick
def inverse_NTT(ntt_P):
    n = len(ntt_P)
    n_inv = pow(n, -1, MOD) # Requires Python 3.8
    # The following works in any Python version, but requires MOD to be prime
    # n_inv = pow(n, MOD - 2, MOD)
    assert n * n_inv % MOD == 1
    return NTT([ntt_P[-i] * n_inv % MOD for i in range(n)])

"""
Calculates P(x) * Q(x) (where the coeffiecents are returned % MOD)

Input:
  P: A list representing a polynomial P(x)
  Q: A list representing a polynomial Q(x)
Output:
  A list representing the polynomial P(x) * Q(x) (with coeffients % MOD)
"""
def fast_polymult_using_NTT(P, Q):
    # Calculate length of the list representing P*Q
    n1 = len(P)
    n2 = len(Q)
    res_len = n1 + n2 - 1
    
    # Pick n sufficiently big
    n = 1
    while n < res_len:
        n *= 2

    # Pad with extra 0s to reach length n
    P = P + [0] * (n - n1)
    Q = Q + [0] * (n - n2)
    
    # Transform P and Q
    ntt_P = NTT(P)
    ntt_Q = NTT(Q)

    # Calculate NTT of P*Q
    ntt_PQ = [p * q % MOD for p,q in zip(ntt_P,ntt_Q)]

    # Inverse NTT
    PQ = inverse_NTT(ntt_PQ)
    
    # Remove padding and return
    return PQ[:res_len]

NTT implementation in C++ based on this algorithm

Here is an NTT implementation. It is coded in the same style as in KACTL.

const ll mod = (119 << 23) + 1;// = 998244353
// For p < 2^30 there is also e.g. 5 << 25, 7 << 26, 479 << 21
// and 483 << 21 The last two are > 10^9.
typedef vector<ll> vl;

#include "../number-theory/ModPow.h"

void ntt(vl &a) {
	int n = sz(a);
	static ll r = 3;
	while(modpow(r, mod/2) + 1 < mod) ++r;
	static vl rt{1};
	for (static int k = 2; k < n; k *= 2) {
		rt.resize(n/2);
                ll x = modpow(r, mod/2/k);
		rep(i,k/2,k) rt[i] = rt[i-k/2] * x % mod;
	}
	for (int k = n; k > 1; k /= 2) rep(i,0,n/k) rep(j,i*k,i*k + k/2) {
		ll &u = a[j], &v = a[j+k/2], z = rt[i] * v % mod;
		v = u - z + (u < z ? mod : 0);
		u = u + z - (u + z >= mod ? mod : 0);
	}
	vi rev(n);
	rep(i,0,n) rev[i] = rev[i / 2] / 2 + (i & 1) * n / 2;
	rep(i,0,n) if (i < rev[i]) swap(a[i], a[rev[i]]);
}
vl conv(vl a, vl b) {
	
	if (a.empty() || b.empty()) return {};
	int s = sz(a) + sz(b) - 1, B = 32 - __builtin_clz(s), n = 1 << B;
	int inv = modpow(n, mod - 2);
	vl out(n);
	a.resize(n); b.resize(n);
	ntt(a), ntt(b);
	rep(i,0,n) out[-i & (n - 1)] = (ll)a[i] * b[i] % mod * inv % mod;
	ntt(out);
	return {out.begin(), out.begin() + s};
}

Blazingly fast NTT C++ implementation

(Advanced) Shorter implementations ("Codegolfed version")

It is possible to make really short but slightly less natural implementations of this algorithm. Originally I was thinking of using this shorter version in the blog, but in the end I didn't do it. So here they are. If you want a short implemention of this algorithm to use in practice, then I would recommend taking one of these implementations and porting it to C++.

Short Python implementation without any speedup tricks

"""
Calculates P(x) * Q(x) % (x^n - c) in O(n log n) time

Input:
  n: Integer, needs to be power of 2
  c: Non-zero complex floating point number
  P: A list of length 2*n representing a polynomial P(x) % (x^2n - c^2)
  Q: A list of length 2*n representing a polynomial Q(x) % (x^2n - c^2)
Output:
  A list of length n representing the polynomial P(x) * Q(x) % (x^n - c)
"""
def fast_polymult_mod3(P, Q, n, c):
    assert len(P) == 2*n and len(Q) == 2*n

    # Mod P and Q by (x^n - c)
    P = [p1 + c * p2 for p1,p2 in zip(P[:n], P[n:])]
    Q = [q1 + c * q2 for q1,q2 in zip(Q[:n], Q[n:])]
    
    # Base case
    if n == 1:
        return [P[0] * Q[0]]

    assert n % 2 == 0
    import cmath
    sqrtc = cmath.sqrt(c)

    # Recursively calculate PQ_minus := P * Q % (x^n/2 - sqrt(c)) 
    #                       PQ_plus  := P * Q % (x^n/2 + sqrt(c))
    
    PQ_minus = fast_polymult_mod3(P, Q, n//2, sqrtc)
    PQ_plus  = fast_polymult_mod3(P, Q, n//2, -sqrtc)

    # Calculate PQ mod (x^n - c) using PQ_minus and PQ_plus
    PQ = [(m + p)/2         for m,p in zip(PQ_minus, PQ_plus)] +\
         [(m - p)/(2*sqrtc) for m,p in zip(PQ_minus, PQ_plus)]
    
    return PQ
"""
Calculates P(x) * Q(x)

Input:
  P: A list representing a polynomial P(x)
  Q: A list representing a polynomial Q(x)
Output:
  A list representing the polynomial P(x) * Q(x)
"""
def fast_polymult3(P, Q):
    # Calculate length of the list representing P*Q
    n1 = len(P)
    n2 = len(Q)
    res_len = n1 + n2 - 1
    
    # Pick n sufficiently big
    n = 1
    while n < res_len:
        n *= 2

    # Pad with extra 0s to reach length 2*n
    P = P + [0] * (2*n - n1)
    Q = Q + [0] * (2*n - n2)
    
    # Pick non-zero c arbitrarily =)
    c = 123.24

    # Calculate P*Q mod x^n - c
    PQ = fast_polymult_mod3(P, Q, n, c)

    # Remove extra 0 padding and return
    return PQ[:res_len]

Short Python implementation supporting odd and even $n$ (making it up to 2 times faster)

"""
Calculates P(x) * Q(x) % (x^n - c) in O(n log n) time

Input:
  n: Integer, needs to be power of 2
  c: Non-zero complex floating point number
  P: A list of length 2*n representing a polynomial P(x) % (x^2n - c^2)
  Q: A list of length 2*n representing a polynomial Q(x) % (x^2n - c^2)
Output:
  A list of length n representing the polynomial P(x) * Q(x) % (x^n - c)
"""
def fast_polymult_mod4(P, Q, n, c):
    assert len(P) == 2*n and len(Q) == 2*n

    # Mod P and Q by (x^n - c)
    P = [p1 + c * p2 for p1,p2 in zip(P[:n], P[n:])]
    Q = [q1 + c * q2 for q1,q2 in zip(Q[:n], Q[n:])]
    
    # Base case (n is odd)
    if n & 1:
        # Calculate the answer in O(n^2) time
        res = [0] * (2*n)
        for i in range(n):
            for j in range(n):
                res[i + j] += P[i] * Q[j]
        return [r1 + c * r2 for r1,r2 in zip(res[:n], res[n:])]
    
    assert n % 2 == 0
    import cmath
    sqrtc = cmath.sqrt(c)

    # Recursively calculate PQ_minus := P * Q % (x^n/2 - sqrt(c)) 
    #                       PQ_plus  := P * Q % (x^n/2 + sqrt(c))
    
    PQ_minus = fast_polymult_mod4(P, Q, n//2, sqrtc)
    PQ_plus  = fast_polymult_mod4(P, Q, n//2, -sqrtc)

    # Calculate PQ mod (x^n - c) using PQ_minus and PQ_plus
    PQ = [(m + p)/2         for m,p in zip(PQ_minus, PQ_plus)] +\
         [(m - p)/(2*sqrtc) for m,p in zip(PQ_minus, PQ_plus)]
    
    return PQ

"""
Calculates P(x) * Q(x)

Input:
  P: A list representing a polynomial P(x)
  Q: A list representing a polynomial Q(x)
Output:
  A list representing the polynomial P(x) * Q(x)
"""
def fast_polymult4(P, Q):
    # Calculate length of the list representing P*Q
    n1 = len(P)
    n2 = len(Q)
    res_len = n1 + n2 - 1
    
    # Pick n sufficiently big
    b = 0
    alim = 10
    while alim * 2**b < res_len:
        b += 1
    a = (res_len - 1) // 2**b + 1
    n = a * 2**b

    # Pad with extra 0s to reach length 2*n
    P = P + [0] * (2*n - n1)
    Q = Q + [0] * (2*n - n2)
    
    # Pick non-zero c arbitrarily =)
    c = 123.24

    # Calculate P*Q mod x^n - c
    PQ = fast_polymult_mod4(P, Q, n, c)

    # Remove extra 0 padding and return
    return PQ[:res_len]

Short Python implementation supporting odd and even $n$ and imaginary cyclic convolution (making it up to 4 times faster)

"""
Calculates P(x) * Q(x) % (x^n - c) in O(n log n) time

Input:
  n: Integer, needs to be power of 2
  c: Non-zero complex floating point number
  P: A list of length 2*n representing a polynomial P(x) % (x^2n - c^2)
  Q: A list of length 2*n representing a polynomial Q(x) % (x^2n - c^2)
Output:
  A list of length n representing the polynomial P(x) * Q(x) % (x^n - c)
"""
def fast_polymult_mod4(P, Q, n, c):
    assert len(P) == 2*n and len(Q) == 2*n

    # Mod P and Q by (x^n - c)
    P = [p1 + c * p2 for p1,p2 in zip(P[:n], P[n:])]
    Q = [q1 + c * q2 for q1,q2 in zip(Q[:n], Q[n:])]
    
    # Base case (n is odd)
    if n & 1:
        # Calculate the answer in O(n^2) time
        res = [0] * (2*n)
        for i in range(n):
            for j in range(n):
                res[i + j] += P[i] * Q[j]
        return [r1 + c * r2 for r1,r2 in zip(res[:n], res[n:])]
    
    assert n % 2 == 0
    import cmath
    sqrtc = cmath.sqrt(c)

    # Recursively calculate PQ_minus := P * Q % (x^n/2 - sqrt(c)) 
    #                       PQ_plus  := P * Q % (x^n/2 + sqrt(c))
    
    PQ_minus = fast_polymult_mod4(P, Q, n//2, sqrtc)
    PQ_plus  = fast_polymult_mod4(P, Q, n//2, -sqrtc)

    # Calculate PQ mod (x^n - c) using PQ_minus and PQ_plus
    PQ = [(m + p)/2         for m,p in zip(PQ_minus, PQ_plus)] +\
         [(m - p)/(2*sqrtc) for m,p in zip(PQ_minus, PQ_plus)]
    
    return PQ

"""
Calculates P(x) * Q(x) of two real polynomials

Input:
  P: A list representing a real polynomial P(x)
  Q: A list representing a real polynomial Q(x)
Output:
  A list representing the real polynomial P(x) * Q(x)
"""
def fast_polymult5(P, Q):
    # Calculate length of the list representing P*Q
    n1 = len(P)
    n2 = len(Q)
    res_len = n1 + n2 - 1
    
    # Pick n sufficiently big
    b = 1
    alim = 10
    while alim * 2**b < res_len:
        b += 1
    a = (res_len - 1) // 2**b + 1
    n = a * 2**b
    
    # Pick c = i (imaginary unit)
    c = 1j
    # and decrease the size of n by a factor of 2
    n //= 2

    # Pad with extra 0s to reach length 2*n
    P = P + [0] * (2*n - n1)
    Q = Q + [0] * (2*n - n2)
    
    # Calculate P*Q mod x^n - i
    PQ = fast_polymult_mod4(P, Q, n, c)

    # The imaginary part contains the "overflow"
    PQ = [pq.real for pq in PQ] + [pq.imag for pq in PQ]

    # Remove extra 0 padding and return
    return PQ[:res_len]

Comments (18)

Write comment?

pajenegod

3 years ago, hide # |

Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).

→ Reply

adamant

← Rev. 2 →

+36

Somewhat direct and compact way to obtain the key formula is as follows.

Note that $$$x^n-c \equiv -2c \pmod{x^n+c}$$$ and $$$x^n+c \equiv 2c \pmod{x^n-c}$$$. Thus, the expression

$$$ A(x) = Q(x) \left(\frac{x^n+c}{2c}\right) + P(x) \left(\frac{x^n-c}{-2c}\right) $$$

has remainder $$$Q(x)$$$ modulo $$$(x^n-c)$$$ and $$$P(x)$$$ modulo $$$(x^n+c)$$$. This is very similar to Chinese remainder theorem.

But since FFT is commonly used as a black box, it's very hard to estimate how useful this recursion really is, until it's checked on problems designed to compare various FFT implementations with one another, such as Convolution or Convolution (large)...

I tested out the suggested NTT implementation vs current kactl implementation.

The implementation from this blog post takes 302ms and 48.46MiB (submission),
The current kactl implementation takes 284ms and 52.6MiB (submission).

So, seems roughly the same, but maybe the approach from this blog would be better if it's used for convolution directly, rather than through NTT? Unfortunately, the blog doesn't provide C++ implementation for this case...

3 years ago, hide # ^ |

The two implementations take about about the same time because computationally they are almost the same algorithm.

However, it is possible to make something that runs faster than standard NTT. Here are two things to consider:

You need almost no extra padding using the method from this blog compared to standard NTT.
You can make the base case be something like if (n <= 64), and then switch to a fast $$$O(n^2)$$$ polynomial multiplication algorithm. This can be significantly faster than going down all the way to $$$n = 1$$$.

Unfortunately, the blog doesn't provide C++ implementation for this case...

Yes I agree. Because of that I have been working on making efficient C++ implementations.

Here is a simple recursive implementation that makes use of 1. and 2: (239 ms), and here is a slightly more complicated iterative version: (205 ms). Adding on Montgomery and FastIO to the iterative version brings the time down to 63 ms. This same implementation also holds the current record for the fastest solution to Convolution (Large).

LeoPro

Great blog!

Below I'll try to compare this blog algorithm and standard FFT with splitting to even and odd indices (probably it's Cooley-Tukey algorithm, or only its special case, don't know).

Let $$$n = n_1 n_2$$$. I was told at my university algorithm course that every discrete Fourier transform of size $$$n$$$ (that is, calculating $$$\sum\limits_{i = 0}^{n - 1}a_i \omega^{ji}$$$ for $$$0 \le j \lt n$$$ where $$$\omega^n = 1$$$) could be done via $$$n_2$$$ discrete Forier transforms of sizes $$$n_1$$$ and $$$n_1$$$ transforms on the result — of sizes $$$n_2$$$ (and some $$$\mathcal{O}(n)$$$ stuff). It decomposes indices and powers to the form of $$$in_2 + j$$$ or $$$k + n_1l$$$ and uses cyclicity of $$$\omega$$$.

In terms of matrices

Curiously enough, all ways to factorize $$$n$$$ (to small enough factors) result in the same complexity. For example, $$$T(n) = 2\cdot T\left(\frac{n}{2}\right) + \frac{n}{2}\cdot T(2) + \mathcal{O}(n)$$$ and $$$T(n) = 2\sqrt{n}\cdot T(\sqrt{n}) + \mathcal{O}(n)$$$ both resolve to $$$T(n) = \mathcal{O}(n\log n)$$$.

Why

So, one have some freedom while calculating discrete Fourier transform of size $$$n$$$. The most common scheme is to split the elements into two parts — odd and even, go into recursion and then merge results. It corresponds to $$$n_1 = \frac{n}{2}, n_2 = 2$$$: firstly you calculate $$$2$$$ independent DFT's of sizes $$$n_1$$$ and then merge them — in fact by $$$n_1$$$ DFT's of sizes $$$2$$$.

The algorithm from this blog is different. For $$$c = 1$$$ it's the same as $$$n_1 = 2$$$ and $$$n_2 = \frac{n}{2}$$$: firstly you process data (take polynomials modulo $$$x^{n/2} \pm \sqrt{c}$$$ (in fact, doing $$$n_2$$$ DFT's of size 2), then go into recursion (and no need to process results afterwards). In the recursion $$$c$$$ alters, but the reweight technique, mentioned in the blog, shows us that it doesn't matter.

Nevertheless, it was really unexpected to me that this interpretation somehow is able to deal with arbitrary $$$c$$$ in such a natural way. I don't know (putting aside general calculations for arbitrary $$$n_1, n_2$$$), which algorithm for a beginner is easier to understand. The main fundamental difference, as far as I can see, is that this interpretation is more "polynomial": you don't calculate values at some points and then interpolate, but take polynomials modulo. This allows more general tricks such that "bad NTT primes" and "avoiding padding too many zeroes to make n a power of 2". This tricks are impossible with even-odd FFT, aren't they?

The last thing to notice is that I implemented in C++ even-odd FFT (iterative (723 ms), recursive (879 ms)) and $$$x^n - c$$$ version (iterative (537 ms), recursive (699 ms)). Putting aside that they don't show high performance, in my implementation this blog algorithm is a bit faster than a standard one.

← Rev. 6 →

This allows more general tricks such that "bad NTT primes" and "avoiding padding too many zeroes to make n a power of 2". This tricks are impossible with even-odd FFT, aren't they?

Actually there is a way to do this with even-odd FFT (Cooley-Tukey) too.

Explanation

So the conclusion is that there are some analogues to bad NTT primes and avoiding padding too many zeroes to make n a power of 2 with even-odd FFT. But I think that using even-odd FFT will result in a more complicated and slower running algorithm.

Putting aside that they don't show high performance

About performance. I think there are two things that you are missing which will make this algorithm a lot faster than even-odd FFT (Cooley-Tukey).

Firstly, pick $$$n = a \, 2^b$$$ in order to minimize padding. This wont matter for the Yosupo problem you benchmarked on since the input constraints are "nice" (Yosupo uses power of 2). But in general this will save a factor of two in time and memory over even-odd FFT.

Secondly, try using a more advanced base case than

     if (n == 1) {
        a[0] *= b[0];
        return;
    }

If you instead make the base case something like if (n <= 64), then you will have a much faster running algorithm. The reason for this is that if you have low degree polynomials, then the fastest thing to do is to just multiply them in $$$O(n^2)$$$ time. This turns out to give a fairly significant speed boost.

My implementations take recursive (239 ms), iterative (205 ms). Adding on Montgomery and FastIO to the iterative version brings down the time to 63 ms. This same implementation also holds the current record for the fastest solution to Convolution (Large).

Hm, indeed, it's possible to implement even-odd FFT in $$$\mathcal{O}((a + b)n)$$$ using DFT instead of multiplications as subroutine.

Regarding performance: yes, I see now (and could have easily guessed before) that there is much room to optimize. I tried to push it a little more (based on your codes), and it indeed becomes a bit faster, but not much. So, I just wanted to show that in my particular implementation this blog algorithm runs ~20% faster than even-odd FFT, but provided inefficiency of it, the result doesn't really matter.

iordache_

I ain’t reading allat

bicsi

← Rev. 3 →

I was not exactly sure about how it worked, but it seems that I have had a very similar (if not exactly the same) implementation of this NTT transform in our ICPC notebook from two years ago. I think I stole it from yosupo judge when looking for a fast but easy to code implementation of NTT.

Also, this implementation seems to not require bit reversal at all.

Code

void DFT(vector<ModInt> &a, bool rev) {
  int n = a.size(); auto b = a; // assert(!(n & (n - 1)));
  ModInt g = 1; while (g.pow((MOD - 1) / 2) == 1) g = g + 1;
  if (rev) g = g.inv();
  
  for (int step = n / 2; step; step /= 2) {
    ModInt w = g.pow((MOD - 1) / (n / step)), wn = 1; 
    for (int i = 0; i < n / 2; i += step) {
      for (int j = 0; j < step; ++j) {
        auto u = a[2 * i + j], v = wn * a[2 * i + j + step];
        b[i + j] = u + v; b[i + n / 2 + j] = u - v;
      }
      wn = wn * w;
    }
    swap(a, b);
  }
  if (rev) {
    auto n1 = ModInt(n).inv();
    for (auto& x : a) x = x * n1;
  }
}

I think my reasoning at that time was that the Hadamard linear transform that appears when doing polynomial evaluation can be written as a product of elementary "butterfly" matrices, but they can be evaluated in any order, and this implementation just chooses to compute it in reverse (large to small instead of small to large). I might be wrong, though, as my memory is really foggy nowadays.

Oh cool. Thats a really nice and short implementation!

The way I understand it is that FFT algorithms require bit reversal if they work in-place in memory. Your implementation gets around this by not working in-place. So in a sense, swap(a, b); is your bit reversal.

Yes, that’s true. It’s nice that it makes the code shorter, and it also seemed to be a bit faster than bit reversal solutions in my experiments (probably due to cache friendliness).

I tried implementing this style of NTT combined with the algorithm from this blog. The code becomes really neat and short, but it also seems about 30 % slower compared to the in-place version.

Even if it turns out to be slower, I think the added simplicity is likely worth it for CP. It is also possible to codegolf it down a ton, for example you can switch (MOD - 1) / 2 to MOD / 2 and (MOD - 1) / (n / step) to MOD / n * step. You also don't need to explicitly invert $$$g$$$, instead you could just do ModInt w = g.pow(rev ? MOD - 1 - MOD / n * step : MOD / n * step).

Kevin114514

+81

Discussed it with some polynomial-pros and found out that this algorithm is well known in china:

https://www.luogu.com.cn/blog/chtholly-willem/you-hua-DIT-DIF-NTT
https://blog.seniorious.cc/2023/relearn-FFT/

ling_luo

+34

yet another "well known in China" time

Is there some funny name for it, like for other things that are well known in China?

it's called DIF-DIT, not funny enough for a name though

#	User	Rating
1	Benq	3792
2	VivaciousAubergine	3647
3	Kevin114514	3603
4	jiangly	3583
5	turmax	3559
6	tourist	3541
7	strapple	3515
8	ksun48	3461
9	dXqwq	3436
10	Otomachi_Una	3413

#	User	Contrib.
1	Qingyu	157
2	adamant	153
3	Um_nik	147
4	Proof_by_QED	146
5	Dominater069	145
6	errorgorn	141
7	cry	139
8	YuukiS	135
9	TheScrasse	134
10	chromate00	133