Complex FFT is not as bad as you think

#	User	Rating
1	Benq	3792
2	VivaciousAubergine	3647
3	Kevin114514	3603
4	jiangly	3583
5	turmax	3559
6	tourist	3541
7	strapple	3515
8	ksun48	3461
9	dXqwq	3436
10	Otomachi_Una	3413

#	User	Contrib.
1	Qingyu	157
2	adamant	153
3	Um_nik	147
4	Proof_by_QED	146
5	Dominater069	145
6	errorgorn	141
7	cry	139
8	YuukiS	135
9	TheScrasse	134
10	chromate00	133

... Probably.

Hi everyone!

As some of you might already know, I don't like NTT. Primary reasons for this are:

I don't like modular arithmetic optimizations (Montgomery, etc).
I don't like NTT mods, and prefer good, old $$$10^9+7$$$.
I do like algebra of complex numbers.

But, to my regret, the overwhelming mainstream in modern competitive programming is NTT. Primary reasons are:

It's allegedly faster.
It needs less memory.
It has no precision issues.
People don't like algebra of complex numbers.

There is nothing I can do about the last, but today I'll address the first 3. Optimizing complex FFT was on my mind for quite some time, but I didn't really work on it that much until the blog by Qwerty1232 on optimizing NTT has dropped, which motivated me to actually put some serious effort into this.

In this blog, we will primarily focus on optimizations that are specific to complex arithmetic. Of course, people already mostly know of its challenges, as mentioned above, but it also offers some unique benefits that might alleviate them, and even help get the upper hand in certain cases! My main examples for this:

the FFT solution to Wildcard Pattern Matching runs in only 5 ms for input strings of length up to $$$2^{19}$$$. At the moment of writing this blog, it is the top-1 solution for the problem, with a close runner-up of 9 ms by Qwerty1232 using NTT. This is achieved by tightly packing arrays in complex numbers, thus the whole task boils down to a single imaginary-cyclic convolution of length $$$\frac{n}{2}$$$.
The current top-1 solution for Convolution (mod $$$10^9+7$$$) is also mine, FFT based, with running time of 36 ms, and with the runner up being 48 ms, again, by Qwerty1232 using NTT. Moreover, even on Convolution modulo $$$998\;244\;353$$$, complex based FFT ends up being in top-8 (if we dedup solutions by same authors).
I was finally able to get AC on Convolution (Large) with complex FFT in 2393 ms. This puts the solution in top-8 by speed (after dedup), while previously I didn't even think it's possible to get AC on this task with complex FFT at all.

I would like to express my sincerest thanks to:

purplesyringa for all the in-depth discussions of the topic, and help with shaving those last milliseconds off Wildcard Pattern Matching,
Qwerty1232 for discussions, help with identifying adversarial inputs to complex FFT, and of course for the inspiration to do all this,
jeroenodb for helping me understand details of error bounds for FFT,
pajenegod for introducing imaginary cyclic convolution and poly-mod based approach to FFT to the community.
sslotin for Algorithmica, which I unjustly neglected in the past, and which was an amazing intro to constant optimizations.
magnus.hegdahl for introducing GNU vector extensions to me.

Challenges

Okay, first of all, let's summarize main challenges that we will need to tackle. I would split them into the following categories:

Precision;
Time consumption;
Memory consumption;

They of course are all intertwined, as e.g. optimizing memory may help with reducing time consumption, but at the same time, trying to make it more precise might increase time consumption, so it is important to find a balance.

Precision

The first practical limitation we encounter is, if we use double type for our computations, it can only represent integers precisely if they are in the range $$$(-2^{53}, 2^{53})$$$. So, we have a hard cap of the size of the numbers we can work with of $$$2^{53} \approx 9 \cdot 10^{15}$$$. For practical purposes it's actually even more limited, and for reasons that will be apparent later on, we will have to operate within $$$(-2^{51}, 2^{51})$$$ range, which further limits our numbers to $$$2.25 \cdot 10^{15}$$$.

Now, if we need to find a convolution of two arrays of size $$$n$$$ with values being up to $$$C$$$, the resulting array might have values up to $$$n C^2$$$, so if we want to find a convolution modulo $$$M = 10^9+7$$$ for $$$n=2^{19}$$$, the resulting numbers would be around $$$5.24 \cdot 10^{23}$$$, which far exceeds our acceptable boundaries. Common way to mitigate this is to pick a threshold $$$m$$$, and then split numbers modulo $$$M$$$ into $$$x_0 + m x_1$$$, after which we can compute their product as

$$$ xy = x_0 y_0 + (x_0 y_1 + x_1 y_0) m + x_1 y_1 m^2 $$$

So, assume that $$$m \approx \sqrt{M}$$$, then we have $$$|x_0|, |x_1| \leq \sqrt{M}$$$, and thus the resulting numbers would be up to $$$nM \approx 5.24 \cdot 10^{14}$$$, which is still quite large, but at least theoretically representable in double.

However, we can limit the numbers we work with even further. For this, let's assume that $$$x \in (-\frac{M}{2}, \frac{M}{2})$$$ instead of $$$x \in [0, M)$$$. Now, if we keep $$$m \approx \sqrt M$$$, we will automatically get $$$|x_1| \leq \frac{\sqrt{M}}{2}$$$, but additionally we can also pick $$$x_0$$$ to be in the range $$$(-\frac{m}{2}, \frac{m}{2})$$$ instead of $$$[0,m)$$$ to make sure that also $$$|x_0| \leq \frac{\sqrt{M}}{2}$$$, thus the resulting numbers would be up to $$$\frac{nM}{4} \approx 1.31 \cdot 10^{14}$$$.

This way, for convolutions of size $$$2^{19}$$$ we will have around $$$6$$$ "spare" bits in the significand that we can waste away due to various accumulated precision errors. But is it really enough for us?..

FFT precision bounds

Some relevant reads: (1), (2)

Generally, FFT is an outlier among most floating-point algorithms you might encounter, as it is generally known to be outstandingly stable. But how much stable exactly? Will just $$$6$$$ bits be enough for us?.. To answer this, we will have to learn the basics of error estimation in numeric analysis. If you're not familiar with it at all, that's fine! So am I, so we will have to learn on the fly.

We will assume IEEE 754 Double which represents numbers in scientific notation, where $$$52$$$ bits are allocated for the significand, $$$11$$$ bits for the exponent, and $$$1$$$ bit for the sign. On practice, it means $$$53$$$ bits of precision, as one bit is "implied" by the exponent, which is why we mentioned earlier that all integers in $$$(-2^{53}, 2^{53})$$$ can be represented exactly.

When it comes to actual computation, it is required by IEEE 754 that arithmetic operations are performed in a way as if we had infinite precision, but rounded to the nearest representable number in the end. This means that the results of basic operations, such as addition, multiplication, and even computation of $$$\sin$$$ or $$$\cos$$$ have relative precision of $$$\varepsilon = 2^{-53}$$$, due to rounding to the nearest of two "candidate" values representable with $$$52$$$ bits.

By relative precision here we mean that if the exact number is $$$x$$$, the resulting number would be equal to $$$x(1+\delta)$$$, where $$$|\delta| \leq \varepsilon$$$.

Note: The $$$\varepsilon$$$, as defined here, is different from machine epsilon, the smallest representable increment to $$$1.0$$$, which you get e.g. by calling std::numeric_limits<double>::epsilon(). This number typically evaluates to $$$2^{-52}$$$.

Floating-point operations

As a general convention, we will use the $$$'$$$ to mark imprecise computed values. The following is generally true, assuming $$$x$$$ and $$$y$$$ are represented exactly in the double type:

$$$(x\pm y)' = (x \pm y) (1+\delta)$$$,
$$$(x y)' = (xy) (1+\delta)$$$,
$$$(\cos x)' = (\cos x)(1+\delta)$$$ and $$$(\sin y)' = (\sin y) (1+\delta)$$$,

where $$$|\delta| \leq \varepsilon$$$ in each individual case. In other words, we assume that the relative error of any of the above operations is at most $$$\varepsilon$$$.

Complex operations

When dealing with complex numbers, it would make most sense to assume that $$$\delta$$$ is also a complex number, and estimate $$$|\delta|=\sqrt{\delta_x^2+\delta_y^2}$$$, the complex absolute value of $$$\delta$$$.

Complex addition. Let $$$z = z_1 \pm z_2 = x + iy$$$, then we have

$$$ z' = x(1+\delta_x)+i y(1+\delta_y), $$$

where $$$|\delta_x|, |\delta_y| \leq \varepsilon$$$, and if we want to estimate $$$\delta = z' - z$$$, we get

$$$ |\delta|^2 =(x \delta_x)^2 + (y \delta_y)^2 \leq \varepsilon^2 (x^2+y^2), $$$

meaning that $$$|\delta| \leq \varepsilon$$$.

Roots of unity. For a root of unity $$$\omega$$$, it holds that $$$|\omega|=1$$$, so the absolute and the relative error are one and the same for them. Note that as $$$|\omega_x|, |\omega_y| \lt 1$$$, the exponent for them will be $$$\frac{1}{2}$$$, rather than $$$1$$$, meaning that they can be computed with the absolute precision $$$\frac{\varepsilon}{2}$$$, thus

$$$ |\omega' - \omega|^2 = \delta_x^2 + \delta_y^2 \leq \left(\frac{\varepsilon}{2}\right)^2 + \left(\frac{\varepsilon}{2}\right)^2 = \frac{\varepsilon^2}{2}, $$$

so we have $$$|\delta| \leq \frac{\varepsilon}{\sqrt{2}}$$$ for them.

Complex multiplication. Assume the standard formula for multiplication:

$$$ z = x+iy = (x_1 + iy_1)(x_2 + iy_2) = (x_1 x_2 - y_1 y_2) + i (x_1 y_2 + y_1 x_2) $$$

It can be shown that $$$|\delta| \leq \sqrt{5} \varepsilon$$$ . The details of the derivation are quite complicated, as the authors consider a lot of various corner cases. But to better understand how error estimates generally work, we can derive a simpler bound of $$$|\delta| \leq \sqrt{8} \varepsilon$$$:

Derivation

Butterfly transformations

The key idea in analyzing FFT precision is, instead of estimating individual errors, to estimate the total error of the resulting transform. So, assume $$$A_0,\dots, A_n$$$ is the DFT-vector, we want to estimate

$$$ |A'-A|^2 = |A_0'-A_0|^2+\dots+|A_n'-A_n|^2 $$$

Note that the maximum error in individual terms can never surpass the square root of the sum-of-squares error.

To estimate the sum-of-squares error, we notice that the core part of the Cooley-Tukey scheme are the following transforms:

$$$(a, b) \mapsto (a+b\omega, a-b\omega)$$$;
$$$(a, b) \mapsto (a+b, \frac{a-b}{\omega})$$$.

The first one is typically used in "direct" FFT, while the second is its inverse. Additionally, radix-4 FFT uses the following butterfly-like transforms:

$$$\begin{gather*} \begin{pmatrix}a \\ b \\ c \\ d \end{pmatrix} \mapsto \begin{pmatrix} a + \omega b + \omega^2 c + \omega^3 d \\ a - \omega b + \omega^2 c - \omega^3 d \\ a + \omega i b - \omega^2 c - \omega^3 i d \\ a - \omega i b - \omega^2 c + \omega^3 i d\end{pmatrix} & \begin{pmatrix}a \\ b \\ c \\ d \end{pmatrix} \mapsto \begin{pmatrix} \frac{a + b + c + d}{1} \\ \frac{a - b - ic + id}{\omega} \\ \frac{a + b - c - d}{\omega^2} \\ \frac{a - b + ic - id}{\omega^3} \end{pmatrix} \end{gather*} $$$

We will get to them in detail later on, but in essence they just process two FFT layers at once, and we will need to also use them for speed, so it would be prudent to estimate their precision as well.

Note that, generally, instead of direct division by $$$\omega$$$, it makes more sense to multiply by its conjugate $$$\omega^*$$$, so essentially it is still a multiplication by a root of unity.

Now, when we're doing $$$b \omega$$$, effectively the error of the multiplication up to $$$(1+\sqrt 5 \varepsilon)$$$ and the relative error of the root of unity up to $$$(1+\frac{\varepsilon}{\sqrt 2})$$$ multiply with each other, which allows us to estimate the error of the whole $$$b \omega$$$ term as the product of the corresponding errors.

Addition, however, may introduce unbounded relative error if e.g. $$$a+b\omega$$$ is close to zero. The way to mitigate this is to note that for $$$s=a+b\omega$$$ and $$$t = a - b\omega$$$, it holds that $$$|s|^2 + |t|^2 = 2(|a|^2+|b|^2)$$$, which allows us to bypass the issue, estimating the error $$$|(s',t')-(s,t)|$$$ instead of individual errors $$$|s'-s|$$$ and $$$|t'-t|$$$.

To be precise and generic, let us formulate the following statement:

Multiplicativity of relative error in linear transformations. Let $$$x'$$$ be a vector such that $$$|x-x'| \leq \delta$$$, where $$$|\cdot|$$$ on vectors denotes the Euclidean length of the vector, and $$$L$$$ be a linear transformation such that $$$|(Lv)'-Lv| \leq \varepsilon |Lv|$$$ and $$$|Lv|=C|v|$$$ for any $$$v$$$. Then, it holds that

$$$ |(Lv')'-Lv| \leq (\varepsilon + \delta + \varepsilon \delta) |Lv| $$$

Proof:

$$$ |(Lv')'-Lv| = |(Lv')'-Lv' + Lv' - Lv| \leq |(Lv')'-Lv'| + |L(v'-v)| \leq \varepsilon |Lv'| + |L(v'-v)| $$$

Now, we can rewrite $$$|Lv'| = |Lv + L(v'-v)|$$$ to get

$$$ \leq \varepsilon |Lv| + (\varepsilon + 1)|L(v'-v)| = \varepsilon |Lv| + (\varepsilon + 1)C|v'-v| \leq \varepsilon |Lv| + (\varepsilon+1)\delta |Lv| = (\varepsilon + \delta + \varepsilon \delta)|Lv| $$$

Note that $$$(\varepsilon + \delta + \varepsilon \delta) = (1+\varepsilon)(1+\delta)-1$$$, meaning that, in essence, we can just multiply relative error components when repeatedly applying linear transformations, and subtract $$$1$$$ in the end to get the actual error estimate.

In other words, in terms of butterfly transformations, we get

$$$ |(s',t')-(s,t)| \leq |(s,t)| (1+\varepsilon)(1+\sqrt 5 \varepsilon)\left(1+\frac{\varepsilon}{\sqrt 2}\right) $$$

In a similar manner, for radix-4 transformations we get

$$$ |(s',t',u',v')-(s,t,u,v)| \leq |(s,t,u,v)| (1+\varepsilon)^2 (1+\sqrt 5 \varepsilon)\left(1+\frac{\varepsilon}{\sqrt 2}\right) $$$

Note that we use the squared $$$(1+\varepsilon)$$$ here because, addition-wise, we do the equivalent of two radix-2 transforms, but at the same time we shave off one of the two multiplication error factors, as we now multiply each element by a root of unity only once.

Putting it all together

Note that the error above is relative to the individual $$$(s, t)$$$ or $$$(s,t,u,v)$$$ block, and on the whole layer of FFT transformations it accumulates with the same factor to the error of the whole array. In other words, we end up with the estimate

$$$ |A'-A| \leq |A| \left[(1+\varepsilon)^{2k} (1+\sqrt 5 \varepsilon)^k \left(1+\frac{\varepsilon}{\sqrt 2}\right)^k-1\right], $$$

where $$$k = \lceil \log_4 n \rceil$$$. The next step in FFT is to find $$$C'$$$, the Hadamard (i.e. component-wise) product of $$$A'$$$ and $$$B'$$$. Estimating the Euclidean norm $$$|C' - C|$$$ directly would be quite tricky. Instead, we can use the Cauchy-Schwarz inequality to get

$$$ \sum\limits_{i} |C'_i - C_i| \leq |A_i| \cdot |B_i|\left[(1+\varepsilon)^{4k} (1+\sqrt 5 \varepsilon)^{2k+1} \left(1+\frac{\varepsilon}{\sqrt 2}\right)^{2k}-1\right] $$$

The left-hand side in the expression above is, in fact, $$$|C'-C|_1$$$, the $$$L_1$$$-norm of the error of $$$C$$$. It generally holds that $$$|v|_1 \geq |v|_2$$$, where $$$|v|_2 = |v|$$$ is the Euclidean norm, hence applying the results above to the inverse transform, we arrive at the final estimate

$$$ |z'-z| \leq |x| \cdot |y| \left[(1+\varepsilon)^{6k} (1+\sqrt 5 \varepsilon)^{3k+1} \left(1+\frac{\varepsilon}{\sqrt 2}\right)^{3k}-1\right], $$$

where $$$x$$$ and $$$y$$$ are the input vectors of complex values, and $$$z$$$ is their cyclic convolution. This bound also trivially serves as the upper bound for maximum individual error as well.

Additionally, we can neglect higher-order powers of $$$\varepsilon$$$ by using $$$(1+x)^n \approx 1+xn$$$ when $$$x \to 0$$$, thus we get

$$$ |z'-z| \leq |x| \cdot |y| \cdot 3\varepsilon k\left(2 + \sqrt 5\left(1+\frac{1}{3k}\right) + \frac{1}{\sqrt 2}\right) $$$

Assuming $$$k \geq 19$$$ (lower values will have higher precision due to smaller $$$|x|$$$ and $$$|y|$$$ anyway), we can rewrite this as

$$$ |z'-z| \leq |x| \cdot |y| \cdot k \cdot 1.66 \cdot 10^{-15} $$$

Making the bound simpler, and adding some padding to account for particular implementation inefficiencies, we can use

$$$ \boxed{|z'-z| \leq |x| \cdot |y| \cdot k \cdot 2 \cdot 10^{-15}} $$$

The bound above should be safe to use when estimating precision of a single complex valued cyclic convolution.

Now, what about convolutions modulo $$$M$$$? As we want $$$|z'-z| \leq \frac{1}{2}$$$ for correct rounding, we need

$$$ |x| \cdot |y| \cdot k \leq 2.5 \cdot 10^{14} $$$

for rounding in the end to be precise enough. So, what are the values $$$|x|, |y|$$$?

In the actual implementation, which we will discuss in more detail later on, we use imaginary cyclic convolution (also called right angle cyclic convolution in the literature) to pack two numbers in a single complex value in the input. This leads to the input numbers being of the form $$$a+ib$$$, where $$$|a|, |b| \leq \frac{\sqrt{M}}{2}$$$, meaning that the squared norm of the complex input is at most $$$\frac{M}{2}$$$, and thus we have

$$$ |x|, |y| \leq \sqrt{\frac{nM}{2}}, $$$

meaning that for a convolution modulo $$$M$$$ we need to satisfy

$$$ nMk \leq 5 \cdot 10^{14} $$$

At the same time, the value we get for convolution tasks on Library Checker at $$$n \leq 2^{19}$$$ is... $$$2^{19} \cdot 10 \cdot 10^9 \approx 5.24 \cdot 10^{15}$$$? Not good. Not good at all.

On one hand, the achieved bound is very pessimistic, and there might still be some room to tighten it for the worst-case by a factor of $$$2$$$, or $$$4$$$, or even $$$8$$$... On the other hand, there's no way we'd really want to use something so tight, as any deviation from the specific algorithm that we analyzed, any attempt to cut corners will cost us more precision loss. So... What can we do about this?

Randomization

It is generally known that, when executed on random inputs, the error of FFT is much better. As a matter of fact, by a factor of $$$\sqrt n$$$. Typical explanations you might find online to this phenomenon claim that it is due to accumulation of errors modeling a random walk, and thus there should be a lot of summands with different signs, which, on average, would cancel each other, and lead to much smaller absolute value.

The issue with such reasoning is that as we see above, the actual length of a random walk in the error accumulation is not proportional to $$$n$$$, it's only proportional to $$$\log n$$$. Meaning that this rationale can only really account for improvement in the precision by the factor of $$$\sqrt{\log n}$$$, but not more. So, what's going on really?

I'd say, there are two general observations here that may be useful in explaining the phenomenon. And, quite importantly, both of them require that the multiplied numbers are not only random, but also "balanced", that is, their expected value should be $$$0$$$, and numerical experiments imply that improvements are minuscule when the numbers are random, but are also e.g. all positive.

Size of the output

If we look just on how large the numbers can be in the output, the trivial estimate would be $$$n C^2$$$, where $$$C$$$ is the maximum possible absolute value in the input. However, if the numbers in the input are all picked uniformly at random, then the output itself, by central limit theorem, behaves as a random walk, meaning that the expected absolute value of the numbers in the output is $$$\sqrt n C^2$$$, rather than $$$n C^2$$$.

This is also consistent with general observations in numerical experiments that the error relative to the magnitude of the output tends to be the same and quite small, meaning that by reducing the size of the output by a factor $$$\sqrt n$$$, we should as well reduce the absolute error by the same factor, as the relative error would presumably stay the same.

There are some caveats with the reasoning above, as I was not able to derive any rigorous bound on the error that is linked to the magnitude of the output $$$|z|$$$, rather than $$$|x| \cdot |y|$$$, which may be much larger than $$$|z|$$$. Main reason why it's difficult to get such a bound is that if we consider cyclic convolutions, there could be non-zero arrays whose cyclic convolution is precisely $$$0$$$, for example $$$(1, 1, \dots)$$$ and $$$(1, -1, \dots)$$$. Due to this, I had to abandon the pursue of a bound on the error relative to the magnitude of the output.

Distribution of errors

But there is still another practical observation, that would also allow us to explain the factor of $$$\sqrt n$$$ improvement in precision, in a somewhat better grounded manner.

The place in our estimates that particularly doesn't allow us to replace $$$|x| \cdot |y|$$$ with $$$|z|$$$ is the Hadamard product of $$$A'$$$ and $$$B'$$$. In our estimates above, we figured out that $$$|C'-C|_1 \leq \delta n \log n$$$, where $$$\delta$$$ is a constant on the scale of $$$\varepsilon$$$. Then, we applied this estimate directly to $$$|C'-C|$$$, assuming that in the worst case it is as big as $$$|C'-C|_1$$$.

Well... It is true that it can be as big in the worst case. But on practice it would mean that there is one term whose error is particularly large, while for all the other terms the error would have to be particularly small to compensate it in the sum. And what appears to happen in random balanced inputs, is that the error is spread much more uniformly, so $$$|C'_i-C_i|$$$ for an individual component of the Hadamard product would be closer to $$$\delta \log n$$$ than $$$\delta n \log n$$$. But knowing this, we can now re-evaluate

$$$ |C'-C|^2 = \sum\limits_i |C'_i-C_i|^2 \leq \sum\limits_i \delta^2 \log^2 n = n \delta^2 \log^2 n, $$$

meaning that we can estimate $$$|C'-C| \leq \delta \sqrt n \log n$$$ instead of $$$\delta n \log n$$$, which would explain the $$$\sqrt n$$$ improvement.

Numerical experiments

Of course, the explanation above is merely a heuristic, so I wanted to get a stronger indication of whether this is actually the case or not. For this, as suggested by Qwerty1232, I ran some experiments to estimate the precision of FFT on certain inputs. As error accumulation in FFT and iFFT is straight-forward, my focus was on how the error behaves in the Hadamard product after FFT on the input arrays.

Experiments ran as follows:

Only fill first half of the array to emulate "real" convolution, rather than cyclic convolution;
Duplicate computations in double and __float128 to track their precision;
Estimate $$$|C'-C|_1$$$, $$$|C'-C| = |C'-C|_2$$$ and $$$|C'-C|_\infty = \max\limits_i |C'_i-C_i|$$$;

Experiment 1: Put numbers to be around $$$30k \pm 10$$$. This emulates input that was specifically crafted to be as bad as possible.

Experiment 2: Put numbers to be uniformly picked from $$$[0, 30k]$$$. This emulates random input with bias into first quarter of $$$\mathbb C$$$.

Experiment 3: Put numbers to be uniformly picked from $$$[-15k, 15k]$$$. This emulates balanced random input.

In all three experiments, $$$L_1$$$ norm of the error closely follows the theoretical upper bound (noteworthy, with quite a large gap still, but with the same slope). At the same time, in first two experiments, $$$L_2 \approx L_\infty$$$, and they also mostly follow $$$L_1$$$: in the first case almost precisely, while in the second case with a slight gap, but with the same slope.

But the situation is completely different in the third chart! We see here that $$$L_2$$$ grows proportional to $$$\sqrt n$$$, rather than $$$n$$$, and growth in $$$L_\infty$$$ is almost non-existent (essentially proportional to $$$\log n$$$), which confirms our assumptions about the nature of drastically improved error.

Input blinding

You will, of course, say, well, why should we care about performance in the "average" case? After all, in competitive programming, we are taught to care about worst-case. Fair enough! But what if I tell you that there is a way to force any input to behave in a manner that is similar to random, especially when dealing in convolutions modulo $$$M$$$?

Generally, the idea of trying to randomize the inputs in this task is not new, and is also mentioned in the KACTL write-up. There, it is suggested to multiply the first array with a random number $$$x$$$, the second array with a random number $$$y$$$, and then the resulting array by $$$(xy)^{-1}$$$. Unfortunately, this randomization is, generally, insufficient to trigger the $$$\sqrt n$$$ improvement, because e.g. when the input mostly has the same coefficients, this property would be preserved, which would generally prevent us from assuming that the input numbers are independent from each other.

But, at the same time, the approach in the write-up is on the right track. What we can do instead is to pick a random number $$$k$$$, and then multiply $$$a_i$$$ and $$$b_i$$$ by $$$k^i$$$, and correspondingly divide the resulting $$$c_i$$$ by $$$k^i$$$. This would work because, effectively, we compute $$$C(kx) = A(kx) B(kx)$$$ via FFT, and then "undo" this operation to go back from $$$C(kx)$$$ to $$$C(x)$$$.

As the powers $$$1, k, k^2, \dots$$$, when taken modulo $$$M$$$, would typically behave in an unpredictable manner for large $$$k$$$, we assume that the practical effect of doing this would be quite similar to multiplying each $$$a_i$$$ with an independently chosen random value.

We can additionally combine this approach with multiplying both arrays by a dedicated constant, to make sure that the values picked for the first input array and the second input array are independent from each other.

One possible attack vector here could be to introduce a lot of zeros in the input, as multiplying them by anything still wouldn't change them at all, but at the same time having zeros in the input is more likely to increase precision, rather than reduce it (e.g. it reduces the norm of the input vectors, allows for more precise compute, etc).

... Okay, I know, I know. It is a heuristic, not something proven rigorously. Analyzing this rigorously is pretty hard, but the few sections above this one is my honest attempt to provide a rationale on why it should work, and so far it seemed to be outstandingly stable in practice. That being said, I'll be more than happy if someone figures out a valid attack vector on this approach! (or maybe finds a rigorous proof it should work?)

From practical perspective, I tried a bunch of various testing strategies, but in all cases the error behaved in a "randomized input" pattern, thus leaving quite a lot of buffer for correct rounding at the final step even for $$$n$$$ up to $$$2^{24}$$$, the constraints of Convolution (Large) on Library Checker, hence I was unable to observe even one single failure among thousands of tests that I ran. Again, this is far from a rigorous evidence, but should be another sound rationale to why it should always work on the scale of competitive programming inputs.

Memory management

Now that we've mostly figured out precision-related questions, let's focus on optimizing time and memory consumption. I'm not going to sugarcoat it, FFT in complex numbers requires a lot of memory, and until it is addressed, its allocation will be a major contribution to the time consumption as well.

Huge pages

When breaking down time usage, we will surprisingly find out that initializing vectors of complex numbers can take up as much, or even more time time than the FFT itself. There is a common misconception, but a misconception nonetheless, that this time is taken by initializing the memory with zeros, and it would be better to allocate unitialized memory at first, and then fill it with the correct values.

This, in particular, is a principle on which various "bump allocators" work, which allocate some very huge buffer of unitialized memory in advance, and then take from this buffer instead of calling new. On practice, however, if we would try to use such bump allocator, we will notice that, indeed, allocation itself is now much faster, but somehow, the first run of filling the memory with specific values takes much more time. Suspiciously by roughly as much as we "saved" on the allocation. So, what is happening here really?

The answer is, it was not filling up the array with zeros that took so much time, but rather first-time access of a newly allocated memory. So, by allocating a huge unitializing chunk, we didn't really solve the problem, but rather delayed it until first actual usage of the memory. So... What exactly takes so much time when we first access newly allocated memory?..

Page faults. The memory pages only get actually added to your virtual memory space on first write operation. By default, memory pages are 4 KiB in size, so you can imagine that the initialization of few MiB of memory would take quite a while for all memory pages to get properly mapped to the virtual address space.

Is there a solution to it? Yes! Huge pages. It is possible to request the operating system to populate your memory on so-called huge pages, sized 2 MiB each, instead of the default pages, sized 4 KiB each. This process is, unfortunately, operating system specific, and as Library Checker operates on Linux, I so far only made a Linux version, which goes as follows:

#include <cstddef>
#include <sys/mman.h>

template <typename T>
class big_alloc: public std::allocator<T> {
    public:
    using value_type = T;
    using base = std::allocator<T>;

    big_alloc() noexcept = default;

    template <typename U>
    big_alloc(const big_alloc<U>&) noexcept {}

    [[nodiscard]] T* allocate(std::size_t n) {
        if(n * sizeof(T) < 1024 * 1024) {
            return base::allocate(n);
        }
        n *= sizeof(T);
        void* raw = mmap(nullptr, n,
                        PROT_READ | PROT_WRITE,
                        MAP_PRIVATE | MAP_ANONYMOUS,
                        -1, 0);
        madvise(raw, n, MADV_HUGEPAGE);
        madvise(raw, n, MADV_POPULATE_WRITE);
        return static_cast<T*>(raw);
    }

    void deallocate(T* p, std::size_t n) noexcept {
        if(n * sizeof(T) < 1024 * 1024) {
            return base::deallocate(p, n);
        }
        if(p) {
            munmap(p, n * sizeof(T));
        }
    }
};

This way, you can use big_alloc<T>::allocate(n) to allocate and populate memory for n instances of type T, using huge pages if necessary. Note that for small allocations it's best to use small pages, hence we fall back to regular allocations if the needed memory is less than 1 MiB.

For added convenience, you can still use std::vector with it via std::vector<T, big_alloc<T>>. Overall, this reduces the time needed to allocate and initialize one complex vector of length $$$2^{19}$$$ from around 3 ms to just 0.7 ms.

You can refer here for the cross-platform implementation of the big allocator. For now, it simply fall backs to std::allocator when used on a non-POSIX system, but maybe I will figure out how to use big pages on Windows in the future as well, to make it compatible in the future.

Reduce memory consumption

Generally, complex numbers on double require quite a lot of memory. We need to aggressively reuse it, and also reduce its consumption as much as possible. If we use imaginary cyclic convolution to pack output of length $$$2n$$$ into an array of complex numbers of length $$$n$$$, we would need at least $$$4$$$ arrays of complex numbers of length $$$n$$$ to compute full convolution of two arrays of length $$$n$$$. For $$$n=2^{24}$$$, this would already take up 1 GiB of RAM, which coincidentally is the tight memory limit for the task on Library Checker.

On top of it, we would need to use some memory for storing integer input/output, as well as an additional array of length $$$n$$$ for the pre-computed array of roots of unity (or suffer huge loss in time due to re-computing them on the fly). This all adds up quickly, and here's what we can do about it.

Pre-computing roots of unity

Unlike in standard DIT/DIF approaches, if we use poly-mod based approach to FFT, the sequence of the roots of unity that we use on each FFT layers stays the same on each layer, and goes as $$$1$$$, then $$$-1$$$, then $$$i$$$, then $$$-i$$$, then $$$\sqrt i$$$, then $$$-\sqrt{i}$$$, then $$$i \sqrt i$$$, then $$$-i \sqrt{i}$$$, and so on. This alone allows us to reduce needed memory for the pre-computed memory by the factor of $$$2$$$, as we can only store the sequence for the bottom-most layer.

But there's more! Due to the nature of how the sequence is computed, the following holds:

$$$\begin{gather*} r_{4k+1} &=& -r_{4k}, \\ r_{4k+2} &=& i r_{4k}, \\ r_{4k+3} &=& -i r_{4k} \end{gather*}$$$

This is due to the fact that the numbers $$$r_{4k}, \dots, r_{4k+3}$$$ are all possible fourth roots of $$$r_k$$$. Utilizing this property, we can reduce the size of the pre-computed array by another factor of $$$4$$$.

But there's more! Since poly-mod based approach allows us to interrupt the transformation prematurely and fall back to naive multiplication on bottom layers instead, we can skip $$$2$$$ bottom layers (this is quite convenient for us, as they would require a different vectorization approach, which we will discuss later on), thus allowing to reduce the size of the pre-computed roots of unity array by another factor of $$$4$$$.

All in all, this reduces its size from $$$2n$$$ to $$$\frac{n}{8}$$$ to compute an imaginary cyclic convolution of length $$$n$$$. This value is, in fact, small enough to even move pre-compute to the compilation stage, and it would further reduce our time consumption if we only do a single convolution!

Incremental convolution

Above, we figured out how to reduce the required pre-computation for a single convolution of length $$$n$$$. But this is still not enough for Convolution (Large), as we need to further reduce the size of the convolution itself to be able to fit below 1 GiB. There are two possible approaches here. A simpler one would be to use Karatsuba or Toom-Cook algorithm on the top-layers, until arrays are small enough to fit. Obvious downside of this is that even if we succeed, we would need to do $$$3$$$ convolutions of size $$$\frac{n}{2}$$$ instead of one convolution of size $$$n$$$, which is roughly a factor of $$$\frac{3}{2}$$$ worse.

What else can we do? Let's think a bit more about what happens exactly when we use imaginary-cyclic convolution of two arrays of length $$$n$$$. Let's say that $$$C(x) = A(x) B(x)$$$, and

$$$ C(x) = C_0(x) + x^n C_1(x). $$$

When we do the imaginary-cyclic convolution, we effectively compute $$$C(x)$$$ modulo $$$x^n-i$$$. But at the same time, we can say that

$$$ C(x) \equiv C_0(x) + i C_1(x) \pmod{x^n-i}, $$$

thus we can recover $$$C_0$$$ and $$$C_1$$$ as real and imaginary parts of $$$C(x) \bmod (x^n-i)$$$ correspondingly. Let's generalize this a bit. Assume that $$$A(x)$$$ and $$$B(x)$$$ are now both of length up to $$$2n$$$, thus we can say that

$$$ C(x) = C_0(x) + x^n C_1(x) + x^{2n} C_2(x) + x^{3n} C_3(x). $$$

What happens if we find the imaginary-cyclic convolution of $$$A(x)$$$ and $$$B(x)$$$ in this setup? That is, find $$$C(x)$$$ modulo $$$x^n-i$$$? Well, we can notice that

$$$ C(x) \equiv [C_0(x) - C_2(x)] + i [C_1(x) - C_3(x)] \pmod{x^n-i}. $$$

But at the same time, it also holds that

$$$ C(x) \equiv [C_0(x) - C_2(x)] + x^n[C_1(x) - C_3(x)] \pmod{x^{2n}+1}. $$$

Amazing! This means that when $$$A(x)$$$ and $$$B(x)$$$ are real polynomials of length $$$n$$$, we can find their negacyclic convolution (i.e. the product modulo $$$x^n+1$$$) by computing just a single imaginary-cyclic convolution of length $$$\frac{n}{2}$$$, that is, the product modulo $$$x^{\frac{n}{2}}-i$$$!

Do you see why this is helpful yet? If we know $$$C(x) \equiv P(x) \pmod{x^n+1}$$$ and $$$C(x) \equiv Q(x) \pmod{x^n-1}$$$, we can recover $$$C(x) \equiv R(x) \pmod{x^{2n}-1}$$$ as

$$$ R(x) = \frac{P(x) + Q(x)}{2} + x^n \frac{P(x)-Q(x)}{2}. $$$

In this way, to find $$$C(x)$$$, we can do the following:

Find its remainder modulo $$$x^n+1$$$ via imaginary cyclic convolution of length $$$\frac{n}{2}$$$;
Find its remainder modulo $$$x^n-1$$$ recursively;
Combine them to get remainder modulo $$$x^{2n}-1$$$ via the formula above.

This way, all in all we will do a sequence of convolutions modulo $$$x^{\frac{n}{2}}-i$$$, then $$$x^{\frac{n}{4}}-i$$$, then $$$x^{\frac{n}{8}}-i$$$ and so on, which would allow us to further halve the maximum memory usage of complex-valued convolutions that we utilize.

So far so good! Except, $$$A(kx) B(kx)$$$ no longer commutes with finding convolution modulo $$$x^n+1$$$, does it? But, at the same time, we can notice that doing the $$$x \mapsto kx$$$ transform and inverting it in the end is effectively equivalent to computing a convolution modulo $$$x^n+k$$$.

This allows us to generalize the approach. Let $$$P(x)$$$ and $$$Q(x)$$$ be the remainders modulo $$$x^n \pm k^n$$$, then

$$$ \boxed{R(x) = \frac{P(x)+Q(x)}{2} + x^n \frac{P(x)-Q(x)}{2k^n}} $$$

is the remainder of $$$C(x)$$$ modulo $$$x^{2n}-k^{2n}$$$. This way, we can utilize our memory reduction trick without sacrificing our input blinding approach! Also, if you're familiar with poly-mod approach to FFT, you would recognize that what we're doing here is, really, just using it on the top level (at which the roots are $$$\pm 1$$$), and falling back to FFT on lower levels.

Note: While the approach above may reduce memory consumption by around 35%, if done in-place, it also seems to increase time consumption by around 10% (at least, in my implementation) due to overhead in the linear part of the algorithm, which is why I only use it on large arrays, on which the only other option appears to be to run Karatsuba to reduce memory, which would take up even more time.

Time management

Okay, and now, finally, the juiciest part. Naturally, we will need to vectorize and use SIMD here. So, before we delve into deep details on vectorization, let's cover some low-hanging fruit.

Imaginary-cyclic convolution

When using complex numbers for FFT, and applying it to the real input, it always makes sense to try "packing" it in the half-length array by utilizing both real and imaginary parts. There are two basic approaches here, one is doing "2-in-1" FFT, and the other is utilizing imaginary cyclic convolution.

For "2-in-1" FFT, you can pack two arrays of reals into one array of complex numbers (first array in reals, second in imaginary parts), and then after passing it through FFT, you would be able to split it back into DFT's of two separate arrays. This is #16 in my general ideas article from 8 years ago.

Nevertheless, in modern time I think that it's better to use the imaginary cyclic convolution, because it is easier to integrate with the optimization that we will discuss next, and also because it's better suited for the inverse part of the transform, where we need to do 3 inverse transformations. With "2-in-1" approach this will only reduce the number of convolutions from $$$3$$$ to $$$2$$$ of length $$$n$$$, but with imaginary cyclic convolution, we will still have $$$3$$$ inverse transformations, but of halved size, which should be better by around 25% in the inverse transformation.

Partial DFT

The next thing that we will need to utilize is also heavily related to the poly-mod approach to FFT, as described by pajenegod. The approach introduces the following meaning to the transformation: Let's say that we know the value of our polynomial modulo $$$x^n-c$$$. Then, we can easily find the remainders of the polynomial modulo $$$x^\frac{n}{2}-\sqrt c$$$ and $$$x^\frac{n}{2}+\sqrt c$$$.

Starting with $$$x^n-i$$$ at the top layer, and repeatedly doing this process, on the $$$k$$$-th FFT layer we will have the sequence of remainders modulo $$$x^{n/2^k}-\sqrt[2^k]{i}$$$ , where $$$\sqrt[2^k]{i}$$$ is a placeholder to mean all possible $$$2^k$$$-th roots of $$$i$$$ in specific order.

The order in question is $$$\omega, -\omega, i\omega, -i\omega, \sqrt{i} \omega, -\sqrt{i}\omega, \dots$$$, where $$$\omega$$$ is the primitive $$$2^k$$$-th root of $$$i$$$, and its multipliers are the $$$2^k$$$-th roots of $$$1$$$ which you would get if you start with $$$1$$$, and then recursively break down the current number into $$$\pm \sqrt{c}$$$, so we will have:

$$$1$$$ and $$$-1$$$ on the first layer, the square roots of $$$1$$$;
$$$1$$$ and $$$-1$$$ (square roots of $$$1$$$), $$$i$$$ and $$$-i$$$ (square roots of $$$-1$$$) on the second layer;
$$$1, -1, i, -i, \sqrt{i}, -\sqrt{i}, i\sqrt i, -i\sqrt i$$$ on the third layer, and so on.

I mostly covered this part already in "pre-computing roots of unity" section. So, now, knowing that on the $$$k$$$-th layer we have remainders of $$$A(x)$$$ and $$$B(x)$$$ modulo $$$x^{n/2^k} - \sqrt[2^k]{i}$$$, what we can do is interrupt the computation at some layer, and compute the product of the remainders naively, after which we will reduce it modulo $$$x^{n/2^k}-\sqrt[2^k]{i}$$$ to get the remainder of $$$C(x)$$$ modulo this polynomial.

In my practical experiments, generally speaking, FFT itself seems to be very efficient, so it only makes sense to do this at the last $$$2$$$ layers, which would need different vectorization approach otherwise.

Bitreverseless FFT

It goes without saying, but I will still write it in a dedicated section that we will not have bitreverse at all. It is terrible for cache and vectorization. Luckily, poly-mod approach easily allows us to completely skip it by simply "undoing" the forward transformation in the reverse order.

Radix-4 FFT

I mentioned this briefly in the "precision" section, and it makes sense to re-iterate on this now. When doing regular FFT, we batch array elements in pairs, and apply the following butterfly-like transformations:

$$$(a, b) \mapsto (a+b\omega, a-b\omega)$$$;
$$$(a, b) \mapsto (a+b, \frac{a-b}{\omega})$$$.

We might still need them as a fallback, but for precision, and memory access related reasons (see Qwerty1232's blog), it is more efficient to do butterfly transforms $$$2$$$ layers at once. utilizing the fact that in poly-mod based FFT, roots of unity come in consecutive blocks of $$$4$$$ that look like $$$\omega, -\omega, i\omega, -i\omega$$$, specific formulas for the radix-4 transform look as follows:

Please note that the formulas might be slightly different in regular DIT/DIF FFT!

DFS transformation order

I will just refer you to Step F of the Qwerty1232's blog, as it already covers it in sufficient detail. In my experience, it didn't have any effect on $$$n=2^{19}$$$, but it does improve runtime sizeably at Convolution (Large).

Vectorization

Some relevant reads: (1), (2)

Okay. We got to it. First of all, a brief intro if you have no idea what vectorization is at all. This is really not something to be scared of at all! The core idea in vectorization, as I already described in my efficient linear algebra for dummies blog is very similar to that of a bitset or a xor basis. In essence, modern CPUs have so-called wide registers that typically allow to work with up to 256-bit types at once.

However what sets them apart from regular registers is that data in these "wide" registers is interpreted not as a singular data type, but rather as a pack (or "vector") of smaller, 32- or 64-bit types. For example, if interpreted as a vector of 64-bit integers, the data stored in a wide register would be equivalent to $$$x[0],\dots,x[3]$$$, where each $$$x[i]$$$ is a single 64-bit integer.

Correspondingly, the supported assembly operations on these registers typically operate on all of these packed values at once. For example, we can add two 64-bit integer vectors $$$x$$$ and $$$y$$$ to get a vector $$$x[0]+y[0], \dots, x[3]+y[3]$$$, all in just one assembly instruction.

The instructions themselves typically can be called directly via intrinsics that have convoluted obscure names, such as _mm256_loadu_si256, etc. This is the part that typically scares away neophytes, and it certainly did scare me away at first. Coincidentally, this is also what you will most likely find if you look into other people's solutions who work with vector types because... Reasons?

Well, jokes aside, I asked people why they do that, and the common answer is that they want to have as much control as possible over what the compiler will do precisely. However, luckily for simple people, like myself, there are much simpler and intuitive ways.

Specifically, beyond directly using intrinsics, there are two distinct wrappers that we can use here:

GNU vector extensions. As always, available in GCC and Clang, but probably not MS C++.
std::experimental::simd a to be standardized wrapper in the standard library.

GNU vector extensions

As I have tried both of them, I ultimately converged on using GNU vector extensions for two main reasons:

They appear to have better performance when desired instruction set is unavailable, e.g. running on Codeforces without using #pragma GCC target("avx2");
It's easier to apply intrinsics to them directly if we really have to.

Generally, vector extensions work in a surprisingly simple manner, essentially mirroring that of std::valarray. See the following example:

using v4si [[gnu::vector_size(4 * sizeof(int))]] = int;

v4si a, b, c;

a = b + 1;    /* a = b + {1,1,1,1}; */
a = 2 * b;    /* a = {2,2,2,2} * b; */
c = a + b;    /* c = {a[0] + b[0], ..., a[3] + b[3]} */

In other words, we can generally just apply most regular arithmetic operations directly to the vector types, and they will mostly behave in the way we expect them to (i.e., doing component-wise operations).

To make our life a bit easier, they also provide the following functions:

__builtin_convertvector(a, type) converts vector value $$$a$$$ of certain type into another vector value of type type. The types should be the same length, e.g. we can use this to convert a vector of int64_t into a vector of double, etc.
__builtin_shufflevector(a, b, index...) assumes that we have concatenated vectors $$$a$$$ and $$$b$$$, and returns elements specified by the index sequence. For example, if a is a vector of length 4, __builtin_shufflevector(a, a, 1, 2, 3, 0) will cyclically rotate its values and return the new value.

Note that direct cast from one vector type to another is essentially equivalent to reinterpret_cast, and must be used with care.

On top of it, actual SIMD type that are used in intrinsics are typically defined in compiler internally as extended vector types, meaning that, if need arises, we may directly apply intrinsics to the vector types (*possibly* having to do a conversion first).

Vectorizing complex operations

Vector extensions are defined for simple types, such as int, double, etc. Naturally, we can not request a vectorized version of std::complex<double> directly. Thus, comes the question of how exactly we should vectorize them?

The decision that I ultimately converged to is to re-define my own complex numbers type, because using std::complex with something that is not a simple floating-point type is unspecified behavior, but besides that, it might have some issues with inlining, and it doesn't allow to modify real or imag by reference, which is a serious concern for working with complex scalars.

After that, I defined a bunch of aliases for commonly used SIMD types:

template<typename T, size_t len>
using simd [[gnu::vector_size(len * sizeof(T))]] = T;
using i64x4 = simd<int64_t, 4>;
using u64x4 = simd<uint64_t, 4>;
using u32x8 = simd<uint32_t, 8>;
using i32x4 = simd<int32_t, 4>;
using u32x4 = simd<uint32_t, 4>;
using dx4 = simd<double, 4>;

and used cp_algo::complex<dx4> as the main type for Fourier transform. Over extended experimentation, I also figured that the following implementations are the most effective for certain operations that we might need along the way:

dx4 abs(dx4 a) {
return a < 0 ? -a : a;
}

// https://stackoverflow.com/a/77376595
// works for ints in (-2^51, 2^51)
static constexpr dx4 magic = dx4() + (3ULL << 51);
i64x4 lround(dx4 x) {
	return i64x4(x + magic) - i64x4(magic);
}
dx4 to_double(i64x4 x) {
	return dx4(x + i64x4(magic)) - magic;
}
inline dx4 round(dx4 a) {
	return dx4{
		std::nearbyint(a[0]),
		std::nearbyint(a[1]),
		std::nearbyint(a[2]),
		std::nearbyint(a[3])
	};
}
dx4 rotate_right(dx4 x) {
	static constexpr u64x4 shuffler = {3, 0, 1, 2};
	return __builtin_shuffle(x, shuffler);
}

These implementations are optimal in the sense that:

If an appropriate instruction set is available, they will compile to use the minimum required number of intrinsics;
If the instruction set is unavailable, I trust compiler to do the next best thing in compliance with what the functions should do.

Note that while the implementations might look simple enough, I had to check what they compile to on the compiler explorer, to make sure that they are equivalent to directly calling corresponding intrinsics, when those are available.

Implementing FFT in complex numbers

Combining all the above together, the actual implementation of FFT and iFFT is reasonably straightforward.

probably the only part that is not as straightforward as simply doing arithmetic operations on 4 consecutive complex numbers at once is the computation of component-wise products modulo $$$x^{n/2^k}-\sqrt[2^k]{i}$$$. In my implementation, it is done in the following manner:

void dot(cvector const& t) {
	size_t n = this->size();
	exec_on_evals<1>(n / flen, [&](size_t k, point rt) {
		k *= flen;
		auto [Ax, Ay] = at(k);
		auto Bv = t.at(k);
		vpoint res = vz;
		for (size_t i = 0; i < flen; i++) {
			res += vpoint(vz + Ax[i], vz + Ay[i]) * Bv;
			real(Bv) = rotate_right(real(Bv));
			imag(Bv) = rotate_right(imag(Bv));
			auto x = real(Bv)[0], y = imag(Bv)[0];
			real(Bv)[0] = x * real(rt) - y * imag(rt);
			imag(Bv)[0] = x * imag(rt) + y * real(rt);
		}
		set(k, res);
	});
	checkpoint("dot");
}

In other words, we repeatedly do the right shift of the second remainder, and multiply first element by $$$\sqrt[2^k]{i}$$$, after which we sum up all the intermediate products. I will have to be rather brief when describing this, and hope that the code would be somewhat self-explanatory.

Conversion between modular and complex numbers

This part is, frankly, quite painful, and optimal implementation of it requires using integer-specific vectorization approaches that we tried so hard to avoid so far.

The main issue with vectorization of integer operations is that, until AVX512 (which is not available on most online judges), most standard operations, such as even the product of 64-bit vectors are not implemented via intrinsics.

Naturally, same stands for reducing numbers modulo $$$M$$$, which would typically be done using Barrett reduction in scalar versions of the corresponding operations: There is no vectorized version of it when the reduced number is initially 64-bit.

The only saving grace for us is the _mm256_mul_epu32 intrinsic that takes two vectors of 64-bit integers, then ignores higher 32 bits of each individual integer, and multiplies lower halves, as if they were 64-bit, and returns it as another vector of 64-bit integers.

Only thanks to this one operation we are able to vectorize Montgomery multiplication in the following manner:

u64x4 montgomery_reduce(u64x4 x, u64x4 mod, u64x4 imod) {
	auto x_ninv = u64x4(u32x8(x) * u32x8(imod));
#ifdef __AVX2__
	x += u64x4(_mm256_mul_epu32(__m256i(x_ninv), __m256i(mod)));
#else
	x += x_ninv * mod;
#endif
	return x >> 32;
}

u64x4 montgomery_mul(u64x4 x, u64x4 y, u64x4 mod, u64x4 imod) {
#ifdef __AVX2__
	return montgomery_reduce(u64x4(_mm256_mul_epu32(__m256i(x), __m256i(y))), mod, imod);
#else
	return montgomery_reduce(x * y, mod, imod);
#endif
}

Note the non-__AVX2__ part. It is designed to be exactly equivalent to the __AVX2__ part in terms of the actual result. This way, you can hopefully see much easier that even vectorizing Montgomery multiplication is not as "magic" as it could appear from looking on other people's solutions, and essentially it just does the same thing as its scalar version.

Unfortunately, I still require the #ifdef though, because while the two versions are identical in the actual output, I couldn't find any way at all to hint compiler to use _mm256_mul_epu32 instead of emulating component-wise product in full 64-bit arithmetics.

I'd be very grateful if someone could help with this!

Finally, the actual part that handles the conversion between modular numbers and complex is quite messy, but still implements the approaches described above in a somewhat straightforward manner, while using the above snippet for Montgomery multiplication when I really need to compute products modulo $$$M$$$ (for example when doing $$$A(kx) B(kx)$$$ substitute and its inverse).

Conclusion

Well, that took much more effort to write up than I initially anticipated. I really hope that it would be useful to someone, and maybe would reignite interest in complex number based FFT at least for someone in the community, as I think that it is a beautiful algorithm and concept that is neglected and avoided far more than it really deserves.

If you actually have read even a few paragraphs from the text above, thank you very much, you're my hero!

Rev.	By	When	Δ	Comment
en5	adamant	2025-05-22 23:57:43	286
en4	adamant	2025-05-22 14:56:15	305
en3	adamant	2025-05-13 22:52:31	30
en2	adamant	2025-05-13 22:47:17	226
en1	adamant	2025-05-13 22:38:35	57912	Initial revision (published)