How to Optimize C Code for NTT -- POLYMUL on SPOJ

№	Пользователь	Рейтинг
1	Benq	3792
2	VivaciousAubergine	3647
3	Kevin114514	3611
4	jiangly	3583
5	strapple	3515
6	tourist	3470
7	dXqwq	3436
8	Radewoosh	3415
9	Otomachi_Una	3413
10	Um_nik	3376

№	Пользователь	Вклад
1	Qingyu	163
2	adamant	150
3	Um_nik	146
4	Dominater069	144
5	errorgorn	141
6	cry	139
7	Proof_by_QED	136
8	YuukiS	135
9	chromate00	134
9	TheScrasse	134

On SPOJ, there is a problem named POLYMUL, which is pretty straightforward: Multiply two polynomials, which have integer coefficients in the interval $$$[0, 1000]$$$ and can each have a degree of up to $$$10000$$$.

Currently, I am trying to solve this problem with NTT and my solution can be found here. As the solution shows, I am doing NTT with the primes of both $$$998244353$$$ and $$$2013265921$$$, and then using Chinese Remainder Theorem in order to find the true coefficients without any modulo (I am assuming the coefficients are less than $$$1000^2\cdot 10000=10^{10}$$$, which is much less than the product of $$$998244353$$$ and $$$2013265921$$$). On my machine, it takes about 3 seconds for this solution to solve a test case with $$$T=10$$$ (i.e. $$$10$$$ pairs of polynomial) where each polynomial in each pair is of degree $$$10000$$$ and the coefficients are integers randomly picked from the interval $$$[0, 1000]$$$.

However, the time limit is 1 second, and I do not know what to do to optimize my solution more. Does any one have any suggestions on how to optimize my NTT algorithm or how to optimize my C code in general? Any solutions would be very welcome. Thanks!

Комментарии (4)

Показать архивные | Написать комментарий?

Chilli

7 лет назад, скрыть # |

← Rev. 3 →

+27

A couple different observations, in decreasing order of "how much does this matter".

First off, you don't even need NTT for this problem (since there are no modulos). You can just use regular FFT.

Second, your NTT is recursive and isn't based off bit reversals. That'll make it much slower.

Third, you have too many mod operations. If you start off with 2 values that have already been modded, you can do a+b%MOD = (a+b>=MOD ? a+b-MOD : a+b). Similarly for subtract.

Fourth, you're using 64 bit integers. Throughout most of the computation, those are unnecessary. The only operation you need 64 bit integers for is multiplication. Only upcast then.

You can see my implementation here: https://gist.github.com/Chillee/4982ba0840745f63f3771bd84f280557#file-ntt-cpp

I might write a blog post sometime about writing a fast FFT for CP.

→ Ответить

RezwanArefin01

7 лет назад, скрыть # ^ |

Adding another one: You are calculating powers of root of unity on the fly, while you can pre-calculate all of them before hand, before calling NTT.

Noble_Mushtak

Thank you both for your suggestions. First, I implemented just Rezwan's suggestion and got accepted in 0.62 seconds: https://pastebin.com/raw/JatXqYJE

Then, I implemented both your second and third suggestions and got accepted in 0.18 seconds: https://pastebin.com/raw/bgLtgqP3

However, why do you suggest using regular FFT over NTT? Personally, I am confused about the advantages of one over the other, and I mostly just used NTT because it seemed easier to implement. Is FFT faster than NTT? When would you choose using FFT over NTT and vice versa?

FFT is faster, but has the potential to run into precision issues with large enough inputs. NTT works entirely in integers so has no precision issues, but is slower and requires certain specific mods. In this case, both work because the coefficients never become too big.

Блог пользователя Noble_Mushtak