Hi everyone!
There is an article on cp-algorithms about how to compute n! modulo prime number p. Today I was wondering, how to do it if we need to compute modulo pk rather than just p. Turns out one can do it efficiently if p is small.
Motivational example
Xiaoxu Guo Contest 3 — Binomial Coefficient. Given 1≤n,k≤1018, find \binom{n}{k} modulo 2^{32}.
There are also some projecteuler problems that require it, including with several queries of distinct n and k.
Task formulation and outline of result
To clarify the task a bit, our ultimate goal here is to be able to compute e.g. binomial coefficients modulo p^k. Thus, what we really need is to represent n! = p^t a, where \gcd(a, p)=1, and then report t and a modulo p^k. This is sufficient to also compute \binom{n}{r} modulo p^k.
We will show that, assuming that polynomial multiplication of size n requires O(n \log n) operations, and assuming that arithmetic operations modulo p^k take O(1), we can find t and a in O(d^2 + dk\log k), where d=\log_p n. It requires O(pk\log^2 k) pre-computation that takes O(pk \log k) memory.
Algorithm when p^k is also small
First, let's solve a simpler version of this problem.
Library Judge — Binomial Coefficient. Given n, k \leq 10^{18} and m \leq 10^6, print \binom{n}{k} \bmod m for T \leq 2 \cdot 10^5 test cases.
Let's denote n!_p as the product of all numbers from 1 to n that are co-prime with p. That is,
In this notion, assuming k=\lfloor \frac{n}{p} \rfloor, we can represent n! as n! = p^k k! n!_p, which expands recursively into
On practice, we only care about i \leq \log_p n, and we may pre-compute all possible values of a!_p modulo p^k in advance in O(p^k) for all a < p^k, which then allows us to compute the full formula in O(d), where d = \log_p n. To do this, we note that
and (p^k-1)!_p \equiv -1 \pmod{p^k} for p > 2 and p^k = 4. For other values with p=2 it is 1 instead of -1.
Please find the implementation in Python below for details:
It works 5.5s on the aforementioned Library Judge problem.
Algorithm with polynomials
Great thanks to Endagorion for sharing this algorithm with me!
Let's look further into the formulas from the section above. There is a part contributed by numbers divisible by p, which can be reduced to computing k!, and a part contributed by numbers not divisible by p. Let n = a_0 + a_1 p + \dots + a_d p^d, then
To compute it quickly, we can define a family of polynomial P_{i,a}(x) for 0 \leq a \leq p such that
so the value of the factorial would be represented as
and expanding k! it then rewrites into
which simplifies as
Now, what would it take us to use this setup? First of all, notice that P_{i,a}(x) can be computed from one another:
Note that for shorter and more consistent implementation, this recurrent formula also mostly works for i=1 if we put P_{0, p}(x) = x+1, but for P_{1,p} we should go up to p-2 instead of p-1. We should also note that for P_{i,a}(x), we only care for coefficients up to x^{\lfloor x/i \rfloor}, as the larger ones are divisible by p^k.
This allows to compute P_{i,a}(x) in O(\frac{pk \log k}{i}) for all a for a given i. Over all i from 1 to k it sums up to O(pk \log^2 k) time and O(p k \log k) memory. Then, evaluating all the polynomials for any specific a requires O(d^2 + dk \log k) operations, where d = \log_p n.
As it requires some manipulations with polynomials, I implemented it in Python with sympy just as a proof of concept:
Algorithm with p-adic logarithms and exponents
The algorithm above requires some heavy machinery on polynomials. I also found out an alternative approach that allows to compute t and a for any given n divisible by p in O(dk^2) with O(pk) precomputation, where d = \log_p n. It doesn't rely on polynomial operations, except for Lagrange interpolation.
Let n=pt+b, where 0 \leq b < p, then we can represent its factorial as
We can further rewrite it as
Let's learn how to compute the product
as with it we can represent the factorial as
We can take a p-adic logarithm (please refer to this article for definitions and properties) from the product to get
We can precompute sums of i^{-z} for each z up to k and b up to p-1 in O(pk), and we can find the sum of j^z for j from 0 to t in O(z) via Lagrange interpolation. Therefore, with O(pk) precomputation, we can compute \log A(b, t) in O(k^2), which then allows to find n! in O(dk^2), where d = \log_p n. I implemented it in Python, as a proof of concept (incl. fast sums over i and j):
Note: Due to inherent properties of p-adic logarithms and exponents, the algorithm only works with p>2, and it might return -n! instead with p=2. This is because \exp \log z = -z with p=2 when z has remainder 3 modulo 4. It is accounted for in the code by tracking the number of integers below n that have remainder 3 modulo 4.
I also tested the solution on the motivational problem and got AC in Python: submission.
Some other approaches
There is also a blog by prabowo, based apparently on another blog by Min_25, and a scholarly article by Andrew Granville. I'm not sure whether they talk about the same algorithms as described here.