### pajenegod's blog

By pajenegod, history, 21 month(s) ago,

Hi CF! During this past weekend I was reading up on Montgomery transformation, which is a really interesting and useful technique to do fast modular multiplication. However, all of the explanations I could find online felt very unintuitive for me, so I decided to write my own blog on the subject. A big thanks to kostia244, nor, nskybytskyi and -is-this-fft- for reading this blog and giving me some feedback =).

### Fast modular multiplication

Let $P=10^9+7$ and let $a$ and $b$ be two numbers in $[0,P)$. Our goal is to calculate $a \cdot b \, \% \, P$ without ever actually calling $\% \, P$. This is because calling $\% \, P$ is very costly.

If you haven't noticed that calling $\% \, P$ is really slow, then the reason you haven't noticed it is likely because the compiler automatically optimizes away the $\% \, P$ call if $P$ is known at compile time. But if $P$ is not known at compile time, then the compiler will have to call $\% \, P$, which is really really slow.

### Montgomery reduction of $a \cdot b$

It turns out that the trick to calculate $a \cdot b \, \% \, P$ efficently is to calculate $a \cdot b \cdot 2^{-32} \, \% \, P$ efficiently. So the goal for this section will be to figure out how to calculate $a \cdot b \cdot 2^{-32} \, \% \, P$ efficently. $a \cdot b \cdot 2^{-32} \, \% \, P$ is called the Montgomery reduction of $a \cdot b$, denoted by $\text{m_reduce}(a \cdot b)$.

#### Idea (easy case)

Suppose that $a \cdot b$ just happens to be divisible by $2^{32}$. Then $(a \cdot b \cdot 2^{-32}) \, \% \, P = (a \cdot b) \gg 32$, which runs super fast!

#### Idea (general case)

Can we do something similar if $a \cdot b$ is not divisible by $2^{32}$? The answer is yes! The trick is to find some integer $m$ such that $(a \cdot b + m \cdot P)$ is divisible by $2^{32}$. Then $a \cdot b \cdot 2^{-32} \, \% \, P = (a \cdot b + m \cdot P) \cdot 2^{-32} \, \% \, P = (a \cdot b + m \cdot P) \gg 32$.

So how do we find such an integer $m$? We want $(a \cdot b + m \cdot P) \,\%\, 2^{32} = 0$ so $m = (-a \cdot b \cdot P^{-1}) \,\%\, 2^{32}$. So if we precalculate $(-P^{-1}) \,\%\, 2^{32}$ then calculating $m$ can be done blazingly fast.

### Montgomery transformation

Since the Montgomery reduction divides $a \cdot b$ by $2^{32}$, we would like some some way of multiplying by $2^{32}$ modulo $P$. The operation $x \cdot 2^{32} \, \% \, P$ is called the Montgomery transform of $x$, denoted by $\text{m_transform}(x)$.

The trick to implement $\text{m_transform}$ efficently is to make use of the Montgomery reduction. Note that $\text{m_transform}(x) = \text{m_reduce}(x \cdot (2^{64} \, \% \, P))$, so if we precalculate $2^{64} \, \% \, P$, then $\text{m_transform}$ also runs blazingly fast.

### Montgomery multiplication

Using $\text{m_reduce}$ and $\text{m_transform}$ there are multiple different ways of calculating $a \cdot b \, \% \, P$ effectively. One way is to run $\text{m_transform}(\text{m_reduce}(a \cdot b))$. This results in two calls to $\text{m_reduce}$ per multiplication.

Another common way to do it is to always keep all integers transformed in the so called Montgomery space. If $a' = \text{m_transform}(a)$ and $b' = \text{m_transform}(b)$ then $\text{m_transform}(a \cdot b \, \% \, P) = \text{m_reduce}(a' \cdot b')$. This effectively results in one call to $\text{m_reduce}$ per multiplication, however you now have to pay to move integers in to and out of the Montgomery space.

### Example implementation

Here is a Python 3.8 implementation of Montgomery multiplication. This implementation is just meant to serve as a basic example. Implement it in C++ if you want it to run fast.

P = 10**9 + 7
r = 2**32
r2 = r * r % P
Pinv = pow(-P, -1, r) # (-P^-1) % r

def m_reduce(ab):
m = ab * Pinv % r
return (ab + m * P) // r

def m_transform(a):
return m_reduce(a * r2)

# Example of how to use it
a = 123456789
b = 35
a_prim = m_transform(a) # mult a by 2^32
b_prim = m_transform(b) # mult b by 2^32
prod_prim = m_reduce(a_prim * b_prim) # divide a' * b' by 2^32
prod = m_reduce(prod_prim) # divide prod' by 2^32
print('%d * %d %% %d = %d' % (a, b, P, prod)) # prints 123456789 * 35 % 1000000007 = 320987587


### Final remarks

One important issue that I've so far swept under the rug is that the output of m_reduce is actually in $[0, 2 P)$ and not $[0, P)$. I just want end by discussing this issue. I can see two ways of handling this:

• Alternative 1. You can force $\text{m_reduce}(a \cdot b)$ to be in $[0, P)$ for $a$ and $b$ in $[0, P)$ by adding an if-stament to the output of m_reduce. This will work for any odd integer $P < 2^{31}$.
Fixed implementation of m_reduce
• Alternative 2. Assuming $P$ is an odd integer $< 2^{30}$ then if $a$ and $b$ $\in [0, 2 P)$ you can show that the output of $\text{m_reduce}(a \cdot b)$ is also in $[0,2 P)$. So if you are fine working with $[0, 2 P) \vphantom]$ everywhere then you don't need any if-statements. Nyaan's github has a nice C++ implementation of Montgomery multiplication using this style of implementation.
• +92

 » 21 month(s) ago, # |   0 Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).
 » 21 month(s) ago, # |   +42 I could read and understand all of what you wrote in 10 minutes. Learning something new so quickly is rare, at least for me. Your explanation is very clear and easy to follow, congratulations.Do you happen to have a comparison of the performances of Montgomery multiplication and Barrett reduction for computing $a\cdot b\ \%\ P$? At the very beginning (in "Idea (easy case)"), when you mention $a\cdot b\cdot 2^{-32}\ \%\ P$ for the first time, I had to stop and think about what you meant with $2^{-32}$. You may want to write explicitly that it represent the inverse of $2^{32}$ modulo $P$.
•  » » 3 months ago, # ^ |   0 How could you understand all of this in 10 minutes?I am still stuck at the very top line It turns out that the trick to calculate a⋅b % P efficently is to calculate a⋅b⋅2^(−32) % P efficiently. Don't see where this comes from at all.
 » 21 month(s) ago, # | ← Rev. 2 →   +22 The "mod-switching" or "integer division under mod" trick is useful in more mathematical ways too. One way it's useful is computing $a^{-1} \mod n$ for composite $n$. def inv(a, n): if(a == 1): return 1 return (1 - inv(n % a, a) * n) / a + n This works because you can just divide by $a$ normally, even in the space $\mod n$ given that the numerator is divisible by $a$ inv_mod_m = [1] * n for i in range(2, n): inv_mod_m[i] = (m - m / i) * inv[m % i] % m This works because $(m - m \% i) \times \text{inv}[m \% i] \equiv -1 \mod m$$\frac{m - m\%i}{i} \times \text{inv}[m\%i] = -i^{-1} \mod m$$\lfloor\frac{m}{i}\rfloor\text{inv}[m\%i] = -i^{-1} \mod m$$(m - \lfloor\frac{m}{i}\rfloor) \times \text{inv}[m\%i] = i^{-1} \mod m$Essentially, we needed an integer $x = 1 \mod m$ and $x = 0 \mod i$. We know $(m - m \% i)$ is $0 \mod i$, and we also know it's inverse already. So then $(m - m \% i) \times (m - m \% i)^{-1}$ is $1 \mod m$ and $0 \mod i$ as we needed.
•  » » 15 months ago, # ^ |   0 We can use extended euclidean algorithm right? Is this faster compared to ExtendedGcd?
•  » » » 15 months ago, # ^ |   0 We can. This is both faster and slower.
•  » » » » 15 months ago, # ^ |   0 when it is slower?
•  » » » » » 15 months ago, # ^ | ← Rev. 4 →   0 The given algorithm precomputes first n modular inverses in $O(n)$, while gcd takes about $O(\log{n})$ iterations for a specific value.
•  » » » 15 months ago, # ^ |   0
 » 3 months ago, # | ← Rev. 8 →   0 Pa.Nic told me this interesting trick involving Montgomery multiplication a while back:When implementing Pollard's rho algorithm for factorization, you can use Montgomery multiplication to implement the pseudorandom step function $g(x)$, and transformation is not necessary. (As long as $N$, the number you're factoring, is odd, but you can easily purge all factors of $2$ first)Why? Turns out there are two arguments: $\gcd(2^w, N) = 1 \implies \gcd (a, N) = \gcd (a \cdot 2^w\ \text{ mod } N, N)$ Pollard's rho is actually not very sensitive to the choice of the step function $g(x)$; the primary requirements are that it is done modulo $N$ (so that it's simultaneously modulo $d$ for any of the divisors you're searching for), and that it is non-linear/non-bijective. Therefore $g(x) = 2^{-w} \cdot x^2 + C\ \text{ mod } N$ or even $g(x) = 2^{-w} (x^2 + C)\ \text{ mod } N$ is about as good as $g(x) = x^2 + C\ \text{ mod } N$. ($C$ can be chosen randomly to discourage/prevent brute-force searching for unusually long sequences)
•  » » 3 months ago, # ^ | ← Rev. 3 →   0 That is a good point. This is reminiscent of another technique that I've used in the past to speed up Pollard's rho, where you greately reduce the number of gcd computations. For example the kactl Pollard's rho implementation only computes gcd once every 40 iterations. The reason for this is that computing gcd is expensive and you don't need to check after every iteration if you've found a non-trivial factor of n or not. It suffices to check the product of potential non-trivial factors only every once in a while.
•  » » » 3 months ago, # ^ |   0 I've also used this trick, and Montgomery multiplication can also be used for the intermediate product, as extraneous factors of $2^{-w}$ don't affect the gcd.