If mod(%) are so expensive why not make own modulus? - Codeforces

→ Pay attention

Before contest
Educational Codeforces Round 173 (Rated for Div. 2)
40:09:31
Register now »

→ Streams

CF Edu Round 173 Solution Discussion

By aryanc403

Before stream 42:19:29

View all →

→ Top rated

#	User	Rating
1	tourist	3985
2	jiangly	3814
3	jqdai0815	3682
4	Benq	3529
5	orzdevinwang	3526
6	ksun48	3517
7	Radewoosh	3410
8	hos.lyric	3399
9	ecnerwala	3392
9	Um_nik	3392

Countries | Cities | Organizations

→ Top contributors

#	User	Contrib.
1	cry	169
2	maomao90	162
2	Um_nik	162
4	atcoder_official	161
5	djm03178	158
6	-is-this-fft-	157
7	adamant	155
8	awoo	154
8	Dominater069	154
10	luogu_official	150

View all →

→ Find user

→ Recent actions

Detailed →

Qualified's blog

If mod(%) are so expensive why not make own modulus?

By Qualified, history, 4 years ago, In English

In English

I see this article and at the bottom, it says that modulus operators are expensive so they implemented a slightly faster version of Euclidean Algorithm. Why not make a more efficient mod?

int mod(int a, int b) { // computes a % b;
	return (a - b * (a / b));
}

+87

Qualified
4 years ago
39

Comments

Comments (38)

Show archived | Write comment?

»

quadraticreciprocity

4 years ago, # |

Vote: I like it

+72

Vote: I do not like it

afaik / and % are expensive compared to + and *

→ Reply

»

4 years ago, # |

Vote: I like it

+219

Vote: I do not like it

Cause / is approximately as expensive as %.

→ Reply

»

4 years ago, # |

Vote: I like it

+130

Vote: I do not like it

Actually on x86 division instruction DIV (or IDIV for signed integers) computes both quotient and reminder, just stores them in different registers. So obviously your "more efficient" version can't work faster than a % b;

→ Reply

»

4 years ago, # |

Vote: I like it

+32

Vote: I do not like it

I liked your blog! As explained in the comments, apparently this would not be faster, but I agree with you that mod operator are famously said to be the ones that are slow, I never stopped to think about the division operator, that should be as slow too! I’m sorry that your blog got downvoted. It made me have a better understanding of predicting the run time of a code

→ Reply

»

»

4 years ago, # ^ |

Vote: I like it

-16

Vote: I do not like it

Your welcome. :D

→ Reply

»

4 years ago, # |

← Rev. 2 →

Vote: I like it

0

Vote: I do not like it

Intel wants to know your location.......

On a serious note, it will be helpful to see not them as O(1) operations on 32 bit but asymptotic complexity on variable number of bits. Then you'll appreciate that modulus isn't much of a different problem than division.

→ Reply

»

4 years ago, # |

Vote: I like it

-48

Vote: I do not like it

You can do a binary lifting/binary search mod operation. I really don’t know whether it’s faster or not.

→ Reply

»

dmitry.dolgopolov

4 years ago, # |

← Rev. 2 →

Vote: I like it

+60

Vote: I do not like it

https://godbolt.org/z/7W35Me

Spoiler

int mod0(int a, int b)
{
    return a % b;
}

int mod1(int a, int b)
{
	return a - b * (a / b);
}

mod0(int, int):
        mov     eax, edi
        cdq
        idiv    esi
        mov     eax, edx
        ret
mod1(int, int):
        mov     eax, edi
        cdq
        idiv    esi
        mov     eax, edx
        ret

→ Reply

»

LanceTheDragonTrainer

4 years ago, # |

Vote: I like it

0

Vote: I do not like it

You could get a speed up sometimes by doing this:

a = a >= b ? a % b : a;

The more versatile option that always works is, write assembly instructions to perform the modulus operation. It gives quite a bit of speed up.

→ Reply

»

»

4 years ago, # ^ |

← Rev. 2 →

Vote: I like it

+4

Vote: I do not like it

If this always works and gives "quite a bit of speed up", why doesn't the C++ compiler just do that too?

→ Reply

»

»

»

LanceTheDragonTrainer

4 years ago, # ^ |

← Rev. 2 →

Vote: I like it

0

Vote: I do not like it

Not sure. But the key is to always do a benchmark. We cannot trust compilers to do magic all the time can we? Anyway, I have personally tried using the assembly trick before and it worked pretty well.

→ Reply

»

»

»

»

4 years ago, # ^ |

Vote: I like it

+4

Vote: I do not like it

Can you share your results? Would be nice to see the methodology and the final numbers to understand better.

→ Reply

»

»

»

»

»

4 years ago, # ^ |

Vote: I like it

+95

Vote: I do not like it

bruh your shirt is orange again

→ Reply

»

»

»

»

»

»

4 years ago, # ^ |

Vote: I like it

+3

Vote: I do not like it

When will your shirt become red?

→ Reply

»

»

»

»

»

»

»

4 years ago, # ^ |

Vote: I like it

0

Vote: I do not like it

I don't paint my shirts

→ Reply

»

»

»

»

»

LanceTheDragonTrainer

4 years ago, # ^ |

← Rev. 4 →

Vote: I like it

+65

Vote: I do not like it

Disclaimer: Please try not to bash me for the sample size. I don't have time to record data for different problems (but I have tried the trick on several other problems before). Hence, I have only presented one below. You may perform your own benchmarks too. Lastly, the assembly code does not belong to me, I shamelessly peeled it off Kaist's online ICPC team notebook).

Firstly, here is a sample problem from CF.

To give some context, I read about a failing submission for this problem from Petr's blog due to a large number of modulo operations. In particular, this sentence:

maroonrk's C failed due to trying to squeeze $$$10^8$$$ modulo operations in the time limit

The optimization(s)
I only modified this line of code in the main function (which was identified by Petr to be causing the TLE):

code

int d=(v+m-i)%m;

Note that I have used 2 optimizations (I call them optimization 1 and optimization 2 below):

Optimization 1 refers to writing modulo in the form a = a >= b ? a % b : a;
Optimization 2 refers to writing modulo in assembly code.

The benchmarks

The original code which uses modulo operation naively. Result: TLE on test 49 (>1000ms).

Using optimization 1 with C++11: Code. Result: TLE on test 51 (>1000ms)

Using optimization 1 + 2 with C++11: Code. Result: TLE on test 51 (>1000ms)

Using optimization 1 with C++17 (32-bit): Code. Result: TLE on test 51 (>1000ms)

Using optimization 1 + 2 with C++17 (32-bit): Code. Result: AC (982ms). Due to the closeness of the runtime to the time limit, I submitted twice to be sure. Both submissions yielded the same runtime.

Using optimization 1 with C++17 (64-bit): Code. Result: TLE on test 51 (>1000ms)

Using optimization 1 + 2 with C++17 (64-bit): Code. Result: AC (545ms)

Oh, yes. Here is the modulo subroutine in case you are interested to test it out yourself:

Code

inline void fasterLLDivMod(unsigned long long x, unsigned y, unsigned &out_d, unsigned &out_m) {
    unsigned xh = (unsigned)(x >> 32), xl = (unsigned)x, d, m;
    #ifdef __GNUC__
    asm(
        "divl %4 \n\t"
        : "=a" (d), "=d" (m)
        : "d" (xh), "a" (xl), "r" (y)
    );
    #else
    __asm {
        mov edx, dword ptr[xh];
        mov eax, dword ptr[xl];
        div dword ptr[y];
        mov dword ptr[d], eax;
        mov dword ptr[m], edx;
    };
    #endif
    out_d = d; out_m = m;
}
inline unsigned mod(unsigned long long x, unsigned y) {
    unsigned dummy, r;
    fasterLLDivMod(x, y, dummy, r);
    return r;
}

→ Reply

»

»

»

4 years ago, # ^ |

Vote: I like it

+7

Vote: I do not like it

If this always works and gives "quite a bit of speed up", why doesn't the C++ compiler just do that too?

Because you waste time on comparison and branching. Similarly, it isn't easy to say if sort() should first check if the sequence is already sorted and then finish in $$$O(n)$$$.

→ Reply

»

»

»

»

4 years ago, # ^ |

← Rev. 2 →

Vote: I like it

0

Vote: I do not like it

You are talking about the conditional if a >= b version that LanceTheDragonTrainer said sometimes works. I was asking about the assembly version that LTDT said always works.

→ Reply

»

»

»

»

»

4 years ago, # ^ |

Vote: I like it

+14

Vote: I do not like it

right, sorry

→ Reply

»

»

»

4 years ago, # ^ |

Vote: I like it

+19

Vote: I do not like it

The compiler has to make sure to produce code that works correctly for all possible int values, we don't. In particular for this case I believe there are some odd corner cases if you allow numbers to be negative.

Just tell the compiler that you are modding unsigned integers, and you get a code that runs at around the same speed (slightly faster, even) than Lance's assembly version: 88112311

→ Reply

»

»

4 years ago, # ^ |

Vote: I like it

-9

Vote: I do not like it

FYI Branches are much more expensive than integer division/modulo operators.

→ Reply

»

»

»

4 years ago, # ^ |

Vote: I like it

+17

Vote: I do not like it

Implementing addition of two values $$$a, b \in [0, P-1]$$$ as return a+b<P ? a+b : a+b-P; is actually faster than (a+b)%P.

→ Reply

»

»

»

»

4 years ago, # ^ |

← Rev. 2 →

Vote: I like it

+13

Vote: I do not like it

Benchmarks:
14700593 CPU ticks with a+b<P ? ... : ...
11126168 CPU ticks with just (a+b) % P

Code

#include <bits/stdc++.h>

constexpr unsigned P = 1e8;

unsigned f(unsigned a, unsigned b)
{
    //return (a + b) % P;
    return a + b < P ? a + b : a + b - P;
}

int main()
{
    srandom(0);

    const auto t0 = clock();

    unsigned s = 0;
    for (size_t i = 0; i < 1000000000; ++i) {
        s += f(random() % P, 1 + random() % P);
    }

    const auto t1 = clock();

    std::cout << s << '\n';
    std::cout << (t1 - t0) << '\n';

    return 0;
}

→ Reply

»

»

»

»

»

4 years ago, # ^ |

Vote: I like it

+26

Vote: I do not like it

Your solution spends most time on computing random() % P and that includes computing that random value. Running your program multiple times gave me inconsistent results but the x?y:z version was faster by a few percents usually.

The x?y:z version is more than twice faster if it's really a bottleneck of a solution https://ideone.com/8m0qWb (0.56s vs. 1.46s)

→ Reply

»

»

»

»

»

»

4 years ago, # ^ |

Vote: I like it

-15

Vote: I do not like it

Well, your solution spends most time on data access :)

Actually it does not matter where most time is spent if it is the same for both versions because you always can subtract it from total times and compare the rests.

BTW, I think we need another test.

→ Reply

»

»

»

»

»

»

4 years ago, # ^ |

Vote: I like it

-14

Vote: I do not like it

Kamil, unfortunately your test code cannot be used for benchmarking branch-misses (details under spoilers).

(a+b)%P

$ perf stat ./a.out 
979631356

 Performance counter stats for './a.out':

          1,231.31 msec task-clock                #    1.000 CPUs utilized          
                 2      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
             1,091      page-faults               #    0.886 K/sec                  
     5,087,658,741      cycles                    #    4.132 GHz                    
     6,370,937,337      instructions              #    1.25  insn per cycle         
       471,001,344      branches                  #  382.520 M/sec                  
           204,400      branch-misses             #    0.04% of all branches

(a+b)<P

$ perf stat ./a.out 
979631356

 Performance counter stats for './a.out':

            333.17 msec task-clock                #    0.999 CPUs utilized          
                 5      context-switches          #    0.015 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
             1,088      page-faults               #    0.003 M/sec                  
     1,502,169,022      cycles                    #    4.509 GHz                    
     3,675,878,460      instructions              #    2.45  insn per cycle         
       470,814,463      branches                  # 1413.130 M/sec                  
           164,599      branch-misses             #    0.03% of all branches

→ Reply

»

»

»

»

4 years ago, # ^ |

Vote: I like it

0

Vote: I do not like it

Branches are not expensive if there's a pattern that branch predictor can learn. While implementing addition most of the time the result will not overflow so a predictor which outputs false would be good enough for you.

There are lot of things at play like speculative execution and other low level CPU stuff. Modern CPU are quite complicated to make a rule of thumb.

→ Reply

»

»

7 months ago, # ^ |

Vote: I like it

0

Vote: I do not like it

this doesn't always produce correct result

→ Reply

»

4 years ago, # |

← Rev. 2 →

Vote: I like it

+3

Vote: I do not like it

I don't know if this works, but it might help. It's a fast way to reduce a%b under some loose constraints (Barret Reduction).

→ Reply

»

»

4 years ago, # ^ |

Vote: I like it

0

Vote: I do not like it

So smart your link is!

→ Reply

»

»

4 years ago, # ^ |

← Rev. 2 →

Vote: I like it

+10

Vote: I do not like it

slightly faster than origin %: 827ms vs 643ms in my computer

#include <bits/stdc++.h>
#define watch(x) std::cout << (#x) << " is " << (x) << std::endl
using LL = long long;
constexpr LL M  = 1e9 + 7;
constexpr int  k = std::__lg(M) + 2;
constexpr LL m = (1LL << k) / M;

const int N = 1e8 + 2;
LL fac[N];
void init1(){
	fac[0] = 1;
	for (int i = 1; i < N; ++i) fac[i] = fac[i - 1] * i % M;
}
void init2() {
	auto mod = [&](LL &a) {
		LL r = a - ((a * m) >> k) * M;
		if (r >= M) r -= M;
	};
	fac[0] = 1;
	for (int i = 1; i < N; ++i) mod(fac[i] = fac[i - 1] * i);
}
int main() {
	//freopen("in","r",stdin);
	std::ios::sync_with_stdio(false);
	std::cin.tie(nullptr);

	auto start1 = std::chrono::high_resolution_clock::now();
	init1();
	auto end1 = std::chrono::high_resolution_clock::now();
	std::cout << "Time used: " << std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count() << " (ms)" << std::endl;

	auto start2 = std::chrono::high_resolution_clock::now();
	init2();
	auto end2 = std::chrono::high_resolution_clock::now();
	std::cout << "Time used: " << std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count() << " (ms)" << std::endl;

	return 0;
}

→ Reply

»

»

»

4 years ago, # ^ |

Vote: I like it

0

Vote: I do not like it

Won't faster... sorry

→ Reply

»

»

»

dmitry.dolgopolov

4 years ago, # ^ |

Vote: I like it

+1

Vote: I do not like it

Since init1() and init2() generate different results, there is no sense to measure runtime I think.

→ Reply

»

»

»

»

4 years ago, # ^ |

Vote: I like it

0

Vote: I do not like it

Thanks ~

→ Reply

»

4 years ago, # |

Vote: I like it

+13

Vote: I do not like it

Barrett Reduction

→ Reply

»

4 years ago, # |

Vote: I like it

0

Vote: I do not like it

Well, I see a quite different scenario with python.

In [1]: %timeit (503043530435 % 232039042)
Out[1]: 8.24 ns ± 0.207 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
In [2]: %timeit mod(503043530435, 232039042)
Out[2]: 231 ns ± 8.43 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

I used IPython's %timeit to calculate the time difference and found mod(a, b) to be more expensive than %.

→ Reply

»

»

4 years ago, # ^ |

Vote: I like it

-12

Vote: I do not like it

Can you check the runtime of this

int mod(int a, int b) {
    return a >= b ? a % b : a;
}

BTW, this was proposed by LTDT

→ Reply

»

»

»

4 years ago, # ^ |

Vote: I like it

+8

Vote: I do not like it

This is usually quite unhelpful because most of the time in the worst case mods are required at every step, and if you are at the point where this is the difference between AC and TLE, then it is probably better to remove unnecessary mods (i. e. just mod once after several additions, rather than after each one).

This also has overhead caused by the potential branch (which, in general, should probably always be used). The article linked by sys. provides a small speedup, but for most cases is probably not necessary.

→ Reply