### Qualified's blog

By Qualified, history, 4 years ago,

I see this article and at the bottom, it says that modulus operators are expensive so they implemented a slightly faster version of Euclidean Algorithm. Why not make a more efficient mod?

int mod(int a, int b) { // computes a % b;
return (a - b * (a / b));
}

• +87

| Write comment?
 » 4 years ago, # |   +72 afaik / and % are expensive compared to + and *
 » 4 years ago, # |   +219 Cause / is approximately as expensive as %.
 » 4 years ago, # |   +130 Actually on x86 division instruction DIV (or IDIV for signed integers) computes both quotient and reminder, just stores them in different registers. So obviously your "more efficient" version can't work faster than a % b;
 » 4 years ago, # |   +32 I liked your blog! As explained in the comments, apparently this would not be faster, but I agree with you that mod operator are famously said to be the ones that are slow, I never stopped to think about the division operator, that should be as slow too! I’m sorry that your blog got downvoted. It made me have a better understanding of predicting the run time of a code
•  » » 4 years ago, # ^ |   -16 Your welcome. :D
 » 4 years ago, # | ← Rev. 2 →   0 Intel wants to know your location.......On a serious note, it will be helpful to see not them as O(1) operations on 32 bit but asymptotic complexity on variable number of bits. Then you'll appreciate that modulus isn't much of a different problem than division.
 » 4 years ago, # |   -48 You can do a binary lifting/binary search mod operation. I really don’t know whether it’s faster or not.
 » 4 years ago, # | ← Rev. 2 →   +60 https://godbolt.org/z/7W35Me Spoilerint mod0(int a, int b) { return a % b; } int mod1(int a, int b) { return a - b * (a / b); }  mod0(int, int): mov eax, edi cdq idiv esi mov eax, edx ret mod1(int, int): mov eax, edi cdq idiv esi mov eax, edx ret 
 » 4 years ago, # |   0 You could get a speed up sometimes by doing this:a = a >= b ? a % b : a;The more versatile option that always works is, write assembly instructions to perform the modulus operation. It gives quite a bit of speed up.
•  » » 4 years ago, # ^ | ← Rev. 2 →   +4 If this always works and gives "quite a bit of speed up", why doesn't the C++ compiler just do that too?
•  » » » 4 years ago, # ^ | ← Rev. 2 →   0 Not sure. But the key is to always do a benchmark. We cannot trust compilers to do magic all the time can we? Anyway, I have personally tried using the assembly trick before and it worked pretty well.
•  » » » » 4 years ago, # ^ |   +4 Can you share your results? Would be nice to see the methodology and the final numbers to understand better.
•  » » » » » 4 years ago, # ^ |   +95 bruh your shirt is orange again
•  » » » » » » 4 years ago, # ^ |   +3 When will your shirt become red?
•  » » » » » » » 4 years ago, # ^ |   0 I don't paint my shirts
•  » » » » » 4 years ago, # ^ | ← Rev. 4 →   +65 Disclaimer: Please try not to bash me for the sample size. I don't have time to record data for different problems (but I have tried the trick on several other problems before). Hence, I have only presented one below. You may perform your own benchmarks too. Lastly, the assembly code does not belong to me, I shamelessly peeled it off Kaist's online ICPC team notebook).Firstly, here is a sample problem from CF.To give some context, I read about a failing submission for this problem from Petr's blog due to a large number of modulo operations. In particular, this sentence: maroonrk's C failed due to trying to squeeze $10^8$ modulo operations in the time limitThe optimization(s) I only modified this line of code in the main function (which was identified by Petr to be causing the TLE): codeint d=(v+m-i)%m; Note that I have used 2 optimizations (I call them optimization 1 and optimization 2 below):Optimization 1 refers to writing modulo in the form a = a >= b ? a % b : a; Optimization 2 refers to writing modulo in assembly code. The benchmarksThe original code which uses modulo operation naively. Result: TLE on test 49 (>1000ms).Using optimization 1 with C++11: Code. Result: TLE on test 51 (>1000ms)Using optimization 1 + 2 with C++11: Code. Result: TLE on test 51 (>1000ms)Using optimization 1 with C++17 (32-bit): Code. Result: TLE on test 51 (>1000ms)Using optimization 1 + 2 with C++17 (32-bit): Code. Result: AC (982ms). Due to the closeness of the runtime to the time limit, I submitted twice to be sure. Both submissions yielded the same runtime.Using optimization 1 with C++17 (64-bit): Code. Result: TLE on test 51 (>1000ms)Using optimization 1 + 2 with C++17 (64-bit): Code. Result: AC (545ms)Oh, yes. Here is the modulo subroutine in case you are interested to test it out yourself: Codeinline void fasterLLDivMod(unsigned long long x, unsigned y, unsigned &out_d, unsigned &out_m) { unsigned xh = (unsigned)(x >> 32), xl = (unsigned)x, d, m; #ifdef __GNUC__ asm( "divl %4 \n\t" : "=a" (d), "=d" (m) : "d" (xh), "a" (xl), "r" (y) ); #else __asm { mov edx, dword ptr[xh]; mov eax, dword ptr[xl]; div dword ptr[y]; mov dword ptr[d], eax; mov dword ptr[m], edx; }; #endif out_d = d; out_m = m; } inline unsigned mod(unsigned long long x, unsigned y) { unsigned dummy, r; fasterLLDivMod(x, y, dummy, r); return r; } 
•  » » » 4 years ago, # ^ |   +7 If this always works and gives "quite a bit of speed up", why doesn't the C++ compiler just do that too? Because you waste time on comparison and branching. Similarly, it isn't easy to say if sort() should first check if the sequence is already sorted and then finish in $O(n)$.
•  » » » » 4 years ago, # ^ | ← Rev. 2 →   0 You are talking about the conditional if a >= b version that LanceTheDragonTrainer said sometimes works. I was asking about the assembly version that LTDT said always works.
•  » » » » » 4 years ago, # ^ |   +14 right, sorry
•  » » » 4 years ago, # ^ |   +19 The compiler has to make sure to produce code that works correctly for all possible int values, we don't. In particular for this case I believe there are some odd corner cases if you allow numbers to be negative.Just tell the compiler that you are modding unsigned integers, and you get a code that runs at around the same speed (slightly faster, even) than Lance's assembly version: 88112311
•  » » 4 years ago, # ^ |   -9 FYI Branches are much more expensive than integer division/modulo operators.
•  » » » 4 years ago, # ^ |   +17 Implementing addition of two values $a, b \in [0, P-1]$ as return a+b
•  » » » » 4 years ago, # ^ | ← Rev. 2 →   +13 Benchmarks: 14700593 CPU ticks with a+b

constexpr unsigned P = 1e8; unsigned f(unsigned a, unsigned b) { //return (a + b) % P; return a + b < P ? a + b : a + b - P; } int main() { srandom(0); const auto t0 = clock(); unsigned s = 0; for (size_t i = 0; i < 1000000000; ++i) { s += f(random() % P, 1 + random() % P); } const auto t1 = clock(); std::cout << s << '\n'; std::cout << (t1 - t0) << '\n'; return 0; } 

•  » » » » » 4 years ago, # ^ |   +26 Your solution spends most time on computing random() % P and that includes computing that random value. Running your program multiple times gave me inconsistent results but the x?y:z version was faster by a few percents usually.The x?y:z version is more than twice faster if it's really a bottleneck of a solution https://ideone.com/8m0qWb (0.56s vs. 1.46s)
•  » » » » » » 4 years ago, # ^ |   -15 Well, your solution spends most time on data access :) Actually it does not matter where most time is spent if it is the same for both versions because you always can subtract it from total times and compare the rests. BTW, I think we need another test.
•  » » » » » » 4 years ago, # ^ |   -14 Kamil, unfortunately your test code cannot be used for benchmarking branch-misses (details under spoilers). (a+b)%P\$ perf stat ./a.out 979631356 Performance counter stats for './a.out': 1,231.31 msec task-clock # 1.000 CPUs utilized 2 context-switches # 0.002 K/sec 0 cpu-migrations # 0.000 K/sec 1,091 page-faults # 0.886 K/sec 5,087,658,741 cycles # 4.132 GHz 6,370,937,337 instructions # 1.25 insn per cycle 471,001,344 branches # 382.520 M/sec 204,400 branch-misses # 0.04% of all branches  (a+b)
•  » » » » 4 years ago, # ^ |   0 Branches are not expensive if there's a pattern that branch predictor can learn. While implementing addition most of the time the result will not overflow so a predictor which outputs false would be good enough for you.There are lot of things at play like speculative execution and other low level CPU stuff. Modern CPU are quite complicated to make a rule of thumb.
•  » » 3 weeks ago, # ^ |   0 this doesn't always produce correct result
 » 4 years ago, # | ← Rev. 2 →   +3 I don't know if this works, but it might help. It's a fast way to reduce a%b under some loose constraints (Barret Reduction).
•  » » 4 years ago, # ^ |   0 So smart your link is!
•  » » 4 years ago, # ^ | ← Rev. 2 →   +10 slightly faster than origin %: 827ms vs 643ms in my computer #include #define watch(x) std::cout << (#x) << " is " << (x) << std::endl using LL = long long; constexpr LL M = 1e9 + 7; constexpr int k = std::__lg(M) + 2; constexpr LL m = (1LL << k) / M; const int N = 1e8 + 2; LL fac[N]; void init1(){ fac[0] = 1; for (int i = 1; i < N; ++i) fac[i] = fac[i - 1] * i % M; } void init2() { auto mod = [&](LL &a) { LL r = a - ((a * m) >> k) * M; if (r >= M) r -= M; }; fac[0] = 1; for (int i = 1; i < N; ++i) mod(fac[i] = fac[i - 1] * i); } int main() { //freopen("in","r",stdin); std::ios::sync_with_stdio(false); std::cin.tie(nullptr); auto start1 = std::chrono::high_resolution_clock::now(); init1(); auto end1 = std::chrono::high_resolution_clock::now(); std::cout << "Time used: " << std::chrono::duration_cast(end1 - start1).count() << " (ms)" << std::endl; auto start2 = std::chrono::high_resolution_clock::now(); init2(); auto end2 = std::chrono::high_resolution_clock::now(); std::cout << "Time used: " << std::chrono::duration_cast(end2 - start2).count() << " (ms)" << std::endl; return 0; } 
•  » » » 4 years ago, # ^ |   0 Won't faster... sorry
•  » » » 4 years ago, # ^ |   +1 Since init1() and init2() generate different results, there is no sense to measure runtime I think.
•  » » » » 4 years ago, # ^ |   0 Thanks ~
 » 4 years ago, # |   +13
 » 4 years ago, # |   0 Well, I see a quite different scenario with python. In [1]: %timeit (503043530435 % 232039042) Out[1]: 8.24 ns ± 0.207 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each) In [2]: %timeit mod(503043530435, 232039042) Out[2]: 231 ns ± 8.43 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) I used IPython's %timeit to calculate the time difference and found mod(a, b) to be more expensive than %.
•  » » 4 years ago, # ^ |   -12 Can you check the runtime of this int mod(int a, int b) { return a >= b ? a % b : a; } BTW, this was proposed by LTDT
•  » » » 4 years ago, # ^ |   +8 This is usually quite unhelpful because most of the time in the worst case mods are required at every step, and if you are at the point where this is the difference between AC and TLE, then it is probably better to remove unnecessary mods (i. e. just mod once after several additions, rather than after each one). This also has overhead caused by the potential branch (which, in general, should probably always be used). The article linked by sys. provides a small speedup, but for most cases is probably not necessary.