Блог пользователя Qualified

Автор Qualified, история, 4 года назад, По-английски

I see this article and at the bottom, it says that modulus operators are expensive so they implemented a slightly faster version of Euclidean Algorithm. Why not make a more efficient mod?

int mod(int a, int b) { // computes a % b;
	return (a - b * (a / b));
}
  • Проголосовать: нравится
  • +87
  • Проголосовать: не нравится

»
4 года назад, # |
  Проголосовать: нравится +72 Проголосовать: не нравится

afaik / and % are expensive compared to + and *

»
4 года назад, # |
  Проголосовать: нравится +219 Проголосовать: не нравится

Cause / is approximately as expensive as %.

»
4 года назад, # |
  Проголосовать: нравится +130 Проголосовать: не нравится

Actually on x86 division instruction DIV (or IDIV for signed integers) computes both quotient and reminder, just stores them in different registers. So obviously your "more efficient" version can't work faster than a % b;

»
4 года назад, # |
  Проголосовать: нравится +32 Проголосовать: не нравится

I liked your blog! As explained in the comments, apparently this would not be faster, but I agree with you that mod operator are famously said to be the ones that are slow, I never stopped to think about the division operator, that should be as slow too! I’m sorry that your blog got downvoted. It made me have a better understanding of predicting the run time of a code

»
4 года назад, # |
Rev. 2   Проголосовать: нравится 0 Проголосовать: не нравится

Intel wants to know your location.......

On a serious note, it will be helpful to see not them as O(1) operations on 32 bit but asymptotic complexity on variable number of bits. Then you'll appreciate that modulus isn't much of a different problem than division.

»
4 года назад, # |
  Проголосовать: нравится -48 Проголосовать: не нравится

You can do a binary lifting/binary search mod operation. I really don’t know whether it’s faster or not.

»
4 года назад, # |
Rev. 2   Проголосовать: нравится +60 Проголосовать: не нравится

https://godbolt.org/z/7W35Me

Spoiler
»
4 года назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

You could get a speed up sometimes by doing this:

a = a >= b ? a % b : a;

The more versatile option that always works is, write assembly instructions to perform the modulus operation. It gives quite a bit of speed up.

  • »
    »
    4 года назад, # ^ |
    Rev. 2   Проголосовать: нравится +4 Проголосовать: не нравится

    If this always works and gives "quite a bit of speed up", why doesn't the C++ compiler just do that too?

    • »
      »
      »
      4 года назад, # ^ |
      Rev. 2   Проголосовать: нравится 0 Проголосовать: не нравится

      Not sure. But the key is to always do a benchmark. We cannot trust compilers to do magic all the time can we? Anyway, I have personally tried using the assembly trick before and it worked pretty well.

      • »
        »
        »
        »
        4 года назад, # ^ |
          Проголосовать: нравится +4 Проголосовать: не нравится

        Can you share your results? Would be nice to see the methodology and the final numbers to understand better.

        • »
          »
          »
          »
          »
          4 года назад, # ^ |
            Проголосовать: нравится +95 Проголосовать: не нравится

          bruh your shirt is orange again

        • »
          »
          »
          »
          »
          4 года назад, # ^ |
          Rev. 4   Проголосовать: нравится +65 Проголосовать: не нравится

          Disclaimer: Please try not to bash me for the sample size. I don't have time to record data for different problems (but I have tried the trick on several other problems before). Hence, I have only presented one below. You may perform your own benchmarks too. Lastly, the assembly code does not belong to me, I shamelessly peeled it off Kaist's online ICPC team notebook).

          Firstly, here is a sample problem from CF.

          To give some context, I read about a failing submission for this problem from Petr's blog due to a large number of modulo operations. In particular, this sentence:

          maroonrk's C failed due to trying to squeeze $$$10^8$$$ modulo operations in the time limit

          The optimization(s)
          I only modified this line of code in the main function (which was identified by Petr to be causing the TLE):

          code

          Note that I have used 2 optimizations (I call them optimization 1 and optimization 2 below):

          Optimization 1 refers to writing modulo in the form a = a >= b ? a % b : a;
          Optimization 2 refers to writing modulo in assembly code.


          The benchmarks

          The original code which uses modulo operation naively. Result: TLE on test 49 (>1000ms).

          Using optimization 1 with C++11: Code. Result: TLE on test 51 (>1000ms)

          Using optimization 1 + 2 with C++11: Code. Result: TLE on test 51 (>1000ms)

          Using optimization 1 with C++17 (32-bit): Code. Result: TLE on test 51 (>1000ms)

          Using optimization 1 + 2 with C++17 (32-bit): Code. Result: AC (982ms). Due to the closeness of the runtime to the time limit, I submitted twice to be sure. Both submissions yielded the same runtime.

          Using optimization 1 with C++17 (64-bit): Code. Result: TLE on test 51 (>1000ms)

          Using optimization 1 + 2 with C++17 (64-bit): Code. Result: AC (545ms)

          Oh, yes. Here is the modulo subroutine in case you are interested to test it out yourself:

          Code
    • »
      »
      »
      4 года назад, # ^ |
        Проголосовать: нравится +7 Проголосовать: не нравится

      If this always works and gives "quite a bit of speed up", why doesn't the C++ compiler just do that too?

      Because you waste time on comparison and branching. Similarly, it isn't easy to say if sort() should first check if the sequence is already sorted and then finish in $$$O(n)$$$.

      • »
        »
        »
        »
        4 года назад, # ^ |
        Rev. 2   Проголосовать: нравится 0 Проголосовать: не нравится

        You are talking about the conditional if a >= b version that LanceTheDragonTrainer said sometimes works. I was asking about the assembly version that LTDT said always works.

    • »
      »
      »
      4 года назад, # ^ |
        Проголосовать: нравится +19 Проголосовать: не нравится

      The compiler has to make sure to produce code that works correctly for all possible int values, we don't. In particular for this case I believe there are some odd corner cases if you allow numbers to be negative.

      Just tell the compiler that you are modding unsigned integers, and you get a code that runs at around the same speed (slightly faster, even) than Lance's assembly version: 88112311

  • »
    »
    4 года назад, # ^ |
      Проголосовать: нравится -9 Проголосовать: не нравится

    FYI Branches are much more expensive than integer division/modulo operators.

    • »
      »
      »
      4 года назад, # ^ |
        Проголосовать: нравится +17 Проголосовать: не нравится

      Implementing addition of two values $$$a, b \in [0, P-1]$$$ as return a+b<P ? a+b : a+b-P; is actually faster than (a+b)%P.

      • »
        »
        »
        »
        4 года назад, # ^ |
        Rev. 2   Проголосовать: нравится +13 Проголосовать: не нравится

        Benchmarks:
        14700593 CPU ticks with a+b<P ? ... : ...
        11126168 CPU ticks with just (a+b) % P

        Code
        • »
          »
          »
          »
          »
          4 года назад, # ^ |
            Проголосовать: нравится +26 Проголосовать: не нравится

          Your solution spends most time on computing random() % P and that includes computing that random value. Running your program multiple times gave me inconsistent results but the x?y:z version was faster by a few percents usually.

          The x?y:z version is more than twice faster if it's really a bottleneck of a solution https://ideone.com/8m0qWb (0.56s vs. 1.46s)

          • »
            »
            »
            »
            »
            »
            4 года назад, # ^ |
              Проголосовать: нравится -15 Проголосовать: не нравится

            Well, your solution spends most time on data access :)

            Actually it does not matter where most time is spent if it is the same for both versions because you always can subtract it from total times and compare the rests.

            BTW, I think we need another test.

          • »
            »
            »
            »
            »
            »
            4 года назад, # ^ |
              Проголосовать: нравится -14 Проголосовать: не нравится

            Kamil, unfortunately your test code cannot be used for benchmarking branch-misses (details under spoilers).

            (a+b)%P
            (a+b)<P
      • »
        »
        »
        »
        4 года назад, # ^ |
          Проголосовать: нравится 0 Проголосовать: не нравится

        Branches are not expensive if there's a pattern that branch predictor can learn. While implementing addition most of the time the result will not overflow so a predictor which outputs false would be good enough for you.

        There are lot of things at play like speculative execution and other low level CPU stuff. Modern CPU are quite complicated to make a rule of thumb.

  • »
    »
    6 месяцев назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    this doesn't always produce correct result

»
4 года назад, # |
Rev. 2   Проголосовать: нравится +3 Проголосовать: не нравится

I don't know if this works, but it might help. It's a fast way to reduce a%b under some loose constraints (Barret Reduction).

  • »
    »
    4 года назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    So smart your link is!

  • »
    »
    4 года назад, # ^ |
    Rev. 2   Проголосовать: нравится +10 Проголосовать: не нравится

    slightly faster than origin %: 827ms vs 643ms in my computer

    #include <bits/stdc++.h>
    #define watch(x) std::cout << (#x) << " is " << (x) << std::endl
    using LL = long long;
    constexpr LL M  = 1e9 + 7;
    constexpr int  k = std::__lg(M) + 2;
    constexpr LL m = (1LL << k) / M;
    
    const int N = 1e8 + 2;
    LL fac[N];
    void init1(){
    	fac[0] = 1;
    	for (int i = 1; i < N; ++i) fac[i] = fac[i - 1] * i % M;
    }
    void init2() {
    	auto mod = [&](LL &a) {
    		LL r = a - ((a * m) >> k) * M;
    		if (r >= M) r -= M;
    	};
    	fac[0] = 1;
    	for (int i = 1; i < N; ++i) mod(fac[i] = fac[i - 1] * i);
    }
    int main() {
    	//freopen("in","r",stdin);
    	std::ios::sync_with_stdio(false);
    	std::cin.tie(nullptr);
    
    	auto start1 = std::chrono::high_resolution_clock::now();
    	init1();
    	auto end1 = std::chrono::high_resolution_clock::now();
    	std::cout << "Time used: " << std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count() << " (ms)" << std::endl;
    
    	auto start2 = std::chrono::high_resolution_clock::now();
    	init2();
    	auto end2 = std::chrono::high_resolution_clock::now();
    	std::cout << "Time used: " << std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count() << " (ms)" << std::endl;
    
    	return 0;
    }
    
»
4 года назад, # |
  Проголосовать: нравится +13 Проголосовать: не нравится
»
4 года назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

Well, I see a quite different scenario with python.

In [1]: %timeit (503043530435 % 232039042)
Out[1]: 8.24 ns ± 0.207 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
In [2]: %timeit mod(503043530435, 232039042)
Out[2]: 231 ns ± 8.43 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

I used IPython's %timeit to calculate the time difference and found mod(a, b) to be more expensive than %.

  • »
    »
    4 года назад, # ^ |
      Проголосовать: нравится -12 Проголосовать: не нравится

    Can you check the runtime of this

    int mod(int a, int b) {
        return a >= b ? a % b : a;
    }
    

    BTW, this was proposed by LTDT

    • »
      »
      »
      4 года назад, # ^ |
        Проголосовать: нравится +8 Проголосовать: не нравится

      This is usually quite unhelpful because most of the time in the worst case mods are required at every step, and if you are at the point where this is the difference between AC and TLE, then it is probably better to remove unnecessary mods (i. e. just mod once after several additions, rather than after each one).

      This also has overhead caused by the potential branch (which, in general, should probably always be used). The article linked by sys. provides a small speedup, but for most cases is probably not necessary.