Qualified's blog

By Qualified, history, 4 years ago, In English

Disclaimer: Make sure to get yourself comfortable before reading this comment. I was intrigued by the earlier discussion on sieves and decided to benchmark some sieves. There are 2 that I benchmarked with surprising differences in running time. One of them can be found here. https://cp-algorithms.com/algebra/sieve-of-eratosthenes.html. Under the "Different optimizations of the Sieve of Eratosthenes". The optimization is sieving to the root. Another one is Nisiyama_Suzune's https://mirror.codeforces.com/blog/entry/54090. Under the section "Linear sieve". It is a direct copy-paste from the CodeForces blog. Now, here comes the surprising part. https://cp-algorithms.com/algebra/prime-sieve-linear.html. Under the section "Runtime and Memory" it says "and the optimized versions of the sieve run as fast as the algorithm given here." Note: the optimized versions is the one that I used in my benchmarking. This states that the optimized sieve is also linear. Okay, now it really comes the surprising part. When I benchmarked both functions, the optimized sieve or the CP-Algorithm's Optimized Sieve ran more than 2x faster than the linear sieve in the CF blog. Why is this true? Both sieves are linear yet one is more than 2x faster than the other. This is pretty weird. Here you can find the benchmark. https://ideone.com/MfaJCF. In this code N = 1e8 which is pretty big. Now if I try on smaller N like N = 1e7 or N = 1e5, the differences are even larger. For N = 1e7, the sieve in CP Algo's is more than 3x faster than the CF blog implementation. https://ideone.com/b7jjyO. As you can see, the only difference between the two benchmarks is the value of N. Now if I try N = 1e6, the sieve from CP algos is 4.56x faster than the other implementation. Now, I haven't tried it yet but I imagine when N = 1e5 or smaller, the difference is massive. I know this is pretty long, happy reading.

Using the other implementation that is not improved in the same CP Algorithms article produces surprising results. The CF blog and the $$$O(n \log \log n)$$$ implementation of sieve from CP algos have basically the same execution time when running it. Now this makes me think, "Is the CF implementation of linear sieve linear?"

What are your thoughts about this?

  • Vote: I like it
  • +7
  • Vote: I do not like it

»
4 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Auto comment: topic has been updated by Qualified (previous revision, new revision, compare).

»
4 years ago, # |
Rev. 3   Vote: I like it 0 Vote: I do not like it

Have you tried declaring is_composite as: vector is_composite(N + 1, false);

My guess is that asking for the memory and then doing the fill might have something to do with performance, also vector.size() is known to be terribly slow, have you tried keeping a primeCount variable and increasing it by one whenever you push on the prime vector and using that instead of prime.size()?

Also doing a%b is really slow, so I would recommend to avoid using it.

I remember reading about the linear sieve and implemented my own sieve, I don't remember where I based it off, I would give credit if I did (I did do some modifications to make it run faster though).

f[x] has the smallest prime that divides x, and it's linear because we will visit every number once, only with the smallest prime that divides it, it works because when we are in number i we know the smallest prime that divides it so we will multiply it by every prime smaller or equal to it and visit those, another way of thinking about it is that x will only be marked by x/f[x].

#define maxN 100000000

int pc = 0, f[maxN], primes[maxN / 10];
FOR(i, 2, maxN) {
    if (!f[i]) primes[pc++] = i, f[i] = i;
    for (int j = 0; j < pc && 1LL * i * primes[j] < maxN && primes[j] <= f[i]; j++)
      f[i * primes[j]] = primes[j];
}

And at least in my computer this code is 5x faster than the oldsieve code you had.

Let me know if this is faster for you as well!

Edit: Fixing formatting plus I forgot to mention that I make f be a global variable, that way it's initialized to 0s, otherwise you just have to manually set it to 0s.

  • »
    »
    4 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    What is FOR defined as in your code? Is it for(int i = 2; i <= maxN; ++i) or for(int i = 2; i < maxN; ++i)?

    • »
      »
      »
      4 years ago, # ^ |
      Rev. 2   Vote: I like it 0 Vote: I do not like it

      Forgot to change it to standard c++ for xD

      I use this in my template:

      #define FOR(i, a, b) for (int i = int(a); i < int(b); i++)
      
      • »
        »
        »
        »
        4 years ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        Yeah, this implementation really is faster than the $$$O(n \log \log n)$$$ implementation. Testing for N = 1e9, I couldn't even measure the time. It just output 0. Its fast. Thanks for clearing the confusion. :)

  • »
    »
    4 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Okay, so I was testing your implementation of the linear sieve and it seems like that the reason why your linear sieve runs so much faster is because of you using arrays. Take a look at these two benchmarks. https://ideone.com/i5nu1P and https://ideone.com/N70vwb. The first link is your implementation with the arrays and obviously it runs much faster than the CP algo's implementation. Now take a look at the second link. In the function newsieve1 it has your implementation but it has vectors instead of arrays just like what I used in oldsieve. The difference in runtime is small. Do you have any thoughts about this?

    • »
      »
      »
      4 years ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      The compiler was able to determine that the array version as it appears in the context of your test cannot have any visible effects and so it actually optimized the entire function body away. It isn't really performing that computation in 0 seconds.

      • »
        »
        »
        »
        4 years ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        That's true, you can try returning the last prime it found and that should fix the problem.

    • »
      »
      »
      4 years ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      It just means that asking for memory is very slow and that's what takes most of the execution time, you would need bigger N (maybe bigger than 10^10) to make the memory constant small enough such that the loglogN gets big enough to make an impact in execution.

      This is the reason why it's very common to see most people use arrays with a size big enough to handle every case in the problem rather than vectors, they are faster by a constant (up to 3 in the tests I did back in 2016), in this case using a global array is way faster because asking for memory is not counted by the timer.

      Most problems have N big enough such that you can't optimize the constants off of an inneficient solution and make it pass, however in older ICPC contests I've done that, for example a problem had N=10^4 which required NlogN solution, but my N^2/64 solution barely got accepted. Nowadays you would see minimum N=10^5 to avoid this and constant optimization is no longer considered a required skill for most contests, but IMO it's useful for real life software engineering so it's useful to try and see how to make code faster.

»
4 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Auto comment: topic has been updated by Qualified (previous revision, new revision, compare).