Блог пользователя sslotin

Автор sslotin, 5 лет назад, По-русски

https://algorithmica.org/en/eytzinger

This will probably be an hour-long read about the CPU memory model that explains 10 lines of code at the very end.

The following code is 4 to 5 times faster than calling std::lower_bound over a sorted array:

#pragma GCC optimize("O3")
#include <bits/stdc++.h>

using namespace std;

const int n = (1<<20);
const int block_size = 16;
alignas(64) int a[n], b[n+1];

int eytzinger(int i = 0, int k = 1) {
    if (k <= n) {
        i = eytzinger(i, 2 * k);
        b[k] = a[i++];
        i = eytzinger(i, 2 * k + 1);
    }
    return i;
}

int search(int x) {
    int k = 1;
    while (k <= n) {
        __builtin_prefetch(b + k * block_size);
        k = 2 * k + (b[k] < x);
    }
    k >>= __builtin_ffs(~k);
    return k;
}

tl;dr: it uses a segtree-like cache-friendly array reordering that allows trading off memory bandwidth for latency via prefetching.

This is technically not a drop-in replacement, since it requires some preprocessing, but I can't recall a lot of scenarios where you obtain a sorted array but can't spend linear time on preprocessing.

This method could also be used to speedup heaps, segment trees, and other static binary tree structures.

  • Проголосовать: нравится
  • +214
  • Проголосовать: не нравится

»
5 лет назад, # |
  Проголосовать: нравится +46 Проголосовать: не нравится

I want to test this method, but maybe I am doing it wrong? Link. Got different check value and got that std::lower_bound is faster. Can you help to fix?

  • »
    »
    5 лет назад, # ^ |
      Проголосовать: нравится +25 Проголосовать: не нравится

    The check value is wrong because it returns element's index in its internal numeration, not the original one. I should probably put that in bold in the red somewhere, though it's possible to restore it by spending a few more instructions.

    As of running time, I am not sure, but I think that it probably has something to do with the fact that 32-bit random numbers are used to query an array with 20-bit items, so that most searches are only hitting the path to the largest element and the whole caching trick stops working.

    I use variations of this code for benchmarking.

  • »
    »
    5 лет назад, # ^ |
    Rev. 3   Проголосовать: нравится +5 Проголосовать: не нравится

    Idk what is k >>= __builtin_ffs(~k); but it should be k - n + 1.

    https://ideone.com/IeWTJ1

    Upd: nah, this works only for n = power of 2

    • »
      »
      »
      5 лет назад, # ^ |
        Проголосовать: нравится 0 Проголосовать: не нравится

      No problem, we can extend array to power of two by std::numeric_limits<int>::max().

      Also I tested eytzinger in comparison with recursive template. Link on ideone

      Runtime from ideone
      • »
        »
        »
        »
        5 лет назад, # ^ |
          Проголосовать: нравится +13 Проголосовать: не нравится

        Guys I don't think that benchmarking on ideone is a good idea, and I wouldn't benchmark memory-bound algorithms on non-dedicated shared-memory machines in general.

        The key trick is to trade off (almost all of) memory bandwidth for latency, which is not going to work well when you have noisy neighbours possibly competing for it.

        I am not sure if major online judges provide decent memory isolation, but public ideone definitely doesn't.

        • »
          »
          »
          »
          »
          5 лет назад, # ^ |
            Проголосовать: нравится 0 Проголосовать: не нравится

          Memory isolation is an OS feature that's required for security and if you have to switch processes or move pages around too often, performance just tanks. This overhead is possible and could be an explanation, but I'm 99% certain that when you run your code on ideone, it doesn't share physical memory.

          • »
            »
            »
            »
            »
            »
            5 лет назад, # ^ |
            Rev. 2   Проголосовать: нравится 0 Проголосовать: не нравится

            Yes they don't share the memory, but they do share the memory bus, and hence the bandwidth. By "memory isolation" I meant that ideally they shouldn't.

            Try running this bad boi in the background on a spare core on your laptop:

            #include <cstring>
            
            const int n = (1<<23);
            int a[n], b[n];
            
            int main() {
                while (true)
                    std::memcpy(a, b, sizeof a);
            
                return 0;
            }
            

            My running time on the benchmarks went 3x.

            • »
              »
              »
              »
              »
              »
              »
              5 лет назад, # ^ |
                Проголосовать: нравится 0 Проголосовать: не нравится

              Ok, so that's what you meant. I'm not sure how it works for servers compared to regular PC CPUs (if there are attempts to mitigate it), but it's a factor.