Tyx2019's blog

By Tyx2019, history, 3 hours ago, In English

Introduction

This last weekend, I took part in the Southeast Asia Team Selection Test (SEATST). On day $$$1$$$, to grab a $$$28$$$ point subtask, I used an interesting trick involving cache-optimisation and square root decomposition that I thought was cool, so naturally I am writing my first CF blog to talk about it.

The Problem

Suppose we have an array of $$$N$$$ vectors, $$$V_1$$$ to $$$V_N$$$, and we wish to perform $$$Q$$$ insertion operations of the form $$$(x, y)$$$, where we insert the value $$$y$$$ to the back of vector $$$V_x$$$. For instance, you might want to do this for a counting sort if the log factor from sorting is too large.

For the rest of this blog, we shall assume $$$N=10^6, Q=10^8$$$.

The naive solution

Below is the code I used for testing the naive solution.

#include <bits/stdc++.h>
using namespace std;
mt19937 rng(chrono::steady_clock::now().time_since_epoch().count());
const int N = 1e6;
const int Q = 1e8;
vector<int> V[N];
main(){  
    int x, y;
    for(int i=0;i<Q;i++){
        x = rng() % N;
        y = rng();
        V[x].push_back(y);
    }
}

For the other pieces of code in this blog, I will not include the boilerplate (i.e. everything outside of main).

Of course, we need a control to see how much time it took to generate random numbers. Below is the control

main(){  
    int x, y;
    for(int i=0;i<Q;i++){
        x = rng() % N;
        y = rng();
    }
}

Running them on my laptop, the control took 1.7s, and the naive solution took 12.7s. So the vector operations took around 11s.

The cache-optimised solution

We use square-root decomposition to reduce cache misses. To do this, we split the $$$10^6$$$ vectors into $$$1000$$$ blocks. For each operation, we will store which block it goes into. To do this, we push into $$$1000$$$ vectors, which is a small number of vectors so accessing the vector metadata is fast as it can be stored in the cache.

Now after this, we loop from block $$$0$$$ to block $$$999$$$, and perform each operation belonging to that block. Since inside each block, we only push to $$$1000$$$ vectors, this will also be fast.

Below is the code I used to test this solution

main(){  
    vector<pair<int, int>> T[1000];
    int x, y;
    for(int i=0;i<Q;i++){
        x = rng() % N;
        y = rng();
        T[x/1000].push_back({x, y});
    }
    for(int i=0;i<1000;i++){
        for(auto [a, b]:T[i]){
            V[a].push_back(b);
        }
    }
}

Running this code, it took 7.7s, so the vector operations took around 6s. Thus this code is ~2x faster with the cache optimisation.

  • Vote: I like it
  • +46
  • Vote: I do not like it

»
3 hours ago, hide # |
 
Vote: I like it 0 Vote: I do not like it

Auto comment: topic has been updated by Tyx2019 (previous revision, new revision, compare).

»
3 hours ago, hide # |
 
Vote: I like it -11 Vote: I do not like it

first

»
3 hours ago, hide # |
 
Vote: I like it -12 Vote: I do not like it

second

»
2 hours ago, hide # |
 
Vote: I like it 0 Vote: I do not like it

third

»
2 hours ago, hide # |
 
Vote: I like it 0 Vote: I do not like it

fourth.

Also, what if instead of a block size of $$$1 \, 000$$$, other numbers are chosen; and what subtask were you trying to do in SEATST day 1?

  • »
    »
    116 minutes ago, hide # ^ |
     
    Vote: I like it 0 Vote: I do not like it

    I think the optimal block size will depend on the values of $$$N$$$ and $$$Q$$$. I think in general a size of $$$\sqrt N$$$ would be good, but this will depend on $$$N$$$ and $$$Q$$$ e.g. I tested $$$N=10^8, Q=10^7$$$ and it seems like the optimal block size is around $$$5000$$$.

    The subtask was the $$$N=10^4$$$ subtask on Day 1 P3, my time complexity was $$$O(N^2)$$$ but I needed to do this constant optimisation to get the points.

»
103 minutes ago, hide # |
 
Vote: I like it 0 Vote: I do not like it

orz

»
45 minutes ago, hide # |
 
Vote: I like it 0 Vote: I do not like it

fifth

»
43 minutes ago, hide # |
 
Vote: I like it 0 Vote: I do not like it

It feels like this is just a variation on radix sort with the radix set to 1000? It may be faster to just radix-sort the whole array (and thus avoiding the vector reallocations) which will give you all the desired vectors in a contiguous away which should be better for caching.