Square Root Decomposition in Cache Optimisation

Правка en5, от Tyx2019, 2026-05-25 14:00:45

Introduction

This last weekend, I took part in the Southeast Asia Team Selection Test (SEATST). On day $$$1$$$, to grab a $$$28$$$ point subtask, I used an interesting trick involving cache-optimisation and square root decomposition that I thought was cool, so naturally I am writing my first CF blog to talk about it.

The Problem

Suppose we have an array of $$$N$$$ vectors, $$$V_1$$$ to $$$V_N$$$, and we wish to perform $$$Q$$$ insertion operations of the form $$$(x, y)$$$, where we insert the value $$$y$$$ to the back of vector $$$V_x$$$. For instance, you might want to do this for a counting sort if the log factor from sorting is too large.

For the rest of this blog, we shall assume $$$N=10^6, Q=10^8$$$.

The naive solution

Below is the code I used for testing the naive solution.

#include <bits/stdc++.h>
using namespace std;
mt19937 rng(chrono::steady_clock::now().time_since_epoch().count());
const int N = 1e6;
const int Q = 1e8;
vector<int> V[N];
main(){  
    int x, y;
    for(int i=0;i<Q;i++){
        x = rng() % N;
        y = rng();
        V[x].push_back(y);
    }
}

For the other pieces of code in this blog, I will not include the boilerplate (i.e. everything outside of main).

Of course, we need a control to see how much time it took to generate random numbers. Below is the control

main(){  
    int x, y;
    for(int i=0;i<Q;i++){
        x = rng() % N;
        y = rng();
    }
}

Running them on my laptop, the control took 1.7s, and the naive solution took 12.7s. So the vector operations took around 11s.

The cache-optimised solution

We use square-root decomposition to reduce cache misses. To do this, we split the $$$10^6$$$ vectors into $$$1000$$$ blocks. For each operation, we will store which block it goes into. To do this, we push into $$$1000$$$ vectors, which is a small number of vectors so accessing the vector metadata is fast as it can be stored in the cache.

Now after this, we loop from block $$$0$$$ to block $$$999$$$, and perform each operation belonging to that block. Since inside each block, we only push to $$$1000$$$ vectors, this will also be fast.

Below is the code I used to test this solution

main(){  
    vector<pair<int, int>> T[1000];
    int x, y;
    for(int i=0;i<Q;i++){
        x = rng() % N;
        y = rng();
        T[x/1000].push_back({x, y});
    }
    for(int i=0;i<1000;i++){
        for(auto [a, b]:T[i]){
            V[a].push_back(b);
        }
    }
}

Running this code, it took 7.7s, so the vector operations took around 6s. Thus this code is ~2x faster with the cache optimisation.

Edit: I have run the different pieces of code with the O2 flag enabled, the control now gets optimised away by the compiler so I will only provide the times for the naive and optimised solutions. The naive solution on my computer took 10.8s, and the optimised solution took 3.7s, so the optimisation actually makes the vector operations ~3x faster with O2. Thanks errorgorn for the suggestion!

История

 
 
 
 
Правки
 
 
  Rev. Язык Кто Когда Δ Комментарий
en5 Английский Tyx2019 2026-05-25 14:00:45 415
en4 Английский Tyx2019 2026-05-25 04:54:00 0 (published)
en3 Английский Tyx2019 2026-05-25 04:53:37 1218 Tiny change: ' control\n~~~~~\nm' -> ' control\n\n~~~~~\nm'
en2 Английский Tyx2019 2026-05-25 04:39:39 1157
en1 Английский Tyx2019 2026-05-25 04:08:40 382 Initial revision (saved to drafts)