Square Root Decomposition in Cache Optimisation

Правка en4, от Tyx2019, 2026-05-25 04:54:00

Introduction

This last weekend, I took part in the Southeast Asia Team Selection Test (SEATST). On day $$$1$$$, to grab a $$$28$$$ point subtask, I used an interesting trick involving cache-optimisation and square root decomposition that I thought was cool, so naturally I am writing my first CF blog to talk about it.

The Problem

Suppose we have an array of $$$N$$$ vectors, $$$V_1$$$ to $$$V_N$$$, and we wish to perform $$$Q$$$ insertion operations of the form $$$(x, y)$$$, where we insert the value $$$y$$$ to the back of vector $$$V_x$$$. For instance, you might want to do this for a counting sort if the log factor from sorting is too large.

For the rest of this blog, we shall assume $$$N=10^6, Q=10^8$$$.

The naive solution

Below is the code I used for testing the naive solution.

#include <bits/stdc++.h>
using namespace std;
mt19937 rng(chrono::steady_clock::now().time_since_epoch().count());
const int N = 1e6;
const int Q = 1e8;
vector<int> V[N];
main(){  
    int x, y;
    for(int i=0;i<Q;i++){
        x = rng() % N;
        y = rng();
        V[x].push_back(y);
    }
}

For the other pieces of code in this blog, I will not include the boilerplate (i.e. everything outside of main).

Of course, we need a control to see how much time it took to generate random numbers. Below is the control

main(){  
    int x, y;
    for(int i=0;i<Q;i++){
        x = rng() % N;
        y = rng();
    }
}

Running them on my laptop, the control took 1.7s, and the naive solution took 12.7s. So the vector operations took around 11s.

The cache-optimised solution

We use square-root decomposition to reduce cache misses. To do this, we split the $$$10^6$$$ vectors into $$$1000$$$ blocks. For each operation, we will store which block it goes into. To do this, we push into $$$1000$$$ vectors, which is a small number of vectors so accessing the vector metadata is fast as it can be stored in the cache.

Now after this, we loop from block $$$0$$$ to block $$$999$$$, and perform each operation belonging to that block. Since inside each block, we only push to $$$1000$$$ vectors, this will also be fast.

Below is the code I used to test this solution

main(){  
    vector<pair<int, int>> T[1000];
    int x, y;
    for(int i=0;i<Q;i++){
        x = rng() % N;
        y = rng();
        T[x/1000].push_back({x, y});
    }
    for(int i=0;i<1000;i++){
        for(auto [a, b]:T[i]){
            V[a].push_back(b);
        }
    }
}

Running this code, it took 7.7s, so the vector operations took around 6s. Thus this code is ~2x faster with the cache optimisation.

История

 
 
 
 
Правки
 
 
  Rev. Язык Кто Когда Δ Комментарий
en4 Английский Tyx2019 2026-05-25 04:54:00 0 (published)
en3 Английский Tyx2019 2026-05-25 04:53:37 1218 Tiny change: ' control\n~~~~~\nm' -> ' control\n\n~~~~~\nm'
en2 Английский Tyx2019 2026-05-25 04:39:39 1157
en1 Английский Tyx2019 2026-05-25 04:08:40 382 Initial revision (saved to drafts)