Square Root Decomposition in Cache Optimisation

№	Пользователь	Рейтинг
1	Benq	3792
2	VivaciousAubergine	3647
3	jiangly	3631
4	Kevin114514	3574
5	maroonrk	3521
6	strapple	3515
7	Radewoosh	3461
8	tourist	3428
9	turmax	3378
10	Um_nik	3376

№	Пользователь	Вклад
1	Qingyu	162
2	adamant	148
3	Um_nik	146
4	Dominater069	143
5	errorgorn	140
6	cry	138
7	Proof_by_QED	136
8	YuukiS	135
9	chromate00	134
10	soullless	133

Introduction

This last weekend, I took part in the Southeast Asia Team Selection Test (SEATST). On day $$$1$$$, to grab a $$$28$$$ point subtask, I used an interesting trick involving cache-optimisation and square root decomposition that I thought was cool, so naturally I am writing my first CF blog to talk about it.

The Problem

Suppose we have an array of $$$N$$$ vectors, $$$V_1$$$ to $$$V_N$$$, and we wish to perform $$$Q$$$ insertion operations of the form $$$(x, y)$$$, where we insert the value $$$y$$$ to the back of vector $$$V_x$$$. For instance, you might want to do this for a counting sort if the log factor from sorting is too large.

For the rest of this blog, we shall assume $$$N=10^6, Q=10^8$$$.

The naive solution

Below is the code I used for testing the naive solution.

#include <bits/stdc++.h>
using namespace std;
mt19937 rng(chrono::steady_clock::now().time_since_epoch().count());
const int N = 1e6;
const int Q = 1e8;
vector<int> V[N];
main(){  
    int x, y;
    for(int i=0;i<Q;i++){
        x = rng() % N;
        y = rng();
        V[x].push_back(y);
    }
}

For the other pieces of code in this blog, I will not include the boilerplate (i.e. everything outside of main).

Of course, we need a control to see how much time it took to generate random numbers. Below is the control

main(){  
    int x, y;
    for(int i=0;i<Q;i++){
        x = rng() % N;
        y = rng();
    }
}

Running them on my laptop, the control took 1.7s, and the naive solution took 12.7s. So the vector operations took around 11s.

The cache-optimised solution

We use square-root decomposition to reduce cache misses. To do this, we split the $$$10^6$$$ vectors into $$$1000$$$ blocks. For each operation, we will store which block it goes into. To do this, we push into $$$1000$$$ vectors, which is a small number of vectors so accessing the vector metadata is fast as it can be stored in the cache.

Now after this, we loop from block $$$0$$$ to block $$$999$$$, and perform each operation belonging to that block. Since inside each block, we only push to $$$1000$$$ vectors, this will also be fast.

Below is the code I used to test this solution

main(){  
    vector<pair<int, int>> T[1000];
    int x, y;
    for(int i=0;i<Q;i++){
        x = rng() % N;
        y = rng();
        T[x/1000].push_back({x, y});
    }
    for(int i=0;i<1000;i++){
        for(auto [a, b]:T[i]){
            V[a].push_back(b);
        }
    }
}

Running this code, it took 7.7s, so the vector operations took around 6s. Thus this code is ~2x faster with the cache optimisation.

Edit: I have run the different pieces of code with the O2 flag enabled, the control now gets optimised away by the compiler so I will only provide the times for the naive and optimised solutions. The naive solution on my computer took 10.8s, and the optimised solution took 3.7s, so the optimisation actually makes the vector operations ~3x faster with O2. Thanks errorgorn for the suggestion!

Rev.	Кто	Когда	Δ	Комментарий
en5	Tyx2019	2026-05-25 14:00:45	415
en4	Tyx2019	2026-05-25 04:54:00	0	(published)
en3	Tyx2019	2026-05-25 04:53:37	1218	Tiny change: ' control\n~~~~~\nm' -> ' control\n\n~~~~~\nm'
en2	Tyx2019	2026-05-25 04:39:39	1157
en1	Tyx2019	2026-05-25 04:08:40	382	Initial revision (saved to drafts)

Rev.

Язык

Кто

Когда

Комментарий

en5