Square Root Decomposition in Cache Optimisation

№	Пользователь	Рейтинг
1	Benq	3792
2	VivaciousAubergine	3647
3	jiangly	3631
4	Kevin114514	3574
5	maroonrk	3521
6	strapple	3515
7	Radewoosh	3461
8	tourist	3428
9	turmax	3378
10	Um_nik	3376

№	Пользователь	Вклад
1	Qingyu	162
2	adamant	148
3	Um_nik	146
4	Dominater069	143
5	errorgorn	141
6	cry	138
7	Proof_by_QED	136
8	YuukiS	135
9	chromate00	134
10	soullless	133

Introduction↵
------------------↵
This last weekend, I took part in the Southeast Asia Team Selection Test (SEATST). On day $1$, to grab a $28$ point subtask, I used an interesting trick involving cache-optimisation and square root decomposition that I thought was cool, so naturally I am writing my first CF blog to talk about it. ↵
↵
The Problem↵
------------------↵
Suppose we have an array of $N$ vectors, $V_1$ to $V_N$, and we wish to perform $Q$ insertion operations of the form $(x, y)$, where we insert the value $y$ to the back of vector $V_x$. For instance, you might want to do this for a counting sort if the log factor from sorting is too large. ↵
↵
For the rest of this blog, we shall assume $N=10^6, Q=10^8$. ↵
↵
The naive solution↵
------------------↵
Below is the code I used for testing the naive solution. ↵
↵
~~~~~↵
#include <bits/stdc++.h>↵
using namespace std;↵
mt19937 rng(chrono::steady_clock::now().time_since_epoch().count());↵
const int N = 1e6;↵
const int Q = 1e8;↵
vector<int> V[N];↵
main(){  ↵
    int x, y;↵
    for(int i=0;i<Q;i++){↵
        x = rng() % N;↵
        y = rng();↵
        V[x].push_back(y);↵
    }↵
}↵
~~~~~↵
↵
For the other pieces of code in this blog, I will not include the boilerplate (i.e. everything outside of main).↵
↵
Of course, we need a control to see how much time it took to generate random numbers. Below is the control↵
↵
~~~~~↵
main(){  ↵
    int x, y;↵
    for(int i=0;i<Q;i++){↵
        x = rng() % N;↵
        y = rng();↵
    }↵
}↵
~~~~~↵
↵
Running them on my laptop, the control took 1.7s, and the naive solution took 12.7s. So the vector operations took around 11s.↵
↵
The cache-optimised solution↵
------------------↵
↵
We use square-root decomposition to reduce cache misses. To do this, we split the $10^6$ vectors into $1000$ blocks. For each operation, we will store which block it goes into. To do this, we push into $1000$ vectors, which is a small number of vectors so accessing the vector metadata is fast as it can be stored in the cache. ↵
↵
Now after this, we loop from block $0$ to block $999$, and perform each operation belonging to that block. Since inside each block, we only push to $1000$ vectors, this will also be fast. ↵
↵
Below is the code I used to test this solution ↵
↵
~~~~~↵
main(){  ↵
    vector<pair<int, int>> T[1000];↵
    int x, y;↵
    for(int i=0;i<Q;i++){↵
        x = rng() % N;↵
        y = rng();↵
        T[x/1000].push_back({x, y});↵
    }↵
    for(int i=0;i<1000;i++){↵
        for(auto [a, b]:T[i]){↵
            V[a].push_back(b);↵
        }↵
    }↵
}↵
~~~~~↵
↵
Running this code, it took 7.7s, so the vector operations took around 6s. Thus this code is ~2x faster with the cache optimisation.

Rev.	Кто	Когда	Δ	Комментарий
en4	Tyx2019	2026-05-25 04:54:00	0	(published)
en3	Tyx2019	2026-05-25 04:53:37	1218	Tiny change: ' control\n~~~~~\nm' -> ' control\n\n~~~~~\nm'
en2	Tyx2019	2026-05-25 04:39:39	1157
en1	Tyx2019	2026-05-25 04:08:40	382	Initial revision (saved to drafts)

Rev.

Язык

Кто

Когда

Комментарий

en4