Introduction
This last weekend, I took part in the Southeast Asia Team Selection Test (SEATST). On day $$$1$$$, to grab a $$$28$$$ point subtask, I used an interesting trick involving cache-optimisation and square root decomposition that I thought was cool, so naturally I am writing my first CF blog to talk about it.
The Problem
Suppose we have an array of $$$N$$$ vectors, $$$V_1$$$ to $$$V_N$$$, and we wish to perform $$$Q$$$ insertion operations of the form $$$(x, y)$$$, where we insert the value $$$y$$$ to the back of vector $$$V_x$$$. For instance, you might want to do this for a counting sort if the log factor from sorting is too large.
For the rest of this blog, we shall assume $$$N=10^6, Q=10^8$$$.
The naive solution
Below is the code I used for testing the naive solution.
#include <bits/stdc++.h>
using namespace std;
mt19937 rng(chrono::steady_clock::now().time_since_epoch().count());
const int N = 1e6;
const int Q = 1e8;
vector<int> V[N];
main(){
int x, y;
for(int i=0;i<Q;i++){
x = rng() % N;
y = rng();
V[x].push_back(y);
}
}
For the other pieces of code in this blog, I will not include the boilerplate (i.e. everything outside of main).
Of course, we need a control to see how much time it took to generate random numbers. Below is the control
main(){
int x, y;
for(int i=0;i<Q;i++){
x = rng() % N;
y = rng();
}
}
Running them on my laptop, the control took 1.7s, and the naive solution took 12.7s. So the vector operations took around 11s.
The cache-optimised solution
We use square-root decomposition to reduce cache misses. To do this, we split the $$$10^6$$$ vectors into $$$1000$$$ blocks. For each operation, we will store which block it goes into. To do this, we push into $$$1000$$$ vectors, which is a small number of vectors so accessing the vector metadata is fast as it can be stored in the cache.
Now after this, we loop from block $$$0$$$ to block $$$999$$$, and perform each operation belonging to that block. Since inside each block, we only push to $$$1000$$$ vectors, this will also be fast.
Below is the code I used to test this solution
main(){
vector<pair<int, int>> T[1000];
int x, y;
for(int i=0;i<Q;i++){
x = rng() % N;
y = rng();
T[x/1000].push_back({x, y});
}
for(int i=0;i<1000;i++){
for(auto [a, b]:T[i]){
V[a].push_back(b);
}
}
}
Running this code, it took 7.7s, so the vector operations took around 6s. Thus this code is ~2x faster with the cache optimisation.




