Introduction
This last weekend, I took part in the Southeast Asia Team Selection Test (SEATST). On day $$$1$$$, to grab a $$$28$$$ point subtask, I used an interesting trick involving cache-optimisation and square root decomposition that I thought was cool, so naturally I am writing my first CF blog to talk about it.
The Problem
Suppose we have an array of $$$N$$$ vectors, $$$V_1$$$ to $$$V_N$$$, and we wish to perform $$$Q$$$ insertion operations of the form $$$(x, y)$$$, where we insert the value $$$y$$$ to the back of vector $$$V_x$$$. For instance, you might want to do this for a counting sort if the log factor from sorting is too large.
For the rest of this blog, we shall assume $$$N=10^6, Q=10^8$$$.
The naive solution
Below is the code I used for testing the naive solution.
#include <bits/stdc++.h>
using namespace std;
mt19937 rng(chrono::steady_clock::now().time_since_epoch().count());
const int N = 1e6;
const int Q = 1e8;
vector<int> V[N];
main(){
int x, y;
for(int i=0;i<Q;i++){
x = rng() % N;
y = rng();
V[x].push_back(y);
}
}
For the other pieces of code in this blog, I will not include the boilerplate (i.e. everything outside of main).
Of course, we need a control to see how much time it took to generate random numbers. Below is the control
main(){
int x, y;
for(int i=0;i<Q;i++){
x = rng() % N;
y = rng();
}
}
Running them on my laptop, the control took 1.7s, and the naive solution took 12.7s. So the vector operations took around 11s.
The cache-optimised solution
We use square-root decomposition to reduce cache misses. To do this, we split the $$$10^6$$$ vectors into $$$1000$$$ blocks. For each operation, we will store which block it goes into. To do this, we push into $$$1000$$$ vectors, which is a small number of vectors so accessing the vector metadata is fast as it can be stored in the cache.
Now after this, we loop from block $$$0$$$ to block $$$999$$$, and perform each operation belonging to that block. Since inside each block, we only push to $$$1000$$$ vectors, this will also be fast.
Below is the code I used to test this solution
main(){
vector<pair<int, int>> T[1000];
int x, y;
for(int i=0;i<Q;i++){
x = rng() % N;
y = rng();
T[x/1000].push_back({x, y});
}
for(int i=0;i<1000;i++){
for(auto [a, b]:T[i]){
V[a].push_back(b);
}
}
}
Running this code, it took 7.7s, so the vector operations took around 6s. Thus this code is ~2x faster with the cache optimisation.
Edit: I have run the different pieces of code with the O2 flag enabled, the control now gets optimised away by the compiler so I will only provide the times for the naive and optimised solutions. The naive solution on my computer took 10.8s, and the optimised solution took 3.7s, so the optimisation actually makes the vector operations ~3x faster with O2. Thanks errorgorn for the suggestion!




