Introduction↵
------------------↵
This last weekend, I took part in the Southeast Asia Team Selection Test (SEATST). On day $1$, to grab a $28$ point subtask, I used an interesting trick involving cache-optimisation and square root decomposition that I thought was cool, so naturally I am writing my first CF blog to talk about it. ↵
↵
The Problem↵
------------------↵
Suppose we have an array of $N$ vectors, $V_1$ to $V_N$, and we wish to perform $Q$ insertion operations of the form $(x, y)$, where we insert the value $y$ to the back of vector $V_x$. For instance, you might want to do this for a counting sort if the log factor from sorting is too large. ↵
↵
For the rest of this blog, we shall assume $N=10^6, Q=10^8$. ↵
↵
The naive solution↵
------------------↵
Below is the code I used for testing the naive solution. ↵
↵
~~~~~↵
#include <bits/stdc++.h>↵
using namespace std;↵
mt19937 rng(chrono::steady_clock::now().time_since_epoch().count());↵
const int N = 1e6;↵
const int Q = 1e8;↵
vector<int> V[N];↵
main(){ ↵
int x, y;↵
for(int i=0;i<Q;i++){↵
x = rng() % N;↵
y = rng();↵
V[x].push_back(y);↵
}↵
}↵
~~~~~↵
↵
For the other pieces of code in this blog, I will not include the boilerplate (i.e. everything outside of main).↵
↵
Of course, we need a control to see how much time it took to generate random numbers. Below is the control↵
↵
~~~~~↵
main(){ ↵
int x, y;↵
for(int i=0;i<Q;i++){↵
x = rng() % N;↵
y = rng();↵
}↵
}↵
~~~~~↵
↵
Running them on my laptop, the control took 1.7s, and the naive solution took 12.7s. So the vector operations took around 11s.↵
↵
The cache-optimised solution↵
------------------↵
↵
We use square-root decomposition to reduce cache misses. To do this, we split the $10^6$ vectors into $1000$ blocks. For each operation, we will store which block it goes into. To do this, we push into $1000$ vectors, which is a small number of vectors so accessing the vector metadata is fast as it can be stored in the cache. ↵
↵
Now after this, we loop from block $0$ to block $999$, and perform each operation belonging to that block. Since inside each block, we only push to $1000$ vectors, this will also be fast. ↵
↵
Below is the code I used to test this solution ↵
↵
~~~~~↵
main(){ ↵
vector<pair<int, int>> T[1000];↵
int x, y;↵
for(int i=0;i<Q;i++){↵
x = rng() % N;↵
y = rng();↵
T[x/1000].push_back({x, y});↵
}↵
for(int i=0;i<1000;i++){↵
for(auto [a, b]:T[i]){↵
V[a].push_back(b);↵
}↵
}↵
}↵
~~~~~↵
↵
Running this code, it took 7.7s, so the vector operations took around 6s. Thus this code is ~2x faster with the cache optimisation.
------------------↵
This last weekend, I took part in the Southeast Asia Team Selection Test (SEATST). On day $1$, to grab a $28$ point subtask, I used an interesting trick involving cache-optimisation and square root decomposition that I thought was cool, so naturally I am writing my first CF blog to talk about it. ↵
↵
The Problem↵
------------------↵
Suppose we have an array of $N$ vectors, $V_1$ to $V_N$, and we wish to perform $Q$ insertion operations of the form $(x, y)$, where we insert the value $y$ to the back of vector $V_x$. For instance, you might want to do this for a counting sort if the log factor from sorting is too large. ↵
↵
For the rest of this blog, we shall assume $N=10^6, Q=10^8$. ↵
↵
The naive solution↵
------------------↵
Below is the code I used for testing the naive solution. ↵
↵
~~~~~↵
#include <bits/stdc++.h>↵
using namespace std;↵
mt19937 rng(chrono::steady_clock::now().time_since_epoch().count());↵
const int N = 1e6;↵
const int Q = 1e8;↵
vector<int> V[N];↵
main(){ ↵
int x, y;↵
for(int i=0;i<Q;i++){↵
x = rng() % N;↵
y = rng();↵
V[x].push_back(y);↵
}↵
}↵
~~~~~↵
↵
For the other pieces of code in this blog, I will not include the boilerplate (i.e. everything outside of main).↵
↵
Of course, we need a control to see how much time it took to generate random numbers. Below is the control↵
↵
~~~~~↵
main(){ ↵
int x, y;↵
for(int i=0;i<Q;i++){↵
x = rng() % N;↵
y = rng();↵
}↵
}↵
~~~~~↵
↵
Running them on my laptop, the control took 1.7s, and the naive solution took 12.7s. So the vector operations took around 11s.↵
↵
The cache-optimised solution↵
------------------↵
↵
We use square-root decomposition to reduce cache misses. To do this, we split the $10^6$ vectors into $1000$ blocks. For each operation, we will store which block it goes into. To do this, we push into $1000$ vectors, which is a small number of vectors so accessing the vector metadata is fast as it can be stored in the cache. ↵
↵
Now after this, we loop from block $0$ to block $999$, and perform each operation belonging to that block. Since inside each block, we only push to $1000$ vectors, this will also be fast. ↵
↵
Below is the code I used to test this solution ↵
↵
~~~~~↵
main(){ ↵
vector<pair<int, int>> T[1000];↵
int x, y;↵
for(int i=0;i<Q;i++){↵
x = rng() % N;↵
y = rng();↵
T[x/1000].push_back({x, y});↵
}↵
for(int i=0;i<1000;i++){↵
for(auto [a, b]:T[i]){↵
V[a].push_back(b);↵
}↵
}↵
}↵
~~~~~↵
↵
Running this code, it took 7.7s, so the vector operations took around 6s. Thus this code is ~2x faster with the cache optimisation.




