Introduction
This last weekend, I took part in the Southeast Asia Team Selection Test (SEATST). On day $$$1$$$, to grab a $$$28$$$ point subtask, I used an interesting trick involving cache-optimisation and square root decomposition that I thought was cool, so naturally I am writing my first CF blog to talk about it.
The Problem
Suppose we have an array of $$$N$$$ vectors, $$$V_1$$$ to $$$V_N$$$, and we wish to perform $$$Q$$$ insertion operations of the form $$$(x, y)$$$, where we insert the value $$$y$$$ to the back of vector $$$V_x$$$. For instance, you might want to do this for a counting sort if the log factor from sorting is too large.
For the rest of this blog, we shall assume $$$N=10^6, Q=10^8$$$.
The naive solution
Below is the code I used for testing the naive solution.
#include <bits/stdc++.h>
using namespace std;
mt19937 rng(chrono::steady_clock::now().time_since_epoch().count());
const int N = 1e6;
const int Q = 1e8;
vector<int> V[N];
main(){
int x, y;
for(int i=0;i<Q;i++){
x = rng() % N;
y = rng();
V[x].push_back(y);
}
}
For the other pieces of code in this blog, I will not include the boilerplate (i.e. everything outside of main).
Of course, we need a control to see how much time it took to generate random numbers. Below is the control
main(){
int x, y;
for(int i=0;i<Q;i++){
x = rng() % N;
y = rng();
}
}
Running them on my laptop, the control took 1.7s, and the naive solution took 12.7s. So the vector operations took around 11s.
The cache-optimised solution
We use square-root decomposition to reduce cache misses. To do this, we split the $$$10^6$$$ vectors into $$$1000$$$ blocks. For each operation, we will store which block it goes into. To do this, we push into $$$1000$$$ vectors, which is a small number of vectors so accessing the vector metadata is fast as it can be stored in the cache.
Now after this, we loop from block $$$0$$$ to block $$$999$$$, and perform each operation belonging to that block. Since inside each block, we only push to $$$1000$$$ vectors, this will also be fast.
Below is the code I used to test this solution
main(){
vector<pair<int, int>> T[1000];
int x, y;
for(int i=0;i<Q;i++){
x = rng() % N;
y = rng();
T[x/1000].push_back({x, y});
}
for(int i=0;i<1000;i++){
for(auto [a, b]:T[i]){
V[a].push_back(b);
}
}
}
Running this code, it took 7.7s, so the vector operations took around 6s. Thus this code is ~2x faster with the cache optimisation.









Auto comment: topic has been updated by Tyx2019 (previous revision, new revision, compare).
first
second
third
fourth.
Also, what if instead of a block size of $$$1 \, 000$$$, other numbers are chosen; and what subtask were you trying to do in SEATST day 1?
I think the optimal block size will depend on the values of $$$N$$$ and $$$Q$$$. I think in general a size of $$$\sqrt N$$$ would be good, but this will depend on $$$N$$$ and $$$Q$$$ e.g. I tested $$$N=10^8, Q=10^7$$$ and it seems like the optimal block size is around $$$5000$$$.
The subtask was the $$$N=10^4$$$ subtask on Day 1 P3, my time complexity was $$$O(N^2)$$$ but I needed to do this constant optimisation to get the points.
orz
fifth
It feels like this is just a variation on radix sort with the radix set to 1000? It may be faster to just radix-sort the whole array (and thus avoiding the vector reallocations) which will give you all the desired vectors in a contiguous away which should be better for caching.