Square Root Decomposition in Cache Optimisation

Introduction

This last weekend, I took part in the Southeast Asia Team Selection Test (SEATST). On day $$$1$$$, to grab a $$$28$$$ point subtask, I used an interesting trick involving cache-optimisation and square root decomposition that I thought was cool, so naturally I am writing my first CF blog to talk about it.

The Problem

Suppose we have an array of $$$N$$$ vectors, $$$V_1$$$ to $$$V_N$$$, and we wish to perform $$$Q$$$ insertion operations of the form $$$(x, y)$$$, where we insert the value $$$y$$$ to the back of vector $$$V_x$$$. For instance, you might want to do this for a counting sort if the log factor from sorting is too large.

For the rest of this blog, we shall assume $$$N=10^6, Q=10^8$$$.

The naive solution

Below is the code I used for testing the naive solution.

#include <bits/stdc++.h>
using namespace std;
mt19937 rng(chrono::steady_clock::now().time_since_epoch().count());
const int N = 1e6;
const int Q = 1e8;
vector<int> V[N];
main(){  
    int x, y;
    for(int i=0;i<Q;i++){
        x = rng() % N;
        y = rng();
        V[x].push_back(y);
    }
}

For the other pieces of code in this blog, I will not include the boilerplate (i.e. everything outside of main).

Of course, we need a control to see how much time it took to generate random numbers. Below is the control

main(){  
    int x, y;
    for(int i=0;i<Q;i++){
        x = rng() % N;
        y = rng();
    }
}

Running them on my laptop, the control took 1.7s, and the naive solution took 12.7s. So the vector operations took around 11s.

The cache-optimised solution

We use square-root decomposition to reduce cache misses. To do this, we split the $$$10^6$$$ vectors into $$$1000$$$ blocks. For each operation, we will store which block it goes into. To do this, we push into $$$1000$$$ vectors, which is a small number of vectors so accessing the vector metadata is fast as it can be stored in the cache.

Now after this, we loop from block $$$0$$$ to block $$$999$$$, and perform each operation belonging to that block. Since inside each block, we only push to $$$1000$$$ vectors, this will also be fast.

Below is the code I used to test this solution

main(){  
    vector<pair<int, int>> T[1000];
    int x, y;
    for(int i=0;i<Q;i++){
        x = rng() % N;
        y = rng();
        T[x/1000].push_back({x, y});
    }
    for(int i=0;i<1000;i++){
        for(auto [a, b]:T[i]){
            V[a].push_back(b);
        }
    }
}

Running this code, it took 7.7s, so the vector operations took around 6s. Thus this code is ~2x faster with the cache optimisation.

Комментарии (9)

Написать комментарий?

Tyx2019

3 часа назад, скрыть # |

Auto comment: topic has been updated by Tyx2019 (previous revision, new revision, compare).

→ Ответить

developinpop

-11

first

Dolphy

-12

second

arctano

2 часа назад, скрыть # |

third

omsincoconut

fourth.

Also, what if instead of a block size of $$$1 \, 000$$$, other numbers are chosen; and what subtask were you trying to do in SEATST day 1?

116 минут назад, скрыть # ^ |

I think the optimal block size will depend on the values of $$$N$$$ and $$$Q$$$. I think in general a size of $$$\sqrt N$$$ would be good, but this will depend on $$$N$$$ and $$$Q$$$ e.g. I tested $$$N=10^8, Q=10^7$$$ and it seems like the optimal block size is around $$$5000$$$.

The subtask was the $$$N=10^4$$$ subtask on Day 1 P3, my time complexity was $$$O(N^2)$$$ but I needed to do this constant optimisation to get the points.

Yuichiro17

103 минуты назад, скрыть # |

orz

Divadiva

45 минут назад, скрыть # |

fifth

dooglius

43 минуты назад, скрыть # |

It feels like this is just a variation on radix sort with the radix set to 1000? It may be faster to just radix-sort the whole array (and thus avoiding the vector reallocations) which will give you all the desired vectors in a contiguous away which should be better for caching.

№	Пользователь	Рейтинг
1	Benq	3792
2	VivaciousAubergine	3647
3	jiangly	3631
4	Kevin114514	3574
5	maroonrk	3521
6	strapple	3515
7	Radewoosh	3461
8	tourist	3428
9	turmax	3378
10	Um_nik	3376

№	Пользователь	Вклад
1	Qingyu	162
2	adamant	148
3	Um_nik	146
4	Dominater069	143
5	errorgorn	141
6	cry	138
7	Proof_by_QED	136
8	YuukiS	135
9	chromate00	134
10	soullless	133

Блог пользователя Tyx2019

Introduction

The Problem

The naive solution

The cache-optimised solution