Your Algorithm is Fast, Your Implementation is Slow: A Guide to Data Locality and Cache Optimization

#	User	Rating
1	Benq	3792
2	VivaciousAubergine	3647
3	Kevin114514	3603
4	jiangly	3583
5	turmax	3559
6	tourist	3541
7	strapple	3515
8	ksun48	3461
9	dXqwq	3436
10	Otomachi_Una	3413

#	User	Contrib.
1	Qingyu	157
2	adamant	153
3	Um_nik	147
3	Proof_by_QED	147
5	Dominater069	145
6	errorgorn	142
7	cry	139
8	YuukiS	135
9	TheScrasse	134
10	chromate00	133

You've all seen it. You and a friend both code an $$$O(N^2)$$$ solution. Theirs passes in 500ms, yours gets TLE. You check the algorithm—it's identical. What gives?

The problem isn't your algorithm. It's your implementation. You're ignoring the hardware, and the single biggest killer is cache locality.

1. The 9x Difference: Row-Major vs. Column-Major

Your CPU doesn't read memory byte by byte. It pulls in chunks called cache lines (usually 64 bytes). When you access arr[i][j], the CPU also loads arr[i][j+1], arr[i][j+2], etc., into a super-fast cache, assuming you'll need them next. This is spatial locality.

This is cache-friendly:

// Row-major: Accessing memory sequentially
for(int i = 0; i < n; i++) {
    for(int j = 0; j < m; j++) {
        sum += arr[i][j]; 
    }
}

This is a cache nightmare:

// Column-major: Jumping all over memory
for(int j = 0; j < m; j++) {
    for(int i = 0; i < n; i++) {
        sum += arr[i][j]; 
    }
}

In the second loop, you access arr[0][0], then arr[1][0], then arr[2][0]. These aren't next to each other in memory. For each access, the CPU has to fetch a new 64-byte cache line from slow RAM, throwing away the one it just got.

Benchmark: On a 10000x10000 matrix, the row-major loop runs in ~300ms. The column-major loop takes ~2800ms.

That's a 9x performance hit on the exact same $$$O(N^2)$$$ complexity.

2. Stop Using std::list. Seriously.

"But std::list has $$$O(1)$$$ insertion!"

Irrelevant. That $$$O(1)$$$ is theoretical. In practice, std::list is slow for two reasons:

Allocation Overhead: Every push_back requires a new memory allocation, which is slow. std::vector allocates geometrically (1, 2, 4, 8...), so it's much faster on average.
Pointer Chasing: std::vector stores elements in a single contiguous block of memory (perfect for the cache). std::list scatters elements all over your RAM, connected by pointers.

Iterating a list means jumping from one random memory address to another—the definition of a cache miss.

Benchmark (1M elements): - Iterating std::vector: ~2ms - Iterating std::list: ~35ms (17x slower)

Unless you are frequently inserting/deleting in the middle of a massive list (which you almost never are in CP), just use std::vector.

3. if Statements are Expensive

Your CPU is smart. It uses branch prediction to guess which way an if statement will go before it's even calculated. If it guesses right, you get a performance boost. If it guesses wrong (a misprediction), it has to flush its pipeline and restart, costing 10-20 clock cycles.

If your data is random, this if is unpredictable and slow:

// Unpredictable branch
for(int i = 0; i < n; i++) {
    if(arr[i] % 2 == 0) {
        count++;
    }
}

For random data, the CPU will be wrong ~50% of the time. You can often rewrite this to be branchless:

// Branchless
for(int i = 0; i < n; i++) {
    count += (arr[i] & 1) ^ 1;
}

This version has no if, no guessing, and no misprediction penalty. On random data, it can be 30-50% faster.

4. How Your structs Lie to You (AoS vs. SoA)

Let's say you have a Point struct and you only need to process the x coordinates.

Array of Structs (AoS) — The Bad Way:

struct Point { int x, y, z; };
vector<Point> points; // Memory: [x1,y1,z1], [x2,y2,z2], ...

When you access points[i].x, the CPU loads the entire struct (x1, y1, z1) into the cache line. You only wanted x1. You've wasted 2/3 of your cache.

Struct of Arrays (SoA) — The Fast Way:

struct Points {
    vector<int> x, y, z;
};
Points points; // Memory: [x1,x2,x3...], [y1,y2,y3...], [z1,z2,z3...]

Now, when you loop through points.x, you are accessing a contiguous block of memory. Every single byte loaded into the cache is data you actually need.

TL;DR

Stop just counting operations. Your $$$O(N \log N)$$$ can be slower than $$$O(N^2)$$$ if you ignore the hardware.

Always iterate in row-major order.
Use std::vector.
Avoid data-dependent ifs in hot loops.
Think about data layout (AoS vs. SoA) if you're really pushing time limits.

Your algorithm might be fast, but your implementation is what gets the AC.

thank u

Comments (52)

Write comment?

demonicc

6 months ago, hide # |

← Rev. 2 →

-6

Just a note:
This blog is intended for beginners and specialists who are trying to understand why their correct $$$O(N^2)$$$ solution might get TLE.
Hope it helps you!

→ Reply

krkush18

6 months ago, hide # ^ |

Ignore the downvotes, this is a banger post. Super practical stuff that actually helps you pass TLE. The 9x benchmark is crazy. Keep it up, bro!

_smart_stupid_

-18

Codeforces is very fast and most of the codes can't TLE when the algorithm of them is correct

BetaZero

5 months ago, hide # ^ |

It's partly true; sometimes Codeforces is too slow (like 30% — 40%) due to the large queue during div. 3s or 4s, but for div. 2 and div. 1 It's normal because not many noobies spam submit the problems.

AksLolCoding

+59

The content is for people who love AI generated blogs, obviously

sushant4599

Learnt something new! especially the row-major & col-major one :D

+42

You know what's even better than vector? Static arrays

-116

Haha, you're 100% right.

For pure speed in Competitive Programming (CP) where you know the maximum value of $$$N$$$, a global static array (or std::array) is unbeatable — no dynamic allocation overhead at all.

My main point was just to show why std::vector (contiguous memory) is so much better than std::list (pointer chasing).
But you're absolutely correct — static arrays are another level up.

+65

why are you using ai

Nyemot

+14

account is few days old, probably not a real person.

tfg

+18

I've never seen someone using std::list lol

OtterZ

+12

You don't have to,just hand-writing chain is enough.

Gassa

+43

Point 4 looks a bit hypothetical to me.

If you put something in a struct, like a point in 2D or 3D space, the use case is often to access several parts of each single struct simultaneously, rather than to use a single property on every access. For example, you can sort $$$(x, y, z)$$$ points by their $$$y$$$-coordinate, but this operation probably won't be the bottleneck of your algorithm: why would you need the other two coordinates then.

Other than that, all good points! Nice to see them compiled in a single post.

-13

That's a fair point, and you're right.

In most common use cases (like sorting, finding distances, etc.), you’ll be accessing x, y, and z together. So the standard Array of Structs (AoS) is already quite cache-friendly. the Struct of Arrays (SoA) pattern is more of a niche optimization. It only helps in a specific bottleneck scenario — for example, if you had a hot loop that only needed to process the x-coordinates of millions of points, not y or z.

Thanks for the feedback! It's a good clarification.

virinci

It is highly use case dependent, but it's a very common optimization. It can also help with auto-vectorization if you do operations such as f(x[i], y[i], z[i]) on the entire array. Ref: https://en.algorithmica.org/hpc/cpu-cache/aos-soa/ https://odin-lang.org/docs/overview/#soa-data-types

nik_exists

I got my first FST on 2159B - Rectangles because of this, had an inefficient loop where the vector elements were all over the place, and instead made a new vector that stored the elements in the order they'd be accessed and it passed with time to spare.

-41

Wow, thanks for sharing this!
That's exactly the kind of scenario that causes TLEs or FSTs even when the algorithm is correct.
A painful way to learn, but honestly, a super valuable fix.
Great case study for the post!

+37

i_love_sqrt_decomp

Your recommendation to stop using std::list is absolutely right. std::list has such a huge constant that data structures with $$$O(\log)$$$ time complexity can sometimes be faster.

Muaath_5

Number 3 is automatically done by G++ compiler (I'm not sure if it needs -O2, but it is done automatically by Codeforces)

G++ is smart and makes any modulo by a constant value much faster in compile time.

Check how does it usually in https://godbolt.org/

-44

Thanks for the Godbolt tip G++ does auto-optimize %2 to bit ops like &1, but the real killer is branch mispredictions on unpredictable ifs (10-20 cycles each); branchless dodges that for hotter loops...

Appreciate the correction—keeps me sharp!

+46

I remember a recent case where I optimized branch prediction out of a dp in some ucup contest by using templates.

int dp(int u, int v, int t) {
  // memo stuff here
  if(t == 0) {
    // do a shit
  } else if(t == 1) {
    // do some shit
  } else {
    // do other shit
  }
}

template<const int t>
int dp(int u, int v) {
  // memo stuff here
  if(t == 0) {
    // do a shit
  } else if(t == 1) {
    // do some shit
  } else {
    // do other shit
  }
}

The compiler then creates a function for every t and optimizes out the ifs completely. It turned 3s TLE into 1.1s AC.

Redpanda_x

Wait WHATTT????!!!!

How?? Whyy??

Please elaborate, I wish to learn this magic.

← Rev. 3 →

Sure. Let's say our dp is a weird dp solution for the Kadane problem. $$$dp[i][0]$$$ is the best answer starting from index i and the range hasn't been opened yet and $$$dp[i][1]$$$ is the best answer starting from index i and the range has been opened already. $$$dp[i][0] = max(dp[i+1][0], dp[i+1][1] + a[i]), dp[i][1] = max(0, dp[i+1][1] + a[i])$$$

Code

#include <bits/stdc++.h>

std::mt19937 rng(58);

const int ms = 3001000;

long long memo[ms][2], a[ms];
int n;
bool vis[ms][2];

long long slow(int on, int t) {
    long long &ans = memo[on][t];
    if(vis[on][t]) return ans;
    vis[on][t] = true;
    if(on == n) return ans = 0;
    if(t == 0) {
        return ans = std::max(slow(on+1, 0), slow(on+1, 1) + a[on]);
    } else {
        return ans = std::max(0LL, slow(on+1, 1) + a[on]);
    }
}

template<const int t>
long long fast(int on) {
    long long &ans = memo[on][t];
    if(vis[on][t]) return ans;
    vis[on][t] = true;
    if(on == n) return ans = 0;
    if(t == 0) {
        return ans = std::max(fast<0>(on+1), fast<1>(on+1) + a[on]);
    } else {
        return ans = std::max(0LL, fast<1>(on+1) + a[on]);
    }
}


int main() {
    std::ios_base::sync_with_stdio(false); std::cin.tie(NULL);
    std::cin >> n;
    for(int i = 0; i < n; i++) a[i] = rng() % 1000000 - 500000;
    std::cout << slow(0, 0) << '\n';
    // std::cout << fast<0>(0) << '\n';
}

You can try the code on cf's custom invocation sending 3000000 as input (using the fast<0>(0) call gets around 300ms and the other gets around 400ms). The fast function is called as fast<0> and fast<1> and the compiler knows in compile-time which fast calls which fast. When compiling, if I understand correctly, it's the same as having 2 functions fast0 and fast1, so the parameter <t> is known in compile time and the ifs that you already know the answer to are optimized (so no operation is executed on that if). Another side benefit is that t isn't passed as an argument so you use less memory on the stack. On a solution with even more ifs and less "predictable" ifs it'd make an even bigger difference. Edit: you can verify that my explanation is correct by looking at the assembly here https://godbolt.org/z/rqecEWKsq.

Usually this shouldn't be necessary, but if you have a slower-than-expected solution with same complexity then it can make a difference. I learned this from Stockfish code from my time coding my chess engine. Take a note at the Search::Worker::search function in https://github.com/official-stockfish/Stockfish/blob/master/src/search.cpp (I tried using a link but it didn't work) that uses a template. It took me looking at that, having some knowledge about template programming, and wondering about why it's like that to realize this minor optimization is one of the reasons for that. If I recall correctly this kind of stuff is used especially in move generation.

Thank you very much

sarvagya2545

This is some next level optimization!

FiniteMoves

-8

where did you study all these things from?

+44

My guess is from GPT.

miata

If you actually want to learn about this you can start by looking up Data Oriented Design. I think these topics mostly apply only to game development though.

I have started trying to learn making games in unity recently , although that uses C# . These things are also important for algorithmic trading

Cache locality is something that is taught in computer architecture courses

yangmuguang

Thanks for the blog, even though you are using AI they are still helpful tips. I would like to ask if basic_string and vector have differences in performance, as the former has some potentially useful functions like substr() and also has all the functions of vector.

There are some minor differences such as the size of a basic_string object is 32 bytes, whereas the size of a vector object is 24 bytes. This should not be a problem unless you store and want to iterate over a large collection of basic_string_view objects.

basic_string does small string optimization so it can store up to 15 bytes of data inline without doing any heap allocation. This can be useful if want a lot of dynamic arrays but most of them are very small.

Additionally, you may want to look into std::span if you want slicing of a vector. It is similar to basic_string_view. Both are views over a contiguous block of elements of type T, and both provide sub-slicing (span::subspan and basic_string_view::substr). Note that these types are cheap to construct so try not to hold a persistent reference to them. For example, if you have a span over a vector and you push_back into the vector then the span might get invalidated. Just construct it when you need it. I tend to do string_view{s}.substr(i, j - i + 1) == s2 if I need to compare a sub-string.

My advice would be to not pick a favorite, and use it everywhere. Pick what you need.

Thank you, I didn't know there were so many small differences between the two types, I'll keep that in mind.

X-detector

thanks

This AI generated post, accompanied with the GIF at the end, reeks of contribution farming. It is very common on LeetCode. Neither of the platforms benefit from it. I was fine with the blog being AI written as the content can be very useful and new to a lot of people. But using AI generated replies on comments is just repulsive.

Why does the post have close to 200 upvotes within the span of 18 hours?

+10

Interesting. To me, a good bit of stuff in the post looks "too humanly" :)

isitcorrect

we can say same about LinkedList<>(); vs ArrayDeque<>(); in java array deque is fast

https://stackoverflow.com/questions/6163166/why-is-arraydeque-better-than-linkedlist

in vector there are insert to insert several elements in the middle.would you like to help me to check if it's faster?

It will not be a meaningful comparison because list::insert and vector::insert both have different time complexities.

vector::insert: If reallocation happens, linear in the number of elements of the vector after insertion; otherwise, linear in the number of elements inserted plus std::distance(pos, end()).

The memory allocation time complexity due to insert in a vector can be considered as amortized $$$O(1)$$$ due to the same reason as push_back. The reason being vector allocates more memory than required and it does not need to reallocate for the subsequent few inserts.

So, the overhead mostly comes from copying and shifting the elements in [pos, end()) over.

I wrote a small benchmark. You can increase the no. of insertions to see how the vector insert gets worse. Link: https://quick-bench.com/q/0hMXEe0tvlxqFLSJH1xInYvvneA#

Inserting in a vector at the beginning is the worst case as all the elements need to be copied over. Inserting at the back is equivalent to push_back. The closer to the end you are inserting, the better the performance.

Thank you https://quick-bench.com/q/Ry9uJG9YTejCLqRCwLHW7A2LNr8

adityacyan

thanks for the information , amazed by the if one , will all this work for python too?

Interstellar001

I found number 1 very useful. Will keep it in mind.

Almost got TLE with this submission: https://mirror.codeforces.com/problemset/submission/2115/345078804

Easy AC by following the advice: https://mirror.codeforces.com/problemset/submission/2115/345079553

From 1843ms to 999ms

CS_alpha

I got TLE on last div 2. I couldn't understand why, as my code was supposed to perform fewer than 1e8 operations, and computers can perform about 3*1e8 operations in 3 seconds. After the contest, I only changed my vector<2,vector<2e5>> to unordered_map[2] and it got accepted, taking less than 400 ms.

Can you provide links to both the submissions?

unordered_map-AC
vector-TLE
avoids traversing the entire vector-TLE

edit: unordered map actually took 562ms , not <400ms

Sam4188

Thanks!I have just solved a problem which was TLE because of that.It lets me know more about $$$O(N^2)$$$ solution's TLE.

Also,I wonder who is still using std::list now. (:

demonicc I'd like to tell you that __builtin_sqrt is 10X faster than simple sqrt.

_admin_

+13

This leads to a tip when you are doing matrix multiplication in matrix fast exponentiation.

for(int i = 0; i < x; i ++)
    for(int k = 0; k < z; k ++)
        for(int j = 0; j < y; j ++)
            c[i][j] += a[i][k] * b[k][j];

for(int i = 0; i < x; i ++)
    for(int j = 0; j < y; j ++)
        for(int k = 0; k < z; k ++)
            c[i][j] += a[i][k] * b[k][j];

The first one is quicker than the second one.

The second one is a cache nightmare because b[k][j] is Column-Major.

clonecuantan

There's a common mistake that many people make which significantly slows down their code: using endl. While browsing submissions on Codeforces, I've noticed that many people, even at the Expert rank, still get TLE for using endl. I simply replaced endl with '\n' and the code got Accepted (AC)

In fact endl flush the out stream and might be more significant in interactive tasks.

longago_06

yes ,I use endl instead of doing cout<<flush seperately (is it a bad practice for non-interactive problems )

demonicc's blog

1. The 9x Difference: Row-Major vs. Column-Major

2. Stop Using std::list. Seriously.

3. if Statements are Expensive

4. How Your structs Lie to You (AoS vs. SoA)

TL;DR