Your Algorithm is Fast, Your Implementation is Slow: A Guide to Data Locality and Cache Optimization

6 месяцев назад, скрыть # ^ |

+59

The content is for people who love AI generated blogs, obviously

→ Ответить

»

sushant4599

6 месяцев назад, скрыть # |

+4

Learnt something new! especially the row-major & col-major one :D

→ Ответить

»

6 месяцев назад, скрыть # |

← Rev. 2 →

+42

You know what's even better than vector? Static arrays

→ Ответить

»

6 месяцев назад, скрыть # ^ |

-116

Haha, you're 100% right.

For pure speed in Competitive Programming (CP) where you know the maximum value of $$$N$$$, a global static array (or std::array) is unbeatable — no dynamic allocation overhead at all.

My main point was just to show why std::vector (contiguous memory) is so much better than std::list (pointer chasing).
But you're absolutely correct — static arrays are another level up.

→ Ответить

»

6 месяцев назад, скрыть # ^ |

+65

why are you using ai

→ Ответить

»

Nyemot

6 месяцев назад, скрыть # ^ |

+14

account is few days old, probably not a real person.

→ Ответить

»

6 месяцев назад, скрыть # ^ |

+18

I've never seen someone using std::list lol

→ Ответить

»

6 месяцев назад, скрыть # ^ |

+12

You don't have to,just hand-writing chain is enough.

→ Ответить

»

Gassa

6 месяцев назад, скрыть # |

+43

Point 4 looks a bit hypothetical to me.

If you put something in a struct, like a point in 2D or 3D space, the use case is often to access several parts of each single struct simultaneously, rather than to use a single property on every access. For example, you can sort $$$(x, y, z)$$$ points by their $$$y$$$-coordinate, but this operation probably won't be the bottleneck of your algorithm: why would you need the other two coordinates then.

Other than that, all good points! Nice to see them compiled in a single post.

→ Ответить

»

6 месяцев назад, скрыть # ^ |

-13

That's a fair point, and you're right.

In most common use cases (like sorting, finding distances, etc.), you’ll be accessing x, y, and z together. So the standard Array of Structs (AoS) is already quite cache-friendly. the Struct of Arrays (SoA) pattern is more of a niche optimization. It only helps in a specific bottleneck scenario — for example, if you had a hot loop that only needed to process the x-coordinates of millions of points, not y or z.

Thanks for the feedback! It's a good clarification.

→ Ответить

»

6 месяцев назад, скрыть # ^ |

+8

It is highly use case dependent, but it's a very common optimization. It can also help with auto-vectorization if you do operations such as f(x[i], y[i], z[i]) on the entire array. Ref: https://en.algorithmica.org/hpc/cpu-cache/aos-soa/ https://odin-lang.org/docs/overview/#soa-data-types

→ Ответить

»

nik_exists

6 месяцев назад, скрыть # |

+1

I got my first FST on 2159B - Rectangles because of this, had an inefficient loop where the vector elements were all over the place, and instead made a new vector that stored the elements in the order they'd be accessed and it passed with time to spare.

→ Ответить

»

6 месяцев назад, скрыть # ^ |

-41

Wow, thanks for sharing this!
That's exactly the kind of scenario that causes TLEs or FSTs even when the algorithm is correct.
A painful way to learn, but honestly, a super valuable fix.
Great case study for the post!

→ Ответить

»

6 месяцев назад, скрыть # ^ |

+37

why are you using ai

→ Ответить

»

i_love_sqrt_decomp

6 месяцев назад, скрыть # |

+1

Your recommendation to stop using std::list is absolutely right. std::list has such a huge constant that data structures with $$$O(\log)$$$ time complexity can sometimes be faster.

→ Ответить

»

Muaath_5

6 месяцев назад, скрыть # |

+1

Number 3 is automatically done by G++ compiler (I'm not sure if it needs -O2, but it is done automatically by Codeforces)

G++ is smart and makes any modulo by a constant value much faster in compile time.

Check how does it usually in https://godbolt.org/

→ Ответить

»

6 месяцев назад, скрыть # ^ |

-44

Thanks for the Godbolt tip G++ does auto-optimize %2 to bit ops like &1, but the real killer is branch mispredictions on unpredictable ifs (10-20 cycles each); branchless dodges that for hotter loops...

Appreciate the correction—keeps me sharp!

→ Ответить

»

6 месяцев назад, скрыть # ^ |

+46

I remember a recent case where I optimized branch prediction out of a dp in some ucup contest by using templates.

int dp(int u, int v, int t) {
  // memo stuff here
  if(t == 0) {
    // do a shit
  } else if(t == 1) {
    // do some shit
  } else {
    // do other shit
  }
}

to

template<const int t>
int dp(int u, int v) {
  // memo stuff here
  if(t == 0) {
    // do a shit
  } else if(t == 1) {
    // do some shit
  } else {
    // do other shit
  }
}

The compiler then creates a function for every t and optimizes out the ifs completely. It turned 3s TLE into 1.1s AC.

→ Ответить

»

Redpanda_x

6 месяцев назад, скрыть # ^ |

0

Wait WHATTT????!!!!

How?? Whyy??

Please elaborate, I wish to learn this magic.

→ Ответить

»

6 месяцев назад, скрыть # ^ |

← Rev. 3 →

+18

Sure. Let's say our dp is a weird dp solution for the Kadane problem. $$$dp[i][0]$$$ is the best answer starting from index i and the range hasn't been opened yet and $$$dp[i][1]$$$ is the best answer starting from index i and the range has been opened already. $$$dp[i][0] = max(dp[i+1][0], dp[i+1][1] + a[i]), dp[i][1] = max(0, dp[i+1][1] + a[i])$$$

Code

#include <bits/stdc++.h>

std::mt19937 rng(58);

const int ms = 3001000;

long long memo[ms][2], a[ms];
int n;
bool vis[ms][2];

long long slow(int on, int t) {
    long long &ans = memo[on][t];
    if(vis[on][t]) return ans;
    vis[on][t] = true;
    if(on == n) return ans = 0;
    if(t == 0) {
        return ans = std::max(slow(on+1, 0), slow(on+1, 1) + a[on]);
    } else {
        return ans = std::max(0LL, slow(on+1, 1) + a[on]);
    }
}

template<const int t>
long long fast(int on) {
    long long &ans = memo[on][t];
    if(vis[on][t]) return ans;
    vis[on][t] = true;
    if(on == n) return ans = 0;
    if(t == 0) {
        return ans = std::max(fast<0>(on+1), fast<1>(on+1) + a[on]);
    } else {
        return ans = std::max(0LL, fast<1>(on+1) + a[on]);
    }
}


int main() {
    std::ios_base::sync_with_stdio(false); std::cin.tie(NULL);
    std::cin >> n;
    for(int i = 0; i < n; i++) a[i] = rng() % 1000000 - 500000;
    std::cout << slow(0, 0) << '\n';
    // std::cout << fast<0>(0) << '\n';
}

You can try the code on cf's custom invocation sending 3000000 as input (using the fast<0>(0) call gets around 300ms and the other gets around 400ms). The fast function is called as fast<0> and fast<1> and the compiler knows in compile-time which fast calls which fast. When compiling, if I understand correctly, it's the same as having 2 functions fast0 and fast1, so the parameter <t> is known in compile time and the ifs that you already know the answer to are optimized (so no operation is executed on that if). Another side benefit is that t isn't passed as an argument so you use less memory on the stack. On a solution with even more ifs and less "predictable" ifs it'd make an even bigger difference. Edit: you can verify that my explanation is correct by looking at the assembly here https://godbolt.org/z/rqecEWKsq.

Usually this shouldn't be necessary, but if you have a slower-than-expected solution with same complexity then it can make a difference. I learned this from Stockfish code from my time coding my chess engine. Take a note at the Search::Worker::search function in https://github.com/official-stockfish/Stockfish/blob/master/src/search.cpp (I tried using a link but it didn't work) that uses a template. It took me looking at that, having some knowledge about template programming, and wondering about why it's like that to realize this minor optimization is one of the reasons for that. If I recall correctly this kind of stuff is used especially in move generation.

→ Ответить

»

Redpanda_x

6 месяцев назад, скрыть # ^ |

0

Thank you very much

→ Ответить

»

sarvagya2545

5 месяцев назад, скрыть # ^ |

0

This is some next level optimization!

→ Ответить

»

FiniteMoves

6 месяцев назад, скрыть # |

-8

where did you study all these things from?

→ Ответить

»

6 месяцев назад, скрыть # ^ |

+44

My guess is from GPT.

→ Ответить

»

miata

6 месяцев назад, скрыть # ^ |

0

If you actually want to learn about this you can start by looking up Data Oriented Design. I think these topics mostly apply only to game development though.

→ Ответить

»

FiniteMoves

6 месяцев назад, скрыть # ^ |

+1

I have started trying to learn making games in unity recently , although that uses C# . These things are also important for algorithmic trading

→ Ответить

»

sarvagya2545

5 месяцев назад, скрыть # ^ |

+5

Cache locality is something that is taught in computer architecture courses

→ Ответить

»

yangmuguang

6 месяцев назад, скрыть # |

← Rev. 2 →

+7

Thanks for the blog, even though you are using AI they are still helpful tips. I would like to ask if basic_string and vector have differences in performance, as the former has some potentially useful functions like substr() and also has all the functions of vector.

→ Ответить

»

6 месяцев назад, скрыть # ^ |

0

There are some minor differences such as the size of a basic_string object is 32 bytes, whereas the size of a vector object is 24 bytes. This should not be a problem unless you store and want to iterate over a large collection of basic_string_view objects.

basic_string does small string optimization so it can store up to 15 bytes of data inline without doing any heap allocation. This can be useful if want a lot of dynamic arrays but most of them are very small.

Additionally, you may want to look into std::span if you want slicing of a vector. It is similar to basic_string_view. Both are views over a contiguous block of elements of type T, and both provide sub-slicing (span::subspan and basic_string_view::substr). Note that these types are cheap to construct so try not to hold a persistent reference to them. For example, if you have a span over a vector and you push_back into the vector then the span might get invalidated. Just construct it when you need it. I tend to do string_view{s}.substr(i, j - i + 1) == s2 if I need to compare a sub-string.

My advice would be to not pick a favorite, and use it everywhere. Pick what you need.

→ Ответить

»

yangmuguang

6 месяцев назад, скрыть # ^ |

+3

Thank you, I didn't know there were so many small differences between the two types, I'll keep that in mind.

→ Ответить

»

X-detector

6 месяцев назад, скрыть # |

0

thanks

→ Ответить

»

https://stackoverflow.com/questions/6163166/why-is-arraydeque-better-than-linkedlist

6 месяцев назад, скрыть # |

+6

This AI generated post, accompanied with the GIF at the end, reeks of contribution farming. It is very common on LeetCode. Neither of the platforms benefit from it. I was fine with the blog being AI written as the content can be very useful and new to a lot of people. But using AI generated replies on comments is just repulsive.

Why does the post have close to 200 upvotes within the span of 18 hours?

→ Ответить

»

Gassa

6 месяцев назад, скрыть # ^ |

+10

Interesting. To me, a good bit of stuff in the post looks "too humanly" :)

→ Ответить

»

isitcorrect

6 месяцев назад, скрыть # |

0

we can say same about LinkedList<>(); vs ArrayDeque<>(); in java array deque is fast

→ Ответить

»

6 месяцев назад, скрыть # |

0

in vector there are insert to insert several elements in the middle.would you like to help me to check if it's faster?

→ Ответить

»

6 месяцев назад, скрыть # ^ |

0

It will not be a meaningful comparison because list::insert and vector::insert both have different time complexities.

vector::insert: If reallocation happens, linear in the number of elements of the vector after insertion; otherwise, linear in the number of elements inserted plus std::distance(pos, end()).

The memory allocation time complexity due to insert in a vector can be considered as amortized $$$O(1)$$$ due to the same reason as push_back. The reason being vector allocates more memory than required and it does not need to reallocate for the subsequent few inserts.

So, the overhead mostly comes from copying and shifting the elements in [pos, end()) over.

I wrote a small benchmark. You can increase the no. of insertions to see how the vector insert gets worse. Link: https://quick-bench.com/q/0hMXEe0tvlxqFLSJH1xInYvvneA#

Inserting in a vector at the beginning is the worst case as all the elements need to be copied over. Inserting at the back is equivalent to push_back. The closer to the end you are inserting, the better the performance.

→ Ответить

»

6 месяцев назад, скрыть # ^ |

0

Thank you https://quick-bench.com/q/Ry9uJG9YTejCLqRCwLHW7A2LNr8

→ Ответить

»

adityacyan

6 месяцев назад, скрыть # |

0

thanks for the information , amazed by the if one , will all this work for python too?

→ Ответить

»

Interstellar001

6 месяцев назад, скрыть # |

+1

I found number 1 very useful. Will keep it in mind.

Almost got TLE with this submission: https://mirror.codeforces.com/problemset/submission/2115/345078804

Easy AC by following the advice: https://mirror.codeforces.com/problemset/submission/2115/345079553

From 1843ms to 999ms

→ Ответить

»

CS_alpha

6 месяцев назад, скрыть # |

← Rev. 2 →

0

I got TLE on last div 2. I couldn't understand why, as my code was supposed to perform fewer than 1e8 operations, and computers can perform about 3*1e8 operations in 3 seconds. After the contest, I only changed my vector<2,vector<2e5>> to unordered_map[2] and it got accepted, taking less than 400 ms.

→ Ответить

»

unordered_map-AC
vector-TLE
avoids traversing the entire vector-TLE

6 месяцев назад, скрыть # ^ |

0

Can you provide links to both the submissions?

→ Ответить

»

CS_alpha

6 месяцев назад, скрыть # ^ |

0

edit: unordered map actually took 562ms , not <400ms

→ Ответить

»

Sam4188

6 месяцев назад, скрыть # |

0

Thanks!I have just solved a problem which was TLE because of that.It lets me know more about $$$O(N^2)$$$ solution's TLE.

Also,I wonder who is still using std::list now. (:

→ Ответить

»

6 месяцев назад, скрыть # |

0

demonicc I'd like to tell you that __builtin_sqrt is 10X faster than simple sqrt.

→ Ответить

»

_admin_

6 месяцев назад, скрыть # |

+13

This leads to a tip when you are doing matrix multiplication in matrix fast exponentiation.

for(int i = 0; i < x; i ++)
    for(int k = 0; k < z; k ++)
        for(int j = 0; j < y; j ++)
            c[i][j] += a[i][k] * b[k][j];

for(int i = 0; i < x; i ++)
    for(int j = 0; j < y; j ++)
        for(int k = 0; k < z; k ++)
            c[i][j] += a[i][k] * b[k][j];

The first one is quicker than the second one.

The second one is a cache nightmare because b[k][j] is Column-Major.

→ Ответить

»

clonecuantan

6 месяцев назад, скрыть # |

0

There's a common mistake that many people make which significantly slows down their code: using endl. While browsing submissions on Codeforces, I've noticed that many people, even at the Expert rank, still get TLE for using endl. I simply replaced endl with '\n' and the code got Accepted (AC)

→ Ответить

»