Блог пользователя demonicc

Автор demonicc, история, 6 месяцев назад, По-английски

You've all seen it. You and a friend both code an $$$O(N^2)$$$ solution. Theirs passes in 500ms, yours gets TLE. You check the algorithm—it's identical. What gives?

The problem isn't your algorithm. It's your implementation. You're ignoring the hardware, and the single biggest killer is cache locality.

1. The 9x Difference: Row-Major vs. Column-Major

Your CPU doesn't read memory byte by byte. It pulls in chunks called cache lines (usually 64 bytes). When you access arr[i][j], the CPU also loads arr[i][j+1], arr[i][j+2], etc., into a super-fast cache, assuming you'll need them next. This is spatial locality.

This is cache-friendly:

// Row-major: Accessing memory sequentially
for(int i = 0; i < n; i++) {
    for(int j = 0; j < m; j++) {
        sum += arr[i][j]; 
    }
}

This is a cache nightmare:

// Column-major: Jumping all over memory
for(int j = 0; j < m; j++) {
    for(int i = 0; i < n; i++) {
        sum += arr[i][j]; 
    }
}

In the second loop, you access arr[0][0], then arr[1][0], then arr[2][0]. These aren't next to each other in memory. For each access, the CPU has to fetch a new 64-byte cache line from slow RAM, throwing away the one it just got.

Benchmark: On a 10000x10000 matrix, the row-major loop runs in ~300ms. The column-major loop takes ~2800ms.

That's a 9x performance hit on the exact same $$$O(N^2)$$$ complexity.

2. Stop Using std::list. Seriously.

"But std::list has $$$O(1)$$$ insertion!"

Irrelevant. That $$$O(1)$$$ is theoretical. In practice, std::list is slow for two reasons:

  1. Allocation Overhead: Every push_back requires a new memory allocation, which is slow. std::vector allocates geometrically (1, 2, 4, 8...), so it's much faster on average.

  2. Pointer Chasing: std::vector stores elements in a single contiguous block of memory (perfect for the cache). std::list scatters elements all over your RAM, connected by pointers.

Iterating a list means jumping from one random memory address to another—the definition of a cache miss.

Benchmark (1M elements): - Iterating std::vector: ~2ms - Iterating std::list: ~35ms (17x slower)

Unless you are frequently inserting/deleting in the middle of a massive list (which you almost never are in CP), just use std::vector.

3. if Statements are Expensive

Your CPU is smart. It uses branch prediction to guess which way an if statement will go before it's even calculated. If it guesses right, you get a performance boost. If it guesses wrong (a misprediction), it has to flush its pipeline and restart, costing 10-20 clock cycles.

If your data is random, this if is unpredictable and slow:

// Unpredictable branch
for(int i = 0; i < n; i++) {
    if(arr[i] % 2 == 0) {
        count++;
    }
}

For random data, the CPU will be wrong ~50% of the time. You can often rewrite this to be branchless:

// Branchless
for(int i = 0; i < n; i++) {
    count += (arr[i] & 1) ^ 1;
}

This version has no if, no guessing, and no misprediction penalty. On random data, it can be 30-50% faster.

4. How Your structs Lie to You (AoS vs. SoA)

Let's say you have a Point struct and you only need to process the x coordinates.

Array of Structs (AoS) — The Bad Way:

struct Point { int x, y, z; };
vector<Point> points; // Memory: [x1,y1,z1], [x2,y2,z2], ...

When you access points[i].x, the CPU loads the entire struct (x1, y1, z1) into the cache line. You only wanted x1. You've wasted 2/3 of your cache.

Struct of Arrays (SoA) — The Fast Way:

struct Points {
    vector<int> x, y, z;
};
Points points; // Memory: [x1,x2,x3...], [y1,y2,y3...], [z1,z2,z3...]

Now, when you loop through points.x, you are accessing a contiguous block of memory. Every single byte loaded into the cache is data you actually need.

TL;DR

Stop just counting operations. Your $$$O(N \log N)$$$ can be slower than $$$O(N^2)$$$ if you ignore the hardware.

  • Always iterate in row-major order.
  • Use std::vector.
  • Avoid data-dependent ifs in hot loops.
  • Think about data layout (AoS vs. SoA) if you're really pushing time limits.

Your algorithm might be fast, but your implementation is what gets the AC.

thank u

  • Проголосовать: нравится
  • +369
  • Проголосовать: не нравится

»
6 месяцев назад, скрыть # |
Rev. 2  
Проголосовать: нравится -6 Проголосовать: не нравится

Just a note:
This blog is intended for beginners and specialists who are trying to understand why their correct $$$O(N^2)$$$ solution might get TLE.
Hope it helps you!


»
6 месяцев назад, скрыть # |
 
Проголосовать: нравится +4 Проголосовать: не нравится

Learnt something new! especially the row-major & col-major one :D

»
6 месяцев назад, скрыть # |
Rev. 2  
Проголосовать: нравится +42 Проголосовать: не нравится

You know what's even better than vector? Static arrays

»
6 месяцев назад, скрыть # |
 
Проголосовать: нравится +43 Проголосовать: не нравится

Point 4 looks a bit hypothetical to me.

If you put something in a struct, like a point in 2D or 3D space, the use case is often to access several parts of each single struct simultaneously, rather than to use a single property on every access. For example, you can sort $$$(x, y, z)$$$ points by their $$$y$$$-coordinate, but this operation probably won't be the bottleneck of your algorithm: why would you need the other two coordinates then.

Other than that, all good points! Nice to see them compiled in a single post.

»
6 месяцев назад, скрыть # |
 
Проголосовать: нравится +1 Проголосовать: не нравится

I got my first FST on 2159B - Rectangles because of this, had an inefficient loop where the vector elements were all over the place, and instead made a new vector that stored the elements in the order they'd be accessed and it passed with time to spare.

»
6 месяцев назад, скрыть # |
 
Проголосовать: нравится +1 Проголосовать: не нравится

Your recommendation to stop using std::list is absolutely right. std::list has such a huge constant that data structures with $$$O(\log)$$$ time complexity can sometimes be faster.

»
6 месяцев назад, скрыть # |
 
Проголосовать: нравится +1 Проголосовать: не нравится

Number 3 is automatically done by G++ compiler (I'm not sure if it needs -O2, but it is done automatically by Codeforces)

G++ is smart and makes any modulo by a constant value much faster in compile time.

Check how does it usually in https://godbolt.org/

  • »
    »
    6 месяцев назад, скрыть # ^ |
     
    Проголосовать: нравится -44 Проголосовать: не нравится

    Thanks for the Godbolt tip G++ does auto-optimize %2 to bit ops like &1, but the real killer is branch mispredictions on unpredictable ifs (10-20 cycles each); branchless dodges that for hotter loops...

    Appreciate the correction—keeps me sharp!

  • »
    »
    6 месяцев назад, скрыть # ^ |
     
    Проголосовать: нравится +46 Проголосовать: не нравится

    I remember a recent case where I optimized branch prediction out of a dp in some ucup contest by using templates.

    int dp(int u, int v, int t) {
      // memo stuff here
      if(t == 0) {
        // do a shit
      } else if(t == 1) {
        // do some shit
      } else {
        // do other shit
      }
    }
    

    to

    template<const int t>
    int dp(int u, int v) {
      // memo stuff here
      if(t == 0) {
        // do a shit
      } else if(t == 1) {
        // do some shit
      } else {
        // do other shit
      }
    }
    

    The compiler then creates a function for every t and optimizes out the ifs completely. It turned 3s TLE into 1.1s AC.

    • »
      »
      »
      6 месяцев назад, скрыть # ^ |
       
      Проголосовать: нравится 0 Проголосовать: не нравится

      Wait WHATTT????!!!!

      How?? Whyy??

      Please elaborate, I wish to learn this magic.

      • »
        »
        »
        »
        6 месяцев назад, скрыть # ^ |
        Rev. 3  
        Проголосовать: нравится +18 Проголосовать: не нравится

        Sure. Let's say our dp is a weird dp solution for the Kadane problem. $$$dp[i][0]$$$ is the best answer starting from index i and the range hasn't been opened yet and $$$dp[i][1]$$$ is the best answer starting from index i and the range has been opened already. $$$dp[i][0] = max(dp[i+1][0], dp[i+1][1] + a[i]), dp[i][1] = max(0, dp[i+1][1] + a[i])$$$

        Code

        You can try the code on cf's custom invocation sending 3000000 as input (using the fast<0>(0) call gets around 300ms and the other gets around 400ms). The fast function is called as fast<0> and fast<1> and the compiler knows in compile-time which fast calls which fast. When compiling, if I understand correctly, it's the same as having 2 functions fast0 and fast1, so the parameter <t> is known in compile time and the ifs that you already know the answer to are optimized (so no operation is executed on that if). Another side benefit is that t isn't passed as an argument so you use less memory on the stack. On a solution with even more ifs and less "predictable" ifs it'd make an even bigger difference. Edit: you can verify that my explanation is correct by looking at the assembly here https://godbolt.org/z/rqecEWKsq.

        Usually this shouldn't be necessary, but if you have a slower-than-expected solution with same complexity then it can make a difference. I learned this from Stockfish code from my time coding my chess engine. Take a note at the Search::Worker::search function in https://github.com/official-stockfish/Stockfish/blob/master/src/search.cpp (I tried using a link but it didn't work) that uses a template. It took me looking at that, having some knowledge about template programming, and wondering about why it's like that to realize this minor optimization is one of the reasons for that. If I recall correctly this kind of stuff is used especially in move generation.

    • »
      »
      »
      5 месяцев назад, скрыть # ^ |
       
      Проголосовать: нравится 0 Проголосовать: не нравится

      This is some next level optimization!

»
6 месяцев назад, скрыть # |
 
Проголосовать: нравится -8 Проголосовать: не нравится

where did you study all these things from?

»
6 месяцев назад, скрыть # |
Rev. 2  
Проголосовать: нравится +7 Проголосовать: не нравится

Thanks for the blog, even though you are using AI they are still helpful tips. I would like to ask if basic_string and vector have differences in performance, as the former has some potentially useful functions like substr() and also has all the functions of vector.

  • »
    »
    6 месяцев назад, скрыть # ^ |
     
    Проголосовать: нравится 0 Проголосовать: не нравится

    There are some minor differences such as the size of a basic_string object is 32 bytes, whereas the size of a vector object is 24 bytes. This should not be a problem unless you store and want to iterate over a large collection of basic_string_view objects.

    basic_string does small string optimization so it can store up to 15 bytes of data inline without doing any heap allocation. This can be useful if want a lot of dynamic arrays but most of them are very small.

    Additionally, you may want to look into std::span if you want slicing of a vector. It is similar to basic_string_view. Both are views over a contiguous block of elements of type T, and both provide sub-slicing (span::subspan and basic_string_view::substr). Note that these types are cheap to construct so try not to hold a persistent reference to them. For example, if you have a span over a vector and you push_back into the vector then the span might get invalidated. Just construct it when you need it. I tend to do string_view{s}.substr(i, j - i + 1) == s2 if I need to compare a sub-string.

    My advice would be to not pick a favorite, and use it everywhere. Pick what you need.

»
6 месяцев назад, скрыть # |
 
Проголосовать: нравится 0 Проголосовать: не нравится

thanks

»
6 месяцев назад, скрыть # |
 
Проголосовать: нравится +6 Проголосовать: не нравится

This AI generated post, accompanied with the GIF at the end, reeks of contribution farming. It is very common on LeetCode. Neither of the platforms benefit from it. I was fine with the blog being AI written as the content can be very useful and new to a lot of people. But using AI generated replies on comments is just repulsive.

Why does the post have close to 200 upvotes within the span of 18 hours?

»
6 месяцев назад, скрыть # |
 
Проголосовать: нравится 0 Проголосовать: не нравится

we can say same about LinkedList<>(); vs ArrayDeque<>(); in java array deque is fast

https://stackoverflow.com/questions/6163166/why-is-arraydeque-better-than-linkedlist

»
6 месяцев назад, скрыть # |
 
Проголосовать: нравится 0 Проголосовать: не нравится

in vector there are insert to insert several elements in the middle.would you like to help me to check if it's faster?

  • »
    »
    6 месяцев назад, скрыть # ^ |
     
    Проголосовать: нравится 0 Проголосовать: не нравится

    It will not be a meaningful comparison because list::insert and vector::insert both have different time complexities.

    vector::insert: If reallocation happens, linear in the number of elements of the vector after insertion; otherwise, linear in the number of elements inserted plus std::distance(pos, end()).

    The memory allocation time complexity due to insert in a vector can be considered as amortized $$$O(1)$$$ due to the same reason as push_back. The reason being vector allocates more memory than required and it does not need to reallocate for the subsequent few inserts.

    So, the overhead mostly comes from copying and shifting the elements in [pos, end()) over.

    I wrote a small benchmark. You can increase the no. of insertions to see how the vector insert gets worse. Link: https://quick-bench.com/q/0hMXEe0tvlxqFLSJH1xInYvvneA#

    Inserting in a vector at the beginning is the worst case as all the elements need to be copied over. Inserting at the back is equivalent to push_back. The closer to the end you are inserting, the better the performance.

»
6 месяцев назад, скрыть # |
 
Проголосовать: нравится 0 Проголосовать: не нравится

thanks for the information , amazed by the if one , will all this work for python too?

»
6 месяцев назад, скрыть # |
 
Проголосовать: нравится +1 Проголосовать: не нравится

I found number 1 very useful. Will keep it in mind.

Almost got TLE with this submission: https://mirror.codeforces.com/problemset/submission/2115/345078804

Easy AC by following the advice: https://mirror.codeforces.com/problemset/submission/2115/345079553

From 1843ms to 999ms

»
6 месяцев назад, скрыть # |
Rev. 2  
Проголосовать: нравится 0 Проголосовать: не нравится

I got TLE on last div 2. I couldn't understand why, as my code was supposed to perform fewer than 1e8 operations, and computers can perform about 3*1e8 operations in 3 seconds. After the contest, I only changed my vector<2,vector<2e5>> to unordered_map[2] and it got accepted, taking less than 400 ms.

»
6 месяцев назад, скрыть # |
 
Проголосовать: нравится 0 Проголосовать: не нравится

Thanks!I have just solved a problem which was TLE because of that.It lets me know more about $$$O(N^2)$$$ solution's TLE.

Also,I wonder who is still using std::list now. (:

»
6 месяцев назад, скрыть # |
 
Проголосовать: нравится 0 Проголосовать: не нравится

demonicc I'd like to tell you that __builtin_sqrt is 10X faster than simple sqrt.

»
6 месяцев назад, скрыть # |
 
Проголосовать: нравится +13 Проголосовать: не нравится

This leads to a tip when you are doing matrix multiplication in matrix fast exponentiation.

for(int i = 0; i < x; i ++)
    for(int k = 0; k < z; k ++)
        for(int j = 0; j < y; j ++)
            c[i][j] += a[i][k] * b[k][j];
for(int i = 0; i < x; i ++)
    for(int j = 0; j < y; j ++)
        for(int k = 0; k < z; k ++)
            c[i][j] += a[i][k] * b[k][j];

The first one is quicker than the second one.

The second one is a cache nightmare because b[k][j] is Column-Major.

»
6 месяцев назад, скрыть # |
 
Проголосовать: нравится 0 Проголосовать: не нравится

There's a common mistake that many people make which significantly slows down their code: using endl. While browsing submissions on Codeforces, I've noticed that many people, even at the Expert rank, still get TLE for using endl. I simply replaced endl with '\n' and the code got Accepted (AC)