demonicc's blog

By demonicc, history, 6 months ago, In English

You've all seen it. You and a friend both code an $$$O(N^2)$$$ solution. Theirs passes in 500ms, yours gets TLE. You check the algorithm—it's identical. What gives?

The problem isn't your algorithm. It's your implementation. You're ignoring the hardware, and the single biggest killer is cache locality.

1. The 9x Difference: Row-Major vs. Column-Major

Your CPU doesn't read memory byte by byte. It pulls in chunks called cache lines (usually 64 bytes). When you access arr[i][j], the CPU also loads arr[i][j+1], arr[i][j+2], etc., into a super-fast cache, assuming you'll need them next. This is spatial locality.

This is cache-friendly:

// Row-major: Accessing memory sequentially
for(int i = 0; i < n; i++) {
    for(int j = 0; j < m; j++) {
        sum += arr[i][j]; 
    }
}

This is a cache nightmare:

// Column-major: Jumping all over memory
for(int j = 0; j < m; j++) {
    for(int i = 0; i < n; i++) {
        sum += arr[i][j]; 
    }
}

In the second loop, you access arr[0][0], then arr[1][0], then arr[2][0]. These aren't next to each other in memory. For each access, the CPU has to fetch a new 64-byte cache line from slow RAM, throwing away the one it just got.

Benchmark: On a 10000x10000 matrix, the row-major loop runs in ~300ms. The column-major loop takes ~2800ms.

That's a 9x performance hit on the exact same $$$O(N^2)$$$ complexity.

2. Stop Using std::list. Seriously.

"But std::list has $$$O(1)$$$ insertion!"

Irrelevant. That $$$O(1)$$$ is theoretical. In practice, std::list is slow for two reasons:

  1. Allocation Overhead: Every push_back requires a new memory allocation, which is slow. std::vector allocates geometrically (1, 2, 4, 8...), so it's much faster on average.

  2. Pointer Chasing: std::vector stores elements in a single contiguous block of memory (perfect for the cache). std::list scatters elements all over your RAM, connected by pointers.

Iterating a list means jumping from one random memory address to another—the definition of a cache miss.

Benchmark (1M elements): - Iterating std::vector: ~2ms - Iterating std::list: ~35ms (17x slower)

Unless you are frequently inserting/deleting in the middle of a massive list (which you almost never are in CP), just use std::vector.

3. if Statements are Expensive

Your CPU is smart. It uses branch prediction to guess which way an if statement will go before it's even calculated. If it guesses right, you get a performance boost. If it guesses wrong (a misprediction), it has to flush its pipeline and restart, costing 10-20 clock cycles.

If your data is random, this if is unpredictable and slow:

// Unpredictable branch
for(int i = 0; i < n; i++) {
    if(arr[i] % 2 == 0) {
        count++;
    }
}

For random data, the CPU will be wrong ~50% of the time. You can often rewrite this to be branchless:

// Branchless
for(int i = 0; i < n; i++) {
    count += (arr[i] & 1) ^ 1;
}

This version has no if, no guessing, and no misprediction penalty. On random data, it can be 30-50% faster.

4. How Your structs Lie to You (AoS vs. SoA)

Let's say you have a Point struct and you only need to process the x coordinates.

Array of Structs (AoS) — The Bad Way:

struct Point { int x, y, z; };
vector<Point> points; // Memory: [x1,y1,z1], [x2,y2,z2], ...

When you access points[i].x, the CPU loads the entire struct (x1, y1, z1) into the cache line. You only wanted x1. You've wasted 2/3 of your cache.

Struct of Arrays (SoA) — The Fast Way:

struct Points {
    vector<int> x, y, z;
};
Points points; // Memory: [x1,x2,x3...], [y1,y2,y3...], [z1,z2,z3...]

Now, when you loop through points.x, you are accessing a contiguous block of memory. Every single byte loaded into the cache is data you actually need.

TL;DR

Stop just counting operations. Your $$$O(N \log N)$$$ can be slower than $$$O(N^2)$$$ if you ignore the hardware.

  • Always iterate in row-major order.
  • Use std::vector.
  • Avoid data-dependent ifs in hot loops.
  • Think about data layout (AoS vs. SoA) if you're really pushing time limits.

Your algorithm might be fast, but your implementation is what gets the AC.

thank u

  • Vote: I like it
  • +369
  • Vote: I do not like it

»
6 months ago, hide # |
Rev. 2  
Vote: I like it -6 Vote: I do not like it

Just a note:
This blog is intended for beginners and specialists who are trying to understand why their correct $$$O(N^2)$$$ solution might get TLE.
Hope it helps you!


  • »
    »
    6 months ago, hide # ^ |
     
    Vote: I like it 0 Vote: I do not like it

    Ignore the downvotes, this is a banger post. Super practical stuff that actually helps you pass TLE. The 9x benchmark is crazy. Keep it up, bro!

    • »
      »
      »
      6 months ago, hide # ^ |
       
      Vote: I like it -18 Vote: I do not like it

      Codeforces is very fast and most of the codes can't TLE when the algorithm of them is correct

      • »
        »
        »
        »
        5 months ago, hide # ^ |
         
        Vote: I like it 0 Vote: I do not like it

        It's partly true; sometimes Codeforces is too slow (like 30% — 40%) due to the large queue during div. 3s or 4s, but for div. 2 and div. 1 It's normal because not many noobies spam submit the problems.

  • »
    »
    6 months ago, hide # ^ |
     
    Vote: I like it +59 Vote: I do not like it

    The content is for people who love AI generated blogs, obviously

»
6 months ago, hide # |
 
Vote: I like it +4 Vote: I do not like it

Learnt something new! especially the row-major & col-major one :D

»
6 months ago, hide # |
Rev. 2  
Vote: I like it +42 Vote: I do not like it

You know what's even better than vector? Static arrays

  • »
    »
    6 months ago, hide # ^ |
     
    Vote: I like it -116 Vote: I do not like it

    Haha, you're 100% right.

    For pure speed in Competitive Programming (CP) where you know the maximum value of $$$N$$$, a global static array (or std::array) is unbeatable — no dynamic allocation overhead at all.

    My main point was just to show why std::vector (contiguous memory) is so much better than std::list (pointer chasing).
    But you're absolutely correct — static arrays are another level up.

»
6 months ago, hide # |
 
Vote: I like it +43 Vote: I do not like it

Point 4 looks a bit hypothetical to me.

If you put something in a struct, like a point in 2D or 3D space, the use case is often to access several parts of each single struct simultaneously, rather than to use a single property on every access. For example, you can sort $$$(x, y, z)$$$ points by their $$$y$$$-coordinate, but this operation probably won't be the bottleneck of your algorithm: why would you need the other two coordinates then.

Other than that, all good points! Nice to see them compiled in a single post.

  • »
    »
    6 months ago, hide # ^ |
     
    Vote: I like it -13 Vote: I do not like it

    That's a fair point, and you're right.

    In most common use cases (like sorting, finding distances, etc.), you’ll be accessing x, y, and z together. So the standard Array of Structs (AoS) is already quite cache-friendly. the Struct of Arrays (SoA) pattern is more of a niche optimization. It only helps in a specific bottleneck scenario — for example, if you had a hot loop that only needed to process the x-coordinates of millions of points, not y or z.

    Thanks for the feedback! It's a good clarification.

  • »
    »
    6 months ago, hide # ^ |
     
    Vote: I like it +8 Vote: I do not like it

    It is highly use case dependent, but it's a very common optimization. It can also help with auto-vectorization if you do operations such as f(x[i], y[i], z[i]) on the entire array. Ref: https://en.algorithmica.org/hpc/cpu-cache/aos-soa/ https://odin-lang.org/docs/overview/#soa-data-types

»
6 months ago, hide # |
 
Vote: I like it +1 Vote: I do not like it

I got my first FST on 2159B - Rectangles because of this, had an inefficient loop where the vector elements were all over the place, and instead made a new vector that stored the elements in the order they'd be accessed and it passed with time to spare.

  • »
    »
    6 months ago, hide # ^ |
     
    Vote: I like it -41 Vote: I do not like it

    Wow, thanks for sharing this!
    That's exactly the kind of scenario that causes TLEs or FSTs even when the algorithm is correct.
    A painful way to learn, but honestly, a super valuable fix.
    Great case study for the post!

»
6 months ago, hide # |
 
Vote: I like it +1 Vote: I do not like it

Your recommendation to stop using std::list is absolutely right. std::list has such a huge constant that data structures with $$$O(\log)$$$ time complexity can sometimes be faster.

»
6 months ago, hide # |
 
Vote: I like it +1 Vote: I do not like it

Number 3 is automatically done by G++ compiler (I'm not sure if it needs -O2, but it is done automatically by Codeforces)

G++ is smart and makes any modulo by a constant value much faster in compile time.

Check how does it usually in https://godbolt.org/

  • »
    »
    6 months ago, hide # ^ |
     
    Vote: I like it -44 Vote: I do not like it

    Thanks for the Godbolt tip G++ does auto-optimize %2 to bit ops like &1, but the real killer is branch mispredictions on unpredictable ifs (10-20 cycles each); branchless dodges that for hotter loops...

    Appreciate the correction—keeps me sharp!

  • »
    »
    6 months ago, hide # ^ |
     
    Vote: I like it +46 Vote: I do not like it

    I remember a recent case where I optimized branch prediction out of a dp in some ucup contest by using templates.

    int dp(int u, int v, int t) {
      // memo stuff here
      if(t == 0) {
        // do a shit
      } else if(t == 1) {
        // do some shit
      } else {
        // do other shit
      }
    }
    

    to

    template<const int t>
    int dp(int u, int v) {
      // memo stuff here
      if(t == 0) {
        // do a shit
      } else if(t == 1) {
        // do some shit
      } else {
        // do other shit
      }
    }
    

    The compiler then creates a function for every t and optimizes out the ifs completely. It turned 3s TLE into 1.1s AC.

    • »
      »
      »
      6 months ago, hide # ^ |
       
      Vote: I like it 0 Vote: I do not like it

      Wait WHATTT????!!!!

      How?? Whyy??

      Please elaborate, I wish to learn this magic.

      • »
        »
        »
        »
        6 months ago, hide # ^ |
        Rev. 3  
        Vote: I like it +18 Vote: I do not like it

        Sure. Let's say our dp is a weird dp solution for the Kadane problem. $$$dp[i][0]$$$ is the best answer starting from index i and the range hasn't been opened yet and $$$dp[i][1]$$$ is the best answer starting from index i and the range has been opened already. $$$dp[i][0] = max(dp[i+1][0], dp[i+1][1] + a[i]), dp[i][1] = max(0, dp[i+1][1] + a[i])$$$

        Code

        You can try the code on cf's custom invocation sending 3000000 as input (using the fast<0>(0) call gets around 300ms and the other gets around 400ms). The fast function is called as fast<0> and fast<1> and the compiler knows in compile-time which fast calls which fast. When compiling, if I understand correctly, it's the same as having 2 functions fast0 and fast1, so the parameter <t> is known in compile time and the ifs that you already know the answer to are optimized (so no operation is executed on that if). Another side benefit is that t isn't passed as an argument so you use less memory on the stack. On a solution with even more ifs and less "predictable" ifs it'd make an even bigger difference. Edit: you can verify that my explanation is correct by looking at the assembly here https://godbolt.org/z/rqecEWKsq.

        Usually this shouldn't be necessary, but if you have a slower-than-expected solution with same complexity then it can make a difference. I learned this from Stockfish code from my time coding my chess engine. Take a note at the Search::Worker::search function in https://github.com/official-stockfish/Stockfish/blob/master/src/search.cpp (I tried using a link but it didn't work) that uses a template. It took me looking at that, having some knowledge about template programming, and wondering about why it's like that to realize this minor optimization is one of the reasons for that. If I recall correctly this kind of stuff is used especially in move generation.

    • »
      »
      »
      5 months ago, hide # ^ |
       
      Vote: I like it 0 Vote: I do not like it

      This is some next level optimization!

»
6 months ago, hide # |
 
Vote: I like it -8 Vote: I do not like it

where did you study all these things from?

  • »
    »
    6 months ago, hide # ^ |
     
    Vote: I like it +44 Vote: I do not like it

    My guess is from GPT.

  • »
    »
    6 months ago, hide # ^ |
     
    Vote: I like it 0 Vote: I do not like it

    If you actually want to learn about this you can start by looking up Data Oriented Design. I think these topics mostly apply only to game development though.

    • »
      »
      »
      6 months ago, hide # ^ |
       
      Vote: I like it +1 Vote: I do not like it

      I have started trying to learn making games in unity recently , although that uses C# . These things are also important for algorithmic trading

  • »
    »
    5 months ago, hide # ^ |
     
    Vote: I like it +5 Vote: I do not like it

    Cache locality is something that is taught in computer architecture courses

»
6 months ago, hide # |
Rev. 2  
Vote: I like it +7 Vote: I do not like it

Thanks for the blog, even though you are using AI they are still helpful tips. I would like to ask if basic_string and vector have differences in performance, as the former has some potentially useful functions like substr() and also has all the functions of vector.

  • »
    »
    6 months ago, hide # ^ |
     
    Vote: I like it 0 Vote: I do not like it

    There are some minor differences such as the size of a basic_string object is 32 bytes, whereas the size of a vector object is 24 bytes. This should not be a problem unless you store and want to iterate over a large collection of basic_string_view objects.

    basic_string does small string optimization so it can store up to 15 bytes of data inline without doing any heap allocation. This can be useful if want a lot of dynamic arrays but most of them are very small.

    Additionally, you may want to look into std::span if you want slicing of a vector. It is similar to basic_string_view. Both are views over a contiguous block of elements of type T, and both provide sub-slicing (span::subspan and basic_string_view::substr). Note that these types are cheap to construct so try not to hold a persistent reference to them. For example, if you have a span over a vector and you push_back into the vector then the span might get invalidated. Just construct it when you need it. I tend to do string_view{s}.substr(i, j - i + 1) == s2 if I need to compare a sub-string.

    My advice would be to not pick a favorite, and use it everywhere. Pick what you need.

    • »
      »
      »
      6 months ago, hide # ^ |
       
      Vote: I like it +3 Vote: I do not like it

      Thank you, I didn't know there were so many small differences between the two types, I'll keep that in mind.

»
6 months ago, hide # |
 
Vote: I like it 0 Vote: I do not like it

thanks

»
6 months ago, hide # |
 
Vote: I like it +6 Vote: I do not like it

This AI generated post, accompanied with the GIF at the end, reeks of contribution farming. It is very common on LeetCode. Neither of the platforms benefit from it. I was fine with the blog being AI written as the content can be very useful and new to a lot of people. But using AI generated replies on comments is just repulsive.

Why does the post have close to 200 upvotes within the span of 18 hours?

  • »
    »
    6 months ago, hide # ^ |
     
    Vote: I like it +10 Vote: I do not like it

    Interesting. To me, a good bit of stuff in the post looks "too humanly" :)

»
6 months ago, hide # |
 
Vote: I like it 0 Vote: I do not like it

we can say same about LinkedList<>(); vs ArrayDeque<>(); in java array deque is fast

https://stackoverflow.com/questions/6163166/why-is-arraydeque-better-than-linkedlist

»
6 months ago, hide # |
 
Vote: I like it 0 Vote: I do not like it

in vector there are insert to insert several elements in the middle.would you like to help me to check if it's faster?

  • »
    »
    6 months ago, hide # ^ |
     
    Vote: I like it 0 Vote: I do not like it

    It will not be a meaningful comparison because list::insert and vector::insert both have different time complexities.

    vector::insert: If reallocation happens, linear in the number of elements of the vector after insertion; otherwise, linear in the number of elements inserted plus std::distance(pos, end()).

    The memory allocation time complexity due to insert in a vector can be considered as amortized $$$O(1)$$$ due to the same reason as push_back. The reason being vector allocates more memory than required and it does not need to reallocate for the subsequent few inserts.

    So, the overhead mostly comes from copying and shifting the elements in [pos, end()) over.

    I wrote a small benchmark. You can increase the no. of insertions to see how the vector insert gets worse. Link: https://quick-bench.com/q/0hMXEe0tvlxqFLSJH1xInYvvneA#

    Inserting in a vector at the beginning is the worst case as all the elements need to be copied over. Inserting at the back is equivalent to push_back. The closer to the end you are inserting, the better the performance.

»
6 months ago, hide # |
 
Vote: I like it 0 Vote: I do not like it

thanks for the information , amazed by the if one , will all this work for python too?

»
6 months ago, hide # |
 
Vote: I like it +1 Vote: I do not like it

I found number 1 very useful. Will keep it in mind.

Almost got TLE with this submission: https://mirror.codeforces.com/problemset/submission/2115/345078804

Easy AC by following the advice: https://mirror.codeforces.com/problemset/submission/2115/345079553

From 1843ms to 999ms

»
6 months ago, hide # |
Rev. 2  
Vote: I like it 0 Vote: I do not like it

I got TLE on last div 2. I couldn't understand why, as my code was supposed to perform fewer than 1e8 operations, and computers can perform about 3*1e8 operations in 3 seconds. After the contest, I only changed my vector<2,vector<2e5>> to unordered_map[2] and it got accepted, taking less than 400 ms.

»
6 months ago, hide # |
 
Vote: I like it 0 Vote: I do not like it

Thanks!I have just solved a problem which was TLE because of that.It lets me know more about $$$O(N^2)$$$ solution's TLE.

Also,I wonder who is still using std::list now. (:

»
6 months ago, hide # |
 
Vote: I like it 0 Vote: I do not like it

demonicc I'd like to tell you that __builtin_sqrt is 10X faster than simple sqrt.

»
6 months ago, hide # |
 
Vote: I like it +13 Vote: I do not like it

This leads to a tip when you are doing matrix multiplication in matrix fast exponentiation.

for(int i = 0; i < x; i ++)
    for(int k = 0; k < z; k ++)
        for(int j = 0; j < y; j ++)
            c[i][j] += a[i][k] * b[k][j];
for(int i = 0; i < x; i ++)
    for(int j = 0; j < y; j ++)
        for(int k = 0; k < z; k ++)
            c[i][j] += a[i][k] * b[k][j];

The first one is quicker than the second one.

The second one is a cache nightmare because b[k][j] is Column-Major.

»
6 months ago, hide # |
 
Vote: I like it 0 Vote: I do not like it

There's a common mistake that many people make which significantly slows down their code: using endl. While browsing submissions on Codeforces, I've noticed that many people, even at the Expert rank, still get TLE for using endl. I simply replaced endl with '\n' and the code got Accepted (AC)

  • »
    »
    5 months ago, hide # ^ |
     
    Vote: I like it 0 Vote: I do not like it

    In fact endl flush the out stream and might be more significant in interactive tasks.

    • »
      »
      »
      5 months ago, hide # ^ |
       
      Vote: I like it 0 Vote: I do not like it

      yes ,I use endl instead of doing cout<<flush seperately (is it a bad practice for non-interactive problems )