You've all seen it. You and a friend both code an $$$O(N^2)$$$ solution. Theirs passes in 500ms, yours gets TLE. You check the algorithm—it's identical. What gives?
The problem isn't your algorithm. It's your implementation. You're ignoring the hardware, and the single biggest killer is cache locality.
1. The 9x Difference: Row-Major vs. Column-Major
Your CPU doesn't read memory byte by byte. It pulls in chunks called cache lines (usually 64 bytes). When you access arr[i][j], the CPU also loads arr[i][j+1], arr[i][j+2], etc., into a super-fast cache, assuming you'll need them next. This is spatial locality.
This is cache-friendly:
// Row-major: Accessing memory sequentially
for(int i = 0; i < n; i++) {
for(int j = 0; j < m; j++) {
sum += arr[i][j];
}
}
This is a cache nightmare:
// Column-major: Jumping all over memory
for(int j = 0; j < m; j++) {
for(int i = 0; i < n; i++) {
sum += arr[i][j];
}
}
In the second loop, you access arr[0][0], then arr[1][0], then arr[2][0]. These aren't next to each other in memory. For each access, the CPU has to fetch a new 64-byte cache line from slow RAM, throwing away the one it just got.
Benchmark: On a 10000x10000 matrix, the row-major loop runs in ~300ms. The column-major loop takes ~2800ms.
That's a 9x performance hit on the exact same $$$O(N^2)$$$ complexity.
2. Stop Using std::list. Seriously.
"But std::list has $$$O(1)$$$ insertion!"
Irrelevant. That $$$O(1)$$$ is theoretical. In practice, std::list is slow for two reasons:
Allocation Overhead: Every
push_backrequires a new memory allocation, which is slow.std::vectorallocates geometrically (1, 2, 4, 8...), so it's much faster on average.Pointer Chasing:
std::vectorstores elements in a single contiguous block of memory (perfect for the cache).std::listscatters elements all over your RAM, connected by pointers.
Iterating a list means jumping from one random memory address to another—the definition of a cache miss.
Benchmark (1M elements): - Iterating std::vector: ~2ms - Iterating std::list: ~35ms (17x slower)
Unless you are frequently inserting/deleting in the middle of a massive list (which you almost never are in CP), just use std::vector.
3. if Statements are Expensive
Your CPU is smart. It uses branch prediction to guess which way an if statement will go before it's even calculated. If it guesses right, you get a performance boost. If it guesses wrong (a misprediction), it has to flush its pipeline and restart, costing 10-20 clock cycles.
If your data is random, this if is unpredictable and slow:
// Unpredictable branch
for(int i = 0; i < n; i++) {
if(arr[i] % 2 == 0) {
count++;
}
}
For random data, the CPU will be wrong ~50% of the time. You can often rewrite this to be branchless:
// Branchless
for(int i = 0; i < n; i++) {
count += (arr[i] & 1) ^ 1;
}
This version has no if, no guessing, and no misprediction penalty. On random data, it can be 30-50% faster.
4. How Your structs Lie to You (AoS vs. SoA)
Let's say you have a Point struct and you only need to process the x coordinates.
Array of Structs (AoS) — The Bad Way:
struct Point { int x, y, z; };
vector<Point> points; // Memory: [x1,y1,z1], [x2,y2,z2], ...
When you access points[i].x, the CPU loads the entire struct (x1, y1, z1) into the cache line. You only wanted x1. You've wasted 2/3 of your cache.
Struct of Arrays (SoA) — The Fast Way:
struct Points {
vector<int> x, y, z;
};
Points points; // Memory: [x1,x2,x3...], [y1,y2,y3...], [z1,z2,z3...]
Now, when you loop through points.x, you are accessing a contiguous block of memory. Every single byte loaded into the cache is data you actually need.
TL;DR
Stop just counting operations. Your $$$O(N \log N)$$$ can be slower than $$$O(N^2)$$$ if you ignore the hardware.
- Always iterate in row-major order.
- Use
std::vector. - Avoid data-dependent
ifs in hot loops. - Think about data layout (AoS vs. SoA) if you're really pushing time limits.
Your algorithm might be fast, but your implementation is what gets the AC.









