Hi Codeforces!
I recently tried to solve R — Triples from the AtCoder NDPC, but my submission is getting Time Limit Exceeded (TLE).
Here is my submission on atcoder.jp.
My Approach
Let $$$L = 24$$$ be the maximum number of bits, $$$B = L / 2 = 12$$$, and $$$|x|$$$ denote the popcount of $$$x$$$. Also, let $$$U$$$ be the universal mask (all $$$L$$$ bits set to 1).
Let $$$f_i$$$ be the number of good sequences ending with the tuple $$$(u_i, v_i, w_i)$$$. To compute this efficiently, we need to handle the DP transitions in two main steps:
Step 1: Push the current $$$f_i$$$ to future states in $$$O(2^{B})$$$
We can divide the update into three cases based on the bit counts:
Case 1: $$$|u_i \oplus v_i| \le B$$$ > The number of free bits is small. We can simply iterate through all allowed future masks $$$w_j$$$ directly.
Case 2: $$$|U \setminus v_i| \ge |u_i|$$$ > We maintain two multi-dimensional arrays, $$$A$$$ and $$$B$$$. We divide the mask into two halves: the lower $$$B$$$ bits and the higher $$$B$$$ bits. Then, we pick the half that has more positive bits. > Suppose the lower half $$$T$$$ has more bits, and the higher half is $$$H$$$. We iterate over all masks $$$S$$$ such that $$$T \subseteq S$$$, and add $$$f_i$$$ to $$$A_{S,H}$$$. > The symmetric case (where the higher half has more bits) is handled similarly by updating array $$$B$$$.
Case 3: Otherwise (i.e., $$$|U \setminus v_i| \lt |u_i|$$$) > We do a similar state-splitting trick, but applying the Principle of Inclusion-Exclusion (PIE) on $$$u_i$$$ and $$$U \setminus v_i$$$.
Step 2: Calculate the value of $$$f_i$$$ in $$$O(2^{B})$$$
This part is straightforward. To pull the answer for the current state $$$w_i$$$, we simply sum up the contributions from the direct updates (Case 1) and the queries from our two SOS DP arrays ($$$A$$$ and $$$B$$$).
Complexity
- Time Complexity: $$$O(N \cdot 2^{L / 2})$$$
- Memory Complexity: $$$O(2^L + N)$$$
Both bounds seem perfectly fine for the given constraints ($$$N \le 3 \cdot 10^5, L=24$$$), but I still got a TLE after submitting this solution :-(
Is my complexity analysis fundamentally wrong? Or is there a massive constant factor / cache miss issue hidden in the DP array memory layout that I completely missed?
Any insights or suggestions for constant factor optimization would be greatly appreciated! Thanks in advance!
Note: I used a LLM to polish this blog.








Well, it seems the TL is extremely tight and even my implementation of $$$O(N(\frac{4}{3})^L)$$$ solution pass in 3.4 second...
$$$O(N2^{L/2})$$$ solution probably need lot of optimization in order to pass.
To understand why this solution receives TLE, you could make a rough estimate of execution time based on RAM throughput.
Each update and query accesses scattered locations in an array much larger than the cache, so a worst case assumption can be that each triggers a fetch or a flush of a new cache line. Step 1, Case 1 performs $$$2^B$$$ reads and writes, Step 2 performs $$$2 \times 2^B$$$ reads, approximate RAM read throughput on AtCoder invokers is 40GB/s, so your solution could take $$$N \times 2^B \times 4 \times 64 \mathbin{/} 40 \mathbin{/} 2^{30} \approx 7$$$ seconds to finish. In practice, it likely doesn't need to fetch/flush a new cache line each access, but neither does it likely achieve max throughput, so it's just a very rough estimate, closer to the truth than an estimate based on arithmetic operations.
As to how you could optimize it, general approaches are vectorization (in this case not auto-), unrolling. Neither will reduce RAM traffic, but get you closer to achieving max throughput. If currently your solution does not guarantee often accessing spatially close elements close in time as well, when it could've, such as if your $$$S$$$ dimension is indeed outer with respect to $$$H$$$, fixing this may help. My first solution had a similar structure, but Step 2 could choose between A and B, giving an $$$O(\frac{1}{\sqrt{L}})$$$ expected multiplier to its complexity and taking memory writes upon itself, Step 1 Case 2 and 3 were merged and always did both parts while you seem to do the larger (perhaps a typo, or this is necessary for Step 2, idk), and could be shown to be somewhat faster asymptotically than $$$O(2^B)$$$ in expectation, allowing them to take over some heat from Case 1.