Could someone please help me to optimize my solution in AtCoder Next DP Contest's problem R

Hi Codeforces!

I recently tried to solve R — Triples from the AtCoder NDPC, but my submission is getting Time Limit Exceeded (TLE).

My Approach

Let $$$L = 24$$$ be the maximum number of bits, $$$B = L / 2 = 12$$$, and $$$|x|$$$ denote the popcount of $$$x$$$. Also, let $$$U$$$ be the universal mask (all $$$L$$$ bits set to 1).

Let $$$f_i$$$ be the number of good sequences ending with the tuple $$$(u_i, v_i, w_i)$$$. To compute this efficiently, we need to handle the DP transitions in two main steps:

Step 1: Push the current $$$f_i$$$ to future states in $$$O(2^{B})$$$

We can divide the update into three cases based on the bit counts:

Case 1: $$$|u_i \oplus v_i| \le B$$$ > The number of free bits is small. We can simply iterate through all allowed future masks $$$w_j$$$ directly.
Case 2: $$$|U \setminus v_i| \ge |u_i|$$$ > We maintain two multi-dimensional arrays, $$$A$$$ and $$$B$$$. We divide the mask into two halves: the lower $$$B$$$ bits and the higher $$$B$$$ bits. Then, we pick the half that has more positive bits. > Suppose the lower half $$$T$$$ has more bits, and the higher half is $$$H$$$. We iterate over all masks $$$S$$$ such that $$$T \subseteq S$$$, and add $$$f_i$$$ to $$$A_{S,H}$$$. > The symmetric case (where the higher half has more bits) is handled similarly by updating array $$$B$$$.
Case 3: Otherwise (i.e., $$$|U \setminus v_i| \lt |u_i|$$$) > We do a similar state-splitting trick, but applying the Principle of Inclusion-Exclusion (PIE) on $$$u_i$$$ and $$$U \setminus v_i$$$.

Step 2: Calculate the value of $$$f_i$$$ in $$$O(2^{B})$$$

This part is straightforward. To pull the answer for the current state $$$w_i$$$, we simply sum up the contributions from the direct updates (Case 1) and the queries from our two SOS DP arrays ($$$A$$$ and $$$B$$$).

Complexity

Time Complexity: $$$O(N \cdot 2^{L / 2})$$$
Memory Complexity: $$$O(2^L + N)$$$

Both bounds seem perfectly fine for the given constraints ($$$N \le 3 \cdot 10^5, L=24$$$), but I still got a TLE after submitting this solution :-(

Is my complexity analysis fundamentally wrong? Or is there a massive constant factor / cache miss issue hidden in the DP array memory layout that I completely missed?

Any insights or suggestions for constant factor optimization would be greatly appreciated! Thanks in advance!

Note: I used a LLM to polish this blog.

	Rev.	Язык	Кто	Когда	Δ	Комментарий
	en2		ETO_leader	2026-05-06 17:05:07	2	Tiny change: 'mplexity\n- Time' -> 'mplexity\n\n- Time'
	en1		ETO_leader	2026-05-06 16:57:05	2578	Initial revision (published)

Rev.

Язык

Кто

Когда

Комментарий

en2