[Tutorial] Counting subsequences in a regular language

#	User	Rating
1	Benq	3792
2	VivaciousAubergine	3647
3	jiangly	3631
4	Kevin114514	3574
5	maroonrk	3521
6	strapple	3515
7	Radewoosh	3461
8	tourist	3428
9	turmax	3378
10	Um_nik	3376

#	User	Contrib.
1	Qingyu	162
2	adamant	148
3	Um_nik	146
4	Dominater069	143
5	errorgorn	140
6	cry	138
7	Proof_by_QED	136
8	YuukiS	135
9	chromate00	134
10	soullless	133

Hello everyone! In this blog I want to share a simple but powerful technique: counting subsequences of a string that belong to a given regular language, by adding a DFA state dimension to the DP.

The idea itself is not new, but I haven't seen it written down explicitly as a general technique. Once you see it, you'll start recognizing it in many problems.

Maybe this is well known in China, but I'm happy because I discovered it independently.

Prerequisites

You should be comfortable with:

Dynamic programming.
Basic concepts from the theory of formal languages: alphabets, regular languages, and deterministic finite automata (DFA).

A fun motivating problem

Let's start with something concrete. You have a string $$$s$$$ of length $$$n$$$ over the alphabet $$$\Sigma = {\texttt{a}, \texttt{b}, \ldots, \texttt{z}, \texttt{@}, \texttt{.}}$$$. You want to count how many subsequences of $$$s$$$ form a valid email address.

We define a "valid email" as any string in the language:

$$$L = \{ \; w_1 \texttt{@} w_2 \texttt{.} w_3 \;\mid\; w_1, w_2 \in \{\texttt{a}\text{-}\texttt{z}\}^+, \; w_3 \in \{\texttt{a}\text{-}\texttt{z}\}^3 \;\}$$$

For example, a@b.com is valid. How would you solve this?

Step 1: build the DFA

The language $$$L$$$ is regular, so we can build a DFA $$$\mathcal{A} = (Q, \Sigma, \delta, q_0, F)$$$ that recognizes it. Here's the diagram Email DFA

Step 2: the DP

Now comes the key idea. We define:

$$$dp[i][q] = \text{number of subsequences of } s[0..i-1] \text{ that leave } \mathcal{A} \text{ in state } q$$$

The base case is $$$dp[0][q_0] = 1$$$ (the empty subsequence sits at the start state) and $$$dp[0][q] = 0$$$ for all other states.

When we go from row $$$i$$$ to row $$$i+1$$$, we consider character $$$s[i]$$$ and for each state $$$q$$$:

Skip $$$s[i]$$$: the subsequence doesn't change, so $$$dp[i+1][q]$$$ gets $$$dp[i][q]$$$.
Take $$$s[i]$$$: the subsequence extends by one character, and $$$\mathcal{A}$$$ transitions from $$$q$$$ to $$$\delta(q, s[i])$$$, so $$$dp[i+1][\delta(q, s[i])]$$$ gets $$$dp[i][q]$$$.

More precisely:

$$$dp[i+1][q] = dp[i][q] + \sum_{\substack{q' \in Q \\ \delta(q', s[i]) = q}} dp[i][q']$$$

The answer is $$$\sum_{q \in F} dp[n][q]$$$.

// dp[i][q] = number of subsequences of s[0..i-1] that leave the DFA in state q
vector<vector<long long>> dp(n + 1, vector<long long>(8, 0));
dp[0][0] = 1;

for (int i = 0; i < n; i++) {
    for (int q = 0; q < 8; q++) {
        // skip s[i]: carry over
        dp[i + 1][q] = (dp[i + 1][q] + dp[i][q]) % MOD;

        // take s[i]: transition
        int nq = delta[q][s[i]];
        if (nq != DEAD) {
            dp[i + 1][nq] = (dp[i + 1][nq] + dp[i][q]) % MOD;
        }
    }
}

cout << dp[n][7] << "\n"; // accepting state

Done. $$$O(n \cdot 8)$$$ time, and the table $$$dp[i][q]$$$ makes the structure completely explicit: row $$$i$$$ summarizes everything about the first $$$i$$$ characters.

The general technique

There's nothing special about emails. The same idea works for any regular language $$$L$$$ recognized by a DFA $$$\mathcal{A} = (Q, \Sigma, \delta, q_0, F)$$$.

We define:

$$$dp[i][q] = \text{number of subsequences of } s[0..i-1] \text{ that leave } \mathcal{A} \text{ in state } q$$$

Base case: $$$dp[0][q_0] = 1$$$.

Transition: for each position $$$i$$$ and each state $$$q$$$, we can skip $$$s[i]$$$ (carry $$$dp[i][q]$$$ to $$$dp[i+1][q]$$$) or take $$$s[i]$$$ (add $$$dp[i][q]$$$ to $$$dp[i+1][\delta(q, s[i])]$$$).

Answer: $$$\sum_{q \in F} dp[n][q]$$$.

Complexity: $$$O(n \cdot |Q|)$$$ time, $$$O(n \cdot |Q|)$$$ space (or $$$O(|Q|)$$$ if you only keep two consecutive rows).

That's the whole technique. The rest of this blog is about recognizing where to apply it.

Why does this work?

Each subsequence of $$$s$$$ is itself a string over $$$\Sigma$$$. We want to count how many of those strings belong to $$$L$$$. The DP simulates $$$\mathcal{A}$$$ on all $$$2^n$$$ possible subsequences simultaneously, grouping them by their current DFA state. This is the product construction: the intersection of the "language of all subsequences of $$$s$$$" with $$$L$$$, done implicitly through the DP.

Composing constraints

What if your problem has two constraints on the subsequence?

If both are regular: build the product automaton (states are pairs), or equivalently, add both state dimensions to the DP.
If one is regular and one is not (e.g., context-free): track the regular part with a DFA state, and the non-regular part with whatever other DP dimension it needs.

The dimensions just multiply. This is where the technique really shines.

Application: counting C comments

Here's another example. Given a string $$$s$$$ of length $$$n$$$ over the alphabet $$$\Sigma = {\texttt{/}, \texttt{*}, \texttt{a}\text{-}\texttt{z}}$$$, count the number of subsequences that form a valid C-style comment, that is, strings in the language:

$$$L_{\text{comment}} = \{\; \texttt{/} \texttt{*} \, w \, \texttt{*} \texttt{/} \;\mid\; w \in \Sigma^* \;\}$$$

For example, /* hello */ is valid. The subsequence must start with /*, end with */, and can have anything in between.

The DFA has 5 states. Here is the diagram: Comment DFA

We plug this directly into our technique:

$$$dp[i][q] = \text{number of subsequences of } s[0..i-1] \text{ that leave } \mathcal{A} \text{ in state } q$$$

And the answer is $$$dp[n][4]$$$. Five states, one pass, $$$O(5n)$$$. Done.

Now, what if we wanted to count subsequences that are both a balanced bracket sequence and contain a C comment? Suppose $$$\Sigma = {\texttt{(}, \texttt{)}, \texttt{/}, \texttt{*}}$$$. We just compose: track balance $$$b$$$ for the bracket structure and DFA state $$$q$$$ for the comment pattern:

$$$dp[i][b][q]$$$

The answer is $$$dp[n][0][4]$$$. That's it: composing constraints costs nothing beyond multiplying dimensions.

Application: Sub-RBS (hard version)

Let's see this in action on a harder problem. Consider B2. Sub-RBS (Hard Version):

Given a bracket sequence $$$s$$$ of length $$$n \leq 100$$$. For each subsequence $$$t$$$ of $$$s$$$, define its score as the length of the longest regular bracket subsequence of $$$t$$$ that is "better" than $$$t$$$ (under a custom ordering where ( $$$ \lt $$$ )), or $$$0$$$ if no such subsequence exists or $$$t$$$ is not a regular bracket sequence. Find the sum of scores over all non-empty subsequences, modulo $$$998244353$$$.

The key insight: a regular bracket sequence $$$t$$$ admits a "better" regular subsequence if and only if $$$t$$$ contains the subsequence $$$\texttt{)((}$$$, a ) followed (not necessarily immediately) by two (s. In formal terms, $$$t$$$ belongs to the language:

$$$L_{\text{pattern}} = \Sigma^* \texttt{)} \, \Sigma^* \texttt{(} \, \Sigma^* \texttt{(} \, \Sigma^*$$$

where $$$\Sigma = {\texttt{(}, \texttt{)}}$$$. This is a regular language! A 5-state DFA captures it. Here's the diagram: Sub RBS Automaton

Now we need subsequences that satisfy two conditions simultaneously:

Regular bracket sequence: this is context-free (not regular), but we can track it with a "balance" counter $$$b$$$ bounded by $$$n$$$.
Contains the $$$\texttt{)((}$$$ pattern: regular, tracked by our 5-state DFA.

Our DP becomes:

$$$dp[i][b][q]$$$

where $$$dp[i][b][q]$$$ counts the number of subsequences of $$$s[0..i-1]$$$ with bracket balance $$$b$$$ and DFA state $$$q$$$. We process each character choosing to take or skip, updating both dimensions simultaneously. At the end, we look at $$$dp[n][0][3]$$$: balance $$$= 0$$$ (valid bracket sequence) and DFA state $$$= 3$$$ (pattern confirmed).

But the problem doesn't just ask us to count these subsequences, it asks for the sum of their scores, where the score of a qualifying subsequence of length $$$\ell$$$ is $$$\ell - 2$$$. So we need to track not only how many subsequences reach each state, but also the sum of their lengths.

We add one more dimension: for each $$$(i, b, q)$$$ triplet, we store two values: - $$$dp[i][b][q][0]$$$ = the count of subsequences of $$$s[0..i-1]$$$ with balance $$$b$$$ and DFA state $$$q$$$. - $$$dp[i][b][q][1]$$$ = the sum of lengths of those subsequences.

When we extend a subsequence by taking a character, its length increases by 1. So if we have $$$\text{cnt}$$$ subsequences with total length sum $$$\text{sum}$$$, after extending all of them by one character, the new sum becomes $$$\text{sum} + \text{cnt}$$$ (each of the $$$\text{cnt}$$$ subsequences grew by 1).

At the end, the answer is:

$$$\sum_{\text{qualifying}} (\ell - 2) = dp[n][0][3][1] - 2 \cdot dp[n][0][3][0]$$$

The complexity is $$$O(n \cdot n \cdot 5) = O(n^2)$$$.

Solution code for Sub-RBS (hard version)

Code

#include <bits/stdc++.h>
using namespace std;
using ll = long long;
const ll mod = 998244353;

// DFA transitions for the pattern )((
// delta[q][c]: q = current state, c = 0 for '(', 1 for ')'
//              '('    ')'
// State 0:      0      1
// State 1:      2      1
// State 2:      3      4
// State 3:      3      3   (accepting)
// State 4:      3      4
int delta[5][2] = {
    {0, 1},
    {2, 1},
    {3, 4},
    {3, 3},
    {3, 4}
};

void solve() {
    int n;
    cin >> n;
    string s;
    cin >> s;

    // dp[i][b][q][0] = count of subsequences of s[0..i-1] with balance b and DFA state q
    // dp[i][b][q][1] = sum of lengths of those subsequences
    vector dp(n + 1, vector(n + 2, vector(5, vector<ll>(2, 0))));
    dp[0][0][0][0] = 1;

    for (int i = 0; i < n; i++) {
        int c = (s[i] == '(') ? 0 : 1;
        int db = (s[i] == '(') ? 1 : -1; // balance change when taking this character

        for (int b = 0; b <= n; b++) {
            for (int q = 0; q < 5; q++) {
                if (dp[i][b][q][0] == 0) continue;
                ll cnt = dp[i][b][q][0], sum = dp[i][b][q][1];

                // skip s[i]: carry over
                dp[i + 1][b][q][0] = (dp[i + 1][b][q][0] + cnt) % mod;
                dp[i + 1][b][q][1] = (dp[i + 1][b][q][1] + sum) % mod;

                // take s[i]: transition
                int nb = b + db;
                if (nb >= 0) {
                    int nq = delta[q][c];
                    dp[i + 1][nb][nq][0] = (dp[i + 1][nb][nq][0] + cnt) % mod;
                    dp[i + 1][nb][nq][1] = (dp[i + 1][nb][nq][1] + sum + cnt) % mod;
                }
            }
        }
    }

    ll len_sum = dp[n][0][3][1], cnt = dp[n][0][3][0];
    ll ans = ((len_sum - 2 * cnt) % mod + mod) % mod;
    cout << ans << endl;
}


int main() {
    int t;
    cin >> t;
    while (t--) solve();
    return 0;
}

Final thoughts

Thanks for reading! This is my first blog, so I hope it was useful! I hope this technique becomes a useful tool in your problem-solving toolbox. The recipe is always the same: spot the regular constraint, draw the DFA, add the dimension.

If you found any mistakes, have suggestions to improve the blog, or know other problems where this technique applies, please let me know in the comments. I'd also be curious to hear if this idea has been written up before somewhere else, I'd love to link to it.

Rev.	By	When	Δ	Comment
en5	storrealbac	2026-04-07 00:20:24	0	(published)
en4	storrealbac	2026-04-07 00:15:56	205
en3	storrealbac	2026-04-07 00:14:23	10	Tiny change: 'imgur.com/5tIefdY.png)\n\nW' -> 'imgur.com/IeUheAC.png)\n\nW'
en2	storrealbac	2026-04-07 00:10:00	12	Tiny change: 'imgur.com/fi9OGPR.png)\n\nW' -> 'imgur.com/5tIefdY.png)\n\nW'
en1	storrealbac	2026-04-07 00:06:11	11512	Initial revision (saved to drafts)

Prerequisites

A fun motivating problem

Step 1: build the DFA

Step 2: the DP

The general technique

Why does this work?

Composing constraints

Application: counting C comments

Application: Sub-RBS (hard version)

Solution code for Sub-RBS (hard version)

Final thoughts

History