Exponential Length Substrings in Pattern Matching

#	User	Rating
1	tourist	3985
2	jiangly	3814
3	jqdai0815	3682
4	Benq	3529
5	orzdevinwang	3526
6	ksun48	3517
7	Radewoosh	3410
8	hos.lyric	3399
9	ecnerwala	3392
9	Um_nik	3392

#	User	Contrib.
1	cry	169
2	maomao90	162
2	Um_nik	162
4	atcoder_official	161
5	djm03178	158
6	-is-this-fft-	157
7	adamant	155
8	awoo	154
8	Dominater069	154
10	luogu_official	150

Hi all,

I would like to share with you a part of my undergraduate thesis on a Multi-String Pattern Matcher data structure. In my opinion, it's easy to understand and hard to implement correctly and efficiently. It's competitive against other MSPM data structures (Aho-Corasick, suffix array/automaton/tree to name a few) when the dictionary size is specifically (uncommonly) large.

I would also like to sign up this entry to bashkort's Month of Blog Posts:-)

Abstract

This work describes a hash-based mass-searching algorithm, finding (count, location of first match) entries from a dictionary against a string $$$s$$$ of length $$$n$$$. The presented implementation makes use of all substrings of $$$s$$$ whose lengths are powers of $$$2$$$ to construct an offline algorithm that can, in some cases, reach a complexity of $$$O(n \log^2n)$$$ even if there are $$$O(n^2)$$$ possible matches. If there is a limit on the dictionary size $$$m$$$, then the precomputation complexity is $$$O(m + n \log^2n)$$$, and the search complexity is bounded by $$$O(n\log^2n + m\log n)$$$, even if it performs in practice like $$$O(n\log^2n + \sqrt{nm}\log n)$$$. Other applications, such as finding the number of distinct substrings of $$$s$$$ for each length between $$$1$$$ and $$$n$$$, can be done with the same algorithm in $$$O(n\log^2n)$$$.

Problem Description

We want to write an offline algorithm for the following problem, which receives as input a string $$$s$$$ of length $$$n$$$, and a dictionary $$$ts = \{t_1, t_2, .., t_{\lvert ts \rvert}\}$$$. As output, it expects for each string $$$t$$$ in the dictionary the number of times it is found in $$$s$$$. We could also ask for the position of the fist occurrence of each $$$t$$$ in $$$s$$$, but the paper mainly focuses on the number of matches.

Algorithm Description

We will build a DAG in which every node is mapped to a substring from $$$s$$$ whose length is a power of $$$2$$$. We will draw edges between any two nodes whose substrings are consecutive in $$$s$$$. The DAG has $$$O(n \log n)$$$ nodes and $$$O(n \log^2 n)$$$ edges.

We will break every $$$t_i \in ts$$$ down into a chain of substrings of $$$2$$$-exponential length in strictly decreasing order (e.g. if $$$\lvert t \rvert = 11$$$, we will break it into $$$\{t[1..8], t[9..10], t[11]\}$$$). If $$$t_i$$$ occurs $$$k$$$ times in $$$s$$$, we will find $$$t_i$$$'s chain $$$k$$$ times in the DAG.

Figure 1: The DAG for $$$s = (ab)^3$$$. If $$$ts = \{aba, baba, abb\}$$$, then $$$t_0 = aba$$$ is found twice in the DAG, $$$t_1 = baba$$$ once, and $$$t_2 = abb$$$ zero times.

Redundancy Elimination: Associated Trie, Tokens, Trie Search

A generic search for $$$t_0 = aba$$$ in the DAG would check if any node marked as $$$ab$$$ would have a child labeled as $$$a$$$. $$$t_2 = abb$$$ is never found, but a part of its chain is ($$$ab$$$). We have to check all $$$ab$$$s to see if any may continue with a $$$b$$$, but we have already checked if any $$$ab$$$s continue with an $$$a$$$ for $$$t_0$$$, making second set of checks redundant.

Figure 2: If the chains of $$$t_i$$$ and $$$t_j$$$ have a common prefix, it is inefficient to count the number of occurrences of the prefix twice. We will put all the $$$t_i$$$ chains in a trie. We will keep the hashes of the values on the trie edges.

In order to generalize all of chain searches in the DAG, we will add a starter node that points to all other nodes in the DAG. Now all DAG chains begin in the same node.

The actual search will go through the trie and find help in the DAG. The two Data Structures cooperate through tokens. A token is defined by both its value (the DAG index in which it’s at), and its position (the trie node in which it’s at).

This spoiler shows some Trie Search steps.

Rev.	By	When	Δ	Comment
en31	catalystgma	2024-10-06 14:47:18	0	(published)
en30	catalystgma	2024-10-06 14:45:49	459	tag typo
en29	catalystgma	2024-10-06 14:23:07	381	Small changes, tags
en28	catalystgma	2024-10-06 13:10:51	648	update toy LLM test
en27	catalystgma	2024-10-05 19:56:49	95
en26	catalystgma	2024-10-04 22:27:05	391	table floats
en25	catalystgma	2024-10-04 22:00:48	856	Acknowledgements completion
en24	catalystgma	2024-10-04 18:04:25	1430	Acknowledgements. TODO run nlp and reread
en23	catalystgma	2024-10-04 17:37:45	1102	LLM copyright
en22	catalystgma	2024-10-04 14:15:40	1694	Second batch benchmark
en21	catalystgma	2024-10-04 14:02:15	4179	First batch benchmark
en20	catalystgma	2024-10-04 13:21:21	2521	Practical Results part 1
en19	catalystgma	2024-10-02 19:24:23	3730	Complexity computation: with DAG compression
en18	catalystgma	2024-10-02 19:09:40	4996	Property 4.2.8 ended
en17	catalystgma	2024-10-02 18:59:38	1435	upto Property 4.2.6.
en16	catalystgma	2024-10-02 18:55:05	4390	Property 4.2.5 ended
en15	catalystgma	2024-10-02 15:36:08	3267	Complexity computation: with DAG compression (beginning, upto Theorem 4.2.3)
en14	catalystgma	2024-10-02 15:14:13	4221	Corner Case Improvement
en13	catalystgma	2024-10-01 14:51:07	1113	Complexity computation: without DAG compression
en12	catalystgma	2024-10-01 14:43:35	726	Complexity computation: without DAG compression (next table)
en11	catalystgma	2024-10-01 14:37:54	5235	Complexity computation: without DAG compression (half)
en10	catalystgma	2024-10-01 14:04:04	2080	DAG suffix compression theory
en9	catalystgma	2024-10-01 13:31:28	1689	Token count bound and Token propagation complexity
en8	catalystgma	2024-10-01 12:55:22	1272	Trie search comments
en7	catalystgma	2024-10-01 12:25:08	1159	Trie Search images in spoiler
en6	catalystgma	2024-09-30 23:25:19	1380	Redundancy Elimination upto Trie Search
en5	catalystgma	2024-09-30 23:02:01	139	add image of decent size
en4	catalystgma	2024-09-30 22:10:13	107	Tiny change: 'DAG.\n\n![]()' -> 'DAG.\n\n![ ](https://mirror.codeforces.com/5076b6/beamer_img2.png)'
en3	catalystgma	2024-09-30 19:57:10	7	Upload beamer images
en2	catalystgma	2024-09-30 18:59:08	1032
en1	catalystgma	2024-09-30 18:14:49	1342	Initial revision (saved to drafts)

Abstract

Problem Description

Algorithm Description

Redundancy Elimination: Associated Trie, Tokens, Trie Search

History