Exponential Length Substrings in Pattern Matching

#	User	Rating
1	tourist	3985
2	jiangly	3814
3	jqdai0815	3682
4	Benq	3529
5	orzdevinwang	3526
6	ksun48	3517
7	Radewoosh	3410
8	hos.lyric	3399
9	ecnerwala	3392
9	Um_nik	3392

#	User	Contrib.
1	cry	169
2	maomao90	162
2	Um_nik	162
4	atcoder_official	161
5	djm03178	158
6	-is-this-fft-	157
7	adamant	155
8	awoo	154
8	Dominater069	154
10	luogu_official	150

Hi all,

I would like to share with you a part of my undergraduate thesis on a Multi-String Pattern Matcher data structure. In my opinion, it's easy to understand and hard to implement correctly and efficiently. It's competitive against other MSPM data structures (Aho-Corasick, suffix array/automaton/tree to name a few) when the dictionary size is specifically (uncommonly) large.

I would also like to sign up this entry to bashkort's Month of Blog Posts:-)

Abstract

This work describes a hash-based mass-searching algorithm, finding (count, location of first match) entries from a dictionary against a string $$$s$$$ of length $$$n$$$. The presented implementation makes use of all substrings of $$$s$$$ whose lengths are powers of $$$2$$$ to construct an offline algorithm that can, in some cases, reach a complexity of $$$O(n \log^2n)$$$ even if there are $$$O(n^2)$$$ possible matches. If there is a limit on the dictionary size $$$m$$$, then the precomputation complexity is $$$O(m + n \log^2n)$$$, and the search complexity is bounded by $$$O(n\log^2n + m\log n)$$$, even if it performs in practice like $$$O(n\log^2n + \sqrt{nm}\log n)$$$. Other applications, such as finding the number of distinct substrings of $$$s$$$ for each length between $$$1$$$ and $$$n$$$, can be done with the same algorithm in $$$O(n\log^2n)$$$.

Problem Description

We want to write an offline algorithm for the following problem, which receives as input a string $$$s$$$ of length $$$n$$$, and a dictionary $$$ts = \{t_1, t_2, .., t_{\lvert ts \rvert}\}$$$. As output, it expects for each string $$$t$$$ in the dictionary the number of times it is found in $$$s$$$. We could also ask for the position of the fist occurrence of each $$$t$$$ in $$$s$$$, but the paper mainly focuses on the number of matches.

Algorithm Description

We will build a DAG in which every node is mapped to a substring from $$$s$$$ whose length is a power of $$$2$$$. We will draw edges between any two nodes whose substrings are consecutive in $$$s$$$. The DAG has $$$O(n \log n)$$$ nodes and $$$O(n \log^2 n)$$$ edges.

We will break every $$$t_i \in ts$$$ down into a chain of substrings of 2-exponential length in strictly decreasing order (e.g. if $$$\lvert t \rvert = 11$$$, we will break it into $$$\{t[1..8], t[9..10], t[11]\}$$$). If $$$t_i$$$ occurs $$$k$$$ times in $$$s$$$, we will find $$$t_i$$$'s chain $$$k$$$ times in the DAG.

Rev.	By	When	Δ	Comment
en31	catalystgma	2024-10-06 14:47:18	0	(published)
en30	catalystgma	2024-10-06 14:45:49	459	tag typo
en29	catalystgma	2024-10-06 14:23:07	381	Small changes, tags
en28	catalystgma	2024-10-06 13:10:51	648	update toy LLM test
en27	catalystgma	2024-10-05 19:56:49	95
en26	catalystgma	2024-10-04 22:27:05	391	table floats
en25	catalystgma	2024-10-04 22:00:48	856	Acknowledgements completion
en24	catalystgma	2024-10-04 18:04:25	1430	Acknowledgements. TODO run nlp and reread
en23	catalystgma	2024-10-04 17:37:45	1102	LLM copyright
en22	catalystgma	2024-10-04 14:15:40	1694	Second batch benchmark
en21	catalystgma	2024-10-04 14:02:15	4179	First batch benchmark
en20	catalystgma	2024-10-04 13:21:21	2521	Practical Results part 1
en19	catalystgma	2024-10-02 19:24:23	3730	Complexity computation: with DAG compression
en18	catalystgma	2024-10-02 19:09:40	4996	Property 4.2.8 ended
en17	catalystgma	2024-10-02 18:59:38	1435	upto Property 4.2.6.
en16	catalystgma	2024-10-02 18:55:05	4390	Property 4.2.5 ended
en15	catalystgma	2024-10-02 15:36:08	3267	Complexity computation: with DAG compression (beginning, upto Theorem 4.2.3)
en14	catalystgma	2024-10-02 15:14:13	4221	Corner Case Improvement
en13	catalystgma	2024-10-01 14:51:07	1113	Complexity computation: without DAG compression
en12	catalystgma	2024-10-01 14:43:35	726	Complexity computation: without DAG compression (next table)
en11	catalystgma	2024-10-01 14:37:54	5235	Complexity computation: without DAG compression (half)
en10	catalystgma	2024-10-01 14:04:04	2080	DAG suffix compression theory
en9	catalystgma	2024-10-01 13:31:28	1689	Token count bound and Token propagation complexity
en8	catalystgma	2024-10-01 12:55:22	1272	Trie search comments
en7	catalystgma	2024-10-01 12:25:08	1159	Trie Search images in spoiler
en6	catalystgma	2024-09-30 23:25:19	1380	Redundancy Elimination upto Trie Search
en5	catalystgma	2024-09-30 23:02:01	139	add image of decent size
en4	catalystgma	2024-09-30 22:10:13	107	Tiny change: 'DAG.\n\n![]()' -> 'DAG.\n\n![ ](https://mirror.codeforces.com/5076b6/beamer_img2.png)'
en3	catalystgma	2024-09-30 19:57:10	7	Upload beamer images
en2	catalystgma	2024-09-30 18:59:08	1032
en1	catalystgma	2024-09-30 18:14:49	1342	Initial revision (saved to drafts)

Abstract

Problem Description

Algorithm Description

History