After the recent ABC, I realized I suck at string algorithms (the only one I know is hashing). What are the important string algos I should learn (for cp) and in what order should I learn them?
# | User | Rating |
---|---|---|
1 | tourist | 4009 |
2 | jiangly | 3823 |
3 | Benq | 3738 |
4 | Radewoosh | 3633 |
5 | jqdai0815 | 3620 |
6 | orzdevinwang | 3529 |
7 | ecnerwala | 3446 |
8 | Um_nik | 3396 |
9 | ksun48 | 3390 |
10 | gamegame | 3386 |
# | User | Contrib. |
---|---|---|
1 | cry | 167 |
2 | Um_nik | 163 |
3 | maomao90 | 162 |
3 | atcoder_official | 162 |
5 | adamant | 159 |
6 | -is-this-fft- | 158 |
7 | awoo | 157 |
8 | TheScrasse | 154 |
9 | Dominater069 | 153 |
9 | nor | 153 |
After the recent ABC, I realized I suck at string algorithms (the only one I know is hashing). What are the important string algos I should learn (for cp) and in what order should I learn them?
Name |
---|
Em, String hashing should be enough for most problems, but if you want to go above and beyond, here are some cool things to learn.
Aho-Corasick.
Trie (Prefix Tree).
Suffix Array.
Suffix Tree.
KMP / Z-Algorithm / Manachers.
That should be about it.
I think Suffix Automaton is very important as well.
Thank you for the advice! Is there any particular order I should learn these (in terms of importance/usefulness, but also in terms of difficulty/prerequisites)?
Personally, I would say learn as much as you can in any order, but in terms of broadness I would prioritize Trie and Z-algorithm first because they can be widely applied in a bunch of areas. The rest can be learned after, since they aren't extremely common and implementing them yourself might be tricky.
I see. Thanks again!
FYI you can get red without knowing any string algorithms other than hashing.
This is actually very interesting to me. I wonder why it is that many coders can get very far into their competitive programming journeys and not really use or learn string algorithms. I myself am an example of this, I have been doing cp for over a year now and have not really touched any string algorithms. Its not that string algorithms aren’t useful, I have seen them in plenty of problems and contests, but I just never bothered to learn them for some reason. From what I have observed of other coders, this is true for a decent amount of them as well.
Hard string problems are rare nowadays.
+1, I've still only solved 1 problem in contest that required a "string algorithm" (1466G - Song of the Sirens), and it was a while after I got red. [*]
Still, it's nice to know a few so you don't end up completely stuck on a problem that requires them (and so you don't feel like there's a big gap in your knowledge).
[*] but of course, I'm not sure if there were any problems I should have solved along the way, but lacked the strings knowledge.
Here's the strings problem that appeared in the latest ICPC Regionals from my country (Asia-Kanpur 2019).
Would it be possible to solve this without any fancy strings algorithms?
Combine hashing with the idea that "if the total length of strings is limited with $$$L$$$, the number of different lengths among strings $$$\le \sqrt{2L}$$$" and you can be completely fine for quite a long time. However, at some point, you will understand that in some problems you basically do the same stuff as in, for example, suffix array, but with hashes. At this moment you will know that learning suffix structures is beneficial for you.
Before this magical moment, knowing how suffix structures work will not be as rewarding: even if you learn how they can be applied, solving problems (beyond educational ones) involving them during contests will be extra tough.
Still, learning simple stuff like prefix- or z- function is rewarding at any level.
If you don't mind, could you please elaborate on your statement: "if the total length of strings is limited with L, the number of different lengths among strings ≤√2L." I'm not sure what is meant by this.
If in a problem you have $$$N$$$ strings, usually their total length is limited with some number $$$L$$$ (otherwise you will get like $$$10^5 \times 10^5$$$ characters in the input which is impossible). Then, if you can solve the problem in close-to-linear time for all strings of the same length, you can get something like $$$O(L\sqrt{L})$$$ solution which often can be squeezed in the time limit with a couple heuristics.
This idea helps when you have some string queries. If the intended solution requires, for example, suffix array, time limit that allows slow intended implementations to pass will often allow a fast hash-based implementation to squeeze in.
Got it, thanks!
To give you an example, https://mirror.codeforces.com/contest/547/problem/E.
Along with the suffix array solution, it has a simple hash-based solution involving the described idea.
Oh, I see. So because there are only sqrt(L) different lengths, we dont have to check all substrings — we can just check the sqrt(L) ones. Then we can just use prefix sums and calculate the answer.
Close enough. You cannot store prefix sums explicitly as it will lead to $$$O(NQ)$$$ complexity. You need to optimise it a little bit (think in binary search direction).
Eertree is a cool data structure that deals with palindromes (also called "Palindromic tree").
Usually, you won't need anything sophisticated like this (maybe for hard problems but still, it's rare), the most important string algorithms/data structures are the basic ones like hashing and trie, and maybe the more complicated (but still, not hard) Z-array and Manacher's algorithm.