Meaf's blog

By Meaf, history, 20 months ago, In English

I've recently been working on suffix automaton code and I have an implementation that I believe to be more intuitive and easier to type (certainly shorter) than the implementation I've seen circulating around (mainly adapted from cp-algorithms). It also appears to benchmark a little faster than the cp-algorithms implementation as well, which is a nice perk. Please, let me know if you can find some improvements to make, either making the code shorter, faster, etc.

Edit: The primary purpose of this code is for ICPC-style contests which limit printed material and have no digital materials (unlike codeforces which allows copy/paste during contests). Thus, my primary focus is brevity and speed, not necessarily readability. There's always tradeoffs; for example, if 20 characters would be added to make the code much more readable, than that's a valid consideration, but I want the number of lines and number of characters to be as small as reasonable without making the code completely arcane.

// len is the longest-length substring ending here
// pos is the first index in the string matching here
// term is whether this node is a terminal (aka a suffix)
struct st { int len, pos, term; st *link; map<char, st*> next; };
st *suffixAutomaton(string &str) {
    st *last = new st(), *root = last;
    for(auto c : str) {
        st *p = last, *cur = last = new st{last->len + 1, last->len};
        while(p && !p->next.count(c))
            p->next[c] = cur, p = p->link;
        if (!p) cur->link = root;
        else {
            st *q = p->next[c];
            if (p->len + 1 == q->len) cur->link = q;
            else {
                st *clone = new st{p->len+1, q->pos, 0, q->link, q->next};
                for (; p && p->next[c] == q; p = p->link)
                    p->next[c] = clone;
                q->link = cur->link = clone;
            }
        }
    }
    while(last) last->term = 1, last = last->link;
    return root;
}
  • Vote: I like it
  • +58
  • Vote: I do not like it

»
20 months ago, # |
Rev. 2   Vote: I like it -27 Vote: I do not like it

I personally prefer not to use pointers until you don't have to. You can take a look on my implementation: i think, it should be clear (anyway, I've spent hours to make it such clean and fast as i think(lol))

Spoiler
  • »
    »
    20 months ago, # ^ |
    Rev. 2   Vote: I like it +32 Vote: I do not like it

    I think you might be missing the point of the post.

    Also
    Also 2
    • »
      »
      »
      20 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      Thanks for pointing these things out! Suprisingly, I've never thought about just sorting states by their length. And sorry for inconvenience with reading my comment, I've already edited it.

  • »
    »
    20 months ago, # ^ |
      Vote: I like it +24 Vote: I do not like it

    What doing?

    • »
      »
      »
      20 months ago, # ^ |
        Vote: I like it +30 Vote: I do not like it

      Relevant profile pic.

»
20 months ago, # |
  Vote: I like it +8 Vote: I do not like it

I really like this implementation meaf!

»
20 months ago, # |
Rev. 2   Vote: I like it +8 Vote: I do not like it

My typical implementation with fpos looks as follows:

void add_letter(char c) {
    int p = last;
    last = sz++;
    len[last] = fpos[last] = len[p] + 1;
    for(; to[p][c] == 0; p = link[p]) {
        to[p][c] = last;
    }
    int q = to[p][c];
    if(q == last) {
        return;
    }
    if(len[q] == len[p] + 1) {
        link[last] = q;
        return;
    }
    int cl = sz++;
    to[cl] = to[q];
    link[cl] = link[q];
    fpos[cl] = fpos[q];
    len[cl] = len[p] + 1;
    link[q] = link[last] = cl;
    for(; to[p][c] == q; p = link[p]) {
        to[p][c] = cl;
    }
}

I think, in terms of simplicity, it's generally better to have shorter lines (e.g. don't go longer than 80 characters per line), and avoid writing several operations on the same line, unless you're absolutely certain it will make the code better... Besides that, structuring code in a way that earlier conditions allow preliminary termination (vs having few nested levels of conditions) is often considered simpler to comprehend.

P.S. In your code, you use while loop in the first part and for loop in the second, which is inconsistent and probably should be avoided...

  • »
    »
    20 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Ah yes, I forgot to mention that number of lines is important since this is for ICPC book code primarily rather than codeforces, so line count and character count are quite important. I'll take a look at re-replacing the while with the for, though, I see that it might be a little confusing to read during contest in a hackpack

»
20 months ago, # |
  Vote: I like it +3 Vote: I do not like it

If you are concerned about the possibility to implement it from scratch during a contest, then the main improvement point I see is surprisingly replacing pointers with std::vector + indices. Usually, I advocate for using pointers (especially, smart pointers) in tree-like data structures, but here I believe that indices are clearly superior. The idea of the pointer-based approach in general is that you can have the best of both worlds: the speed overhead becomes negligible if you are using a proper allocator, you can make your data structure persistent just by changing two lines of code, and even if on top of that you need to do a lot of deletions and the memory becomes a concern, it is also solvable without rewriting the whole data structure. However, in this particular case pointers don't do anything really useful. The suffix automaton can't really be made persistent (because of the amortization) and the deletion is also not a thing, unless you remove the last character only and this is basically a pop_back or two. On the other hand, indices are easier to debug, which may be important for some variations of suffix automata that you may want to implement.

The problem I am referring to is that there is no single best way to implement things like trie (Aho-Corasick included). The problem is in the variable that you call next. The naive solution is to use map or even unordered_map in each node and hope for the best, but it is not the fastest one. The fastest solution is to use std::array, but it works only if memory is not a problem and $$$\lvert \Sigma \rvert$$$ is small. Another approach that is somewhere in the middle is to make unordered_map that maps pairs (node, char) into nodes and in that case, using pointers makes this part nondeterministic, which is potentially problematic for debugging.

Just for the reference, here is my minimalistic implementation. It is slightly longer, takes 69 lines of code (nice!), but most of the difference comes from the different formatting, the algorithm is essentially the same.

  • »
    »
    20 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    It kind of can be made persistent, if you absolutely need it at all cost, but you'd need a data structure like link-cut tree on top of it to do so, unfortunately. Like, you would need to maintain the suffix link tree as a dynamic tree, and then the changes you do to the automaton are expressed as some updates on a path from a vertex to one of its ancestors. Probably for this specific purpose a dynamic suffix array or suffix tree would make a bit more sense.

    • »
      »
      »
      20 months ago, # ^ |
        Vote: I like it +3 Vote: I do not like it

      But even then you store nodes in a link-cut tree and still can survive accessing them through the indices. I don't see it as any kind of problem, especially, having in mind that you usually have this interface in a regular link-cut tree.

      Accessing vertices of the inner tree by index in logarithmic time instead of constant may look as a problem, but, first of all, I think it is unavoidable because of the persistency and also we have bigger problems with the fact that we can't use splay trees anymore (due to similar amortized problems), so our link-cut tree would work in time $$$\Theta(\log^2)$$$.

      But maybe there is a bit more clever solution than simply putting everything in a link-cut tree. At least, our paths have some additional structure that we potentially may use. I think, it should be either known and written somewhere in a literature or a promising topic to research.

    • »
      »
      »
      20 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      If it’s possible to reverse the string, it’s easier to deal with adding/removing from the front of the string, as opposed to the end (or equivalently, working with prefixes instead of suffixes). A nice offline solution for a persistent prefix array where you add and remove from the end of the string works by building a trie on the queries, and running the suffix array algorithm on the prefixes of the trie (similar to binary lifting). Then you can build the full LCS (as opposed to LCP) array of everything as normal, and maintaining the current LCS array in a set as you DFS the trie. Alternatively, you can do it fully online with hashing to compare suffixes and compute LCS.

      There’s a funny problem May I Add A Letter? where the judge solution was persistent suffix automaton, but it ended up getting hacked.

      • »
        »
        »
        »
        20 months ago, # ^ |
        Rev. 2   Vote: I like it +8 Vote: I do not like it

        To correct the record: the real author of that problem is arthur.nascimento and the official solution is some weird dynamic suffix array structure that can't get hacked because it's not amortized (I think it's similar to what you described in your first paragraph).

        However, some contestants used a persistent suffix automaton to get AC because the cases weren't strong enough. At least in Brazil this is partially my fault, since a few weeks before the contest I mentioned using persistent suffix automaton for a different problem (where the string was highly constrained so I don't think it was hackable -- but I didn't prove it).

        • »
          »
          »
          »
          »
          20 months ago, # ^ |
            Vote: I like it 0 Vote: I do not like it

          Oh my bad, I thought that user was a judge from this blog. That is one of my favorite problems, thanks to you both for working on it!

  • »
    »
    20 months ago, # ^ |
    Rev. 2   Vote: I like it 0 Vote: I do not like it

    Regarding pointers, I think the current implementation is pretty nice, apart from the use of new everywhere. Using a simple function to allocate nodes from a deque will make it faster (about as fast as using a vector and indices) and also can be debugged better (if your point about debugging was about having all nodes in one place). Using pointers for trees is also quite idiomatic (and short) and that's also a reason why many ICPC team codebooks (including KACTL) use them despite being slow.

    • »
      »
      »
      20 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      Maybe there is a pre-allocation and overload of new happening behind the scenes..

      • »
        »
        »
        »
        20 months ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        With deques, allocation automatically happens in blocks, so it's better in general.

    • »
      »
      »
      20 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      Ironically, I found this implementation to be faster than the standard CP-Algorithms implementation using a vector and indices! And yes, this code is for ICPC hackpack/book code, so I prioritized code length over speed. To that point, is there a way you can think of removing the new keyword?

  • »
    »
    20 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Hi, thanks for the comment! I implemented both and did my best to minimize the size, and I found that using pointers was significantly shorter than using indices. It's also just easier to use, in my opinion :p.

    I'll take a look at your implementation and see if there are any tricks I can incorporate into my code. But yes, the primary concern is the ability to type it from scratch in a contest; I don't really much care about my codeforces rating (look at my profile, lol); instead, I'm focusing on ICPC contests (since I've qualified for the NAC and am World Finals hopeful), so I want my book code to be as short and implementable as possible

    • »
      »
      »
      20 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      In that case, I think that indices are the better option. Unlike things like fft, you would never implement the suffix automaton in its shortest form, instead, it will always contain a dp of some sort. In these cases, debuggability is a major concern (and in ICPC it always is a concern in general) and indices allow you to simply print all vertices and check the correctness of your dp on paper without spending any computer time.

      Also, speaking from experience, I would say that the reliability of your library both in team reference and inside your head is the #1 priority, especially for the WF. So, I don't think that reducing the code size by, let's say, 10% should be a priority of any kind when you are building your team reference document, and all reasonable implementations of the suffix automaton vary in length on that 10% at most (if we ignore the formatting).

      • »
        »
        »
        »
        20 months ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        I found that for many applications, it suffices to add a single value to the node and then do the dfs / dp, but maybe I need more SAM problems. Do you have any that come to mind?

»
20 months ago, # |
  Vote: I like it +11 Vote: I do not like it

nice implementation