Z Algorithm - Codeforces

This is the first post in my Codeforces blog! I plan to use this as a place where I can write down new algorithms I learned, problems that I found interesting, or stupid mistakes I made during contests.

Today, the topic will be the Z Algorithm, which I learned as a result of failing to solve Problem B of Beta Round 93 (http://mirror.codeforces.com/contest/126/problem/B). There are some other solutions like binary search + hashing, but this one is quite nice. Anyway, first, a description of the algorithm and why it works; it is simple and makes a lot of sense (as all good algorithms are).

Algorithm

Given a string S of length n, the Z Algorithm produces an array Z where Z[i] is the length of the longest substring starting from S[i] which is also a prefix of S, i.e. the maximum k such that S[j] = S[i + j] for all 0 ≤ j < k. Note that Z[i] = 0 means that S[0] ≠ S[i]. For easier terminology, we will refer to substrings which are also a prefix as prefix-substrings.

The algorithm relies on a single, crucial invariant. As we iterate over the letters in the string (index i from 1 to n - 1), we maintain an interval [L, R] which is the interval with maximum R such that 1 ≤ L ≤ i ≤ R and S[L...R] is a prefix-substring (if no such interval exists, just let L = R = - 1). For i = 1, we can simply compute L and R by comparing S[0...] to S[1...]. Moreover, we also get Z[1] during this.

Now suppose we have the correct interval [L, R] for i - 1 and all of the Z values up to i - 1. We will compute Z[i] and the new [L, R] by the following steps:

If i > R, then there does not exist a prefix-substring of S that starts before i and ends at or after i. If such a substring existed, [L, R] would have been the interval for that substring rather than its current value. Thus we "reset" and compute a new [L, R] by comparing S[0...] to S[i...] and get Z[i] at the same time (Z[i] = R - L + 1).
Otherwise, i ≤ R, so the current [L, R] extends at least to i. Let k = i - L. We know that Z[i] ≥ min(Z[k], R - i + 1) because S[i...] matches S[k...] for at least R - i + 1 characters (they are in the [L, R] interval which we know to be a prefix-substring). Now we have a few more cases to consider.
If Z[k] < R - i + 1, then there is no longer prefix-substring starting at S[i] (or else Z[k] would be larger), meaning Z[i] = Z[k] and [L, R] stays the same. The latter is true because [L, R] only changes if there is a prefix-substring starting at S[i] that extends beyond R, which we know is not the case here.
If Z[k] ≥ R - i + 1, then it is possible for S[i...] to match S[0...] for more than R - i + 1 characters (i.e. past position R). Thus we need to update [L, R] by setting L = i and matching from S[R + 1] forward to obtain the new R. Again, we get Z[i] during this.

The process computes all of the Z values in a single pass over the string, so we're done. Correctness is inherent in the algorithm and is pretty intuitively clear.

Analysis

We claim that the algorithm runs in O(n) time, and the argument is straightforward. We never compare characters at positions less than R, and every time we match a character R increases by one, so there are at most n comparisons there. Lastly, we can only mismatch once for each i (it causes R to stop increasing), so that's another at most n comparisons, giving O(n) total.

Code

Simple and short. Note that the optimization L = R = i is used when S[0] ≠ S[i] (it doesn't affect the algorithm since at the next iteration i > R regardless).

int L = 0, R = 0;
for (int i = 1; i < n; i++) {
  if (i > R) {
    L = R = i;
    while (R < n && s[R-L] == s[R]) R++;
    z[i] = R-L; R--;
  } else {
    int k = i-L;
    if (z[k] < R-i+1) z[i] = z[k];
    else {
      L = i;
      while (R < n && s[R-L] == s[R]) R++;
      z[i] = R-L; R--;
    }
  }
}

Application

One application of the Z Algorithm is for the standard string matching problem of finding matches for a pattern T of length m in a string S of length n. We can do this in O(n + m) time by using the Z Algorithm on the string T Φ S (that is, concatenating T, Φ, and S) where Φ is a character that matches nothing. The indices i with Z[i] = m correspond to matches of T in S.

Lastly, to solve Problem B of Beta Round 93, we simply compute Z for the given string S, then iterate from i to n - 1. If Z[i] = n - i then we know the suffix from S[i] is a prefix, and if the largest Z value we've seen so far is at least n - i, then we know some string inside also matches that prefix. That gives the result.

int maxz = 0, res = 0;
for (int i = 1; i < n; i++) {
  if (z[i] == n-i && maxz >= n-i) { res = n-i; break; }
  maxz = max(maxz, z[i]);
}

int L = 0, R = 0; for (int i = 1; i < n; i++) { if (i > R) { L = R = i; while (R < n && s[R-L] == s[R]) R++; z[i] = R-L; R--; }else{ int k = i-L; if (z[k] < R-i+1) z[i] = z[k]; else { L = i; R = R + 1 while (R < n && s[R-L] == s[R]) R++; z[i] = R-L; R--; } }

Комментарии (62)

Показать архивные | Написать комментарий?

thphong

15 лет назад, скрыть # |

+21

The are many good blogs on Codeforces about algorithm. If they are post in same place (as tutorial) is so great.

→ Ответить

kvtoraman

15 лет назад, скрыть # ^ |

-11

so you dont like?

wil93

He simply said that there should be a place where all useful blog posts like this are categorized

Red_Before_GTA6

4 года назад, скрыть # ^ |

Hmmm, like some sort of... catalog?

+28

Interesting, I will check it out. It didn't exist 11 years ago

paladin8

I agree. Maybe systematically tagging them with "tutorial" would do the trick.

Abinash

11 лет назад, скрыть # ^ |

Here I post some tutorial link.

If you know any new tutorial blog post please comment , I will add them .

Thanks paladin8 for his awesome tutorial.

casher

Here also by using Z-algorithm 6780044

ivan.popelyshev

← Rev. 3 →

((e-maxx)) has a site with algorithms, but it's on russian:

http://translate.google.ru/translate?sl=ru&tl=en&js=n&prev=_t&hl=ru&ie=UTF-8&layout=2&eotf=1&u=http%3A%2F%2Fe-maxx.ru%2Falgo%2F&act=url

e-maxx has a site with algorithms, but it's on russian:

Empty

14 лет назад, скрыть # ^ |

That site is awesome, if someone could translate it to english, i think that it would be the best algorithm reference ever.

shahidul_brur

← Rev. 2 →

Here is the english version of this site: http://e-maxx-eng.github.io

Also the pdf ebook containing all the tutorials and algo in English can be downloaded from: https://www.dropbox.com/s/j8z839pzamxxqw2/e-maxx_algo.ru.en.pdf?dl=1

Non Russians can use this site to read the tutorials in English: http://e-maxx-eng.github.io

Or if one visit that site using google chrome, then s/he can automaticaly translate in English with the help of google chrome browser.

Also I have a pdf file in English containing all posts on this site. This pdf ebook containing all the tutorials and algo in English can be downloaded from: https://www.dropbox.com/s/j8z839pzamxxqw2/e-maxx_algo.ru.en.pdf?dl=1

Darooha

+16

The KMP algorithm computes memo[i] which is the length of the longest prefix of s that matches the tail of s[1], s[2], ... s[i]. What are the advantages of the z-algorithm over KMP? Here's my code using KMP that solves the Codeforces password problem.

import java.io.*;
import java.util.*;

public class D {
    public static void main(String[] args) throws IOException{
    BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
    String s = in.readLine();
    char[] c = s.toCharArray();
    int n = c.length;
    int[] memo = new int[n];
    /* memo[i] will store the length of the longest prefix of s that matches the tail of s1...si */

    for (int i=1; i<n; i++) {
        int j=i;
        while (j>0 && c[i] != c[memo[j-1]]) j = memo[j-1];
        memo[i] = (j>0) ? memo[j-1]+1 : 0;
    }

    int max = 0;
    for (int i=1; i<n-1; i++) max = Math.max(max, memo[i]);
    /* max = Maximum internal match to prefix */
    j = memo[n-1];
    while (j>max) j = memo[j-1];
    System.out.println((j==0)? "Just a legend" : s.substring(0, j));
    }
}

As far as I know, both approaches work fine. I learned KMP before, but it didn't seem very intuitive to me, so I never fully understood string matching. The Z Algorithm is, in my opinion, easier to comprehend (and maybe to code, too). This motivated me to write this blog post :)

MikeMirzayanov

Glad to see you here!

Yep, this problem could be solved using KMP. Moreover Z- and prefix-function are equivalent (one can transform Z<->prefix in O(n)). But for me often it is easier to think in terms of Z-function to solve a problem.

AlexSkidanov

> one can transform Z<->prefix in O(n)

How?

mkagenius

may be this will work for i = 1 to n p[i + z[i]] = max(p[i + z[i]], z[i])

and one more travesing from n to 1. doing p[i] = max(p[i+1] - 1, p[i])

nashir

"and one more travesing from n to 1. doing p[i] = max(p[i+1] — 1, p[i])"

why is this necessary ?

I_love_natalia

A portion of necroposting.

You can solve 2 problems: 1) Create a sample string with given Z-function over an infinite alphabet. Just follow Z-function generation algorithm and store all the equalities in DST. Next, check the result. 2) Create a sample string with given prefix-function over an infinite alphabet. Just follow prefix function generation algorithm and store all the equalities in DST. Next, check the result.

(in Russian: http://contest.samara.ru/ru/problemset/735/)

WannaBe67

6 лет назад, скрыть # ^ |

Z to KMP — for i=1;i<n $$$P[i+Z[i]-1] = max(P[i+Z[i]-1],z[i])$$$

KMP to Z — for i=1;i<n $$$Z[i-P[i]+1] = max(Z[i-P[i]+1],P[i])$$$

liruqi

13 лет назад, скрыть # ^ |

I dont know if your Z<-> prefix transformation is mutual. I've written a post to try to construct z from kmp, http://liruqi.info/post/62511983824/kmp-z (In Chinese). Beside, you may check my code here, https://raw.github.com/liruqi/topcoder/master/HackerRank/SaveHumanity.cpp

RNR	8 лет назад, скрыть # ^ \| 0 Can someone give both transformation implementations? If possible with some explanations. Thanking you in advance. Learn and let learn. → Ответить

Burunduk1

What are the advantages of the z-algorithm over KMP?

No advantages =)

But if we compare Z-function and Prefix-function then:

1) "monotony"

If z[i] = x, it means substrings [0..j) [i..i+j) are equals for all j from 0 to x.

If p[i] = x, it means [0..j) = (i-j..i] for all j = x, p[x], p[p[x]] and so on.

2) LCP (Largest Common Prefix)

Z-function in fact calculates LCP[0,j] for all j. It can be used for not only substring searching.

I also have two examples of problems which, I hope, show advantages Z-function over Prefix-function.

1) Determine number (No.) of the string in its suffix array in O(n).

2) Determine number (amount) of substrings, which have at least two occuarances in the string in O(n²) (of course, you can solve it in O(n) using suffix tree).

kien_coi_1997

12 лет назад, скрыть # ^ |

Your comment is valuable. But you have some error, I will show them, to prevent others from misunderstanding. (Someone reopen the post, so I read your comment)

If z[i] = x, it means substrings [0..j) [i..i+j) are equals for all j from 0 to x-1.

Thanks. Fixed.

shubham1100

KMP requires only the smaller string to be known to allow for computing the prefix function. It is a stream based algorithm,you can work even if characters are given you one by one for the main string. This may be an advantage in certain scenarios.

vlade087

Test your skill using Z algorithm http://www.spoj.pl/problems/QUERYSTR . Good Luck

simp1eton

Wondering, is there any reason why the z-algorithm is not as famous as KMP? This is the first time I heard about it XD. It seems that z-algorithm is easier to understand.

aquamongoose

14 лет назад, скрыть # |

Sorry to revive this old topic, but I have a question. Let's say that i < R and z[i-L] > R-i+1. Wouldn't this mean that z[i] = R-i+1? Here is my reasoning.

s[R+1] != s[R-L+1], because if they were equal then z[L] would be greater than R-L+1.

In order for z[i] to be greater than R-i+1, s[R+1] must equal s[i-L+R-i+1].

This can be rewritten as s[R+1] must equal s[R-L+1], but it doesn't because of the first statement.

Therefore, assuming i < R, if z[i-L] != R-L+1, then it equals min(z[i-L],R-i+1).

Otherwise, it is still necessary to loop and update R and L values.

Is this true?

That argument seems to work. Doesn't change the complexity though :)

brunoja

← Rev. 5 →

Hi, I think that assumption is wrong, take this test case:

string input: ddcdddc  
z array     : 0102310  
your assumpt: 0102110

Is it right or I misunderstood what he said? :o

(sorry for editing, I am new to Markdown :p)

forthright48

Amazing tutorial. Thanks to you and e-maxx I have learned a new algorithm today :D

TarifEzaz

Thanks for this amazing tutorial. Keep writing! :)

piyush006

Very nice explanation of the algorithm , thanks !

__builtin__wolfy

12 лет назад, скрыть # |

-8

One of my favorite posts. Quick question: the algorithm seems to ignore z[0]; shouldn't it be the case that z[0] = n? Thanks.

Prestige

By the algorithm we consider that Z[0] is underfined

shamir0xe

you can add it end of your code, if necessary :)

farnasirim

ghabool dari nemikhaym z[0] ?

Pheonix_0

-9

can you please elaborate on what S[i] stores .

S is the string. S[i] is the symbol at position i. If S = "abcd", then S[0] = 'a', S[1] = 'b' and so on.

11 лет назад, скрыть # |

Just a silly optimization. In the line while (R < n && s[R-L] == s[R]) R++;, isn't the R < n redundant for strings? Because s[R-L] == s[R] will be false when we hit the \0 character. This could also work if we are using a sequence of numbers instead of strings (we have to add a number at the end that is different than all others).

dalex

Not everyone uses C++ char arrays to store strings

that's the point. I'm saying that you can use sequences of any type provided that you insert a sentinel element at the end of your sequence. Then you can save the R < n.

riadwaw

afair, std::string allows s[s.size()] and returns 0

beatoriche

it's guaranteed only for C++11 or later

Decrypto

Guys, this link at youtube might come in handy for a deeper insight and intuition into the algorithm. https://www.youtube.com/watch?v=MFK0WYeVEag

eagle93

Thank you for this awesome tutorial.

this link contains animation for Z-Algorithm, it may be helpful.

_Shakil

awesome animation...... Thank you eagle93

unrealsoul007

Thanks for the link! :)

sierra101

can any one tell me how to solve Problem B of Beta Round 93 with KMP by building Longest common prefix/suffix array :D

kipawa

If you want a much more detailed explanation with examples and complexity analysis, this link will be quite helpful!

vbgfew167

Now we are computing the value at i = k. The values L and R are as shown in the image. Now in the case Z[k] ≥ R - i + 1, since r + 1 and r' + 1 values are not same(if they were to be same, the interval would have been L and R + 1 instead of the current values), why do we need to check the values after r + 1?

diptesh1ce3

Helpful.

Can anyone answer my question above?

yashv

Can the longest sub-string and prefix overlap?

u33how

The O(n) complexity made me interested. Such a beautiful algorithm. Thank you :)

tieuchanlong

9 лет назад, скрыть # |

Does anyone have more problems related to Z Algorithm?I am appreciate if anyone could offer some.

greedy_me

9 лет назад, скрыть # ^ |

try this one

another one

stringray

7 лет назад, скрыть # |

← Rev. 7 →

explanation by my friend,

consider this example

s: A B A A B A B A A B B A B A B

z: 0 0 1 3 0 5 0 1 2 0 0 3 0 2 0

here's how we compute it

first, Z[0] = 0

our rightmost border is 0, starting from index 0

onto index 1. since we passed our border 0, we will compute it "normally" (brute force). we get 0 length. now our rightmost border is still not far enough (still 0), and its index is 1

index 2. compute by brute force, we get 1. now our right border is index 2, starting from index 2 (S[2...2] = S[0...0])

index 3. again we compute by brute force, we get 3. now out right border is index 5, starting from index 3 (S[3...5] = S[0...2])

index 4. this time 4 <= right border, so we use past computations. since 3 is the index of start, we put Z[1] into Z[4]. notice that if this makes the range of equality for 4 exceed the right border, we block it by the right border (so for example, if Z[1] = 3, then the range is 4...6 (length 3), and 6 exceeds the right border (5) so we change it to Z[4] = 2 instead of Z[4] = 3). so now, Z[4] = 0. we try extending by brute force, but it doesn't change our farthest is still starting from index 3 till index 5

index 5. using (3, 5), since 5 <= right border, we can use Z[2] for Z[5]. Z[2] is 1, and it just fits for Z[5] with 1, not to exceed our right border. we then try to extend by brute force, and we get Z[5] = 5. now our start index is 5 and border is 9

index 6. using (5, 9), we can use Z[1] for Z[6]. this results in Z[6] = 0. we then try to extend by brute force, but it keeps Z[6] = 0.

index 7. we use Z[2] for Z[7], getting Z[7] starting with value 1. we can try to extend it by brute force, but it will keep at 1.

index 8. we use Z[3] for Z[8], making Z[8] = 3. however, notice that Z[8] = 3 makes it exceed the right border, 9. therefor we block Z[8] by 2, making it Z[8] = 2. we then try to extend, but it doesn't change.

index 9. we use Z[4] for Z[9], making Z[9] = 0. we can try to extend, but it won't work

index 10. now we don't use past values, since we passed our border 9. so we try brute force, and get Z[10] = 0

index 11. again we can't use past values, so we do brute force. we get Z[11] = 3, and now our border is 13 with start 11.

index 12. using (11, 13), we do Z[12] = Z[1] = 0. brute force from here changes nothing.

index 13. using (11, 13), we do Z[13] = Z[2] = 1. it also doesn't exceed our border 13. we then try brute force to extend, and get Z[13] = 2, with new border 14

index 14. using (13, 14), we do Z[14] = Z[1] = 0. we cannot extend anymore.

O(n) reasoning: the first computation (using past values) is O(1) per index. also observe, that when we manage to extend by brute force, then we also extend our right border. since our right border only increases from 0 to n — 1, it changes O(n) times overall, O(n) computation

i thought like for i<=R case Z[i]>=Z[k] where k=i-L but here is a counter example

s: A A A B A A C

z: 0 2 1 0 2 1 0

as we finish index 4, our right border is 5, so we get the pair (4, 5) now for Z[5] so we use Z[1] for it Z[5] = 2 but Z[5] cannot be 2 or more, as you can see from the string the border means we currently have no information about equalities beyond the border.

now you can get the feel for this inequality z[i]>=min(z[k],R-i+1)

moral of this algorithm : try to use the past computational information by keeping track of the [L,R] (largest R basically and it's corresponding L ) values and using them and continuing the brute force comparision from new position to extend it to as maximum as possible , proof of time complexity is well explained by my friend .

the code mentioned by the author is correct but it is doing redundant work for the else case where Z[k]>=R-i+1 instead as we already know Z[i]>=R-i+1 we can just start our R from R+1 itself so the code now changes to this (we need to add this line R=R+1 that's it . )

}

this should make more sense now