Suffix array tutorial

→ Обратите внимание

До соревнования
CodeTON Round 9 (Div. 1 + Div. 2, Rated, Prizes!)
18:17:25
Зарегистрироваться »

*есть доп. регистрация

→ Трансляции

Leetcode BiWeekly Contest 144 — Solution Discussion

Shayan

До начала 19:47:23

Codeforces CodeTON Round 9 (Div 1 + Div 2) — Solution Discussion

Shayan

До начала 21:17:23

Всё →

→ Лидеры (рейтинг)

№	Пользователь	Рейтинг
1	tourist	4009
2	jiangly	3823
3	Benq	3738
4	Radewoosh	3633
5	jqdai0815	3620
6	orzdevinwang	3529
7	ecnerwala	3446
8	Um_nik	3396
9	ksun48	3390
10	gamegame	3386

Страны | Города | Организации

Всё →

→ Лидеры (вклад)

№	Пользователь	Вклад
1	cry	167
2	Um_nik	163
3	maomao90	162
3	atcoder_official	162
5	adamant	159
6	-is-this-fft-	158
7	awoo	157
8	TheScrasse	154
9	Dominater069	153
9	nor	153

Всё →

→ Найти пользователя

→ Прямой эфир

Детальнее →

Блог пользователя Samsam

Suffix array tutorial

Автор Samsam, история, 9 лет назад, По-английски

Hello everybody, Could somebody provide me with a good tutorial for suffix array data structure?

suffix array

Samsam
9 лет назад
15

Комментарии (15)

Написать комментарий?

Gogis

9 лет назад, # |

Why not Codechef?

→ Ответить

Samsam

9 лет назад, # ^ |

Thanks, actually I've read it before but I need better resources

→ Ответить

Gogis

9 лет назад, # ^ |

What do you mean by better resources?

→ Ответить

Samsam

9 лет назад, # ^ |

I felt that it is not clear enough for me

→ Ответить

Samsam

9 лет назад, # ^ |

Did you learn this data structure from this tutorial ?

→ Ответить

Gogis

9 лет назад, # ^ |

No, I don't know suffix array yet, it's still on my TODO list :D But I thought the explanation from kuruma would pretty clear and detailed.

→ Ответить

suxrib

9 лет назад, # |

What is suffix array data structure. Can someone explain generally in several words.

→ Ответить

adamant

9 лет назад, # ^ |

It is array p for string s such that $\text{[math]}$ .

→ Ответить

lnishan

9 лет назад, # ^ |

Simply put, suffix array is a sorted array of suffixes of a given string.

→ Ответить

Xellos

9 лет назад, # |

Okay, think what you need for a suffix array in better than O(N²) time. You want to sort suffixes in alphabetical order. Sorting is simple, $\text{[math]}$ , where comparing 2 suffixes takes O(C) time. The main point is optimising C.

Comparing 2 suffixes = finding their longest common prefix (LCP), then you can just compare the characters that follow after that LCP. The LCP can be found using binary search, where you only need to check if 2 substrings are equal. That can be done by comparing their hashes, and the hash of any substring can be computed in O(1) time with O(N) preprocessing. This way, $\text{[math]}$ and it's often sufficient (possibly with trying to squeeze into the time limit).

There's a better approach that makes suffix array construction $\text{[math]}$ . The previous approach used a custom comparison operator and any suitable sort(). This one uses radix sort (at least I think that's what it's called): you sort the strings by first character, then by the first 2 characters, by the first 4, 8... up to a sufficient power of 2. In step k, you store strings with the same first 2^k characters in buckets, which are split into smaller buckets in the next step using the ordering by these 2^k characters.

How to do that? Number the buckets in increasing alphabetical order of the 2^k characters. Take empty meta-buckets numbered in the same way. Traverse the suffixes in order of non-decreasing bucket number; if prefix s[i..N] is in bucket b[i], then put s[i..N] to the meta-bucket numbered b[i + 2^k] — you're sorting them by the next 2^k characters, basically. Then traverse the suffixes in non-decreasing order of meta-bucket number and put them back in the original buckets in that order. Tada, they're now sorted by the first 2^k + 1 characters! All that's left is splitting the original buckets further, which can be done simply — just when 2 successive prefixes went in a different meta-bucket, then they'll be in different buckets afterwards.

The reason is that when 2 prefixes were in different buckets before, they will be in different buckets (and in the same order) afterwards, and if they were in the same bucket before and in different ones afterwards, then the smaller one will go into a smaller meta-bucket. This is just array juggling in O(N) per step, and you can stop when 2^k > N, so the total time is really $\text{[math]}$ .

Suffix arrays can also be constructed in O(N), but why?

→ Ответить