I recently solved some problems that involved the concept of Lyndon decomposition. Honestly, most of them were too hard to understand for me. I'm just trying to think out loud about things I've read, so I can learn ideas or better takes from smarter people?
Note that I will omit almost all proofs as I can't do that. I believe all unproven claims below are facts, but it is always great to have doubts about anything.
1. Lyndon decomposition, definition, and algorithms
Partly copy-pasted from this link.
A string is called simple (or a Lyndon word), if it is strictly smaller than any of its own nontrivial suffixes. Examples of simple strings are $$$a, b, ab, aab, abb, abcd, abac$$$.
It can be shown that a string is simple, if and only if it is strictly smaller than all its nontrivial cyclic shifts. As a corollary, it can be observed that simple words are never periodic (it is not a repetition of some words for $$$2$$$ or more times).
The Lyndon decomposition of string $$$s$$$ is a factorization $$$s = w_1 w_2 \ldots w_k$$$, where all strings $$$w_i$$$ are simple, and are in non-increasing order $$$w_1 \geq w_2 \geq \ldots \geq w_k$$$.
Alternatively, the Lyndon decomposition of string $$$s$$$ can be represented as $$$s = w_1^{p_1} w_2^{p_2} \ldots w_k^{p_k}$$$. Here, $$$p_i$$$ are positive integers, and $$$w^p_i$$$ denotes the string $$$w$$$ repeated for $$$p_i$$$ times. All strings $$$w_i$$$ are simple, and are in decreasing order $$$w_1 \gt w_2 \gt \ldots \gt w_k$$$. The only difference is that the group of identical factors is grouped as a chunk such as $$$w^p_i$$$.
It is claimed that for any string such a factorization exists and it is unique. However, I can't prove it.
1.1 Algorithm
There are two algorithms that compute the Lyndon decomposition in linear time. The first algorithm is the well-known Duval algorithm. E-maxx has a good explanation on this, so I won't discuss it here.
Another algorithm is conceptually much simpler. Given a string $$$S$$$, consider the greedy algorithm that repeatedly removes the smallest suffix from $$$S$$$. By definition, the greedy algorithm always removes a simple word, so the algorithm will return a decomposition consisting of simple words. We believe that the Lyndon decomposition is unique, thus algorithm returns a Lyndon decomposition.
Let's compute the time complexity, the algorithm will iterate at most $$$O(N)$$$ times, and it can find the smallest suffix naively in $$$O(N^2)$$$ time, so the naive implementation will take $$$O(N^3)$$$ time. However, the smallest suffix is just the first entry of the suffix array, so using the fastest suffix array algorithm can optimize each phase to $$$O(N)$$$, giving an $$$O(N^2)$$$ algorithm.
Should we compute the suffix array from scratch in each phase? The removal of a suffix does change the ordering in the suffix array. For example, $$$abac \lt ac$$$, but $$$aba \gt a$$$.
However, this issue doesn't apply to our application, where we remove the smallest suffix. Therefore, given a suffix array $$$SA_0, \ldots, SA_{N - 1}$$$ for the string $$$S$$$, one can simply iterate from $$$SA_0$$$ to $$$SA_{N - 1}$$$, and cut the string as long as it is the leftmost position we encountered. As the suffix array can be solved in $$$O(N)$$$, this gives an $$$O(N)$$$ solution to the Lyndon decomposition. I can't prove why this is true. But this looks like a folklore algorithm, so I believe it's true.
2. Computing Lyndon decomposition for each substring
For a string of size $$$N$$$, the Lyndon decomposition may have at most $$$O(N)$$$ size, in which case the above algorithms are already optimal. Hence, in this section, we only discuss finding the smallest suffix for each substring in near-constant time, since it may
- lead to an algorithm for computing Lyndon decomposition in near-linear time on output size, by the above greedy algorithm.
- yield some small implicit structure (tree) that captures the Lyndon decomposition for all interesting substrings
2.1. Lyndon decomposition for all suffixes
The removal of a prefix does not change the ordering in the suffix array. To find the smallest suffix in $$$S[x ...]$$$, just find the first entry in the suffix array such that $$$SA_i \geq x$$$.
2.2. Lyndon decomposition for all prefixes
Duval's algorithm is basically incremental since it repeatedly adds a letter $$$s[j]$$$ to the existing structure. This hints that the Lyndon decomposition can be computed for all prefixes, although it's not entirely straightforward.
I came up with the algorithm to compute all min suffixes for all prefixes. There are other algorithms to compute the min suffixes, such as the one ecnerwala described in this comment.
Duval algorithm maintains a pre-simple string in each iteration. Consider a pre-simple string $$$t = ww\ldots w\overline{w}$$$ for the current prefix. Except for the last string $$$\overline{w}$$$, every other string are simple. And if we take the Lyndon decomposition of $$$\overline{w}$$$, the first element of it is the prefix of $$$\overline{w}$$$, which is obviously less than $$$w$$$. As we know that Lyndon decomposition is unique, we can see that the last element of Lyndon decomposition of $$$\overline{w}$$$ is exactly the smallest suffix of the current prefix.
Thus, the naive algorithm is the following:
- If $$$\overline{w}$$$ is empty, $$$w$$$ is the smallest suffix of the given prefix.
- Otherwise, the smallest suffix of $$$\overline{w}$$$ is the smallest suffix for the given prefix.
However, we don't have to recompute the smallest suffix of $$$\overline{w}$$$ every time. In the decomposition algorithm, we fix the string $$$s_1 = s[0 : i)$$$ and compute the decomposition for the suffix $$$s[i \ldots]$$$. For each relevant $$$i$$$, we use dynamic programming. Let $$$MinSuf[j]$$$ be the length of smallest suffix of $$$S[i \ldots j)$$$ for $$$j \gt i$$$. If $$$\overline{w}$$$ is empty the smallest suffix is $$$w$$$. Otherwise, since $$$\overline{w}$$$ is exactly the string $$$S[i \ldots i + |\overline{w}|)$$$, $$$MinSuf[j] = MinSuf[i + |\overline{w}|]$$$. Therefore we can obtain a simple recursive formula.
2.3 Lyndon decomposition for all substrings?
This paper contains some ideas, so if you are interested, give it a try :)
3. The Runs Theorem
Run is a concept that is useful for solving problems related to repeats. Even if you never heard of the name, anyone who solved some challenging suffix array problems will be familiar with it.
Given a string $$$S$$$, the tuple $$$(l, r, p)$$$ is a run of string $$$S$$$ if
- $$$0 \le l \lt r \le |S|$$$
- $$$1 \le p \le |S|$$$
- $$$r - l \geq 2p$$$
- $$$p$$$ is the smallest positive integer where $$$S[i] = S[i + p]$$$ holds for all $$$l \le i \lt r - p$$$
- The above four properties doesn't hold for tuple $$$(l - 1, r, p)$$$ and $$$(l, r + 1, p)$$$
Let $$$-S$$$ be the string where all elements are inverted: Specifically, we assign s[i] = 'a' + 'z' - s[i] for all elements of $$$S$$$, so that the usual comparison order is reverted, except the empty character which has the lowest priority.
Given a string $$$S$$$, a Lyndon prefix is the longest prefix that is a Lyndon word. Given a suffix array of $$$S$$$, this Lyndon prefix can be easily computed. Recall an algorithm that computes the Lyndon decomposition given a suffix array. Let $$$Rank_i$$$ be the inverse of the suffix array. Then, we can see that the length of the Lyndon prefix is the smallest $$$i$$$ such that $$$Rank_i \lt Rank_0$$$ (or $$$|S|$$$ if such does not exist). Similarly, we can also compute this for all suffixes $$$S[i \ldots]$$$: find the smallest $$$j \gt 0$$$ such that $$$Rank_{i + j} \lt Rank_i$$$.
For each suffix of $$$S$$$ and $$$-S$$$, we compute the Lyndon prefix $$$[i, j)$$$ and take them as a "seed". Start from the tuple $$$(i, j, j - i)$$$, and extend the tuple in both direction as long as $$$S[i] = S[i + p]$$$ holds. Specifically, Let $$$k$$$ be the maximum number such that $$$S[i, i + k) = S[j, j + k)$$$ and $$$l$$$ be the maximum number such that $$$S[i - l, i) = S[j - l, j)$$$. Then we obtain a run $$$(i - l, j + k, j - i)$$$. Both $$$k, l$$$ can be computed in $$$O(\log N)$$$ time with suffix arrays.
It's easy to verify that those elements are actually the run of the string. If we remove all duplicated runs, the following fact holds:
Fact 1. Those we computed are exactly the set of all Runs.
Fact 2. There are at most $$$n$$$ runs.
Fact 3. The sum of $$$(j - i) / p$$$ for all runs are at most $$$3n$$$.
Fact 4. The sum of 2-repeats ($$$j - i - 2p + 1$$$) obtained from runs are at most $$$n \log n$$$.
Fact 3 is useful when we want to enumerate all repeats. Suppose that we have to enumerate all possible repeats. A string "aaaa" can be considered as a repeat of "a" 4 times, but it is also a repeat of "aa" 2 times. In this case, we have to enumerate all multiples of $$$p$$$ — but by Fact 3, that does not affect the overall complexity.
Fact 1, 2, 3 can be found on this paper. I think Fact 4 is not hard to prove, but that doesn't mean I've done it, nor do I have a reference that states this fact.
4. Lexicographically minimum substring reverse
Given a string $$$S$$$, you can select $$$0$$$ or more non-overlapping substrings, and reverse them. What is the lexicographically minimum result you can obtain from the single iteration of this operation?
Let $$$S^R$$$ be the reverse of $$$S$$$. The answer is to take the Lyndon decomposition for $$$S^R$$$, and reverse each substring from that respective position.
I don't know why this works.
Intuitively, we are replacing each prefix of $$$S$$$ to the minimum suffix of $$$S^R$$$. Replacing each prefix to the minimum possible suffix seems like a good trade. Do you agree or disagree? XD
5. Minimal Rotation from Lyndon decomposition
Given a string $$$S$$$, what is the lexicographically minimum result you can obtain by taking a cyclic shift of $$$S$$$?
The answer can be found by finding the smallest suffix of length $$$ \gt |S|$$$ for string $$$S + S$$$, and rotating at the respective position. This suffix can be found with Lyndon decomposition. Therefore we can solve this in $$$O(n)$$$ time, which is great.
What about just reversing a minimum suffix of $$$S$$$? Unfortunately, cases like "acabab", "dacaba" are the countercase. If we can reduce this problem into a minimum suffix instance, we can solve this problem for all prefixes, suffixes, and possibly substrings, so that's really unfortunate...
.. or maybe not. For a string $$$S$$$, consider it's Lyndon factorization $$$S = w_1^{p_1} w_2^{p_2} w_3^{p_3} \ldots w_k^{p_k}$$$. Clearly, taking the middle of periods is a bad idea. And taking only $$$w_k^{p_k}$$$ as a candidate is wrong.
Then what about trying to crack the tests? Let $$$SFX_j = w_j^{p_j} w_{j+1}^{p_{j + 1}} \ldots w_k^{p_k}$$$. Then, we can try all $$$SFX_j$$$ in range $$$k - 69 \le j \le k + 1$$$ as a candidate. It looks really hard to create an anti-test for this approach.
Lemma. Minimum rotation exists in the last $$$\log_2 |S|$$$ candidates of $$$SFX_j$$$. (Observation 6)
This provides an algorithm for computing the minimum rotation in $$$O(Q(n) \log n)$$$ time, where $$$Q(n)$$$ is time to compute the minimum suffix.
Practice problems
Minimum suffix for each prefix
Run Enumeration
- https://www.acmicpc.net/problem/23495
- https://judge.yosupo.jp/problem/runenumerate
- https://www.acmicpc.net/problem/25111
- https://www.acmicpc.net/problem/19020
- https://uoj.ac/problem/219
Lexicographically minimum substring reverse
Minimum rotation for each substring
- https://www.acmicpc.net/problem/19403
- https://www.acmicpc.net/problem/18985 (This is not exactly the minimum rotation, but the observation from Part 5 can be applied directly.)









Existence: repeatedly take the longest prefix that is a Lyndon word, and add it to the decomposition. We need to prove $$$w_{i} \geq w_{i + 1}$$$. Assume the contrary. Then, the concatenation of $$$w_{i}$$$ and $$$w_{i + 1}$$$ is a Lyndon word, thus $$$w_{i}$$$ wasn't maximal.
Uniqueness: induction on the length of the string, trivial for $$$n = 1$$$. Now, take two arbitrary decompositions of the same string $$$S$$$ of length $$$n$$$, the first of which has words of length $$$x_1[i]$$$ and the second of length $$$x_2[i]$$$ (1-indexed). WLOG assume $$$x_1[1] \lt x_2[1]$$$, as if $$$x_1[1] = x_2[1]$$$, by induction the decompositions are equal. Now, let $$$j$$$ be the maximum index such that $$$y = \sum_{i \leq j} x_1[j] \lt x_2[1]$$$. Then, the string $$$S[y + 1, x_2[1]]$$$ is lexicographically at most $$$S[y + 1, y + x_1[j + 1]]$$$, which is lexicographically at most $$$S[1, x_1[1]]$$$, which is lexicographically at most $$$S[1, x_2[1]]$$$. But $$$S[1, x_2[1]]$$$ is a Lyndon word, a contradiction.
An easier proof might exist, this is just the first thing I came up with.
As the author of https://mirror.codeforces.com/blog/entry/89629 and https://mirror.codeforces.com/blog/entry/90035#duval. When I taught my students, some have good questions like that and I would like to prove it in some ways.
I once give the two-liner proof for my student like this:
What it means it that
Ofcs, we can explain each of those sentences with math
If the factorization isnt unique then:
We want to prove that $$$A_i = B_i\ \forall i=1..n$$$ and $$$n = m$$$, because if not, there must be a position $$$p$$$ that:
Without loss of generality we can assume $$$\alpha \lt \beta$$$, which also make $$$(\alpha \beta)$$$ as a lyndon word
But then, by making $$$S' = \lambda \alpha \beta Y$$$ we can see that $$$\alpha \beta$$$ is the prefix of $$$\alpha X$$$
It contradicts (either proof works)
A good construction of Lyndon should prove that $$$\alpha = \beta$$$
By induction every block coincides in the remaining $$$X$$$ and $$$Y$$$, we can prove $$$A_i = B_i\ \forall i = 1 \ldots n$$$ and $$$n = m$$$
Hence $$$S$$$ is uniquely defined as $$$S = A_1 A_2 \ldots A_n$$$
Today, I solved 594E - Cutting the Line and thought about this.
The operation is as follows: decompose $$$S$$$ as $$$S=X_1\cdots X_K$$$ and obtain $$$T=X_K\cdots X_1$$$. Let $$$S=w_1\cdots w_k$$$ be the Lyndon decomposition.
(1) $$$X_K$$$ must be of the form $$$w_i\cdots w_k$$$ for some $$$i$$$.
If not, then $$$X_K$$$ starts with a proper suffix of some $$$w_i$$$: $$$X_K = w_i' w_{i+1} \cdots w_k$$$ and $$$w_i'$$$ is proper suffix of $$$w_i$$$.
Since $$$w_i$$$ is a Lyndon word, $$$w_i' \gt w_i$$$. Moreover, $$$w_i$$$ is clearly not a prefix of $$$w_i'$$$. Therefore we have $$$w_i+X \lt w_i'+Y$$$ for every string $$$X,Y$$$. This implies $$$X_K\cdots X_1 = w_i'\cdots \lt w_i\cdots$$$.
(2) For Lyndon Words $$$S,T$$$, we have: $$$S \lt T \iff S+T \lt T+S$$$.
If $$$S$$$ is not a prefix of $$$T$$$, this is straightforward. Assume that $$$S$$$ is a prefix of $$$T$$$ and write $$$T=Sw$$$. Since $$$w$$$ is a suffix of $$$T$$$, $$$T \lt w$$$ and $$$T$$$ is not a prefix of $$$w$$$. Therefore $$$T+X \lt w+Y$$$ for any string $$$X,Y$$$. In particular we have $$$T \lt wS$$$, therefore $$$ST \lt SwS=TS$$$.
By (1), we only need to consider the case where all $$$X_i$$$ is of the form $$$w_l\cdots w_r$$$. So $$$T$$$ is a permutation of $$$w_i$$$. By (2), the lexicographically smallest such concatination is $$$w_n\cdots w_1$$$.
Your solution to the problem of lexicographically minimal string construction via substring reversal is truly insightful. Interestingly, I had been working independently on this idea for some time before I came across your solution. I explored the same core concept mentioned below. However, I hesitated to publish or share this earlier, mostly because I wasn't confident I had formalized the logic soundly enough. And I was somewhat ADHD or OCD, because I read some other papers about Lyndon and they have some interesting problems as well (says, some kind of Computer Vision related stuff but can be boosted via Lyndon). After reading your explanation, I took the opportunity to revisit and reframe my understanding in the form of a structured claim-based proof (before that I use FAQ or something similar, which is kinda unprofessional). I broke the reasoning into three core claims like those lemmas from other papers. I believe this perspective complements your solution by exploring certains
this is easy to prove thatthings, to make other readers clearer to understand and can intuitively get the ideas of yours. Tho, I hope I can also learn something more if someone can point out my mistakes or improvements.AFAIK, we can
can select 0 or more non-overlapping substrings and reverse themindicating the fact that:What substrings in S should we reverse, so that their reverse appears early in the output string minimizing the answer ?We know about $$$S$$$ related with $$$S^R$$$ and $$$T$$$ related with $$$T^R$$$, we want to show that:
Basic Definitions:
Lyndon approach: After using lyndon decomposition on $$$S^R$$$, we have that:
Suppose: someone suggests a different strategy:
Claim 1: For any $$$1 \leq q \leq p$$$, if $$$X_q$$$ is not a lyndon word, $$$X \gt R$$$
any proper suffix of a lyndon word is stricter greater than itselfClaim 2: Every valid $$$X_q$$$ is a reversed of some $$$\omega_j$$$
Claim 3: Among all permutations of {$$${ \omega_1^R, \omega_2^R, \ldots, \omega_p^R }$$$} the minimal one is $$$R = \omega_1^R \omega_2^R \ldots \omega_p^R$$$
The Lyndon decomposition of a string gives a lexicographically minimal concatenation of the multiset of its Lyndon factors.TLDR;
TL;DR: I really appreciated your solution on lexicographically minimal substring reversal — it's elegant and insightful. Coincidentally, I had explored a similar direction independently long ago but didn’t publish it, partly due to doubts and partly being sidetracked by Lyndon-related research in other fields (e.g., some interesting connections in vision problems).
After reading your approach, I reframed my work using a more formal, claim-based structure — breaking it into three claims to logically verify that:
I think this structure complements your idea by making certain intuitive parts more explicit for readers. I'm still learning and open to feedback for improvements (not random insults from strangers when I made those 2 blogs please, you guys are so toxic and not contribute anything). If you guys still unclear about anything, I may invest sometimes to explain better and deliver some sketches or examples via excalidraw