Please help me with a string search problem

→ Pay attention

Before contest
Codeforces Round 995 (Div. 3)
01:17:06
Register now »

→ Streams

Codeforces Round 995 Solution Discussion

By aryanc403

Before stream 03:37:04

View all →

→ Top rated

#	User	Rating
1	tourist	3985
2	jiangly	3814
3	jqdai0815	3682
4	Benq	3529
5	orzdevinwang	3526
6	ksun48	3517
7	Radewoosh	3410
8	hos.lyric	3399
9	ecnerwala	3392
9	Um_nik	3392

Countries | Cities | Organizations

View all →

→ Top contributors

#	User	Contrib.
1	cry	169
2	maomao90	162
2	Um_nik	162
4	atcoder_official	161
5	djm03178	158
6	-is-this-fft-	157
7	adamant	155
8	awoo	154
8	Dominater069	154
10	luogu_official	150

View all →

→ Find user

→ Recent actions

Detailed →

Koyote's blog

Please help me with a string search problem

By Koyote, history, 18 months ago, In English

I encountered this problem at a past local competition. Here is the brief statement:

Given two string $$$s$$$ of length $$$n$$$ and t of length $$$m$$$ ($$$n$$$, $$$m$$$ $$$\le$$$ $$$10^{5}$$$) find the longest, lexicographically smallest substring of $$$s$$$ that appears in string $$$t$$$ at least $$$k$$$ times. The input is given in a way that there is at least one substring satisfies the constraints.

I have tried to find answers on Google and the best I could find is this problem from HackerEarth . The suggested solutions were either hashing with binary search, or suffix tree. And I am not knowledgeable enough to transform the problem into the original problem. So far the best I could have done is to iterate every substring of $$$s$$$ in $$$O(n^2)$$$, then use string search algorithm (KMP/Z-function) to count number of occurrences, yielding total complexity of $$$O(n^2*(n+m))$$$. Please give hints to an approach for this problem.

string, hard, suffix tree, hashing

Koyote
18 months ago
9

Comments (9)

Write comment?

Koyote

18 months ago, # |

I forgot. The first line of input contains 3 positive integers $$$n$$$, $$$m$$$, and $$$k$$$. The second line contains the string $$$s$$$ and the third line contains the string $$$t$$$. Here is an example test case:

Input:

7 19 3
GATTACA
TACATTACGCATTACACAT

Output:

ACA

Explanation: TAC and ACA are the two longest substring of $$$s$$$ that satisfy, and ACA is lexicographically smaller than TAC.

→ Reply

tminh_hk20

14 months ago, # ^ |

this is my code about this problem:

#include<bits/stdc++.h>
using namespace std;
#define int long long
#define ii pair<int,int>
#define fi first
#define se second
const int N=1e5+10,MOD=1e9+7,base=311;
int n,m,k,has1[N],has2[N],p[N];
int l,r,mi;
string s,t;
ii res;

int get1(int i, int j){
    return (has1[j]-has1[i-1]*p[j-i+1]+MOD*MOD)%MOD;
}

int get2(int i, int j){
    return (has2[j]-has2[i-1]*p[j-i+1]+MOD*MOD)%MOD;
}

bool compound(int a, int b, int n){
    if (b==-1) return true;
    int l =0,r=n-1,res=-1;

    while(l<=r){

        int m = (l+r)/2;

        if (get2(a,a+m)==get2(b,b+m)){
            l = m+1;
            res=m;
        }else{
            r =m-1;
        }

    }

    if (res==n-1) return true;
    return (t[a+res+1]<t[b+res+1]);
}

signed main(){

    ios_base::sync_with_stdio(0);
    cin.tie(0); cout.tie(0);

    freopen("GATTACA.INP","r",stdin);
    freopen("GATTACA.OUT","w",stdout);

    cin>>n>>m>>k;
    cin>>s>>t;

    s =" "+s;
    t =" "+t;

    p[0]=1;
    for (int i=1;i<=max(n,m)+1;i++){
        p[i]=(p[i-1]*base)%MOD;
    }

    has1[0]=0;
    for (int i=1;i<=n;i++){
        has1[i]=(has1[i-1]*base+s[i])%MOD;
    }

    has2[0]=0;
    for (int i=1;i<=m;i++){
        has2[i]=(has2[i-1]*base+t[i])%MOD;
    }

    l=1;
    r=1;
    bool div=0;
    res={-1,-1};

    while(l<=r){
        //cout <<l<<" "<<r<<endl;
        if (div)
            mi = (l+r)/2;
        else
            mi = r;

        bool check=0;
        unordered_map<int,int> mp,ap;

        for (int i=1;i<=n-mi+1;i++){
            ap[get1(i,i+mi-1)]++;
        }

        for (int i=1;i<=m-mi+1;i++){
            int tmp=get2(i,i+mi-1);
            mp[tmp]++;

            if (ap[tmp] && mp[tmp]>=k){
                check=1;
                if (mi<res.se-res.fi+1) continue;

                if (mi>res.se-res.fi+1){
                    res={i,i+mi-1};
                    continue;
                }

                if (compound(i,res.fi,mi)){
                    res = {i,i+mi-1};
                }
            }

        }

        if (div){
            if (check)
                l = mi+1;
            else
                r = mi-1;
        }else{
            if (check){
                l =r;
            }else{
                div=1;
            }
            r*=2;
        }


    }

    for (int i=res.fi;i<=res.se;i++)
        cout <<t[i];



}

→ Reply

ayhan37

18 months ago, # |

← Rev. 2 →

I have found O((M+N)*LOG_M*LOG_N) solution using binary search on the length of substring + suffix array for sorting substrings of length X. We also need to use hashing for counting occurences.

I don't know if there is any better solution than this one.

→ Reply

Koyote

18 months ago, # ^ |

That is good enough, please explain more.

→ Reply

ayhan37

18 months ago, # ^ |

We can use binary search on the answer technique. We fix the length of substring. Let's call the mid value X. We can use binary search because suppose that the answer is A(length of the substring). Then, an answer with length (A-1) must exist because you can pop the last character of the valid substring.

We use hashing to calculate the number of occurences (on string t with substring length X). This part is O(MlogM) (due to std::map). Finally , to find the lexicographically minimum substring we need to use suffix array to sort substring of length X. You can learn suffix array and it's applications from CP-ALGO. Also sorry for my poor english.

→ Reply

Koyote

18 months ago, # ^ |

No problem. Thank you for your idea, I'll try to implement it.

→ Reply

ayhan37

18 months ago, # ^ |

Your welcome :)

→ Reply

bicsi

18 months ago, # |

← Rev. 2 →

Translation: find a string of maximal length that has at least one occurrence in $$$s$$$ and at least $$$k$$$ occurrences in $$$t$$$. One way to do that is:

Build a suffix automaton of $$$s\$t$$$
For each node in automaton compute the number of occurrences in $$$s$$$ and in $$$t$$$
Out of all the valid nodes, choose the min lex one

A similar solution can be done with suffix arrays and two pointers technique:

Sort the suffixes of $$$s\$t$$$
For each $$$i \leq n + m$$$ find the smallest $$$j$$$ such that there is at least one index from $$$s$$$ and $$$k$$$ indices from $$$t$$$ in the range $$$sa[i..j]$$$ (via two pointers)
Update the solution if $$$lcp[i..j]$$$ is strictly better than the current best. This makes sure to pick the lex min solution.

→ Reply

Koyote

18 months ago, # ^ |

Thank you so much. That is so clever.

→ Reply