Koyote's blog

By Koyote, history, 18 months ago, In English

I encountered this problem at a past local competition. Here is the brief statement:

Given two string $$$s$$$ of length $$$n$$$ and t of length $$$m$$$ ($$$n$$$, $$$m$$$ $$$\le$$$ $$$10^{5}$$$) find the longest, lexicographically smallest substring of $$$s$$$ that appears in string $$$t$$$ at least $$$k$$$ times. The input is given in a way that there is at least one substring satisfies the constraints.

I have tried to find answers on Google and the best I could find is this problem from HackerEarth . The suggested solutions were either hashing with binary search, or suffix tree. And I am not knowledgeable enough to transform the problem into the original problem. So far the best I could have done is to iterate every substring of $$$s$$$ in $$$O(n^2)$$$, then use string search algorithm (KMP/Z-function) to count number of occurrences, yielding total complexity of $$$O(n^2*(n+m))$$$. Please give hints to an approach for this problem.

  • Vote: I like it
  • +6
  • Vote: I do not like it

»
18 months ago, # |
  Vote: I like it 0 Vote: I do not like it

I forgot. The first line of input contains 3 positive integers $$$n$$$, $$$m$$$, and $$$k$$$. The second line contains the string $$$s$$$ and the third line contains the string $$$t$$$. Here is an example test case:

Input:

7 19 3
GATTACA
TACATTACGCATTACACAT

Output:

ACA

Explanation: TAC and ACA are the two longest substring of $$$s$$$ that satisfy, and ACA is lexicographically smaller than TAC.

  • »
    »
    14 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    this is my code about this problem:

    #include<bits/stdc++.h>
    using namespace std;
    #define int long long
    #define ii pair<int,int>
    #define fi first
    #define se second
    const int N=1e5+10,MOD=1e9+7,base=311;
    int n,m,k,has1[N],has2[N],p[N];
    int l,r,mi;
    string s,t;
    ii res;
    
    int get1(int i, int j){
        return (has1[j]-has1[i-1]*p[j-i+1]+MOD*MOD)%MOD;
    }
    
    int get2(int i, int j){
        return (has2[j]-has2[i-1]*p[j-i+1]+MOD*MOD)%MOD;
    }
    
    bool compound(int a, int b, int n){
        if (b==-1) return true;
        int l =0,r=n-1,res=-1;
    
        while(l<=r){
    
            int m = (l+r)/2;
    
            if (get2(a,a+m)==get2(b,b+m)){
                l = m+1;
                res=m;
            }else{
                r =m-1;
            }
    
        }
    
        if (res==n-1) return true;
        return (t[a+res+1]<t[b+res+1]);
    }
    
    signed main(){
    
        ios_base::sync_with_stdio(0);
        cin.tie(0); cout.tie(0);
    
        freopen("GATTACA.INP","r",stdin);
        freopen("GATTACA.OUT","w",stdout);
    
        cin>>n>>m>>k;
        cin>>s>>t;
    
        s =" "+s;
        t =" "+t;
    
        p[0]=1;
        for (int i=1;i<=max(n,m)+1;i++){
            p[i]=(p[i-1]*base)%MOD;
        }
    
        has1[0]=0;
        for (int i=1;i<=n;i++){
            has1[i]=(has1[i-1]*base+s[i])%MOD;
        }
    
        has2[0]=0;
        for (int i=1;i<=m;i++){
            has2[i]=(has2[i-1]*base+t[i])%MOD;
        }
    
        l=1;
        r=1;
        bool div=0;
        res={-1,-1};
    
        while(l<=r){
            //cout <<l<<" "<<r<<endl;
            if (div)
                mi = (l+r)/2;
            else
                mi = r;
    
            bool check=0;
            unordered_map<int,int> mp,ap;
    
            for (int i=1;i<=n-mi+1;i++){
                ap[get1(i,i+mi-1)]++;
            }
    
            for (int i=1;i<=m-mi+1;i++){
                int tmp=get2(i,i+mi-1);
                mp[tmp]++;
    
                if (ap[tmp] && mp[tmp]>=k){
                    check=1;
                    if (mi<res.se-res.fi+1) continue;
    
                    if (mi>res.se-res.fi+1){
                        res={i,i+mi-1};
                        continue;
                    }
    
                    if (compound(i,res.fi,mi)){
                        res = {i,i+mi-1};
                    }
                }
    
            }
    
            if (div){
                if (check)
                    l = mi+1;
                else
                    r = mi-1;
            }else{
                if (check){
                    l =r;
                }else{
                    div=1;
                }
                r*=2;
            }
    
    
        }
    
        for (int i=res.fi;i<=res.se;i++)
            cout <<t[i];
    
    
    
    }
    
    
    
»
18 months ago, # |
Rev. 2   Vote: I like it +3 Vote: I do not like it

I have found O((M+N)*LOG_M*LOG_N) solution using binary search on the length of substring + suffix array for sorting substrings of length X. We also need to use hashing for counting occurences.

I don't know if there is any better solution than this one.

  • »
    »
    18 months ago, # ^ |
      Vote: I like it +3 Vote: I do not like it

    That is good enough, please explain more.

    • »
      »
      »
      18 months ago, # ^ |
        Vote: I like it +3 Vote: I do not like it

      We can use binary search on the answer technique. We fix the length of substring. Let's call the mid value X. We can use binary search because suppose that the answer is A(length of the substring). Then, an answer with length (A-1) must exist because you can pop the last character of the valid substring.

      We use hashing to calculate the number of occurences (on string t with substring length X). This part is O(MlogM) (due to std::map). Finally , to find the lexicographically minimum substring we need to use suffix array to sort substring of length X. You can learn suffix array and it's applications from CP-ALGO. Also sorry for my poor english.

      • »
        »
        »
        »
        18 months ago, # ^ |
          Vote: I like it +3 Vote: I do not like it

        No problem. Thank you for your idea, I'll try to implement it.

»
18 months ago, # |
Rev. 2   Vote: I like it +6 Vote: I do not like it

Translation: find a string of maximal length that has at least one occurrence in $$$s$$$ and at least $$$k$$$ occurrences in $$$t$$$. One way to do that is:

  1. Build a suffix automaton of $$$s\$t$$$
  2. For each node in automaton compute the number of occurrences in $$$s$$$ and in $$$t$$$
  3. Out of all the valid nodes, choose the min lex one

A similar solution can be done with suffix arrays and two pointers technique:

  1. Sort the suffixes of $$$s\$t$$$
  2. For each $$$i \leq n + m$$$ find the smallest $$$j$$$ such that there is at least one index from $$$s$$$ and $$$k$$$ indices from $$$t$$$ in the range $$$sa[i..j]$$$ (via two pointers)
  3. Update the solution if $$$lcp[i..j]$$$ is strictly better than the current best. This makes sure to pick the lex min solution.
  • »
    »
    18 months ago, # ^ |
      Vote: I like it +3 Vote: I do not like it

    Thank you so much. That is so clever.