Блог пользователя Stecher

Автор Stecher, история, 6 лет назад, По-английски

How to calculate the number of the string S of length N, which does not contain given string T as a substring?

length(S) ~ 10000

I would like to know all the approaches for the following 2 constraints :

  1. length(T) ~ 3 -> I have some DP idea but would appreciate if you can elaborate it more
  2. length(T) ~ 100
  3. length(T) ~ 10000

EDIT: What if we want to find the number of string S which contains given string T as a substring at most K times?

Please let me know all the available approaches.

  • Проголосовать: нравится
  • +52
  • Проголосовать: не нравится

»
6 лет назад, скрыть # |
Rev. 3  
Проголосовать: нравится +11 Проголосовать: не нравится

2) DP[i][j] denoting current string has length i and its suffix shares j characters with prefix of T.

DP[i][j] updates from DP[i-1][j-1] for j > 0.

DP[i][0] updates from DP[i-1][k] for each k.

DP[i][j] can also be updated from DP[i-1][k] where k>j (preprocessing will be required to find)

»
6 лет назад, скрыть # |
Rev. 2  
Проголосовать: нравится 0 Проголосовать: не нравится

3) Let DP[i] denote number of strings of length i such that T ends at i for the first time.

(At last we can add DP[i]*26^(N-i) to get all strings which containt string T and subtract it from 26^N. )

Initialize DP[i] to be 26^(i-len(T)) for i >= len(T).

Now we subtract from it all cases where string can end before i. (Similar to this )

Note that there may be indices j > i — len(T) from which we can update (preprocess needed)

  • »
    »
    6 лет назад, скрыть # ^ |
     
    Проголосовать: нравится 0 Проголосовать: не нравится

    nishant403 I have high doubts regarding the correctness of this solution. Let's consider this problem to one given in the link, Then we have got a straight 1xN single row matrix, with all the blocked cells at (1,i) where i >= len(T). Now, the problem is we are blocking all the blocked cells, but there might be cases when some blocked cells can be available cells when few are already blocked before that..

»
6 лет назад, скрыть # |
Rev. 2  
Проголосовать: нравится +43 Проголосовать: не нравится

I have solved some similar problem sometime ago and would like to share some interesting approaches to this problem.

1) Solution 1: Complexity- O(|S|*|T|*26)

DP[i][j] -> currently set the first 'i' elements of S and 'j' is the length of the longest suffix of S which is a prefix of T. Transition would require you to iterate over all the characters and calculate the updated longest suffix. These transitions can be preprocessed in O(|T|*26) using the prefix function and stored in another dp, TRANS[j][c] where j is the current suffix, c is the character you add after the current suffix and TRANS[j][c] would store the length of longest updated suffix. Note that you should not perform the transition where TRANS[j][c] = |T|. For the bonus part, you can maintain another state which will store the number of times you visit j = |T|.

2) Solution 2: O(|T|^3*log(|S|))

We will create a transition matrix for T and use matrix expo over the length of S to find the answer. Construct a matrix M, where M[i][j] will store the number of characters s.t. the longest suffix changes from i to j on adding the character (again use prefix function on T) and skip those transitions where j = |T| or i = |T|. If you exponentiate this to the power of N, you will realize that the answer is sum of M[0][j] where 0 <= j < |T|.

3) What if you are given multiple patterns ( say P1, P2, ..., Pk) and you have to calculate the number of strings S of length N s.t. none of the patterns occur in the string ?

Solution: We will use aho corasick to solve this problem in O(|N|*|sum of pattern length|). Build aho corasick over all the patterns and mark those nodes where atleast one pattern ends. We will consider these nodes as the states and make transitions only when we are not ending at any marked node. The new dp state becomes dp[i][j] where i is the length of the current string and j is the state in the aho corasick. Transitions will be similar as in Solution 1. You can also use Solution 2 and solve the problem in O(|sum of pattern length|^3*log(|S|)) too.

Here are some problems which uses the last approach : Problem Problem

»
6 лет назад, скрыть # |
Rev. 2  
Проголосовать: нравится 0 Проголосовать: не нравится

EDIT: What if we want to find the number of string S which contains given string T as a substring at most K times?

Again DP[i][len(Alphabet)]. You can make a transition function via KMP, and it will be len(T) * (K + 1), (each len(T) block is offset by K * (T)). You don't want to reach the final state, so if you have string abc, and you want to have it max 2 times you have this — [0,0,0,3,3,3,6,6,6], which means if you already have matched 'abc' once, each "invalid" character brings you back to state 3, not 0.

  • »
    »
    6 лет назад, скрыть # ^ |
    Rev. 2  
    Проголосовать: нравится 0 Проголосовать: не нравится

    zed_b Your explanation is a bit unclear. But I got your point. whenever we are matching at position j and if the character does not match, we reach (j/k) * j the position instead of 0 after applying KMP. maybe that would work. I think that is same as dp[N][K][len(T)]