Блог пользователя cardcounter

Автор cardcounter, история, 7 часов назад, По-английски

I am thinking about an effient algorithm for wildcard searching with * representing any characters with any length.

aa*c, she*he, *she*he

When I am supposed to return, say, the beginning index of the first matching instance.

Say the pattern is of length M and the document is of length N, and the pattern has K '*' signs. I can think of a solution that first uses AC Automation to find all occurences of each chunk in O(N + M), with bitmask.

While converting the bitmask to indexes takes O(N * K)

Then binary search for the last possible beginning positions for each chunk. This could take O (K log N)

So the overall time complexity is still O (N * K), any way to do better?

References

https://mirror.codeforces.com/blog/entry/111380

https://mirror.codeforces.com/problemset/problem/1023/A

https://mirror.codeforces.com/blog/entry/127169

  • Проголосовать: нравится
  • 0
  • Проголосовать: не нравится

»
4 часа назад, # |
  Проголосовать: нравится +6 Проголосовать: не нравится

I'm not acquainted with regex expressions, so the following might be a misinterpretation of your problem, but it seems that what you are asking is essentially to find a series of non-intersecting occurrences of several strings. We should now seek a way to find all occurrences of a certain string, and find the first one past a certain threshold (determined by the last string's position). This can be accomplished by the use of a Suffix Automaton, which turns the problem into online queries about the lower bound upon the values within a node's subtree (Since the fail tree/suffix tree's subtree values represent the endpos set for a string). This can be done by a persistent segment tree and binary search upon its structure, resulting in a time and memory complexity of O(nlogn).

  • »
    »
    4 часа назад, # ^ |
      Проголосовать: нравится +8 Проголосовать: не нравится

    Wait wait wait, if my interpretation is correct, there is no need to be so complex. Each time you just need to find the first occurrence of each string between the '*'s, so you can just run KMP upon each of them (taking O(n) time), and scan thru the document, it should only take O(n+m) time.

    I'm now convinced I've misunderstood. Please give some more explanation :(