KMP Algorithm Explained
KMP algorithm is a string-searching algorithm, commonly used to find if some text contains a given pattern.
The name KMP comes from the initials of its 3 authors: Knuth–Morris–Pratt.
It is encouraged to read its original paper here: FAST PATTERN MATCHING IN STRINGS
Some example of this algorithm:
All indexes are 0-based for this article
- finding
code
inleetcode
should return4
(code
starts at index 4 ofleetcode
) - finding
leet
inleetcode
should return0
- finding
hello
inleetcode
should return-1
(indicating not found)
Most programming languages have this function built-in, for example
- Python str.find
- Java String::indexOf
- C++ string::find
Now let me explain each part of KMP algorithm. Lets assume two variable text
and pattern
, where we want to find the first occurrence of pattern
in text
, i.e
text = "leetcode"
pattern = "code"
text.find(pattern) == 4
Part 0. Naive approach
The naive approach one could come up is: for each index i
of text
, we compare text[i:i+len(pattern)]
with pattern
to see if they are equal, if so, a match is found, we return i
.
def find(self, text: str, pattern: str) -> int:
M, N = len(text), len(pattern)
if M == 0 or N == 0 or M < N:
return -1
for i in range(M):
found = True
for j in range(N): # <---- this part can be optimized
if i + j >= M or text[i + j] != pattern[j]:
found = False
break
if found:
return i
return -1
One obvious downside of this approach is that, for every new index i
, we need to start from the beginning of pattern
to do a character-by-character comparison. Essentially, we are checking pattern
over and over again, without utilizing the fact that we have looped over pattern
in the last loop.
And the authors of KMP algorithm found out a way to optimizing that, they found out that we could pre-process the pattern
, producing an array called next
, which allow us to start from a later index other than 0
, thus saving some comparison operations, whenever we find a mismatch.
Don't worry if you don't understand, we will look into the details of using next
array below.
Part 1. Using next array
Suppose we have this magic array, called next
, it has the same length as our pattern
len(next) == len(pattern)
Here's an example next
array, given pattern = abcabcacab
Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
pattern | a | b | c | a | b | c | a | c | a | b |
next | -1 | 0 | 0 | -1 | 0 | 0 | -1 | 4 | -1 | 0 |
As we previously mentioned, without this array, whenever we found a mismatch, we need to start from the beginning of pattern
, i.e. setting j=0
, and check j=0,1,2,3,4,...
Now, with next
, whenever we found a mismatch, we don't have to always set j=0
, we could set j=next[j]
, so each time we don't have to start from j=0
like before, skipping some comparisons.
And what does the value inside next
mean?
Say, current we are comparing text[i]
and pattern[j]
,
- If
text[i] == pattern[j]
, we increase bothi
j
by one, nothing special. - But if there is a mismatch, meaning
index[i] != pattern[j]
, thennext[j]
is the next index ofpattern
we should check. In other words, instead ofj+=1
, we doj=next[j]
- Note this step is recursive/repeated, meaning if there is still a mismatch after
j=next[j]
, we loop again and setj=next[next[j]]
, untilj
becomes-1
, that means reverting back to the naive approach, settingj=0
, and start from beginning forpattern
.
- Note this step is recursive/repeated, meaning if there is still a mismatch after
Translate to code:
i, j = 0, 0
while i < M and j < N:
while j != -1 and text[i] != pattern[j]:
j = next[j]
i += 1
j += 1
Full code:
def kmp(text, pattern) -> int:
M, N = len(text), len(pattern)
if M == 0 or N == 0:
return -1
# Will implement this function later
next = build_next(pattern)
i, j = 0, 0
while i < M and j < N:
while j != -1 and text[i] != pattern[j]:
j = next[j]
i += 1
j += 1
if j == N:
# Match found
return i - N
# Match not found
return -1
Part 2. Building next array
This part is tricky to understand, I suggest you read the original paper
Now, the hard part is building next
array. This is a dynamic programming problem.
We need another helper array f
to build next
.
Let f[i]
be the largest index k
that is both:
0 <= k < i
pattern[:k] == pattern[i-k:i]
We manually set f[0] = -1
to indicate no match. We see empty strings as equal, i.e. k
could be 0
, pattern[:0] == pattern[i:i]
, this means f[i]
is always >=0
(except for f[0]
)
Example:
Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
pattern | a | b | c | a | b | c | a | c | a | b |
f | -1 | 0 | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |
next | -1 | 0 | 0 | -1 | 0 | 0 | -1 | 4 | -1 | 0 |
To compute f[i]
, we need to know f[i - 1]
and next[:i]
let t = f[i - 1]
while t != -1 and pattern[i] != pattern[t]:
t = next[t]
f[i] = t + 1
To compute next[i]
, we need to know only f[i]
next[i] = f[i] if pattern[i] != pattern[f[i]]
next[i] = next[f[i]] if pattern[i] == pattern[f[i]]
Combine this two part together, we can actually reduce f
array to a single variable recording f[i]
def build_next(pattern) -> list[int]:
N = len(pattern)
next = [-1] * N
i = 0
f = -1
while i < N - 1:
while f != -1 and pattern[i] != pattern[f]:
f = next[f]
i += 1
f += 1
if pattern[i] == pattern[f]:
next[i] = next[f]
else:
next[i] = f
return next
Full code for KMP Algorithm
def kmp(a, b) -> int:
M, N = len(a), len(b)
if M == 0 or N == 0:
return -1
next = build_next(b)
i, j = 0, 0
while i < M and j < N:
while j != -1 and a[i] != b[j]:
j = next[j]
i += 1
j += 1
if j == N:
# Match found
return i - N
# Match not found
return -1
# Helper function
def build_next(pattern) -> list[int]:
N = len(pattern)
next = [-1] * N
i = 0
f = -1
while i < N - 1:
while f != -1 and pattern[i] != pattern[f]:
f = next[f]
i += 1
f += 1
if pattern[i] == pattern[f]:
next[i] = next[f]
else:
next[i] = f
return next
You can test this code in this leetcode question: 28. Find the Index of the First Occurrence in a String
Reference
- Original paper about the algorithm FAST PATTERN MATCHING IN STRINGS
- LeetCode 28. Find the Index of the First Occurrence in a String
- 最浅显易懂的 KMP 算法讲解