KMP Algorithm Explained

Jeff posted on  (updated on )

KMP algorithm is a string-searching algorithm, commonly used to find if some text contains a given pattern.

The name KMP comes from the initials of its 3 authors: Knuth–Morris–Pratt.
It is encouraged to read its original paper here: FAST PATTERN MATCHING IN STRINGS

Some example of this algorithm:

All indexes are 0-based for this article

  1. finding code in leetcode should return 4 (code starts at index 4 of leetcode)
  2. finding leet in leetcode should return 0
  3. finding hello in leetcode should return -1 (indicating not found)

Most programming languages have this function built-in, for example

  1. Python str.find
  2. Java String::indexOf
  3. C++ string::find

Now let me explain each part of KMP algorithm. Lets assume two variable text and pattern, where we want to find the first occurrence of pattern in text, i.e

text = "leetcode"
pattern = "code"
text.find(pattern) == 4

Part 0. Naive approach

The naive approach one could come up is: for each index i of text, we compare text[i:i+len(pattern)] with pattern to see if they are equal, if so, a match is found, we return i.

def find(self, text: str, pattern: str) -> int:
	M, N = len(text), len(pattern)
	if M == 0 or N == 0 or M < N:
		return -1

	for i in range(M):
		found = True
		for j in range(N):        # <---- this part can be optimized
			if i + j >= M or text[i + j] != pattern[j]:
				found = False
				break
		if found:
			return i

	return -1

One obvious downside of this approach is that, for every new index i, we need to start from the beginning of pattern to do a character-by-character comparison. Essentially, we are checking pattern over and over again, without utilizing the fact that we have looped over pattern in the last loop.

And the authors of KMP algorithm found out a way to optimizing that, they found out that we could pre-process the pattern, producing an array called next, which allow us to start from a later index other than 0, thus saving some comparison operations, whenever we find a mismatch.

Don't worry if you don't understand, we will look into the details of using next array below.

Part 1. Using next array

Suppose we have this magic array, called next, it has the same length as our pattern

len(next) == len(pattern)

Here's an example next array, given pattern = abcabcacab

Index0123456789
patternabcabcacab
next-100-100-14-10

As we previously mentioned, without this array, whenever we found a mismatch, we need to start from the beginning of pattern , i.e. setting j=0, and check j=0,1,2,3,4,...

Now, with next, whenever we found a mismatch, we don't have to always set j=0, we could set j=next[j], so each time we don't have to start from j=0 like before, skipping some comparisons.

And what does the value inside next mean? Say, current we are comparing text[i] and pattern[j],

  • If text[i] == pattern[j], we increase both i j by one, nothing special.
  • But if there is a mismatch, meaning index[i] != pattern[j], then next[j] is the next index of pattern we should check. In other words, instead of j+=1, we do j=next[j]
    • Note this step is recursive/repeated, meaning if there is still a mismatch after j=next[j], we loop again and set j=next[next[j]], until j becomes -1, that means reverting back to the naive approach, setting j=0, and start from beginning for pattern.

Translate to code:

i, j = 0, 0
while i < M and j < N:
	while j != -1 and text[i] != pattern[j]:
		j = next[j]
	i += 1
	j += 1

Full code:

def kmp(text, pattern) -> int:
    M, N = len(text), len(pattern)
    if M == 0 or N == 0:
        return -1

	# Will implement this function later
    next = build_next(pattern)

    i, j = 0, 0
    while i < M and j < N:
        while j != -1 and text[i] != pattern[j]:
            j = next[j]
        i += 1
        j += 1

    if j == N:
        # Match found
        return i - N
    # Match not found
    return -1

Part 2. Building next array

This part is tricky to understand, I suggest you read the original paper

Now, the hard part is building next array. This is a dynamic programming problem. We need another helper array f to build next. Let f[i] be the largest index k that is both:

  1. 0 <= k < i
  2. pattern[:k] == pattern[i-k:i]

We manually set f[0] = -1 to indicate no match. We see empty strings as equal, i.e. k could be 0, pattern[:0] == pattern[i:i], this means f[i] is always >=0 (except for f[0])

Example:

Index0123456789
patternabcabcacab
f-1000123401
next-100-100-14-10

To compute f[i], we need to know f[i - 1] and next[:i]

let t = f[i - 1]
while t != -1 and pattern[i] != pattern[t]:
	t = next[t]
f[i] = t + 1

To compute next[i], we need to know only f[i]

next[i] = f[i]       if pattern[i] != pattern[f[i]]
next[i] = next[f[i]] if pattern[i] == pattern[f[i]]

Combine this two part together, we can actually reduce f array to a single variable recording f[i]

def build_next(pattern) -> list[int]:
    N = len(pattern)
    next = [-1] * N

    i = 0
    f = -1
    while i < N - 1:
        while f != -1 and pattern[i] != pattern[f]:
            f = next[f]
        i += 1
        f += 1
        if pattern[i] == pattern[f]:
            next[i] = next[f]
        else:
            next[i] = f
    return next

Full code for KMP Algorithm

def kmp(a, b) -> int:
    M, N = len(a), len(b)
    if M == 0 or N == 0:
        return -1

    next = build_next(b)

    i, j = 0, 0
    while i < M and j < N:
        while j != -1 and a[i] != b[j]:
            j = next[j]
        i += 1
        j += 1

    if j == N:
        # Match found
        return i - N
    # Match not found
    return -1

# Helper function
def build_next(pattern) -> list[int]:
    N = len(pattern)
    next = [-1] * N

    i = 0
    f = -1
    while i < N - 1:
        while f != -1 and pattern[i] != pattern[f]:
            f = next[f]
        i += 1
        f += 1
        if pattern[i] == pattern[f]:
            next[i] = next[f]
        else:
            next[i] = f
    return next

You can test this code in this leetcode question: 28. Find the Index of the First Occurrence in a String

Reference

  1. Original paper about the algorithm FAST PATTERN MATCHING IN STRINGS
  2. LeetCode 28. Find the Index of the First Occurrence in a String
  3. 最浅显易懂的 KMP 算法讲解