Rabin-Karp String Search

Concept Overview

The Rabin-Karp algorithm is a string-searching algorithm created by Richard M. Karp and Michael O. Rabin. It uses a sliding window hash to find one or more occurrences of a pattern string within a larger text string. Unlike naive string matching algorithms that compare strings character by character, Rabin-Karp calculates a numerical hash value for the pattern and for each successive substring of the text. If the hash values match, it performs a full character-by-character comparison to confirm the match and avoid spurious hits.

Mathematical Definition

The core of the algorithm relies on a rolling hash function. A common choice is the Rabin fingerprint, which treats the characters of the string as digits in a base-b numeral system (where b is often the size of the character set, e.g., 256). To prevent the hash values from overflowing, arithmetic is performed modulo a large prime number q.

For a string S of length m, the hash value H(S) is defined as:

H(S) = (S[0] × b^m-1 + S[1] × b^m-2 + ... + S[m-1] × b⁰) mod q

When the window slides one position to the right (from text index i to i+1), the hash for the new window can be calculated in O(1) time using the hash of the previous window:

H(T[i+1 .. i+m]) = ( (H(T[i .. i+m-1]) - T[i] × b^m-1) × b + T[i+m] ) mod q

Key Concepts

Rolling Hash Function: The algorithm's efficiency stems from the rolling hash. Instead of recalculating the hash from scratch for every window (which would take O(m) time), the rolling hash updates the value in O(1) time by subtracting the contribution of the character leaving the window, multiplying by the base, and adding the character entering the window.
Spurious Hits (Hash Collisions): Because the hash is computed modulo q, two different strings might have the same hash value. This is called a hash collision or a "spurious hit." When H(pattern) == H(window), the algorithm must still verify the match by comparing the strings character-by-character.
Time Complexity: The average and best-case time complexity is O(n + m). However, in the worst case (when there are many spurious hits), the complexity degrades to O(n × m).

Historical Context

The algorithm was introduced by Michael O. Rabin and Richard M. Karp in 1987. It was an early application of randomized algorithms and hashing techniques to string matching problems. While other algorithms like Knuth-Morris-Pratt (KMP) or Boyer-Moore often perform faster for single-pattern searches because they avoid the worst-case O(n × m) scenario, Rabin-Karp's reliance on hashing makes it highly extensible, particularly for multiple-pattern matching.

Real-world Applications

Plagiarism Detection: It is widely used in systems that detect plagiarism by searching for multiple sentences or phrases simultaneously in a large body of text.
Multiple Pattern Search: If multiple patterns of the same length are being searched, their hash values can be stored in a Bloom filter or hash set. A single pass over the text using a rolling hash can then efficiently detect occurrences of any of the patterns.
Bioinformatics: Searching for specific DNA sequences within large genomes where hashing can quickly rule out non-matching segments.

Related Concepts

Knuth-Morris-Pratt (KMP) Algorithm
Bloom Filters
Finite State Machines

Rabin-Karp String Search