Suffix Arrays
Suffix arrays are a clever idea that can be used to solve some tricky problems involving strings. A suffix array is a sorted list of all the suffixes of a string. For example, the suffixes of computer
are listed below on the left. Then we sort them alphabetically, as shown on the right.
computer computer omputer er mputer mputer puter omputer uter puter ter r er ter r uter
Sometimes people store the indices that the suffixes start at instead of the suffixes themselves. But we'll stick to storing the suffixes themselves. Below is a function to create a suffix array from a string. There are approaches with better big O performance that should be used for real applications, but this should be fine for our purposes.
def suffix_array(s): return sorted([s[i:] for i in range(len(s))])
Suffix arrays are good for answering questions about substrings. Below we will look at several applications. For each of them, the key fact that makes suffix arrays useful is the fact that every substring of the main string will appear at the start of some suffix (namely the suffix that starts at the same index that the substring starts at).
put
in computer
. At the first step of the binary search, we would check index 3 of the suffix array, which is omputer
. We compare the first three letters of omputer
, which is omp
, with put
.
Note: This approach requires us to first create a suffix array. If we are just checking a single substring, then the time to create the suffix array would be longer than just doing a slow linear search for the substring. However, if we are doing a lot of substring searches on the same string, then this is a good approach.**For just checking a single substring, there are a number of special string searching algorithms that are better than a linear search.
abcaabcbcbbc
. The suffix array for it is shown below.
aabcbcbbc
abcaabcbcbbc
abcbcbbc
bbc
bc
bcaabcbcbbc
\h{bcbbc}
bcbcbbc
c
caabcbcbbc
cbbc
cbcbbc
Suppose we are looking for a count of the substring bc
. Notice that all the substrings that start with bc
are grouped together. A binary search would find one of them, such as the one highlighted above. Then we could run one loop that goes toward the front of the string and another that goes toward the back of the string to count them all. Those loops will be quick because we can stop once the substrings stop starting with bc
.
abcabacabcabaac
. As we noted in the previous problem, each substring will show up at the start of some suffix, and all occurrences of the substring will be grouped together. The trick to finding the longest repeated substring is to look at the prefixes of each suffix (what they start with), and see what each one has in common with the one before it.
suffix | shared prefix with previous item |
aac | – |
abaac | a |
abacabcabaac | aba |
abcabaac | ab |
abcabacabcabaac | abcaba |
ac | a |
acabcabaac | ac |
baac | – |
bacabcabaac | ba |
bcabaac | b |
bcabacabcabaac | bcaba |
c | – |
cabaac | c |
cabacabcabaac | caba |
cabcabaac | cab |
abcaba
. Below is code that finds it. The code loops through the suffix array and runs another loop for each suffix to find what it shares in common with the one before it.
def longest_repeated_substring(s): array = suffix_array(s) shared_prefixes = [] for i in range(1, len(array)): prefix = '' j = 0 s = array[i-1] t = array[i] while j < len(s) and j < len(t) and s[j] == t[j]: prefix += s[j] j += 1 shared_prefixes.append(prefix) return sorted(shared_prefixes, key=len)[-1]
racecar
or abcddcba
. The trick here is to use the solution for the longest common substring problem, where the two arguments are the string and its reverse.