Detailed explanation of Boyer-Moore string search algorithm from entry to proficiency

Detailed explanation of Boyer-Moore string search algorithm from entry to proficiency

This article is about the Boyer-Moore algorithm. The Boyer-Moore algorithm is a string search algorithm. I want to understand this algorithm when I am interested. I found that this algorithm is quite difficult to understand at the beginning. Maybe my understanding ability is not very good. It took a long time to understand, and I wanted to share it after I understood it, because I think this algorithm is really good. I used to think that KMP in the string search algorithm is very good. I didn t expect there to be better, Boyer- On average, Moore's algorithm is 3-5 times faster than KMP.

The following is my understanding of the algorithm. I have referred to some introductions about the algorithm. Each picture in it is drawn very carefully. I hope to clarify the problem. If there are any errors, questions or unintelligence, please mention it. Come out, learn and progress together! The following text begins.

 

Introduction

Among the algorithms used to find substrings, the BM (Boyer-Moore) algorithm is currently considered the most efficient string search algorithm. It was designed by Bob Boyer and J Strother Moore in 1977. In general, it is 3-5 times faster than the KMP algorithm. This algorithm is often used for search and matching functions in text editors. For example, the well-known GNU grep command uses this algorithm. This is also an important reason why GNU grep is faster than BSD grep~~~

Why is GNU grep so fast. Maybe you can learn some of these ideas and apply them to BSD grep.

#Trick 1: GNU grep is fast because it does not check every byte in the input .

#Trick 2: GNU grep is fast because it performs very few instructions (operations) for every byte that really needs to be checked .

GNU grep uses the very famous Boyer-Moore algorithm

GNU grep also unfolds the inner loop of the Boyer-Moore algorithm, and builds a Boyer-Moore delta table, so that it does not need to perform loop exit judgment at every unfolding step. The result of this is that, in the limit, GNU grep will execute no more than 3 x86 instructions for each input byte that needs to be checked (and many bytes are skipped).

You can take a look at the paper "Fast String Searching" published in "Software Practice & Experience" by Andrew Hume and Daniel Sunday in November 1991. This article discusses the implementation techniques of Boyer-Moore algorithm very well. This article is free PDF online version. (Below is Baidu Cloud link)

Link: pan.baidu.com/s/1kULPQbP Password: en7g

Once you have a quick search, you will find that you need the same fast typing.

GNU grep uses native Unix input system calls and avoids copying data after reading.

Moreover, GNU grep also avoids breaking the input. Finding a newline character will slow down grep several times, because you have to look at every byte to find a newline character!

So GNU grep does not use line-based input, but reads the original data into a large buffer, uses the Boyer-Moore algorithm to search this buffer, and only finds a match before it finds the nearest newline Character (some command parameters, such as -n will prohibit this optimization).

GNU grep also tries to do some very difficult things so that the kernel can also avoid processing each byte of input, such as using mmap() instead of read() for file input. At the time, using read() would cause some extra copies in most Unix versions. Because I no longer GNU grep, it seems that mmap is no longer used by default, but you can still enable it with the parameter -mmap, at least when the buffer of the file system has cached your data, mmap still needs Faster:

1 $ time sh -c'find. -Type f -print | xargs grep -l 123456789abcdef' 2 real 0m1.530s 3 user 0m0.230s 4 sys 0m1.357s 5 $ time sh -c'find. -Type f -print | xargs grep --mmap -l 123456789abcdef' 6 real 0m1.201s 7 user 0m0.330s 8 sys 0m0.929s Copy code

[The input used here is a 648M MH mail folder, containing about 41,000 messages]

So even today, using mmap can still speed up by more than 20%.

summary:

Use the Boyer-Moore algorithm (and expand its inner loop).

-Use native system calls to build your buffered input, avoid copying input bytes before searching. (In any case, it is best to use buffered output, because in the common grep scenario, the output is less than the input, so the overhead of the output buffer copy is small, and it can save many such small unbuffered write operations.)

-Don't look for a newline character until you find a match.

Try to make some settings (such as page alignment buffer, read blocks according to page size, and selectively use mmap) so that the kernel can avoid copying bytes.

The key to making programs faster is to make them do fewer things . ;-)

 

Main features

Assuming that the length of the text string is n and the length of the pattern string is m, the main features of the BM algorithm are:

  • Compare and match from right to left (general string search algorithms such as KMP are matched from left to right);
  • The algorithm is divided into two stages: the preprocessing stage and the search stage;
  • The time and space complexity of the preprocessing stage are both O ( m +), which is the size of the character set, generally 256;
  • The time complexity of the search phase is O ( m n );
  • When the pattern string is aperiodic, in the worst case, the algorithm needs to perform 3n character comparison operations;
  • The algorithm achieves O ( n  /  m ) in the best case . For example, searching for the pattern string am-1b in the text string bn requires only n/m comparisons.

These features first let everyone have a basic understanding of the algorithm, and after you understand the algorithm and then look at these features, there will be some additional gains.

 

Basic idea of algorithm

The conventional matching algorithm moves the pattern string from left to right, and compares it from left to right. The basic framework is:

1 while (j <= strlen(text)-strlen(pattern)) 2 { 3 for (i = 0; i <strlen(pattern) && pattern[i] == text[i + j]; ++i); 4 if (i == strlen(pattern)) 5 { 6 Match; 7 break; 8 } 9 else 10 ++j; 11} Copy code

The BM algorithm moves the pattern string from left to right, and compares it from right to left. The basic framework is:

1 while (j <= strlen(text)-strlen(pattern)) 2 { 3 for (i = strlen(pattern); i >= 0 && pattern[i] == text[i + j]; --i); 4 5 if (i <0)) 6 { 7 Match; 8 break; 9 } 10 else 11 j += BM(); 12} Copy code

The essence of the BM algorithm lies in BM (text, pattern), that is, the BM algorithm can skip more than one character at a time when it does not match. That is, it does not need to compare the characters in the searched string one by one, but will skip some parts. Generally, the longer the search keyword, the faster the algorithm. Its efficiency comes from the fact that for every failed matching attempt, the algorithm can use this information to exclude as many unmatched locations as possible. That is, it makes full use of some characteristics of the string to be searched, speeding up the search step.

The BM algorithm actually contains two parallel algorithms (that is, two heuristic strategies): bad-character shift and good-suffix shift . The purpose of these two algorithms is to make the pattern string move as far as possible to the right each time (that is, the above BM() is as large as possible).

The two algorithms are not directly explained in writing below. In order to make them easier to understand, let's use an example to explain it first. This is the most acceptable way.

 

String search brainstorm

Let s brainstorm: how to speed up string search? For a very simple example, as shown in the figure below, navie represents the general practice, comparing one by one, from right to left, the last character c does not match the d in text, and the pattern is shifted one bit to the right. But what are the characteristics of this d? There is no d in the pattern, so no matter if you shift 1, 2, 3, or 4 bits to the right, it will definitely still not match. Why bother? Wouldn't it be better to shift by 5 (strlen(pattern)) to the right and compare? Okay, just do this. After shifting to the right by 5 bits, the b in text is compared with the c in pattern, and it is still different. What should I do then? b is in the pattern so it can t be shifted to the right by 5 digits at once, can it be shifted to the right by one? No, you can directly move the b in the pattern to the position of b in the text for comparison, but there are two b in the pattern, which b is shifted to the right? The insurance method is to compare the rightmost b with text. Why? The picture below is very clear. Using the leftmost b is too radical, and it is easy to miss the real match. After using the rightmost b in the picture, it is found that all of them are matched successfully. If you use the leftmost b, you will not miss this. Matches? This heuristic search is done by the BM algorithm.

 

But, if you encounter the following situation, the c in the pattern does not match the b in the text, Ok, according to the above rules, move the pattern to the right until the rightmost b is aligned with the b in the text for comparison. Then compare the c in the pattern with the c in the text. The matching continues to compare to the left until the a in the pattern at position 3 does not match the b in the text. According to the heuristic rules mentioned above, the pattern should be The b on the far right is aligned with the b of text, but what do you find at this time? The pattern went back, why? Of course, don't do it, so don't be so stupid. In this case, you only need to move the pattern one step to the right, and insist on not going back!

Okay, this is the so-called "bad character algorithm", simple and easy to understand. The b marked in bold red above is the "bad character", that is, the unmatched character, and the bad character is for text.

Is BM that simple? Is it done with just one heuristic? Of course not, everyone brainstorm again, is there any other way to speed up the string search? For example, the following example

In the beginning, the bad character algorithm was used to move 4 digits at once, which was good. Then I encountered a backtracking. There was no way to move only one bit conservatively, but is it really only possible to move one bit? No, because other positions in the front of the pattern also have the suffix ab that has just been matched, wouldn't it be better to move the ab in front of the pattern to the right of the text that just matched the ab and continue to match forward? In this way, you can move two bits to the right at a time, and there is a good heuristic search rule. Someone may think: What if there is no matching suffix in the front? Is it invalid? Not exactly, it depends on the situation, such as the following example.

 

The suffix cbab has been successfully matched, then b was not successful, and no string like cbab was found before the pattern, so just move one bit conservatively? No, there is ab in front. This is part of the cbab suffix. You can also make good use of it. Move the ab in front of the pattern right to the ab position where the text has been matched and continue to match forward, so that it moves four places to the right at once ,well. Of course, if there is no suffix or part of the suffix that has been successfully matched, such as the first babac, then it really cannot be used.

Okay, this is the so-called "good suffix algorithm", simple and easy to understand. The ab (previous example) and cbab (above example) marked in red above are "good suffixes", and good suffixes are for pattern of.

Below, I will give an example at the end to illustrate what is a bad character and what is a good suffix.

Main string: mahtavaatalomaisema omalomailuun

Pattern string: maisemaomaloma

Bad character: "t" in the main string is a bad character.

Good suffix: Aloma in the pattern string is "good suffix".

BM is that simple? Yes, two heuristic search rules that are easy to understand but not everyone can think of make BM such an excellent algorithm. So there is another problem? How to use these two algorithms, if you have bad characters and good suffixes, when should you use bad characters? When should I use the suffix? Very good question, it depends on which digits are shifted to the right. For example, in the above example, at the beginning, if you use a good suffix, you can only move one bit, but you can move three digits to the right with a bad character. Bad character algorithm. Next, if you continue to use bad characters, you can only move one bit to the right, and use a good suffix to move four bits to the right. What do you say to use at this time? So, these two algorithms are "parallel", whichever is more useful.

It is of course not enough to illustrate with examples alone, it is too shallow, and it may not completely cover all situations, and it is not accurate. The real theoretical discussion begins below.

 

BM algorithm theory discussion

(1) Bad character algorithm

When a bad character appears, the BM algorithm moves the pattern string to the right, so that the rightmost corresponding character in the pattern string is opposite to the bad character, and then continues to match. There are two cases of bad character algorithms.

Case1: When there is a corresponding bad character in the pattern string, let the rightmost corresponding character in the pattern string be opposite to the bad character (PS: BM cannot go back, because if it is going back, the movement distance is negative, which is definitely not the maximum movement. Step count), as shown below.

Case2: There are no bad characters in the pattern string, which is good. Move the entire pattern string to the right by such a large number of steps, as shown in the figure below.

(2) Good suffix algorithm

If the program matches a good suffix, and there is another same suffix or part of the suffix in the pattern, then move the next suffix or part to the current suffix position. Suppose, the last u characters of pattern and text have been matched, but the next character does not match, I need to move to match. If the last u characters have appeared or partially appeared in other positions of the pattern, we move the pattern right to the front u characters or parts and the last u characters or parts are the same, if the last u characters are in other positions of the pattern It doesn't appear at all, very good, just move the whole pattern to the right. In this way, there are three situations for a good suffix algorithm, as shown in the following figure:

Case1: If there is a substring in the pattern string that exactly matches the good suffix, move the rightmost substring to the position of the good suffix to continue matching.

Case2: If there is no substring that exactly matches the good suffix, find the longest substring with the following characteristics in the good suffix, so that P[ms...m]=P[0...s].

Case3: If there is no substring that matches the suffix at all, move the entire pattern string to the right.

(3) Movement rules

The movement rules of the BM algorithm are:

Replace j += BM() in the basic framework of the three algorithms with j += MAX (shift (good suffix), shift (bad character)), that is

The BM algorithm is the maximum value calculated according to the good suffix algorithm and the bad character algorithm for the distance that the pattern string is moved to the right each time.

Shift (good suffix) and shift (bad character) are obtained by simple calculation of the preprocessing array of the pattern string. The preprocessing array of bad character algorithm is bmBc[], and the preprocessing array of good suffix algorithm is bmGs[].

 

Specific implementation of BM algorithm

When the substrings of the BM algorithm are mismatched, the distance required to shift the pattern to the right is calculated according to the bad character algorithm, using the bmBc array, and the distance to the right shift of the pattern according to the good suffix algorithm is calculated using the bmGs array. Let's talk about how to calculate the two preprocessing arrays bmBc[] and bmGs[].

(1) Calculate the bad character array bmBc[]

This calculation should be easy. It seems that bmBc[i] = m 1 i is enough, but this is not correct, because the character at position i may appear in multiple places in the pattern (as shown in the figure below), and we What is needed is the rightmost position, so it needs to be judged every time, which is very troublesome and poor performance. Here is a little trick, which is to use characters as subscripts instead of position numbers as subscripts. This only needs to be traversed once. This seems to be a space-for-time approach, but if it is a pure 8-bit character, only 256 spaces are needed, and for large patterns, the length itself may exceed 256, so it is worth it (This is also one of the reasons why the larger the data, the more efficient the BM algorithm).

As mentioned earlier, the calculation of bmBc[] is divided into two cases, which correspond to the previous one one by one.

Case1: The character appears in the pattern string, bmBc['v'] represents the position of the last occurrence of the character v in the pattern string, and the length from the end of the pattern string, as shown in the figure above.

Case2: The character does not appear in the pattern string. If there is no character v in the pattern string, then BmBc['v'] = strlen(pattern).

The code is also very simple:

1 void PreBmBc(char *pattern, int m, int bmBc[]) 2 { 3 int i; 4 5 for(i = 0; i <256; i++) 6 { 7 bmBc[i] = m; 8 } 9 10 for(i = 0; i <m-1; i++) 11 { 12 bmBc[pattern[i]] = m-1-i; 13} 14} Copy code

To calculate the distance that the pattern needs to shift to the right, we need to use the bmBc array. Is the value of bmBc the distance that the pattern actually needs to shift to the right? No, it's not true if you think about it. For example, the previous example mentioned that the use of bmBc algorithm may also go back, that is, the distance of the right shift is negative, and the value of bmBc is definitely not a negative number, so the two are not equal. So how do you calculate the distance that the pattern actually shifts to the right? This depends on the location of the bad characters in the text. As I said before, the bad character algorithm is for text. Let s look at the picture for a clear view. In the figure, v is a bad character in text (corresponding to position i+j), and the corresponding non-matching position in the pattern is i, then the distance that the pattern actually needs to be shifted to the right is: bmBc['v'] m + 1 + i .

(2) Calculate the suffix array bmGs[]

Here, the subscript of bmGs[] is a number instead of a character, indicating the position of the character in the pattern.

As mentioned earlier, the calculation of the bmGs array is divided into three cases, which correspond to the previous one one by one. Assume that the length of the suffix in the figure is represented by the array suff[].

Case1: Corresponding to the good suffix algorithm case1, as shown in the figure below, j is the position before the good suffix.

Case2: Corresponding to the suffix algorithm case2: As shown in the figure below:

Case3: Correspondence and good suffix algorithm case3, bmGs[i] = strlen(pattern) = m

This is clearer, and the code is easier to write:

1 void PreBmGs(char *pattern, int m, int bmGs[]) 2 { 3 int i, j; 4 int suff[SIZE]; 5 6//Calculate the suffix array 7 suffix(pattern, m, suff); 8 9//First assign all values to m, including Case3 10 for(i = 0; i <m; i++) 11 { 12 bmGs[i] = m; 13} 14 15//Case2 16 j = 0; 17 for(i = m-1; i >= 0; i--) 18 { 19 if(suff[i] == i + 1) 20 { 21 for(; j <m-1-i; j++) 22 { 23 if(bmGs[j] == m) 24 bmGs[j] = m-1-i; 25} 26} 27} 28 29//Case1 30 for(i = 0; i <= m-2; i++) 31 { 32 bmGs[m-1-suff[i]] = m-1-i; 33} 34} Copy code

So easy? Is it over? It's still one step away, what about suff[] here?

When calculating the bmGc array, in order to improve efficiency, first calculate the auxiliary array suff[] to indicate the length of the suffix.

The definition of the suff array: m is the length of the pattern

a. suffix[m-1] = m;

b. suffix[i] = k

    for [pattern[i-k+1] .,pattern[i]] == [pattern[m-1-k+1],pattern[m-1]]

It looks a little obscure, but in fact suff[i] is to find the length of the common suffix string with the character at position i as the suffix and the last character as the suffix in the pattern. I don t know if this is clear, let s give an example:

i: 0 1 2 3 4 5 6 7
pattern: bc ababab

When i=7, by definition suff[7] = strlen(pattern) = 8

When i=6, the suffix string with pattern[6] as the suffix is bcababa, and the suffix string with the last character b as the suffix is bcababab. There is no common suffix string between the two, so suff[6] = 0

When i=5, the suffix string with pattern[5] as the suffix is bcabab, the suffix string with the last character b as the suffix is bcababab, and the common suffix string of the two is abab, so suff[5] = 4

And so on...

When i=0, the suffix string with pattern[0] as the suffix is b, the suffix string with the last character b as the suffix is bcababab, and the common suffix string of the two is b, so suff[0] = 1

It seems that the code is also easy to write:

1 void suffix(char *pattern, int m, int suff[]) 2 { 3 int i, j; 4 int k; 5 6 suff[m-1] = m; 7 8 for(i = m-2; i >= 0; i--) 9 { 10 j = i; 11 while(j >= 0 && pattern[j] == pattern[m-1-i + j]) j--; 12 13 suff[i] = i-j; 14} 15} Copy code

This may be all right, but there are always people who are dissatisfied with this algorithm and feel too violent, so there are smart people who come up with a way to improve the above-mentioned conventional method. The basic scanning is from right to left. The improvement is to use the calculated suff[] value to calculate the suff[] value that is currently being calculated. How to use it, see the picture below:

i is the position where the suff[] value is currently being calculated.

f is the starting position of the last successful matching (not every position can be successfully matched, in fact, there are not many positions that can be successfully matched).

g is the mismatch position of the last successful match.

If i is between g and f, then there must be P[i]=P[m-1-f+i]; and if suff[m-1-f+i] <ig, then suff[i] = suff [m-1-f+i], doesn't this make use of the previous suff.

PS: Some people here may think it should be suff[m-1-f+i] <= i g, because if suff[m-1-f+i] = i g, it still does not exceed suff[f] Range, you can still use the previous suff[], but this is wrong, such as an extreme example:

i: 0 1 2 3 4 5 6 7 8 9
pattern: a aaaabaaa a

suff[4] = 4, where f=4, g=0, when i=3, then suff[m-1=f+i]=suff[8]=3, and suff[3]=4, The two are not equal, because the previous mismatch position g may be matched this time.

Well, after this explanation, the code is relatively simple:

1 void suffix(char *pattern, int m, int suff[]) 2 { 3 int f, g, i; 4 suff[m-1] = m; 5 g = m-1; 6 for (i = m-2; i >= 0; --i) 7 { 8 if (i> g && suff[i + m-1-f] <i-g) 9 suff[i] = suff[i + m-1-f]; 10 else 11 { 12 if (i <g) 13 g = i; 14 f = i; 15 while (g >= 0 && pattern[g] == pattern[g + m-1-f]) 16 --g; 17 suff[i] = f-g; 18} 19} 20} Copy code

ended? OK, it can be said that all the important algorithms are completed. I hope everyone can understand it. In order to verify whether you fully understand it, here is a simple example. Let's count bmBc[], suff[] and bmGs[].

Examples are as follows:

PS: Someone may ask here: How does bmBc['b'] equal to 2, doesn't it appear in the last position of the pattern last? It should be 0 by definition. Please take a closer look at the algorithm of bmBc:

1 for(i = 0; i <m-1; i++) 2 { 3 bmBc[pattern[i]] = m-1-i; 4} Copy code

Here is i <m 1 is not i <m, that is, if the last character has not appeared before, then its bmBc value is m. Why is the last digit not counted in bmBc? It s easy to think, if the bmBc of the character is 0, as mentioned above, the distance the pattern needs to be shifted to the right is bmBc['v']-m+1+i=-m+1+i <= 0, That is, if you stay on the same place or go back, of course you don t do it. The above situation has already been explained very clearly, so here is m-1.

Okay, all is finally finished, let s integrate these algorithms.

1 #include <stdio.h> 2 #include <string.h> 3 4 #define MAX_CHAR 256 5 #define SIZE 256 6 #define MAX(x, y) (x)> (y)? (X): (y) 7 8 void BoyerMoore(char *pattern, int m, char *text, int n); 9 10 int main() 11 { 12 char text[256], pattern[256]; 13 14 while(1) 15 { 16 scanf("%s%s", text, pattern); 17 if(text == 0 || pattern == 0) break; 18 19 BoyerMoore(pattern, strlen(pattern), text, strlen(text)); 20 printf("\n"); 21 } 22 23 return 0; 24 } 25 26 void print(int *array, int n, char *arrayName) 27 { 28 int i; 29 printf("%s: ", arrayName); 30 for(i = 0; i <n; i++) 31 { 32 printf("%d ", array[i]); 33} 34 printf("\n"); 35} 36 37 void PreBmBc(char *pattern, int m, int bmBc[]) 38 { 39 int i; 40 41 for(i = 0; i <MAX_CHAR; i++) 42 { 43 bmBc[i] = m; 44} 45 46 for(i = 0; i <m-1; i++) 47 { 48 bmBc[pattern[i]] = m-1-i; 49} 50 51/* printf("bmBc[]: "); 52 for(i = 0; i <m; i++) 53 { 54 printf("%d ", bmBc[pattern[i]]); 55} 56 printf("\n"); */ 57} 58 59 void suffix_old(char *pattern, int m, int suff[]) 60 { 61 int i, j; 62 63 suff[m-1] = m; 64 65 for(i = m-2; i >= 0; i--) 66 { 67 j = i; 68 while(j >= 0 && pattern[j] == pattern[m-1-i + j]) j--; 69 70 suff[i] = i-j; 71} 72} 73 74 void suffix(char *pattern, int m, int suff[]) { 75 int f, g, i; 76 77 suff[m-1] = m; 78 g = m-1; 79 for (i = m-2; i >= 0; --i) { 80 if (i> g && suff[i + m-1-f] <i-g) 81 suff[i] = suff[i + m-1-f]; 82 else { 83 if (i <g) 84 g = i; 85 f = i; 86 while (g >= 0 && pattern[g] == pattern[g + m-1-f]) 87 --g; 88 suff[i] = f-g; 89} 90} 91 92//print(suff, m, "suff[]"); 93} 94 95 void PreBmGs(char *pattern, int m, int bmGs[]) 96 { 97 int i, j; 98 int suff[SIZE]; 99 100//Calculate the suffix array 101 suffix(pattern, m, suff); 102 103//First assign all values to m, including Case3 104 for(i = 0; i <m; i++) 105 { 106 bmGs[i] = m; 107} 108 109//Case2 110 j = 0; 111 for(i = m-1; i >= 0; i--) 112 { 113 if(suff[i] == i + 1) 114 { 115 for(; j <m-1-i; j++) 116 { 117 if(bmGs[j] == m) 118 bmGs[j] = m-1-i; 119} 120} 121} 122 123//Case1 124 for(i = 0; i <= m-2; i++) 125 { 126 bmGs[m-1-suff[i]] = m-1-i; 127} 128 129//print(bmGs, m, "bmGs[]"); 130} 131 132 void BoyerMoore(char *pattern, int m, char *text, int n) 133 { 134 int i, j, bmBc[MAX_CHAR], bmGs[SIZE]; 135 136//Preprocessing 137 PreBmBc(pattern, m, bmBc); 138 PreBmGs(pattern, m, bmGs); 139 140//Searching 141 j = 0; 142 while(j <= n-m) 143 { 144 for(i = m-1; i >= 0 && pattern[i] == text[i + j]; i--); 145 if(i <0) 146 { 147 printf("Find it, the position is %d\n", j); 148 j += bmGs[0]; 149 return; 150} 151 else 152 { 153 j += MAX(bmBc[text[i + j]]-m + 1 + i, bmGs[i]); 154} 155} 156 157 printf("No find.\n"); 158} Copy code

The operation effect is as follows: