Boyer-Moore Algorithm

24 Sep 2023 in Study / Rosalind on Motifs

For searching motifs in suggested sequences

We should find the patterns which are represented in the suggested sequences
There are various methods for it
Boyer-Moore algorithm is efficient for searching patterns in string
It should use two methods simultaneously
- Bad-character rule (BCR)
- Good suffix rule (GSR)
It efficiently decreases the searching path in comparison to Brute-force algorithm

BCR

Example
A C G G T C A T G C C A (Suggested Seq)
A T C A G (Pattern)
First of all, we should identify the location where there is a mismatch which is the closest to the end of the pattern (T-G)
We should focus on a symbol (T) in suggested seq which is involved in that mismatch
We should find the location of the symbol in pattern. (If there are the symbols more than one, we choose one which is the closest to the end of the pattern)
- Consequently, it is possible to move backward (that is, we need more rule (: GSR) for efficently finding the searching path)
We move the pattern
A C G G T C A T G C C A
………..A T C A G
If the symbol is not in pattern, we just move the pattern by the length of pattern
For this calculation, we should do preprocessing
- We should make the dictionary (BCR_Dictionary) representing each nucleotides’ index which is the closest to the end of pattern
- If pattern is ATCAG, BCR_Dictionary = {‘A’ : 3, ‘T’ : 1, ‘C’ : 2, ‘G’ : 4}
- If a certain nucleotide is not in pattern, the value in BCR_Dictionary is -1 (for calculating the case that the symbol is not in pattern : we just move the pattern by the length of pattern)

GSR

Good suffix : From the end of the pattern to the location where next location is first mismatch

A C G G T C A T G C C A
T G T G T

Good suffix = G T

GSR-Case 1 : There is one more sequence in the pattern which is equal to good suffix

A C G G T C A T G C C A
T G T G T

→

A C G G T C A T G C C A
……..T G T G T

GSR-Case 2: There is sequence in the pattern which is the part of good suffix

A C G G T C A T G C C A
C A T A T C A

→

A C G G T C A T G C C A
………………C A T A T C A

For this calculation, we should do preprocessing
- We should make the list (shift[])
- shift[n] : Assume the good suffix is from n (index) to the end of pattern → list value = the number needed for pattern shifting
  - index n-1 is absolutely mismatch (definition of good suffix)
  - we don’t know whether 1 ~ n-2 is mismatch or not
- For calculating shift[], we should make the list (bpos[])
  - bpos[n] : Assume there is a string from n (index) to the end of pattern → list value = start index of suffix which is equal to prefix
  - Example : ABABA → bpos = 2 (prefix = suffix = ABA)

preProcessingBCR

def preProcessingBCR(pattern):
  BCR_Dictionary = {}
  Nucleotide = ['A', 'G', 'C', 'T']

  for nt in Nucleotide:
    BCR_Dictionary[nt] = -1

  for nt in range(len(pattern)):
    BCR_Dictionary[pattern[nt]] = nt

preProcessingGSR

def preProcessingGSR(pattern):
  bpos = [0] * (len(pattern) + 1)
  shift = [0] * (len(pattern) + 1)

  i = len(pattern)
  j = len(pattern) + 1
  bpos[i] = j

  while i > 0:
    while j <= len(pattern) and pattern[i - 1] != pattern[j - 1]:
      if shift[j] == 0:
        shift[j] = j - i
      j = bpos[j]

    i -= 1
    j -= 1
    bpos[i] = j

  i = bpos[0]
  for j in range(len(pattern)):
    if shift[j] == 0:
      shift[j] = i
    if j == i:
      i = bpos[i]

Implementation for Boyer-Moore

def BoyerMoore(pattern, seq, BCR_Dictionary, shift):
  i = 0
  result = []

  while i <= len(seq) - len(pattern):
    j = len(pattern) - 1

    while j >= 0 and pattern[j] == seq[i + j]:
      j -= 1

    if j < 0:
      result.append(i)
      i += shift[0]
    else:
      i += max(shift[j + 1], j - BCR_Dictionary[seq[i + j]])

  return result

Boyer-Moore Algorithm

BCR

GSR

preProcessingBCR

preProcessingGSR

Implementation for Boyer-Moore

Bioinformatics

Error

BCR

GSR

preProcessingBCR

preProcessingGSR

Implementation for Boyer-Moore

Templates (for web app):

Error