Boyer-Moore Algorithm

For searching motifs in suggested sequences

  • We should find the patterns which are represented in the suggested sequences
  • There are various methods for it
  • Boyer-Moore algorithm is efficient for searching patterns in string
  • It should use two methods simultaneously
    • Bad-character rule (BCR)
    • Good suffix rule (GSR)
  • It efficiently decreases the searching path in comparison to Brute-force algorithm

BCR

  • Example
    A C G G T C A T G C C A (Suggested Seq)
    A T C A G (Pattern)

  • First of all, we should identify the location where there is a mismatch which is the closest to the end of the pattern (T-G)
  • We should focus on a symbol (T) in suggested seq which is involved in that mismatch
  • We should find the location of the symbol in pattern. (If there are the symbols more than one, we choose one which is the closest to the end of the pattern)
    • Consequently, it is possible to move backward (that is, we need more rule (: GSR) for efficently finding the searching path)
  • We move the pattern
    A C G G T C A T G C C A
    ………..A T C A G
  • If the symbol is not in pattern, we just move the pattern by the length of pattern
  • For this calculation, we should do preprocessing
    • We should make the dictionary (BCR_Dictionary) representing each nucleotides’ index which is the closest to the end of pattern
    • If pattern is ATCAG, BCR_Dictionary = {‘A’ : 3, ‘T’ : 1, ‘C’ : 2, ‘G’ : 4}
    • If a certain nucleotide is not in pattern, the value in BCR_Dictionary is -1 (for calculating the case that the symbol is not in pattern : we just move the pattern by the length of pattern)

GSR

  • Good suffix : From the end of the pattern to the location where next location is first mismatch

A C G G T C A T G C C A
T G T G T

Good suffix = G T

  • GSR-Case 1 : There is one more sequence in the pattern which is equal to good suffix

A C G G T C A T G C C A
T G T G T


A C G G T C A T G C C A
……..T G T G T

  • GSR-Case 2: There is sequence in the pattern which is the part of good suffix

A C G G T C A T G C C A
C A T A T C A


A C G G T C A T G C C A
………………C A T A T C A

  • For this calculation, we should do preprocessing
    • We should make the list (shift[])
    • shift[n] : Assume the good suffix is from n (index) to the end of pattern → list value = the number needed for pattern shifting
      • index n-1 is absolutely mismatch (definition of good suffix)
      • we don’t know whether 1 ~ n-2 is mismatch or not
    • For calculating shift[], we should make the list (bpos[])
      • bpos[n] : Assume there is a string from n (index) to the end of pattern → list value = start index of suffix which is equal to prefix
      • Example : ABABA → bpos = 2 (prefix = suffix = ABA)

preProcessingBCR

def preProcessingBCR(pattern):
  BCR_Dictionary = {}
  Nucleotide = ['A', 'G', 'C', 'T']

  for nt in Nucleotide:
    BCR_Dictionary[nt] = -1

  for nt in range(len(pattern)):
    BCR_Dictionary[pattern[nt]] = nt


preProcessingGSR

def preProcessingGSR(pattern):
  bpos = [0] * (len(pattern) + 1)
  shift = [0] * (len(pattern) + 1)

  i = len(pattern)
  j = len(pattern) + 1
  bpos[i] = j

  while i > 0:
    while j <= len(pattern) and pattern[i - 1] != pattern[j - 1]:
      if shift[j] == 0:
        shift[j] = j - i
      j = bpos[j]

    i -= 1
    j -= 1
    bpos[i] = j

  i = bpos[0]
  for j in range(len(pattern)):
    if shift[j] == 0:
      shift[j] = i
    if j == i:
      i = bpos[i]


Implementation for Boyer-Moore

def BoyerMoore(pattern, seq, BCR_Dictionary, shift):
  i = 0
  result = []

  while i <= len(seq) - len(pattern):
    j = len(pattern) - 1

    while j >= 0 and pattern[j] == seq[i + j]:
      j -= 1

    if j < 0:
      result.append(i)
      i += shift[0]
    else:
      i += max(shift[j + 1], j - BCR_Dictionary[seq[i + j]])

  return result

© 2017. All rights reserved.

Powered by Hydejack v조현진