Regular Expression

28 Sep 2023 in Study / Rosalind on Motifs

For searching motifs in suggested sequences

We should find the patterns which are represented in the suggested sequences
There are various methods for it
Regular Expression is simple for it
However, it is not efficient when there are lots of data
After defining the regular expression, we gonna find the pattern which corresponds to it
- Ex) G[CG]A (RE) → GCA, GGA

Symbol representing RE

. : any character
- Ex) A.G → ACG, AGG, AAG, ATG, ….
chr? : X or chr * 1
- Ex) A?G → G, AG
chr* : X or chr * n (n >= 1)
- Ex) A*G → G, AG, AAG, AAAG, ….
chr+ : chr * n (n >= 1)
- Ex) A+G → AG, AAG, AAAG, ….
[chr1chr2chr3…] : one from [chr1, chr2, chr3, ….]
- Ex) [AC]G → AG, CG
- [num1num2num3…] : one from [num1, num2, num3, ….]
- [ACTGactg]* : all DNA sequences
[^chr1chr2chr3…] : one from (chr1, chr2, chr3, …)^C
chr{n} : chr * n
- Ex) GCA(TG){3} → GCATGTGTG
chr{n, m} : chr * α (n <= α <= m)
- Ex) AC{1, 2}G → ACG, ACCG
seq1 or seq2
```
"ATT | A(CG){2}" # → ATT, ACGCG
```
M[ ^_ ]*_ : coding amino acid sequence (_ : stop codon)

Module for RE

import re
re.search(RE, seq) : return object representing first subsequence corresponding to RE from sequence
- object.group() : set of subsequence
  - RE = “(seq1)(seq2)”
  - group(1) : In group(), set of seq1
  - group(2) : In group(), set of seq2
- object.span() : (start_index, end_index + 1)
re.match(RE, seq) : return object representing whether first subsequence (start from index 0) from sequence corresponds to RE
re.findall(RE, seq) : return list representing all subsequence corresponding to RE
- element from list is not object
re.finditer(RE, seq) : return iterator representing all subsequence corresponding to RE
- element from iterator is objcet : using group(), span()

import re

seq = "ACACACCCGGCGCGAGCATCGTCACTGCAGCATCGACTCCTCGAGCACGTTCTCCACCGTTTCACTCACTATCGG"

regexp = "(ACA)(C.C)"

motifs = re.finditer(regexp, seq)

for motif in motifs:
  print(motif.group(), end=' ')
  print(motif.span(), end=' ')
  print(motif.group(1), end='')

result: ACACAC (0, 6) ACA

Lookahead Assertion

전방 탐색 어설션
There are some problems when we use re.findall (or finditer)
- If there are overlaps, it cannot distinguish them
- Ex) seq = “GAGAC”, RE = “GA[GC]”
  - desired result = GAG (start : 0), GAC (start : 2)
  - result = GAG because of the overlap
Lookahead Assertion
- If the form of RE gonna be “(?= original RE)”, it can distinguish overlaps
- However, it knows only start index (not end index, so it will return just start index no matter how you use span())
- Both group() (return nothing) and span() ((start index, start index)) are changed so you cannot use them for original purpose, but the function of group(n) remains
- After slicing the sequence based on start index calculated from lookahead assertion, calculate again with original RE

import re

seq = "ACACACCCGGCGCGAGCATCGTCACTGCAGCATCGACTCCTCGAGCACGTTCTCCACCGTTTCACTCACTATCGG"

regexp = "(?=" + "(ACA)(C.C)" + ")"

motifs = re.finditer(regexp, seq)

for motif in motifs:
  print(motif.group(), end=' / ')
  print(motif.span(), end=' / ')
  print(motif.group(1), end=' / ')

result : / (0, 0) / ACA / / (2, 2) / ACA /

Preprocessing

Before using method from module re, set the RE inside it so that it processes for efficient calculation based on the specific RE
use compile method
- re.compile(RE)

import re
import time

seq = "ACACACCCGGCGCGAGCATCGTCACTGCAGCATCGACTCCTCGAGCACGTTCTCCACCGTTTCACTCACTATCGG"

start = time.time()

regexp = "(ACA)(C.C)"
motifs = re.finditer(regexp, seq)

for motif in motifs:
  print(motif.group(), end=' ')

end = time.time()
print(f"{end - start:.5f} sec", end=' ')

start = time.time()

compileRE = re.compile(regexp)

motifs = compileRE.finditer(seq)

for motif in motifs:
  print(motif.group(), end=' ')

end = time.time()
print(f"{end - start:.5f} sec")

result : ACACAC 0.00126 sec ACACAC 0.00000 sec

Database

Data from database → convert to RE → search it from the suggested sequence
Example : Prosite
The form of motif from Prosite is different compared to that of RE
- ”-“ exists between two symbols
- “x” : any amino acid
- ”()” : equal to “{}” from RE
- Ex) arsR family consensus pattern
- C-x(2)-D-[LIVM]-x(6)-[ST]-x(4)-S-[HYR]-[HQ]

def convert(data):
  data.replace("-", "")
  data.replace("x", ".")
  data.replace("(", "{")
  data.replace(")", "}")

  return data

Regular Expression

Symbol representing RE

Module for RE

Lookahead Assertion

Preprocessing

Database

Bioinformatics

Error

Symbol representing RE

Module for RE

Lookahead Assertion

Preprocessing

Database

Templates (for web app):

Error