RNA Splicing
in Study / Rosalind on Rosalind problem
A problem from rosalind “Bioinformatics Stronghold” category
Rosalind Problem LinkDescription
- pre-mRNA involves several introns that are not coding region
- RNA splicing -> remove the introns
My Solution
- Sketch
- This rosalind problem intends for us not to consider the overlaps of several intron sequences (A given dataset, there is no such a case)
- I am curious there can be such a case in the real world
- If there can be, is it right that only one of overlapped intron sequences is removed? (2024-07-07 My question)
- I need to study more when it comes to it
- I don’t need to consider such a case in this rosalind problem, but I consider that case as well with my thought: “Only one of overlapped intron sequences is removed”
Intron: ACCT, CTTA 5-...ATGACCTTACAGCGGATTAG...-3 ACCT CTTA Possible Result 1: ATGTACAGCGGATTAG Possible Result 2: ATGACCAGCGGATTAG
- I am curious there can be such a case in the real world
- Get the information about intron’s start location and end location within a given DNA sequence through regular expression (ori_intron)
- Through “Lookahead Assertion” and string algorithm, I get the overlapped introns (over_intron) and I am tracking which introns (ori_intron) go through overlapping with these introns (over_intron)
- If there is no overlap, we just remove introns (ori_intron) from the DNA sequence
- If there are overlapped sequences
- Which one do we remove between ori_intron and over_intron: This is the matter of combination (That is, it is the matter of whether one intron is selected or not)
- I use Brute-force algorithm (exhaustive search, 완전탐색) through bitmask
- For all possible combination cases, I get the sequences where I removed the introns satisfying the current combination
- All possible final sequences -> Translate
- This rosalind problem intends for us not to consider the overlaps of several intron sequences (A given dataset, there is no such a case)
- Code
import re
codonTable = {"TTT" : "F", "TTC" : "F", "TTA" : "L", "TTG" : "L", "CTT" : "L", "CTC" : "L", "CTA" : "L", "CTG" : "L", "ATT" : "I", "ATC" : "I", "ATA" : "I", "ATG" : "M", "GTT" : "V", "GTC" : "V", "GTA" : "V", "GTG" : "V", "TCT" : "S", "TCC" : "S", "TCA" : "S", "TCG" : "S", "CCT" : "P", "CCC" : "P", "CCA" : "P", "CCG" : "P", "ACT" : "T", "ACC" : "T", "ACA" : "T", "ACG" : "T", "GCT" : "A", "GCC" : "A", "GCA" : "A", "GCG" : "A", "TAT" : "Y", "TAC" : "Y", "CAT" : "H", "CAC" : "H", "CAA" : "Q", "CAG" : "Q", "AAT" : "N", "AAC" : "N", "AAA" : "K", "AAG" : "K", "GAT" : "D", "GAC" : "D", "GAA" : "E", "GAG" : "E", "TGT" : "C", "TGC" : "C", "TGG" : "W", "CGT" : "R", "CGC" : "R", "CGA" : "R", "CGG" : "R", "AGT" : "S", "AGC" : "S", "AGA" : "R", "AGG" : "R", "GGT" : "G", "GGC" : "G", "GGA" : "G", "GGG" : "G"}
def translate(seq):
startCodon = "(?=" + "ATG" + ")"
compiler = re.compile(startCodon)
startCodonLocations = compiler.finditer(seq)
startCodonList = []
for i in startCodonLocations:
startCodonList.append(i.span()[0])
protein = ""
flag = False
for i in range(len(startCodonList)):
if (flag):
break
for j in range(startCodonList[i], len(seq), 3):
codon = seq[j : j + 3]
if codon not in codonTable.keys():
flag = True
break
else:
protein += codonTable[codon]
return protein
def quick_sort(array):
if len(array) <= 1:
return array
pivot = len(array) // 2
front_arr, pivot_arr, rear_arr = [], [], []
for value in array:
if value.getX() < array[pivot].getX():
front_arr.append(value)
elif value.getX() > array[pivot].getX():
rear_arr.append(value)
else:
pivot_arr.append(value)
if len(front_arr) == 0 and len(rear_arr) == 0 and len(pivot_arr) != 0:
return pivot_arr
return quick_sort(front_arr) + quick_sort(pivot_arr) + quick_sort(rear_arr)
name = input()
f = open(name, 'r')
f.readline()
count = 0
seq = ""
reExp = []
input_intron = ""
for line in f.readlines():
if line[0] != '>':
if count == 0:
seq += line[:-1]
else:
input_intron += line[:-1]
else:
if count == 0:
count += 1
else:
reExp.append(input_intron)
input_intron = ""
reExp.append(input_intron)
candidate_seqs = []
ori_intron_location = []
overlap_intron_location = []
class Intron:
def __init__(self, x, y):
self.__x = x
self.__y = y
self.__pair = []
def getX(self):
return self.__x
def getY(self):
return self.__y
def getPair(self):
return self.__pair
def appendPair(self, other):
self.__pair.append(other)
def __eq__(self, other):
return self.__x == other.__x and self.__y == other.__y
for i in range(len(reExp)):
oneReExp = reExp[i]
compiler = re.compile(oneReExp)
introns = compiler.finditer(seq)
for intron in introns:
x = intron.span()[0]
y = intron.span()[1]
newIntron = Intron(x, y)
ori_intron_location.append(newIntron)
for i in range(len(reExp)):
oneReExp = "(?=" + reExp[i] + ")"
compiler2 = re.compile(oneReExp)
introns = compiler2.finditer(seq)
for intron in introns:
x = intron.span()[0]
y = x + len(reExp[i])
if Intron(x, y) in ori_intron_location:
continue
else:
newIntron = Intron(x, y)
overlap_intron_location.append(newIntron)
removedOriIntron = []
for i in range(len(ori_intron_location)):
ori_intron1 = ori_intron_location[i]
for j in range(i, len(ori_intron_location)):
if i == j:
continue
else:
ori_intron2 = ori_intron_location[j]
if (ori_intron2.getX() >= ori_intron1.getX() and ori_intron2.getX() <= ori_intron1.getY() - 1) or (ori_intron2.getY() >= ori_intron1.getX() and ori_intron2.getY() <= ori_intron1.getY() - 1):
removedOriIntron.append(ori_intron2)
overlap_intron_location.append(ori_intron2)
new_ori_intron_location = []
for i in range(len(ori_intron_location)):
if ori_intron_location[i] not in removedOriIntron:
new_ori_intron_location.append(ori_intron_location[i])
for i in range(len(overlap_intron_location)):
overlap_intron = overlap_intron_location[i]
for j in range(len(new_ori_intron_location)):
new_ori_intron = new_ori_intron_location[j]
if (overlap_intron.getX() >= new_ori_intron.getX() and overlap_intron.getX() <= new_ori_intron.getY() - 1) or (overlap_intron.getY() >= new_ori_intron.getX() and overlap_intron.getY() <= new_ori_intron.getY() - 1):
overlap_intron.appendPair(new_ori_intron)
for i in range(2 ** len(overlap_intron_location)):
possibleSeq = ""
removed_intron = []
pre_ori_intron = -1
for j in range(len(overlap_intron_location)):
if i & (1 << j) != 0:
for k in range(len(overlap_intron_location[j].getPair())):
removed_intron.append(overlap_intron_location[j].getPair()[k])
else:
removed_intron.append(overlap_intron_location[j])
new_ori_intron_location = new_ori_intron_location + overlap_intron_location
new_ori_intron_location = quick_sort(new_ori_intron_location)
for l in range(len(new_ori_intron_location)):
if new_ori_intron_location[l] in removed_intron:
continue
else:
if pre_ori_intron != -1:
possibleSeq += seq[pre_ori_intron : new_ori_intron_location[l].getX()]
else:
possibleSeq += seq[: new_ori_intron_location[l].getX()]
pre_ori_intron = new_ori_intron_location[l].getY()
possibleSeq += seq[pre_ori_intron :]
candidate_seqs.append(possibleSeq)
proteins = []
for i in range(len(candidate_seqs)):
proteins.append(translate(candidate_seqs[i]))
for i in range(len(proteins)):
print(proteins[i])
f.close()
- Code Analysis
def translate(seq):
startCodon = "(?=" + "ATG" + ")"
compiler = re.compile(startCodon)
startCodonLocations = compiler.finditer(seq)
startCodonList = []
for i in startCodonLocations:
startCodonList.append(i.span()[0])
protein = ""
flag = False
for i in range(len(startCodonList)):
if (flag):
break
for j in range(startCodonList[i], len(seq), 3):
codon = seq[j : j + 3]
if codon not in codonTable.keys():
flag = True
break
else:
protein += codonTable[codon]
return protein
def quick_sort(array):
if len(array) <= 1:
return array
pivot = len(array) // 2
front_arr, pivot_arr, rear_arr = [], [], []
for value in array:
if value.getX() < array[pivot].getX():
front_arr.append(value)
elif value.getX() > array[pivot].getX():
rear_arr.append(value)
else:
pivot_arr.append(value)
if len(front_arr) == 0 and len(rear_arr) == 0 and len(pivot_arr) != 0:
return pivot_arr
return quick_sort(front_arr) + quick_sort(pivot_arr) + quick_sort(rear_arr)
translate: Find just one ORF (because the anwser is only one) and translate it
quickSort: When removing introns, it is possible to use linear search if introns are sorted by its start index
name = input()
f = open(name, 'r')
f.readline()
count = 0
seq = ""
reExp = []
input_intron = ""
for line in f.readlines():
if line[0] != '>':
if count == 0:
seq += line[:-1]
else:
input_intron += line[:-1]
else:
if count == 0:
count += 1
else:
reExp.append(input_intron)
input_intron = ""
reExp.append(input_intron)
Extract a given DNA sequence (seq) and several introns (reExp)
candidate_seqs = []
ori_intron_location = []
overlap_intron_location = []
class Intron:
def __init__(self, x, y):
self.__x = x
self.__y = y
self.__pair = []
def getX(self):
return self.__x
def getY(self):
return self.__y
def getPair(self):
return self.__pair
def appendPair(self, other):
self.__pair.append(other)
def __eq__(self, other):
return self.__x == other.__x and self.__y == other.__y
Intron class: appendPair (overlap_intron has pair instance variable and it will have its overlapped ori_intron)
candidate_seqs: DNA Sequences whose introns are removed
ori_intron_location: The location of intron sequences that were extracted through regular expression without “Lookahead Assertion”
overlap_intron_location: The location of intron sequences that were extracted through “Lookahead Assertion” (but, intron which has already been involved in ori_intron_location shouldn’t be involved in) + ori_intron sequences that are ovelpped with other ori_intron (These will be removed from ori_intron_location and move to overlap_intron_location)
Example
Intron: ACCT, CTTA, GCGC
DNA sequence: ATGACCTTACAGCGCGCATTAG
----
---- ----
----
RE (without lookahead assertion): 3 (RE: ACCT), 5 (RE: CTTA), 11 (RE: GCGC) -> ori_intron_location
RE (with lookahead assertion): 3 (RE: ACCT), 5 (RE: CTTA), 11 (RE: GCGC), 13 (RE: GCGC) -> Only 13 is the element of overlap_intron_location
In ori_intron_location, 5 (RE: CTTA) is overlapped with 3 (RE: ACCT)
-> 5 (RE: CTTA) is removed from ori_intron and move to overlap_intron
for i in range(len(reExp)):
oneReExp = reExp[i]
compiler = re.compile(oneReExp)
introns = compiler.finditer(seq)
for intron in introns:
x = intron.span()[0]
y = intron.span()[1]
newIntron = Intron(x, y)
ori_intron_location.append(newIntron)
for i in range(len(reExp)):
oneReExp = "(?=" + reExp[i] + ")"
compiler2 = re.compile(oneReExp)
introns = compiler2.finditer(seq)
for intron in introns:
x = intron.span()[0]
y = x + len(reExp[i])
if Intron(x, y) in ori_intron_location:
continue
else:
newIntron = Intron(x, y)
overlap_intron_location.append(newIntron)
RE (without lookahead assertion) -> ori_intron_location
RE (with lookahead assertion) + remove duplicated intron within ori_intron_location -> overlap_intron_location
removedOriIntron = []
for i in range(len(ori_intron_location)):
ori_intron1 = ori_intron_location[i]
for j in range(i, len(ori_intron_location)):
if i == j:
continue
else:
ori_intron2 = ori_intron_location[j]
if (ori_intron2.getX() >= ori_intron1.getX() and ori_intron2.getX() <= ori_intron1.getY() - 1) or (ori_intron2.getY() >= ori_intron1.getX() and ori_intron2.getY() <= ori_intron1.getY() - 1):
removedOriIntron.append(ori_intron2)
overlap_intron_location.append(ori_intron2)
new_ori_intron_location = []
for i in range(len(ori_intron_location)):
if ori_intron_location[i] not in removedOriIntron:
new_ori_intron_location.append(ori_intron_location[i])
ori_intron sequences that are ovelpped with other ori_intron (These will be removed from ori_intron_location and move to overlap_intron_location)
After this process, ori_intron_location -> new_ori_intron_location
for i in range(len(overlap_intron_location)):
overlap_intron = overlap_intron_location[i]
for j in range(len(new_ori_intron_location)):
new_ori_intron = new_ori_intron_location[j]
if (overlap_intron.getX() >= new_ori_intron.getX() and overlap_intron.getX() <= new_ori_intron.getY() - 1) or (overlap_intron.getY() >= new_ori_intron.getX() and overlap_intron.getY() <= new_ori_intron.getY() - 1):
overlap_intron.appendPair(new_ori_intron)
For overlap_intron, appendPair
for i in range(2 ** len(overlap_intron_location)):
possibleSeq = ""
removed_intron = []
pre_ori_intron = -1
for j in range(len(overlap_intron_location)):
if i & (1 << j) != 0:
for k in range(len(overlap_intron_location[j].getPair())):
removed_intron.append(overlap_intron_location[j].getPair()[k])
else:
removed_intron.append(overlap_intron_location[j])
Use Brute-force algorithm (exhaustive search, 완전탐색) through bitmask
(overlap_intron) intron 1 intron 2 intron 3
i = 0 x x x bit: 000 = 0 -> seq1
i = 1 o x x bit: 100 = 1 -> seq2
i = 2 x o x bit: 010 = 2 -> seq3
i = 3 o o x bit: 110 = 3 -> seq4
i = 4 x x o bit: 001 = 4 -> seq5
i = 5 o x o bit: 101 = 5 -> seq6
i = 6 x o o bit: 011 = 6 -> seq7
i = 7 o o o bit: 111 = 7 -> seq8
possibleSeq: seq1 ~ seq8
removed_intron: its element is the ori_intron & overlap_intron which are determined to remove
pre_ori_intron: When slicing the given DNA sequence for getting mature mRNA sequence, we need the y (last index) value of previous ori_intron
for i in range(2 ** len(overlap_intron_location)):
///
new_ori_intron_location = new_ori_intron_location + overlap_intron_location
new_ori_intron_location = quick_sort(new_ori_intron_location)
for l in range(len(new_ori_intron_location)):
if new_ori_intron_location[l] in removed_intron:
continue
else:
if pre_ori_intron != -1:
possibleSeq += seq[pre_ori_intron : new_ori_intron_location[l].getX()]
else:
possibleSeq += seq[: new_ori_intron_location[l].getX()]
pre_ori_intron = new_ori_intron_location[l].getY()
possibleSeq += seq[pre_ori_intron :]
candidate_seqs.append(possibleSeq)
Doing “for” loop for (new_ori_intron_location + overlap_intron_location) after doing quickSort
If intron is involved in removed_intron, continue
pre_ori_intron == -1 means that it is the first intron
possibleSeq = seq[:intron 1’s x value] + seq[intron 1’s y value : intron 2’s x value] + seq[intron 2’s y value : intron 3’s x value] + …. + seq[last intron’s y value :]
proteins = []
for i in range(len(candidate_seqs)):
proteins.append(translate(candidate_seqs[i]))
for i in range(len(proteins)):
print(proteins[i])
f.close()