Consensus and Profile

A problem from rosalind “Bioinformatics Stronghold” category, String Algorithm

  • A problem from rosalind “Bioinformatics Stronghold” category
    Rosalind Problem Link

  • Description
    • Given several FASTA format data, make the profile and find the consensus
  • Input example


>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT


  • Output example


ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6

Biological Background

  • We want to analyze & compare several sequences (same length)
    • They can be motifs from several organisms
  • Making the Profile can make it easy to analyze these
    • Profile : Each symbol’s frequency on ith location (4 x (length of sequence) matrix)
  • Consensus : sequence of symbols that have maximum frequency on each location

My Solution


nucleotides = {'A' : 0, 'C' : 1, 'G' : 2, 'T' : 3}
reverse_nt = {0 : 'A', 1 : 'C', 2 : 'G', 3 : 'T'}
def create_profile_consensus(seqs):
    res1 = [[0 for i in range(len(seqs[0]))] for i in range(4)]
    res2 = ""

    for i in range(len(seqs[0])):
        for j in range(len(seqs)):           
            nt_index = nucleotides[seqs[j][i]]
            res1[nt_index][i] += 1
    for i in range(len(seqs[0])):
        max_symbol = reverse_nt[0]
        max_freq = res1[0][i]
        for j in range(4):
            if max_freq < res1[j][i]:
                max_symbol = reverse_nt[j]
                max_freq = res1[j][i]
        res2 += max_symbol


    return res1, res2


file_name = input()
f = open(file_name, 'r')


Seqs = []
res = ""
for line in f.readlines():
    if line[0] == ">":
        if res != "":
            Seqs.append(res)
            res = ""
        continue
    else:
        res += line[:-1]

Seqs.append(res)

profile, consensus = create_profile_consensus(Seqs)

print(consensus)

for i in range(len(profile)):
    print(reverse_nt[i] + ":", end="")
    for j in range(len(profile[i])):
        print(" " + str(profile[i][j]), end="")
    print()


  • Code Analysis


def create_profile_consensus(seqs):
    res1 = [[0 for i in range(len(seqs[0]))] for i in range(4)]
    res2 = ""

    for i in range(len(seqs[0])):
        for j in range(len(seqs)):           
            nt_index = nucleotides[seqs[j][i]]
            res1[nt_index][i] += 1
    for i in range(len(seqs[0])):
        max_symbol = reverse_nt[0]
        max_freq = res1[0][i]
        for j in range(4):
            if max_freq < res1[j][i]:
                max_symbol = reverse_nt[j]
                max_freq = res1[j][i]
        res2 += max_symbol


    return res1, res2


res1 : profile
res2 : consensus sequence


file_name = input()
f = open(file_name, 'r')


Seqs = []
res = ""
for line in f.readlines():
    if line[0] == ">":
        if res != "":
            Seqs.append(res)
            res = ""
        continue
    else:
        res += line[:-1]

Seqs.append(res)

Distinguish between “> ID” and sequence
line[-1] == ‘\n’


profile, consensus = create_profile_consensus(Seqs)

print(consensus)

for i in range(len(profile)):
    print(reverse_nt[i] + ":", end="")
    for j in range(len(profile[i])):
        print(" " + str(profile[i][j]), end="")
    print()

print out the consensus and profile

Best answer

answer


© 2017. All rights reserved.

Powered by Hydejack v조현진