Skip to content

Commit de6cb4b

Browse files
add tutorial
1 parent 4fa71a7 commit de6cb4b

File tree

1 file changed

+211
-0
lines changed

1 file changed

+211
-0
lines changed

rosalind/10-cons.md

Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
# Consensus and Profile
2+
3+
🤔 [Problem link](https://rosalind.info/problems/cons/)
4+
5+
!!! warning "The Problem".
6+
7+
A matrix is a rectangular table of values divided into rows and columns.
8+
An m×n matrix has m rows and n columns.
9+
Given a matrix A, we write Ai,j.
10+
to indicate the value found at the intersection of row i and column j.
11+
12+
Say that we have a collection of DNA strings,
13+
all having the same length n.
14+
Their profile matrix is a 4×n matrix P in which P1,
15+
j represents the number of times that 'A' occurs in the jth position of one of the strings,
16+
P2,j represents the number of times that C occurs in the jth position,
17+
and so on (see below).
18+
19+
A consensus string c is a string of length n
20+
formed from our collection by taking the most common symbol at each position;
21+
the jth symbol of c therefore corresponds to the symbol having the maximum value
22+
in the j-th column of the profile matrix.
23+
Of course, there may be more than one most common symbol,
24+
leading to multiple possible consensus strings.
25+
26+
### DNA Strings
27+
A T C C A G C T
28+
G G G C A A C T
29+
A T G G A T C T
30+
A A G C A A C C
31+
T T G G A A C T
32+
A T G C C A T T
33+
A T G G C A C T
34+
35+
### Profile
36+
37+
A 5 1 0 0 5 5 0 0
38+
C 0 0 1 4 2 0 6 1
39+
G 1 1 6 3 0 1 0 0
40+
T 1 5 0 0 0 1 1 6
41+
42+
Consensus A T G C A A C T
43+
44+
Given:
45+
A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.
46+
47+
Return:
48+
A consensus string and profile matrix for the collection.
49+
(If several possible consensus strings exist,
50+
then you may return any one of them.)
51+
52+
Sample Dataset
53+
>Rosalind_1
54+
ATCCAGCT
55+
>Rosalind_2
56+
GGGCAACT
57+
>Rosalind_3
58+
ATGGATCT
59+
>Rosalind_4
60+
AAGCAACC
61+
>Rosalind_5
62+
TTGGAACT
63+
>Rosalind_6
64+
ATGCCATT
65+
>Rosalind_7
66+
ATGGCACT
67+
68+
Sample Output
69+
ATGCAACT
70+
A: 5 1 0 0 5 5 0 0
71+
C: 0 0 1 4 2 0 6 1
72+
G: 1 1 6 3 0 1 0 0
73+
T: 1 5 0 0 0 1 1 6
74+
75+
76+
The first thing we will need to do is read in the input fasta.
77+
In this case, we will not be reading in an actual fasta file,
78+
but a set of strings in fasta format.
79+
If we were reading in an actual fasta file,
80+
we could use the [FASTX.jl](https://github.com/BioJulia/FASTX.jl) package to help us with that.
81+
82+
Since the task required here is something that was already demonstrated in the [GC-content tutorial](./05-gc.md),
83+
we can borrow the function from that tutorial.
84+
85+
```julia
86+
87+
fake_file = IOBuffer("""
88+
>Rosalind_1
89+
ATCCAGCT
90+
>Rosalind_2
91+
GGGCAACT
92+
>Rosalind_3
93+
ATGGATCT
94+
>Rosalind_4
95+
AAGCAACC
96+
>Rosalind_5
97+
TTGGAACT
98+
>Rosalind_6
99+
ATGCCATT
100+
>Rosalind_7
101+
ATGGCACT
102+
"""
103+
)
104+
105+
function parse_fasta(buffer)
106+
records = [] # this is a Vector of type `Any`
107+
record_name = ""
108+
sequence = ""
109+
for line in eachline(buffer)
110+
if startswith(line, ">")
111+
!isempty(record_name) && push!(records, (record_name, sequence))
112+
record_name = lstrip(line, '>')
113+
sequence = ""
114+
else
115+
sequence *= line
116+
end
117+
end
118+
push!(records, (record_name, sequence))
119+
return records
120+
end
121+
122+
records = parse_fasta(fake_file)
123+
```
124+
125+
Once the fasta is read in, we can iterate over each sequence/record and store its nucleotide sequence in a data matrix.
126+
127+
From there, we can generate the profile matrix.
128+
We'll need to sum the number of times each nucleotide appears at a particular column of the data matrix.
129+
130+
Then, we can identify the most common nucleotide at each column of the data matrix,
131+
which represent each index of the consensus string.
132+
After doing this for all columns of the data matrix,
133+
we can generate the consensus string.
134+
135+
136+
```julia
137+
using DataFrames
138+
139+
function consensus(fasta_string)
140+
141+
# extract strings from fasta
142+
records = parse_fasta(fasta_string)
143+
144+
# make a vector of sequence strings
145+
data_vector = last.(records)
146+
147+
# convert data_vector to matrix where each column is a character position and each row is a string
148+
data_matrix = reduce(vcat, permutedims.(collect.(data_vector)))
149+
150+
# make profile matrix
151+
consensus_matrix_list = Vector{Int64}[]
152+
for nuc in ['A', 'C', 'G', 'T']
153+
nuc_count = vec(sum(x->x==nuc, data_matrix, dims=1))
154+
push!(consensus_matrix_list, nuc_count)
155+
end
156+
157+
consensus_matrix = vcat(consensus_matrix_list)
158+
159+
# convert matrix to DF and add row names for nucleotides
160+
consensus_df = DataFrame(consensus_matrix, ["A", "C", "G", "T"])
161+
162+
163+
# make column with nucleotide with the max value
164+
# argmax returns the index or key of the first one encountered
165+
nuc_max_df = transform(consensus_df, AsTable(:) => ByRow(argmax) => :MaxColName)
166+
167+
# return consensus string
168+
return join(nuc_max_df.MaxColName)
169+
170+
end
171+
172+
consensus(fake_file)
173+
```
174+
175+
As mentioned in the problem description above,
176+
it is possible that there can be multiple consensus strings,
177+
as some nucleotides may appear the same number of times
178+
in each column of the data matrix.
179+
180+
If this is the case,
181+
the function we are using (`argmax`) returns the nucleotide with the most occurrences that it first encounters.
182+
183+
The way our function is written,
184+
we first scan for 'A', 'C', then 'G' and 'T',
185+
so the final consensus string will be biased towards more A's, then C's, G's and T's.
186+
This is simply based on which nucleotide counts it will encounter first in the profile matrix.
187+
188+
In the example below, there are equal number of sequences that are all `A`'s and `G`'s,
189+
so the consensus string could be either `AAAAAAAA` or `GGGGGGGG`.
190+
However, because our solution scans for `A` first,
191+
the consensus string returned will be `AAAAAAAA`.
192+
193+
```julia
194+
fake_file2 = IOBuffer("""
195+
>Rosalind_1
196+
AAAAAAAA
197+
>Rosalind_2
198+
AAAAAAAA
199+
>Rosalind_3
200+
AAAAAAAA
201+
>Rosalind_4
202+
GGGGGGGG
203+
>Rosalind_5
204+
GGGGGGGG
205+
>Rosalind_6
206+
GGGGGGGG
207+
"""
208+
)
209+
210+
consensus(fake_file2)
211+
```

0 commit comments

Comments
 (0)