Skip to content

Commit 1d9408d

Browse files
Merge pull request #21 from BioJulia/2026-03-02-cons
Consensus string tutorial
2 parents 4fa71a7 + 23e4bec commit 1d9408d

File tree

1 file changed

+215
-0
lines changed

1 file changed

+215
-0
lines changed

rosalind/10-cons.md

Lines changed: 215 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,215 @@
1+
+++
2+
using Dates
3+
date = Date("2026-03-02")
4+
title = "Problem 10: Consensus and Profile"
5+
rss_descr = "Solving Rosalind problem CONS — finding a consensus string from a collection of DNA strings — using base Julia, DataFrames, and matrix operations"
6+
+++
7+
8+
# Consensus and Profile
9+
10+
🤔 [Problem link](https://rosalind.info/problems/cons/)
11+
12+
> **The Problem**
13+
>
14+
> A matrix is a rectangular table of values divided into rows and columns.
15+
> An m×n matrix has m rows and n columns.
16+
> Given a matrix A, we write Ai,j.
17+
> to indicate the value found at the intersection of row i and column j.
18+
19+
> Say that we have a collection of DNA strings,
20+
> all having the same length n.
21+
> Their profile matrix is a 4×n matrix P in which P1,
22+
> j represents the number of times that 'A' occurs in the jth position of one of the strings,
23+
> P2,j represents the number of times that C occurs in the jth position,
24+
> and so on (see below).
25+
26+
> A consensus string c is a string of length n
27+
> formed from our collection by taking the most common symbol at each position;
28+
> the jth symbol of c therefore corresponds to the symbol having the maximum value
29+
> in the j-th column of the profile matrix.
30+
> Of course, there may be more than one most common symbol,
31+
> leading to multiple possible consensus strings.
32+
>
33+
> ### DNA Strings
34+
> ```
35+
> A T C C A G C T
36+
> G G G C A A C T
37+
> A T G G A T C T
38+
> A A G C A A C C
39+
> T T G G A A C T
40+
> A T G C C A T T
41+
> A T G G C A C T
42+
> ```
43+
>
44+
> ### Profile
45+
> ```
46+
> A 5 1 0 0 5 5 0 0
47+
> C 0 0 1 4 2 0 6 1
48+
> G 1 1 6 3 0 1 0 0
49+
> T 1 5 0 0 0 1 1 6
50+
> ```
51+
>
52+
> ### Consensus
53+
> ```A T G C A A C T```
54+
>
55+
> **Given:**
56+
> A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.
57+
>
58+
> **Return:**
59+
> A consensus string and profile matrix for the collection.
60+
> (If several possible consensus strings exist,
61+
> then you may return any one of them.)
62+
>
63+
> **Sample Dataset***
64+
>
65+
> ```
66+
> >Rosalind_1
67+
> ATCCAGCT
68+
> >Rosalind_2
69+
> GGGCAACT
70+
> >Rosalind_3
71+
> ATGGATCT
72+
> >Rosalind_4
73+
> AAGCAACC
74+
> >Rosalind_5
75+
> TTGGAACT
76+
> >Rosalind_6
77+
> ATGCCATT
78+
> >Rosalind_7
79+
> ATGGCACT
80+
> ```
81+
>
82+
> **Sample Output**
83+
> ```
84+
> ATGCAACT
85+
> ```
86+
>
87+
> ```
88+
> A: 5 1 0 0 5 5 0 0
89+
> C: 0 0 1 4 2 0 6 1
90+
> G: 1 1 6 3 0 1 0 0
91+
> T: 1 5 0 0 0 1 1 6
92+
> ```
93+
94+
95+
The first thing we will need to do is read in the input fasta.
96+
In this case, we will not be reading in an actual fasta file,
97+
but a set of strings in fasta format.
98+
If we were reading in an actual fasta file,
99+
we could use the [FASTX.jl](https://github.com/BioJulia/FASTX.jl) package to help us with that.
100+
101+
Since the task required here is something that was already demonstrated in the [GC-content tutorial](./05-gc.md),
102+
we can borrow the function from that tutorial.
103+
104+
```julia
105+
106+
fake_file = IOBuffer("""
107+
>Rosalind_1
108+
ATCCAGCT
109+
>Rosalind_2
110+
GGGCAACT
111+
>Rosalind_3
112+
ATGGATCT
113+
>Rosalind_4
114+
AAGCAACC
115+
>Rosalind_5
116+
TTGGAACT
117+
>Rosalind_6
118+
ATGCCATT
119+
>Rosalind_7
120+
ATGGCACT
121+
"""
122+
)
123+
124+
function parse_fasta(buffer)
125+
records = [] # this is a Vector of type `Any`
126+
record_name = ""
127+
sequence = ""
128+
for line in eachline(buffer)
129+
if startswith(line, ">")
130+
!isempty(record_name) && push!(records, (record_name, sequence))
131+
record_name = lstrip(line, '>')
132+
sequence = ""
133+
else
134+
sequence *= line
135+
end
136+
end
137+
push!(records, (record_name, sequence))
138+
return records
139+
end
140+
141+
records = parse_fasta(fake_file)
142+
```
143+
144+
Once the fasta is read in, we can iterate over each sequence/record and store its nucleotide sequence in a data matrix.
145+
146+
From there, we can generate the profile matrix.
147+
We'll need to sum the number of times each nucleotide appears at a particular column of the data matrix.
148+
149+
Then, we can identify the most common nucleotide at each column of the data matrix,
150+
which represent each index of the consensus string.
151+
After doing this for all columns of the data matrix,
152+
we can generate the consensus string.
153+
154+
155+
```julia
156+
function consensus(fasta_string)
157+
# extract strings from fasta
158+
records = parse_fasta(fasta_string)
159+
160+
# make a vector of sequence strings
161+
data_vector = last.(records)
162+
163+
# convert data_vector to matrix where each column is a character position and each row is a string
164+
data_matrix = reduce(vcat, permutedims.(collect.(data_vector)))
165+
166+
# make profile matrix: (num_strings × n) of Chars
167+
# profile is 4×n (each row corresponds to A,C,G,T)
168+
nucs = ['A', 'C', 'G', 'T']
169+
profile = reduce(vcat, (sum(data_matrix .== nuc, dims=1) for nuc in nucs))
170+
171+
# compute the consensus string
172+
consensus_string = join([nucs[argmax(@view profile[:, j])] for j in 1:size(profile, 2)])
173+
return(consensus_string)
174+
end
175+
176+
consensus(fake_file)
177+
```
178+
179+
As mentioned in the problem description above,
180+
it is possible that there can be multiple consensus strings,
181+
as some nucleotides may appear the same number of times
182+
in each column of the data matrix.
183+
184+
If this is the case,
185+
the function we are using (`argmax`) returns the nucleotide with the most occurrences that it first encounters.
186+
187+
The way our function is written,
188+
we first scan for 'A', 'C', then 'G' and 'T',
189+
so the final consensus string will be biased towards more A's, then C's, G's and T's.
190+
This is simply based on which nucleotide counts it will encounter first in the profile matrix.
191+
192+
In the example below, there are equal number of sequences that are all `A`'s and `G`'s,
193+
so the consensus string could be either `AAAAAAAA` or `GGGGGGGG`.
194+
However, because our solution scans for `A` first,
195+
the consensus string returned will be `AAAAAAAA`.
196+
197+
```julia
198+
fake_file2 = IOBuffer("""
199+
>Rosalind_1
200+
AAAAAAAA
201+
>Rosalind_2
202+
AAAAAAAA
203+
>Rosalind_3
204+
AAAAAAAA
205+
>Rosalind_4
206+
GGGGGGGG
207+
>Rosalind_5
208+
GGGGGGGG
209+
>Rosalind_6
210+
GGGGGGGG
211+
"""
212+
)
213+
214+
consensus(fake_file2)
215+
```

0 commit comments

Comments
 (0)