Merge pull request #21 from BioJulia/2026-03-02-cons

danielle-pinto · web-flow · commit 1d9408d9d7ab · 2026-03-02T21:36:39.000-05:00
Consensus string tutorial
diff --git a/rosalind/10-cons.md b/rosalind/10-cons.md
@@ -0,0 +1,215 @@
++++
+using Dates
+date = Date("2026-03-02")
+title = "Problem 10: Consensus and Profile"
+rss_descr = "Solving Rosalind problem CONS — finding a consensus string from a collection of DNA strings — using base Julia, DataFrames, and matrix operations"
++++
+
+# Consensus and Profile
+
+🤔 [Problem link](https://rosalind.info/problems/cons/)
+
+> **The Problem**
+>
+>   A matrix is a rectangular table of values divided into rows and columns.   
+>   An m×n matrix has m rows and n columns.   
+>   Given a matrix A, we write Ai,j.  
+>   to indicate the value found at the intersection of row i and column j.  
+
+>   Say that we have a collection of DNA strings,   
+>   all having the same length n.   
+>   Their profile matrix is a 4×n matrix P in which P1,    
+>   j represents the number of times that 'A' occurs in the jth position of one of the strings,   
+>   P2,j represents the number of times that C occurs in the jth position,   
+>   and so on (see below).
+
+>   A consensus string c is a string of length n
+>   formed from our collection by taking the most common symbol at each position;   
+>   the jth symbol of c therefore corresponds to the symbol having the maximum value 
+>   in the j-th column of the profile matrix.     
+>   Of course, there may be more than one most common symbol, 
+>   leading to multiple possible consensus strings.  
+>
+>    ### DNA Strings
+>    ```
+>    A T C C A G C T
+>    G G G C A A C T
+>    A T G G A T C T
+>    A A G C A A C C
+>    T T G G A A C T
+>    A T G C C A T T
+>    A T G G C A C T
+>    ```
+>
+>    ### Profile
+>    ```
+>    A   5 1 0 0 5 5 0 0
+>	 C   0 0 1 4 2 0 6 1
+>    G   1 1 6 3 0 1 0 0
+>    T   1 5 0 0 0 1 1 6
+>    ```
+>
+>    ### Consensus	
+>   ```A T G C A A C T```
+>
+>    **Given:**   
+>    A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.
+>
+>    **Return:**
+>    A consensus string and profile matrix for the collection. 
+>    (If several possible consensus strings exist,   
+>    then you may return any one of them.)
+>
+>    **Sample Dataset***
+>
+>    ```
+>    >Rosalind_1
+>    ATCCAGCT
+>    >Rosalind_2
+>    GGGCAACT
+>    >Rosalind_3
+>    ATGGATCT
+>    >Rosalind_4
+>    AAGCAACC
+>    >Rosalind_5
+>    TTGGAACT
+>    >Rosalind_6
+>    ATGCCATT
+>    >Rosalind_7
+>    ATGGCACT
+>    ```
+>
+>    **Sample Output** 
+>    ```
+>    ATGCAACT
+>    ```
+> 
+>    ```
+>    A: 5 1 0 0 5 5 0 0
+>    C: 0 0 1 4 2 0 6 1
+>    G: 1 1 6 3 0 1 0 0
+>    T: 1 5 0 0 0 1 1 6
+>   ```
+
+
+The first thing we will need to do is read in the input fasta.  
+In this case, we will not be reading in an actual fasta file,   
+but a set of strings in fasta format.
+If we were reading in an actual fasta file,  
+we could use the [FASTX.jl](https://github.com/BioJulia/FASTX.jl) package to help us with that.  
+
+Since the task required here is something that was already demonstrated in the [GC-content tutorial](./05-gc.md),  
+we can borrow the function from that tutorial.
+
+```julia
+
+fake_file = IOBuffer("""
+    >Rosalind_1
+    ATCCAGCT
+    >Rosalind_2
+    GGGCAACT
+    >Rosalind_3
+    ATGGATCT
+    >Rosalind_4
+    AAGCAACC
+    >Rosalind_5
+    TTGGAACT
+    >Rosalind_6
+    ATGCCATT
+    >Rosalind_7
+    ATGGCACT
+    """
+)
+
+function parse_fasta(buffer)
+    records = [] # this is a Vector of type `Any`
+    record_name = ""
+    sequence = ""
+    for line in eachline(buffer)
+        if startswith(line, ">")
+            !isempty(record_name) && push!(records, (record_name, sequence))
+            record_name = lstrip(line, '>')
+            sequence = ""
+        else
+            sequence *= line
+        end
+    end
+    push!(records, (record_name, sequence))
+    return records
+end
+
+records = parse_fasta(fake_file)
+```
+
+Once the fasta is read in, we can iterate over each sequence/record and store its nucleotide sequence in a data matrix.
+
+From there, we can generate the profile matrix.  
+We'll need to sum the number of times each nucleotide appears at a particular column of the data matrix.  
+
+Then, we can identify the most common nucleotide at each column of the data matrix,   
+which represent each index of the consensus string.  
+After doing this for all columns of the data matrix,   
+we can generate the consensus string.  
+
+
+```julia
+function consensus(fasta_string)
+    # extract strings from fasta
+    records = parse_fasta(fasta_string)
+
+    # make a vector of sequence strings
+    data_vector = last.(records)
+
+    # convert data_vector to matrix where each column is a character position and each row is a string
+    data_matrix = reduce(vcat, permutedims.(collect.(data_vector)))
+
+    # make profile matrix: (num_strings × n) of Chars
+    # profile is 4×n (each row corresponds to A,C,G,T)
+    nucs = ['A', 'C', 'G', 'T']
+    profile = reduce(vcat, (sum(data_matrix .== nuc, dims=1) for nuc in nucs))
+
+    # compute the consensus string
+    consensus_string = join([nucs[argmax(@view profile[:, j])] for j in 1:size(profile, 2)])
+    return(consensus_string)
+end
+
+consensus(fake_file)
+```
+
+As mentioned in the problem description above,   
+it is possible that there can be multiple consensus strings,    
+as some nucleotides may appear the same number of times  
+in each column of the data matrix. 
+
+If this is the case,   
+the function we are using (`argmax`) returns the nucleotide with the most occurrences that it first encounters. 
+
+The way our function is written,  
+we first scan for 'A', 'C', then 'G' and 'T',   
+so the final consensus string will be biased towards more A's, then C's, G's and T's.  
+This is simply based on which nucleotide counts it will encounter first in the profile matrix.
+
+In the example below, there are equal number of sequences that are all `A`'s and `G`'s,  
+so the consensus string could be either `AAAAAAAA` or `GGGGGGGG`.   
+However, because our solution scans for `A` first,   
+the consensus string returned will be `AAAAAAAA`.
+
+```julia
+fake_file2 = IOBuffer("""
+    >Rosalind_1
+    AAAAAAAA
+    >Rosalind_2
+    AAAAAAAA
+    >Rosalind_3
+    AAAAAAAA
+    >Rosalind_4
+    GGGGGGGG
+    >Rosalind_5
+    GGGGGGGG
+    >Rosalind_6
+    GGGGGGGG
+    """
+)
+
+consensus(fake_file2)
+```