|
| 1 | ++++ |
| 2 | +using Dates |
| 3 | +date = Date("2026-03-02") |
| 4 | +title = "Problem 10: Consensus and Profile" |
| 5 | +rss_descr = "Solving Rosalind problem CONS — finding a consensus string from a collection of DNA strings — using base Julia, DataFrames, and matrix operations" |
| 6 | ++++ |
| 7 | + |
1 | 8 | # Consensus and Profile |
2 | 9 |
|
3 | 10 | 🤔 [Problem link](https://rosalind.info/problems/cons/) |
4 | 11 |
|
5 | | -!!! warning "The Problem". |
6 | | - |
7 | | - A matrix is a rectangular table of values divided into rows and columns. |
8 | | - An m×n matrix has m rows and n columns. |
9 | | - Given a matrix A, we write Ai,j. |
10 | | - to indicate the value found at the intersection of row i and column j. |
11 | | - |
12 | | - Say that we have a collection of DNA strings, |
13 | | - all having the same length n. |
14 | | - Their profile matrix is a 4×n matrix P in which P1, |
15 | | - j represents the number of times that 'A' occurs in the jth position of one of the strings, |
16 | | - P2,j represents the number of times that C occurs in the jth position, |
17 | | - and so on (see below). |
18 | | - |
19 | | - A consensus string c is a string of length n |
20 | | - formed from our collection by taking the most common symbol at each position; |
21 | | - the jth symbol of c therefore corresponds to the symbol having the maximum value |
22 | | - in the j-th column of the profile matrix. |
23 | | - Of course, there may be more than one most common symbol, |
24 | | - leading to multiple possible consensus strings. |
25 | | - |
26 | | - ### DNA Strings |
27 | | - A T C C A G C T |
28 | | - G G G C A A C T |
29 | | - A T G G A T C T |
30 | | - A A G C A A C C |
31 | | - T T G G A A C T |
32 | | - A T G C C A T T |
33 | | - A T G G C A C T |
34 | | - |
35 | | - ### Profile |
36 | | - |
37 | | - A 5 1 0 0 5 5 0 0 |
38 | | - C 0 0 1 4 2 0 6 1 |
39 | | - G 1 1 6 3 0 1 0 0 |
40 | | - T 1 5 0 0 0 1 1 6 |
41 | | - |
42 | | - Consensus A T G C A A C T |
43 | | - |
44 | | - Given: |
45 | | - A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format. |
46 | | - |
47 | | - Return: |
48 | | - A consensus string and profile matrix for the collection. |
49 | | - (If several possible consensus strings exist, |
50 | | - then you may return any one of them.) |
51 | | - |
52 | | - Sample Dataset |
53 | | - >Rosalind_1 |
54 | | - ATCCAGCT |
55 | | - >Rosalind_2 |
56 | | - GGGCAACT |
57 | | - >Rosalind_3 |
58 | | - ATGGATCT |
59 | | - >Rosalind_4 |
60 | | - AAGCAACC |
61 | | - >Rosalind_5 |
62 | | - TTGGAACT |
63 | | - >Rosalind_6 |
64 | | - ATGCCATT |
65 | | - >Rosalind_7 |
66 | | - ATGGCACT |
67 | | - |
68 | | - Sample Output |
69 | | - ATGCAACT |
70 | | - A: 5 1 0 0 5 5 0 0 |
71 | | - C: 0 0 1 4 2 0 6 1 |
72 | | - G: 1 1 6 3 0 1 0 0 |
73 | | - T: 1 5 0 0 0 1 1 6 |
| 12 | +> **The Problem** |
| 13 | +> |
| 14 | +> A matrix is a rectangular table of values divided into rows and columns. |
| 15 | +> An m×n matrix has m rows and n columns. |
| 16 | +> Given a matrix A, we write Ai,j. |
| 17 | +> to indicate the value found at the intersection of row i and column j. |
| 18 | +
|
| 19 | +> Say that we have a collection of DNA strings, |
| 20 | +> all having the same length n. |
| 21 | +> Their profile matrix is a 4×n matrix P in which P1, |
| 22 | +> j represents the number of times that 'A' occurs in the jth position of one of the strings, |
| 23 | +> P2,j represents the number of times that C occurs in the jth position, |
| 24 | +> and so on (see below). |
| 25 | +
|
| 26 | +> A consensus string c is a string of length n |
| 27 | +> formed from our collection by taking the most common symbol at each position; |
| 28 | +> the jth symbol of c therefore corresponds to the symbol having the maximum value |
| 29 | +> in the j-th column of the profile matrix. |
| 30 | +> Of course, there may be more than one most common symbol, |
| 31 | +> leading to multiple possible consensus strings. |
| 32 | +> |
| 33 | +> ### DNA Strings |
| 34 | +> ``` |
| 35 | +> A T C C A G C T |
| 36 | +> G G G C A A C T |
| 37 | +> A T G G A T C T |
| 38 | +> A A G C A A C C |
| 39 | +> T T G G A A C T |
| 40 | +> A T G C C A T T |
| 41 | +> A T G G C A C T |
| 42 | +> ``` |
| 43 | +> |
| 44 | +> ### Profile |
| 45 | +> ``` |
| 46 | +> A 5 1 0 0 5 5 0 0 |
| 47 | +> C 0 0 1 4 2 0 6 1 |
| 48 | +> G 1 1 6 3 0 1 0 0 |
| 49 | +> T 1 5 0 0 0 1 1 6 |
| 50 | +> ``` |
| 51 | +> |
| 52 | +> ### Consensus |
| 53 | +> ```A T G C A A C T``` |
| 54 | +> |
| 55 | +> **Given:** |
| 56 | +> A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format. |
| 57 | +> |
| 58 | +> **Return:** |
| 59 | +> A consensus string and profile matrix for the collection. |
| 60 | +> (If several possible consensus strings exist, |
| 61 | +> then you may return any one of them.) |
| 62 | +> |
| 63 | +> **Sample Dataset*** |
| 64 | +> |
| 65 | +> ``` |
| 66 | +> >Rosalind_1 |
| 67 | +> ATCCAGCT |
| 68 | +> >Rosalind_2 |
| 69 | +> GGGCAACT |
| 70 | +> >Rosalind_3 |
| 71 | +> ATGGATCT |
| 72 | +> >Rosalind_4 |
| 73 | +> AAGCAACC |
| 74 | +> >Rosalind_5 |
| 75 | +> TTGGAACT |
| 76 | +> >Rosalind_6 |
| 77 | +> ATGCCATT |
| 78 | +> >Rosalind_7 |
| 79 | +> ATGGCACT |
| 80 | +> ``` |
| 81 | +> |
| 82 | +> **Sample Output** |
| 83 | +> ``` |
| 84 | +> ATGCAACT |
| 85 | +> A: 5 1 0 0 5 5 0 0 |
| 86 | +> C: 0 0 1 4 2 0 6 1 |
| 87 | +> G: 1 1 6 3 0 1 0 0 |
| 88 | +> T: 1 5 0 0 0 1 1 6 |
| 89 | +> ``` |
74 | 90 |
|
75 | 91 |
|
76 | 92 | The first thing we will need to do is read in the input fasta. |
|
0 commit comments