You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/rosalind/08-prot.md
+83-61Lines changed: 83 additions & 61 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,11 +4,15 @@
4
4
5
5
!!! warning "The Problem"
6
6
7
-
The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet (all letters except for B, J, O, U, X, and Z). Protein strings are constructed from these 20 symbols. Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings.
7
+
The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet.
8
+
(all letters except for B, J, O, U, X, and Z).
9
+
Protein strings are constructed from these 20 symbols.
10
+
Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings.
8
11
9
12
The RNA codon table dictates the details regarding the encoding of specific codons into the amino acid alphabet.
10
13
11
-
Given: An RNA string s corresponding to a strand of mRNA (of length at most 10 kbp).
14
+
Given: An RNA string s corresponding to a strand of mRNA.
15
+
(of length at most 10 kbp).
12
16
13
17
Return: The protein string encoded by s.
14
18
@@ -25,68 +29,63 @@
25
29
### DIY solution
26
30
Let's first tackle this problem by writing our own solution.
27
31
28
-
First, we will check that this is a coding region by verifying that the string starts with a start codon (`AUG`). If not, we can still convert the string to protein, but we'll throw an error. There may be a frame shift, in which case the returned translation will be incorrect.
29
-
30
-
We'll also do a check that the string is divisible by three. If it is not, this will likely mean that there was a mutation in the string (addition or deletion). Again, we can still convert as much of the the string as possible. However, we should alert the user that this result may be incorrect!
31
-
32
-
We need to convert this string of DNA to a string of proteins using the RNA codon table. We can convert the RNA codon table into a dictionary, which can map over our codons.
33
-
34
-
Then, we'll break the string into codons by slicing at every three characters. These codons can be matched to the strings into the RNA codon table to get the corresponding amino acid. We'll append this amino acid to a string.
35
-
36
-
We'll need to deal with any three-character strings that don't match a codon. This likely means that there was a mutation in the input DNA string! If we get a codon that doesn't match, we can return "X" for that amino acid, and continue translating the rest of the string. However, if we get a string X's, that will definitely signal to us that there was some kind of frame shift.
37
-
38
-
Now that we have established an approach, let's turn this into code!
32
+
First, we will check that this is a coding region by verifying that the string starts with a start codon (`AUG`).
33
+
If not, we can still convert the string to protein,
34
+
but we'll throw an error.
35
+
There may be a frame shift,
36
+
in which case the returned translation will be incorrect.
37
+
38
+
We'll also do a check that the string is divisible by three.
39
+
If it is not, this will likely mean that there was a mutation in the string
40
+
(addition or deletion).
41
+
Again, we can still convert as much of the the string as possible.
42
+
However, we should alert the user that this result may be incorrect!
43
+
44
+
We need to convert this string of DNA to a string of proteins using the RNA codon table.
45
+
We can convert the RNA codon table into a dictionary,
46
+
which can map over our codons.
47
+
Alternatively, we could also import this from the BioSequences package,
48
+
as this is already defined [there](https://github.com/BioJulia/BioSequences.jl/blob/b626dbcaad76217b248449e6aa2cc1650e95660c/src/geneticcode.jl#L132).
49
+
50
+
Then, we'll break the string into codons by slicing at every three characters.
51
+
These codons can be matched to the strings into the RNA codon table to get the corresponding amino acid.
52
+
We'll append this amino acid to a string.
53
+
54
+
We'll need to deal with any three-character strings that don't match a codon.
55
+
This likely means that there was a mutation in the input DNA string!
56
+
If we get a codon that doesn't match,
57
+
we can return "X" for that amino acid,
58
+
and continue translating the rest of the string.
59
+
However, if we get a string X's,
60
+
that will definitely signal to us that there was some kind of frame shift.
However, there are also additional parameters for us to use.
120
+
121
+
For instance, the function defaults to using the standard genetic code.
122
+
However, if a user wishes to use another codon chart
123
+
(for example, yeast or invertebrate),
124
+
there are others available on [BioSequences.jl](https://github.com/BioJulia/BioSequences.jl/blob/b626dbcaad76217b248449e6aa2cc1650e95660c/src/geneticcode.jl#L130) to choose from.
110
125
126
+
By default `allow_ambiguous_codons` is `true`.
127
+
However, if a user is giving the function a mRNA string with ambiguous codons that may not be found in the standard genetic code,
128
+
these codons will be translated to the most narrow amino acid which covers all
129
+
non-ambiguous codons encompassed by the ambiguous codon.
130
+
By default, ambiguous codons will cause an error.
111
131
132
+
Additionally, `alternative_start` is `false` by default.
133
+
If set to true, the starting codon will be Methionine regardless of the starting codon.
112
134
113
-
```
135
+
Similar to our function, the BioSequences function also throws an error if the input mRNA string is not divisible by 3.
0 commit comments