Skip to content

Commit 3b086db

Browse files
add BioSequences translate solution
1 parent f0b421b commit 3b086db

1 file changed

Lines changed: 83 additions & 61 deletions

File tree

docs/src/rosalind/08-prot.md

Lines changed: 83 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,15 @@
44

55
!!! warning "The Problem"
66

7-
The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet (all letters except for B, J, O, U, X, and Z). Protein strings are constructed from these 20 symbols. Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings.
7+
The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet.
8+
(all letters except for B, J, O, U, X, and Z).
9+
Protein strings are constructed from these 20 symbols.
10+
Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings.
811

912
The RNA codon table dictates the details regarding the encoding of specific codons into the amino acid alphabet.
1013

11-
Given: An RNA string s corresponding to a strand of mRNA (of length at most 10 kbp).
14+
Given: An RNA string s corresponding to a strand of mRNA.
15+
(of length at most 10 kbp).
1216

1317
Return: The protein string encoded by s.
1418

@@ -25,68 +29,63 @@
2529
### DIY solution
2630
Let's first tackle this problem by writing our own solution.
2731

28-
First, we will check that this is a coding region by verifying that the string starts with a start codon (`AUG`). If not, we can still convert the string to protein, but we'll throw an error. There may be a frame shift, in which case the returned translation will be incorrect.
29-
30-
We'll also do a check that the string is divisible by three. If it is not, this will likely mean that there was a mutation in the string (addition or deletion). Again, we can still convert as much of the the string as possible. However, we should alert the user that this result may be incorrect!
31-
32-
We need to convert this string of DNA to a string of proteins using the RNA codon table. We can convert the RNA codon table into a dictionary, which can map over our codons.
33-
34-
Then, we'll break the string into codons by slicing at every three characters. These codons can be matched to the strings into the RNA codon table to get the corresponding amino acid. We'll append this amino acid to a string.
35-
36-
We'll need to deal with any three-character strings that don't match a codon. This likely means that there was a mutation in the input DNA string! If we get a codon that doesn't match, we can return "X" for that amino acid, and continue translating the rest of the string. However, if we get a string X's, that will definitely signal to us that there was some kind of frame shift.
37-
38-
Now that we have established an approach, let's turn this into code!
32+
First, we will check that this is a coding region by verifying that the string starts with a start codon (`AUG`).
33+
If not, we can still convert the string to protein,
34+
but we'll throw an error.
35+
There may be a frame shift,
36+
in which case the returned translation will be incorrect.
37+
38+
We'll also do a check that the string is divisible by three.
39+
If it is not, this will likely mean that there was a mutation in the string
40+
(addition or deletion).
41+
Again, we can still convert as much of the the string as possible.
42+
However, we should alert the user that this result may be incorrect!
43+
44+
We need to convert this string of DNA to a string of proteins using the RNA codon table.
45+
We can convert the RNA codon table into a dictionary,
46+
which can map over our codons.
47+
Alternatively, we could also import this from the BioSequences package,
48+
as this is already defined [there](https://github.com/BioJulia/BioSequences.jl/blob/b626dbcaad76217b248449e6aa2cc1650e95660c/src/geneticcode.jl#L132).
49+
50+
Then, we'll break the string into codons by slicing at every three characters.
51+
These codons can be matched to the strings into the RNA codon table to get the corresponding amino acid.
52+
We'll append this amino acid to a string.
53+
54+
We'll need to deal with any three-character strings that don't match a codon.
55+
This likely means that there was a mutation in the input DNA string!
56+
If we get a codon that doesn't match,
57+
we can return "X" for that amino acid,
58+
and continue translating the rest of the string.
59+
However, if we get a string X's,
60+
that will definitely signal to us that there was some kind of frame shift.
61+
62+
Now that we have established an approach,
63+
let's turn this into code!
3964

4065
```julia
4166

42-
dna = "AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA"
67+
rna = "AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA"
4368

4469
# note: this can be created by hand
4570
# or it can be accessed using
46-
codon_table = rna_codon_table = {
47-
# Phenylalanine (F)
48-
'UUU': 'F', 'UUC': 'F',
49-
# Leucine (L)
50-
'UUA': 'L', 'UUG': 'L', 'CUU': 'L', 'CUC': 'L', 'CUA': 'L', 'CUG': 'L',
51-
# Isoleucine (I)
52-
'AUU': 'I', 'AUC': 'I', 'AUA': 'I',
53-
# Methionine (M) - Start Codon
54-
'AUG': 'M',
55-
# Valine (V)
56-
'GUU': 'V', 'GUC': 'V', 'GUA': 'V', 'GUG': 'V',
57-
# Serine (S)
58-
'UCU': 'S', 'UCC': 'S', 'UCA': 'S', 'UCG': 'S', 'AGU': 'S', 'AGC': 'S',
59-
# Proline (P)
60-
'CCU': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P',
61-
# Threonine (T)
62-
'ACU': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T',
63-
# Alanine (A)
64-
'GCU': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A',
65-
# Tyrosine (Y)
66-
'UAU': 'Y', 'UAC': 'Y',
67-
# Stop Codons (*)
68-
'UAA': '*', 'UAG': '*', 'UGA': '*',
69-
# Histidine (H)
70-
'CAU': 'H', 'CAC': 'H',
71-
# Glutamine (Q)
72-
'CAA': 'Q', 'CAG': 'Q',
73-
# Asparagine (N)
74-
'AAU': 'N', 'AAC': 'N',
75-
# Lysine (K)
76-
'AAA': 'K', 'AAG': 'K',
77-
# Aspartic Acid (D)
78-
'GAU': 'D', 'GAC': 'D',
79-
# Glutamic Acid (E)
80-
'GAA': 'E', 'GAG': 'E',
81-
# Cysteine (C)
82-
'UGU': 'C', 'UGC': 'C',
83-
# Tryptophan (W)
84-
'UGG': 'W',
85-
# Arginine (R)
86-
'CGU': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R', 'AGA': 'R', 'AGG': 'R',
87-
# Glycine (G)
88-
'GGU': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G'
89-
}
71+
codon_table = Dict{String,Char}(
72+
"AAA" => 'K', "AAC" => 'N', "AAG" => 'K', "AAU" => 'N',
73+
"ACA" => 'T', "ACC" => 'T', "ACG" => 'T', "ACU" => 'T',
74+
"AGA" => 'R', "AGC" => 'S', "AGG" => 'R', "AGU" => 'S',
75+
"AUA" => 'I', "AUC" => 'I', "AUG" => 'M', "AUU" => 'I',
76+
"CAA" => 'Q', "CAC" => 'H', "CAG" => 'Q', "CAU" => 'H',
77+
"CCA" => 'P', "CCC" => 'P', "CCG" => 'P', "CCU" => 'P',
78+
"CGA" => 'R', "CGC" => 'R', "CGG" => 'R', "CGU" => 'R',
79+
"CUA" => 'L', "CUC" => 'L', "CUG" => 'L', "CUU" => 'L',
80+
"GAA" => 'E', "GAC" => 'D', "GAG" => 'E', "GAU" => 'D',
81+
"GCA" => 'A', "GCC" => 'A', "GCG" => 'A', "GCU" => 'A',
82+
"GGA" => 'G', "GGC" => 'G', "GGG" => 'G', "GGU" => 'G',
83+
"GUA" => 'V', "GUC" => 'V', "GUG" => 'V', "GUU" => 'V',
84+
"UAA" => '*', "UAC" => 'Y', "UAG" => '*', "UAU" => 'Y',
85+
"UCA" => 'S', "UCC" => 'S', "UCG" => 'S', "UCU" => 'S',
86+
"UGA" => '*', "UGC" => 'C', "UGG" => 'W', "UGU" => 'C',
87+
"UUA" => 'L', "UUC" => 'F', "UUG" => 'L', "UUU" => 'F',
88+
)
9089

9190

9291
# check if starts with start codon
@@ -102,12 +101,35 @@ codon_table = rna_codon_table = {
102101
```
103102

104103

105-
### Biojulia Solution
104+
### BioSequences Solution
106105

107-
An alternative way to approach this problem would be to leverage an already written, established function from BioJulia.
106+
An alternative way to approach this problem would be to leverage an already written,
107+
established function from the BioSequences package in BioJulia.
108108

109109
```julia
110+
using BioSequences
111+
112+
rna =("AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA")
113+
114+
translate(rna"AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA")
115+
116+
```
117+
118+
This function is straightforward to use.
119+
However, there are also additional parameters for us to use.
120+
121+
For instance, the function defaults to using the standard genetic code.
122+
However, if a user wishes to use another codon chart
123+
(for example, yeast or invertebrate),
124+
there are others available on [BioSequences.jl](https://github.com/BioJulia/BioSequences.jl/blob/b626dbcaad76217b248449e6aa2cc1650e95660c/src/geneticcode.jl#L130) to choose from.
110125

126+
By default `allow_ambiguous_codons` is `true`.
127+
However, if a user is giving the function a mRNA string with ambiguous codons that may not be found in the standard genetic code,
128+
these codons will be translated to the most narrow amino acid which covers all
129+
non-ambiguous codons encompassed by the ambiguous codon.
130+
By default, ambiguous codons will cause an error.
111131

132+
Additionally, `alternative_start` is `false` by default.
133+
If set to true, the starting codon will be Methionine regardless of the starting codon.
112134

113-
```
135+
Similar to our function, the BioSequences function also throws an error if the input mRNA string is not divisible by 3.

0 commit comments

Comments
 (0)