Skip to content

Commit 9ad11e7

Browse files
Merge pull request #17 from BioJulia/2026-01-31-prot
Translating RNA into Protein
2 parents b030ede + aaef36b commit 9ad11e7

File tree

2 files changed

+207
-9
lines changed

2 files changed

+207
-9
lines changed

Project.toml

Lines changed: 0 additions & 9 deletions
This file was deleted.

docs/src/rosalind/08-prot.md

Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
# Translating RNA into Protein
2+
3+
🤔 [Problem link](https://rosalind.info/problems/prot/)
4+
5+
!!! warning "The Problem"
6+
7+
The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet.
8+
(all letters except for B, J, O, U, X, and Z).
9+
Protein strings are constructed from these 20 symbols.
10+
Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings.
11+
12+
The RNA codon table dictates the details regarding the encoding of specific codons into the amino acid alphabet.
13+
14+
Given: An RNA string s corresponding to a strand of mRNA.
15+
(of length at most 10 kbp).
16+
17+
Return: The protein string encoded by s.
18+
19+
Sample Dataset
20+
```
21+
AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA
22+
```
23+
24+
Sample Output
25+
```
26+
MAMAPRTEINSTRING
27+
```
28+
29+
### DIY solution
30+
Let's tackle this problem by writing our own solution,
31+
and then seeing how we can solve it with functions already available in BioJulia.
32+
33+
First, we will check that this is a coding region by verifying that the string starts with a start codon (`AUG`).
34+
If not, we can still convert the string to protein,
35+
but we'll throw a warning to alert the user.
36+
There may be a frame shift,
37+
in which case the returned translation will be incorrect.
38+
39+
We'll also do a check that the string is divisible by three.
40+
If it is not, this will likely mean that there was a mutation in the string
41+
(addition or deletion).
42+
Again, we can still convert as much of the string as possible.
43+
However, we should alert the user that the result may be incorrect!
44+
45+
Next, we'll need to convert this string of mRNA to a string of proteins using the RNA codon table.
46+
We can convert the RNA codon table into a dictionary,
47+
which can map over our codons.
48+
Alternatively, we could also import this from the BioSequences package,
49+
as this is already defined [there](https://github.com/BioJulia/BioSequences.jl/blob/b626dbcaad76217b248449e6aa2cc1650e95660c/src/geneticcode.jl#L132).
50+
51+
Then, we'll break the string into codons by slicing it every three characters.
52+
These codons can be matched against the RNA codon table to get the corresponding amino acid.
53+
We'll join all these amino acids together to form the final string.
54+
55+
Lastly, we'll need to deal with any three-character strings that don't match a codon.
56+
This likely means that there was a mutation in the input mRNA string!
57+
If we get a codon that doesn't match,
58+
we can return "X" for that amino acid,
59+
and continue translating the rest of the string.
60+
If we get a string of X's,
61+
that should signal to the user that there was some kind of frame shift.
62+
63+
64+
Now that we have established an approach,
65+
let's turn this into code!
66+
67+
```julia
68+
using Test
69+
70+
rna = "AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA"
71+
72+
# note: this can be created by hand
73+
# or it can be accessed from the BioSequences package (see link above)
74+
codon_table = Dict(
75+
"AAA" => 'K', "AAC" => 'N', "AAG" => 'K', "AAU" => 'N',
76+
"ACA" => 'T', "ACC" => 'T', "ACG" => 'T', "ACU" => 'T',
77+
"AGA" => 'R', "AGC" => 'S', "AGG" => 'R', "AGU" => 'S',
78+
"AUA" => 'I', "AUC" => 'I', "AUG" => 'M', "AUU" => 'I',
79+
"CAA" => 'Q', "CAC" => 'H', "CAG" => 'Q', "CAU" => 'H',
80+
"CCA" => 'P', "CCC" => 'P', "CCG" => 'P', "CCU" => 'P',
81+
"CGA" => 'R', "CGC" => 'R', "CGG" => 'R', "CGU" => 'R',
82+
"CUA" => 'L', "CUC" => 'L', "CUG" => 'L', "CUU" => 'L',
83+
"GAA" => 'E', "GAC" => 'D', "GAG" => 'E', "GAU" => 'D',
84+
"GCA" => 'A', "GCC" => 'A', "GCG" => 'A', "GCU" => 'A',
85+
"GGA" => 'G', "GGC" => 'G', "GGG" => 'G', "GGU" => 'G',
86+
"GUA" => 'V', "GUC" => 'V', "GUG" => 'V', "GUU" => 'V',
87+
"UAA" => '*', "UAC" => 'Y', "UAG" => '*', "UAU" => 'Y',
88+
"UCA" => 'S', "UCC" => 'S', "UCG" => 'S', "UCU" => 'S',
89+
"UGA" => '*', "UGC" => 'C', "UGG" => 'W', "UGU" => 'C',
90+
"UUA" => 'L', "UUC" => 'F', "UUG" => 'L', "UUU" => 'F',
91+
)
92+
93+
function translate_mrna(seq, codon_table)
94+
95+
# check if starts with start codon
96+
if ! startswith(seq, "AUG")
97+
@warn "this sequence does not start with AUG"
98+
end
99+
# check if string is divisible by three
100+
if rem(length(seq), 3) != 0
101+
@warn "this sequence is not divisible by 3"
102+
end
103+
# separate string into codons
104+
# this makes a generator, which allocates less memory than a vector
105+
codons = (join(chunk) for chunk in Iterators.partition(seq, 3))
106+
107+
# map over codons with codon table, return X if not in codon_table
108+
aa_string = join(get(codon_table, c, "X") for c in codons)
109+
110+
# return amino acid string
111+
return aa_string
112+
113+
end
114+
115+
translate_mrna(rna, codon_table)
116+
```
117+
118+
Let's test that our function correctly deals with non-conventional mRNA strings.
119+
120+
If we change the input string to include a codon that is not present in the codon table,
121+
we should get a warning.
122+
The codon should also be translated to an amino acid "X."
123+
```julia
124+
translate_mrna("AUGNCG", codon_table)
125+
```
126+
127+
Next, let's confirm that an input mRNA strand with a length that is not divisible by 3 produces the correct warning.
128+
129+
```julia
130+
translate_mrna("AUGGC", codon_table)
131+
```
132+
133+
134+
135+
136+
137+
### BioSequences Solution
138+
139+
An alternative way to approach this problem would be to leverage
140+
an established function from the BioSequences package in BioJulia.
141+
142+
```julia
143+
using BioSequences
144+
145+
translate(rna"AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA")
146+
147+
```
148+
149+
This function is straightforward to use,
150+
especially in the case where the input mRNA has no ambiguous codons
151+
and is divisible by 3.
152+
However, there are also additional parameters available for handling other types of strings.
153+
154+
For instance, the function defaults to using the standard genetic code.
155+
However, if a user wishes to use another codon chart
156+
(for example, yeast or invertebrate),
157+
there are others available on [BioSequences.jl](https://github.com/BioJulia/BioSequences.jl/blob/b626dbcaad76217b248449e6aa2cc1650e95660c/src/geneticcode.jl#L130) to choose from.
158+
159+
160+
For example, we can translate the same input mRNA string
161+
using the vertebrate mitochondrial genetic code!
162+
163+
```julia
164+
translate(rna"AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA", code=BioSequences.vertebrate_mitochondrial_genetic_code)
165+
166+
```
167+
168+
By default, `allow_ambiguous_codons` is `true`.
169+
If a user gives the function a mRNA string with ambiguous codons that may not be found in the standard genetic code,
170+
these codons will be translated to the narrowest amino acid
171+
which covers all
172+
non-ambiguous codons encompassed by the ambiguous codon.
173+
If this option is turned off,
174+
ambiguous codons will cause an error.
175+
176+
For example, the input mRNA string below includes the nucleotides `NCG`,
177+
which is an ambiguous codon.
178+
This could potentially code for `ACG` (Threonine),
179+
`CCG` (Proline), `UCG` (Serine), `GCG` (Alanine),
180+
each of which would code for four different amino acids.
181+
182+
`allow_ambiguous_codons` is `true` by default,
183+
so this mRNA strand is translated to `MX`.
184+
185+
```julia
186+
translate(rna"AUGNCG")
187+
```
188+
189+
However, if `allow_ambiguous_codons` is changed to `false`,
190+
an error is thrown, as no ambiguous codons are allowed in the result.
191+
192+
```julia
193+
translate(rna"AUGNCG", allow_ambiguous_codons=false)
194+
```
195+
196+
Additionally, `alternative_start` is `false` by default.
197+
If set to true, the starting amino acid will be Methionine regardless of what the first codon is.
198+
199+
```julia
200+
translate(rna"AUCGAC", alternative_start = true)
201+
```
202+
203+
Similar to our function, the BioSequences function also throws an error if the input mRNA string is not divisible by 3.
204+
205+
```julia
206+
translate(rna"AUGGA")
207+
```

0 commit comments

Comments
 (0)