|
| 1 | +# Translating RNA into Protein |
| 2 | + |
| 3 | +🤔 [Problem link](https://rosalind.info/problems/prot/) |
| 4 | + |
| 5 | +!!! warning "The Problem" |
| 6 | + |
| 7 | + The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet. |
| 8 | + (all letters except for B, J, O, U, X, and Z). |
| 9 | + Protein strings are constructed from these 20 symbols. |
| 10 | + Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings. |
| 11 | + |
| 12 | + The RNA codon table dictates the details regarding the encoding of specific codons into the amino acid alphabet. |
| 13 | + |
| 14 | + Given: An RNA string s corresponding to a strand of mRNA. |
| 15 | + (of length at most 10 kbp). |
| 16 | + |
| 17 | + Return: The protein string encoded by s. |
| 18 | + |
| 19 | + Sample Dataset |
| 20 | + ``` |
| 21 | + AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA |
| 22 | + ``` |
| 23 | + |
| 24 | + Sample Output |
| 25 | + ``` |
| 26 | + MAMAPRTEINSTRING |
| 27 | + ``` |
| 28 | + |
| 29 | +### DIY solution |
| 30 | +Let's tackle this problem by writing our own solution, |
| 31 | +and then seeing how we can solve it with functions already available in BioJulia. |
| 32 | + |
| 33 | +First, we will check that this is a coding region by verifying that the string starts with a start codon (`AUG`). |
| 34 | +If not, we can still convert the string to protein, |
| 35 | +but we'll throw a warning to alert the user. |
| 36 | +There may be a frame shift, |
| 37 | +in which case the returned translation will be incorrect. |
| 38 | + |
| 39 | +We'll also do a check that the string is divisible by three. |
| 40 | +If it is not, this will likely mean that there was a mutation in the string |
| 41 | +(addition or deletion). |
| 42 | +Again, we can still convert as much of the string as possible. |
| 43 | +However, we should alert the user that the result may be incorrect! |
| 44 | + |
| 45 | +Next, we'll need to convert this string of mRNA to a string of proteins using the RNA codon table. |
| 46 | +We can convert the RNA codon table into a dictionary, |
| 47 | +which can map over our codons. |
| 48 | +Alternatively, we could also import this from the BioSequences package, |
| 49 | +as this is already defined [there](https://github.com/BioJulia/BioSequences.jl/blob/b626dbcaad76217b248449e6aa2cc1650e95660c/src/geneticcode.jl#L132). |
| 50 | + |
| 51 | +Then, we'll break the string into codons by slicing it every three characters. |
| 52 | +These codons can be matched against the RNA codon table to get the corresponding amino acid. |
| 53 | +We'll join all these amino acids together to form the final string. |
| 54 | + |
| 55 | +Lastly, we'll need to deal with any three-character strings that don't match a codon. |
| 56 | +This likely means that there was a mutation in the input mRNA string! |
| 57 | +If we get a codon that doesn't match, |
| 58 | +we can return "X" for that amino acid, |
| 59 | +and continue translating the rest of the string. |
| 60 | +If we get a string of X's, |
| 61 | +that should signal to the user that there was some kind of frame shift. |
| 62 | + |
| 63 | + |
| 64 | +Now that we have established an approach, |
| 65 | +let's turn this into code! |
| 66 | + |
| 67 | +```julia |
| 68 | +using Test |
| 69 | + |
| 70 | +rna = "AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA" |
| 71 | + |
| 72 | +# note: this can be created by hand |
| 73 | +# or it can be accessed from the BioSequences package (see link above) |
| 74 | +codon_table = Dict( |
| 75 | + "AAA" => 'K', "AAC" => 'N', "AAG" => 'K', "AAU" => 'N', |
| 76 | + "ACA" => 'T', "ACC" => 'T', "ACG" => 'T', "ACU" => 'T', |
| 77 | + "AGA" => 'R', "AGC" => 'S', "AGG" => 'R', "AGU" => 'S', |
| 78 | + "AUA" => 'I', "AUC" => 'I', "AUG" => 'M', "AUU" => 'I', |
| 79 | + "CAA" => 'Q', "CAC" => 'H', "CAG" => 'Q', "CAU" => 'H', |
| 80 | + "CCA" => 'P', "CCC" => 'P', "CCG" => 'P', "CCU" => 'P', |
| 81 | + "CGA" => 'R', "CGC" => 'R', "CGG" => 'R', "CGU" => 'R', |
| 82 | + "CUA" => 'L', "CUC" => 'L', "CUG" => 'L', "CUU" => 'L', |
| 83 | + "GAA" => 'E', "GAC" => 'D', "GAG" => 'E', "GAU" => 'D', |
| 84 | + "GCA" => 'A', "GCC" => 'A', "GCG" => 'A', "GCU" => 'A', |
| 85 | + "GGA" => 'G', "GGC" => 'G', "GGG" => 'G', "GGU" => 'G', |
| 86 | + "GUA" => 'V', "GUC" => 'V', "GUG" => 'V', "GUU" => 'V', |
| 87 | + "UAA" => '*', "UAC" => 'Y', "UAG" => '*', "UAU" => 'Y', |
| 88 | + "UCA" => 'S', "UCC" => 'S', "UCG" => 'S', "UCU" => 'S', |
| 89 | + "UGA" => '*', "UGC" => 'C', "UGG" => 'W', "UGU" => 'C', |
| 90 | + "UUA" => 'L', "UUC" => 'F', "UUG" => 'L', "UUU" => 'F', |
| 91 | + ) |
| 92 | + |
| 93 | +function translate_mrna(seq, codon_table) |
| 94 | + |
| 95 | + # check if starts with start codon |
| 96 | + if ! startswith(seq, "AUG") |
| 97 | + @warn "this sequence does not start with AUG" |
| 98 | + end |
| 99 | + # check if string is divisible by three |
| 100 | + if rem(length(seq), 3) != 0 |
| 101 | + @warn "this sequence is not divisible by 3" |
| 102 | + end |
| 103 | + # separate string into codons |
| 104 | + # this makes a generator, which allocates less memory than a vector |
| 105 | + codons = (join(chunk) for chunk in Iterators.partition(seq, 3)) |
| 106 | + |
| 107 | + # map over codons with codon table, return X if not in codon_table |
| 108 | + aa_string = join(get(codon_table, c, "X") for c in codons) |
| 109 | + |
| 110 | + # return amino acid string |
| 111 | + return aa_string |
| 112 | + |
| 113 | +end |
| 114 | + |
| 115 | +translate_mrna(rna, codon_table) |
| 116 | +``` |
| 117 | + |
| 118 | +Let's test that our function correctly deals with non-conventional mRNA strings. |
| 119 | + |
| 120 | +If we change the input string to include a codon that is not present in the codon table, |
| 121 | +we should get a warning. |
| 122 | +The codon should also be translated to an amino acid "X." |
| 123 | +```julia |
| 124 | +translate_mrna("AUGNCG", codon_table) |
| 125 | +``` |
| 126 | + |
| 127 | +Next, let's confirm that an input mRNA strand with a length that is not divisible by 3 produces the correct warning. |
| 128 | + |
| 129 | +```julia |
| 130 | +translate_mrna("AUGGC", codon_table) |
| 131 | +``` |
| 132 | + |
| 133 | + |
| 134 | + |
| 135 | + |
| 136 | + |
| 137 | +### BioSequences Solution |
| 138 | + |
| 139 | +An alternative way to approach this problem would be to leverage |
| 140 | +an established function from the BioSequences package in BioJulia. |
| 141 | + |
| 142 | +```julia |
| 143 | +using BioSequences |
| 144 | + |
| 145 | +translate(rna"AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA") |
| 146 | + |
| 147 | +``` |
| 148 | + |
| 149 | +This function is straightforward to use, |
| 150 | +especially in the case where the input mRNA has no ambiguous codons |
| 151 | +and is divisible by 3. |
| 152 | +However, there are also additional parameters available for handling other types of strings. |
| 153 | + |
| 154 | +For instance, the function defaults to using the standard genetic code. |
| 155 | +However, if a user wishes to use another codon chart |
| 156 | +(for example, yeast or invertebrate), |
| 157 | +there are others available on [BioSequences.jl](https://github.com/BioJulia/BioSequences.jl/blob/b626dbcaad76217b248449e6aa2cc1650e95660c/src/geneticcode.jl#L130) to choose from. |
| 158 | + |
| 159 | + |
| 160 | +For example, we can translate the same input mRNA string |
| 161 | +using the vertebrate mitochondrial genetic code! |
| 162 | + |
| 163 | +```julia |
| 164 | +translate(rna"AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA", code=BioSequences.vertebrate_mitochondrial_genetic_code) |
| 165 | + |
| 166 | +``` |
| 167 | + |
| 168 | +By default, `allow_ambiguous_codons` is `true`. |
| 169 | +If a user gives the function a mRNA string with ambiguous codons that may not be found in the standard genetic code, |
| 170 | +these codons will be translated to the narrowest amino acid |
| 171 | +which covers all |
| 172 | +non-ambiguous codons encompassed by the ambiguous codon. |
| 173 | +If this option is turned off, |
| 174 | +ambiguous codons will cause an error. |
| 175 | + |
| 176 | +For example, the input mRNA string below includes the nucleotides `NCG`, |
| 177 | +which is an ambiguous codon. |
| 178 | +This could potentially code for `ACG` (Threonine), |
| 179 | +`CCG` (Proline), `UCG` (Serine), `GCG` (Alanine), |
| 180 | +each of which would code for four different amino acids. |
| 181 | + |
| 182 | +`allow_ambiguous_codons` is `true` by default, |
| 183 | +so this mRNA strand is translated to `MX`. |
| 184 | + |
| 185 | +```julia |
| 186 | +translate(rna"AUGNCG") |
| 187 | +``` |
| 188 | + |
| 189 | +However, if `allow_ambiguous_codons` is changed to `false`, |
| 190 | +an error is thrown, as no ambiguous codons are allowed in the result. |
| 191 | + |
| 192 | +```julia |
| 193 | +translate(rna"AUGNCG", allow_ambiguous_codons=false) |
| 194 | +``` |
| 195 | + |
| 196 | +Additionally, `alternative_start` is `false` by default. |
| 197 | +If set to true, the starting amino acid will be Methionine regardless of what the first codon is. |
| 198 | + |
| 199 | +```julia |
| 200 | +translate(rna"AUCGAC", alternative_start = true) |
| 201 | +``` |
| 202 | + |
| 203 | +Similar to our function, the BioSequences function also throws an error if the input mRNA string is not divisible by 3. |
| 204 | + |
| 205 | +```julia |
| 206 | +translate(rna"AUGGA") |
| 207 | +``` |
0 commit comments