|
| 1 | +# Counting Point Mutations |
| 2 | + |
| 3 | +🤔 [Problem link](https://rosalind.info/problems/hamm/) |
| 4 | + |
| 5 | +!!! warning "The Problem" |
| 6 | + |
| 7 | + Given two strings s and t of equal length, |
| 8 | + the Hamming distance between s and t, denoted dH(s,t), |
| 9 | + is the number of corresponding symbols that differ in s and t. |
| 10 | + |
| 11 | + |
| 12 | + Given: Two DNA strings s and t of equal length (not exceeding 1 kbp). |
| 13 | + |
| 14 | + Return: The Hamming distance dH(s,t). |
| 15 | + |
| 16 | + ***Sample Dataset*** |
| 17 | + |
| 18 | + ``` |
| 19 | + GAGCCTACTAACGGGAT |
| 20 | + CATCGTAATGACGGCCT |
| 21 | + ``` |
| 22 | + |
| 23 | + ***Sample Output*** |
| 24 | + |
| 25 | + ``` |
| 26 | + 7 |
| 27 | + ``` |
| 28 | + |
| 29 | + |
| 30 | +To calculate the Hamming Distance between two strings/sequences, |
| 31 | +the two strings/DNA sequences must be the same length. |
| 32 | + |
| 33 | +The simplest way to solve this problem is to compare the corresponding values in each string for each index and then sum the mismatches. |
| 34 | +This is the fastest and most idiomatic Julia solution, as it leverages vector math. |
| 35 | + |
| 36 | +Let's give this a try! |
| 37 | + |
| 38 | +```julia |
| 39 | +ex_seq_a = "GAGCCTACTAACGGGAT" |
| 40 | +ex_seq_b = "CATCGTAATGACGGCCT" |
| 41 | + |
| 42 | +count(i-> ex_seq_a[i] != ex_seq_b[i], eachindex(ex_seq_a)) |
| 43 | +``` |
| 44 | + |
| 45 | +### For Loop |
| 46 | + |
| 47 | +Another way we can approach this would be to use the for-loop. |
| 48 | +For loops are traditionally slower and clunkier (especially in Python). |
| 49 | +However, Julia can often optimize for-loops like this, |
| 50 | +which is one of the things that makes it so powerful. |
| 51 | +It has multiple processing units that can run the same task in parallel. |
| 52 | + |
| 53 | +We can calculate the Hamming Distance by looping over the characters in one of the strings |
| 54 | +and checking if the corresponding character at the same index in the other string matches. |
| 55 | + |
| 56 | +Each mismatch will cause 1 to be added to a `counter` variable. |
| 57 | +At the end of the loop, we can return the total value of the `counter` variable. |
| 58 | + |
| 59 | + |
| 60 | + |
| 61 | +```julia |
| 62 | +ex_seq_a = "GAGCCTACTAACGGGAT" |
| 63 | +ex_seq_b = "CATCGTAATGACGGCCT" |
| 64 | + |
| 65 | +function hamming(seq_a, seq_b) |
| 66 | + # check if the strings are empty |
| 67 | + if isempty(seq_a) |
| 68 | + throw(ErrorException("empty sequences")) |
| 69 | + end |
| 70 | + |
| 71 | + # check if the strings are different lengths |
| 72 | + if length(seq_a) != length(seq_b) |
| 73 | + throw(ErrorException(" sequences have different lengths")) |
| 74 | + end |
| 75 | + |
| 76 | + mismatches = 0 |
| 77 | + for i in 1:length(seq_a) |
| 78 | + if seq_a[i] != seq_b[i] |
| 79 | + mismatches += 1 |
| 80 | + end |
| 81 | + end |
| 82 | + return mismatches |
| 83 | +end |
| 84 | + |
| 85 | +hamming(ex_seq_a, ex_seq_b) |
| 86 | + |
| 87 | +``` |
| 88 | + |
| 89 | + |
| 90 | +## BioAlignments method |
| 91 | + |
| 92 | +Instead of writing your own function, an alternative would be to use the readily-available Hamming Distance [function](https://github.com/BioJulia/BioAlignments.jl/blob/0f3cc5e1ac8b34fdde23cb3dca7afb9eb480322f/src/pairwise/algorithms/hamming_distance.jl#L4) in the `BioAlignments.jl` package. |
| 93 | + |
| 94 | +```julia |
| 95 | +using BioAlignments |
| 96 | + |
| 97 | +ex_seq_a = "GAGCCTACTAACGGGAT" |
| 98 | +ex_seq_b = "CATCGTAATGACGGCCT" |
| 99 | + |
| 100 | +bio_hamming = BioAlignments.hamming_distance(Int64, ex_seq_a, ex_seq_b) |
| 101 | + |
| 102 | +bio_hamming[1] |
| 103 | + |
| 104 | +``` |
| 105 | + |
| 106 | +```julia |
| 107 | +# Double check that we got the same values from both ouputs |
| 108 | +@assert hamming(ex_seq_a, ex_seq_b) == bio_hamming[1] |
| 109 | +``` |
| 110 | + |
| 111 | + |
| 112 | + The BioAlignments `hamming_distance` function requires three input variables -- |
| 113 | + the first of which allows the user to control the `type` of the returned hamming distance value. |
| 114 | + |
| 115 | + In the above example, `Int64` is provided as the first input variable, |
| 116 | + but `Float64` or `Int8` are also acceptable inputs. |
| 117 | + The second two input variables are the two sequences that are being compared. |
| 118 | + |
| 119 | + There are two outputs of this function: |
| 120 | + the actual Hamming Distance value and the Alignment Anchor. |
| 121 | + The Alignment Anchor is a one-dimensional array (vector) that is the same length as the length of the input strings. |
| 122 | + |
| 123 | + Each value in the vector is also an AlignmentAnchor with three fields: |
| 124 | + sequence position, reference position, and an operation code |
| 125 | + ('0' for start, '=' for match, 'X' for mismatch). |
| 126 | + |
| 127 | + The Alignment Anchor for the above example is: |
| 128 | + ``` |
| 129 | + AlignmentAnchor[ |
| 130 | + AlignmentAnchor(0, 0, '0'), |
| 131 | + AlignmentAnchor(1, 1, 'X'), |
| 132 | + AlignmentAnchor(2, 2, '='), |
| 133 | + AlignmentAnchor(3, 3, 'X'), |
| 134 | + AlignmentAnchor(4, 4, '='), |
| 135 | + AlignmentAnchor(5, 5, 'X'), |
| 136 | + AlignmentAnchor(7, 7, '='), |
| 137 | + AlignmentAnchor(8, 8, 'X'), |
| 138 | + AlignmentAnchor(9, 9, '='), |
| 139 | + AlignmentAnchor(10, 10, 'X'), |
| 140 | + AlignmentAnchor(14, 14, '='), |
| 141 | + AlignmentAnchor(16, 16, 'X'), |
| 142 | + AlignmentAnchor(17, 17, '=')] |
| 143 | + ``` |
| 144 | + |
| 145 | + ### Distances.jl method |
| 146 | + |
| 147 | + Another package that calculates the Hamming distance is the [Distances package](https://github.com/JuliaStats/Distances.jl). |
| 148 | + We can call its `hamming` function on our two test sequences: |
| 149 | + |
| 150 | + |
| 151 | + |
| 152 | +```julia |
| 153 | +using Distances |
| 154 | + |
| 155 | +ex_seq_a = "GAGCCTACTAACGGGAT" |
| 156 | +ex_seq_b = "CATCGTAATGACGGCCT" |
| 157 | + |
| 158 | +Distances.hamming(ex_seq_a, ex_seq_b) |
| 159 | +``` |
| 160 | + |
| 161 | +## Benchmarking |
| 162 | + |
| 163 | +Let's test to see which method is the most efficient! |
| 164 | +Did the for-loop slow us down? |
| 165 | + |
| 166 | +```julia |
| 167 | +using BenchmarkTools |
| 168 | + |
| 169 | +testseq1 = string(randdnaseq(100_000)) # this is defined in BioSequences |
| 170 | + |
| 171 | +testseq2 = string(randdnaseq(100_000)) |
| 172 | + |
| 173 | + |
| 174 | +@benchmark hamming(\$testseq1, \$testseq2) |
| 175 | + |
| 176 | +@benchmark BioAlignments.hamming_distance(Int64, \$testseq1, \$testseq2) |
| 177 | + |
| 178 | +@benchmark Distances.hamming(\$testseq1, \$testseq2) |
| 179 | +``` |
| 180 | + |
| 181 | +The BioAlignments method takes up a much larger amount of memory, |
| 182 | +and nearly three times as long to run. |
| 183 | +However, it also generates an `AlignmentAnchor` data structure each time the function is called, |
| 184 | +so this is not a fair comparison. |
| 185 | +The `Distances` package is the winner here, which makes sense, |
| 186 | +as it uses a vectorized approach. |
| 187 | + |
| 188 | + |
| 189 | + |
0 commit comments