Skip to content

Commit 82d71c8

Browse files
add semantic line breaks and benchmarking
1 parent eb58985 commit 82d71c8

1 file changed

Lines changed: 69 additions & 14 deletions

File tree

docs/src/rosalind/06-hamm.md

Lines changed: 69 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,9 @@
44

55
!!! warning "The Problem"
66

7-
Given two strings s and t of equal length, the Hamming distance between s and t, denoted dH(s,t), is the number of corresponding symbols that differ in s and t.
7+
Given two strings s and t of equal length,
8+
the Hamming distance between s and t, denoted dH(s,t),
9+
is the number of corresponding symbols that differ in s and t.
810

911

1012
Given: Two DNA strings s and t of equal length (not exceeding 1 kbp).
@@ -25,9 +27,11 @@
2527
```
2628

2729

28-
To calculate the Hamming Distance between two strings/sequences, the two strings/DNA sequences must be the same length.
30+
To calculate the Hamming Distance between two strings/sequences,
31+
the two strings/DNA sequences must be the same length.
2932

30-
The simplest way to solve this problem is to compare the corresponding values in each string for each index and then sum the mismatches. This is the fastest and most idiomatic Julia solution, as it leverages vector math.
33+
The simplest way to solve this problem is to compare the corresponding values in each string for each index and then sum the mismatches.
34+
This is the fastest and most idiomatic Julia solution, as it leverages vector math.
3135

3236
Let's give this a try!
3337

@@ -38,15 +42,19 @@ ex_seq_b = "CATCGTAATGACGGCCT"
3842
count(i-> ex_seq_a[i] != ex_seq_b[i], eachindex(ex_seq_a))
3943
```
4044

41-
42-
4345
### For Loop
4446

45-
Another way we can approach this would be to use the for-loop. This method will be a bit slower.
47+
Another way we can approach this would be to use the for-loop.
48+
For loops are traditionally slower and clunkier (especially in Python).
49+
However, Julia can often optimize for-loops like this,
50+
which is one of the things that makes it so powerful.
51+
It has multiple processing units that can run the same task parallelly.
4652

47-
We can calculate the Hamming Distance by looping over the characters in one of the strings and checking if the corresponding character at the same index in the other string matches.
53+
We can calculate the Hamming Distance by looping over the characters in one of the strings
54+
and checking if the corresponding character at the same index in the other string matches.
4855

49-
Each mismatch will cause 1 to be added to a `counter` variable. At the end of the loop, we can return the total value of the `counter` variable.
56+
Each mismatch will cause 1 to be added to a `counter` variable.
57+
At the end of the loop, we can return the total value of the `counter` variable.
5058

5159

5260

@@ -101,22 +109,43 @@ bio_hamming[1]
101109
```
102110

103111

104-
The BioAlignments `hamming_distance` function requires three input variables -- the first of which allows the user to control the `type` of the returned hamming distance value.
112+
The BioAlignments `hamming_distance` function requires three input variables --
113+
the first of which allows the user to control the `type` of the returned hamming distance value.
105114

106-
In the above example, `Int64` is provided as the first input variable, but `Float64` or `Int8` are also acceptable inputs. The second two input variables are the two sequences that are being compared.
115+
In the above example, `Int64` is provided as the first input variable,
116+
but `Float64` or `Int8` are also acceptable inputs.
117+
The second two input variables are the two sequences that are being compared.
107118

108-
There are two outputs of this function: the actual Hamming Distance value and the Alignment Anchor. The Alignment Anchor is a one-dimensional array (vector) that is the same length as the length of the input strings.
119+
There are two outputs of this function:
120+
the actual Hamming Distance value and the Alignment Anchor.
121+
The Alignment Anchor is a one-dimensional array (vector) that is the same length as the length of the input strings.
109122

110-
Each value in the vector is also an AlignmentAnchor with three fields: sequence position, reference position, and an operation code ('0' for start, '=' for match, 'X' for mismatch).
123+
Each value in the vector is also an AlignmentAnchor with three fields:
124+
sequence position, reference position, and an operation code
125+
('0' for start, '=' for match, 'X' for mismatch).
111126

112127
The Alignment Anchor for the above example is:
113128
```
114-
AlignmentAnchor[AlignmentAnchor(0, 0, '0'), AlignmentAnchor(1, 1, 'X'), AlignmentAnchor(2, 2, '='), AlignmentAnchor(3, 3, 'X'), AlignmentAnchor(4, 4, '='), AlignmentAnchor(5, 5, 'X'), AlignmentAnchor(7, 7, '='), AlignmentAnchor(8, 8, 'X'), AlignmentAnchor(9, 9, '='), AlignmentAnchor(10, 10, 'X'), AlignmentAnchor(14, 14, '='), AlignmentAnchor(16, 16, 'X'), AlignmentAnchor(17, 17, '=')]
129+
AlignmentAnchor[
130+
AlignmentAnchor(0, 0, '0'),
131+
AlignmentAnchor(1, 1, 'X'),
132+
AlignmentAnchor(2, 2, '='),
133+
AlignmentAnchor(3, 3, 'X'),
134+
AlignmentAnchor(4, 4, '='),
135+
AlignmentAnchor(5, 5, 'X'),
136+
AlignmentAnchor(7, 7, '='),
137+
AlignmentAnchor(8, 8, 'X'),
138+
AlignmentAnchor(9, 9, '='),
139+
AlignmentAnchor(10, 10, 'X'),
140+
AlignmentAnchor(14, 14, '='),
141+
AlignmentAnchor(16, 16, 'X'),
142+
AlignmentAnchor(17, 17, '=')]
115143
```
116144

117145
### Distances.Jl method
118146

119-
Another package that calculates the Hamming distance is the [Distances package](https://github.com/JuliaStats/Distances.jl). We can call its `hamming` function on our two test sequences:
147+
Another package that calculates the Hamming distance is the [Distances package](https://github.com/JuliaStats/Distances.jl).
148+
We can call its `hamming` function on our two test sequences:
120149

121150

122151

@@ -129,6 +158,32 @@ ex_seq_b = "CATCGTAATGACGGCCT"
129158
Distances.hamming(ex_seq_a, ex_seq_b)
130159
```
131160

161+
## Benchmarking
162+
163+
Let's test to see which method is the most efficient!
164+
Did the for-loop slow us down?
165+
166+
```julia
167+
using BenchmarkTools
168+
169+
testseq1 = string(randdnaseq(100_000)) # this is defined in BioSequences
170+
171+
testseq2 = string(randdnaseq(100_000))
172+
173+
174+
@benchmark hamming($testseq1, $testseq2)
175+
176+
@benchmark BioAlignments.hamming_distance(Int64, $testseq1, $testseq2)
177+
178+
@benchmark Distances.hamming($testseq1, $testseq2)
179+
```
180+
181+
The BioAlignments method takes up a much larger amount of memory,
182+
and nearly three times as long to run.
183+
However, it also generates an `AlignmentAnchor` data structure each time the function is called,
184+
so this is not a fair comparison.
185+
The `Distances` package is the winner here,which makes sense,
186+
as it uses a vectorized approach.
132187

133188

134189

0 commit comments

Comments
 (0)