Skip to content

Commit 50c1d06

Browse files
Resolve conflict in src/rosalind/06-hamm.md
2 parents 79236d3 + bfc070e commit 50c1d06

2 files changed

Lines changed: 193 additions & 0 deletions

File tree

docs/Project.toml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,7 @@ FormatSpecimens = "3372ea36-2a1a-11e9-3eb7-996970b6ffbd"
1010
JuliaFormatter = "98e50ef6-434e-11e9-1051-2b60c6c9e899"
1111
LiveServer = "16fef848-5104-11e9-1b77-fb7a48bbb589"
1212
XAM = "d759349c-bcba-11e9-07c2-5b90f8f05f7c"
13+
14+
[compat]
15+
DocumenterVitepress = "=0.2.7"
16+

docs/src/rosalind/06-hamm.md

Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
# Counting Point Mutations
2+
3+
🤔 [Problem link](https://rosalind.info/problems/hamm/)
4+
5+
!!! warning "The Problem"
6+
7+
Given two strings s and t of equal length,
8+
the Hamming distance between s and t, denoted dH(s,t),
9+
is the number of corresponding symbols that differ in s and t.
10+
11+
12+
Given: Two DNA strings s and t of equal length (not exceeding 1 kbp).
13+
14+
Return: The Hamming distance dH(s,t).
15+
16+
***Sample Dataset***
17+
18+
```
19+
GAGCCTACTAACGGGAT
20+
CATCGTAATGACGGCCT
21+
```
22+
23+
***Sample Output***
24+
25+
```
26+
7
27+
```
28+
29+
30+
To calculate the Hamming Distance between two strings/sequences,
31+
the two strings/DNA sequences must be the same length.
32+
33+
The simplest way to solve this problem is to compare the corresponding values in each string for each index and then sum the mismatches.
34+
This is the fastest and most idiomatic Julia solution, as it leverages vector math.
35+
36+
Let's give this a try!
37+
38+
```julia
39+
ex_seq_a = "GAGCCTACTAACGGGAT"
40+
ex_seq_b = "CATCGTAATGACGGCCT"
41+
42+
count(i-> ex_seq_a[i] != ex_seq_b[i], eachindex(ex_seq_a))
43+
```
44+
45+
### For Loop
46+
47+
Another way we can approach this would be to use the for-loop.
48+
For loops are traditionally slower and clunkier (especially in Python).
49+
However, Julia can often optimize for-loops like this,
50+
which is one of the things that makes it so powerful.
51+
It has multiple processing units that can run the same task in parallel.
52+
53+
We can calculate the Hamming Distance by looping over the characters in one of the strings
54+
and checking if the corresponding character at the same index in the other string matches.
55+
56+
Each mismatch will cause 1 to be added to a `counter` variable.
57+
At the end of the loop, we can return the total value of the `counter` variable.
58+
59+
60+
61+
```julia
62+
ex_seq_a = "GAGCCTACTAACGGGAT"
63+
ex_seq_b = "CATCGTAATGACGGCCT"
64+
65+
function hamming(seq_a, seq_b)
66+
# check if the strings are empty
67+
if isempty(seq_a)
68+
throw(ErrorException("empty sequences"))
69+
end
70+
71+
# check if the strings are different lengths
72+
if length(seq_a) != length(seq_b)
73+
throw(ErrorException(" sequences have different lengths"))
74+
end
75+
76+
mismatches = 0
77+
for i in 1:length(seq_a)
78+
if seq_a[i] != seq_b[i]
79+
mismatches += 1
80+
end
81+
end
82+
return mismatches
83+
end
84+
85+
hamming(ex_seq_a, ex_seq_b)
86+
87+
```
88+
89+
90+
## BioAlignments method
91+
92+
Instead of writing your own function, an alternative would be to use the readily-available Hamming Distance [function](https://github.com/BioJulia/BioAlignments.jl/blob/0f3cc5e1ac8b34fdde23cb3dca7afb9eb480322f/src/pairwise/algorithms/hamming_distance.jl#L4) in the `BioAlignments.jl` package.
93+
94+
```julia
95+
using BioAlignments
96+
97+
ex_seq_a = "GAGCCTACTAACGGGAT"
98+
ex_seq_b = "CATCGTAATGACGGCCT"
99+
100+
bio_hamming = BioAlignments.hamming_distance(Int64, ex_seq_a, ex_seq_b)
101+
102+
bio_hamming[1]
103+
104+
```
105+
106+
```julia
107+
# Double check that we got the same values from both ouputs
108+
@assert hamming(ex_seq_a, ex_seq_b) == bio_hamming[1]
109+
```
110+
111+
112+
The BioAlignments `hamming_distance` function requires three input variables --
113+
the first of which allows the user to control the `type` of the returned hamming distance value.
114+
115+
In the above example, `Int64` is provided as the first input variable,
116+
but `Float64` or `Int8` are also acceptable inputs.
117+
The second two input variables are the two sequences that are being compared.
118+
119+
There are two outputs of this function:
120+
the actual Hamming Distance value and the Alignment Anchor.
121+
The Alignment Anchor is a one-dimensional array (vector) that is the same length as the length of the input strings.
122+
123+
Each value in the vector is also an AlignmentAnchor with three fields:
124+
sequence position, reference position, and an operation code
125+
('0' for start, '=' for match, 'X' for mismatch).
126+
127+
The Alignment Anchor for the above example is:
128+
```
129+
AlignmentAnchor[
130+
AlignmentAnchor(0, 0, '0'),
131+
AlignmentAnchor(1, 1, 'X'),
132+
AlignmentAnchor(2, 2, '='),
133+
AlignmentAnchor(3, 3, 'X'),
134+
AlignmentAnchor(4, 4, '='),
135+
AlignmentAnchor(5, 5, 'X'),
136+
AlignmentAnchor(7, 7, '='),
137+
AlignmentAnchor(8, 8, 'X'),
138+
AlignmentAnchor(9, 9, '='),
139+
AlignmentAnchor(10, 10, 'X'),
140+
AlignmentAnchor(14, 14, '='),
141+
AlignmentAnchor(16, 16, 'X'),
142+
AlignmentAnchor(17, 17, '=')]
143+
```
144+
145+
### Distances.jl method
146+
147+
Another package that calculates the Hamming distance is the [Distances package](https://github.com/JuliaStats/Distances.jl).
148+
We can call its `hamming` function on our two test sequences:
149+
150+
151+
152+
```julia
153+
using Distances
154+
155+
ex_seq_a = "GAGCCTACTAACGGGAT"
156+
ex_seq_b = "CATCGTAATGACGGCCT"
157+
158+
Distances.hamming(ex_seq_a, ex_seq_b)
159+
```
160+
161+
## Benchmarking
162+
163+
Let's test to see which method is the most efficient!
164+
Did the for-loop slow us down?
165+
166+
```julia
167+
using BenchmarkTools
168+
169+
testseq1 = string(randdnaseq(100_000)) # this is defined in BioSequences
170+
171+
testseq2 = string(randdnaseq(100_000))
172+
173+
174+
@benchmark hamming(\$testseq1, \$testseq2)
175+
176+
@benchmark BioAlignments.hamming_distance(Int64, \$testseq1, \$testseq2)
177+
178+
@benchmark Distances.hamming(\$testseq1, \$testseq2)
179+
```
180+
181+
The BioAlignments method takes up a much larger amount of memory,
182+
and nearly three times as long to run.
183+
However, it also generates an `AlignmentAnchor` data structure each time the function is called,
184+
so this is not a fair comparison.
185+
The `Distances` package is the winner here, which makes sense,
186+
as it uses a vectorized approach.
187+
188+
189+

0 commit comments

Comments
 (0)