Skip to content

Commit 89a8779

Browse files
Merge pull request #18 from BioJulia/2026-02-12-subs
Finding a Motif in DNA
2 parents 9ad11e7 + a0857a2 commit 89a8779

File tree

1 file changed

+132
-0
lines changed

1 file changed

+132
-0
lines changed

docs/src/rosalind/09-subs.md

Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
# Finding a Motif in DNA
2+
3+
🤔 [Problem link](https://rosalind.info/problems/subs/)
4+
5+
!!! warning "The Problem"
6+
7+
Given two strings s and t,
8+
t is a substring of s if t is contained as a contiguous collection of symbols in s
9+
(as a result, t must be no longer than s).
10+
11+
The position of a symbol in a string is the total number of symbols found to its left, including itself.
12+
(e.g., the positions of all occurrences of 'U' in "AUGCUUCAGAAAGGUCUUACG" are 2, 5, 6, 15, 17, and 18).
13+
The symbol at position i of s is denoted by s[i].
14+
15+
A substring of s can be represented as s[j:k],
16+
where j and k represent the starting and ending positions of the substring in s;
17+
for example, if s= "AUGCUUCAGAAAGGUCUUACG",
18+
then s[2:5]= "UGCU".
19+
20+
The location of a substring s[j:k]is its beginning position j;
21+
note that t will have multiple locations in s
22+
if it occurs more than once as a substring of s.
23+
(see the Sample below).
24+
25+
Given:
26+
Two DNA strings s and t.
27+
(each of length at most 1 kbp).
28+
29+
Return:
30+
All locations of t as a substring of s.
31+
32+
Sample Dataset
33+
`GATATATGCATATACTTATAT`
34+
35+
Sample Output
36+
`2 4 10`
37+
38+
### Handwritten solution
39+
Let's start off with the most verbose solution.
40+
We can loop over every character within the input string and
41+
check if we can find the substring in the subsequent characters.
42+
43+
In other words,
44+
we will check each index for an exact match to the substring we are searching for.
45+
46+
```julia
47+
dataset = "GATATATGCATATACTTATAT"
48+
search_string = "ATAT"
49+
50+
function haystack(substring, string)
51+
# check if the strings are empty
52+
if isempty(substring) || isempty(string)
53+
throw(ErrorException("empty sequences"))
54+
end
55+
56+
# check that string exists in data
57+
if ! occursin(substring, string)
58+
return []
59+
end
60+
61+
output = []
62+
for i in eachindex(string)
63+
# check if first letter of string matches character at the index
64+
if string[i] == substring[1]
65+
# check if full substring matches at index
66+
# make sure not to search index past string
67+
if i + length(substring) - 1 <= length(string) && string[i:i+length(substring)-1] == substring
68+
push!(output, i)
69+
end
70+
end
71+
end
72+
return output
73+
end
74+
75+
haystack(search_string, dataset)
76+
```
77+
78+
### Biojulia solution
79+
80+
The BioSequences package has a helpful function [`findall`](https://github.com/BioJulia/BioSequences.jl/blob/b626dbcaad76217b248449e6aa2cc1650e95660c/src/BioSequences.jl#L261-L316),
81+
which returns the indices of all exact string matches.
82+
83+
It isn't included in the documentation about exact string search [here](https://biojulia.dev/BioSequences.jl/v2.0/sequence_search/#Exact-search-1),
84+
but the function exists!
85+
86+
BioSequences has other helpful exact string search functions like `findfirst`, `firstnext`, and `findlast`.
87+
88+
89+
```julia
90+
function haystack_findall(substring, string)
91+
# check if the strings are empty
92+
if isempty(substring) || isempty(string)
93+
throw(ErrorException("empty sequences"))
94+
end
95+
96+
# check that string exists in data
97+
if ! occursin(substring, string)
98+
return []
99+
end
100+
101+
matches = findall(ExactSearchQuery(dna"$substring"),dna"$string")
102+
return first.(matches)
103+
end
104+
105+
106+
haystack_findall(search_string, dataset)
107+
```
108+
### Regex solution
109+
110+
Lastly, we can also use Regex's search function.
111+
Here the "pattern" we are searching for is the exact string.
112+
This is the a great solution if we wanted to look for patterns of more complicated strings,
113+
but it works for exact matches as well!
114+
115+
116+
```julia
117+
function haystack_regex(substring, string)
118+
if isempty(substring) || isempty(string)
119+
throw(ErrorException("emptysequences"))
120+
end
121+
if !occursin(substring, string)
122+
return[]
123+
end
124+
125+
return [m.offset for m in eachmatch(Regex(substring), string, overlap=true) ]
126+
end
127+
128+
haystack_findnext(search_string, dataset)
129+
```
130+
131+
132+

0 commit comments

Comments
 (0)