Skip to content

Commit ae2cf2c

Browse files
mecA blast rough draft
1 parent 3eb1e42 commit ae2cf2c

File tree

3 files changed

+216
-0
lines changed

3 files changed

+216
-0
lines changed

cookbook/03-blast.md

Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
+++
2+
using Dates
3+
date = Date("2026-03-15")
4+
title = "BLAST"
5+
rss_descr = "Using NCBIBlast.jl to run BLAST searches"
6+
+++
7+
8+
# Introduction to BLAST
9+
A BLAST search allows you to query a sequence (either nucleotide or protein) against an entire database of sequences.
10+
11+
More information about how to use BLAST can be found in its [manual](https://www.ncbi.nlm.nih.gov/books/NBK569856/).
12+
13+
BLAST's can be run from the BLAST web page [here](https://blast.ncbi.nlm.nih.gov/Blast.cgi).
14+
A user can simply copy in a nucleotide sequence and search for a best match against all of the databases in NCBI!
15+
16+
`NCBIBlast.jl` is a thin wrapper around the BLAST command line tool,
17+
allowing users to run the tool within Julia.
18+
The following BLAST tools are supported by `NCBIBlast`:
19+
- `blastn`
20+
- `blastp`
21+
- `tblastn`
22+
- `blastx`
23+
- `makeblastdb`
24+
25+
26+
27+
Note: [BioTools BLAST](https://biojulia.dev/BioTools.jl/stable/blast/) is a deprecated Julia package for running BLAST searches and is different from `NCBIBLAST`.
28+
29+
30+
31+
32+
# How NCBIBlast.jl works
33+
34+
35+
The keywords used in the tool are sent to the shell for running BLAST.
36+
37+
As stated on the Github [docs](https://github.com/BioJulia/NCBIBlast.jl), the julia call
38+
39+
```
40+
blastn(; query = "a_file.txt", db="mydb", out="results.txt")
41+
```
42+
43+
is sent to the shell as
44+
45+
```
46+
$ blastn -query a_file.txt -db mydb -out results.txt
47+
```
48+
49+
# BLAST databases
50+
Having a BLAST database is necessary to run BLAST locally.
51+
A BLAST database is constructed from FASTA files that serve as reference sequences.
52+
A database can be built using the following command:
53+
```
54+
makeblastdb(; in="test/example_files/dna2.fasta", dbtype="nucl")
55+
```
56+
57+
More directions on building a BLAST database locally can be found [here](https://www.ncbi.nlm.nih.gov/books/NBK569841/).
58+
59+
## Example: Building a local BLAST database and running the BLAST search
60+
61+
For our first example, we will replicate the example on the NCBIBlast.jl Github.
62+
63+
First, we will build a local database using a fasta file found in the NCBIBlast github repository ([link here](https://github.com/BioJulia/NCBIBlast.jl/blob/main/test/example_files/dna2.fasta)).
64+
65+
```
66+
makeblastdb(; in="assets/dna2.fasta", dbtype="nucl")
67+
68+
Building a new DB, current time: 03/16/2026 21:04:36
69+
New DB name: /Users/pintorson/Documents/tufts_bonham/daniellepinto/Projects/BioJulia/BioTutorials/cookbook/assets/dna2.fasta
70+
New DB title: assets/dna2.fasta
71+
Sequence type: Nucleotide
72+
Keep MBits: T
73+
Maximum file size: 3000000000B
74+
Adding sequences from FASTA; added 2 sequences in 0.0012269 seconds.
75+
76+
77+
Process(`/Users/pintorson/.julia/artifacts/0406b91031ce302fa9117606d007d04635279fef/ncbi-blast-2.16.0+/bin/makeblastdb -in assets/dna2.fasta -dbtype nucl`, ProcessExited(0))
78+
```
79+
A new database was built in `assets`.
80+
81+
Now, we can define our query sequence.
82+
We can save the query string in memory (using IOBuffer) rather than reading in a FASTA file.
83+
84+
```
85+
buf = IOBuffer("TTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAG")
86+
```
87+
88+
Now, we can run the BLAST search.
89+
The BLAST output format "6" means that the output will be tab-delimited text with no column names.
90+
91+
The BLAST output will be written into IO.
92+
```
93+
io = IOBuffer();
94+
blastn(buf; stdout=io, db="assets/dna2.fasta", outfmt="6");
95+
seek(io, 0);
96+
```
97+
The command `seek(io,0)` moves the cursor of to the start of the captured object (index 0) so it can be read into a dataframe.
98+
99+
100+
```
101+
using CSV, DataFrames
102+
CSV.read(io, DataFrame; header=false)
103+
104+
1×12 DataFrame
105+
Row │ Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8 Column9 Column10 Column11 Column12
106+
│ String7 String7 Float64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Float64 Float64
107+
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────
108+
1 │ Query_1 Test1 100.0 38 0 0 1 38 82 119 5.64e-18 71.3
109+
```
110+
111+
### Interpreting BLAST Output
112+
This output tells us that the query sequence (`Query_1` is the default name since we didn't specify a name) matches `Test1` in the reference database.
113+
There is 100% identity on a region that is 38 nucleotides long.
114+
There are 0 mismatches or gap openings.
115+
The match starts at index 1 on the query sequence, and ends at index 82.
116+
This region matches a region on the `Test1` that spans from index 82 to 119.
117+
The E-value is `5.64e-18`, meaning that it is extremely unlikely that this match occured simply due to chance.
118+
119+
Here is a description of the E-value from the NCBI [website](https://blast.ncbi.nlm.nih.gov/doc/blast-help/FAQ.html):
120+
> The Expect value (E) is a parameter that describes the number of
121+
> hits one can “expect” to see by chance when searching a database of
122+
> a particular size. It decreases exponentially as the Score (S) of
123+
> the match increases.
124+
> The lower the E-value the more “significant” the match is. However
125+
> keep in mind that virtually identical short alignments have
126+
> relatively high E values. This is because the calculation of the E
127+
> value takes into account the length of the query sequence. These
128+
> high E values make sense because shorter sequences have a higher
129+
> probability of occurring in the database purely by chance.
130+
131+
132+
133+
The bitscore is 71.3.
134+
135+
Here is a definition of bitscore from the NCBI BLAST [glossary](https://www.ncbi.nlm.nih.gov/books/NBK62051/):
136+
> The bit score, S', is derived from the raw alignment score, S,
137+
> taking the statistical properties of the scoring system into account.
138+
> Because bit scores are normalized with respect to the scoring system, they
139+
> can be used to compare alignment scores from different searches.
140+
141+
142+
## Example: BLASTing the _mecA1_ gene against all of NCBI
143+
Now that we've tried BLASTing against a local database,
144+
let's try BLASTing a piece of the _mecA_ gene against NCBI.
145+
To create the query file `mecA_BLAST.fasta`,
146+
I randomly selected 140 nucleotides from `mecA.fasta`.
147+
148+
We should see that the query fasta is a direct hit to the _mecA_ gene
149+
(specifically to the sequence upload that we pulled from NCBI).
150+
151+
For this BLAST search, I will search against the `core_nt` database,
152+
which is a faster, smaller, and more focused subset of the traditional `nt` (nucleotide) database.
153+
This newer database is the default as of August 2024.
154+
It seeks to reduce redundancy and reduce storage when downloading the database. More information about it can be found [here](https://ncbiinsights.ncbi.nlm.nih.gov/2024/07/18/new-blast-core-nucleotide-database/).
155+
156+
General information about the different kinds of BLAST databases is also available [here](https://www.nlm.nih.gov/ncbi/workshops/2023-08_BLAST_evol/databases.html).
157+
158+
```
159+
io = IOBuffer();
160+
blastn("assets/mecA_BLAST.fasta"; db="core_nt", stdout=io, remote=true, outfmt="6 query subject expect")
161+
seek(io, 0);
162+
CSV.read(io, DataFrame; header=false)
163+
```
164+
165+
Here is the output of the BLAST.
166+
167+
```
168+
julia> CSV.read(io, DataFrame; header=false)
169+
500×12 DataFrame
170+
Row │ Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8 Column9 Column10 Column11 Column12
171+
│ String15 String15 Float64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Float64 Int64
172+
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
173+
1 │ mecA_BLAST CP026646.1 100.0 140 0 0 1 140 46719 46580 7.12e-65 259
174+
2 │ mecA_BLAST CP154497.1 100.0 140 0 0 1 140 45702 45563 7.12e-65 259
175+
3 │ mecA_BLAST CP049499.1 100.0 140 0 0 1 140 61014 60875 7.12e-65 259
176+
4 │ mecA_BLAST CP030403.1 100.0 140 0 0 1 140 48633 48494 7.12e-65 259
177+
5 │ mecA_BLAST CP132494.1 100.0 140 0 0 1 140 1867742 1867603 7.12e-65 259
178+
6 │ mecA_BLAST MH798864.1 100.0 140 0 0 1 140 281 420 7.12e-65 259
179+
7 │ mecA_BLAST CP162442.1 100.0 140 0 0 1 140 503005 503144 7.12e-65 259
180+
8 │ mecA_BLAST OR082611.1 100.0 140 0 0 1 140 6415 6276 7.12e-65 259
181+
9 │ mecA_BLAST CP053183.1 100.0 140 0 0 1 140 41607 41468 7.12e-65 259
182+
10 │ mecA_BLAST CP085310.1 100.0 140 0 0 1 140 1610215 1610076 7.12e-65 259
183+
11 │ mecA_BLAST CP162465.1 100.0 140 0 0 1 140 1140196 1140057 7.12e-65 259
184+
12 │ mecA_BLAST CP133660.1 100.0 140 0 0 1 140 2821019 2821158 7.12e-65 259
185+
13 │ mecA_BLAST CP049476.1 100.0 140 0 0 1 140 40314 40175 7.12e-65 259
186+
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
187+
489 │ mecA_BLAST CP093527.1 100.0 140 0 0 1 140 42814 42675 7.12e-65 259
188+
490 │ mecA_BLAST CP035541.1 100.0 140 0 0 1 140 72431 72292 7.12e-65 259
189+
491 │ mecA_BLAST CP145216.2 100.0 140 0 0 1 140 45830 45691 7.12e-65 259
190+
492 │ mecA_BLAST CP193734.1 100.0 140 0 0 1 140 64595 64456 7.12e-65 259
191+
493 │ mecA_BLAST CP083210.2 100.0 140 0 0 1 140 43331 43192 7.12e-65 259
192+
494 │ mecA_BLAST MW052033.1 100.0 140 0 0 1 140 245 384 7.12e-65 259
193+
495 │ mecA_BLAST CP168087.1 100.0 140 0 0 1 140 40806 40667 7.12e-65 259
194+
496 │ mecA_BLAST CP150769.1 100.0 140 0 0 1 140 2583517 2583378 7.12e-65 259
195+
497 │ mecA_BLAST CP030596.1 100.0 140 0 0 1 140 40848 40709 7.12e-65 259
196+
498 │ mecA_BLAST MZ398128.1 100.0 140 0 0 1 140 16977 17116 7.12e-65 259
197+
499 │ mecA_BLAST CP083199.2 100.0 140 0 0 1 140 40078 39939 7.12e-65 259
198+
500 │ mecA_BLAST CP162663.1 100.0 140 0 0 1 140 938360 938499 7.12e-65 259
199+
475 rows omitted
200+
```
201+
There are 500 hits with the same E-value and Bitscore.
202+
This likely means that this sequence is an exact match to these 500 sequences in NCBI.
203+
Because of this, the first row in the results is not necessarily a better match than the 500th,
204+
even though it appears first.
205+
206+
To verify the first hit, we can look up the GenBankID of the first hit: `CP026646.1`.
207+
The NCBI [page](https://www.ncbi.nlm.nih.gov/nuccore/CP026646.1/) listing this sample confirms that this sample was phenotyped as _S. aureus_.
208+
Our query matches from indices 46719 to 46580.
209+
When we use the Graphics feature to visualize gene annotations, we see that there is a clear match to _mecA_.
210+
211+
![BLAST Graphics](assets/mecA_BLAST.png)
212+
213+
Overall, this confirms that our BLAST worked correctly!

cookbook/assets/mecA_BLAST.fasta

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
>mecA_BLAST file that is used for BLAST tutorial
2+
GAGTAGATGCTCAATATAAAATTAAAACAAACTACGGTAACATTGATCGCAACGTTCAATTTAATTTTGT
3+
TAAAGAAGATGGTATGTGGAAGTTAGATTGGGATCATAGCGTCATTATTCCAGGAATGCAGAAAGACCAA

cookbook/assets/mecA_BLAST.png

384 KB
Loading

0 commit comments

Comments
 (0)