Skip to content

Commit fff4769

Browse files
Merge pull request #31 from BioJulia/2026-03-15-blast
mecA BLAST example
2 parents 7825363 + b0bde72 commit fff4769

File tree

3 files changed

+231
-0
lines changed

3 files changed

+231
-0
lines changed

cookbook/03-blast.md

Lines changed: 228 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,228 @@
1+
+++
2+
using Dates
3+
date = Date("2026-03-15")
4+
title = "BLAST"
5+
rss_descr = "Using NCBIBlast.jl to run BLAST searches"
6+
+++
7+
8+
# Introduction to BLAST
9+
A BLAST search allows you to query a sequence (either nucleotide or protein) against an entire database of sequences.
10+
It can quickly compare unknown sequences to databases of established reference sequences for purposes such as species identity or assignment gene function.
11+
12+
More information about how to use BLAST can be found in its [manual](https://www.ncbi.nlm.nih.gov/books/NBK569856/).
13+
14+
BLAST searches can be run from the command line interface (CLI) or through the BLAST web page [here](https://blast.ncbi.nlm.nih.gov/Blast.cgi).
15+
A user can simply copy in a protein or nucleotide sequence and search against NCBI to find the best match!
16+
While searching from the website is fast and straightforward,
17+
running searches from the website only performs searches against the NCBI databases.
18+
The CLI allows users to query both NCBI databases and custom databases.
19+
20+
`NCBIBlast.jl` is a thin wrapper around the BLAST command line tool,
21+
allowing users to run the tool within Julia.
22+
The following BLAST tools are supported by `NCBIBlast`:
23+
- `blastn`
24+
- `blastp`
25+
- `tblastn`
26+
- `blastx`
27+
- `makeblastdb`
28+
29+
A benefit of using the package is that it uses Julia's powerful `BinaryBuilder.jl` architechture to bundle the `BLAST+` executables
30+
so that you don't have to deal with that.
31+
In other words, if you do add `NCBIBLAST.jl`,
32+
you don't need to have the command line tools available.
33+
34+
Note: [BioTools BLAST](https://biojulia.dev/BioTools.jl/stable/blast/) is a deprecated Julia package for running BLAST searches and is different from `NCBIBLAST`.
35+
36+
37+
38+
39+
# How `NCBIBlast.jl` works
40+
41+
42+
The keywords used in the tool are sent to the shell for running BLAST.
43+
44+
As stated on the GitHub [docs](https://github.com/BioJulia/NCBIBlast.jl), the Julia call
45+
46+
```
47+
blastn(; query = "a_file.txt", db="mydb", out="results.txt")
48+
```
49+
50+
is sent to the shell as
51+
52+
```
53+
$ blastn -query a_file.txt -db mydb -out results.txt
54+
```
55+
56+
# BLAST databases
57+
Just like running a BLAST search from the CLI, `NCBIBlast.jl` requires a BLAST database to search against.
58+
The user can build a local, custom database,
59+
or search against a specific NCBI database.
60+
A custom BLAST database is constructed from FASTA files that serve as reference sequences.
61+
A database can be built using the following command in `NCBIBlast.jl`:
62+
```
63+
makeblastdb(; in="test/example_files/dna2.fasta", dbtype="nucl")
64+
```
65+
66+
More directions on building a BLAST database locally can be found [here](https://www.ncbi.nlm.nih.gov/books/NBK569841/).
67+
68+
69+
## Example: Building a local BLAST database and running the BLAST search
70+
71+
For our first example, we will replicate the example on the `NCBIBlast.jl` GitHub repository.
72+
73+
First, we will build a local database using a FASTA file found in the NCBIBlast github repository ([link here](https://github.com/BioJulia/NCBIBlast.jl/blob/main/test/example_files/dna2.fasta)).
74+
This file has been downloaded into `assets` as `dna2.fasta`.
75+
76+
```
77+
makeblastdb(; in="assets/dna2.fasta", dbtype="nucl")
78+
79+
Building a new DB, current time: 03/16/2026 21:04:36
80+
New DB name: /LOCAL/PATH/BioTutorials/cookbook/assets/dna2.fasta
81+
New DB title: assets/dna2.fasta
82+
Sequence type: Nucleotide
83+
Keep MBits: T
84+
Maximum file size: 3000000000B
85+
Adding sequences from FASTA; added 2 sequences in 0.0012269 seconds.
86+
87+
88+
Process(`/Users/USER/.julia/artifacts/0406b91031ce302fa9117606d007d04635279fef/ncbi-blast-2.16.0+/bin/makeblastdb -in assets/dna2.fasta -dbtype nucl`, ProcessExited(0))
89+
```
90+
A new database was built in `assets`.
91+
92+
Now, we can define our query sequence.
93+
We can save the query string in memory (using `IOBuffer`) rather than reading in a FASTA file.
94+
95+
```
96+
buf = IOBuffer("TTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAG")
97+
```
98+
99+
Now, we can run the BLAST search.
100+
The BLAST output format "6" means that the output will be tab-delimited text with no column names.
101+
102+
The BLAST output will be written into I/O.
103+
```
104+
io = IOBuffer();
105+
blastn(buf; stdout=io, db="assets/dna2.fasta", outfmt="6");
106+
seek(io, 0);
107+
```
108+
The command `seek(io,0)` moves the cursor to the start of the captured object (index 0)
109+
so it can be read into a dataframe via the [`DataFrames.jl`](https://dataframes.juliadata.org/stable/) package.
110+
111+
```
112+
using CSV, DataFrames
113+
CSV.read(io, DataFrame; header=false)
114+
115+
1×12 DataFrame
116+
Row │ Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8 Column9 Column10 Column11 Column12
117+
│ String7 String7 Float64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Float64 Float64
118+
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────
119+
1 │ Query_1 Test1 100.0 38 0 0 1 38 82 119 5.64e-18 71.3
120+
```
121+
122+
### Interpreting BLAST Output
123+
This output tells us that the query sequence
124+
(`Query_1` is the default name for the sequence because we didn't specify a name)
125+
matches `Test1` in the reference database.
126+
There is 100% identity between the query and a region on `Test1` that is 38 nucleotides long.
127+
There are 0 mismatches or gap openings.
128+
The match starts at index 1 on the query sequence, and ends at index 82.
129+
This region matches a region in the `Test1` sequence spanning from index 82 to 119.
130+
The E-value is `5.64e-18`, meaning that it is extremely unlikely that this match occurred simply due to chance.
131+
132+
Here is a description of the E-value from the NCBI [website](https://blast.ncbi.nlm.nih.gov/doc/blast-help/FAQ.html):
133+
> The Expect value (E) is a parameter that describes the number of hits one can “expect” to see
134+
> by chance when searching a database of a particular size.
135+
> It decreases exponentially as the Score (S) of the match increases.
136+
> The lower the E-value the more “significant” the match is.
137+
> However keep in mind that virtually identical short alignments have relatively high E values.
138+
> This is because the calculation of the E value takes into account the length of the query sequence.
139+
> These high E values make sense because shorter sequences
140+
> have a higher probability of occurring in the database purely by chance.
141+
142+
143+
144+
The bitscore is 71.3.
145+
146+
Here is a definition of bitscore from the NCBI BLAST [glossary](https://www.ncbi.nlm.nih.gov/books/NBK62051/):
147+
> The bit score, S',
148+
> is derived from the raw alignment score, S,
149+
> taking the statistical properties of the scoring system into account.
150+
> Because bit scores are normalized with respect to the scoring system,
151+
> they can be used to compare alignment scores from different searches.
152+
153+
154+
## Example: BLASTing the _mecA1_ gene against all of NCBI
155+
Now that we've tried BLAST'ing against a local, custom database,
156+
let's try BLAST'ing a piece of the _mecA_ gene against NCBI.
157+
To create the query file `mecA_BLAST.fasta`,
158+
I randomly selected 140 nucleotides from `mecA.fasta`.
159+
160+
We should see that the query FASTA is a direct hit to the _mecA_ gene
161+
(one of the NCBI hits should definitely be the NCBI sample `NG_047945.1`,
162+
which is the sample the gene fragment was extracted from).
163+
164+
For this BLAST search, I will search against the `core_nt` database,
165+
which is a faster, smaller, and more focused subset of the traditional `nt` (nucleotide) database.
166+
This newer database is the default as of August 2024.
167+
It seeks to reduce redundancy and storage requirements when downloading.
168+
More information about it can be found [here](https://ncbiinsights.ncbi.nlm.nih.gov/2024/07/18/new-blast-core-nucleotide-database/).
169+
170+
General information about the different kinds of BLAST databases is also available [here](https://www.nlm.nih.gov/ncbi/workshops/2023-08_BLAST_evol/databases.html).
171+
172+
```
173+
io = IOBuffer();
174+
blastn("assets/mecA_BLAST.fasta"; db="core_nt", stdout=io, remote=true, outfmt="6 query subject expect")
175+
seek(io, 0);
176+
CSV.read(io, DataFrame; header=false)
177+
```
178+
179+
Here is the output of the BLAST.
180+
181+
```
182+
julia> CSV.read(io, DataFrame; header=false)
183+
500×12 DataFrame
184+
Row │ Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8 Column9 Column10 Column11 Column12
185+
│ String15 String15 Float64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Float64 Int64
186+
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
187+
1 │ mecA_BLAST CP026646.1 100.0 140 0 0 1 140 46719 46580 7.12e-65 259
188+
2 │ mecA_BLAST CP154497.1 100.0 140 0 0 1 140 45702 45563 7.12e-65 259
189+
3 │ mecA_BLAST CP049499.1 100.0 140 0 0 1 140 61014 60875 7.12e-65 259
190+
4 │ mecA_BLAST CP030403.1 100.0 140 0 0 1 140 48633 48494 7.12e-65 259
191+
5 │ mecA_BLAST CP132494.1 100.0 140 0 0 1 140 1867742 1867603 7.12e-65 259
192+
6 │ mecA_BLAST MH798864.1 100.0 140 0 0 1 140 281 420 7.12e-65 259
193+
7 │ mecA_BLAST CP162442.1 100.0 140 0 0 1 140 503005 503144 7.12e-65 259
194+
8 │ mecA_BLAST OR082611.1 100.0 140 0 0 1 140 6415 6276 7.12e-65 259
195+
9 │ mecA_BLAST CP053183.1 100.0 140 0 0 1 140 41607 41468 7.12e-65 259
196+
10 │ mecA_BLAST CP085310.1 100.0 140 0 0 1 140 1610215 1610076 7.12e-65 259
197+
11 │ mecA_BLAST CP162465.1 100.0 140 0 0 1 140 1140196 1140057 7.12e-65 259
198+
12 │ mecA_BLAST CP133660.1 100.0 140 0 0 1 140 2821019 2821158 7.12e-65 259
199+
13 │ mecA_BLAST CP049476.1 100.0 140 0 0 1 140 40314 40175 7.12e-65 259
200+
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
201+
489 │ mecA_BLAST CP093527.1 100.0 140 0 0 1 140 42814 42675 7.12e-65 259
202+
490 │ mecA_BLAST CP035541.1 100.0 140 0 0 1 140 72431 72292 7.12e-65 259
203+
491 │ mecA_BLAST CP145216.2 100.0 140 0 0 1 140 45830 45691 7.12e-65 259
204+
492 │ mecA_BLAST CP193734.1 100.0 140 0 0 1 140 64595 64456 7.12e-65 259
205+
493 │ mecA_BLAST CP083210.2 100.0 140 0 0 1 140 43331 43192 7.12e-65 259
206+
494 │ mecA_BLAST MW052033.1 100.0 140 0 0 1 140 245 384 7.12e-65 259
207+
495 │ mecA_BLAST CP168087.1 100.0 140 0 0 1 140 40806 40667 7.12e-65 259
208+
496 │ mecA_BLAST CP150769.1 100.0 140 0 0 1 140 2583517 2583378 7.12e-65 259
209+
497 │ mecA_BLAST CP030596.1 100.0 140 0 0 1 140 40848 40709 7.12e-65 259
210+
498 │ mecA_BLAST MZ398128.1 100.0 140 0 0 1 140 16977 17116 7.12e-65 259
211+
499 │ mecA_BLAST CP083199.2 100.0 140 0 0 1 140 40078 39939 7.12e-65 259
212+
500 │ mecA_BLAST CP162663.1 100.0 140 0 0 1 140 938360 938499 7.12e-65 259
213+
475 rows omitted
214+
```
215+
There are 500 hits with the same E-value and Bitscore.
216+
This likely means that this sequence is an exact match to these 500 sequences in NCBI.
217+
Because of this, the first row in the results is not necessarily a better match than the 500th,
218+
even though it appears first.
219+
220+
To verify the first hit, we can look up the GenBankID of the first row: `CP026646.1`.
221+
The NCBI [page](https://www.ncbi.nlm.nih.gov/nuccore/CP026646.1/) for this sample confirms that this sample was phenotyped as _S. aureus_.
222+
Our query matches indices 46719 to 46580 on this reference genome.
223+
When we use the Graphics feature to visualize gene annotations on the reference genome,
224+
we see that there is a clear match to _mecA_ in the region that matches the query.
225+
226+
![BLAST Graphics](assets/mecA_BLAST.png)
227+
228+
Overall, this confirms that our BLAST worked as corrected!

cookbook/assets/mecA_BLAST.fasta

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
>mecA_BLAST file that is used for BLAST tutorial
2+
GAGTAGATGCTCAATATAAAATTAAAACAAACTACGGTAACATTGATCGCAACGTTCAATTTAATTTTGT
3+
TAAAGAAGATGGTATGTGGAAGTTAGATTGGGATCATAGCGTCATTATTCCAGGAATGCAGAAAGACCAA

cookbook/assets/mecA_BLAST.png

384 KB
Loading

0 commit comments

Comments
 (0)