Skip to content

Commit c78a6c2

Browse files
final language tweaks
1 parent 6a35d32 commit c78a6c2

File tree

1 file changed

+50
-45
lines changed

1 file changed

+50
-45
lines changed

cookbook/03-blast.md

Lines changed: 50 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -6,34 +6,32 @@ rss_descr = "Using NCBIBlast.jl to run BLAST searches"
66
+++
77

88
# Introduction to BLAST
9-
A BLAST search allows you to query a sequence (either nucleotide or protein) against an entire database of sequences.
10-
It can be helpful for quickly compare unknown sequences to databases of established reference sequences for purposes such as species identity or assignment gene function.
9+
A BLAST search allows you to query a sequence (either nucleotide or protein) against an entire database of sequences.
10+
It can quickly compare unknown sequences to databases of established reference sequences for purposes such as species identity or assignment gene function.
1111

1212
More information about how to use BLAST can be found in its [manual](https://www.ncbi.nlm.nih.gov/books/NBK569856/).
1313

14-
BLAST searches can be run from the command line interface (CLI) or through BLAST web page [here](https://blast.ncbi.nlm.nih.gov/Blast.cgi).
15-
A user can simply copy in a nucleotide sequence and search for the best match in NCBI!
14+
BLAST searches can be run from the command line interface (CLI) or through the BLAST web page [here](https://blast.ncbi.nlm.nih.gov/Blast.cgi).
15+
A user can simply copy in a protein or nucleotide sequence and search against NCBI to find the best match!
1616
While searching from the website is fast and straightforward,
17-
it only searches against the NCBI databases.
18-
The CLI allows users to query against both NCBI databases and custom databases.
17+
running searches from the website only performs searches against the NCBI databases.
18+
The CLI allows users to query both NCBI databases and custom databases.
1919

2020
`NCBIBlast.jl` is a thin wrapper around the BLAST command line tool,
21-
allowing users to run the tool within Julia.
21+
allowing users to run the tool within Julia.
2222
The following BLAST tools are supported by `NCBIBlast`:
2323
- `blastn`
2424
- `blastp`
2525
- `tblastn`
2626
- `blastx`
2727
- `makeblastdb`
2828

29-
30-
3129
Note: [BioTools BLAST](https://biojulia.dev/BioTools.jl/stable/blast/) is a deprecated Julia package for running BLAST searches and is different from `NCBIBLAST`.
3230

3331

3432

3533

36-
# How NCBIBlast.jl works
34+
# How `NCBIBlast.jl` works
3735

3836

3937
The keywords used in the tool are sent to the shell for running BLAST.
@@ -51,39 +49,43 @@ $ blastn -query a_file.txt -db mydb -out results.txt
5149
```
5250

5351
# BLAST databases
54-
Having a BLAST database is necessary to run BLAST locally.
55-
A BLAST database is constructed from FASTA files that serve as reference sequences.
56-
A database can be built using the following command:
52+
Just like running a BLAST search from the CLI, `NCBIBlast.jl` requires a BLAST database to search against.
53+
The user can build a local, custom database,
54+
or search against a specific NCBI database.
55+
A custom BLAST database is constructed from FASTA files that serve as reference sequences.
56+
A database can be built using the following command in `NCBIBlast.jl`:
5757
```
5858
makeblastdb(; in="test/example_files/dna2.fasta", dbtype="nucl")
5959
```
6060

6161
More directions on building a BLAST database locally can be found [here](https://www.ncbi.nlm.nih.gov/books/NBK569841/).
6262

63+
6364
## Example: Building a local BLAST database and running the BLAST search
6465

65-
For our first example, we will replicate the example on the `NCBIBlast.jl` Github.
66+
For our first example, we will replicate the example on the `NCBIBlast.jl` GitHub repository.
6667

6768
First, we will build a local database using a FASTA file found in the NCBIBlast github repository ([link here](https://github.com/BioJulia/NCBIBlast.jl/blob/main/test/example_files/dna2.fasta)).
69+
This file has been downloaded into `assets` as `dna2.fasta`.
6870

6971
```
7072
makeblastdb(; in="assets/dna2.fasta", dbtype="nucl")
7173
7274
Building a new DB, current time: 03/16/2026 21:04:36
73-
New DB name: /Users/pintorson/Documents/tufts_bonham/daniellepinto/Projects/BioJulia/BioTutorials/cookbook/assets/dna2.fasta
75+
New DB name: /LOCAL/PATH/BioTutorials/cookbook/assets/dna2.fasta
7476
New DB title: assets/dna2.fasta
7577
Sequence type: Nucleotide
7678
Keep MBits: T
7779
Maximum file size: 3000000000B
7880
Adding sequences from FASTA; added 2 sequences in 0.0012269 seconds.
7981
8082
81-
Process(`/Users/pintorson/.julia/artifacts/0406b91031ce302fa9117606d007d04635279fef/ncbi-blast-2.16.0+/bin/makeblastdb -in assets/dna2.fasta -dbtype nucl`, ProcessExited(0))
83+
Process(`/Users/USER/.julia/artifacts/0406b91031ce302fa9117606d007d04635279fef/ncbi-blast-2.16.0+/bin/makeblastdb -in assets/dna2.fasta -dbtype nucl`, ProcessExited(0))
8284
```
8385
A new database was built in `assets`.
8486

8587
Now, we can define our query sequence.
86-
We can save the query string in memory (using IOBuffer) rather than reading in a FASTA file.
88+
We can save the query string in memory (using `IOBuffer`) rather than reading in a FASTA file.
8789

8890
```
8991
buf = IOBuffer("TTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAG")
@@ -92,14 +94,14 @@ buf = IOBuffer("TTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAG")
9294
Now, we can run the BLAST search.
9395
The BLAST output format "6" means that the output will be tab-delimited text with no column names.
9496

95-
The BLAST output will be written into IO.
97+
The BLAST output will be written into I/O.
9698
```
9799
io = IOBuffer();
98100
blastn(buf; stdout=io, db="assets/dna2.fasta", outfmt="6");
99101
seek(io, 0);
100102
```
101-
The command `seek(io,0)` moves the cursor to the start of the captured object (index 0) so it can be read into a dataframe.
102-
103+
The command `seek(io,0)` moves the cursor to the start of the captured object (index 0)
104+
so it can be read into a dataframe via the [`DataFrames.jl`](https://dataframes.juliadata.org/stable/) package.
103105

104106
```
105107
using CSV, DataFrames
@@ -113,49 +115,51 @@ CSV.read(io, DataFrame; header=false)
113115
```
114116

115117
### Interpreting BLAST Output
116-
This output tells us that the query sequence (`Query_1` is the default name since we didn't specify a name) matches `Test1` in the reference database.
117-
There is 100% identity on a region that is 38 nucleotides long.
118+
This output tells us that the query sequence
119+
(`Query_1` is the default name for the sequence because we didn't specify a name)
120+
matches `Test1` in the reference database.
121+
There is 100% identity between the query and a region on `Test1` that is 38 nucleotides long.
118122
There are 0 mismatches or gap openings.
119123
The match starts at index 1 on the query sequence, and ends at index 82.
120124
This region matches a region in the `Test1` sequence spanning from index 82 to 119.
121125
The E-value is `5.64e-18`, meaning that it is extremely unlikely that this match occurred simply due to chance.
122126

123127
Here is a description of the E-value from the NCBI [website](https://blast.ncbi.nlm.nih.gov/doc/blast-help/FAQ.html):
124-
> The Expect value (E) is a parameter that describes the number of
125-
> hits one can “expect” to see by chance when searching a database of
126-
> a particular size. It decreases exponentially as the Score (S) of
127-
> the match increases.
128-
> The lower the E-value the more “significant” the match is. However
129-
> keep in mind that virtually identical short alignments have
130-
> relatively high E values. This is because the calculation of the E
131-
> value takes into account the length of the query sequence. These
132-
> high E values make sense because shorter sequences have a higher
133-
> probability of occurring in the database purely by chance.
128+
> The Expect value (E) is a parameter that describes the number of hits one can “expect” to see
129+
> by chance when searching a database of a particular size.
130+
> It decreases exponentially as the Score (S) of the match increases.
131+
> The lower the E-value the more “significant” the match is.
132+
> However keep in mind that virtually identical short alignments have relatively high E values.
133+
> This is because the calculation of the E value takes into account the length of the query sequence.
134+
> These high E values make sense because shorter sequences
135+
> have a higher probability of occurring in the database purely by chance.
134136
135137

136138

137139
The bitscore is 71.3.
138140

139141
Here is a definition of bitscore from the NCBI BLAST [glossary](https://www.ncbi.nlm.nih.gov/books/NBK62051/):
140-
> The bit score, S', is derived from the raw alignment score, S,
141-
> taking the statistical properties of the scoring system into account.
142-
> Because bit scores are normalized with respect to the scoring system, they
143-
> can be used to compare alignment scores from different searches.
142+
> The bit score, S',
143+
> is derived from the raw alignment score, S,
144+
> taking the statistical properties of the scoring system into account.
145+
> Because bit scores are normalized with respect to the scoring system,
146+
> they can be used to compare alignment scores from different searches.
144147
145148

146149
## Example: BLASTing the _mecA1_ gene against all of NCBI
147-
Now that we've tried BLASTing against a local database,
148-
let's try BLASTing a piece of the _mecA_ gene against NCBI.
150+
Now that we've tried BLAST'ing against a local, custom database,
151+
let's try BLAST'ing a piece of the _mecA_ gene against NCBI.
149152
To create the query file `mecA_BLAST.fasta`,
150153
I randomly selected 140 nucleotides from `mecA.fasta`.
151154

152-
We should see that the query fasta is a direct hit to the _mecA_ gene
153-
(specifically to the sequence upload that we pulled from NCBI).
155+
We should see that the query FASTA is a direct hit to the _mecA_ gene
156+
(one of the NCBI hits should definitely be the NCBI sample `NG_047945.1`,
157+
which is the sample the gene fragment was extracted from).
154158

155-
For this BLAST search, I will search against the `core_nt` database,
159+
For this BLAST search, I will search against the `core_nt` database,
156160
which is a faster, smaller, and more focused subset of the traditional `nt` (nucleotide) database.
157161
This newer database is the default as of August 2024.
158-
It seeks to reduce redundancy and storage requirements when downloading the database.
162+
It seeks to reduce redundancy and storage requirements when downloading.
159163
More information about it can be found [here](https://ncbiinsights.ncbi.nlm.nih.gov/2024/07/18/new-blast-core-nucleotide-database/).
160164

161165
General information about the different kinds of BLAST databases is also available [here](https://www.nlm.nih.gov/ncbi/workshops/2023-08_BLAST_evol/databases.html).
@@ -208,10 +212,11 @@ This likely means that this sequence is an exact match to these 500 sequences in
208212
Because of this, the first row in the results is not necessarily a better match than the 500th,
209213
even though it appears first.
210214

211-
To verify the first hit, we can look up the GenBankID of the first hit: `CP026646.1`.
215+
To verify the first hit, we can look up the GenBankID of the first row: `CP026646.1`.
212216
The NCBI [page](https://www.ncbi.nlm.nih.gov/nuccore/CP026646.1/) for this sample confirms that this sample was phenotyped as _S. aureus_.
213-
Our query matches from indices 46719 to 46580.
214-
When we use the Graphics feature to visualize gene annotations, we see that there is a clear match to _mecA_.
217+
Our query matches indices 46719 to 46580 on this reference genome.
218+
When we use the Graphics feature to visualize gene annotations on the reference genome,
219+
we see that there is a clear match to _mecA_ in the region that matches the query.
215220

216221
![BLAST Graphics](assets/mecA_BLAST.png)
217222

0 commit comments

Comments
 (0)