You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: cookbook/03-blast.md
+50-45Lines changed: 50 additions & 45 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,34 +6,32 @@ rss_descr = "Using NCBIBlast.jl to run BLAST searches"
6
6
+++
7
7
8
8
# Introduction to BLAST
9
-
A BLAST search allows you to query a sequence (either nucleotide or protein) against an entire database of sequences.
10
-
It can be helpful for quickly compare unknown sequences to databases of established reference sequences for purposes such as species identity or assignment gene function.
9
+
A BLAST search allows you to query a sequence (either nucleotide or protein) against an entire database of sequences.
10
+
It can quickly compare unknown sequences to databases of established reference sequences for purposes such as species identity or assignment gene function.
11
11
12
12
More information about how to use BLAST can be found in its [manual](https://www.ncbi.nlm.nih.gov/books/NBK569856/).
13
13
14
-
BLAST searches can be run from the command line interface (CLI) or through BLAST web page [here](https://blast.ncbi.nlm.nih.gov/Blast.cgi).
15
-
A user can simply copy in a nucleotide sequence and search for the best match in NCBI!
14
+
BLAST searches can be run from the command line interface (CLI) or through the BLAST web page [here](https://blast.ncbi.nlm.nih.gov/Blast.cgi).
15
+
A user can simply copy in a protein or nucleotide sequence and search against NCBI to find the best match!
16
16
While searching from the website is fast and straightforward,
17
-
it only searches against the NCBI databases.
18
-
The CLI allows users to query against both NCBI databases and custom databases.
17
+
running searches from the website only performs searches against the NCBI databases.
18
+
The CLI allows users to query both NCBI databases and custom databases.
19
19
20
20
`NCBIBlast.jl` is a thin wrapper around the BLAST command line tool,
21
-
allowing users to run the tool within Julia.
21
+
allowing users to run the tool within Julia.
22
22
The following BLAST tools are supported by `NCBIBlast`:
23
23
-`blastn`
24
24
-`blastp`
25
25
-`tblastn`
26
26
-`blastx`
27
27
-`makeblastdb`
28
28
29
-
30
-
31
29
Note: [BioTools BLAST](https://biojulia.dev/BioTools.jl/stable/blast/) is a deprecated Julia package for running BLAST searches and is different from `NCBIBLAST`.
32
30
33
31
34
32
35
33
36
-
# How NCBIBlast.jl works
34
+
# How `NCBIBlast.jl` works
37
35
38
36
39
37
The keywords used in the tool are sent to the shell for running BLAST.
More directions on building a BLAST database locally can be found [here](https://www.ncbi.nlm.nih.gov/books/NBK569841/).
62
62
63
+
63
64
## Example: Building a local BLAST database and running the BLAST search
64
65
65
-
For our first example, we will replicate the example on the `NCBIBlast.jl`Github.
66
+
For our first example, we will replicate the example on the `NCBIBlast.jl`GitHub repository.
66
67
67
68
First, we will build a local database using a FASTA file found in the NCBIBlast github repository ([link here](https://github.com/BioJulia/NCBIBlast.jl/blob/main/test/example_files/dna2.fasta)).
69
+
This file has been downloaded into `assets` as `dna2.fasta`.
This output tells us that the query sequence (`Query_1` is the default name since we didn't specify a name) matches `Test1` in the reference database.
117
-
There is 100% identity on a region that is 38 nucleotides long.
118
+
This output tells us that the query sequence
119
+
(`Query_1` is the default name for the sequence because we didn't specify a name)
120
+
matches `Test1` in the reference database.
121
+
There is 100% identity between the query and a region on `Test1` that is 38 nucleotides long.
118
122
There are 0 mismatches or gap openings.
119
123
The match starts at index 1 on the query sequence, and ends at index 82.
120
124
This region matches a region in the `Test1` sequence spanning from index 82 to 119.
121
125
The E-value is `5.64e-18`, meaning that it is extremely unlikely that this match occurred simply due to chance.
122
126
123
127
Here is a description of the E-value from the NCBI [website](https://blast.ncbi.nlm.nih.gov/doc/blast-help/FAQ.html):
124
-
> The Expect value (E) is a parameter that describes the number of
125
-
> hits one can “expect” to see by chance when searching a database of
126
-
> a particular size. It decreases exponentially as the Score (S) of
127
-
> the match increases.
128
-
> The lower the E-value the more “significant” the match is. However
129
-
> keep in mind that virtually identical short alignments have
130
-
> relatively high E values. This is because the calculation of the E
131
-
> value takes into account the length of the query sequence. These
132
-
> high E values make sense because shorter sequences have a higher
133
-
> probability of occurring in the database purely by chance.
128
+
> The Expect value (E) is a parameter that describes the number of hits one can “expect” to see
129
+
> by chance when searching a database of a particular size.
130
+
> It decreases exponentially as the Score (S) of the match increases.
131
+
> The lower the E-value the more “significant” the match is.
132
+
> However keep in mind that virtually identical short alignments have relatively high E values.
133
+
> This is because the calculation of the E value takes into account the length of the query sequence.
134
+
> These high E values make sense because shorter sequences
135
+
> have a higher probability of occurring in the database purely by chance.
134
136
135
137
136
138
137
139
The bitscore is 71.3.
138
140
139
141
Here is a definition of bitscore from the NCBI BLAST [glossary](https://www.ncbi.nlm.nih.gov/books/NBK62051/):
140
-
> The bit score, S', is derived from the raw alignment score, S,
141
-
> taking the statistical properties of the scoring system into account.
142
-
> Because bit scores are normalized with respect to the scoring system, they
143
-
> can be used to compare alignment scores from different searches.
142
+
> The bit score, S',
143
+
> is derived from the raw alignment score, S,
144
+
> taking the statistical properties of the scoring system into account.
145
+
> Because bit scores are normalized with respect to the scoring system,
146
+
> they can be used to compare alignment scores from different searches.
144
147
145
148
146
149
## Example: BLASTing the _mecA1_ gene against all of NCBI
147
-
Now that we've tried BLASTing against a local database,
148
-
let's try BLASTing a piece of the _mecA_ gene against NCBI.
150
+
Now that we've tried BLAST'ing against a local, custom database,
151
+
let's try BLAST'ing a piece of the _mecA_ gene against NCBI.
149
152
To create the query file `mecA_BLAST.fasta`,
150
153
I randomly selected 140 nucleotides from `mecA.fasta`.
151
154
152
-
We should see that the query fasta is a direct hit to the _mecA_ gene
153
-
(specifically to the sequence upload that we pulled from NCBI).
155
+
We should see that the query FASTA is a direct hit to the _mecA_ gene
156
+
(one of the NCBI hits should definitely be the NCBI sample `NG_047945.1`,
157
+
which is the sample the gene fragment was extracted from).
154
158
155
-
For this BLAST search, I will search against the `core_nt` database,
159
+
For this BLAST search, I will search against the `core_nt` database,
156
160
which is a faster, smaller, and more focused subset of the traditional `nt` (nucleotide) database.
157
161
This newer database is the default as of August 2024.
158
-
It seeks to reduce redundancy and storage requirements when downloading the database.
162
+
It seeks to reduce redundancy and storage requirements when downloading.
159
163
More information about it can be found [here](https://ncbiinsights.ncbi.nlm.nih.gov/2024/07/18/new-blast-core-nucleotide-database/).
160
164
161
165
General information about the different kinds of BLAST databases is also available [here](https://www.nlm.nih.gov/ncbi/workshops/2023-08_BLAST_evol/databases.html).
@@ -208,10 +212,11 @@ This likely means that this sequence is an exact match to these 500 sequences in
208
212
Because of this, the first row in the results is not necessarily a better match than the 500th,
209
213
even though it appears first.
210
214
211
-
To verify the first hit, we can look up the GenBankID of the first hit: `CP026646.1`.
215
+
To verify the first hit, we can look up the GenBankID of the first row: `CP026646.1`.
212
216
The NCBI [page](https://www.ncbi.nlm.nih.gov/nuccore/CP026646.1/) for this sample confirms that this sample was phenotyped as _S. aureus_.
213
-
Our query matches from indices 46719 to 46580.
214
-
When we use the Graphics feature to visualize gene annotations, we see that there is a clear match to _mecA_.
217
+
Our query matches indices 46719 to 46580 on this reference genome.
218
+
When we use the Graphics feature to visualize gene annotations on the reference genome,
219
+
we see that there is a clear match to _mecA_ in the region that matches the query.
0 commit comments