|
| 1 | ++++ |
| 2 | +using Dates |
| 3 | +date = Date("2026-03-15") |
| 4 | +title = "BLAST" |
| 5 | +rss_descr = "Using NCBIBlast.jl to run BLAST searches" |
| 6 | ++++ |
| 7 | + |
| 8 | +# Introduction to BLAST |
| 9 | +A BLAST search allows you to query a sequence (either nucleotide or protein) against an entire database of sequences. |
| 10 | +It can quickly compare unknown sequences to databases of established reference sequences for purposes such as species identity or assignment gene function. |
| 11 | + |
| 12 | +More information about how to use BLAST can be found in its [manual](https://www.ncbi.nlm.nih.gov/books/NBK569856/). |
| 13 | + |
| 14 | +BLAST searches can be run from the command line interface (CLI) or through the BLAST web page [here](https://blast.ncbi.nlm.nih.gov/Blast.cgi). |
| 15 | +A user can simply copy in a protein or nucleotide sequence and search against NCBI to find the best match! |
| 16 | +While searching from the website is fast and straightforward, |
| 17 | +running searches from the website only performs searches against the NCBI databases. |
| 18 | +The CLI allows users to query both NCBI databases and custom databases. |
| 19 | + |
| 20 | +`NCBIBlast.jl` is a thin wrapper around the BLAST command line tool, |
| 21 | +allowing users to run the tool within Julia. |
| 22 | +The following BLAST tools are supported by `NCBIBlast`: |
| 23 | +- `blastn` |
| 24 | +- `blastp` |
| 25 | +- `tblastn` |
| 26 | +- `blastx` |
| 27 | +- `makeblastdb` |
| 28 | + |
| 29 | +A benefit of using the package is that it uses Julia's powerful `BinaryBuilder.jl` architechture to bundle the `BLAST+` executables |
| 30 | +so that you don't have to deal with that. |
| 31 | +In other words, if you do add `NCBIBLAST.jl`, |
| 32 | +you don't need to have the command line tools available. |
| 33 | + |
| 34 | +Note: [BioTools BLAST](https://biojulia.dev/BioTools.jl/stable/blast/) is a deprecated Julia package for running BLAST searches and is different from `NCBIBLAST`. |
| 35 | + |
| 36 | + |
| 37 | + |
| 38 | + |
| 39 | +# How `NCBIBlast.jl` works |
| 40 | + |
| 41 | + |
| 42 | +The keywords used in the tool are sent to the shell for running BLAST. |
| 43 | + |
| 44 | +As stated on the GitHub [docs](https://github.com/BioJulia/NCBIBlast.jl), the Julia call |
| 45 | + |
| 46 | +``` |
| 47 | +blastn(; query = "a_file.txt", db="mydb", out="results.txt") |
| 48 | +``` |
| 49 | + |
| 50 | +is sent to the shell as |
| 51 | + |
| 52 | +``` |
| 53 | +$ blastn -query a_file.txt -db mydb -out results.txt |
| 54 | +``` |
| 55 | + |
| 56 | +# BLAST databases |
| 57 | +Just like running a BLAST search from the CLI, `NCBIBlast.jl` requires a BLAST database to search against. |
| 58 | +The user can build a local, custom database, |
| 59 | +or search against a specific NCBI database. |
| 60 | +A custom BLAST database is constructed from FASTA files that serve as reference sequences. |
| 61 | +A database can be built using the following command in `NCBIBlast.jl`: |
| 62 | +``` |
| 63 | +makeblastdb(; in="test/example_files/dna2.fasta", dbtype="nucl") |
| 64 | +``` |
| 65 | + |
| 66 | +More directions on building a BLAST database locally can be found [here](https://www.ncbi.nlm.nih.gov/books/NBK569841/). |
| 67 | + |
| 68 | + |
| 69 | +## Example: Building a local BLAST database and running the BLAST search |
| 70 | + |
| 71 | +For our first example, we will replicate the example on the `NCBIBlast.jl` GitHub repository. |
| 72 | + |
| 73 | +First, we will build a local database using a FASTA file found in the NCBIBlast github repository ([link here](https://github.com/BioJulia/NCBIBlast.jl/blob/main/test/example_files/dna2.fasta)). |
| 74 | +This file has been downloaded into `assets` as `dna2.fasta`. |
| 75 | + |
| 76 | +``` |
| 77 | +makeblastdb(; in="assets/dna2.fasta", dbtype="nucl") |
| 78 | +
|
| 79 | +Building a new DB, current time: 03/16/2026 21:04:36 |
| 80 | +New DB name: /LOCAL/PATH/BioTutorials/cookbook/assets/dna2.fasta |
| 81 | +New DB title: assets/dna2.fasta |
| 82 | +Sequence type: Nucleotide |
| 83 | +Keep MBits: T |
| 84 | +Maximum file size: 3000000000B |
| 85 | +Adding sequences from FASTA; added 2 sequences in 0.0012269 seconds. |
| 86 | +
|
| 87 | +
|
| 88 | +Process(`/Users/USER/.julia/artifacts/0406b91031ce302fa9117606d007d04635279fef/ncbi-blast-2.16.0+/bin/makeblastdb -in assets/dna2.fasta -dbtype nucl`, ProcessExited(0)) |
| 89 | +``` |
| 90 | +A new database was built in `assets`. |
| 91 | + |
| 92 | +Now, we can define our query sequence. |
| 93 | +We can save the query string in memory (using `IOBuffer`) rather than reading in a FASTA file. |
| 94 | + |
| 95 | +``` |
| 96 | +buf = IOBuffer("TTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAG") |
| 97 | +``` |
| 98 | + |
| 99 | +Now, we can run the BLAST search. |
| 100 | +The BLAST output format "6" means that the output will be tab-delimited text with no column names. |
| 101 | + |
| 102 | +The BLAST output will be written into I/O. |
| 103 | +``` |
| 104 | +io = IOBuffer(); |
| 105 | +blastn(buf; stdout=io, db="assets/dna2.fasta", outfmt="6"); |
| 106 | +seek(io, 0); |
| 107 | +``` |
| 108 | +The command `seek(io,0)` moves the cursor to the start of the captured object (index 0) |
| 109 | +so it can be read into a dataframe via the [`DataFrames.jl`](https://dataframes.juliadata.org/stable/) package. |
| 110 | + |
| 111 | +``` |
| 112 | +using CSV, DataFrames |
| 113 | +CSV.read(io, DataFrame; header=false) |
| 114 | +
|
| 115 | +1×12 DataFrame |
| 116 | + Row │ Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8 Column9 Column10 Column11 Column12 |
| 117 | + │ String7 String7 Float64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Float64 Float64 |
| 118 | +─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────── |
| 119 | + 1 │ Query_1 Test1 100.0 38 0 0 1 38 82 119 5.64e-18 71.3 |
| 120 | +``` |
| 121 | + |
| 122 | +### Interpreting BLAST Output |
| 123 | +This output tells us that the query sequence |
| 124 | +(`Query_1` is the default name for the sequence because we didn't specify a name) |
| 125 | +matches `Test1` in the reference database. |
| 126 | +There is 100% identity between the query and a region on `Test1` that is 38 nucleotides long. |
| 127 | +There are 0 mismatches or gap openings. |
| 128 | +The match starts at index 1 on the query sequence, and ends at index 82. |
| 129 | +This region matches a region in the `Test1` sequence spanning from index 82 to 119. |
| 130 | +The E-value is `5.64e-18`, meaning that it is extremely unlikely that this match occurred simply due to chance. |
| 131 | + |
| 132 | +Here is a description of the E-value from the NCBI [website](https://blast.ncbi.nlm.nih.gov/doc/blast-help/FAQ.html): |
| 133 | +> The Expect value (E) is a parameter that describes the number of hits one can “expect” to see |
| 134 | +> by chance when searching a database of a particular size. |
| 135 | +> It decreases exponentially as the Score (S) of the match increases. |
| 136 | +> The lower the E-value the more “significant” the match is. |
| 137 | +> However keep in mind that virtually identical short alignments have relatively high E values. |
| 138 | +> This is because the calculation of the E value takes into account the length of the query sequence. |
| 139 | +> These high E values make sense because shorter sequences |
| 140 | +> have a higher probability of occurring in the database purely by chance. |
| 141 | +
|
| 142 | + |
| 143 | + |
| 144 | +The bitscore is 71.3. |
| 145 | + |
| 146 | +Here is a definition of bitscore from the NCBI BLAST [glossary](https://www.ncbi.nlm.nih.gov/books/NBK62051/): |
| 147 | +> The bit score, S', |
| 148 | +> is derived from the raw alignment score, S, |
| 149 | +> taking the statistical properties of the scoring system into account. |
| 150 | +> Because bit scores are normalized with respect to the scoring system, |
| 151 | +> they can be used to compare alignment scores from different searches. |
| 152 | +
|
| 153 | + |
| 154 | +## Example: BLASTing the _mecA1_ gene against all of NCBI |
| 155 | +Now that we've tried BLAST'ing against a local, custom database, |
| 156 | +let's try BLAST'ing a piece of the _mecA_ gene against NCBI. |
| 157 | +To create the query file `mecA_BLAST.fasta`, |
| 158 | +I randomly selected 140 nucleotides from `mecA.fasta`. |
| 159 | + |
| 160 | +We should see that the query FASTA is a direct hit to the _mecA_ gene |
| 161 | +(one of the NCBI hits should definitely be the NCBI sample `NG_047945.1`, |
| 162 | +which is the sample the gene fragment was extracted from). |
| 163 | + |
| 164 | +For this BLAST search, I will search against the `core_nt` database, |
| 165 | +which is a faster, smaller, and more focused subset of the traditional `nt` (nucleotide) database. |
| 166 | +This newer database is the default as of August 2024. |
| 167 | +It seeks to reduce redundancy and storage requirements when downloading. |
| 168 | +More information about it can be found [here](https://ncbiinsights.ncbi.nlm.nih.gov/2024/07/18/new-blast-core-nucleotide-database/). |
| 169 | + |
| 170 | +General information about the different kinds of BLAST databases is also available [here](https://www.nlm.nih.gov/ncbi/workshops/2023-08_BLAST_evol/databases.html). |
| 171 | + |
| 172 | +``` |
| 173 | +io = IOBuffer(); |
| 174 | +blastn("assets/mecA_BLAST.fasta"; db="core_nt", stdout=io, remote=true, outfmt="6 query subject expect") |
| 175 | +seek(io, 0); |
| 176 | +CSV.read(io, DataFrame; header=false) |
| 177 | +``` |
| 178 | + |
| 179 | +Here is the output of the BLAST. |
| 180 | + |
| 181 | +``` |
| 182 | +julia> CSV.read(io, DataFrame; header=false) |
| 183 | +500×12 DataFrame |
| 184 | + Row │ Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8 Column9 Column10 Column11 Column12 |
| 185 | + │ String15 String15 Float64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Float64 Int64 |
| 186 | +─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────── |
| 187 | + 1 │ mecA_BLAST CP026646.1 100.0 140 0 0 1 140 46719 46580 7.12e-65 259 |
| 188 | + 2 │ mecA_BLAST CP154497.1 100.0 140 0 0 1 140 45702 45563 7.12e-65 259 |
| 189 | + 3 │ mecA_BLAST CP049499.1 100.0 140 0 0 1 140 61014 60875 7.12e-65 259 |
| 190 | + 4 │ mecA_BLAST CP030403.1 100.0 140 0 0 1 140 48633 48494 7.12e-65 259 |
| 191 | + 5 │ mecA_BLAST CP132494.1 100.0 140 0 0 1 140 1867742 1867603 7.12e-65 259 |
| 192 | + 6 │ mecA_BLAST MH798864.1 100.0 140 0 0 1 140 281 420 7.12e-65 259 |
| 193 | + 7 │ mecA_BLAST CP162442.1 100.0 140 0 0 1 140 503005 503144 7.12e-65 259 |
| 194 | + 8 │ mecA_BLAST OR082611.1 100.0 140 0 0 1 140 6415 6276 7.12e-65 259 |
| 195 | + 9 │ mecA_BLAST CP053183.1 100.0 140 0 0 1 140 41607 41468 7.12e-65 259 |
| 196 | + 10 │ mecA_BLAST CP085310.1 100.0 140 0 0 1 140 1610215 1610076 7.12e-65 259 |
| 197 | + 11 │ mecA_BLAST CP162465.1 100.0 140 0 0 1 140 1140196 1140057 7.12e-65 259 |
| 198 | + 12 │ mecA_BLAST CP133660.1 100.0 140 0 0 1 140 2821019 2821158 7.12e-65 259 |
| 199 | + 13 │ mecA_BLAST CP049476.1 100.0 140 0 0 1 140 40314 40175 7.12e-65 259 |
| 200 | + ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ |
| 201 | + 489 │ mecA_BLAST CP093527.1 100.0 140 0 0 1 140 42814 42675 7.12e-65 259 |
| 202 | + 490 │ mecA_BLAST CP035541.1 100.0 140 0 0 1 140 72431 72292 7.12e-65 259 |
| 203 | + 491 │ mecA_BLAST CP145216.2 100.0 140 0 0 1 140 45830 45691 7.12e-65 259 |
| 204 | + 492 │ mecA_BLAST CP193734.1 100.0 140 0 0 1 140 64595 64456 7.12e-65 259 |
| 205 | + 493 │ mecA_BLAST CP083210.2 100.0 140 0 0 1 140 43331 43192 7.12e-65 259 |
| 206 | + 494 │ mecA_BLAST MW052033.1 100.0 140 0 0 1 140 245 384 7.12e-65 259 |
| 207 | + 495 │ mecA_BLAST CP168087.1 100.0 140 0 0 1 140 40806 40667 7.12e-65 259 |
| 208 | + 496 │ mecA_BLAST CP150769.1 100.0 140 0 0 1 140 2583517 2583378 7.12e-65 259 |
| 209 | + 497 │ mecA_BLAST CP030596.1 100.0 140 0 0 1 140 40848 40709 7.12e-65 259 |
| 210 | + 498 │ mecA_BLAST MZ398128.1 100.0 140 0 0 1 140 16977 17116 7.12e-65 259 |
| 211 | + 499 │ mecA_BLAST CP083199.2 100.0 140 0 0 1 140 40078 39939 7.12e-65 259 |
| 212 | + 500 │ mecA_BLAST CP162663.1 100.0 140 0 0 1 140 938360 938499 7.12e-65 259 |
| 213 | + 475 rows omitted |
| 214 | +``` |
| 215 | +There are 500 hits with the same E-value and Bitscore. |
| 216 | +This likely means that this sequence is an exact match to these 500 sequences in NCBI. |
| 217 | +Because of this, the first row in the results is not necessarily a better match than the 500th, |
| 218 | +even though it appears first. |
| 219 | + |
| 220 | +To verify the first hit, we can look up the GenBankID of the first row: `CP026646.1`. |
| 221 | +The NCBI [page](https://www.ncbi.nlm.nih.gov/nuccore/CP026646.1/) for this sample confirms that this sample was phenotyped as _S. aureus_. |
| 222 | +Our query matches indices 46719 to 46580 on this reference genome. |
| 223 | +When we use the Graphics feature to visualize gene annotations on the reference genome, |
| 224 | +we see that there is a clear match to _mecA_ in the region that matches the query. |
| 225 | + |
| 226 | + |
| 227 | + |
| 228 | +Overall, this confirms that our BLAST worked as corrected! |
0 commit comments