|
| 1 | ++++ |
| 2 | +using Dates |
| 3 | +date = Date("2026-03-15") |
| 4 | +title = "BLAST" |
| 5 | +rss_descr = "Using NCBIBlast.jl to run BLAST searches" |
| 6 | ++++ |
| 7 | + |
| 8 | +# Introduction to BLAST |
| 9 | +A BLAST search allows you to query a sequence (either nucleotide or protein) against an entire database of sequences. |
| 10 | + |
| 11 | +More information about how to use BLAST can be found in its [manual](https://www.ncbi.nlm.nih.gov/books/NBK569856/). |
| 12 | + |
| 13 | +BLAST's can be run from the BLAST web page [here](https://blast.ncbi.nlm.nih.gov/Blast.cgi). |
| 14 | +A user can simply copy in a nucleotide sequence and search for a best match against all of the databases in NCBI! |
| 15 | + |
| 16 | +`NCBIBlast.jl` is a thin wrapper around the BLAST command line tool, |
| 17 | +allowing users to run the tool within Julia. |
| 18 | +The following BLAST tools are supported by `NCBIBlast`: |
| 19 | +- `blastn` |
| 20 | +- `blastp` |
| 21 | +- `tblastn` |
| 22 | +- `blastx` |
| 23 | +- `makeblastdb` |
| 24 | + |
| 25 | + |
| 26 | + |
| 27 | +Note: [BioTools BLAST](https://biojulia.dev/BioTools.jl/stable/blast/) is a deprecated Julia package for running BLAST searches and is different from `NCBIBLAST`. |
| 28 | + |
| 29 | + |
| 30 | + |
| 31 | + |
| 32 | +# How NCBIBlast.jl works |
| 33 | + |
| 34 | + |
| 35 | +The keywords used in the tool are sent to the shell for running BLAST. |
| 36 | + |
| 37 | +As stated on the Github [docs](https://github.com/BioJulia/NCBIBlast.jl), the julia call |
| 38 | + |
| 39 | +``` |
| 40 | +blastn(; query = "a_file.txt", db="mydb", out="results.txt") |
| 41 | +``` |
| 42 | + |
| 43 | +is sent to the shell as |
| 44 | + |
| 45 | +``` |
| 46 | +$ blastn -query a_file.txt -db mydb -out results.txt |
| 47 | +``` |
| 48 | + |
| 49 | +# BLAST databases |
| 50 | +Having a BLAST database is necessary to run BLAST locally. |
| 51 | +A BLAST database is constructed from FASTA files that serve as reference sequences. |
| 52 | +A database can be built using the following command: |
| 53 | +``` |
| 54 | +makeblastdb(; in="test/example_files/dna2.fasta", dbtype="nucl") |
| 55 | +``` |
| 56 | + |
| 57 | +More directions on building a BLAST database locally can be found [here](https://www.ncbi.nlm.nih.gov/books/NBK569841/). |
| 58 | + |
| 59 | +## Example: Building a local BLAST database and running the BLAST search |
| 60 | + |
| 61 | +For our first example, we will replicate the example on the NCBIBlast.jl Github. |
| 62 | + |
| 63 | +First, we will build a local database using a fasta file found in the NCBIBlast github repository ([link here](https://github.com/BioJulia/NCBIBlast.jl/blob/main/test/example_files/dna2.fasta)). |
| 64 | + |
| 65 | +``` |
| 66 | +makeblastdb(; in="assets/dna2.fasta", dbtype="nucl") |
| 67 | +
|
| 68 | +Building a new DB, current time: 03/16/2026 21:04:36 |
| 69 | +New DB name: /Users/pintorson/Documents/tufts_bonham/daniellepinto/Projects/BioJulia/BioTutorials/cookbook/assets/dna2.fasta |
| 70 | +New DB title: assets/dna2.fasta |
| 71 | +Sequence type: Nucleotide |
| 72 | +Keep MBits: T |
| 73 | +Maximum file size: 3000000000B |
| 74 | +Adding sequences from FASTA; added 2 sequences in 0.0012269 seconds. |
| 75 | +
|
| 76 | +
|
| 77 | +Process(`/Users/pintorson/.julia/artifacts/0406b91031ce302fa9117606d007d04635279fef/ncbi-blast-2.16.0+/bin/makeblastdb -in assets/dna2.fasta -dbtype nucl`, ProcessExited(0)) |
| 78 | +``` |
| 79 | +A new database was built in `assets`. |
| 80 | + |
| 81 | +Now, we can define our query sequence. |
| 82 | +We can save the query string in memory (using IOBuffer) rather than reading in a FASTA file. |
| 83 | + |
| 84 | +``` |
| 85 | +buf = IOBuffer("TTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAG") |
| 86 | +``` |
| 87 | + |
| 88 | +Now, we can run the BLAST search. |
| 89 | +The BLAST output format "6" means that the output will be tab-delimited text with no column names. |
| 90 | + |
| 91 | +The BLAST output will be written into IO. |
| 92 | +``` |
| 93 | +io = IOBuffer(); |
| 94 | +blastn(buf; stdout=io, db="assets/dna2.fasta", outfmt="6"); |
| 95 | +seek(io, 0); |
| 96 | +``` |
| 97 | +The command `seek(io,0)` moves the cursor of to the start of the captured object (index 0) so it can be read into a dataframe. |
| 98 | + |
| 99 | + |
| 100 | +``` |
| 101 | +using CSV, DataFrames |
| 102 | +CSV.read(io, DataFrame; header=false) |
| 103 | +
|
| 104 | +1×12 DataFrame |
| 105 | + Row │ Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8 Column9 Column10 Column11 Column12 |
| 106 | + │ String7 String7 Float64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Float64 Float64 |
| 107 | +─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────── |
| 108 | + 1 │ Query_1 Test1 100.0 38 0 0 1 38 82 119 5.64e-18 71.3 |
| 109 | +``` |
| 110 | + |
| 111 | +### Interpreting BLAST Output |
| 112 | +This output tells us that the query sequence (`Query_1` is the default name since we didn't specify a name) matches `Test1` in the reference database. |
| 113 | +There is 100% identity on a region that is 38 nucleotides long. |
| 114 | +There are 0 mismatches or gap openings. |
| 115 | +The match starts at index 1 on the query sequence, and ends at index 82. |
| 116 | +This region matches a region on the `Test1` that spans from index 82 to 119. |
| 117 | +The E-value is `5.64e-18`, meaning that it is extremely unlikely that this match occured simply due to chance. |
| 118 | + |
| 119 | +Here is a description of the E-value from the NCBI [website](https://blast.ncbi.nlm.nih.gov/doc/blast-help/FAQ.html): |
| 120 | +> The Expect value (E) is a parameter that describes the number of |
| 121 | +> hits one can “expect” to see by chance when searching a database of |
| 122 | +> a particular size. It decreases exponentially as the Score (S) of |
| 123 | +> the match increases. |
| 124 | +> The lower the E-value the more “significant” the match is. However |
| 125 | +> keep in mind that virtually identical short alignments have |
| 126 | +> relatively high E values. This is because the calculation of the E |
| 127 | +> value takes into account the length of the query sequence. These |
| 128 | +> high E values make sense because shorter sequences have a higher |
| 129 | +> probability of occurring in the database purely by chance. |
| 130 | +
|
| 131 | + |
| 132 | + |
| 133 | +The bitscore is 71.3. |
| 134 | + |
| 135 | +Here is a definition of bitscore from the NCBI BLAST [glossary](https://www.ncbi.nlm.nih.gov/books/NBK62051/): |
| 136 | +> The bit score, S', is derived from the raw alignment score, S, |
| 137 | +> taking the statistical properties of the scoring system into account. |
| 138 | +> Because bit scores are normalized with respect to the scoring system, they |
| 139 | +> can be used to compare alignment scores from different searches. |
| 140 | +
|
| 141 | + |
| 142 | +## Example: BLASTing the _mecA1_ gene against all of NCBI |
| 143 | +Now that we've tried BLASTing against a local database, |
| 144 | +let's try BLASTing a piece of the _mecA_ gene against NCBI. |
| 145 | +To create the query file `mecA_BLAST.fasta`, |
| 146 | +I randomly selected 140 nucleotides from `mecA.fasta`. |
| 147 | + |
| 148 | +We should see that the query fasta is a direct hit to the _mecA_ gene |
| 149 | +(specifically to the sequence upload that we pulled from NCBI). |
| 150 | + |
| 151 | +For this BLAST search, I will search against the `core_nt` database, |
| 152 | +which is a faster, smaller, and more focused subset of the traditional `nt` (nucleotide) database. |
| 153 | +This newer database is the default as of August 2024. |
| 154 | +It seeks to reduce redundancy and reduce storage when downloading the database. More information about it can be found [here](https://ncbiinsights.ncbi.nlm.nih.gov/2024/07/18/new-blast-core-nucleotide-database/). |
| 155 | + |
| 156 | +General information about the different kinds of BLAST databases is also available [here](https://www.nlm.nih.gov/ncbi/workshops/2023-08_BLAST_evol/databases.html). |
| 157 | + |
| 158 | +``` |
| 159 | +io = IOBuffer(); |
| 160 | +blastn("assets/mecA_BLAST.fasta"; db="core_nt", stdout=io, remote=true, outfmt="6 query subject expect") |
| 161 | +seek(io, 0); |
| 162 | +CSV.read(io, DataFrame; header=false) |
| 163 | +``` |
| 164 | + |
| 165 | +Here is the output of the BLAST. |
| 166 | + |
| 167 | +``` |
| 168 | +julia> CSV.read(io, DataFrame; header=false) |
| 169 | +500×12 DataFrame |
| 170 | + Row │ Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8 Column9 Column10 Column11 Column12 |
| 171 | + │ String15 String15 Float64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Float64 Int64 |
| 172 | +─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────── |
| 173 | + 1 │ mecA_BLAST CP026646.1 100.0 140 0 0 1 140 46719 46580 7.12e-65 259 |
| 174 | + 2 │ mecA_BLAST CP154497.1 100.0 140 0 0 1 140 45702 45563 7.12e-65 259 |
| 175 | + 3 │ mecA_BLAST CP049499.1 100.0 140 0 0 1 140 61014 60875 7.12e-65 259 |
| 176 | + 4 │ mecA_BLAST CP030403.1 100.0 140 0 0 1 140 48633 48494 7.12e-65 259 |
| 177 | + 5 │ mecA_BLAST CP132494.1 100.0 140 0 0 1 140 1867742 1867603 7.12e-65 259 |
| 178 | + 6 │ mecA_BLAST MH798864.1 100.0 140 0 0 1 140 281 420 7.12e-65 259 |
| 179 | + 7 │ mecA_BLAST CP162442.1 100.0 140 0 0 1 140 503005 503144 7.12e-65 259 |
| 180 | + 8 │ mecA_BLAST OR082611.1 100.0 140 0 0 1 140 6415 6276 7.12e-65 259 |
| 181 | + 9 │ mecA_BLAST CP053183.1 100.0 140 0 0 1 140 41607 41468 7.12e-65 259 |
| 182 | + 10 │ mecA_BLAST CP085310.1 100.0 140 0 0 1 140 1610215 1610076 7.12e-65 259 |
| 183 | + 11 │ mecA_BLAST CP162465.1 100.0 140 0 0 1 140 1140196 1140057 7.12e-65 259 |
| 184 | + 12 │ mecA_BLAST CP133660.1 100.0 140 0 0 1 140 2821019 2821158 7.12e-65 259 |
| 185 | + 13 │ mecA_BLAST CP049476.1 100.0 140 0 0 1 140 40314 40175 7.12e-65 259 |
| 186 | + ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ |
| 187 | + 489 │ mecA_BLAST CP093527.1 100.0 140 0 0 1 140 42814 42675 7.12e-65 259 |
| 188 | + 490 │ mecA_BLAST CP035541.1 100.0 140 0 0 1 140 72431 72292 7.12e-65 259 |
| 189 | + 491 │ mecA_BLAST CP145216.2 100.0 140 0 0 1 140 45830 45691 7.12e-65 259 |
| 190 | + 492 │ mecA_BLAST CP193734.1 100.0 140 0 0 1 140 64595 64456 7.12e-65 259 |
| 191 | + 493 │ mecA_BLAST CP083210.2 100.0 140 0 0 1 140 43331 43192 7.12e-65 259 |
| 192 | + 494 │ mecA_BLAST MW052033.1 100.0 140 0 0 1 140 245 384 7.12e-65 259 |
| 193 | + 495 │ mecA_BLAST CP168087.1 100.0 140 0 0 1 140 40806 40667 7.12e-65 259 |
| 194 | + 496 │ mecA_BLAST CP150769.1 100.0 140 0 0 1 140 2583517 2583378 7.12e-65 259 |
| 195 | + 497 │ mecA_BLAST CP030596.1 100.0 140 0 0 1 140 40848 40709 7.12e-65 259 |
| 196 | + 498 │ mecA_BLAST MZ398128.1 100.0 140 0 0 1 140 16977 17116 7.12e-65 259 |
| 197 | + 499 │ mecA_BLAST CP083199.2 100.0 140 0 0 1 140 40078 39939 7.12e-65 259 |
| 198 | + 500 │ mecA_BLAST CP162663.1 100.0 140 0 0 1 140 938360 938499 7.12e-65 259 |
| 199 | + 475 rows omitted |
| 200 | +``` |
| 201 | +There are 500 hits with the same E-value and Bitscore. |
| 202 | +This likely means that this sequence is an exact match to these 500 sequences in NCBI. |
| 203 | +Because of this, the first row in the results is not necessarily a better match than the 500th, |
| 204 | +even though it appears first. |
| 205 | + |
| 206 | +To verify the first hit, we can look up the GenBankID of the first hit: `CP026646.1`. |
| 207 | +The NCBI [page](https://www.ncbi.nlm.nih.gov/nuccore/CP026646.1/) listing this sample confirms that this sample was phenotyped as _S. aureus_. |
| 208 | +Our query matches from indices 46719 to 46580. |
| 209 | +When we use the Graphics feature to visualize gene annotations, we see that there is a clear match to _mecA_. |
| 210 | + |
| 211 | + |
| 212 | + |
| 213 | +Overall, this confirms that our BLAST worked correctly! |
0 commit comments