final language tweaks

danielle-pinto · danielle-pinto · commit c78a6c2f7cea · 2026-03-21T21:20:44.000-04:00
diff --git a/cookbook/03-blast.md b/cookbook/03-blast.md
@@ -6,34 +6,32 @@ rss_descr = "Using NCBIBlast.jl to run BLAST searches"
 +++
 
 # Introduction to BLAST
-A BLAST search allows you to query a sequence (either nucleotide or protein) against an entire database of sequences. 
-It can be helpful for quickly compare unknown sequences to databases of established reference sequences for purposes such as species identity or assignment gene function.   
+A BLAST search allows you to query a sequence (either nucleotide or protein) against an entire database of sequences.   
+It can quickly compare unknown sequences to databases of established reference sequences for purposes such as species identity or assignment gene function.   
 
 More information about how to use BLAST can be found in its [manual](https://www.ncbi.nlm.nih.gov/books/NBK569856/).  
 
-BLAST searches can be run from the command line interface (CLI) or through BLAST web page [here](https://blast.ncbi.nlm.nih.gov/Blast.cgi).  
-A user can simply copy in a nucleotide sequence and search for the best match in NCBI! 
+BLAST searches can be run from the command line interface (CLI) or through the BLAST web page [here](https://blast.ncbi.nlm.nih.gov/Blast.cgi).  
+A user can simply copy in a protein or nucleotide sequence and search against NCBI to find the best match!     
 While searching from the website is fast and straightforward,   
-it only searches against the NCBI databases.    
-The CLI allows users to query against both NCBI databases and custom databases.  
+running searches from the website only performs searches against the NCBI databases.    
+The CLI allows users to query both NCBI databases and custom databases.    
 
 `NCBIBlast.jl` is a thin wrapper around the BLAST command line tool, 
-allowing users to run the tool within Julia.  
+allowing users to run the tool within Julia.    
 The following BLAST tools are supported by `NCBIBlast`:
 - `blastn`
 - `blastp`
 - `tblastn`
 - `blastx`
 - `makeblastdb`
 
-
-
 Note: [BioTools BLAST](https://biojulia.dev/BioTools.jl/stable/blast/) is a deprecated Julia package for running BLAST searches and is different from `NCBIBLAST`.  
 
 
 
 
-# How NCBIBlast.jl works
+# How `NCBIBlast.jl` works
 
 
 The keywords used in the tool are sent to the shell for running BLAST.  
@@ -51,39 +49,43 @@ $ blastn -query a_file.txt -db mydb -out results.txt
 ```
 
 # BLAST databases
-Having a BLAST database is necessary to run BLAST locally.    
-A BLAST database is constructed from FASTA files that serve as reference sequences.   
-A database can be built using the following command:  
+Just like running a BLAST search from the CLI, `NCBIBlast.jl` requires a BLAST database to search against.  
+The user can build a local, custom database,  
+or search against a specific NCBI database.       
+A custom BLAST database is constructed from FASTA files that serve as reference sequences.   
+A database can be built using the following command in `NCBIBlast.jl`:  
 ```
 makeblastdb(; in="test/example_files/dna2.fasta", dbtype="nucl")
 ```
 
 More directions on building a BLAST database locally can be found [here](https://www.ncbi.nlm.nih.gov/books/NBK569841/).  
 
+
 ## Example: Building a local BLAST database and running the BLAST search
 
-For our first example, we will replicate the example on the `NCBIBlast.jl` Github.  
+For our first example, we will replicate the example on the `NCBIBlast.jl` GitHub repository.    
 
 First, we will build a local database using a FASTA file found in the NCBIBlast github repository ([link here](https://github.com/BioJulia/NCBIBlast.jl/blob/main/test/example_files/dna2.fasta)).  
+This file has been downloaded into `assets` as `dna2.fasta`.    
 
 ```
 makeblastdb(; in="assets/dna2.fasta", dbtype="nucl")
 
 Building a new DB, current time: 03/16/2026 21:04:36
-New DB name:   /Users/pintorson/Documents/tufts_bonham/daniellepinto/Projects/BioJulia/BioTutorials/cookbook/assets/dna2.fasta
+New DB name:   /LOCAL/PATH/BioTutorials/cookbook/assets/dna2.fasta
 New DB title:  assets/dna2.fasta
 Sequence type: Nucleotide
 Keep MBits: T
 Maximum file size: 3000000000B
 Adding sequences from FASTA; added 2 sequences in 0.0012269 seconds.
 
 
-Process(`/Users/pintorson/.julia/artifacts/0406b91031ce302fa9117606d007d04635279fef/ncbi-blast-2.16.0+/bin/makeblastdb -in assets/dna2.fasta -dbtype nucl`, ProcessExited(0))
+Process(`/Users/USER/.julia/artifacts/0406b91031ce302fa9117606d007d04635279fef/ncbi-blast-2.16.0+/bin/makeblastdb -in assets/dna2.fasta -dbtype nucl`, ProcessExited(0))
 ```
 A new database was built in `assets`.
 
 Now, we can define our query sequence.  
-We can save the query string in memory (using IOBuffer) rather than reading in a FASTA file. 
+We can save the query string in memory (using `IOBuffer`) rather than reading in a FASTA file. 
 
 ```
 buf = IOBuffer("TTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAG")
@@ -92,14 +94,14 @@ buf = IOBuffer("TTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAG")
 Now, we can run the BLAST search.  
 The BLAST output format "6" means that the output will be tab-delimited text with no column names.  
 
-The BLAST output will be written into IO.  
+The BLAST output will be written into I/O.  
 ```
 io = IOBuffer();
 blastn(buf; stdout=io, db="assets/dna2.fasta", outfmt="6");
 seek(io, 0);
 ```
-The command `seek(io,0)` moves the cursor to the start of the captured object (index 0) so it can be read into a dataframe.  
-
+The command `seek(io,0)` moves the cursor to the start of the captured object (index 0)   
+so it can be read into a dataframe via the [`DataFrames.jl`](https://dataframes.juliadata.org/stable/) package.  
 
 ```
 using CSV, DataFrames
@@ -113,49 +115,51 @@ CSV.read(io, DataFrame; header=false)
 ```
 
 ### Interpreting BLAST Output
-This output tells us that the query sequence (`Query_1` is the default name since we didn't specify a name) matches `Test1` in the reference database. 
-There is 100% identity on a region that is 38 nucleotides long. 
+This output tells us that the query sequence      
+(`Query_1` is the default name for the sequence because we didn't specify a name)  
+matches `Test1` in the reference database.   
+There is 100% identity between the query and a region on `Test1` that is 38 nucleotides long.   
 There are 0 mismatches or gap openings.  
 The match starts at index 1 on the query sequence, and ends at index 82.
 This region matches a region in the `Test1` sequence spanning from index 82 to 119.  
 The E-value is `5.64e-18`, meaning that it is extremely unlikely that this match occurred simply due to chance.  
 
 Here is a description of the E-value from the NCBI [website](https://blast.ncbi.nlm.nih.gov/doc/blast-help/FAQ.html):
-> The Expect value (E) is a parameter that describes the number of 
-> hits one can “expect” to see by chance when searching a database of
-> a particular size. It decreases exponentially as the Score (S) of 
-> the match increases. 
-> The lower the E-value the more “significant” the match is. However
-> keep in mind that virtually identical short alignments have 
-> relatively high E values. This is because the calculation of the E
-> value takes into account the length of the query sequence. These 
-> high E values make sense because shorter sequences have a higher 
-> probability of occurring in the database purely by chance. 
+> The Expect value (E) is a parameter that describes the number of hits one can “expect” to see  
+> by chance when searching a database of a particular size.  
+> It decreases exponentially as the Score (S) of the match increases.   
+> The lower the E-value the more “significant” the match is.   
+> However keep in mind that virtually identical short alignments have relatively high E values.   
+> This is because the calculation of the E value takes into account the length of the query sequence.   
+> These high E values make sense because shorter sequences  
+> have a higher probability of occurring in the database purely by chance.   
 
 
 
 The bitscore is 71.3.  
 
 Here is a definition of bitscore from the NCBI BLAST [glossary](https://www.ncbi.nlm.nih.gov/books/NBK62051/):
-> The bit score, S', is derived from the raw alignment score, S, 
-> taking the statistical properties of the scoring system into account. 
-> Because bit scores are normalized with respect to the scoring system, they
-> can be used to compare alignment scores from different searches.
+> The bit score, S', 
+> is derived from the raw alignment score, S, 
+> taking the statistical properties of the scoring system into account.   
+> Because bit scores are normalized with respect to the scoring system,  
+> they can be used to compare alignment scores from different searches.  
 
 
 ## Example: BLASTing the _mecA1_ gene against all of NCBI
-Now that we've tried BLASTing against a local database,   
-let's try BLASTing a piece of the _mecA_ gene against NCBI.  
+Now that we've tried BLAST'ing against a local, custom database,   
+let's try BLAST'ing a piece of the _mecA_ gene against NCBI.  
 To create the query file `mecA_BLAST.fasta`,   
 I randomly selected 140 nucleotides from `mecA.fasta`.  
 
-We should see that the query fasta is a direct hit to the _mecA_ gene  
-(specifically to the sequence upload that we pulled from NCBI).  
+We should see that the query FASTA is a direct hit to the _mecA_ gene  
+(one of the NCBI hits should definitely be the NCBI sample `NG_047945.1`,   
+which is the sample the gene fragment was extracted from).  
 
-For this BLAST search, I will search against the `core_nt` database, 
+For this BLAST search, I will search against the `core_nt` database,   
 which is a faster, smaller, and more focused subset of the traditional `nt` (nucleotide) database.  
 This newer database is the default as of August 2024.    
-It seeks to reduce redundancy and storage requirements when downloading the database.      
+It seeks to reduce redundancy and storage requirements when downloading.      
 More information about it can be found [here](https://ncbiinsights.ncbi.nlm.nih.gov/2024/07/18/new-blast-core-nucleotide-database/).    
 
 General information about the different kinds of BLAST databases is also available [here](https://www.nlm.nih.gov/ncbi/workshops/2023-08_BLAST_evol/databases.html).  
@@ -208,10 +212,11 @@ This likely means that this sequence is an exact match to these 500 sequences in
 Because of this, the first row in the results is not necessarily a better match than the 500th,   
 even though it appears first.  
 
-To verify the first hit, we can look up the GenBankID of the first hit: `CP026646.1`.    
+To verify the first hit, we can look up the GenBankID of the first row: `CP026646.1`.    
 The NCBI [page](https://www.ncbi.nlm.nih.gov/nuccore/CP026646.1/) for this sample confirms that this sample was phenotyped as _S. aureus_.  
-Our query matches from indices 46719 to 46580.  
-When we use the Graphics feature to visualize gene annotations, we see that there is a clear match to _mecA_.  
+Our query matches indices 46719 to 46580 on this reference genome.  
+When we use the Graphics feature to visualize gene annotations on the reference genome,  
+we see that there is a clear match to _mecA_ in the region that matches the query.  
 
 ![BLAST Graphics](assets/mecA_BLAST.png)