switch example fastq to illumina sample

danielle-pinto · danielle-pinto · commit 06552476e187 · 2026-03-10T22:13:07.000-04:00
diff --git a/cookbook/sequences.md b/cookbook/sequences.md
@@ -113,28 +113,42 @@ _mecA_ is a well-characterized gene, so there are no ambiguous regions, and ther
 
 Let's try reading in a larger FASTQ file. 
 
-The raw reads for a _Staphylococcus aureus_ isolate were sequenced with PacBio and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).     
-The link to download the raw FASTQ files can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).
+The raw reads for a _Staphylococcus aureus_ isolate were sequenced with Illumina and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/index.html?run=SRR1050625).     
+The link to download the raw FASTQ files can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/index.html?run=SRR1050625).
 
-The BioSample ID for this sample is `SAMN14830786`.
+The BioSample ID for this sample is `SAMN02360768`.
 This ID refers to the physical bacterial isolate.  
-
-The SRA sample accession number (an internal value used within the Sequence Read Archive) is `SRS6947643`.    
-
+The SRA sample accession number (an internal value used within the Sequence Read Archive) is `SRS515580`.    
 Both values correspond to one another and are helpful identifiers. 
 
 The SRR (sample run accession number) is the unique identifier within SRA   
-and corresponds to the specific sequencing run. 
-
-In a later tutorial, we will discuss how to download this file in Julia using the SRR.
-
-But for now, the file can be downloaded using curl 
+and corresponds to the specific sequencing run.   
+There are two sequencing runs and accession numbers for this sample,   
+but we'll select one to download: `SRX392511`.  
 
+This file can be downloaded on the command line using `curl`.  
+> One of the nice things about julia is that it is 
+> super easy to toggle between the julia REPL and 
+> bash shell by typing `;` to access shell mode 
+> from julia.  
+> You can use the `backspace`/`delete` key to 
+> quickly toggle back to the julia REPL.  
 
+To download using the command line, type:
 ```
 curl -L --retry 5 --retry-delay 2 \
-  "https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRR12147540" \
-  | gzip -c > SRR12147540.fastq.gz
+  "https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRX392511" \
+  | gzip -c > SRX392511.fastq.gz
+```
+Alternatively, command line code can be executed from within julia:
+
+```julia
+run(pipeline(
+    `curl -L --retry 5 --retry-delay 2 "https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRX392511"`,
+    `gzip -c`,
+    "SRX392511.fastq.gz"
+    )
+)
 ```
 This file is gzipped, so we'll need to account for that as we are reading it in.
 
@@ -145,7 +159,7 @@ using CodecZlib
 
 records = []
 
-FASTQReader(GzipDecompressorStream(open("assets/SRR12147540.fastq.gz"))) do reader
+FASTQReader(GzipDecompressorStream(open("assets/SRX392511.fastq.gz"))) do reader
            for record in reader
                push!(records, record)
            end
@@ -156,32 +170,67 @@ We can see how many reads there are by looking at the length of `records`.
 
 ```julia
 julia> length(records)
-163528
+9852716
 ```
 
 Let's take a look at what the first 10 reads look like:
 
 ```
 julia> records[1:10]
 10-element Vector{Any}:
- FASTX.FASTQ.Record("SRR12147540.1  length=43", "TTTTTTTTCCTTTCTTTCT…", "$$$$$$$$$$$$$$$$$$$…")
- FASTX.FASTQ.Record("SRR12147540.2  length=40", "TTTGTTTTTTTTTGTTTTC…", "$$$$$$$$$$$$$$$$$$$…")
- FASTX.FASTQ.Record("SRR12147540.3  length=32", "TTTTTTTTTGTTCTTTGGT…", "$$$$$$$$$$$$$$$$$$$…")
- FASTX.FASTQ.Record("SRR12147540.4  length=40", "TTTTCTTTTCCTTCTTTTC…", "$$$$$$$$$$$$$$$$$$$…")
- FASTX.FASTQ.Record("SRR12147540.5  length=55", "TGTTGTTTGTGTCTCGTTT…", "$$$$$$$$$$$$$$$$$$$…")
- FASTX.FASTQ.Record("SRR12147540.6  length=166", "TTTCCTTTTTTTTCCTCTC…", "$$$$$$$$$$$$$$$$$$$…")
- FASTX.FASTQ.Record("SRR12147540.7  length=338", "TGACCACCTTAGAACTTGG…", "$$$$$$$$$$$$$$$$$$$…")
- FASTX.FASTQ.Record("SRR12147540.8  length=245", "ACGCCGCGGCCAAAGAACG…", "$$$$$$$$$$$$$$$$$$$…")
- FASTX.FASTQ.Record("SRR12147540.9  length=157", "TTTGTTTGCGCGGTCTCTT…", "$$$$$$$$$$$$$$$$$$$…")
- FASTX.FASTQ.Record("SRR12147540.10  length=100", "TTCTTTCGCCTTTTTGCCT…", "$$$$$$$$$$$$$$$$$$$…")
+ FASTX.FASTQ.Record("SRX392511.1 1 length=101", "AGGATTTTGTTATATTGTA…", "$$$$$$$$$$$$$$$$$$$…")
+ FASTX.FASTQ.Record("SRX392511.1 1 length=101", "AGCATAAATTTAAAAAAAA…", "$$$$$$$$$$$$$$$$$$$…")
+ FASTX.FASTQ.Record("SRX392511.2 2 length=101", "TAGATTAAAATTCTCGTAT…", "???????????????????…")
+ FASTX.FASTQ.Record("SRX392511.2 2 length=101", "TAGATTAAAATTCTCGTAG…", "???????????????????…")
+ FASTX.FASTQ.Record("SRX392511.3 3 length=101", "GAGCAGTAGTATAAAATGA…", "???????????????????…")
+ FASTX.FASTQ.Record("SRX392511.3 3 length=101", "CCGACACAATTACAAGCCA…", "???????????????????…")
+ FASTX.FASTQ.Record("SRX392511.4 4 length=101", "GATTATCTAGTCATAATTC…", "???????????????????…")
+ FASTX.FASTQ.Record("SRX392511.4 4 length=101", "TTCGCCCCCCCAAAAGGCT…", "???????????????????…")
+ FASTX.FASTQ.Record("SRX392511.5 5 length=101", "ATAAAATGAACTTGCGTTA…", "???????????????????…")
+ FASTX.FASTQ.Record("SRX392511.5 5 length=101", "TAACTTGTGGATAATTATT…", "???????????????????…")
  ```
 
- All of the nucleotides in all of the reads have a quality score of `$`, which corresponds to a probabilty of error of 0.50119.  
- More information about how to convert ASCII values to quality scores [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html).  
- This would be quite poor if we were looking at Illumia data.  
- However, because of how PacBio chemistry works,  
- quality scores are often flattened and there is simply a placeholder value on this line.  
- This does not mean our reads are low quality!  
+ All of the nucleotides in these two reads have a quality score of `$`, which corresponds to a probabilty of error of 0.50119.
+ This is not concerning, as it is typical for the first couple of reads to be low quality.    
+
+ Let's calculate the average quality across all reads. 
+ We'll do this by converting each PHRED score to its probability error,   
+ and summing these values across all sequences in all reads.  
+ Then, we can divide this sum by the total number of bases  
+ to get the average probability error for the entire sequence. 
+ It's important that we first convert the PHRED scores to the probability errors before averaging,   
+because PHRED is a logarithmic transform of error probability.  
+Therefore, simply averaging PHRED and then converting to error probability is not the same as converting PHRED to error probability and averaging that sum.  
+
+```julia
+function phred_to_prob(Q)
+    # The formula is Pe = 10^(-Q/10)
+    return 10^(-Q / 10.0)
+end
+
+# get sum of all error probabilities
+total_error_prob = sum(
+    sum(phred_to_prob(q) for q in FASTQ.quality_scores(record))
+    for record in records
+)
+4.5100356107423276e7
+# get total number of base pairs
+total_bases = sum(length(FASTQ.quality_scores(record)) for record in records)
+995124316
+
+# get average error probability
+total_error_prob/total_bases
+0.045321328584059316
+```
+
+The average error probability is just 4.53%,   
+meaning the majority of reads are high quality.  
+
+ More information about how to convert ASCII values/PHRED scores to quality scores [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html).  
+ 
+
+
+
  Now that we've learned how to read files in and manipulate them a bit,  
 let's see if we can align the _mecA_ gene to the _Staphylococcus aureus_ genome.  
 This will tell us if this _S. aureus_ is MRSA.