Skip to content

Commit 0655247

Browse files
switch example fastq to illumina sample
1 parent bdd640c commit 0655247

File tree

1 file changed

+80
-31
lines changed

1 file changed

+80
-31
lines changed

cookbook/sequences.md

Lines changed: 80 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -113,28 +113,42 @@ _mecA_ is a well-characterized gene, so there are no ambiguous regions, and ther
113113

114114
Let's try reading in a larger FASTQ file.
115115

116-
The raw reads for a _Staphylococcus aureus_ isolate were sequenced with PacBio and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).
117-
The link to download the raw FASTQ files can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).
116+
The raw reads for a _Staphylococcus aureus_ isolate were sequenced with Illumina and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/index.html?run=SRR1050625).
117+
The link to download the raw FASTQ files can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/index.html?run=SRR1050625).
118118

119-
The BioSample ID for this sample is `SAMN14830786`.
119+
The BioSample ID for this sample is `SAMN02360768`.
120120
This ID refers to the physical bacterial isolate.
121-
122-
The SRA sample accession number (an internal value used within the Sequence Read Archive) is `SRS6947643`.
123-
121+
The SRA sample accession number (an internal value used within the Sequence Read Archive) is `SRS515580`.
124122
Both values correspond to one another and are helpful identifiers.
125123

126124
The SRR (sample run accession number) is the unique identifier within SRA
127-
and corresponds to the specific sequencing run.
128-
129-
In a later tutorial, we will discuss how to download this file in Julia using the SRR.
130-
131-
But for now, the file can be downloaded using curl
125+
and corresponds to the specific sequencing run.
126+
There are two sequencing runs and accession numbers for this sample,
127+
but we'll select one to download: `SRX392511`.
132128

129+
This file can be downloaded on the command line using `curl`.
130+
> One of the nice things about julia is that it is
131+
> super easy to toggle between the julia REPL and
132+
> bash shell by typing `;` to access shell mode
133+
> from julia.
134+
> You can use the `backspace`/`delete` key to
135+
> quickly toggle back to the julia REPL.
133136
137+
To download using the command line, type:
134138
```
135139
curl -L --retry 5 --retry-delay 2 \
136-
"https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRR12147540" \
137-
| gzip -c > SRR12147540.fastq.gz
140+
"https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRX392511" \
141+
| gzip -c > SRX392511.fastq.gz
142+
```
143+
Alternatively, command line code can be executed from within julia:
144+
145+
```julia
146+
run(pipeline(
147+
`curl -L --retry 5 --retry-delay 2 "https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRX392511"`,
148+
`gzip -c`,
149+
"SRX392511.fastq.gz"
150+
)
151+
)
138152
```
139153
This file is gzipped, so we'll need to account for that as we are reading it in.
140154

@@ -145,7 +159,7 @@ using CodecZlib
145159

146160
records = []
147161

148-
FASTQReader(GzipDecompressorStream(open("assets/SRR12147540.fastq.gz"))) do reader
162+
FASTQReader(GzipDecompressorStream(open("assets/SRX392511.fastq.gz"))) do reader
149163
for record in reader
150164
push!(records, record)
151165
end
@@ -156,32 +170,67 @@ We can see how many reads there are by looking at the length of `records`.
156170

157171
```julia
158172
julia> length(records)
159-
163528
173+
9852716
160174
```
161175

162176
Let's take a look at what the first 10 reads look like:
163177

164178
```
165179
julia> records[1:10]
166180
10-element Vector{Any}:
167-
FASTX.FASTQ.Record("SRR12147540.1 length=43", "TTTTTTTTCCTTTCTTTCT…", "$$$$$$$$$$$$$$$$$$$…")
168-
FASTX.FASTQ.Record("SRR12147540.2 length=40", "TTTGTTTTTTTTTGTTTTC…", "$$$$$$$$$$$$$$$$$$$…")
169-
FASTX.FASTQ.Record("SRR12147540.3 length=32", "TTTTTTTTTGTTCTTTGGT…", "$$$$$$$$$$$$$$$$$$$…")
170-
FASTX.FASTQ.Record("SRR12147540.4 length=40", "TTTTCTTTTCCTTCTTTTC…", "$$$$$$$$$$$$$$$$$$$…")
171-
FASTX.FASTQ.Record("SRR12147540.5 length=55", "TGTTGTTTGTGTCTCGTTT…", "$$$$$$$$$$$$$$$$$$$…")
172-
FASTX.FASTQ.Record("SRR12147540.6 length=166", "TTTCCTTTTTTTTCCTCTC…", "$$$$$$$$$$$$$$$$$$$…")
173-
FASTX.FASTQ.Record("SRR12147540.7 length=338", "TGACCACCTTAGAACTTGG…", "$$$$$$$$$$$$$$$$$$$…")
174-
FASTX.FASTQ.Record("SRR12147540.8 length=245", "ACGCCGCGGCCAAAGAACG…", "$$$$$$$$$$$$$$$$$$$…")
175-
FASTX.FASTQ.Record("SRR12147540.9 length=157", "TTTGTTTGCGCGGTCTCTT…", "$$$$$$$$$$$$$$$$$$$…")
176-
FASTX.FASTQ.Record("SRR12147540.10 length=100", "TTCTTTCGCCTTTTTGCCT…", "$$$$$$$$$$$$$$$$$$$…")
181+
FASTX.FASTQ.Record("SRX392511.1 1 length=101", "AGGATTTTGTTATATTGTA…", "$$$$$$$$$$$$$$$$$$$…")
182+
FASTX.FASTQ.Record("SRX392511.1 1 length=101", "AGCATAAATTTAAAAAAAA…", "$$$$$$$$$$$$$$$$$$$…")
183+
FASTX.FASTQ.Record("SRX392511.2 2 length=101", "TAGATTAAAATTCTCGTAT…", "???????????????????…")
184+
FASTX.FASTQ.Record("SRX392511.2 2 length=101", "TAGATTAAAATTCTCGTAG…", "???????????????????…")
185+
FASTX.FASTQ.Record("SRX392511.3 3 length=101", "GAGCAGTAGTATAAAATGA…", "???????????????????…")
186+
FASTX.FASTQ.Record("SRX392511.3 3 length=101", "CCGACACAATTACAAGCCA…", "???????????????????…")
187+
FASTX.FASTQ.Record("SRX392511.4 4 length=101", "GATTATCTAGTCATAATTC…", "???????????????????…")
188+
FASTX.FASTQ.Record("SRX392511.4 4 length=101", "TTCGCCCCCCCAAAAGGCT…", "???????????????????…")
189+
FASTX.FASTQ.Record("SRX392511.5 5 length=101", "ATAAAATGAACTTGCGTTA…", "???????????????????…")
190+
FASTX.FASTQ.Record("SRX392511.5 5 length=101", "TAACTTGTGGATAATTATT…", "???????????????????…")
177191
```
178192

179-
All of the nucleotides in all of the reads have a quality score of `$`, which corresponds to a probabilty of error of 0.50119.
180-
More information about how to convert ASCII values to quality scores [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html).
181-
This would be quite poor if we were looking at Illumia data.
182-
However, because of how PacBio chemistry works,
183-
quality scores are often flattened and there is simply a placeholder value on this line.
184-
This does not mean our reads are low quality!
193+
All of the nucleotides in these two reads have a quality score of `$`, which corresponds to a probabilty of error of 0.50119.
194+
This is not concerning, as it is typical for the first couple of reads to be low quality.
195+
196+
Let's calculate the average quality across all reads.
197+
We'll do this by converting each PHRED score to its probability error,
198+
and summing these values across all sequences in all reads.
199+
Then, we can divide this sum by the total number of bases
200+
to get the average probability error for the entire sequence.
201+
It's important that we first convert the PHRED scores to the probability errors before averaging,
202+
because PHRED is a logarithmic transform of error probability.
203+
Therefore, simply averaging PHRED and then converting to error probability is not the same as converting PHRED to error probability and averaging that sum.
204+
205+
```julia
206+
function phred_to_prob(Q)
207+
# The formula is Pe = 10^(-Q/10)
208+
return 10^(-Q / 10.0)
209+
end
210+
211+
# get sum of all error probabilities
212+
total_error_prob = sum(
213+
sum(phred_to_prob(q) for q in FASTQ.quality_scores(record))
214+
for record in records
215+
)
216+
4.5100356107423276e7
217+
# get total number of base pairs
218+
total_bases = sum(length(FASTQ.quality_scores(record)) for record in records)
219+
995124316
220+
221+
# get average error probability
222+
total_error_prob/total_bases
223+
0.045321328584059316
224+
```
225+
226+
The average error probability is just 4.53%,
227+
meaning the majority of reads are high quality.
228+
229+
More information about how to convert ASCII values/PHRED scores to quality scores [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html).
230+
231+
232+
233+
185234
Now that we've learned how to read files in and manipulate them a bit,
186235
let's see if we can align the _mecA_ gene to the _Staphylococcus aureus_ genome.
187236
This will tell us if this _S. aureus_ is MRSA.

0 commit comments

Comments
 (0)