You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: cookbook/sequences.md
+80-31Lines changed: 80 additions & 31 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -113,28 +113,42 @@ _mecA_ is a well-characterized gene, so there are no ambiguous regions, and ther
113
113
114
114
Let's try reading in a larger FASTQ file.
115
115
116
-
The raw reads for a _Staphylococcus aureus_ isolate were sequenced with PacBio and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).
117
-
The link to download the raw FASTQ files can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).
116
+
The raw reads for a _Staphylococcus aureus_ isolate were sequenced with Illumina and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/index.html?run=SRR1050625).
117
+
The link to download the raw FASTQ files can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/index.html?run=SRR1050625).
118
118
119
-
The BioSample ID for this sample is `SAMN14830786`.
119
+
The BioSample ID for this sample is `SAMN02360768`.
120
120
This ID refers to the physical bacterial isolate.
121
-
122
-
The SRA sample accession number (an internal value used within the Sequence Read Archive) is `SRS6947643`.
123
-
121
+
The SRA sample accession number (an internal value used within the Sequence Read Archive) is `SRS515580`.
124
122
Both values correspond to one another and are helpful identifiers.
125
123
126
124
The SRR (sample run accession number) is the unique identifier within SRA
127
-
and corresponds to the specific sequencing run.
128
-
129
-
In a later tutorial, we will discuss how to download this file in Julia using the SRR.
130
-
131
-
But for now, the file can be downloaded using curl
125
+
and corresponds to the specific sequencing run.
126
+
There are two sequencing runs and accession numbers for this sample,
127
+
but we'll select one to download: `SRX392511`.
132
128
129
+
This file can be downloaded on the command line using `curl`.
130
+
> One of the nice things about julia is that it is
All of the nucleotides in all of the reads have a quality score of `$`, which corresponds to a probabilty of error of 0.50119.
180
-
More information about how to convert ASCII values to quality scores [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html).
181
-
This would be quite poor if we were looking at Illumia data.
182
-
However, because of how PacBio chemistry works,
183
-
quality scores are often flattened and there is simply a placeholder value on this line.
184
-
This does not mean our reads are low quality!
193
+
All of the nucleotides in these two reads have a quality score of `$`, which corresponds to a probabilty of error of 0.50119.
194
+
This is not concerning, as it is typical for the first couple of reads to be low quality.
195
+
196
+
Let's calculate the average quality across all reads.
197
+
We'll do this by converting each PHRED score to its probability error,
198
+
and summing these values across all sequences in all reads.
199
+
Then, we can divide this sum by the total number of bases
200
+
to get the average probability error for the entire sequence.
201
+
It's important that we first convert the PHRED scores to the probability errors before averaging,
202
+
because PHRED is a logarithmic transform of error probability.
203
+
Therefore, simply averaging PHRED and then converting to error probability is not the same as converting PHRED to error probability and averaging that sum.
204
+
205
+
```julia
206
+
functionphred_to_prob(Q)
207
+
# The formula is Pe = 10^(-Q/10)
208
+
return10^(-Q /10.0)
209
+
end
210
+
211
+
# get sum of all error probabilities
212
+
total_error_prob =sum(
213
+
sum(phred_to_prob(q) for q in FASTQ.quality_scores(record))
214
+
for record in records
215
+
)
216
+
4.5100356107423276e7
217
+
# get total number of base pairs
218
+
total_bases =sum(length(FASTQ.quality_scores(record)) for record in records)
219
+
995124316
220
+
221
+
# get average error probability
222
+
total_error_prob/total_bases
223
+
0.045321328584059316
224
+
```
225
+
226
+
The average error probability is just 4.53%,
227
+
meaning the majority of reads are high quality.
228
+
229
+
More information about how to convert ASCII values/PHRED scores to quality scores [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html).
230
+
231
+
232
+
233
+
185
234
Now that we've learned how to read files in and manipulate them a bit,
186
235
let's see if we can align the _mecA_ gene to the _Staphylococcus aureus_ genome.
0 commit comments