You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
FASTQ files are also text-based files that contain sequences, along with a name and description.
48
+
FASTQ files are also text-based files that contain sequences, along with an identifier and description.
47
49
However, they also store sequence quality information (the Q is for quality!).
48
50
49
51
The template of a sequence record is:
@@ -64,7 +66,7 @@ tcagTTAAGATGGGAT
64
66
65
67
### FAI files
66
68
67
-
FAI (FASTA index) files are used in conjunction with FASTA/FASTQ files.
69
+
FAI (FASTA index) files are used in conjunction with FASTA or FASTQ files.
68
70
They are text files with TAB-delimited columns.
69
71
They allow the user to access specific regions of the reference FASTA/FASTQ without reading in the entire sequence into memory.
70
72
More information about fai index files can be found [here](https://www.htslib.org/doc/faidx.html).
@@ -80,14 +82,15 @@ QUALOFFSET Offset of sequence's first quality within the FASTQ file
80
82
```
81
83
82
84
We will read in a FASTA file containing the _mecA_ gene.
83
-
_mecA_ is an antibiotic resistance gene commonly found in Methicillin-resistant _Staphylococcus aureus_ (MRSA).
84
-
It helps bacteria to break down beta-lactam antibiotics like methicillin.
85
+
_mecA_ is an antibiotic resistance gene commonly found in methicillin-resistant _Staphylococcus aureus_ (MRSA).
86
+
It encodes an alternative penicillin-binding protein that allows the bacteria to resist beta-lactam antibiotics like methicillin.
85
87
It is typically 2.1 kB long.
86
-
This specific reference fasta was downloaded from NCBI [here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1?report=fasta). More information about this reference sequence can be found [here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1).
88
+
This specific reference fasta was downloaded from NCBI [here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1?report=fasta).
89
+
More information about this reference sequence can be found [here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1).
87
90
88
91
First we'll open the file.
89
-
Then we'll iterate over every record in the file and
90
-
print out the sequence identifier, the sequence description and then the corresponding sequence.
92
+
Then we'll iterate over every record in the file and print the sequence identifier,
93
+
description and then the corresponding sequence.
91
94
92
95
```julia
93
96
julia>FASTAReader(open("assets/mecA.fasta")) do reader
We confirmed that the length of the gene matches what we'd expect for _mecA_.
110
+
We confirmed that the length of the gene matches what we'd expect for _mecA_.
108
111
In this case, there is only one sequence that spans the entire length of the gene.
109
112
After this gene was sequenced, all of the reads were assembled together into a single consensus sequence.
110
-
_mecA_ is a well-characterized gene, so there are no ambiguous regions, and there is no need for there to be multiple contigs
111
-
(AKA for the gene to be broken up into multiple pieces, as we know exactly how the reads should be assembled together.).
112
-
113
+
_mecA_ is a well-characterized gene, so there are no ambiguous regions,
114
+
and there is no need for there to be multiple contigs
115
+
(that is, for the gene to be broken into multiple pieces, since we know how the reads should be assembled).
113
116
114
117
Let's try reading in a larger FASTQ file.
115
118
116
119
The raw reads for a _Staphylococcus aureus_ isolate were sequenced with Illumina and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/index.html?run=SRR1050625).
117
120
The link to download the raw FASTQ files can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/index.html?run=SRR1050625).
118
121
119
-
The BioSample ID for this sample is `SAMN02360768`.
122
+
The BioSample ID for this sample is `SAMN02360768`.
120
123
This ID refers to the physical bacterial isolate.
121
124
The SRA sample accession number (an internal value used within the Sequence Read Archive) is `SRS515580`.
122
125
Both values correspond to one another and are helpful identifiers.
123
126
124
-
The SRR (sample run accession number) is the unique identifier within SRA
125
-
and corresponds to the specific sequencing run.
126
-
There are two sequencing runs and accession numbers for this sample,
127
-
but we'll select one to download: `SRX392511`.
128
-
129
-
This file can be downloaded on the command line using `curl`.
130
-
> One of the nice things about julia is that it is
131
-
> super easy to toggle between the julia REPL and
132
-
> bash shell by typing `;` to access shell mode
133
-
> from julia.
134
-
> You can use the `backspace`/`delete` key to
135
-
> quickly toggle back to the julia REPL.
127
+
The SRA run accession (SRR) uniquely identifies a sequencing run.
128
+
This sample is associated with two runs, and for this example we will download data linked to experiment `SRX392511`.
All of the nucleotides in these two reads have a quality score of `$`, which corresponds to a probabilty of error of 0.50119.
194
-
This is not concerning, as it is typical for the first couple of reads to be low quality.
194
+
```
195
195
196
196
Let's calculate the average quality across all reads.
197
-
We'll do this by converting each PHRED score to its probability error,
198
-
and summing these values across all sequences in all reads.
197
+
We'll do this by converting each PHRED score to its error probability,
198
+
and summing these values across all bases in all reads.
199
199
Then, we can divide this sum by the total number of bases
200
-
to get the average probability error for the entire sequence.
200
+
to get the average probability across the entire dataset.
201
201
It's important that we first convert the PHRED scores to the probability errors before averaging,
202
-
because PHRED is a logarithmic transform of error probability.
202
+
because PHRED is a logarithmic transformation of error probability.
203
203
Therefore, simply averaging PHRED and then converting to error probability is not the same as converting PHRED to error probability and averaging that sum.
204
204
205
205
```julia
@@ -214,7 +214,7 @@ total_error_prob = sum(
214
214
for record in records
215
215
)
216
216
4.5100356107423276e7
217
-
# get total number of base pairs
217
+
# get total number of bases
218
218
total_bases =sum(length(FASTQ.quality_scores(record)) for record in records)
More information about how to convert ASCII values/PHRED scores to quality scores [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html).
229
+
More information about converting ASCII characters and PHRED scores to quality scores can be found[here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html).
230
230
231
-
232
-
233
-
234
231
Now that we've learned how to read files in and manipulate them a bit,
235
232
let's see if we can align the _mecA_ gene to the _Staphylococcus aureus_ genome.
236
-
This will tell us if this _S. aureus_ is MRSA.
233
+
This will tell us if this particular _S. aureus_ is MRSA.
0 commit comments