You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
FASTQ files are also text-based files that contain sequences, along with a name and description.
37
47
However, they also store sequence quality information (the Q is for quality!).
@@ -54,9 +64,9 @@ tcagTTAAGATGGGAT
54
64
55
65
### FAI files
56
66
57
-
FAI (FASTA index) files are used in conjuction with FASTA/FASTQ files.
67
+
FAI (FASTA index) files are used in conjunction with FASTA/FASTQ files.
58
68
They are text files with TAB-delimited columns.
59
-
Each line contains information about each region sequence within the FASTA/FASTQ.
69
+
They allow the user to access specific regions of the reference FASTA/FASTQ without reading in the entire sequence into memory.
60
70
More information about fai index files can be found [here](https://www.htslib.org/doc/faidx.html).
61
71
62
72
```
@@ -69,15 +79,14 @@ QUALOFFSET Offset of sequence's first quality within the FASTQ file
69
79
70
80
```
71
81
82
+
We will read in a FASTA file containing the _mecA_ gene.
83
+
_mecA_ is an antibiotic resistance gene commonly found in Methicillin-resistant _Staphylococcus aureus_ (MRSA).
84
+
It helps bacteria to break down beta-lactam antibiotics like methicillin.
85
+
It is typically 2.1 kB long.
86
+
This specific reference fasta was downloaded from NCBI [here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1?report=fasta). More information about this reference sequence can be found [here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1).
72
87
73
-
74
-
75
-
76
-
We will read in a fasta file containing the _mecA_ gene.
77
-
This gene was taken from NCBI [here](https://www.ncbi.nlm.nih.gov/gene?Db=gene&Cmd=DetailsSearch&Term=3920764#).
78
-
79
-
First we'll open the file,
80
-
then we'll iterate over every record in the file and
88
+
First we'll open the file.
89
+
Then we'll iterate over every record in the file and
81
90
print out the sequence identifier, the sequence description and then the corresponding sequence.
82
91
83
92
```julia
@@ -86,32 +95,38 @@ julia> FASTAReader(open("assets/mecA.fasta")) do reader
We confirmed that the length of the gene matches what we'd expect for _mecA_.
108
+
In this case, there is only one sequence that spans the entire length of the gene.
109
+
After this gene was sequenced, all of the reads were assembled together into a single consensus sequence.
110
+
_mecA_ is a well-characterized gene, so there are no ambiguous regions, and there is no need for there to be multiple contigs
111
+
(AKA for the gene to be broken up into multiple pieces, as we know exactly how the reads should be assembled together.).
96
112
97
-
In this case, there is only one sequence.
98
113
99
-
Let's try reading in a larger fastq file.
114
+
Let's try reading in a larger FASTQ file.
100
115
101
-
The raw reads for a Staphylococcus aureus isolate was sequenced with Pac-Bio and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).
102
-
The link to download the raw fastq's can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).
116
+
The raw reads for a _Staphylococcus aureus_ isolate were sequenced with PacBio and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).
117
+
The link to download the raw FASTQ files can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).
103
118
104
-
The biosample ID for this sample is `SAMN14830786`.
119
+
The BioSample ID for this sample is `SAMN14830786`.
105
120
This ID refers to the physical bacterial isolate.
106
121
107
-
The SRA sample accession number (an internal value used within Sequence Read Archive) is `SRS6947643`.
122
+
The SRA sample accession number (an internal value used within the Sequence Read Archive) is `SRS6947643`.
108
123
109
124
Both values correspond to one another and are helpful identifiers.
110
125
111
126
The SRR (sample run accession number) is the unique identifier within SRA
112
-
and corresponds to the specific sequencing run within SRA.
127
+
and corresponds to the specific sequencing run.
113
128
114
-
In a later tutorial, we will discuss how we can download this file within julia with the SRR.
129
+
In a later tutorial, we will discuss how to download this file in Julia using the SRR.
115
130
116
131
But for now, the file can be downloaded using curl
117
132
@@ -164,10 +179,11 @@ julia> records[1:10]
164
179
All of the nucleotides in all of the reads have a quality score of `$`, which corresponds to a probabilty of error of 0.50119.
165
180
More information about how to convert ASCII values to quality scores [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html).
166
181
This would be quite poor if we were looking at Illumia data.
167
-
However, because of how Pacbio chemistry works,
168
-
the quality scores are often flattened and there is simply a placeholder value on this line.
169
-
This does not mean our reads are trash!
182
+
However, because of how PacBio chemistry works,
183
+
quality scores are often flattened and there is simply a placeholder value on this line.
184
+
This does not mean our reads are low quality!
170
185
Now that we've learned how to read files in and manipulate them a bit,
171
-
let's see if we can align the mecA gene to the Staphylococcus aureus genome.
186
+
let's see if we can align the _mecA_ gene to the _Staphylococcus aureus_ genome.
0 commit comments