You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: cookbook/sequences.md
+75-1Lines changed: 75 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,7 +33,8 @@ The template of a sequence record is:
33
33
```
34
34
35
35
### FASTQ files
36
-
FASTQ files are also text-based files that contain sequences, along with a name and description. However, they also store sequence quality information (the Q is for quality!).
36
+
FASTQ files are also text-based files that contain sequences, along with a name and description.
37
+
However, they also store sequence quality information (the Q is for quality!).
37
38
38
39
The template of a sequence record is:
39
40
```
@@ -43,6 +44,14 @@ The template of a sequence record is:
43
44
{qualities}
44
45
```
45
46
47
+
Here is an example record:
48
+
```
49
+
@FSRRS4401BE7HA
50
+
tcagTTAAGATGGGAT
51
+
+
52
+
###EEEEEEEEE##E#
53
+
```
54
+
46
55
### FAI files
47
56
48
57
FAI (FASTA index) files are used in conjuction with FASTA/FASTQ files.
@@ -89,11 +98,76 @@ In this case, there is only one sequence.
89
98
90
99
Let's try reading in a larger fastq file.
91
100
101
+
The raw reads for a Staphylococcus aureus isolate was sequenced with Pac-Bio and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).
102
+
The link to download the raw fastq's can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).
103
+
104
+
The biosample ID for this sample is `SAMN14830786`.
105
+
This ID refers to the physical bacterial isolate.
106
+
107
+
The SRA sample accession number (an internal value used within Sequence Read Archive) is `SRS6947643`.
108
+
109
+
Both values correspond to one another and are helpful identifiers.
110
+
111
+
The SRR (sample run accession number) is the unique identifier within SRA
112
+
and corresponds to the specific sequencing run within SRA.
113
+
114
+
In a later tutorial, we will discuss how we can download this file within julia with the SRR.
115
+
116
+
But for now, the file can be downloaded using curl
All of the nucleotides in all of the reads have a quality score of `$`, which corresponds to a probabilty of error of 0.50119.
165
+
More information about how to convert ASCII values to quality scores [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html).
166
+
This would be quite poor if we were looking at Illumia data.
167
+
However, because of how Pacbio chemistry works,
168
+
the quality scores are often flattened and there is simply a placeholder value on this line.
169
+
This does not mean our reads are trash!
170
+
Now that we've learned how to read files in and manipulate them a bit,
171
+
let's see if we can align the mecA gene to the Staphylococcus aureus genome.
0 commit comments