Skip to content

Commit 38c2d71

Browse files
add sequence tutorial
1 parent faf2890 commit 38c2d71

1 file changed

Lines changed: 75 additions & 1 deletion

File tree

cookbook/sequences.md

Lines changed: 75 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,8 @@ The template of a sequence record is:
3333
```
3434

3535
### FASTQ files
36-
FASTQ files are also text-based files that contain sequences, along with a name and description. However, they also store sequence quality information (the Q is for quality!).
36+
FASTQ files are also text-based files that contain sequences, along with a name and description.
37+
However, they also store sequence quality information (the Q is for quality!).
3738

3839
The template of a sequence record is:
3940
```
@@ -43,6 +44,14 @@ The template of a sequence record is:
4344
{qualities}
4445
```
4546

47+
Here is an example record:
48+
```
49+
@FSRRS4401BE7HA
50+
tcagTTAAGATGGGAT
51+
+
52+
###EEEEEEEEE##E#
53+
```
54+
4655
### FAI files
4756

4857
FAI (FASTA index) files are used in conjuction with FASTA/FASTQ files.
@@ -89,11 +98,76 @@ In this case, there is only one sequence.
8998

9099
Let's try reading in a larger fastq file.
91100

101+
The raw reads for a Staphylococcus aureus isolate was sequenced with Pac-Bio and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).
102+
The link to download the raw fastq's can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).
103+
104+
The biosample ID for this sample is `SAMN14830786`.
105+
This ID refers to the physical bacterial isolate.
106+
107+
The SRA sample accession number (an internal value used within Sequence Read Archive) is `SRS6947643`.
108+
109+
Both values correspond to one another and are helpful identifiers.
110+
111+
The SRR (sample run accession number) is the unique identifier within SRA
112+
and corresponds to the specific sequencing run within SRA.
113+
114+
In a later tutorial, we will discuss how we can download this file within julia with the SRR.
115+
116+
But for now, the file can be downloaded using curl
117+
118+
119+
```
120+
curl -L --retry 5 --retry-delay 2 \
121+
"https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRR12147540" \
122+
| gzip -c > SRR12147540.fastq.gz
123+
```
124+
This file is gzipped, so we'll need to account for that as we are reading it in.
125+
126+
Instead of printing out every record (this isn't super practical because it's a big file), let's save all the records into a vector.
127+
92128
```julia
129+
using CodecZlib
93130

131+
records = []
94132

133+
FASTQReader(GzipDecompressorStream(open("assets/SRR12147540.fastq.gz"))) do reader
134+
for record in reader
135+
push!(records, record)
136+
end
137+
end
138+
```
139+
140+
We can see how many reads there are by looking at the length of `records`.
95141

142+
```julia
143+
julia> length(records)
144+
163528
96145
```
97146

147+
Let's take a look at what the first 10 reads look like:
148+
149+
```
150+
julia> records[1:10]
151+
10-element Vector{Any}:
152+
FASTX.FASTQ.Record("SRR12147540.1 length=43", "TTTTTTTTCCTTTCTTTCT…", "$$$$$$$$$$$$$$$$$$$…")
153+
FASTX.FASTQ.Record("SRR12147540.2 length=40", "TTTGTTTTTTTTTGTTTTC…", "$$$$$$$$$$$$$$$$$$$…")
154+
FASTX.FASTQ.Record("SRR12147540.3 length=32", "TTTTTTTTTGTTCTTTGGT…", "$$$$$$$$$$$$$$$$$$$…")
155+
FASTX.FASTQ.Record("SRR12147540.4 length=40", "TTTTCTTTTCCTTCTTTTC…", "$$$$$$$$$$$$$$$$$$$…")
156+
FASTX.FASTQ.Record("SRR12147540.5 length=55", "TGTTGTTTGTGTCTCGTTT…", "$$$$$$$$$$$$$$$$$$$…")
157+
FASTX.FASTQ.Record("SRR12147540.6 length=166", "TTTCCTTTTTTTTCCTCTC…", "$$$$$$$$$$$$$$$$$$$…")
158+
FASTX.FASTQ.Record("SRR12147540.7 length=338", "TGACCACCTTAGAACTTGG…", "$$$$$$$$$$$$$$$$$$$…")
159+
FASTX.FASTQ.Record("SRR12147540.8 length=245", "ACGCCGCGGCCAAAGAACG…", "$$$$$$$$$$$$$$$$$$$…")
160+
FASTX.FASTQ.Record("SRR12147540.9 length=157", "TTTGTTTGCGCGGTCTCTT…", "$$$$$$$$$$$$$$$$$$$…")
161+
FASTX.FASTQ.Record("SRR12147540.10 length=100", "TTCTTTCGCCTTTTTGCCT…", "$$$$$$$$$$$$$$$$$$$…")
162+
```
163+
164+
All of the nucleotides in all of the reads have a quality score of `$`, which corresponds to a probabilty of error of 0.50119.
165+
More information about how to convert ASCII values to quality scores [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html).
166+
This would be quite poor if we were looking at Illumia data.
167+
However, because of how Pacbio chemistry works,
168+
the quality scores are often flattened and there is simply a placeholder value on this line.
169+
This does not mean our reads are trash!
170+
Now that we've learned how to read files in and manipulate them a bit,
171+
let's see if we can align the mecA gene to the Staphylococcus aureus genome.
98172

99173

0 commit comments

Comments
 (0)