Skip to content

Commit 3d78237

Browse files
add dates to headers and fix grammar issues/typos
1 parent 0655247 commit 3d78237

File tree

2 files changed

+43
-46
lines changed

2 files changed

+43
-46
lines changed

cookbook/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ rss_descr = "Recipes for performing basic bioinformatics in julia"
99

1010
This cookbook will provide a series of "recipes" that will help get started quickly with BioJulia so you can doing some bioinformatics!
1111

12-
We have tutorials for reading in files, performing alignments, and using tools such as BLAST,
12+
We will have tutorials for reading in files, performing alignments, and using tools such as BLAST,
1313
as well as links to more documentation about specific BioJulia packages.
1414

1515
{{list_dir cookbook}}

cookbook/sequences.md

Lines changed: 42 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,13 @@
11
+++
2+
using Dates
3+
date = Date("2026-03-11")
24
title = "Sequence Input/Output"
35
rss_descr = "Reading in FASTA, FASTQ, and FAI files using FASTX.jl"
46
+++
57

68
# Sequence Input/Output
79

8-
In this chapter, we'll talk about how to read in sequence files using the `FASTX.jl` module.
10+
In this chapter, we'll discuss how to read in sequence files using the `FASTX.jl` package.
911
More information about the `FASTX.jl` package can be found at https://biojulia.dev/FASTX.jl/stable/
1012
and with the built-in documentation you can access directly within the Julia REPL.
1113

@@ -24,16 +26,16 @@ The `FASTX` package can read in three file types: `fasta`, `fastq`, and `fai`.
2426

2527
### FASTA files
2628
FASTA files are text files containing biological sequence data.
27-
They have three parts: name, description, and sequence.
29+
They contain three parts: `identifier`, `description`, and `sequence`.
2830

2931
The template of a sequence record is:
3032
```
3133
>{description}
3234
{sequence}
3335
```
3436

35-
The identifier is the first part of the description until the first whitespace.
36-
If there is no white space, the name and description are the same.
37+
The `identifier` is the first part of the `description` until the first whitespace.
38+
If there is no whitespace, the `identifier` and `description` are the same.
3739

3840
Here is an example fasta:
3941
```
@@ -43,7 +45,7 @@ CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTA
4345
```
4446

4547
### FASTQ files
46-
FASTQ files are also text-based files that contain sequences, along with a name and description.
48+
FASTQ files are also text-based files that contain sequences, along with an identifier and description.
4749
However, they also store sequence quality information (the Q is for quality!).
4850

4951
The template of a sequence record is:
@@ -64,7 +66,7 @@ tcagTTAAGATGGGAT
6466

6567
### FAI files
6668

67-
FAI (FASTA index) files are used in conjunction with FASTA/FASTQ files.
69+
FAI (FASTA index) files are used in conjunction with FASTA or FASTQ files.
6870
They are text files with TAB-delimited columns.
6971
They allow the user to access specific regions of the reference FASTA/FASTQ without reading in the entire sequence into memory.
7072
More information about fai index files can be found [here](https://www.htslib.org/doc/faidx.html).
@@ -80,14 +82,15 @@ QUALOFFSET Offset of sequence's first quality within the FASTQ file
8082
```
8183

8284
We will read in a FASTA file containing the _mecA_ gene.
83-
_mecA_ is an antibiotic resistance gene commonly found in Methicillin-resistant _Staphylococcus aureus_ (MRSA).
84-
It helps bacteria to break down beta-lactam antibiotics like methicillin.
85+
_mecA_ is an antibiotic resistance gene commonly found in methicillin-resistant _Staphylococcus aureus_ (MRSA).
86+
It encodes an alternative penicillin-binding protein that allows the bacteria to resist beta-lactam antibiotics like methicillin.
8587
It is typically 2.1 kB long.
86-
This specific reference fasta was downloaded from NCBI [here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1?report=fasta). More information about this reference sequence can be found [here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1).
88+
This specific reference fasta was downloaded from NCBI [here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1?report=fasta).
89+
More information about this reference sequence can be found [here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1).
8790

8891
First we'll open the file.
89-
Then we'll iterate over every record in the file and
90-
print out the sequence identifier, the sequence description and then the corresponding sequence.
92+
Then we'll iterate over every record in the file and print the sequence identifier,
93+
description and then the corresponding sequence.
9194

9295
```julia
9396
julia> FASTAReader(open("assets/mecA.fasta")) do reader
@@ -104,43 +107,42 @@ NG_047945.1 Staphylococcus aureus TN/CN/1/12 mecA gene for ceftaroline-resistant
104107
ATGAAAAAGATAAAAATTGTTCCACTTATTTTAATAGTTGTAGTTGTCGGGTTTGGTATATATTTTTATGCTTCAAAAGATAAAGAAATTAATAATACTATTGATGCAATTGAAGATAAAAATTTCAAACAAGTTTATAAAGATAGCAGTTATATTTCTAAAAGCGATAATGGTGAAGTAGAAATGACTGAACGTCCGATAAAAATATATAATAGTTTAGGCGTTAAAGATATAAACATTCAGGATCGTAAAATAAAAAAAGTATCTAAAAATAAAAAACGAGTAGATGCTCAATATAAAATTAAAACAAACTACGGTAACATTGATCGCAACGTTCAATTTAATTTTGTTAAAGAAGATGGTATGTGGAAGTTAGATTGGGATCATAGCGTCATTATTCCAGGAATGCAGAAAGACCAAAGCATACATATTGAAAATTTAAAATCAGAACGTGGTAAAATTTTAGACCGAAACAATGTGGAATTGGCCAATACAGGAACAGCATATGAGATAGGCATCGTTCCAAAGAATGTATCTAAAAAAGATTATAAAGCAATCGCTAAAGAACTAAGTATTTCTGAAGACTATATCAAACAACAAATGGATCAAAATTGGGTACAAGATGATACCTTCGTTCCACTTAAAACCGTTAAAAAAATGGATGAATATTTAAGTGATTTCGCAAAAAAATTTCATCTTACAACTAATGAAACAAAAAGTCGTAACTATCCTCTAGAAAAAGCGACTTCACATCTATTAGGTTATGTTGGTCCCATTAACTCTGAAGAATTAAAACAAAAAGAATATAAAGGCTATAAAGATGATGCAGTTATTGGTAAAAAGGGACTCGAAAAACTTTACGATAAAAAGCTCCAACATGAAGATGGCTATCGTGTCACAATCGTTGACGATAATAGCAATACAATCGCACATACATTAATAGAGAAAAAGAAAAAAGATGGCAAAGATATTCAACTAACTATTGATGCTAAAGTTCAAAAGAGTATTTATAACAACATGAAAAATGATTATGGCTCAGGTACTGCTATCCACCCTCAAACAGGTGAATTATTAGCACTTGTAAGCACACCTTCATATGACGTCTATCCATTTATGTATGGCATGAGTAACGAAGAATATAATAAATTAACCGAAGATAAAAAAGAACCTCTGCTCAACAAGTTCCAGATTACAACTTCACCAGGTTCAACTCAAAAAATATTAACAGCAATGATTGGGTTAAATAACAAAACATTAGACGATAAAACAAGTTATAAAATCGATGGTAAAGGTTGGCAAAAAGATAAATCTTGGGGTGGTTACAACGTTACAAGATATGAAGTGGTAAATGGTAATATCGACTTAAAACAAGCAATAGAATCATCAGATAACATTTTCTTTGCTAGAGTAGCACTCGAATTAGGCAGTAAGAAATTTGAAAAAGGCATGAAAAAACTAGGTGTTGGTGAAGATATACCAAGTGATTATCCATTTTATAATGCTCAAATTTCAAACAAAAATTTAGATAATGAAATATTATTAGCTGATTCAGGTTACGGACAAGGTGAAATACTGATTAACCCAGTACAGATCCTTTCAATCTATAGCGCATTAGAAAATAATGGCAATATTAACGCACCTCACTTATTAAAAGACACGAAAAACAAAGTTTGGAAGAAAAATATTATTTCCAAAGAAAATATCAATCTATTAACTGATGGTATGCAACAAGTCGTAAATAAAACACATAAAGAAGATATTTATAGATCTTATGCAAACTTAATTGGCAAATCCGGTACTGCAGAACTCAAAATGAAACAAGGAGAAACTGGCAGACAAATTGGGTGGTTTATATCATATGATAAAGATAATCCAAACATGATGATGGCTATTAATGTTAAAGATGTACAAGATAAAGGAATGGCTAGCTACAATGCCAAAATCTCAGGTAAAGTGTATGATGAGCTATATGAGAACGGTAATAAAAAATACGATATAGATGAATAACAAAACAGTGAAGCAATCCGTAACGATGGTTGCTTCACTGTTTTATTATGAATTATTAATAAGTGCTGTTACTTCTCCCTTAAATACAATTTCTTCATTT
105108
2107
106109
```
107-
We confirmed that the length of the gene matches what we'd expect for _mecA_.
110+
We confirmed that the length of the gene matches what we'd expect for _mecA_.
108111
In this case, there is only one sequence that spans the entire length of the gene.
109112
After this gene was sequenced, all of the reads were assembled together into a single consensus sequence.
110-
_mecA_ is a well-characterized gene, so there are no ambiguous regions, and there is no need for there to be multiple contigs
111-
(AKA for the gene to be broken up into multiple pieces, as we know exactly how the reads should be assembled together.).
112-
113+
_mecA_ is a well-characterized gene, so there are no ambiguous regions,
114+
and there is no need for there to be multiple contigs
115+
(that is, for the gene to be broken into multiple pieces, since we know how the reads should be assembled).
113116

114117
Let's try reading in a larger FASTQ file.
115118

116119
The raw reads for a _Staphylococcus aureus_ isolate were sequenced with Illumina and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/index.html?run=SRR1050625).
117120
The link to download the raw FASTQ files can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/index.html?run=SRR1050625).
118121

119-
The BioSample ID for this sample is `SAMN02360768`.
122+
The BioSample ID for this sample is `SAMN02360768`.
120123
This ID refers to the physical bacterial isolate.
121124
The SRA sample accession number (an internal value used within the Sequence Read Archive) is `SRS515580`.
122125
Both values correspond to one another and are helpful identifiers.
123126

124-
The SRR (sample run accession number) is the unique identifier within SRA
125-
and corresponds to the specific sequencing run.
126-
There are two sequencing runs and accession numbers for this sample,
127-
but we'll select one to download: `SRX392511`.
128-
129-
This file can be downloaded on the command line using `curl`.
130-
> One of the nice things about julia is that it is
131-
> super easy to toggle between the julia REPL and
132-
> bash shell by typing `;` to access shell mode
133-
> from julia.
134-
> You can use the `backspace`/`delete` key to
135-
> quickly toggle back to the julia REPL.
127+
The SRA run accession (SRR) uniquely identifies a sequencing run.
128+
This sample is associated with two runs, and for this example we will download data linked to experiment `SRX392511`.
136129

137130
To download using the command line, type:
138131
```
139132
curl -L --retry 5 --retry-delay 2 \
140133
"https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRX392511" \
141134
| gzip -c > SRX392511.fastq.gz
142135
```
143-
Alternatively, command line code can be executed from within julia:
136+
Alternatively, command line code can be executed from within Julia:
137+
138+
139+
This file can be downloaded on the command line using `curl`.
140+
> One of the nice things about Julia is that it is
141+
> super easy to toggle between the Julia REPL and
142+
> bash shell by typing `;` to access shell mode
143+
> from Julia.
144+
> You can use the `backspace`/`delete` key to
145+
> quickly toggle back to the Julia REPL.
144146
145147
```julia
146148
run(pipeline(
@@ -150,7 +152,8 @@ run(pipeline(
150152
)
151153
)
152154
```
153-
This file is gzipped, so we'll need to account for that as we are reading it in.
155+
This file is gzipped, so we'll need to account for that as we are reading it in with `FASTX.jl`.
156+
In Julia, we can decompress gzipped files using `CodecZlib.jl`.
154157

155158
Instead of printing out every record (this isn't super practical because it's a big file), let's save all the records into a vector.
156159

@@ -166,7 +169,7 @@ FASTQReader(GzipDecompressorStream(open("assets/SRX392511.fastq.gz"))) do reader
166169
end
167170
```
168171

169-
We can see how many reads there are by looking at the length of `records`.
172+
We can see how many reads there are by looking at the length of the vector `records`.
170173

171174
```julia
172175
julia> length(records)
@@ -188,18 +191,15 @@ julia> records[1:10]
188191
FASTX.FASTQ.Record("SRX392511.4 4 length=101", "TTCGCCCCCCCAAAAGGCT…", "???????????????????…")
189192
FASTX.FASTQ.Record("SRX392511.5 5 length=101", "ATAAAATGAACTTGCGTTA…", "???????????????????…")
190193
FASTX.FASTQ.Record("SRX392511.5 5 length=101", "TAACTTGTGGATAATTATT…", "???????????????????…")
191-
```
192-
193-
All of the nucleotides in these two reads have a quality score of `$`, which corresponds to a probabilty of error of 0.50119.
194-
This is not concerning, as it is typical for the first couple of reads to be low quality.
194+
```
195195

196196
Let's calculate the average quality across all reads.
197-
We'll do this by converting each PHRED score to its probability error,
198-
and summing these values across all sequences in all reads.
197+
We'll do this by converting each PHRED score to its error probability,
198+
and summing these values across all bases in all reads.
199199
Then, we can divide this sum by the total number of bases
200-
to get the average probability error for the entire sequence.
200+
to get the average probability across the entire dataset.
201201
It's important that we first convert the PHRED scores to the probability errors before averaging,
202-
because PHRED is a logarithmic transform of error probability.
202+
because PHRED is a logarithmic transformation of error probability.
203203
Therefore, simply averaging PHRED and then converting to error probability is not the same as converting PHRED to error probability and averaging that sum.
204204

205205
```julia
@@ -214,7 +214,7 @@ total_error_prob = sum(
214214
for record in records
215215
)
216216
4.5100356107423276e7
217-
# get total number of base pairs
217+
# get total number of bases
218218
total_bases = sum(length(FASTQ.quality_scores(record)) for record in records)
219219
995124316
220220

@@ -226,13 +226,10 @@ total_error_prob/total_bases
226226
The average error probability is just 4.53%,
227227
meaning the majority of reads are high quality.
228228

229-
More information about how to convert ASCII values/PHRED scores to quality scores [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html).
229+
More information about converting ASCII characters and PHRED scores to quality scores can be found [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html).
230230

231-
232-
233-
234231
Now that we've learned how to read files in and manipulate them a bit,
235232
let's see if we can align the _mecA_ gene to the _Staphylococcus aureus_ genome.
236-
This will tell us if this _S. aureus_ is MRSA.
233+
This will tell us if this particular _S. aureus_ is MRSA.
237234

238235

0 commit comments

Comments
 (0)