Skip to content

Commit 3eb1e42

Browse files
Merge pull request #28 from BioJulia/2026-03-04-cookbook-sequences
2026 03 04 cookbook sequences
2 parents c0207c3 + 3d78237 commit 3eb1e42

File tree

3 files changed

+282
-0
lines changed

3 files changed

+282
-0
lines changed

cookbook/assets/mecA.fasta

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
>NG_047945.1 Staphylococcus aureus TN/CN/1/12 mecA gene for ceftaroline-resistant PBP2a family peptidoglycan transpeptidase MecA, complete CDS
2+
ATGAAAAAGATAAAAATTGTTCCACTTATTTTAATAGTTGTAGTTGTCGGGTTTGGTATATATTTTTATG
3+
CTTCAAAAGATAAAGAAATTAATAATACTATTGATGCAATTGAAGATAAAAATTTCAAACAAGTTTATAA
4+
AGATAGCAGTTATATTTCTAAAAGCGATAATGGTGAAGTAGAAATGACTGAACGTCCGATAAAAATATAT
5+
AATAGTTTAGGCGTTAAAGATATAAACATTCAGGATCGTAAAATAAAAAAAGTATCTAAAAATAAAAAAC
6+
GAGTAGATGCTCAATATAAAATTAAAACAAACTACGGTAACATTGATCGCAACGTTCAATTTAATTTTGT
7+
TAAAGAAGATGGTATGTGGAAGTTAGATTGGGATCATAGCGTCATTATTCCAGGAATGCAGAAAGACCAA
8+
AGCATACATATTGAAAATTTAAAATCAGAACGTGGTAAAATTTTAGACCGAAACAATGTGGAATTGGCCA
9+
ATACAGGAACAGCATATGAGATAGGCATCGTTCCAAAGAATGTATCTAAAAAAGATTATAAAGCAATCGC
10+
TAAAGAACTAAGTATTTCTGAAGACTATATCAAACAACAAATGGATCAAAATTGGGTACAAGATGATACC
11+
TTCGTTCCACTTAAAACCGTTAAAAAAATGGATGAATATTTAAGTGATTTCGCAAAAAAATTTCATCTTA
12+
CAACTAATGAAACAAAAAGTCGTAACTATCCTCTAGAAAAAGCGACTTCACATCTATTAGGTTATGTTGG
13+
TCCCATTAACTCTGAAGAATTAAAACAAAAAGAATATAAAGGCTATAAAGATGATGCAGTTATTGGTAAA
14+
AAGGGACTCGAAAAACTTTACGATAAAAAGCTCCAACATGAAGATGGCTATCGTGTCACAATCGTTGACG
15+
ATAATAGCAATACAATCGCACATACATTAATAGAGAAAAAGAAAAAAGATGGCAAAGATATTCAACTAAC
16+
TATTGATGCTAAAGTTCAAAAGAGTATTTATAACAACATGAAAAATGATTATGGCTCAGGTACTGCTATC
17+
CACCCTCAAACAGGTGAATTATTAGCACTTGTAAGCACACCTTCATATGACGTCTATCCATTTATGTATG
18+
GCATGAGTAACGAAGAATATAATAAATTAACCGAAGATAAAAAAGAACCTCTGCTCAACAAGTTCCAGAT
19+
TACAACTTCACCAGGTTCAACTCAAAAAATATTAACAGCAATGATTGGGTTAAATAACAAAACATTAGAC
20+
GATAAAACAAGTTATAAAATCGATGGTAAAGGTTGGCAAAAAGATAAATCTTGGGGTGGTTACAACGTTA
21+
CAAGATATGAAGTGGTAAATGGTAATATCGACTTAAAACAAGCAATAGAATCATCAGATAACATTTTCTT
22+
TGCTAGAGTAGCACTCGAATTAGGCAGTAAGAAATTTGAAAAAGGCATGAAAAAACTAGGTGTTGGTGAA
23+
GATATACCAAGTGATTATCCATTTTATAATGCTCAAATTTCAAACAAAAATTTAGATAATGAAATATTAT
24+
TAGCTGATTCAGGTTACGGACAAGGTGAAATACTGATTAACCCAGTACAGATCCTTTCAATCTATAGCGC
25+
ATTAGAAAATAATGGCAATATTAACGCACCTCACTTATTAAAAGACACGAAAAACAAAGTTTGGAAGAAA
26+
AATATTATTTCCAAAGAAAATATCAATCTATTAACTGATGGTATGCAACAAGTCGTAAATAAAACACATA
27+
AAGAAGATATTTATAGATCTTATGCAAACTTAATTGGCAAATCCGGTACTGCAGAACTCAAAATGAAACA
28+
AGGAGAAACTGGCAGACAAATTGGGTGGTTTATATCATATGATAAAGATAATCCAAACATGATGATGGCT
29+
ATTAATGTTAAAGATGTACAAGATAAAGGAATGGCTAGCTACAATGCCAAAATCTCAGGTAAAGTGTATG
30+
ATGAGCTATATGAGAACGGTAATAAAAAATACGATATAGATGAATAACAAAACAGTGAAGCAATCCGTAA
31+
CGATGGTTGCTTCACTGTTTTATTATGAATTATTAATAAGTGCTGTTACTTCTCCCTTAAATACAATTTC
32+
TTCATTT

cookbook/index.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
+++
2+
using Dates
3+
date = Date("2026-03-04")
4+
title = "BioJulia Cookbook"
5+
rss_descr = "Recipes for performing basic bioinformatics in julia"
6+
+++
7+
8+
# BioJulia Cookbook
9+
10+
This cookbook will provide a series of "recipes" that will help get started quickly with BioJulia so you can doing some bioinformatics!
11+
12+
We will have tutorials for reading in files, performing alignments, and using tools such as BLAST,
13+
as well as links to more documentation about specific BioJulia packages.
14+
15+
{{list_dir cookbook}}

cookbook/sequences.md

Lines changed: 235 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,235 @@
1+
+++
2+
using Dates
3+
date = Date("2026-03-11")
4+
title = "Sequence Input/Output"
5+
rss_descr = "Reading in FASTA, FASTQ, and FAI files using FASTX.jl"
6+
+++
7+
8+
# Sequence Input/Output
9+
10+
In this chapter, we'll discuss how to read in sequence files using the `FASTX.jl` package.
11+
More information about the `FASTX.jl` package can be found at https://biojulia.dev/FASTX.jl/stable/
12+
and with the built-in documentation you can access directly within the Julia REPL.
13+
14+
```julia
15+
julia> using FASTX
16+
julia> ?FASTX
17+
```
18+
19+
If `FASTX` is not already in your environment,
20+
it can be easily added from the Julia Registry.
21+
22+
To demonstrate how to use this package,
23+
let's try to read in some real-world data!
24+
25+
The `FASTX` package can read in three file types: `fasta`, `fastq`, and `fai`.
26+
27+
### FASTA files
28+
FASTA files are text files containing biological sequence data.
29+
They contain three parts: `identifier`, `description`, and `sequence`.
30+
31+
The template of a sequence record is:
32+
```
33+
>{description}
34+
{sequence}
35+
```
36+
37+
The `identifier` is the first part of the `description` until the first whitespace.
38+
If there is no whitespace, the `identifier` and `description` are the same.
39+
40+
Here is an example fasta:
41+
```
42+
>chrI chromosome 1
43+
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC
44+
CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTA
45+
```
46+
47+
### FASTQ files
48+
FASTQ files are also text-based files that contain sequences, along with an identifier and description.
49+
However, they also store sequence quality information (the Q is for quality!).
50+
51+
The template of a sequence record is:
52+
```
53+
@{description}
54+
{sequence}
55+
+{description}?
56+
{qualities}
57+
```
58+
59+
Here is an example record:
60+
```
61+
@FSRRS4401BE7HA
62+
tcagTTAAGATGGGAT
63+
+
64+
###EEEEEEEEE##E#
65+
```
66+
67+
### FAI files
68+
69+
FAI (FASTA index) files are used in conjunction with FASTA or FASTQ files.
70+
They are text files with TAB-delimited columns.
71+
They allow the user to access specific regions of the reference FASTA/FASTQ without reading in the entire sequence into memory.
72+
More information about fai index files can be found [here](https://www.htslib.org/doc/faidx.html).
73+
74+
```
75+
NAME Name of this reference sequence
76+
LENGTH Total length of this reference sequence, in bases
77+
OFFSET Offset in the FASTA/FASTQ file of this sequence's first base
78+
LINEBASES The number of bases on each line
79+
LINEWIDTH The number of bytes in each line, including the newline
80+
QUALOFFSET Offset of sequence's first quality within the FASTQ file
81+
82+
```
83+
84+
We will read in a FASTA file containing the _mecA_ gene.
85+
_mecA_ is an antibiotic resistance gene commonly found in methicillin-resistant _Staphylococcus aureus_ (MRSA).
86+
It encodes an alternative penicillin-binding protein that allows the bacteria to resist beta-lactam antibiotics like methicillin.
87+
It is typically 2.1 kB long.
88+
This specific reference fasta was downloaded from NCBI [here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1?report=fasta).
89+
More information about this reference sequence can be found [here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1).
90+
91+
First we'll open the file.
92+
Then we'll iterate over every record in the file and print the sequence identifier,
93+
description and then the corresponding sequence.
94+
95+
```julia
96+
julia> FASTAReader(open("assets/mecA.fasta")) do reader
97+
for record in reader
98+
println(identifier(record))
99+
println(description(record))
100+
println(sequence(record))
101+
println(length(sequence(record)))
102+
end
103+
end
104+
105+
NG_047945.1
106+
NG_047945.1 Staphylococcus aureus TN/CN/1/12 mecA gene for ceftaroline-resistant PBP2a family peptidoglycan transpeptidase MecA, complete CDS
107+
ATGAAAAAGATAAAAATTGTTCCACTTATTTTAATAGTTGTAGTTGTCGGGTTTGGTATATATTTTTATGCTTCAAAAGATAAAGAAATTAATAATACTATTGATGCAATTGAAGATAAAAATTTCAAACAAGTTTATAAAGATAGCAGTTATATTTCTAAAAGCGATAATGGTGAAGTAGAAATGACTGAACGTCCGATAAAAATATATAATAGTTTAGGCGTTAAAGATATAAACATTCAGGATCGTAAAATAAAAAAAGTATCTAAAAATAAAAAACGAGTAGATGCTCAATATAAAATTAAAACAAACTACGGTAACATTGATCGCAACGTTCAATTTAATTTTGTTAAAGAAGATGGTATGTGGAAGTTAGATTGGGATCATAGCGTCATTATTCCAGGAATGCAGAAAGACCAAAGCATACATATTGAAAATTTAAAATCAGAACGTGGTAAAATTTTAGACCGAAACAATGTGGAATTGGCCAATACAGGAACAGCATATGAGATAGGCATCGTTCCAAAGAATGTATCTAAAAAAGATTATAAAGCAATCGCTAAAGAACTAAGTATTTCTGAAGACTATATCAAACAACAAATGGATCAAAATTGGGTACAAGATGATACCTTCGTTCCACTTAAAACCGTTAAAAAAATGGATGAATATTTAAGTGATTTCGCAAAAAAATTTCATCTTACAACTAATGAAACAAAAAGTCGTAACTATCCTCTAGAAAAAGCGACTTCACATCTATTAGGTTATGTTGGTCCCATTAACTCTGAAGAATTAAAACAAAAAGAATATAAAGGCTATAAAGATGATGCAGTTATTGGTAAAAAGGGACTCGAAAAACTTTACGATAAAAAGCTCCAACATGAAGATGGCTATCGTGTCACAATCGTTGACGATAATAGCAATACAATCGCACATACATTAATAGAGAAAAAGAAAAAAGATGGCAAAGATATTCAACTAACTATTGATGCTAAAGTTCAAAAGAGTATTTATAACAACATGAAAAATGATTATGGCTCAGGTACTGCTATCCACCCTCAAACAGGTGAATTATTAGCACTTGTAAGCACACCTTCATATGACGTCTATCCATTTATGTATGGCATGAGTAACGAAGAATATAATAAATTAACCGAAGATAAAAAAGAACCTCTGCTCAACAAGTTCCAGATTACAACTTCACCAGGTTCAACTCAAAAAATATTAACAGCAATGATTGGGTTAAATAACAAAACATTAGACGATAAAACAAGTTATAAAATCGATGGTAAAGGTTGGCAAAAAGATAAATCTTGGGGTGGTTACAACGTTACAAGATATGAAGTGGTAAATGGTAATATCGACTTAAAACAAGCAATAGAATCATCAGATAACATTTTCTTTGCTAGAGTAGCACTCGAATTAGGCAGTAAGAAATTTGAAAAAGGCATGAAAAAACTAGGTGTTGGTGAAGATATACCAAGTGATTATCCATTTTATAATGCTCAAATTTCAAACAAAAATTTAGATAATGAAATATTATTAGCTGATTCAGGTTACGGACAAGGTGAAATACTGATTAACCCAGTACAGATCCTTTCAATCTATAGCGCATTAGAAAATAATGGCAATATTAACGCACCTCACTTATTAAAAGACACGAAAAACAAAGTTTGGAAGAAAAATATTATTTCCAAAGAAAATATCAATCTATTAACTGATGGTATGCAACAAGTCGTAAATAAAACACATAAAGAAGATATTTATAGATCTTATGCAAACTTAATTGGCAAATCCGGTACTGCAGAACTCAAAATGAAACAAGGAGAAACTGGCAGACAAATTGGGTGGTTTATATCATATGATAAAGATAATCCAAACATGATGATGGCTATTAATGTTAAAGATGTACAAGATAAAGGAATGGCTAGCTACAATGCCAAAATCTCAGGTAAAGTGTATGATGAGCTATATGAGAACGGTAATAAAAAATACGATATAGATGAATAACAAAACAGTGAAGCAATCCGTAACGATGGTTGCTTCACTGTTTTATTATGAATTATTAATAAGTGCTGTTACTTCTCCCTTAAATACAATTTCTTCATTT
108+
2107
109+
```
110+
We confirmed that the length of the gene matches what we'd expect for _mecA_.
111+
In this case, there is only one sequence that spans the entire length of the gene.
112+
After this gene was sequenced, all of the reads were assembled together into a single consensus sequence.
113+
_mecA_ is a well-characterized gene, so there are no ambiguous regions,
114+
and there is no need for there to be multiple contigs
115+
(that is, for the gene to be broken into multiple pieces, since we know how the reads should be assembled).
116+
117+
Let's try reading in a larger FASTQ file.
118+
119+
The raw reads for a _Staphylococcus aureus_ isolate were sequenced with Illumina and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/index.html?run=SRR1050625).
120+
The link to download the raw FASTQ files can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/index.html?run=SRR1050625).
121+
122+
The BioSample ID for this sample is `SAMN02360768`.
123+
This ID refers to the physical bacterial isolate.
124+
The SRA sample accession number (an internal value used within the Sequence Read Archive) is `SRS515580`.
125+
Both values correspond to one another and are helpful identifiers.
126+
127+
The SRA run accession (SRR) uniquely identifies a sequencing run.
128+
This sample is associated with two runs, and for this example we will download data linked to experiment `SRX392511`.
129+
130+
To download using the command line, type:
131+
```
132+
curl -L --retry 5 --retry-delay 2 \
133+
"https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRX392511" \
134+
| gzip -c > SRX392511.fastq.gz
135+
```
136+
Alternatively, command line code can be executed from within Julia:
137+
138+
139+
This file can be downloaded on the command line using `curl`.
140+
> One of the nice things about Julia is that it is
141+
> super easy to toggle between the Julia REPL and
142+
> bash shell by typing `;` to access shell mode
143+
> from Julia.
144+
> You can use the `backspace`/`delete` key to
145+
> quickly toggle back to the Julia REPL.
146+
147+
```julia
148+
run(pipeline(
149+
`curl -L --retry 5 --retry-delay 2 "https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRX392511"`,
150+
`gzip -c`,
151+
"SRX392511.fastq.gz"
152+
)
153+
)
154+
```
155+
This file is gzipped, so we'll need to account for that as we are reading it in with `FASTX.jl`.
156+
In Julia, we can decompress gzipped files using `CodecZlib.jl`.
157+
158+
Instead of printing out every record (this isn't super practical because it's a big file), let's save all the records into a vector.
159+
160+
```julia
161+
using CodecZlib
162+
163+
records = []
164+
165+
FASTQReader(GzipDecompressorStream(open("assets/SRX392511.fastq.gz"))) do reader
166+
for record in reader
167+
push!(records, record)
168+
end
169+
end
170+
```
171+
172+
We can see how many reads there are by looking at the length of the vector `records`.
173+
174+
```julia
175+
julia> length(records)
176+
9852716
177+
```
178+
179+
Let's take a look at what the first 10 reads look like:
180+
181+
```
182+
julia> records[1:10]
183+
10-element Vector{Any}:
184+
FASTX.FASTQ.Record("SRX392511.1 1 length=101", "AGGATTTTGTTATATTGTA…", "$$$$$$$$$$$$$$$$$$$…")
185+
FASTX.FASTQ.Record("SRX392511.1 1 length=101", "AGCATAAATTTAAAAAAAA…", "$$$$$$$$$$$$$$$$$$$…")
186+
FASTX.FASTQ.Record("SRX392511.2 2 length=101", "TAGATTAAAATTCTCGTAT…", "???????????????????…")
187+
FASTX.FASTQ.Record("SRX392511.2 2 length=101", "TAGATTAAAATTCTCGTAG…", "???????????????????…")
188+
FASTX.FASTQ.Record("SRX392511.3 3 length=101", "GAGCAGTAGTATAAAATGA…", "???????????????????…")
189+
FASTX.FASTQ.Record("SRX392511.3 3 length=101", "CCGACACAATTACAAGCCA…", "???????????????????…")
190+
FASTX.FASTQ.Record("SRX392511.4 4 length=101", "GATTATCTAGTCATAATTC…", "???????????????????…")
191+
FASTX.FASTQ.Record("SRX392511.4 4 length=101", "TTCGCCCCCCCAAAAGGCT…", "???????????????????…")
192+
FASTX.FASTQ.Record("SRX392511.5 5 length=101", "ATAAAATGAACTTGCGTTA…", "???????????????????…")
193+
FASTX.FASTQ.Record("SRX392511.5 5 length=101", "TAACTTGTGGATAATTATT…", "???????????????????…")
194+
```
195+
196+
Let's calculate the average quality across all reads.
197+
We'll do this by converting each PHRED score to its error probability,
198+
and summing these values across all bases in all reads.
199+
Then, we can divide this sum by the total number of bases
200+
to get the average probability across the entire dataset.
201+
It's important that we first convert the PHRED scores to the probability errors before averaging,
202+
because PHRED is a logarithmic transformation of error probability.
203+
Therefore, simply averaging PHRED and then converting to error probability is not the same as converting PHRED to error probability and averaging that sum.
204+
205+
```julia
206+
function phred_to_prob(Q)
207+
# The formula is Pe = 10^(-Q/10)
208+
return 10^(-Q / 10.0)
209+
end
210+
211+
# get sum of all error probabilities
212+
total_error_prob = sum(
213+
sum(phred_to_prob(q) for q in FASTQ.quality_scores(record))
214+
for record in records
215+
)
216+
4.5100356107423276e7
217+
# get total number of bases
218+
total_bases = sum(length(FASTQ.quality_scores(record)) for record in records)
219+
995124316
220+
221+
# get average error probability
222+
total_error_prob/total_bases
223+
0.045321328584059316
224+
```
225+
226+
The average error probability is just 4.53%,
227+
meaning the majority of reads are high quality.
228+
229+
More information about converting ASCII characters and PHRED scores to quality scores can be found [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html).
230+
231+
Now that we've learned how to read files in and manipulate them a bit,
232+
let's see if we can align the _mecA_ gene to the _Staphylococcus aureus_ genome.
233+
This will tell us if this particular _S. aureus_ is MRSA.
234+
235+

0 commit comments

Comments
 (0)