Skip to content

Commit bdd640c

Browse files
fix typos and update mecA.fasta
1 parent 38c2d71 commit bdd640c

File tree

2 files changed

+78
-42
lines changed

2 files changed

+78
-42
lines changed

cookbook/assets/mecA.fasta

Lines changed: 32 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,32 @@
1-
>NC_007795.1:907598-908317 SAOUHSC_00935 [organism=Staphylococcus aureus subsp. aureus NCTC 8325] [GeneID=3920764] [chromosome=]
2-
ATGAGAATAGAACGAGTAGATGATACAACTGTAAAATTGTTTATAACATATAGCGATATCGAGGCCCGTG
3-
GATTTAGTCGTGAAGATTTATGGACAAATCGCAAACGTGGCGAAGAATTCTTTTGGTCAATGATGGATGA
4-
AATTAACGAAGAAGAAGATTTTGTTGTAGAAGGTCCATTATGGATTCAAGTACATGCCTTTGAAAAAGGT
5-
GTCGAAGTCACAATTTCTAAATCTAAAAATGAAGATATGATGAATATGTCTGATGATGATGCAACTGATC
6-
AATTTGATGAACAAGTTCAAGAATTGTTAGCTCAAACATTAGAAGGTGAAGATCAATTAGAAGAATTATT
7-
CGAGCAACGAACAAAAGAAAAAGAAGCTCAAGGTTCTAAACGTCAAAAGTCTTCAGCACGTAAAAATACA
8-
AGAACAATCATTGTGAAATTTAACGATTTAGAAGATGTTATTAATTATGCATATCATAGCAATCCAATAA
9-
CTACAGAGTTTGAAGATTTGTTATATATGGTTGATGGTACTTATTATTATGCTGTATATTTTGATAGTCA
10-
TGTTGATCAAGAAGTCATTAATGATAGTTACAGTCAATTGCTTGAATTTGCTTATCCAACAGACAGAACA
11-
GAAGTTTATTTAAATGACTATGCTAAAATAATTATGAGTCATAACGTAACAGCTCAAGTTCGACGTTATT
12-
TTCCAGAGACAACTGAATAA
1+
>NG_047945.1 Staphylococcus aureus TN/CN/1/12 mecA gene for ceftaroline-resistant PBP2a family peptidoglycan transpeptidase MecA, complete CDS
2+
ATGAAAAAGATAAAAATTGTTCCACTTATTTTAATAGTTGTAGTTGTCGGGTTTGGTATATATTTTTATG
3+
CTTCAAAAGATAAAGAAATTAATAATACTATTGATGCAATTGAAGATAAAAATTTCAAACAAGTTTATAA
4+
AGATAGCAGTTATATTTCTAAAAGCGATAATGGTGAAGTAGAAATGACTGAACGTCCGATAAAAATATAT
5+
AATAGTTTAGGCGTTAAAGATATAAACATTCAGGATCGTAAAATAAAAAAAGTATCTAAAAATAAAAAAC
6+
GAGTAGATGCTCAATATAAAATTAAAACAAACTACGGTAACATTGATCGCAACGTTCAATTTAATTTTGT
7+
TAAAGAAGATGGTATGTGGAAGTTAGATTGGGATCATAGCGTCATTATTCCAGGAATGCAGAAAGACCAA
8+
AGCATACATATTGAAAATTTAAAATCAGAACGTGGTAAAATTTTAGACCGAAACAATGTGGAATTGGCCA
9+
ATACAGGAACAGCATATGAGATAGGCATCGTTCCAAAGAATGTATCTAAAAAAGATTATAAAGCAATCGC
10+
TAAAGAACTAAGTATTTCTGAAGACTATATCAAACAACAAATGGATCAAAATTGGGTACAAGATGATACC
11+
TTCGTTCCACTTAAAACCGTTAAAAAAATGGATGAATATTTAAGTGATTTCGCAAAAAAATTTCATCTTA
12+
CAACTAATGAAACAAAAAGTCGTAACTATCCTCTAGAAAAAGCGACTTCACATCTATTAGGTTATGTTGG
13+
TCCCATTAACTCTGAAGAATTAAAACAAAAAGAATATAAAGGCTATAAAGATGATGCAGTTATTGGTAAA
14+
AAGGGACTCGAAAAACTTTACGATAAAAAGCTCCAACATGAAGATGGCTATCGTGTCACAATCGTTGACG
15+
ATAATAGCAATACAATCGCACATACATTAATAGAGAAAAAGAAAAAAGATGGCAAAGATATTCAACTAAC
16+
TATTGATGCTAAAGTTCAAAAGAGTATTTATAACAACATGAAAAATGATTATGGCTCAGGTACTGCTATC
17+
CACCCTCAAACAGGTGAATTATTAGCACTTGTAAGCACACCTTCATATGACGTCTATCCATTTATGTATG
18+
GCATGAGTAACGAAGAATATAATAAATTAACCGAAGATAAAAAAGAACCTCTGCTCAACAAGTTCCAGAT
19+
TACAACTTCACCAGGTTCAACTCAAAAAATATTAACAGCAATGATTGGGTTAAATAACAAAACATTAGAC
20+
GATAAAACAAGTTATAAAATCGATGGTAAAGGTTGGCAAAAAGATAAATCTTGGGGTGGTTACAACGTTA
21+
CAAGATATGAAGTGGTAAATGGTAATATCGACTTAAAACAAGCAATAGAATCATCAGATAACATTTTCTT
22+
TGCTAGAGTAGCACTCGAATTAGGCAGTAAGAAATTTGAAAAAGGCATGAAAAAACTAGGTGTTGGTGAA
23+
GATATACCAAGTGATTATCCATTTTATAATGCTCAAATTTCAAACAAAAATTTAGATAATGAAATATTAT
24+
TAGCTGATTCAGGTTACGGACAAGGTGAAATACTGATTAACCCAGTACAGATCCTTTCAATCTATAGCGC
25+
ATTAGAAAATAATGGCAATATTAACGCACCTCACTTATTAAAAGACACGAAAAACAAAGTTTGGAAGAAA
26+
AATATTATTTCCAAAGAAAATATCAATCTATTAACTGATGGTATGCAACAAGTCGTAAATAAAACACATA
27+
AAGAAGATATTTATAGATCTTATGCAAACTTAATTGGCAAATCCGGTACTGCAGAACTCAAAATGAAACA
28+
AGGAGAAACTGGCAGACAAATTGGGTGGTTTATATCATATGATAAAGATAATCCAAACATGATGATGGCT
29+
ATTAATGTTAAAGATGTACAAGATAAAGGAATGGCTAGCTACAATGCCAAAATCTCAGGTAAAGTGTATG
30+
ATGAGCTATATGAGAACGGTAATAAAAAATACGATATAGATGAATAACAAAACAGTGAAGCAATCCGTAA
31+
CGATGGTTGCTTCACTGTTTTATTATGAATTATTAATAAGTGCTGTTACTTCTCCCTTAAATACAATTTC
32+
TTCATTT

cookbook/sequences.md

Lines changed: 46 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,26 @@
11
+++
22
title = "Sequence Input/Output"
3-
rss_descr = "Reading in fasta files using FASTX.jl"
3+
rss_descr = "Reading in FASTA, FASTQ, and FAI files using FASTX.jl"
44
+++
55

66
# Sequence Input/Output
77

88
In this chapter, we'll talk about how to read in sequence files using the `FASTX.jl` module.
99
More information about the `FASTX.jl` package can be found at https://biojulia.dev/FASTX.jl/stable/
10-
and with the built-in documentation.
10+
and with the built-in documentation you can access directly within the Julia REPL.
1111

1212
```julia
1313
julia> using FASTX
1414
julia> ?FASTX
1515
```
1616

17-
If FASTX is not already in your environment,
17+
If `FASTX` is not already in your environment,
1818
it can be easily added from the Julia Registry.
1919

20-
To demonstrate how to this package,
20+
To demonstrate how to use this package,
2121
let's try to read in some real-world data!
2222

23-
The `FASTX` can read in 3 file types: fasta, fastq, and fai.
23+
The `FASTX` package can read in three file types: `fasta`, `fastq`, and `fai`.
2424

2525
### FASTA files
2626
FASTA files are text files containing biological sequence data.
@@ -32,6 +32,16 @@ The template of a sequence record is:
3232
{sequence}
3333
```
3434

35+
The identifier is the first part of the description until the first whitespace.
36+
If there is no white space, the name and description are the same.
37+
38+
Here is an example fasta:
39+
```
40+
>chrI chromosome 1
41+
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC
42+
CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTA
43+
```
44+
3545
### FASTQ files
3646
FASTQ files are also text-based files that contain sequences, along with a name and description.
3747
However, they also store sequence quality information (the Q is for quality!).
@@ -54,9 +64,9 @@ tcagTTAAGATGGGAT
5464

5565
### FAI files
5666

57-
FAI (FASTA index) files are used in conjuction with FASTA/FASTQ files.
67+
FAI (FASTA index) files are used in conjunction with FASTA/FASTQ files.
5868
They are text files with TAB-delimited columns.
59-
Each line contains information about each region sequence within the FASTA/FASTQ.
69+
They allow the user to access specific regions of the reference FASTA/FASTQ without reading in the entire sequence into memory.
6070
More information about fai index files can be found [here](https://www.htslib.org/doc/faidx.html).
6171

6272
```
@@ -69,15 +79,14 @@ QUALOFFSET Offset of sequence's first quality within the FASTQ file
6979
7080
```
7181

82+
We will read in a FASTA file containing the _mecA_ gene.
83+
_mecA_ is an antibiotic resistance gene commonly found in Methicillin-resistant _Staphylococcus aureus_ (MRSA).
84+
It helps bacteria to break down beta-lactam antibiotics like methicillin.
85+
It is typically 2.1 kB long.
86+
This specific reference fasta was downloaded from NCBI [here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1?report=fasta). More information about this reference sequence can be found [here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1).
7287

73-
74-
75-
76-
We will read in a fasta file containing the _mecA_ gene.
77-
This gene was taken from NCBI [here](https://www.ncbi.nlm.nih.gov/gene?Db=gene&Cmd=DetailsSearch&Term=3920764#).
78-
79-
First we'll open the file,
80-
then we'll iterate over every record in the file and
88+
First we'll open the file.
89+
Then we'll iterate over every record in the file and
8190
print out the sequence identifier, the sequence description and then the corresponding sequence.
8291

8392
```julia
@@ -86,32 +95,38 @@ julia> FASTAReader(open("assets/mecA.fasta")) do reader
8695
println(identifier(record))
8796
println(description(record))
8897
println(sequence(record))
98+
println(length(sequence(record)))
8999
end
90100
end
91101

92-
NC_007795.1:907598-908317
93-
NC_007795.1:907598-908317 SAOUHSC_00935 [organism=Staphylococcus aureus subsp. aureus NCTC 8325] [GeneID=3920764] [chromosome=]
94-
ATGAGAATAGAACGAGTAGATGATACAACTGTAAAATTGTTTATAACATATAGCGATATCGAGGCCCGTGGATTTAGTCGTGAAGATTTATGGACAAATCGCAAACGTGGCGAAGAATTCTTTTGGTCAATGATGGATGAAATTAACGAAGAAGAAGATTTTGTTGTAGAAGGTCCATTATGGATTCAAGTACATGCCTTTGAAAAAGGTGTCGAAGTCACAATTTCTAAATCTAAAAATGAAGATATGATGAATATGTCTGATGATGATGCAACTGATCAATTTGATGAACAAGTTCAAGAATTGTTAGCTCAAACATTAGAAGGTGAAGATCAATTAGAAGAATTATTCGAGCAACGAACAAAAGAAAAAGAAGCTCAAGGTTCTAAACGTCAAAAGTCTTCAGCACGTAAAAATACAAGAACAATCATTGTGAAATTTAACGATTTAGAAGATGTTATTAATTATGCATATCATAGCAATCCAATAACTACAGAGTTTGAAGATTTGTTATATATGGTTGATGGTACTTATTATTATGCTGTATATTTTGATAGTCATGTTGATCAAGAAGTCATTAATGATAGTTACAGTCAATTGCTTGAATTTGCTTATCCAACAGACAGAACAGAAGTTTATTTAAATGACTATGCTAAAATAATTATGAGTCATAACGTAACAGCTCAAGTTCGACGTTATTTTCCAGAGACAACTGAATAA
102+
NG_047945.1
103+
NG_047945.1 Staphylococcus aureus TN/CN/1/12 mecA gene for ceftaroline-resistant PBP2a family peptidoglycan transpeptidase MecA, complete CDS
104+
ATGAAAAAGATAAAAATTGTTCCACTTATTTTAATAGTTGTAGTTGTCGGGTTTGGTATATATTTTTATGCTTCAAAAGATAAAGAAATTAATAATACTATTGATGCAATTGAAGATAAAAATTTCAAACAAGTTTATAAAGATAGCAGTTATATTTCTAAAAGCGATAATGGTGAAGTAGAAATGACTGAACGTCCGATAAAAATATATAATAGTTTAGGCGTTAAAGATATAAACATTCAGGATCGTAAAATAAAAAAAGTATCTAAAAATAAAAAACGAGTAGATGCTCAATATAAAATTAAAACAAACTACGGTAACATTGATCGCAACGTTCAATTTAATTTTGTTAAAGAAGATGGTATGTGGAAGTTAGATTGGGATCATAGCGTCATTATTCCAGGAATGCAGAAAGACCAAAGCATACATATTGAAAATTTAAAATCAGAACGTGGTAAAATTTTAGACCGAAACAATGTGGAATTGGCCAATACAGGAACAGCATATGAGATAGGCATCGTTCCAAAGAATGTATCTAAAAAAGATTATAAAGCAATCGCTAAAGAACTAAGTATTTCTGAAGACTATATCAAACAACAAATGGATCAAAATTGGGTACAAGATGATACCTTCGTTCCACTTAAAACCGTTAAAAAAATGGATGAATATTTAAGTGATTTCGCAAAAAAATTTCATCTTACAACTAATGAAACAAAAAGTCGTAACTATCCTCTAGAAAAAGCGACTTCACATCTATTAGGTTATGTTGGTCCCATTAACTCTGAAGAATTAAAACAAAAAGAATATAAAGGCTATAAAGATGATGCAGTTATTGGTAAAAAGGGACTCGAAAAACTTTACGATAAAAAGCTCCAACATGAAGATGGCTATCGTGTCACAATCGTTGACGATAATAGCAATACAATCGCACATACATTAATAGAGAAAAAGAAAAAAGATGGCAAAGATATTCAACTAACTATTGATGCTAAAGTTCAAAAGAGTATTTATAACAACATGAAAAATGATTATGGCTCAGGTACTGCTATCCACCCTCAAACAGGTGAATTATTAGCACTTGTAAGCACACCTTCATATGACGTCTATCCATTTATGTATGGCATGAGTAACGAAGAATATAATAAATTAACCGAAGATAAAAAAGAACCTCTGCTCAACAAGTTCCAGATTACAACTTCACCAGGTTCAACTCAAAAAATATTAACAGCAATGATTGGGTTAAATAACAAAACATTAGACGATAAAACAAGTTATAAAATCGATGGTAAAGGTTGGCAAAAAGATAAATCTTGGGGTGGTTACAACGTTACAAGATATGAAGTGGTAAATGGTAATATCGACTTAAAACAAGCAATAGAATCATCAGATAACATTTTCTTTGCTAGAGTAGCACTCGAATTAGGCAGTAAGAAATTTGAAAAAGGCATGAAAAAACTAGGTGTTGGTGAAGATATACCAAGTGATTATCCATTTTATAATGCTCAAATTTCAAACAAAAATTTAGATAATGAAATATTATTAGCTGATTCAGGTTACGGACAAGGTGAAATACTGATTAACCCAGTACAGATCCTTTCAATCTATAGCGCATTAGAAAATAATGGCAATATTAACGCACCTCACTTATTAAAAGACACGAAAAACAAAGTTTGGAAGAAAAATATTATTTCCAAAGAAAATATCAATCTATTAACTGATGGTATGCAACAAGTCGTAAATAAAACACATAAAGAAGATATTTATAGATCTTATGCAAACTTAATTGGCAAATCCGGTACTGCAGAACTCAAAATGAAACAAGGAGAAACTGGCAGACAAATTGGGTGGTTTATATCATATGATAAAGATAATCCAAACATGATGATGGCTATTAATGTTAAAGATGTACAAGATAAAGGAATGGCTAGCTACAATGCCAAAATCTCAGGTAAAGTGTATGATGAGCTATATGAGAACGGTAATAAAAAATACGATATAGATGAATAACAAAACAGTGAAGCAATCCGTAACGATGGTTGCTTCACTGTTTTATTATGAATTATTAATAAGTGCTGTTACTTCTCCCTTAAATACAATTTCTTCATTT
105+
2107
95106
```
107+
We confirmed that the length of the gene matches what we'd expect for _mecA_.
108+
In this case, there is only one sequence that spans the entire length of the gene.
109+
After this gene was sequenced, all of the reads were assembled together into a single consensus sequence.
110+
_mecA_ is a well-characterized gene, so there are no ambiguous regions, and there is no need for there to be multiple contigs
111+
(AKA for the gene to be broken up into multiple pieces, as we know exactly how the reads should be assembled together.).
96112

97-
In this case, there is only one sequence.
98113

99-
Let's try reading in a larger fastq file.
114+
Let's try reading in a larger FASTQ file.
100115

101-
The raw reads for a Staphylococcus aureus isolate was sequenced with Pac-Bio and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).
102-
The link to download the raw fastq's can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).
116+
The raw reads for a _Staphylococcus aureus_ isolate were sequenced with PacBio and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).
117+
The link to download the raw FASTQ files can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).
103118

104-
The biosample ID for this sample is `SAMN14830786`.
119+
The BioSample ID for this sample is `SAMN14830786`.
105120
This ID refers to the physical bacterial isolate.
106121

107-
The SRA sample accession number (an internal value used within Sequence Read Archive) is `SRS6947643`.
122+
The SRA sample accession number (an internal value used within the Sequence Read Archive) is `SRS6947643`.
108123

109124
Both values correspond to one another and are helpful identifiers.
110125

111126
The SRR (sample run accession number) is the unique identifier within SRA
112-
and corresponds to the specific sequencing run within SRA.
127+
and corresponds to the specific sequencing run.
113128

114-
In a later tutorial, we will discuss how we can download this file within julia with the SRR.
129+
In a later tutorial, we will discuss how to download this file in Julia using the SRR.
115130

116131
But for now, the file can be downloaded using curl
117132

@@ -164,10 +179,11 @@ julia> records[1:10]
164179
All of the nucleotides in all of the reads have a quality score of `$`, which corresponds to a probabilty of error of 0.50119.
165180
More information about how to convert ASCII values to quality scores [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html).
166181
This would be quite poor if we were looking at Illumia data.
167-
However, because of how Pacbio chemistry works,
168-
the quality scores are often flattened and there is simply a placeholder value on this line.
169-
This does not mean our reads are trash!
182+
However, because of how PacBio chemistry works,
183+
quality scores are often flattened and there is simply a placeholder value on this line.
184+
This does not mean our reads are low quality!
170185
Now that we've learned how to read files in and manipulate them a bit,
171-
let's see if we can align the mecA gene to the Staphylococcus aureus genome.
186+
let's see if we can align the _mecA_ gene to the _Staphylococcus aureus_ genome.
187+
This will tell us if this _S. aureus_ is MRSA.
172188

173189

0 commit comments

Comments
 (0)