|
| 1 | ++++ |
| 2 | +using Dates |
| 3 | +date = Date("2026-03-11") |
| 4 | +title = "Sequence Input/Output" |
| 5 | +rss_descr = "Reading in FASTA, FASTQ, and FAI files using FASTX.jl" |
| 6 | ++++ |
| 7 | + |
| 8 | +# Sequence Input/Output |
| 9 | + |
| 10 | +In this chapter, we'll discuss how to read in sequence files using the `FASTX.jl` package. |
| 11 | +More information about the `FASTX.jl` package can be found at https://biojulia.dev/FASTX.jl/stable/ |
| 12 | +and with the built-in documentation you can access directly within the Julia REPL. |
| 13 | + |
| 14 | +```julia |
| 15 | +julia> using FASTX |
| 16 | +julia> ?FASTX |
| 17 | +``` |
| 18 | + |
| 19 | +If `FASTX` is not already in your environment, |
| 20 | +it can be easily added from the Julia Registry. |
| 21 | + |
| 22 | +To demonstrate how to use this package, |
| 23 | +let's try to read in some real-world data! |
| 24 | + |
| 25 | +The `FASTX` package can read in three file types: `fasta`, `fastq`, and `fai`. |
| 26 | + |
| 27 | +### FASTA files |
| 28 | +FASTA files are text files containing biological sequence data. |
| 29 | +They contain three parts: `identifier`, `description`, and `sequence`. |
| 30 | + |
| 31 | +The template of a sequence record is: |
| 32 | +``` |
| 33 | +>{description} |
| 34 | +{sequence} |
| 35 | +``` |
| 36 | + |
| 37 | +The `identifier` is the first part of the `description` until the first whitespace. |
| 38 | +If there is no whitespace, the `identifier` and `description` are the same. |
| 39 | + |
| 40 | +Here is an example fasta: |
| 41 | +``` |
| 42 | +>chrI chromosome 1 |
| 43 | +CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC |
| 44 | +CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTA |
| 45 | +``` |
| 46 | + |
| 47 | +### FASTQ files |
| 48 | +FASTQ files are also text-based files that contain sequences, along with an identifier and description. |
| 49 | +However, they also store sequence quality information (the Q is for quality!). |
| 50 | + |
| 51 | +The template of a sequence record is: |
| 52 | +``` |
| 53 | +@{description} |
| 54 | +{sequence} |
| 55 | ++{description}? |
| 56 | +{qualities} |
| 57 | +``` |
| 58 | + |
| 59 | +Here is an example record: |
| 60 | +``` |
| 61 | +@FSRRS4401BE7HA |
| 62 | +tcagTTAAGATGGGAT |
| 63 | ++ |
| 64 | +###EEEEEEEEE##E# |
| 65 | +``` |
| 66 | + |
| 67 | +### FAI files |
| 68 | + |
| 69 | +FAI (FASTA index) files are used in conjunction with FASTA or FASTQ files. |
| 70 | +They are text files with TAB-delimited columns. |
| 71 | +They allow the user to access specific regions of the reference FASTA/FASTQ without reading in the entire sequence into memory. |
| 72 | +More information about fai index files can be found [here](https://www.htslib.org/doc/faidx.html). |
| 73 | + |
| 74 | +``` |
| 75 | +NAME Name of this reference sequence |
| 76 | +LENGTH Total length of this reference sequence, in bases |
| 77 | +OFFSET Offset in the FASTA/FASTQ file of this sequence's first base |
| 78 | +LINEBASES The number of bases on each line |
| 79 | +LINEWIDTH The number of bytes in each line, including the newline |
| 80 | +QUALOFFSET Offset of sequence's first quality within the FASTQ file |
| 81 | +
|
| 82 | +``` |
| 83 | + |
| 84 | +We will read in a FASTA file containing the _mecA_ gene. |
| 85 | +_mecA_ is an antibiotic resistance gene commonly found in methicillin-resistant _Staphylococcus aureus_ (MRSA). |
| 86 | +It encodes an alternative penicillin-binding protein that allows the bacteria to resist beta-lactam antibiotics like methicillin. |
| 87 | +It is typically 2.1 kB long. |
| 88 | +This specific reference fasta was downloaded from NCBI [here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1?report=fasta). |
| 89 | + More information about this reference sequence can be found [here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1). |
| 90 | + |
| 91 | +First we'll open the file. |
| 92 | +Then we'll iterate over every record in the file and print the sequence identifier, |
| 93 | +description and then the corresponding sequence. |
| 94 | + |
| 95 | +```julia |
| 96 | +julia> FASTAReader(open("assets/mecA.fasta")) do reader |
| 97 | + for record in reader |
| 98 | + println(identifier(record)) |
| 99 | + println(description(record)) |
| 100 | + println(sequence(record)) |
| 101 | + println(length(sequence(record))) |
| 102 | + end |
| 103 | + end |
| 104 | + |
| 105 | +NG_047945.1 |
| 106 | +NG_047945.1 Staphylococcus aureus TN/CN/1/12 mecA gene for ceftaroline-resistant PBP2a family peptidoglycan transpeptidase MecA, complete CDS |
| 107 | +ATGAAAAAGATAAAAATTGTTCCACTTATTTTAATAGTTGTAGTTGTCGGGTTTGGTATATATTTTTATGCTTCAAAAGATAAAGAAATTAATAATACTATTGATGCAATTGAAGATAAAAATTTCAAACAAGTTTATAAAGATAGCAGTTATATTTCTAAAAGCGATAATGGTGAAGTAGAAATGACTGAACGTCCGATAAAAATATATAATAGTTTAGGCGTTAAAGATATAAACATTCAGGATCGTAAAATAAAAAAAGTATCTAAAAATAAAAAACGAGTAGATGCTCAATATAAAATTAAAACAAACTACGGTAACATTGATCGCAACGTTCAATTTAATTTTGTTAAAGAAGATGGTATGTGGAAGTTAGATTGGGATCATAGCGTCATTATTCCAGGAATGCAGAAAGACCAAAGCATACATATTGAAAATTTAAAATCAGAACGTGGTAAAATTTTAGACCGAAACAATGTGGAATTGGCCAATACAGGAACAGCATATGAGATAGGCATCGTTCCAAAGAATGTATCTAAAAAAGATTATAAAGCAATCGCTAAAGAACTAAGTATTTCTGAAGACTATATCAAACAACAAATGGATCAAAATTGGGTACAAGATGATACCTTCGTTCCACTTAAAACCGTTAAAAAAATGGATGAATATTTAAGTGATTTCGCAAAAAAATTTCATCTTACAACTAATGAAACAAAAAGTCGTAACTATCCTCTAGAAAAAGCGACTTCACATCTATTAGGTTATGTTGGTCCCATTAACTCTGAAGAATTAAAACAAAAAGAATATAAAGGCTATAAAGATGATGCAGTTATTGGTAAAAAGGGACTCGAAAAACTTTACGATAAAAAGCTCCAACATGAAGATGGCTATCGTGTCACAATCGTTGACGATAATAGCAATACAATCGCACATACATTAATAGAGAAAAAGAAAAAAGATGGCAAAGATATTCAACTAACTATTGATGCTAAAGTTCAAAAGAGTATTTATAACAACATGAAAAATGATTATGGCTCAGGTACTGCTATCCACCCTCAAACAGGTGAATTATTAGCACTTGTAAGCACACCTTCATATGACGTCTATCCATTTATGTATGGCATGAGTAACGAAGAATATAATAAATTAACCGAAGATAAAAAAGAACCTCTGCTCAACAAGTTCCAGATTACAACTTCACCAGGTTCAACTCAAAAAATATTAACAGCAATGATTGGGTTAAATAACAAAACATTAGACGATAAAACAAGTTATAAAATCGATGGTAAAGGTTGGCAAAAAGATAAATCTTGGGGTGGTTACAACGTTACAAGATATGAAGTGGTAAATGGTAATATCGACTTAAAACAAGCAATAGAATCATCAGATAACATTTTCTTTGCTAGAGTAGCACTCGAATTAGGCAGTAAGAAATTTGAAAAAGGCATGAAAAAACTAGGTGTTGGTGAAGATATACCAAGTGATTATCCATTTTATAATGCTCAAATTTCAAACAAAAATTTAGATAATGAAATATTATTAGCTGATTCAGGTTACGGACAAGGTGAAATACTGATTAACCCAGTACAGATCCTTTCAATCTATAGCGCATTAGAAAATAATGGCAATATTAACGCACCTCACTTATTAAAAGACACGAAAAACAAAGTTTGGAAGAAAAATATTATTTCCAAAGAAAATATCAATCTATTAACTGATGGTATGCAACAAGTCGTAAATAAAACACATAAAGAAGATATTTATAGATCTTATGCAAACTTAATTGGCAAATCCGGTACTGCAGAACTCAAAATGAAACAAGGAGAAACTGGCAGACAAATTGGGTGGTTTATATCATATGATAAAGATAATCCAAACATGATGATGGCTATTAATGTTAAAGATGTACAAGATAAAGGAATGGCTAGCTACAATGCCAAAATCTCAGGTAAAGTGTATGATGAGCTATATGAGAACGGTAATAAAAAATACGATATAGATGAATAACAAAACAGTGAAGCAATCCGTAACGATGGTTGCTTCACTGTTTTATTATGAATTATTAATAAGTGCTGTTACTTCTCCCTTAAATACAATTTCTTCATTT |
| 108 | +2107 |
| 109 | +``` |
| 110 | +We confirmed that the length of the gene matches what we'd expect for _mecA_. |
| 111 | +In this case, there is only one sequence that spans the entire length of the gene. |
| 112 | +After this gene was sequenced, all of the reads were assembled together into a single consensus sequence. |
| 113 | +_mecA_ is a well-characterized gene, so there are no ambiguous regions, |
| 114 | +and there is no need for there to be multiple contigs |
| 115 | +(that is, for the gene to be broken into multiple pieces, since we know how the reads should be assembled). |
| 116 | + |
| 117 | +Let's try reading in a larger FASTQ file. |
| 118 | + |
| 119 | +The raw reads for a _Staphylococcus aureus_ isolate were sequenced with Illumina and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/index.html?run=SRR1050625). |
| 120 | +The link to download the raw FASTQ files can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/index.html?run=SRR1050625). |
| 121 | + |
| 122 | +The BioSample ID for this sample is `SAMN02360768`. |
| 123 | +This ID refers to the physical bacterial isolate. |
| 124 | +The SRA sample accession number (an internal value used within the Sequence Read Archive) is `SRS515580`. |
| 125 | +Both values correspond to one another and are helpful identifiers. |
| 126 | + |
| 127 | +The SRA run accession (SRR) uniquely identifies a sequencing run. |
| 128 | +This sample is associated with two runs, and for this example we will download data linked to experiment `SRX392511`. |
| 129 | + |
| 130 | +To download using the command line, type: |
| 131 | +``` |
| 132 | +curl -L --retry 5 --retry-delay 2 \ |
| 133 | + "https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRX392511" \ |
| 134 | + | gzip -c > SRX392511.fastq.gz |
| 135 | +``` |
| 136 | +Alternatively, command line code can be executed from within Julia: |
| 137 | + |
| 138 | + |
| 139 | +This file can be downloaded on the command line using `curl`. |
| 140 | +> One of the nice things about Julia is that it is |
| 141 | +> super easy to toggle between the Julia REPL and |
| 142 | +> bash shell by typing `;` to access shell mode |
| 143 | +> from Julia. |
| 144 | +> You can use the `backspace`/`delete` key to |
| 145 | +> quickly toggle back to the Julia REPL. |
| 146 | +
|
| 147 | +```julia |
| 148 | +run(pipeline( |
| 149 | + `curl -L --retry 5 --retry-delay 2 "https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRX392511"`, |
| 150 | + `gzip -c`, |
| 151 | + "SRX392511.fastq.gz" |
| 152 | + ) |
| 153 | +) |
| 154 | +``` |
| 155 | +This file is gzipped, so we'll need to account for that as we are reading it in with `FASTX.jl`. |
| 156 | +In Julia, we can decompress gzipped files using `CodecZlib.jl`. |
| 157 | + |
| 158 | +Instead of printing out every record (this isn't super practical because it's a big file), let's save all the records into a vector. |
| 159 | + |
| 160 | +```julia |
| 161 | +using CodecZlib |
| 162 | + |
| 163 | +records = [] |
| 164 | + |
| 165 | +FASTQReader(GzipDecompressorStream(open("assets/SRX392511.fastq.gz"))) do reader |
| 166 | + for record in reader |
| 167 | + push!(records, record) |
| 168 | + end |
| 169 | + end |
| 170 | +``` |
| 171 | + |
| 172 | +We can see how many reads there are by looking at the length of the vector `records`. |
| 173 | + |
| 174 | +```julia |
| 175 | +julia> length(records) |
| 176 | +9852716 |
| 177 | +``` |
| 178 | + |
| 179 | +Let's take a look at what the first 10 reads look like: |
| 180 | + |
| 181 | +``` |
| 182 | +julia> records[1:10] |
| 183 | +10-element Vector{Any}: |
| 184 | + FASTX.FASTQ.Record("SRX392511.1 1 length=101", "AGGATTTTGTTATATTGTA…", "$$$$$$$$$$$$$$$$$$$…") |
| 185 | + FASTX.FASTQ.Record("SRX392511.1 1 length=101", "AGCATAAATTTAAAAAAAA…", "$$$$$$$$$$$$$$$$$$$…") |
| 186 | + FASTX.FASTQ.Record("SRX392511.2 2 length=101", "TAGATTAAAATTCTCGTAT…", "???????????????????…") |
| 187 | + FASTX.FASTQ.Record("SRX392511.2 2 length=101", "TAGATTAAAATTCTCGTAG…", "???????????????????…") |
| 188 | + FASTX.FASTQ.Record("SRX392511.3 3 length=101", "GAGCAGTAGTATAAAATGA…", "???????????????????…") |
| 189 | + FASTX.FASTQ.Record("SRX392511.3 3 length=101", "CCGACACAATTACAAGCCA…", "???????????????????…") |
| 190 | + FASTX.FASTQ.Record("SRX392511.4 4 length=101", "GATTATCTAGTCATAATTC…", "???????????????????…") |
| 191 | + FASTX.FASTQ.Record("SRX392511.4 4 length=101", "TTCGCCCCCCCAAAAGGCT…", "???????????????????…") |
| 192 | + FASTX.FASTQ.Record("SRX392511.5 5 length=101", "ATAAAATGAACTTGCGTTA…", "???????????????????…") |
| 193 | + FASTX.FASTQ.Record("SRX392511.5 5 length=101", "TAACTTGTGGATAATTATT…", "???????????????????…") |
| 194 | + ``` |
| 195 | + |
| 196 | + Let's calculate the average quality across all reads. |
| 197 | + We'll do this by converting each PHRED score to its error probability, |
| 198 | + and summing these values across all bases in all reads. |
| 199 | + Then, we can divide this sum by the total number of bases |
| 200 | + to get the average probability across the entire dataset. |
| 201 | + It's important that we first convert the PHRED scores to the probability errors before averaging, |
| 202 | +because PHRED is a logarithmic transformation of error probability. |
| 203 | +Therefore, simply averaging PHRED and then converting to error probability is not the same as converting PHRED to error probability and averaging that sum. |
| 204 | + |
| 205 | +```julia |
| 206 | +function phred_to_prob(Q) |
| 207 | + # The formula is Pe = 10^(-Q/10) |
| 208 | + return 10^(-Q / 10.0) |
| 209 | +end |
| 210 | + |
| 211 | +# get sum of all error probabilities |
| 212 | +total_error_prob = sum( |
| 213 | + sum(phred_to_prob(q) for q in FASTQ.quality_scores(record)) |
| 214 | + for record in records |
| 215 | +) |
| 216 | +4.5100356107423276e7 |
| 217 | +# get total number of bases |
| 218 | +total_bases = sum(length(FASTQ.quality_scores(record)) for record in records) |
| 219 | +995124316 |
| 220 | + |
| 221 | +# get average error probability |
| 222 | +total_error_prob/total_bases |
| 223 | +0.045321328584059316 |
| 224 | +``` |
| 225 | + |
| 226 | +The average error probability is just 4.53%, |
| 227 | +meaning the majority of reads are high quality. |
| 228 | + |
| 229 | +More information about converting ASCII characters and PHRED scores to quality scores can be found [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html). |
| 230 | + |
| 231 | + Now that we've learned how to read files in and manipulate them a bit, |
| 232 | +let's see if we can align the _mecA_ gene to the _Staphylococcus aureus_ genome. |
| 233 | +This will tell us if this particular _S. aureus_ is MRSA. |
| 234 | + |
| 235 | + |
0 commit comments