|
| 1 | ++++ |
| 2 | +title = "Sequence Input/Output" |
| 3 | +rss_descr = "Reading in fasta files using FASTX.jl" |
| 4 | ++++ |
| 5 | + |
| 6 | +# Sequence Input/Output |
| 7 | + |
| 8 | +In this chapter, we'll talk about how to read in sequence files using the `FASTX.jl` module. |
| 9 | +More information about the `FASTX.jl` package can be found at https://biojulia.dev/FASTX.jl/stable/ |
| 10 | +and with the built-in documentation. |
| 11 | + |
| 12 | +```julia |
| 13 | +julia> using FASTX |
| 14 | +julia> ?FASTX |
| 15 | +``` |
| 16 | + |
| 17 | +If FASTX is not already in your environment, |
| 18 | +it can be easily added from the Julia Registry. |
| 19 | + |
| 20 | +To demonstrate how to this package, |
| 21 | +let's try to read in some real-world data! |
| 22 | + |
| 23 | +The `FASTX` can read in 3 file types: fasta, fastq, and fai. |
| 24 | + |
| 25 | +### FASTA files |
| 26 | +FASTA files are text files containing biological sequence data. |
| 27 | +They have three parts: name, description, and sequence. |
| 28 | + |
| 29 | +The template of a sequence record is: |
| 30 | +``` |
| 31 | +>{description} |
| 32 | +{sequence} |
| 33 | +``` |
| 34 | + |
| 35 | +### FASTQ files |
| 36 | +FASTQ files are also text-based files that contain sequences, along with a name and description. However, they also store sequence quality information (the Q is for quality!). |
| 37 | + |
| 38 | +The template of a sequence record is: |
| 39 | +``` |
| 40 | +@{description} |
| 41 | +{sequence} |
| 42 | ++{description}? |
| 43 | +{qualities} |
| 44 | +``` |
| 45 | + |
| 46 | +### FAI files |
| 47 | + |
| 48 | +FAI (FASTA index) files are used in conjuction with FASTA/FASTQ files. |
| 49 | +They are text files with TAB-delimited columns. |
| 50 | +Each line contains information about each region sequence within the FASTA/FASTQ. |
| 51 | +More information about fai index files can be found [here](https://www.htslib.org/doc/faidx.html). |
| 52 | + |
| 53 | +``` |
| 54 | +NAME Name of this reference sequence |
| 55 | +LENGTH Total length of this reference sequence, in bases |
| 56 | +OFFSET Offset in the FASTA/FASTQ file of this sequence's first base |
| 57 | +LINEBASES The number of bases on each line |
| 58 | +LINEWIDTH The number of bytes in each line, including the newline |
| 59 | +QUALOFFSET Offset of sequence's first quality within the FASTQ file |
| 60 | +
|
| 61 | +``` |
| 62 | + |
| 63 | + |
| 64 | + |
| 65 | + |
| 66 | + |
| 67 | +We will read in a fasta file containing the _mecA_ gene. |
| 68 | +This gene was taken from NCBI [here](https://www.ncbi.nlm.nih.gov/gene?Db=gene&Cmd=DetailsSearch&Term=3920764#). |
| 69 | + |
| 70 | +First we'll open the file, |
| 71 | +then we'll iterate over every record in the file and |
| 72 | +print out the sequence identifier, the sequence description and then the corresponding sequence. |
| 73 | + |
| 74 | +```julia |
| 75 | +julia> FASTAReader(open("assets/mecA.fasta")) do reader |
| 76 | + for record in reader |
| 77 | + println(identifier(record)) |
| 78 | + println(description(record)) |
| 79 | + println(sequence(record)) |
| 80 | + end |
| 81 | + end |
| 82 | + |
| 83 | +NC_007795.1:907598-908317 |
| 84 | +NC_007795.1:907598-908317 SAOUHSC_00935 [organism=Staphylococcus aureus subsp. aureus NCTC 8325] [GeneID=3920764] [chromosome=] |
| 85 | +ATGAGAATAGAACGAGTAGATGATACAACTGTAAAATTGTTTATAACATATAGCGATATCGAGGCCCGTGGATTTAGTCGTGAAGATTTATGGACAAATCGCAAACGTGGCGAAGAATTCTTTTGGTCAATGATGGATGAAATTAACGAAGAAGAAGATTTTGTTGTAGAAGGTCCATTATGGATTCAAGTACATGCCTTTGAAAAAGGTGTCGAAGTCACAATTTCTAAATCTAAAAATGAAGATATGATGAATATGTCTGATGATGATGCAACTGATCAATTTGATGAACAAGTTCAAGAATTGTTAGCTCAAACATTAGAAGGTGAAGATCAATTAGAAGAATTATTCGAGCAACGAACAAAAGAAAAAGAAGCTCAAGGTTCTAAACGTCAAAAGTCTTCAGCACGTAAAAATACAAGAACAATCATTGTGAAATTTAACGATTTAGAAGATGTTATTAATTATGCATATCATAGCAATCCAATAACTACAGAGTTTGAAGATTTGTTATATATGGTTGATGGTACTTATTATTATGCTGTATATTTTGATAGTCATGTTGATCAAGAAGTCATTAATGATAGTTACAGTCAATTGCTTGAATTTGCTTATCCAACAGACAGAACAGAAGTTTATTTAAATGACTATGCTAAAATAATTATGAGTCATAACGTAACAGCTCAAGTTCGACGTTATTTTCCAGAGACAACTGAATAA |
| 86 | +``` |
| 87 | + |
| 88 | +In this case, there is only one sequence. |
| 89 | + |
| 90 | +Let's try reading in a larger fastq file. |
| 91 | + |
| 92 | +```julia |
| 93 | + |
| 94 | + |
| 95 | + |
| 96 | +``` |
| 97 | + |
| 98 | + |
| 99 | + |
0 commit comments