Skip to content

Commit d20b9bd

Browse files
fix grammar and typos
1 parent adc2a1d commit d20b9bd

File tree

1 file changed

+49
-36
lines changed

1 file changed

+49
-36
lines changed

cookbook/02-alignments.md

Lines changed: 49 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -4,24 +4,29 @@ rss_descr = "Align a gene against a reference genome using BioAlignments.jl"
44
+++
55

66
As mentioned in the previous [tutorial]("../01-sequences.md"), in this chapter, we will learn about alignments.
7-
We will explore pair-wise alignment as a tool to compare two copies of the _mecA_ gene found on NCBI.
7+
We will explore pairwise alignment as a tool to compare two copies of the _mecA_ gene found on NCBI.
88

99
# Pairwise Alignment
1010

11-
On the most basic level, aligners take two sequences and use algorithms to try to "line them up"
11+
On the most basic level, aligners use algorithms to "line up" sequences
1212
and look for regions of similarity.
1313

14+
BioAlignments implements only pairwise alignment.
1415
Pairwise alignment differs from multiple sequence alignment (MSA) because
15-
it only aligns two sequences, while MSAs align three or more.
16+
it only aligns two sequences, while MSAs align any number of sequences.
1617

17-
### Running the Alignment
18-
There are two main parameters for determining how we want to perform our alignment:
19-
the alignment type and score/cost model.
18+
Pairwise alignment also assumes that the two sequences are roughly homologous.
19+
For example, you may use it to align two versions of the same gene.
20+
It is not used to map reads to a genome -- mapping would be a better solution for that.
21+
22+
# Running the Alignment
23+
There are two main parameters for determining how we want to perform our alignment:
24+
alignment type and score/cost model.
2025

2126
The alignment type specifies the alignment range (local vs global alignment)
22-
and the score/cost model explains how to score insertions and deletions.
27+
and the score/cost model explains how to score matches/mismatches in the sequences that are being compared.
2328

24-
#### Alignment Types
29+
### Alignment Types
2530
Currently, four types of alignments are supported:
2631
- `GlobalAlignment`: global-to-global alignment
2732
- Aligns sequences end-to-end
@@ -66,7 +71,7 @@ The cost model provides a way to calculate penalties for differences between the
6671
and then finds the alignment that minimizes the total penalty.
6772
`AffineGapScoreModel` is the scoring model currently supported by `BioAlignments.jl`.
6873
It imposes an affine gap penalty for insertions and deletions,
69-
which means that it penalizes the opening of a gap more than a gap extending.
74+
which means that it penalizes the opening of a gap more than a gap extension.
7075
Deletions are rare mutations, but if there's a deletion, the length of the deletion is variable.
7176
Longer deletions are less likely than short ones only because they change the structure of the encoded protein more.
7277

@@ -80,29 +85,29 @@ These distance metrics are currently supported:
8085
- `LevenshteinDistance`
8186
- `HammingDistance`
8287

83-
This is a complicated topic, and more information can be found in the BioAlignments documentation about the cost model [here](https://biojulia.dev/BioAlignments.jl/stable/pairalign/).
88+
More information can be found in the BioAlignments documentation about the cost model [here](https://biojulia.dev/BioAlignments.jl/stable/pairalign/).
8489

8590
Just like alignment type, the cost model should be selected based on what the user is optimizing for
8691
and what is known about the two sequences.
8792

8893

89-
### Calling BioAlignments to Run the Alignment
94+
## Calling BioAlignments to Run the Alignment
9095

9196
Now that we have a good understanding of how `pairalign` works, let's run an example!
9297

93-
In this example, we'll compare two similar genes: mecA found in _S. aureus_ ([link to gene on NCBI here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1)), and a homologue, mecA1, found on a _S. sciuri_ ([link to gene on NCBI here](https://www.ncbi.nlm.nih.gov/gene/?term=PBP2a+family+beta-lactam-resistant+peptidoglycan+transpeptidase+MecA1)).
94-
The two Staphs are closely related species in the same _Staphylococcaceae_ family.
98+
In this example, we'll compare two similar genes: _mecA_ found in _S. aureus_ ([link to gene on NCBI here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1)), and a homologue, _mecA1_, found in _S. sciuri_ ([link to gene on NCBI here](https://www.ncbi.nlm.nih.gov/gene/?term=PBP2a+family+beta-lactam-resistant+peptidoglycan+transpeptidase+MecA1)).
99+
These genes are found in two Staph species that are closely related and in the same family (_Staphylococcaceae_).
95100

96101
Because we are comparing homologous genes from two closely related species, we wouldn't expect too many differences.
97-
Although mecA1 doesn't confer resistance to beta-lactams in _S. sciuri_ like _mecA_ does to _S. aureus_,
102+
Although _mecA1_ doesn't confer resistance to beta-lactams in _S. sciuri_ like _mecA_ does to _S. aureus_,
98103
the gene should be mostly conserved.
99-
In fact, mecA1 is considered a pre-cursor to mecA.
100-
Research indicates that there is 80% nucleotide identity between the two genes.[1].
104+
In fact, _mecA1_ is considered a precursor to _mecA_ [1].
105+
Research indicates that there is 80% nucleotide identity between the two genes [1].
101106
Due to the similarity in the genes we are comparing, it makes the most sense to run a global alignment.
102107

103108
In this first example, we'll align two strings that contain the genes.
104109

105-
#### Running Alignment on BioSequences Object
110+
## Running Alignment on BioSequences Object
106111

107112
```julia
108113
using BioAlignments
@@ -112,36 +117,39 @@ mecA =
112117
mecA1 = "ATGAAAAAATTAATCATCGCCATCGTGATTGTAATCATCGCTGTTGGTTCAGGCGTATTCTTTTATGCATCTAAAGATAAGAAAATAAACGAAACAATTGATGCCATTGAAGATAAAAACGTTAAGCAAGTCTTTAAAAATAGTACTTACCAATCTAAAAACGATAATGGTGAAGTAGAAATGACAGACCGCCCTATTAAGATTTATGACAGTCTCGGCGTCAAAGATATCAACATTAAAGATCGTGATATCAAAAAGGTTTCGAAAAACAAAAAACAAGTCACAGCAAAGTATGAACTTCAAACGAATTACGGCAAAATTAATCGTGACGTTAAATTAAACTTTATTAAAGAAGATAAAGATTGGAAATTGGATTGGAATCAAAATGCCATTATTCCAGGCATGAAGAAAAATCAATCCATCAATATTGAACCATTGAAATCAGAACGAGGTAAGATTTTAGACAGGAACAATGTAGAGTTAGCCACTACAGGAACAACACATGAAGTTGGTATTGTTCCTAATAATGTTTCCACAAGTGATTACAAAGCAATCGCTGAAAAGTTAGACCTTTCAGAATCGTATATTAAACAGCAAACAGAACAGGATTGGGTTAAAGATGATACATTCGTCCCTCTCAAGACTGTTCAAGATATGAATCAAGATTTAAAGAATTTTGTTGAAAAGTATCATCTCACATCACAGGAAACAGAAAGTCGACAGTATCCGCTTGAAGAAGCAACAACGCACTTACTTGGATATGTTGGCCCTATTAATTCAGAAGAATTGAAGCAAAAAGCATTTAAAGGTTATAAAAAGGATGCCATCGTTGGTAAAAAAGGTATCGAAAAACTATACGATAAAGACCTTCAAAATAAAGACGGATACCGTGTCACAATAATTGATGATAATAATAAAGTTATTGATACATTAATAGAGAAAAAGAAAATAGACGGCAAAGATATTAAATTAACCATTGATGCTAGAGTCCAAAAAAGTATTTATAACAACATGAAAGATGACTACGGTTCGGGGACTGCTATTCATCCACAAACTGGTGAACTCTTAGCACTTGTCAGCACGCCATCTTATGATGTTTATCCATTTATGAATGGAATGAGCGATGAAGATTATAAGAAATTAACTGAAGATGATAAAGAGCCACTCCTTAATAAGTTCCAAATTACGACATCACCAGGTTCGACTCAAAAAATATTAACAGCCATGATTGGCTTAAACAATAAGACATTAGACGGCAAAACAAGTTATAAAATTAATGGAAAAGGTTGGCAAAAAGATAAATCTTGGGGTGACTACAACGTTACAAGATACGAAGTTGTGAATGCCGATATCGACTTAAAACAAGCTATTGAATCATCAGATAATATCTTCTTTGCGAGAGTTGCACTTGAATTAGGCAGCAAAAAATTCGAAGAAGGAATGAAACGCCTTGGTGTTGGTGAAGATATCCCGAGTGATTATCCATTCTACAATGCACAAATTTCAAATAAGAACTTAGATAATGAAATATTGTTAGCTGACTCAGGTTATGGCCAAGGTGAAATATTAATCAATCCTGTTCAAATTCTTTCAATATACAGCGCATTAGAGAACAAAGGTAATGTGAATGCACCACATGTACTCAAAGATACGAAAAATAAAGTCTGGAAGAAGAACATCATTTCCCAGGAAAATATTAAATTGTTAACAGACGGTATGCAACAAGTCGTGAACAAAACACATAGAGAAGATATTTATAGATCATATGCCAACTTAGTTGGTAAATCAGGTACAGCTGAACTCAAGATGAAACAAGGTGAGACAGGACAACAAATAGGTTGGTTCATTTCATATGATAAAGATAATCCAAATATAATGATGGCTATTAATGTGAAAGATGTACAAGATAAAGGCATGGCAAGTTACAATGCCAAAATATCTGGAAAAGTGTATGACGATTTATATGATAACGGTAAGAAAACGTATCGTATTGATAAATAA"
113118
scoremodel = AffineGapScoreModel(EDNAFULL, gap_open=-5, gap_extend=-1);
114119

120+
# run pairwise alignment
115121
res = pairalign(GlobalAlignment(), mecA, mecA1, scoremodel)
116-
# run pairwise alignment
117122
```
118123

119124

120-
#### Running Alignment on FASTX files
125+
## Running Alignment on FASTX files
121126
In this next example, we'll repeat the same alignment,
122127
but read in the files directly from the FASTA files containing the gene.
123-
Running the alignment on strings is straightforward with short sequences, but when comparing entire genes, simply reading in the file is easier.
128+
Running the alignment on strings is straightforward with short sequences,
129+
but when comparing entire genes, simply reading in the file is easier.
124130
```julia
125131
using BioSequences
126132
using FASTX
127133

128-
# Write a function to get sequence out of a fasta file with 1 record
134+
# Function to get a sequence from a FASTA file with one record
129135
function fasta_sequence(fasta_path)
130136
record = open(FASTA.Reader, fasta_path) do reader
131137
first(reader)
132138
end
139+
# extract sequence and convert to BioSequences DNA object
133140
seq = LongDNA{4}(String(FASTX.sequence(record)))
134141
return (seq)
135142
end
136143

137144
mecA_fasta = fasta_sequence("assets/mecA.fasta")
138145
mecA1_fasta = fasta_sequence("assets/mecA1.fasta")
139146

140-
res_fasta = pairalign(GlobalAlignment(), mecA_fasta, mecA1_fasta, scoremodel) # run pairwise alignment
147+
# run pairwise alignment
148+
res_fasta = pairalign(GlobalAlignment(), mecA_fasta, mecA1_fasta, scoremodel)
141149
```
142150

143151

144-
### Understanding how Alignments are Represented
152+
### Understanding How Alignments Are Represented
145153
The output of an alignment is a series of `AlignmentAnchor` objects.
146154
This data structure gives information on the position of the start of the alignment,
147155
sections where nucleotides match, as well as where there may be deletions or insertions.
@@ -157,17 +165,19 @@ julia> Alignment([
157165
In this example, the alignment starts at position 0 for the query sequence and position 4 for the reference sequence.
158166
Although the Julia programming language typically uses 1-based indexing,
159167
this package uses position 0 to refer to the first position.
160-
The next nucleotides are a match in the query and reference sequences.
168+
The next nucleotides match in the query and reference sequences.
161169
The last 8 nucleotides in the alignment are deleted in the query sequence.
162170

163171
To learn more about the output of the alignment created using BioAlignments.jl, see [here](https://biojulia.dev/BioAlignments.jl/stable/alignments/).
164172

165173

166-
#### Interpreting the Example Output
174+
## Interpreting the Example Output
175+
176+
Here is the output from our example aligning the _mecA_ and _mecA1_ genes:
177+
167178
```
168-
# run pairwise alignment
179+
res
169180
170-
res
171181
PairwiseAlignmentResult{Int64, String, String}:
172182
score: 6375
173183
seq: 1 ATGAAAAAGATAAAA-ATTGTTCCA-CTT-ATTTTAAT-A-----GTTGTAGTTGTCGGG 51
@@ -319,19 +329,22 @@ PairwiseAlignmentResult{Int64, String, String}:
319329
ref: 2001 -- 2001
320330
```
321331

322-
The score returned is entirely dependent on the scoring scheme
323-
(how we penalized gaps, gap extensions and rewarded matches).
332+
The alignment score is entirely dependent on the scoring scheme
333+
(how we penalized gaps, gap extensions and rewarded matches).
324334
It is not an absolute number that we can compare from alignment to alignment.
325-
In our example, our score was influenced by -5 for the start of a gap, and -1 for a gap extension.
326-
If these values were to change, we would get a different score.
327-
However, generally, longer alignments produce larger scores (as there are more bases being compared).
335+
In our example, our score was influenced by a -5 penalty for the start of a gap, and a -1 penalty for a gap extension.
336+
If these values were to change in our cost model,
337+
this would affect the final score, even if the sequences were the same.
338+
However, longer alignments generally produce larger scores
339+
(as there are more bases being compared).
328340

329341
Overall, the two sequences are homologous over most of their length.
330-
There are many matches, but there are frequent small indels and substitutions.
331-
The biggest mismatch is in a section toward the end,
332-
where there are large stretches that are missing in the reference sequence (mecA1).
342+
There are many matches, but also frequent small indels and substitutions.
343+
The biggest mismatch is in a section toward the end
344+
(from base 1980 in the reference onwards),
345+
where there are large stretches that are missing in the reference sequence (_mecA1_).
333346

334347
### Citations
335348

336-
[1]: Rolo J, Worning P, Boye Nielsen J, Sobral R, Bowden R, Bouchami O, Damborg P, Guardabassi L, Perreten V, Westh H, Tomasz A, de Lencastre H, Miragaia M. Evidence for the evolutionary steps leading to mecA-mediated β-lactam resistance in staphylococci. PLoS Genet. 2017 Apr 10;13(4):e1006674. doi: 10.1371/journal.pgen.1006674. PMID: 28394942; PMCID: PMC5402963. [link to the source](10.1371/journal.pgen.1006674=).
349+
[1]: Rolo J, Worning P, Boye Nielsen J, Sobral R, Bowden R, Bouchami O, Damborg P, Guardabassi L, Perreten V, Westh H, Tomasz A, de Lencastre H, Miragaia M. Evidence for the evolutionary steps leading to mecA-mediated β-lactam resistance in staphylococci. PLoS Genet. 2017 Apr 10;13(4):e1006674. doi: 10.1371/journal.pgen.1006674. PMID: 28394942; PMCID: PMC5402963. [link to the source](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1006674).
337350

0 commit comments

Comments
 (0)