Skip to content

Commit b030ede

Browse files
Mendel's Genetics (#16)
* initial commit * rough draft of first solution * fix problem name * adding semantic line breaks * add note about downsides of algorithm approach * add statistical approach * implement Kevin's minor changes * fix typos * make edits according to Kevin's suggestions * add Project.toml back
1 parent bfc070e commit b030ede

File tree

1 file changed

+215
-0
lines changed

1 file changed

+215
-0
lines changed

docs/src/rosalind/07-iprb.md

Lines changed: 215 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,215 @@
1+
# Mendel's First Law
2+
3+
🤔 [Problem link](https://rosalind.info/problems/iprb/)
4+
5+
!!! warning "The Problem"
6+
7+
Probability is the mathematical study of randomly occurring phenomena.
8+
We will model such a phenomenon with a random variable,
9+
which is simply a variable that can take a number of different distinct outcomes
10+
depending on the result of an underlying random process.
11+
12+
For example, say that we have a bag containing 3 red balls and 2 blue balls.
13+
If we let X represent the random variable corresponding to the color of a drawn ball,
14+
then the probability of each of the two outcomes is given by Pr(X=red)=35 and Pr(X=blue)=25.
15+
16+
Random variables can be combined to yield new random variables.
17+
Returning to the ball example, let Y model the color of a second ball drawn from the bag (without replacing the first ball).
18+
The probability of Y being red depends on whether the first ball was red or blue.
19+
20+
To represent all outcomes of X and Y, we therefore use a probability tree diagram.
21+
This branching diagram represents all possible individual probabilities for X and Y,
22+
with outcomes at the endpoints ("leaves") of the tree.
23+
The probability of any outcome is given by the product of probabilities along the path from the beginning of the tree.
24+
25+
An event is simply a collection of outcomes.
26+
Because outcomes are distinct, the probability of an event can be written as the sum of the probabilities of its constituent outcomes.
27+
28+
For our colored ball example, let A be the event "Y is blue."
29+
Pr(A) is equal to the sum of the probabilities of two different outcomes:
30+
Pr(X=blue and Y=blue)+Pr(X=red and Y=blue), or 310+110=25.
31+
32+
33+
34+
Given:
35+
36+
Three positive integers k, m, and n,
37+
representing a population containing k+m+n organisms:
38+
k individuals are homozygous dominant for a factor, m are heterozygous, and n are homozygous recessive.
39+
40+
Return: The probability that two randomly selected mating organisms will produce an individual possessing a dominant allele (and thus displaying the dominant phenotype).
41+
42+
Assume that any two organisms can mate.
43+
44+
We will show two ways we can solve this problem: deriving an algorithm or using a statistical weighted probability approach.
45+
46+
### Deriving an Algorithm
47+
48+
Using the information above, we can derive an algorithm using the variables k, m, and n that will calculate the probability of a progeny possessing a dominant allele.
49+
50+
We could calculate the probability of a progeny having a dominant allele,
51+
but in this case, it is easier to calculate the likelihood of a progeny having the recessive phenotype.
52+
This is a relatively rarer event, and the calculation will be less complicated.
53+
We just have to subtract this probability from 1 to get the overall likelihood of having a progeny with a dominant trait.
54+
55+
To demonstrate how to derive this algorithm, we can use H and h to signify dominant and recessive alleles, respectively.
56+
Out of all the possible combinations, we will only get a progeny with a recessive trait in three situations: Hh x Hh, Hh x hh, and hh x hh.
57+
For all of these situations, we must calculate the probability of these mating combinations occurring (based on k, m, and n),
58+
as well as the probability of these events leading to a progeny with a recessive trait.
59+
60+
First, we must calculate the probability of picking the first and second mate.
61+
For the combination Hh x Hh, this is $\frac{m}{(k+m+n)}$ multiplied by $\frac{(m-1)}{(k+m+n-1)}$.
62+
63+
Selecting the second Hh individual is equal to the number of Hh individuals left after 1 was already picked (m-1),
64+
divided by the total individuals left in the population (k+m+n-1).
65+
A similar calculation is performed for the rest of the combinations.
66+
67+
It is important to note that the probability of selecting Hh x hh as a mating pair is $\frac{2*m*n}{(k+m+n)(k+m+n-1)}$,
68+
as there are two ways to choose this combination.
69+
Hh x hh can be selected (where Hh is picked first), as well as hh x Hh. Order matters!
70+
71+
| Probability of combination occurring | Hh x Hh | Hh x hh | hh x hh |
72+
| --- |---|---|---|
73+
| | $\frac{m(m-1)}{(k+m+n)(k+m+n-1)}$ | $\frac{2*m*n}{(k+m+n)(k+m+n-1)}$| $\frac{n(n-1)}{(k+m+n)(k+m+n-1)}$|
74+
75+
<br>
76+
<br>
77+
78+
The probability of these combinations leading to a recessive trait can be calculated using Punnet Squares.
79+
80+
| Probability of recessive trait | Hh x Hh | Hh x hh | hh x hh |
81+
| --- |---|---|---|
82+
| | 0.25 | 0.50 | 1 |
83+
84+
<br>
85+
<br>
86+
87+
88+
Now, we just have to sum the probability of each combination occurring by the probability of this combination leading to a recessive trait.
89+
90+
This leads to the following formula:
91+
92+
Pr(recessive trait) =
93+
$\frac{m(m-1)}{(k+m+n)(k+m+n-1)}$ x 0.25 + $\frac{m*n}{(k+m+n)(k+m+n-1)}$ + $\frac{n(n-1)}{(k+m+n)(k+m+n-1)}$
94+
95+
Therefore, the probability of selecting an individual with a *dominant* trait is 1 - Pr(recessive trait).
96+
97+
Now that we've derived this formula, let's turn this into code!
98+
99+
```julia
100+
function mendel(k,m,n)
101+
102+
# denominator of the above fractions describing probability of different matches
103+
total = (k+m+n)*(k+m+n-1)
104+
return 1-(
105+
(0.25*m*(m-1))/total +
106+
m*n/total +
107+
n*(n-1)/total)
108+
end
109+
110+
mendel(2,2,2)
111+
```
112+
113+
Deriving and using this algorithm works.
114+
115+
However, it is also narrowly tailored to a specific problem.
116+
117+
What happens if we want to solve a more complicated problem or if there are additional requirements tacked on?
118+
119+
For example, what if we wanted to solve a question like "What's the probability of a heterozygous offspring?"
120+
121+
We would need to derive another algorithm for this similar, yet slightly different problem.
122+
123+
Algorithms work in certain cases, but also don't scale up if we add more constraints.
124+
125+
Another approach would be to use a statistics-based solution.
126+
127+
For instance, we can use a simulation that can broadly calculate the likelihood of a given offspring based on a set of given probabilities.
128+
129+
This solution is generic and can be used to ask more types of questions.
130+
131+
132+
### Simulation Method
133+
134+
For this method, we will make a fake population that follows the given parameters k, m, and n.
135+
136+
Specifically, we can make a vector of 1's, 2's, and 3's, representing the HH, Hh, and hh genotypes, respectively.
137+
138+
In this vector, there will be k 1's, m 2's, and n 3's.
139+
140+
Next, we'll make another vector that stores the probabilities of there being a dominant phenotype given the parental genotypes.
141+
142+
This is calculated using Punnett Squares.
143+
144+
For example, if HH mates with either [HH, Hh, hh], the probability of a dominant phenotype is 100%, leading to a vector [1, 1, 1].
145+
146+
Now that these vectors have been created, we can begin the simulation.
147+
148+
First, we will sample from the population to approximate the ratio of dominant phenotypes.
149+
150+
For each iteration, we will randomly pick two mates from the population.
151+
152+
For example, 2 (Hh) and 3 (hh) is picked.
153+
154+
This will lead to a probability of a dominant allele = 0.5.
155+
156+
All of the probabilities will be accumulated throughout all of the simulations.
157+
158+
At the end of the simulation, we can divide the sum of the probabilities by the total number of simulations.
159+
160+
This will get us the approximated number of individuals with a dominant phenotype.
161+
162+
This method is unlikely to return exactly the same answer as the algorithm approach.
163+
164+
Sampling is random, so we will get slightly different results each time we run the simulation (unless we set a seed).
165+
166+
However, both methods will be very similar.
167+
168+
The standard error for the estimate decreases as the number of simulations gets very large.
169+
170+
The larger the number of iterations, the more likely that the final approximation will be similar both between simulations, as well as to the answer from the algorithm.
171+
172+
It is important to keep in mind that both the algorithm and statistical sampling approaches only provide approximations, as there will definitely be some unaccounted variation in a true biological population!
173+
174+
```julia
175+
using StatsBase
176+
177+
# Probability of dominant offspring given parent genotypes
178+
# Index: offspring_prob[parent1, parent2]
179+
# Genotypes: 1=HH, 2=Hh, 3=hh
180+
181+
ex_offspring_prob = [
182+
1.0 1.0 1.0; # HH × (HH, Hh, hh)
183+
1.0 0.75 0.5; # Hh × (HH, Hh, hh)
184+
1.0 0.5 0.0 # hh × (HH, Hh, hh)
185+
]
186+
187+
function mendel_sim(k, m, n, offspring_prob; iterations=100000)
188+
# Genotypes: 1=HH, 2=Hh, 3=hh
189+
population = [fill(1, k); fill(2, m); fill(3, n)]
190+
191+
total_pop = k+m+n
192+
wts = [k/total_pop, m/total_pop, n/total_pop]
193+
194+
# samples two mates from the vector [1,2,3] with probability weights given by wts
195+
196+
# then sum the probability of each offspring having a dominant phenotype
197+
# sum across all simulations
198+
sum(1:iterations) do _
199+
(i,j) = sample([1,2,3], weights(wts), 2)
200+
offspring_prob[i,j]
201+
end / iterations
202+
end
203+
204+
mendel_sim(2, 2, 2, ex_offspring_prob)
205+
```
206+
207+
In the function above, the user provides the parameter `offspring_prob`.
208+
If the user wanted to answer a slightly different question with different probability weights,
209+
all that would be needed is a different input vector.
210+
This allows the user to solve a wider variety of questions.
211+
212+
However, this function does assume that there are only 3 phenotypes, which limits the situations it can be applied towards.
213+
214+
This solution returns a value closer to 0.75,
215+
while the first one returns a value close to 0.783.

0 commit comments

Comments
 (0)