|
| 1 | +# Mendel's First Law |
| 2 | + |
| 3 | +🤔 [Problem link](https://rosalind.info/problems/iprb/) |
| 4 | + |
| 5 | +!!! warning "The Problem" |
| 6 | + |
| 7 | + Probability is the mathematical study of randomly occurring phenomena. |
| 8 | + We will model such a phenomenon with a random variable, |
| 9 | + which is simply a variable that can take a number of different distinct outcomes |
| 10 | + depending on the result of an underlying random process. |
| 11 | + |
| 12 | + For example, say that we have a bag containing 3 red balls and 2 blue balls. |
| 13 | + If we let X represent the random variable corresponding to the color of a drawn ball, |
| 14 | + then the probability of each of the two outcomes is given by Pr(X=red)=35 and Pr(X=blue)=25. |
| 15 | + |
| 16 | + Random variables can be combined to yield new random variables. |
| 17 | + Returning to the ball example, let Y model the color of a second ball drawn from the bag (without replacing the first ball). |
| 18 | + The probability of Y being red depends on whether the first ball was red or blue. |
| 19 | + |
| 20 | + To represent all outcomes of X and Y, we therefore use a probability tree diagram. |
| 21 | + This branching diagram represents all possible individual probabilities for X and Y, |
| 22 | + with outcomes at the endpoints ("leaves") of the tree. |
| 23 | + The probability of any outcome is given by the product of probabilities along the path from the beginning of the tree. |
| 24 | + |
| 25 | + An event is simply a collection of outcomes. |
| 26 | + Because outcomes are distinct, the probability of an event can be written as the sum of the probabilities of its constituent outcomes. |
| 27 | + |
| 28 | + For our colored ball example, let A be the event "Y is blue." |
| 29 | + Pr(A) is equal to the sum of the probabilities of two different outcomes: |
| 30 | + Pr(X=blue and Y=blue)+Pr(X=red and Y=blue), or 310+110=25. |
| 31 | + |
| 32 | + |
| 33 | + |
| 34 | + Given: |
| 35 | + |
| 36 | + Three positive integers k, m, and n, |
| 37 | + representing a population containing k+m+n organisms: |
| 38 | + k individuals are homozygous dominant for a factor, m are heterozygous, and n are homozygous recessive. |
| 39 | + |
| 40 | + Return: The probability that two randomly selected mating organisms will produce an individual possessing a dominant allele (and thus displaying the dominant phenotype). |
| 41 | + |
| 42 | + Assume that any two organisms can mate. |
| 43 | + |
| 44 | +We will show two ways we can solve this problem: deriving an algorithm or using a statistical weighted probability approach. |
| 45 | + |
| 46 | +### Deriving an Algorithm |
| 47 | + |
| 48 | +Using the information above, we can derive an algorithm using the variables k, m, and n that will calculate the probability of a progeny possessing a dominant allele. |
| 49 | + |
| 50 | +We could calculate the probability of a progeny having a dominant allele, |
| 51 | +but in this case, it is easier to calculate the likelihood of a progeny having the recessive phenotype. |
| 52 | +This is a relatively rarer event, and the calculation will be less complicated. |
| 53 | +We just have to subtract this probability from 1 to get the overall likelihood of having a progeny with a dominant trait. |
| 54 | + |
| 55 | +To demonstrate how to derive this algorithm, we can use H and h to signify dominant and recessive alleles, respectively. |
| 56 | +Out of all the possible combinations, we will only get a progeny with a recessive trait in three situations: Hh x Hh, Hh x hh, and hh x hh. |
| 57 | +For all of these situations, we must calculate the probability of these mating combinations occurring (based on k, m, and n), |
| 58 | +as well as the probability of these events leading to a progeny with a recessive trait. |
| 59 | + |
| 60 | +First, we must calculate the probability of picking the first and second mate. |
| 61 | +For the combination Hh x Hh, this is $\frac{m}{(k+m+n)}$ multiplied by $\frac{(m-1)}{(k+m+n-1)}$. |
| 62 | + |
| 63 | +Selecting the second Hh individual is equal to the number of Hh individuals left after 1 was already picked (m-1), |
| 64 | +divided by the total individuals left in the population (k+m+n-1). |
| 65 | +A similar calculation is performed for the rest of the combinations. |
| 66 | + |
| 67 | +It is important to note that the probability of selecting Hh x hh as a mating pair is $\frac{2*m*n}{(k+m+n)(k+m+n-1)}$, |
| 68 | +as there are two ways to choose this combination. |
| 69 | +Hh x hh can be selected (where Hh is picked first), as well as hh x Hh. Order matters! |
| 70 | + |
| 71 | +| Probability of combination occurring | Hh x Hh | Hh x hh | hh x hh | |
| 72 | +| --- |---|---|---| |
| 73 | +| | $\frac{m(m-1)}{(k+m+n)(k+m+n-1)}$ | $\frac{2*m*n}{(k+m+n)(k+m+n-1)}$| $\frac{n(n-1)}{(k+m+n)(k+m+n-1)}$| |
| 74 | + |
| 75 | +<br> |
| 76 | +<br> |
| 77 | + |
| 78 | +The probability of these combinations leading to a recessive trait can be calculated using Punnet Squares. |
| 79 | + |
| 80 | +| Probability of recessive trait | Hh x Hh | Hh x hh | hh x hh | |
| 81 | +| --- |---|---|---| |
| 82 | +| | 0.25 | 0.50 | 1 | |
| 83 | + |
| 84 | +<br> |
| 85 | +<br> |
| 86 | + |
| 87 | + |
| 88 | +Now, we just have to sum the probability of each combination occurring by the probability of this combination leading to a recessive trait. |
| 89 | + |
| 90 | +This leads to the following formula: |
| 91 | + |
| 92 | +Pr(recessive trait) = |
| 93 | +$\frac{m(m-1)}{(k+m+n)(k+m+n-1)}$ x 0.25 + $\frac{m*n}{(k+m+n)(k+m+n-1)}$ + $\frac{n(n-1)}{(k+m+n)(k+m+n-1)}$ |
| 94 | + |
| 95 | +Therefore, the probability of selecting an individual with a *dominant* trait is 1 - Pr(recessive trait). |
| 96 | + |
| 97 | +Now that we've derived this formula, let's turn this into code! |
| 98 | + |
| 99 | +```julia |
| 100 | +function mendel(k,m,n) |
| 101 | + |
| 102 | + # denominator of the above fractions describing probability of different matches |
| 103 | + total = (k+m+n)*(k+m+n-1) |
| 104 | + return 1-( |
| 105 | + (0.25*m*(m-1))/total + |
| 106 | + m*n/total + |
| 107 | + n*(n-1)/total) |
| 108 | +end |
| 109 | + |
| 110 | +mendel(2,2,2) |
| 111 | +``` |
| 112 | + |
| 113 | +Deriving and using this algorithm works. |
| 114 | + |
| 115 | +However, it is also narrowly tailored to a specific problem. |
| 116 | + |
| 117 | +What happens if we want to solve a more complicated problem or if there are additional requirements tacked on? |
| 118 | + |
| 119 | +For example, what if we wanted to solve a question like "What's the probability of a heterozygous offspring?" |
| 120 | + |
| 121 | +We would need to derive another algorithm for this similar, yet slightly different problem. |
| 122 | + |
| 123 | +Algorithms work in certain cases, but also don't scale up if we add more constraints. |
| 124 | + |
| 125 | +Another approach would be to use a statistics-based solution. |
| 126 | + |
| 127 | +For instance, we can use a simulation that can broadly calculate the likelihood of a given offspring based on a set of given probabilities. |
| 128 | + |
| 129 | +This solution is generic and can be used to ask more types of questions. |
| 130 | + |
| 131 | + |
| 132 | +### Simulation Method |
| 133 | + |
| 134 | +For this method, we will make a fake population that follows the given parameters k, m, and n. |
| 135 | + |
| 136 | +Specifically, we can make a vector of 1's, 2's, and 3's, representing the HH, Hh, and hh genotypes, respectively. |
| 137 | + |
| 138 | +In this vector, there will be k 1's, m 2's, and n 3's. |
| 139 | + |
| 140 | +Next, we'll make another vector that stores the probabilities of there being a dominant phenotype given the parental genotypes. |
| 141 | + |
| 142 | +This is calculated using Punnett Squares. |
| 143 | + |
| 144 | +For example, if HH mates with either [HH, Hh, hh], the probability of a dominant phenotype is 100%, leading to a vector [1, 1, 1]. |
| 145 | + |
| 146 | +Now that these vectors have been created, we can begin the simulation. |
| 147 | + |
| 148 | +First, we will sample from the population to approximate the ratio of dominant phenotypes. |
| 149 | + |
| 150 | +For each iteration, we will randomly pick two mates from the population. |
| 151 | + |
| 152 | +For example, 2 (Hh) and 3 (hh) is picked. |
| 153 | + |
| 154 | +This will lead to a probability of a dominant allele = 0.5. |
| 155 | + |
| 156 | +All of the probabilities will be accumulated throughout all of the simulations. |
| 157 | + |
| 158 | +At the end of the simulation, we can divide the sum of the probabilities by the total number of simulations. |
| 159 | + |
| 160 | +This will get us the approximated number of individuals with a dominant phenotype. |
| 161 | + |
| 162 | +This method is unlikely to return exactly the same answer as the algorithm approach. |
| 163 | + |
| 164 | +Sampling is random, so we will get slightly different results each time we run the simulation (unless we set a seed). |
| 165 | + |
| 166 | +However, both methods will be very similar. |
| 167 | + |
| 168 | +The standard error for the estimate decreases as the number of simulations gets very large. |
| 169 | + |
| 170 | +The larger the number of iterations, the more likely that the final approximation will be similar both between simulations, as well as to the answer from the algorithm. |
| 171 | + |
| 172 | +It is important to keep in mind that both the algorithm and statistical sampling approaches only provide approximations, as there will definitely be some unaccounted variation in a true biological population! |
| 173 | + |
| 174 | +```julia |
| 175 | +using StatsBase |
| 176 | + |
| 177 | +# Probability of dominant offspring given parent genotypes |
| 178 | +# Index: offspring_prob[parent1, parent2] |
| 179 | +# Genotypes: 1=HH, 2=Hh, 3=hh |
| 180 | + |
| 181 | +ex_offspring_prob = [ |
| 182 | + 1.0 1.0 1.0; # HH × (HH, Hh, hh) |
| 183 | + 1.0 0.75 0.5; # Hh × (HH, Hh, hh) |
| 184 | + 1.0 0.5 0.0 # hh × (HH, Hh, hh) |
| 185 | + ] |
| 186 | + |
| 187 | +function mendel_sim(k, m, n, offspring_prob; iterations=100000) |
| 188 | + # Genotypes: 1=HH, 2=Hh, 3=hh |
| 189 | + population = [fill(1, k); fill(2, m); fill(3, n)] |
| 190 | + |
| 191 | + total_pop = k+m+n |
| 192 | + wts = [k/total_pop, m/total_pop, n/total_pop] |
| 193 | + |
| 194 | + # samples two mates from the vector [1,2,3] with probability weights given by wts |
| 195 | + |
| 196 | + # then sum the probability of each offspring having a dominant phenotype |
| 197 | + # sum across all simulations |
| 198 | + sum(1:iterations) do _ |
| 199 | + (i,j) = sample([1,2,3], weights(wts), 2) |
| 200 | + offspring_prob[i,j] |
| 201 | + end / iterations |
| 202 | +end |
| 203 | + |
| 204 | +mendel_sim(2, 2, 2, ex_offspring_prob) |
| 205 | +``` |
| 206 | + |
| 207 | +In the function above, the user provides the parameter `offspring_prob`. |
| 208 | +If the user wanted to answer a slightly different question with different probability weights, |
| 209 | +all that would be needed is a different input vector. |
| 210 | +This allows the user to solve a wider variety of questions. |
| 211 | + |
| 212 | +However, this function does assume that there are only 3 phenotypes, which limits the situations it can be applied towards. |
| 213 | + |
| 214 | +This solution returns a value closer to 0.75, |
| 215 | +while the first one returns a value close to 0.783. |
0 commit comments