`1. Consider two possible alignments of two sequences below:`

`Substitutions are marked with *.`

`a)`
` #1 ATCGTTCAGG--TCTTGGACATTAAGACAAAAAACATGCATAGCAT`
` #2 ATG-ACAGGGGGTCATGGACAATAAGTCAA----CATCCACAGAAT`
` * **** * * * * * *`
`Gaps = 3`
`Substitutions = 11`

`b)`
` #1 ATCGTTCAGGTCTTGGACATTAAGACAAAAAACATGCATAGCAT----`
` #2 ATG-------ACAGGGGGTCATGGACAATAAGTCAACATCCACAGAAT`
` * * ** ******* * ***** *****`
`Gaps = 2`
`Substitutions = 22`

`If the gap:change cost is 2 and gaps of any length are counted
equally, what is the cost of each alignment? Which is preferable?`

`Cost of any alignment = (#subs) + (#gaps)(gap cost)`
`Cost of (a) = 11 + 3(2) = 17`
`Cost of (b) = 22 + 2(2) = 26`
`Alignment (a) is better.`

`What are the costs if the gap:change cost is 10? Which is preferable?`

`Cost of (a) = 11 + 3(10) = 41`
`Cost of (b) = 22 + 2(10) = 42`
`(a) is still better, but not by much.`

`What are the costs if gap:change=2 and the gap extension penalty
is 0.1 per site? Which is preferable?`

`Cost of any alignment = (#subs) + (#gaps)(gap cost) + (#extended
gap positions)(extension cost); where “#extended gap positions is the
number of sites that contain a gap - the total number of gaps.`

`Cost of (a) = 11 + 3(2) + 4(0.1) = 17.4`
`Cost of (b) = 22 + 2(2) + 9(0.1) = 26.9`
`Alignment (a) is better.`

`In alignment b, what evolutionary event does the gap in sequence
#2 represent?`

`Either a seven-nucleotide deletion in sequence 2 or a seven-nucleotide
insertion in sequence 1. From these two sequences, there is no way to
tell which is the case (but there would be if you had more sequences and
a phylogeny describing the evolutionary relationships among them).`

`2. For alignment a in problem 1 above, what is the nucleotide
diversity of these two sequences (leave gapped positions out of the calculation)?`

`For two sequences, the nucleotide diversity is the proportion of
sites that contain non-identical nucleotides. There are 46 positions of
which 7 are gapped, so the effective length of the sequences is 39 nucleotides.
Of these 39 sites, 10 are different, so the nucleotide diversity can be
calculated as:`

` 10/39 = 0.26`

`Imagine a population in which there are two alleles: sequences
1 and 2 in problem 1 above. If the frequency of allele #1 is 0.77 and
the frequency of allele #2 is 0.23, what is the nucletodie diversity in
the population?`` ``For a population with two sequences, the nucleotide diversity is
the nucleotide diversity of the pair times the frequence of each sequence
in the population. (If there were more than two, the divrsity would be
this product summed over all pairwise comparisons in the population).
In this case, the nucleotide diversity in the population is`

` (0.26)(0.23)(0.77) = 0.046`

`3. Imagine a population in which there are four alleles of a
short gene. Using the following alignment and frequencies, calculate the
nucleotide diversity in the population.`

`seq # alignment freq`
`1 ATCGTTCAGG--TCTTGGACATTAAGACAAAAAACATGCATAGCAT 0.40`
`2 ATG-ACAGGGGGTCATGGACAATAAGTCAA----CATCCACAGAAT 0.08`
`3 ATA-ACAGGCGGTCATGCACAATAAGTCTA----CATCGACAGAAT 0.02`
`4 ATG-ACAGGGGCTCTTGGGCAATTAGTCAG----GATCCACAGAAT 0.50`

`The nucleotide diversity is the sum of x _{i}x_{j}p_{ij}
over all pairwise comparisons, where x is the frequency of each allele
and p is the nucleotide diversity for any pair of sequences. (p is normally
written as the Greek letter pi, but I don’t know how to do that in HTML.)`

`In this case, p for each pair is equal to`

`12: 11/39 = 0.28`
`13: 14/39 = 0.36`
`14: 14/39 = 0.36`
`23: 4/41 = 0.10`
`24: 6/41 = 0.15`
`34: 11/41 = 0.27`

`And the nucleotide diversity is the sum of the following`

`(0.40)(0.08)(0.28) + (0.40)(0.02)(0.36) + (0.40)(0.50)(0.36) + (0.08)(0.02)(0.10)
+ (0.08)(0.50)(0.15) + (0.02)(0.50)(0.27) = 0.09`

`Express in a sentence the meaning of this result. If another
population were found to have a nucleotide diversity of 0.001 for the same
locus, what might be the cause of the difference between the two populations?`

`This result means that in any individual in the population, 9 percent
of the nucleotide sites (about four sites total) are expected to differ
between the individual’s two alleles. If another population had an almost
10-fold lower nucleotide diversity for the same gene, this could be because
there are more alleles (leading to fewer individuals homozygous for some
allele) or because the sequences are more diverged from each other (creating
more different sites per sequence). Or both. Either pattern could be
caused if the population with higher diversity were subject to weaker selection
for the genes in question or had an historically larger population size.`

`4. Suppose that 5 different allozymes are examined by electrophoresis
in two rodent populations, one of house mice (M) and of Norway rats (R).
Each gene is represented by a number (1 through 5), and each allele of
the gene is represented by a small letter. Suppose the alleles exist in
the populations in the following frequencies:`

`Locus frequency`
` M1a 0.96 *`
` M1b 0.03`
` M1c 0.01`
` R1a 0.92`
` R1b 0.04`
` R1c 0.04`
` M2a 0.64`
` M2b 0.31`
` M2c 0.05`
` R2a 0.50`
` R2b 0.31`
` R2c 0.19`
` M3a 0.15`
` M3b 0.82`
` M3c 0.03`
` R3a 0.03`
` R3b 0.96 *`
` R3c 0.01`
` M4a 0.01`
` M4b 0.01`
` M4c 0.98 *`
` R4a 0.03`
` R4b 0.03`
` R4c 0.94 *`
` M5a 0.34`
` M5b 0.22`
` M5c 0.34`
` R5a 0.68`
` R5b 0.01`
` R5c 0.31`
` `
`Genes that are monomorphic at the 95% level are starred.`

`What percent of the loci are polymorphic at the 95 percent level
in mice? In rats? For each species, what percent are polymorphic using
a 99 percent cutoff? Express in a sentence the meaning of these results.`

`In mice, 2 of genes are monomorphic at the 95 percent level (meaning
the most common allele has frequency > 0.95); the remaining 3 of the 5
genes are polymorphic, so P(95) = 0.60. In rats too, two genes are monomoprhic
and three polymorphic, so P(95) = .60.`

`None of the alleles of any gene in either species has frequency
> 0.99, so P(99) = 1.0 for both populations.`

`What is the average heterozygosity in mice? In rats? Express
in a sentence what this result means. What forces in evolution might have
caused this difference?`

`The heterozygosity for one locus is equal to 1 - the sum of the
squared frequencies of each allele. The average heterozygosity for multiple
loci is the mean of the heterozygosity for all loci. For mice, this equals
(0.08 + 0.49 + 0.30 + 0.04 + 0.72)/5 = 0.33. For rats, it equals (0.15
+ 0.62 + 0.08 + 0.11 + 0.44)/5 = 0.28.`

`This result means that if any one of these loci is examined in any
individual mouse, there is on average about a one-in-three probability
that the mouse will be a heterozygote for that gene. For the rats, the
heterozygosity is slightly lower, which means there are more homozygotes
in the population and less genetic diversity. This difference could be
caused if the rats were subject to stronger selection on some of the genes
in question (driving the population more strongly towards fixation for
these genes), or if the rats had an historically smaller population size
(magnifying the power of drift and inbreeding to reduce heterozygosity).`

`5. Consider the following two short sequences of the same protein
from two taxa (the spaces are inserted just to make counting easier):`

` #1 CVPCFFKRIS QGHQRNDCEG CKSAISYNGM`
` #2 CIPCFFKRIT QGHQRNECEG CKSALTYNGM`
` * * * **`

`What is the proportion of observed differences between the sequences?`

`D = #diffs/length = 5/30 = 0.166`

`What is the probability that a substitution occurred at any one
amino acid site since their divergence?`

`We calculate the actual frequency of substitution per site using
the equation`

`K = -ln (1-D) = 0.182`

`If the two taxa are two species of apes that diverged from each
other 5 million years ago, what is the rate of amino acid substitution?`

`The rate r (called lambda in class, but beyond my HTML capacity)
is calculated using K=r/2t, which rearranges to`

`r = K/2t = 0.182/10 million = 1.82 substitutions per site 100 million
years`

`6. For the following nucleotide sequences:`

`Synonmous changes are marked with S, nonsynonymous changes with
N. Lower case letters distinguish transitions (s) from transversions (v).
Degeneracy is marked with numbers.`

` 000 204 002 004 002 002 002 002 004 002 204`
` ATG CTA AAC GGA CAT TGT CAT GAT GGG CAC AGT`
` ATA CTA AAT GGT CTT TGC AAT CAT GGA CAT AGC`
` N S N S N N S S S`
` s s v s v v s s s`

`These sequences translate as`
` M L N G H C H D G H S`
` I L N G L C N H G H S`
` * * * *`

`What is the proportion of observed differences between the sequences?`

`D = # diffs/length = 9/33 = 0.27`

`What is the probability that a substitution occurred at any one
nucleotide site since their divergence, assuming a Jukes-Cantor model?`

`K = -3/4(ln(1-4D/3)= 0.34`

`What is the probability that a substitution occurred at any one
nucleotide amino acid site since their divergence, assuming Kimura’s 2-parameter
model?`

`The substitutions include 6 transitions and 3 transversions in 33
sites, so the frequency of transitions (P) and transversions (Q) (which
should always sum to 1) are:`

`P = 6/33 = 0.18`
`Q = 3/33 = 0.09`

`K = [-1/2*ln(1-2P-Q)] + [-1/4*ln(1-2Q] = 0.35.`

`Notice that the Kimura distance between the two sequences is slightly
greater than the J-C distance, because transitions in these two sequences
occur with greater frequency than expected if any one kind of transition
and transversion are equally likely. That means that the high frequency
of transitions suggests that there were even more double-hit transitions
(leading to convergence on the same nucleotide state) than calculated using
a model that doesn’t distinguish between these two kinds of changes. Figure
out for yourself what that “null” expectation is (the ratio of transitions
to transversions at which the J-C and Kimura corrections gives the same
answer).`

`What are Ks and Ka for these two sequences? What does this suggest
about their evolution?`

`Ds = # syn subs/# syn sites = # syn subs /[(#4-fold)+1/3(#2-fold)]`
` = 5/(4 + 8/3) = 5/6.67 = 0.75`

`Da = # ns subs/# ns sites = # ns subs /[(#0-fold)+2/3(#2-fold)]`
` = 4/(21+16/3) = 4/26.33 = 0.15`

`It turns out to be impossible to correct this Ds value using a J-C
model, which produces an incalculable expression (ln of a negative number)
when D < 0.75 (my apologies). So let’s correct them using the simple
infinite-alleles Poisson model: K= -ln(1-D). Note that this correction
will give an under-correction because it doesn’t address multiple hits
to give the same nucleotide state.`

`Ks = 1.39`
`Ka = 0.16`

`Nucleotide changes that cause amino acid substitutions occur with
more than 8-fold lower frequency than those that are “silent.” This suggests
that the protein is functionally constrained by natural selection.`

`If the two taxa are two species of apes that diverged from each
other 5 million years ago, what is the total rate of nucleotide substitution,
assuming a Jukes-Cantor model?`

`As above, r = K/2t. In this case,`

`r = 0.34/10 million = 3.4*10 ^{-8} substitutions per site
per year`

`Using the same model, what is the rate of synonymous substitution?
What is the rate of nonsynonmous substitution?`

`We can’t use the J-C model as discussed above, but we can use the
Poisson model to get K. The synonmous and nonsynonmous rates are calculated
the same way`

`r(s) = 1.39/10 million = 13.9*10 ^{-8} substitutions per
site per year`

`What is the rate of mutation? What has been the average length
of time between non-synonymous substitutions?`

`We don’t know the rate of mutation with absolute confidence, because
codon bias might constrain even synonymous changes. If there is no codon
bias, then the mutation rate should equal the rate of synonymous substitution,
13.9*10 ^{-8} substitutions per site per year.`

`What has been the average length of time between non-synonymous
substitutions?`

`The interval between substitutions is always 1/rate. If nonsynonymous
changes occur at the rate of 1.5*10 ^{-8} substitutions per site
per year, then an average of 67 million years passes between nonsynonmous
substitutions at any one site.`

`7. Consider a stretch of junk DNA (noncoding, nonfunctional)
that is 1,000 nucleotides long. Assume that the mutation rate is high
-- 2 changes per site per million generations. In a population of 10,000
individuals, how many new alleles will be created in the population each
generation?`

`The number of new alleles per generation in an individual equals
the mutation rate per site times the number of sites in the sequence.
The number of new alleles in the population equals this number times the
total number of alleles in the population (2 times the number of individuals).
So`

`# new alleles = (2*10-6) * 1000 * (2*10,000) = 40`

`What fraction of these will ultimately be fixed? What is the
rate of molecular evolution?`

`Because this is non-selected DNA, alleles will be fixed by drift
alone. The probability that any allele will be fixed by drift is equal
to its frequency. A new allele by definition has frequency 1/2N = 1/20,000
= 5 * 10-5. The fraction of new alleles that will be fixed equals this
probability, or about 50 per million.`

`The rate of molecular evolution is the rate at which substitutions
are fixed in the sequence. In an infinite allele model, this is equal
to the rate at which new alleles are created times the rate at which they
are fixed, which can be expressed as`

`r(t) = uL(2N)*1/2N=uL,`

`where u is the neutral mutation rate and L is the length of the
sequence. This quantity expresses the rate at which substitutions occur
anywhere in the sequence. Here,`

`r(t) = (2*10-6) * 1000 = 2*10-3 substitutions per generation in
the entire gene`

`The rate of substitution per site -- the more traditional measureof
the rate of molecular evolution is simply thenetural mutation rate or`

`r = 2*10-6 substitutions per site per generation`

`In a population of 50 individuals with the same mutation rate,
how many new alleles will be created per generation? What fraction of
these will ultimately be fixed? What is the rate of molecular evolution?`

`# new alleles = (2*10-6) * 1000 * (2*50) = 0.2. That is, one new
allele will be created in the population every five generations on average.`

`The probability of fixation is 1/2N = 1/100 = 0.01.`

`The rate of molecular evolution is the same as in a larger population,
because the smaller number of new alleles is exactly offset by the higher
probability of fixation by drift (by factors of 2N and 1/2N).`

`Which population will have greater heterozygosity at any one
time? Why?`` `
`The larger population will have greater heterozygosity, because
it takes longer for alleles to drift to fixation in a larger population.
The average time to fixation of any new allele is`
`t = 4N. This means it will take about 200 times (10,000/50) longer
for alleles to be fixed in the large population than in the small one.
At any one time, then, there will be more alleles on their way to fixation
-- at some intermediate frequency -- in the large population. A population
with more alleles at intermediate frequencies will have more heterozygotes
than a population dominated by one allele at high frequency.`

`8. In a population of 10,000 individuals with a neutral mutation
rate of 1 change per site per 10 million generations, what is the expected
heterozygosity per site? What is the expected heterozygosity for a protein
that is 500 amino acids long? Express the meaning of this result in a
sentence.`

`Drift and mutation will ultimately achieve an equilibrium, with
drift fixing alleles and mutations creating new ones. At this equilibrium,
there will be some steady-state level of heterozygosity. We have learned
that this value is given by`

`H = 4Nu/(1+4Nu),`

`where u is the mutation rate and N is the population size.`

`Here, 4Nu = 4(104)(10 ^{-7}) = 4*10^{-3}`

`H = 4*10-3/(1+4*10 ^{-3}) = .00398 or about 0.004.`

`In a protein 500 sites long, if the neutral mutation rate is 10 ^{-7}
replacements per site per generation, then the production of neutral alleles
takes place at the rate`

`500 sites * 10 ^{-7} mutations per site per generation =
5*10^{-5} mutant alleles per gene per generation. If we use this
as the mutation rate, then 4Nu = 4(104)(10^{-7})(500) = 2, so`

`H = 2/3 = 0.666. This means that 2/3 of the individuals in the
population will carry two different alleles at this locus. Only one-third
will have identical sequences in their two alleles.`

`This calculation shows how a moderately high mutation rate will
yield very low diversity at each site; however, because most genes are
long, the genetic diversity per gene can be very high.`

`What is the expected heterozygosity if there are 12 individuals
in the population?`

`If there are only twelve individuals, then the per site heterozygosity
is calculated as`

`H = (4*12*10 ^{-7})/[1+(4*12*10^{-7})] = 4.8 * 10^{-6}`

`And the per gene heterozygosity is`

`(4*12*500*10 ^{-7})/[1+(4*12*500*10^{-7})] = 0.002.`

`As expected, a smaller population has a lower equilibrium heterozygosity,
because drift drives alleles to fixation much more quickly, reducing the
number of alleles at intermediate frequency and increasing the frequency
of homozygotes in the population.`

`If a population is found to have high heterozygosity, is this
evidence of balancing selection?`

`It might be, but this is not the only possible explanation. As we
just saw, high heterozygosity could also reflect consistently large population
sizes through evolutionary history. It may also reflect a very high mutation
rate, as is the case with many viruses, for which H is often quite high.`
` `

`9. Consider a single amino acid substitution that confers a
selective advantage of 0.02 percent on the individual that carries it in
one copy and 0.04 percent on an individual in which it is homozygous.
In a population of 100, will this mutation be more strongly affected by
selection or drift?`

`By definition, this allele is codominant, because the effect of
having two copies has twice the effect on fitness as having one copy.
This allows us to use the equations and inequalities we learned in class.`

`The key quantity for determining the relative importance of drift
and selection is 4Ns, where N is population size and s is the selection
coefficient. Here 4Ns = 4 * 100* 0.0002 = 0.08. If 4Ns is less than 0.1,
we saw that drift is the primary force that changes the frequency of the
allele. That is, the frequency of the allele will behave as if it were
neutral, though it is not.`

`What about in a population of 1000? In a population of 10,000?
How do you know? Why does population size matter?`

`In a population of 1,000, 4Ns = 0.8, which is in the middle zone
in which both drift and selection play a role. Here, selection will drive
this fitter allele towards fixation, but at a slow rate, such that drift
will also drive other, less fit alleles up in frequency (and this one down
in frequency). The result will be some intermediate frequency less than
1.0 for the advantageous allele.`

`In a population of 10,000, 4Ns = 8, securely in the zone where selection
dominates. Here, we expect selection to drive the allele towards fixation
with rate and efficiency close to that expected in an infinite population
size (which we examined using the equations illustrated in problem set
#1).`

`These inequalities derive from the equation that the probability
Pr(f) that a new codominant allele will ultimately be fixed is`

`Pr(f) = 1-e-2s/1-e-4Ns`

`If you plug in different values of 4Ns, you will see that the inequalities
are approximately correct. At high 4Ns, the denominator becomes approximately
1, and the probability of fixation is determined almost entirely by the
strength of selection. At low 4Ns, the denominator can be approximated
as`
`-4Ns, and the numerator can be approximated as -2s; the resulting
fraction reduces, and the probability of fixation = 1/2N, the same as under
drift alone. In the middle, the approximations don’t allow easy simplification,
and both N and s affect the probability.`

` `
`10. Suppose that scientists at the University of Washington
obtain the sequence of a certain gene from 100 individuals in one population
of coho salmon and another population of chinook salmon. The gene is 1000
nucleotides long. 900 of the sites are invariant in all sequences examined.`

`At forty sites, every individual of both species has a G, except
for four coho and five chinook, all which have T at these sites. Of these
sites, 10 cause an amino acid change in the translated protein.`

`For sixty sites, all chinook sequences an A; all coho sequences
have a T. Of these sites, 5 cause an amino acid change.`

`Is this finding consistent with the neutral theory? Why or why
not? What does it suggest about the role of selection on this sequence?
If there are multiple interpretations, note them.`` `
`Knowledge of nonsynonymous and synonmous variation both within
and between populations can be used to test the neutral theory using the
McDonald-Kreitman test. We need to set up a 2-by-2 table that counts the
number of synonymous and nonsynonymous within-population substitutions(polymorphisms,
P) and between-population substitutions (fixed differences, D).`

`The data tell us that there are forty polymorphic sites, of which
10 are non-synonymous. There are 60 sites with fixed differences, of which
5 are non-synonymous. We can set up the following table and calculate
the NS/S ratio for each type of variation.`

` P D`
`NS 10 5`
`S 30 55`
`NS/S 0.33 0.09`

`If the mutation rate in the two populations is the same and the
sequences are neutral or affected only by functional constraints that are
common to both populations, we expect NS/S to be the same within and between
populations. If the sequences are affected by unique selection pressures
in each population, NS/S should be different within and between populations.
NS/S is useful because it is generally not affected by differences in the
mutation rate or in drift between populations, which affect both synonymous
and nonsynonymous sites equally.`

`The fact that S is greater than NS in both columns shows that the
sequences are not absolutely neutral. But this finding is consistent with
the neutral theory, which says that functional constraints will reduce
the frequency of many nonsynymous substitutions.`

`The key comparison is between the two entries in the table’s bottom
row: there is considerably more nonsynonymous variation within populations
than between them. This finding departs from the neutral expectation.
Nonsynonymous substitutions occurred faster within populations than between
them. This suggests that selection may be driving populations to develop
more internal diversity than would occur under neutrality. This pattern
is known to occur, for instance, with immune system genes that are involved
in recognizing diverse antigens. It also occurs for genes in viruses and
parasitic organisms, where generating more diversity within a population
helps the parasite to evade the host’s immune responses.`

`In general the McDonald-Kreitman test is not absolutely decisive
in rejecting neutrality or near-neutrality, because it is conceivable that
population sizes could have affected the behavior of weakly selected alleles
differently between and within populations. For instance, if there a bottleneck
in each species soon after they diverged from each other (as is thought
to often occur at speciation), then sites that were polymorphic in the
original population will be rapidly fixed for different alleles in each
population, leading to a greater between-population (D) than within-population
variability (P). In this case, the result is the opposite, and it is not
obvious how this pattern could be caused by variation in population sizes.
(If you can think of a way, please let me know.)`

`One other explanation would be if one population were subject to
codon bias and the other were not: the synonymous rate of substitution
in the latter would be higher than in the former, and the NS/S ratio would
be skewed by this selective difference. In this case, however, the almost
four-fold difference in the within- and between-population NS/S ratios
is much greater than the slight effect of any known codon bias could account
for.`