## Alignment, substitution, and the neutral theory

1.  Consider two possible alignments of two sequences below:

Substitutions are marked with *.

a)
#1 ATCGTTCAGG--TCTTGGACATTAAGACAAAAAACATGCATAGCAT
#2 ATG-ACAGGGGGTCATGGACAATAAGTCAA----CATCCACAGAAT
* ****      *      *    *          *  *  *
Gaps = 3
Substitutions = 11

b)
#1 ATCGTTCAGGTCTTGGACATTAAGACAAAAAACATGCATAGCAT----
#2 ATG-------ACAGGGGGTCATGGACAATAAGTCAACATCCACAGAAT
*       * **  *******     *  *****   *****
Gaps = 2
Substitutions = 22

If the gap:change cost is 2 and gaps of any length are counted equally, what is the cost of each alignment?  Which is preferable?

Cost of any alignment = (#subs) + (#gaps)(gap cost)
Cost of (a) = 11 + 3(2) = 17
Cost of (b) = 22 + 2(2) = 26
Alignment (a) is better.

What are the costs if the gap:change cost is 10? Which is preferable?

Cost of (a) = 11 + 3(10) = 41
Cost of (b) = 22 + 2(10) = 42
(a) is still better, but not by much.

What are the costs if gap:change=2 and the gap extension penalty is 0.1 per site? Which is preferable?

Cost of any alignment = (#subs) + (#gaps)(gap cost) + (#extended gap  positions)(extension cost); where “#extended gap positions is the number  of sites that contain a gap - the total number of gaps.

Cost of (a) = 11 + 3(2) + 4(0.1) = 17.4
Cost of (b) = 22 + 2(2) + 9(0.1) = 26.9
Alignment (a) is better.

In alignment b, what evolutionary event does the gap in sequence #2 represent?

Either a seven-nucleotide deletion in sequence 2 or a seven-nucleotide insertion in sequence 1.  From these two sequences, there is no way to tell which is the case (but there would be if you had more sequences and a phylogeny describing the evolutionary relationships among them).

2.  For alignment a in problem 1 above, what is the nucleotide diversity of these two sequences (leave gapped positions out of the calculation)?

For two sequences, the nucleotide diversity is the proportion of sites that contain non-identical nucleotides. There are 46 positions of which 7 are gapped, so the effective length of the sequences is 39 nucleotides.  Of these 39 sites, 10 are different, so the nucleotide diversity can be calculated as:

10/39 = 0.26

Imagine a population in which there are two alleles:  sequences 1 and 2 in problem 1 above.  If the frequency of allele #1 is 0.77 and the frequency of allele #2 is 0.23, what is the nucletodie diversity in the population?

For a population with two sequences, the nucleotide diversity is the nucleotide diversity of the pair times the frequence of each sequence in the population.  (If there were more than two, the divrsity would be this product summed over all pairwise comparisons in the population).  In this case, the nucleotide diversity in the population is

(0.26)(0.23)(0.77) = 0.046

3.  Imagine a population in which there are four alleles of a short gene. Using the following alignment and frequencies, calculate the nucleotide diversity in the population.

seq #               alignment                           freq
1     ATCGTTCAGG--TCTTGGACATTAAGACAAAAAACATGCATAGCAT    0.40
2     ATG-ACAGGGGGTCATGGACAATAAGTCAA----CATCCACAGAAT    0.08
3     ATA-ACAGGCGGTCATGCACAATAAGTCTA----CATCGACAGAAT    0.02
4     ATG-ACAGGGGCTCTTGGGCAATTAGTCAG----GATCCACAGAAT    0.50

The nucleotide diversity is the sum of xixjpij over all pairwise comparisons,  where x is the frequency of each allele and p is the nucleotide diversity for any pair of sequences. (p is normally written as the Greek letter pi, but I don’t know how to do that in HTML.)

In this case, p for each pair is equal to

12: 11/39 = 0.28
13: 14/39 = 0.36
14: 14/39 = 0.36
23: 4/41 =  0.10
24: 6/41 = 0.15
34: 11/41 = 0.27

And the nucleotide diversity is the sum of the following

(0.40)(0.08)(0.28) + (0.40)(0.02)(0.36) + (0.40)(0.50)(0.36) + (0.08)(0.02)(0.10) + (0.08)(0.50)(0.15) + (0.02)(0.50)(0.27) = 0.09

Express in a sentence the meaning of this result. If another population were found to have a nucleotide diversity of 0.001 for the same locus, what might be the cause of the difference between the two populations?

This result means that in any individual in the population, 9 percent of the nucleotide sites (about four sites total) are expected to differ between the individual’s two alleles.  If another population had an almost 10-fold lower nucleotide diversity for the same gene, this could be because there are more alleles (leading to fewer individuals homozygous for some allele) or because the sequences are more diverged from each other (creating more different sites per sequence).  Or both.  Either pattern could be caused if the population with higher diversity were subject to weaker selection for the genes in question or had an historically larger population size.

4. Suppose that 5 different allozymes are examined by electrophoresis in two rodent populations, one of house mice (M) and of Norway rats (R).  Each gene is represented by a number (1 through 5), and each allele of the gene is represented by a small letter.  Suppose the alleles exist in the populations in the following frequencies:

Locus  frequency
M1a  0.96 *
M1b  0.03
M1c  0.01
R1a  0.92
R1b  0.04
R1c  0.04
M2a  0.64
M2b  0.31
M2c  0.05
R2a  0.50
R2b  0.31
R2c  0.19
M3a  0.15
M3b  0.82
M3c  0.03
R3a  0.03
R3b  0.96 *
R3c  0.01
M4a  0.01
M4b  0.01
M4c  0.98 *
R4a  0.03
R4b  0.03
R4c  0.94 *
M5a  0.34
M5b  0.22
M5c  0.34
R5a  0.68
R5b  0.01
R5c  0.31

Genes that are monomorphic at the 95% level are starred.

What percent of the loci are polymorphic at the 95 percent level in mice? In rats?  For each species, what percent are polymorphic using a 99 percent cutoff? Express in a sentence the meaning of these results.

In mice, 2 of genes are monomorphic at the 95 percent level (meaning the most common allele has frequency > 0.95); the remaining 3 of the 5 genes are polymorphic, so P(95) = 0.60.  In rats too, two genes are monomoprhic and three polymorphic, so P(95) = .60.

None of the alleles of any gene in either species has frequency > 0.99, so P(99) = 1.0 for both populations.

What is the average heterozygosity in mice?  In rats? Express in a sentence what this result means. What forces in evolution might have caused this difference?

The heterozygosity for one locus is equal to 1 - the sum of the squared frequencies of each allele.  The average heterozygosity for multiple loci is the mean of the heterozygosity for all loci.  For mice, this equals (0.08 + 0.49 + 0.30 + 0.04 + 0.72)/5 = 0.33.  For rats, it equals (0.15 + 0.62 + 0.08 + 0.11 + 0.44)/5 = 0.28.

This result means that if any one of these loci is examined in any individual mouse, there is on average about a one-in-three probability that the mouse will be a heterozygote for that gene.  For the rats, the heterozygosity is slightly lower, which means there are more homozygotes in the population and less genetic diversity.  This difference could be caused if the rats were subject to stronger selection on some of the genes in question (driving the population more strongly towards fixation for these genes), or if the rats had an historically smaller population size (magnifying the power of drift and inbreeding to reduce heterozygosity).

5.  Consider the following two short sequences of the same protein from two taxa (the spaces are inserted just to make counting easier):

#1 CVPCFFKRIS QGHQRNDCEG CKSAISYNGM
#2 CIPCFFKRIT QGHQRNECEG CKSALTYNGM
*       *       *        **

What is the proportion of observed differences between the sequences?

D = #diffs/length = 5/30 = 0.166

What is the probability that a substitution occurred at any one amino acid site since their divergence?

We calculate the actual frequency of substitution per site using the equation

K = -ln (1-D) = 0.182

If the two taxa are two species of apes that diverged from each other 5 million years ago, what is the rate of amino acid substitution?

The rate r (called lambda in class, but beyond my HTML capacity) is calculated using K=r/2t, which rearranges to

r = K/2t = 0.182/10 million = 1.82 substitutions per site 100 million years

6.  For the following nucleotide sequences:

Synonmous changes are marked with S, nonsynonymous changes with N.  Lower case letters distinguish transitions (s) from transversions (v).  Degeneracy is marked with numbers.

000 204 002 004 002 002 002 002 004 002 204
ATG CTA AAC GGA CAT TGT CAT GAT GGG CAC AGT
ATA CTA AAT GGT CTT TGC AAT CAT GGA CAT AGC
N       S      N    S N   N     S   S   S
s       s      v    s v   v     s   s   s

These sequences translate as
M   L   N   G   H   C   H   D   G   H   S
I   L   N   G   L   C   N   H   G   H   S
*               *       *   *

What is the proportion of observed differences between the sequences?

D = # diffs/length = 9/33 = 0.27

What is the probability that a substitution occurred at any one nucleotide site since their divergence, assuming a Jukes-Cantor model?

K = -3/4(ln(1-4D/3)= 0.34

What is the probability that a substitution occurred at any one nucleotide amino acid site since their divergence, assuming Kimura’s 2-parameter model?

The substitutions include 6 transitions and 3 transversions in 33 sites, so the frequency of transitions (P) and transversions (Q) (which should always sum to 1) are:

P = 6/33 = 0.18
Q = 3/33 = 0.09

K = [-1/2*ln(1-2P-Q)] + [-1/4*ln(1-2Q] = 0.35.

Notice that the Kimura distance between the two sequences is slightly greater than the J-C distance, because transitions in these two sequences occur with greater frequency than expected if any one kind of transition and transversion are equally likely.  That means that the high frequency of transitions suggests that there were even more double-hit transitions (leading to convergence on the same nucleotide state) than calculated using a model that doesn’t distinguish between these two kinds of changes. Figure out for yourself what that “null” expectation is (the ratio of transitions to transversions at which the J-C and Kimura corrections gives the same answer).

What are Ks and Ka for these two sequences?  What does this suggest about their evolution?

Ds = # syn subs/# syn sites = # syn subs /[(#4-fold)+1/3(#2-fold)]
= 5/(4 + 8/3) = 5/6.67 = 0.75

Da = # ns subs/# ns sites = # ns subs /[(#0-fold)+2/3(#2-fold)]
=  4/(21+16/3) =  4/26.33 = 0.15

It turns out to be impossible to correct this Ds value using a J-C model, which produces an incalculable expression (ln of a negative number) when  D < 0.75 (my apologies).  So let’s correct them using the simple infinite-alleles Poisson model: K= -ln(1-D).  Note that this correction will give an under-correction because it doesn’t address multiple hits to give the same nucleotide state.

Ks = 1.39
Ka = 0.16

Nucleotide changes that cause amino acid substitutions occur with more than 8-fold lower frequency than those that are “silent.”  This suggests that the protein is functionally constrained by natural selection.

If the two taxa are two species of apes that diverged from each other 5 million years ago, what is the total rate of nucleotide substitution, assuming a Jukes-Cantor model?

As above, r = K/2t. In this case,

r = 0.34/10 million = 3.4*10-8 substitutions per site per year

Using the same model, what is the rate of synonymous substitution?  What is the rate of nonsynonmous substitution?

We can’t use the J-C model as discussed above, but we can use the Poisson model to get K. The synonmous and nonsynonmous rates are calculated the same way

r(s) = 1.39/10 million = 13.9*10-8 substitutions per site per year
r(a) = 0.16/10 million = 1.6*10-8 substitutions per site per year

What is the rate of mutation?  What has been the average length of time between non-synonymous substitutions?

We don’t know the rate of mutation with absolute confidence, because codon bias might constrain even synonymous changes.  If there is no codon bias, then the mutation rate should equal the rate of synonymous substitution, 13.9*10-8 substitutions per site per year.

What has been the average length of time between non-synonymous substitutions?

The interval between substitutions is always 1/rate. If nonsynonymous changes occur at the rate of 1.5*10-8 substitutions per site per year, then an average of 67 million years passes between nonsynonmous substitutions at any one site.

7.  Consider a stretch of junk DNA (noncoding, nonfunctional) that is 1,000 nucleotides long.  Assume that the mutation rate is high -- 2 changes per site per million generations. In a population of 10,000 individuals, how many new alleles will be created in the population each generation?

The number of new alleles per generation in an individual equals the mutation rate per site times the number of sites in the sequence.  The number of new alleles in the population equals this number times the total number of alleles in the population (2 times the number of individuals). So

# new alleles = (2*10-6) * 1000 * (2*10,000) = 40

What fraction of these will ultimately be fixed?  What is the rate of molecular evolution?

Because this is non-selected DNA, alleles will be fixed by drift alone.  The probability that any allele will be fixed by drift is equal to its frequency.  A new allele by definition has frequency 1/2N = 1/20,000 = 5 * 10-5.  The fraction of new alleles that will be fixed equals this probability, or about 50 per million.

The rate of molecular evolution is the rate at which substitutions are fixed in the sequence.  In an infinite allele model, this is equal to the rate at which new alleles are created times the rate at which they are fixed, which can be expressed as

r(t) = uL(2N)*1/2N=uL,

where u is the neutral mutation rate and L is the length of the sequence.  This quantity expresses the rate at which substitutions occur anywhere in the sequence. Here,

r(t) = (2*10-6) * 1000 = 2*10-3 substitutions per generation in the entire gene

The rate of substitution per site -- the more traditional measureof the rate of molecular evolution is simply thenetural mutation rate or

r = 2*10-6 substitutions per site per generation

In a population of 50 individuals with the same mutation rate, how many new alleles will be created per generation?  What fraction of these will ultimately be fixed?  What is the rate of molecular evolution?

# new alleles = (2*10-6) * 1000 * (2*50) = 0.2.  That is, one new allele will be created in the population every five generations on average.

The probability of fixation is 1/2N = 1/100 = 0.01.

The rate of molecular evolution is the same as in a larger population, because the smaller number of new alleles is exactly offset by the higher probability of fixation by drift (by factors of 2N and 1/2N).

Which population will have greater heterozygosity at any one time?  Why?

The larger population will have greater heterozygosity, because it takes longer for alleles to drift to fixation in a larger population.  The average time to fixation of any new allele is
t = 4N.  This means it will take about 200 times (10,000/50) longer for alleles to be fixed in the large population than in the small one.  At any one time, then, there will be more alleles on their way to fixation -- at some intermediate frequency -- in the large population.  A population with more alleles at intermediate frequencies will have more heterozygotes than a population dominated by one allele at high frequency.

8.  In a population of 10,000 individuals with a neutral mutation rate of 1 change per site per 10 million generations, what is the expected heterozygosity per site?  What is the expected heterozygosity for a protein that is 500 amino acids long?  Express the meaning of this result in a sentence.

Drift and mutation will ultimately achieve an equilibrium, with drift fixing alleles and mutations creating new ones.  At this equilibrium, there will be some steady-state level of heterozygosity.  We have learned that this value is given by

H = 4Nu/(1+4Nu),

where u is the mutation rate and N is the population size.

Here, 4Nu = 4(104)(10-7) = 4*10-3

H = 4*10-3/(1+4*10-3) = .00398 or about 0.004.

In a protein 500 sites long, if the neutral mutation rate is 10-7 replacements per site per generation, then the production of neutral alleles takes place at the rate

500 sites * 10-7 mutations per site per generation = 5*10-5 mutant alleles per gene per generation.  If we use this as the mutation rate, then 4Nu = 4(104)(10-7)(500) = 2, so

H = 2/3 = 0.666.  This means that 2/3 of the individuals in the population will carry two different alleles at this locus.  Only one-third will have identical sequences in their two alleles.

This calculation shows how a moderately high mutation rate will yield very low diversity at each site; however, because most genes are long, the genetic diversity per gene can be very high.

What is the expected heterozygosity if there are 12 individuals in the population?

If there are only twelve individuals, then the per site heterozygosity  is calculated as

H = (4*12*10-7)/[1+(4*12*10-7)] = 4.8 * 10-6

And the per gene heterozygosity is

(4*12*500*10-7)/[1+(4*12*500*10-7)] = 0.002.

As expected, a smaller population has a lower equilibrium heterozygosity, because drift drives alleles to fixation much more quickly, reducing the number of alleles at intermediate frequency and increasing the frequency of homozygotes in the population.

If a population is found to have high heterozygosity, is this evidence of balancing selection?

It might be, but this is not the only possible explanation. As we just saw, high heterozygosity could also reflect consistently large population sizes through evolutionary history.  It may also reflect a very high mutation rate, as is the case with many viruses, for which H is often quite high.

9.  Consider a single amino acid substitution that confers a selective advantage of 0.02 percent on the individual that carries it in one copy and 0.04 percent on an individual in which it is homozygous.  In a population of 100, will this mutation be more strongly affected by selection or drift?

By definition, this allele is codominant, because the effect of having two copies has twice the effect on fitness as having one copy.  This allows us to use the equations and inequalities we learned in class.

The key quantity for determining the relative importance of drift and selection is 4Ns, where N is population size and s is the selection coefficient.  Here 4Ns = 4 * 100* 0.0002 = 0.08.  If 4Ns is less than 0.1, we saw that drift is the primary force that changes the frequency of the allele.  That is, the frequency of the allele will behave as if it were neutral, though it is not.

What about in a population of 1000?  In a population of 10,000? How do you know?  Why does population size matter?

In a population of 1,000, 4Ns = 0.8, which is in the middle zone in which both drift and selection play a role.  Here, selection will drive this fitter allele towards fixation, but at a slow rate, such that drift will also drive other, less fit alleles up in frequency (and this one down in frequency).  The result will be some intermediate frequency less than 1.0 for the advantageous allele.

In a population of 10,000, 4Ns = 8, securely in the zone where selection dominates.  Here, we expect selection to drive the allele towards fixation with rate and efficiency close to that expected in an infinite population size (which we examined using the equations illustrated in problem set #1).

These inequalities derive from the equation that the probability Pr(f) that a new codominant allele will ultimately be fixed is

Pr(f) = 1-e-2s/1-e-4Ns

If you plug in different values of 4Ns, you will see that the inequalities are approximately correct.  At high 4Ns, the denominator becomes approximately 1, and the probability of fixation is determined almost entirely by the strength of selection.  At low 4Ns, the denominator can be approximated as
-4Ns, and the numerator can be approximated as -2s; the resulting fraction reduces, and the probability of fixation = 1/2N, the same as under drift alone.  In the middle, the approximations don’t allow easy simplification, and both N and s affect the probability.

10. Suppose that scientists at the University of Washington obtain the sequence of a certain gene from 100 individuals in one population of coho salmon and another population of chinook salmon.  The gene is 1000 nucleotides long. 900 of the sites are invariant in all sequences examined.

At forty sites, every individual of both species has a G, except for four coho and five chinook, all which have T at these sites.  Of these sites, 10 cause an amino acid change in the translated protein.

For sixty sites, all chinook sequences an A; all coho sequences have a T. Of these sites, 5 cause an amino acid change.

Is this finding consistent with the neutral theory?  Why or why not?  What does it suggest about the role of selection on this sequence?  If there are multiple interpretations, note them.

Knowledge of nonsynonymous and synonmous variation both within and between populations can be used to test the neutral theory using the McDonald-Kreitman test.  We need to set up a 2-by-2 table that counts the number of synonymous and nonsynonymous within-population substitutions(polymorphisms, P) and between-population substitutions (fixed differences, D).

The data tell us that there are forty polymorphic sites, of which 10 are non-synonymous.  There are 60 sites with fixed differences, of which 5 are non-synonymous.  We can set up the following table and calculate the NS/S ratio for each type of variation.

P       D
NS    10     5
S     30     55
NS/S  0.33   0.09

If the mutation rate in the two populations is the same and the sequences are neutral or affected only by functional constraints that are common to both populations, we expect NS/S to be the same within and between populations.  If the sequences are affected by unique selection pressures in each population, NS/S should be different within and between populations.  NS/S is useful because it is generally not affected by differences in the mutation rate or in drift between populations, which affect both synonymous and nonsynonymous sites equally.

The fact that S is greater than NS in both columns shows that the sequences are not absolutely neutral.  But this finding is consistent with the neutral theory, which says that functional constraints will reduce the frequency of many nonsynymous substitutions.

The key comparison is between the two entries in the table’s bottom row: there is considerably more nonsynonymous variation within populations than between them.  This finding departs from the neutral expectation.  Nonsynonymous substitutions occurred faster within populations than between them.  This suggests that selection may be driving populations to develop more internal diversity than would occur under neutrality.  This pattern is known to occur, for instance, with immune system genes that are involved in recognizing diverse antigens.  It also occurs for genes in viruses and parasitic organisms, where generating more diversity within a population helps the parasite to evade the host’s immune responses.

In general the McDonald-Kreitman test is not absolutely decisive in rejecting neutrality or near-neutrality, because it is conceivable that population sizes could have affected the behavior of weakly selected alleles differently between and within populations.  For instance, if there a bottleneck in each species soon after they diverged from each other (as is thought to often occur at speciation), then sites that were polymorphic in the original population will be rapidly fixed for different alleles in each population, leading to a greater between-population (D) than within-population variability (P).  In this case, the result is the opposite, and it is not obvious how this pattern could be caused by variation in population sizes.  (If you can think of a way, please let me know.)

One other explanation would be if one population were subject to codon bias and the other were not: the synonymous rate of substitution in the latter would be higher than in the former, and the NS/S ratio would be skewed by this selective difference.  In this case, however, the almost four-fold difference in the within- and between-population NS/S ratios is much greater than the slight effect of any known codon bias could account for.