Positive Darwinian Selection. Does the Comparative Method Rule?

POSITIVE DARWINIAN SELECTION:

DOES THE COMPARATIVE METHOD

RULE?

DONALD R. FORSDYKE

^{Journal
of Biological Systems (2007) 15,}^95-108

^{Accepted
18th August 2006. Published March 2007}

^{Copyright
World Scientific Pub. Corp.}

1. Introduction
2. Negative and Positive Selection
3. Conflict within a Single Information Channel
4. Analogy with Speech
5. Neutral Mutations and the Ratio Method
6. The Problem of Zero Rates of Synonymous Mutation
7. Non-Synonymous and Synonymous Rates Correlate
8. Genes as Independent Mutational Entities
9. X-axis as a Conceptual Time Axis
10. Conclusions

End Note 2007
End Note 2009
End Note Jan 2010
End_Note_March_2011
End_Note_Feb_2013
End_Note_Aug_2014

To detect positive Darwinian selection it is thought essential to compare two sequences. Despite its defects, "the comparative method rules." However, genes evolving rapidly under positive selection conflict more with internal forces (the genome phenotype) than genes evolving slowly under negative selection. In particular, there is conflict with stem-loop potential.

The conflict between protein-encoding potential (primary information) and stem-loop potential (secondary information) permits detection of positive selection in a single sequence. The degree to which secondary information is compromised provides a measure of the speed of transmission of primary information.

Thus, the sovereignty of the comparative method is challenged not only by its own defects, but also by the availability of a single-sequence method. However, while of limited utility for positive selection, the comparative method casts new light on Darwin's great question - the origin of species. Comparison of rates of synonymous and non-synonymous mutation suggests that branching into new species begins with synonymous mutations.

Keywords: Positive selection; Single sequence; Primary information; Secondary information; Conventional phenotype; Genome phenotype; Speciation

1. Introduction

Positive Darwinian selection is usually evaluated by comparing two nucleotide sequences that are presumed to have diverged from a common ancestor. Recently, Plotkin et al. suggested that the codon volatility displayed in a single genic sequence provides evidence for positive selection.¹ However, this approach was criticized.^2-9 Irrespective of the validity of the criticisms,¹⁰ there is general agreement that, while currently "the comparative method rules,"² detection in a single sequence would "confer greater power of analysis with less information,"⁵ and would be "revolutionary -- because it challenges the essentiality of the comparative method."² Furthermore, "because a closely related sequence may be unavailable for comparison -- a method to detect a signature of positive selection in a single genome has considerable appeal."³ Thus, "the idea of using just one DNA sequence to detect natural selection -- is novel and attractive, and it would be interesting to develop other measures that may accomplish this goal."⁷

However, when a novel method for detecting positive Darwinian selection from a single sequence was presented a decade ago,¹¹ there was little interest, perhaps because the method involved new concepts and a new technology - the detection of base order-dependent stem-loop potential.^12-14 The conceptual and technological aspects have since become better established.^15-21 For example, it is recognized that "the protein encoding region -- can -- comprise one or more overlapping layers of information."²² These layers can be considered as a "genome dialect."²³ Furthermore, there is increasing recognition of the possibility of non-neutral evolution in non-coding DNA and at synonymous sites in coding DNA.^24-26

In view of increased questioning of the comparative method,^26-30 I here review the conceptual bases of the stem-loop potential and comparative methods. Technical details may be found elsewhere.¹¹ I conclude that, while sometimes a misleading indicator of positive selection, the comparative method casts new light on the fundamental question Darwin posed - namely, that of the origin of species. The initiation of divergence between lineages can involve secondary information (the genome phenotype), not primary information (the conventional phenotype).

2. Negative and Positive Selection

If reproductive success is impeded by a mutation, then selection of organisms with the mutation is negative. If reproductive success is promoted then the selection is positive. These are two, usually mutually exclusive, consequences of a mutation in a nucleic acid sequence. The extreme imperative of negative selection is: if you mutate, you die. Thus, the broad population of non-mutators remains and the few mutators die. The extreme imperative of positive selection is: if you do not mutate, you die. Thus, the broad population of non-mutators dies, and the few mutators flourish (i.e. there is a population "bottle neck" from which only the mutators emerge). Occupying the middle-ground are "neutral" mutations, and mutations that may lead to either weak positive or weak negative selection. In the latter cases there will, by definition, be effects on the number of descendents, but only in the long-term.

Whether a genic base mutation will lead to negative or positive selection usually depends on the part of a gene-product that it affects. A mutation affecting the active site of an enzyme will usually disturb enzyme function and this may impair the function of an organism so its fitness to reproduce is impeded. In the extreme, this is ensured by the death of the organism. On the other hand, a mutation affecting an antigen at the surface of a pathogen may allow it to evade the immune defences of its host, so its fitness to reproduce is enhanced. In the extreme this is ensured by the death of pathogens that do not have the mutation.

Nucleic acid bases that are evolving very slowly (i.e. they are conserved among related organisms) are likely to affect functions subject to negative selection (i.e. organisms with mutations in the bases are functionally impaired). Nucleic acid bases that are evolving very rapidly (i.e. they are not conserved among related organisms) are likely to affect functions subject to some degree of positive selection (i.e. organisms with mutations in the bases are functionally improved). Thus, a determination of evolutionary rate has the potential to assist the distinction between bases under positive selection, and bases under negative selection. For this, base differences between sequences can be calibrated against some temporal scale (e.g. the period from the present to the time of divergence of sequences from a common ancestral sequence). Accumulation of a large number of differences in a short time would indicate positive selection. However, accurate temporal calibration is difficult. Accordingly, an alternative comparative approach, involving ratios of non-synonymous and synonymous base substitution mutations, has been widely adopted.³¹

Note that here we are concerned with genes under some degree of selection, not with rare genes whose evolution has been predominantly influenced by random drift. Such genes are assumed to be randomly distributed among those predominantly influenced by selection, and so should have little statistical impact on the data under discussion. We are also not concerned with comparative methods that use the extent of decrease in recombination rate as a measure of positive selection. These assume, for a nucleic acid segment containing a gene that is evolving rapidly (i.e. positive selection), that there may not be time for separation of the gene from neighboring segments by recombination. Thus neighboring genes will tend to remain linked. They will "hitchhike" through the generations with a positively selected gene. Variant forms of neighboring genes will be lost from the population in the course of this selective sweep. Consequently, polymorphism among members of a population in the region of a positively selected gene is decreased.^18,31

3. Conflict within a Single Information Channel

The best recognized form of genomic information is genic. The proteins and RNAs encoded by genes reflect their "primary information" and appear to have most influence on the conventional phenotype - an individual's somatic form and function. However, other types of information ("secondary information") exert pressures in genome space. These pressures have the potential to affect the conventional phenotype, often affect base composition, and are either local or general. Local pressures acting on genes include purine-loading pressure (AG-pressure), and RNY-pressure (the pressure for first and third codon bases to be purine and pyrimidine, respectively). General pressures affect entire genomes and include GC-pressure, fold pressure ("stem-loop potential") and pressures for genome compactness.^18,32-36 In other words, there is a genome phenotype, meaning that aspects of genome organization have the potential to influence reproductive success in the same sense that aspects of the conventional phenotype have the potential to influence reproductive success. Sometimes pressures appear to conflict. For example, seeming to accommodate AG-pressure, protein lengths can increase by inclusion of low complexity, inter-domain, segments, that do not appear important for protein function, yet contain "placeholder" amino acids encoded by AG-rich codons. Here compactness cedes to purine-loading pressure.^18,27,37-40

Thus, genomes can be seen as channels carrying multiple forms of information through the generations from the distant past to the present. As with information channels in general, carrying capacity is finite. When genomes cannot satisfy all informational demands a balance is established, with trade-offs between competing demands. This contrasts with the long-held view that there is an excess of carrying capacity in genome space, so that "neutral" mutations can endure, and sequences without obvious function can be considered as "junk."⁴¹

Compared with genes under negative selection, there is a greater onus on genes under positive selection to adapt to local genic pressure - the pressure to transmit a gene's primary information. To accommodate this increased pressure, the trade-offs in secondary information by non-genic pressures, be they local or general, must be greater. General pressures are of most investigative utility, since their diminution in protein-encoding regions that are under positive selection pressure can be evaluated relative to their levels either in local non-protein-encoding regions (i.e. in intronic and intergenic sequences that are assumed not to be under positive selection), or in local protein-encoding regions (likely to be under negative selection pressure).

Among general pressures, the pressure to order a sequence of bases to promote the potential for nucleic acid structure (base order-dependent stem-loop potential) has emerged as a sensitive index.¹¹ The principle of the method can be briefly summarized. Higher ordered structures of single-stranded nucleic acids may be calculated from the base-pairing energies of overlapping dinucleotides, which are fundamental units of nucleic acid structure. Sequences are reiteratively folded until energetically most favorable structures are arrived at.¹⁹ Contributions to the energetics of each structure decompose into base composition-dependent and base order-dependent components. The latter is determined by subtracting the base composition-dependent component from the total folding energy. The base composition-dependent component is itself determined by shuffling and refolding a sequence several times - thus destroying the base order-dependent component - and then taking the average folding energy of the resulting structures.^12-14

4. Analogy with Speech

Speech provides an analogy for the conflict between primary and secondary information. A public speaker conveys both a message (primary information) and an accent (secondary information). Normally these are not in conflict. Imagine requiring each member of a group of competing speakers, one at a time, to read a given text to a large audience. The speakers are informed that they will be timed to determine the slowest, and that the audience will be polled to determine the most incomprehensible. The slowest and most incomprehensible speakers will then be eliminated. Those who survive will repeat the performance after which the slowest and most incomprehensible speakers will again be eliminated. Eventually a winner will emerge.

Initially, each speaker relays both the text and an individual accent. However, under pressures both to speak rapidly and to be understood, speakers with more deviant accents are soon eliminated. Speakers are under strong pressure to eliminate personal idiosyncrasies of accent (i.e. to mutate their secondary information). The pressures for fast and coherent speech will progressively decrease the diversity of the secondary information among surviving group members. The final sound of the text will probably be the same for any large group of competing speakers exposed to the same large audience. Thus, the divergent accents of the initial multiplicity of speakers converge on a single accent to which the hearing of the average member of the audience is best attuned. In a competition where there is no pressure for speed, idiosyncrasies of accent are less likely to interfere with comprehension (i.e. the diversity of secondary information is tolerated).

Viewed from this perspective we see that a nucleic acid segment that is evolving rapidly with respect to its primary information (e.g. the sequence of a protein) may not be able to accommodate some of the other forms of information that it might otherwise carry. These other forms, assumed to be evolving leisurely under negative selection, include the ordering of bases to support stem-loop potential, and purine-loading.^16-18 Thus, sequences under positive selection are also likely to be sequences where one or more forms of secondary information are impaired. On this basis the type of selection can be evaluated in a DNA sequence without temporal calibration and with, at most, a need to compare only with neighboring sequences.

This method has confirmed as under positive selection numerous genes so designated by the non-synonymous/synonymous ratio method. These genes include those encoding major histocompatability complex proteins,¹¹ snake venom proteins,¹³ and AIDS virus proteins.¹⁴ Their rapid evolution is predicted from the underlying biology. The need for confirmation of results of the ratio method is now pressing since the validity of the method itself, and the underlying concept of neutrality, are under increasing scrutiny.^26-30

5. Neutral Mutations and the Ratio Method

Neutral mutations seemed to offer an internal frame of reference for evaluating mutation rates and for determining whether a mutation would impede or promote reproductive success. Mutations in third positions of codons often do not change the nature of an encoded amino acid, and hence do not change the corresponding protein and any characters that depend on that protein. It was tempting to consider such synonymous mutations as neutral.³¹

An obvious advantage of the use of one particular codon, rather than a synonymous one, is that some codons can be translated more rapidly, or more accurately. This is indeed of evolutionary significance for certain unicellular organisms where the speed of protein synthesis is critical.^42-43 Hence, synonymous mutations may not be neutral. But in many organisms the rate of protein synthesis is not critical. Thus, to provide a relatively time-independent, internal, frame of reference for determining the form of selection (negative or positive), it was found convenient to compare the ratio of amino acid-changing (non-synonymous) base substitution mutations to non-amino acid-changing (synonymous) base substitution mutations in orthologous genes. This assumed that non-amino acid-changing base substitution mutations were adaptively neutral, and hence reflected a "background" rate of accepted mutation. The ratio within a nucleic acid segment seemed capable of providing an index of the rate at which that segment was evolving. A high ratio suggested the segment was under positive selection. A low ratio suggested the segment was under negative selection.³¹

6. The Problem of Zero Rates of Synonymous Mutation

In many cases values for rates of synonymous base substitution mutations are significantly above zero, and determinations of ratios agree with biological expectations. This suggests that third codon position mutations can indeed be neutral. Yet it is not unusual to find values for synonymous base substitution mutations at or close to zero. This is particularly apparent with certain genes of the malaria parasite, Plasmodium falciparum. Some interpret this as revealing a recent population bottleneck - namely a shrinking of the population, the surviving members of which become founders for a subsequent population expansion, but there is a loss of population diversity.⁴⁴ Thus, at the extreme, existing species members are derived from one "Eve" of recent origin.

However, others propose that zero values for synonymous substitutions could result from high conservation of bases that, while not determining the nature of an encoded amino acid, do determine something else (unspecified secondary information). This violates the 'neutral' assumption, so that both the recent origin argument, and calculations based on ratios, could be invalid. Favouring this view, it has been shown that, more than most genomes, that of P. falciparum is sensitive to some of several non-classical selective factors, which affect third codon positions and collectively constitute the "genome phenotype."^27,39

Further evidence for conservation of bases at synonymous sites derives from the uniformity of codon-bias among orthologous genes in vertebrates ("coincident codons"). This is attributed to selection acting coincidentally not only on proteins but also on RNA structure.^26,33,36,45 Other evidence derives from vesicular stomatitis virus, a rapidly evolving RNA virus in which synonymous site conservation can be evaluated under defined conditions.²⁸ Various virus strains evolve in parallel as they adapt to new conditions, and this is presumed to be primarily through non-synonymous base substitution mutations. But concurrent synonymous substitutions do not occur randomly as classical neutral theory would predict. The same synonymous substitutions are accepted independently in different evolving strains. Thus, synonymous substitutions may independently contribute to strain adaptation (perhaps by affecting RNA structure), and/or they may be secondary to primary adaptations at non-synonymous sites (or vice-versa; i.e. non-synonymous and synonymous substitution mutations are correlated). Indirect evidence suggests correlation. Here "the comparative method" can be helpful.

7. Non-Synonymous and Synonymous Rates Correlate

For each individual gene, rates of non-synonymous and synonymous base substitution mutations can be highly correlated, so that the rate of synonymous substitution does not constitute a gene-independent frame of reference.⁴⁶ This is shown for orthologous genes of mouse and rat in Figure 1, which displays data of Wolfe and Sharp⁴⁷ that have been replotted to emphasize the divergence of the lineages from a common ancestor. Each point corresponds to a gene and, since divergence increases with time, the X-axis (percentage DNA divergence) can be conceived as a time axis. Plotted against the DNA divergence for the open reading frame of each gene are the corresponding protein divergences (Fig. 1a), and the corresponding non-synonymous (d_n) and synonymous (d_s) divergences (Fig. 1b). Genes differ dramatically in divergence. While some proteins (bottom left of Fig. 1a) have not diverged at all, the corresponding DNA sequences have diverged a little. Some proteins (top right of Fig. 1a) have diverged more than 20% and the corresponding DNA sequences are also highly diverged. The two divergences are linearly related with an intercept on the X-axis that is significantly different from zero.

Fig. 1. Divergence between 363 orthologous genes of mouse and rat. For each open reading frame is shown the relationship between DNA divergence (base substitutions %) and either (a) protein divergence (amino acids %), or (b) base substitutions per synonymous site (squares; d_s) and per non-synonymous site (triangles; d_n). Points are fitted to either (a) first order, or (b) third order, regression lines. This, with permission, is a replot of data of Wolfe and Sharp.⁴⁷ A direct plot of d_n against d_s(not shown) gave a correlation coefficient of 0.45 (with average values of 0.032 and 0.224 for d_n and d_s, respectively).

So it appears that, since the time of divergence of mice and rats from a common ancestor, some proteins (bottom left of Fig. 1a) have remained unchanged. Presumably organisms with mutations in these proteins have been negatively selected, so no mutations are found in modern organisms. On the other hand, organisms with synonymous mutations would not have been counter-selected so severely. Thus, for a gene whose protein is unchanged there is a DNA divergence of about 5.4% (intercept on X-axis), implying that some synonymous mutations have been accepted. This is shown in Figure 1b where, as expected, the plot for amino acid-changing mutations (d_n) resembles that in Figure 1a, and the plot for synonymous mutations (d_s) extrapolates back close to zero.

The linear relationships show that non-synonymous (amino acid-changing) mutations, and synonymous (non-amino acid-changing) mutations are correlated. If a gene has zero or a few non-synonymous mutations (i.e. it is likely to have been under negative selection), then it will have few synonymous mutations, and will display a low overall DNA divergence. If a gene has many non-synonymous mutations (i.e. it is likely to have been under some degree of positive selection), then it will also have many synonymous mutations, and will display a high overall DNA divergence. This is a general observation and is not confined to the mouse-rat divergence. ^30,46,48-50

Figure 2 shows replots of data from a recent study of the mouse-rat divergence by Bazykin et al.,⁵¹ which involves a much larger number of orthologues and a different method of calculating mutation rates (see also the results of Makalowski and Boguski, and of Friedman and Hughes).^52-53 While there is a greater scatter of points (partly due to the utilization of rat sequences that were incompletely curated at the time of the analysis), the results support those of Wolfe and Sharp.⁴⁷ It is likely that the well-documented tendency of d_s values to curve to the right when DNA divergences are high (e.g. Figs. 1b, 2b) reflects synonymous site saturation for forward mutations.⁵⁴ It follows from this that the d_a/d_s ratio is positively correlated to d_s. Remarkably, this relationship was recently described as "highly unexpected."³⁰

Fig. 2. Divergence between 9390 orthologous genes of mouse and rat. For each open reading frame is shown (a) a direct plot of d_n against d_s, and (b) independent plots of d_s (squares) and d_n (triangles) against their sum (as a measure of overall DNA divergence). In (a) the first order regression line intercepts the X-axis at a value of 0.044, which is significantly different from zero (P < 0.0001). The correlation coefficient was 0.46, and average values for d_n and d_s were 0.04 and 0.22, respectively. In another study of 470 orthologues, the correlation coefficient was 0.54 , with respective average values of 0.035 and 0.164.⁵² In (b) the same data are plotted with second order regression lines. This, with permission, is a replot of data of Bazykin et al..⁵¹

8. Genes as Independent Mutational Entities

Why is the synonymous DNA divergence low in genes with low protein divergences and, despite probable site saturation, high in genes with high protein divergences? It seems likely that, within the group of codons that encode a particular protein, the demands of the conventional and genome phenotypes interrelate.⁵⁵ An accepted non-synonymous mutation that primarily changes an amino acid often cannot help change one or more aspects of the genome phenotype. This invokes (makes more acceptable) secondary compensatory mutations, mainly synonymous, to correct this change. By the same token, an accepted mutation that primarily changes the genome phenotype may happen to be non-synonymous and so may also change the conventional phenotype. This invokes further compensatory mutations, mainly non-synonymous, to correct this change.⁵⁶

For example, a primary mutation from the codon ACA to AAA, not only causes a lysine to be substituted for threonine, but also has the potential to marginally affect nucleic acid conformation (stem-loop potential), purine-loading, and GC%. A primary mutation from the codon AGC to AGG, while not affecting GC%, causes an arginine to be substituted for serine and has the potential to marginally affect purine-loading, RNY-pressure and nucleic acid conformation. On the other hand, a primary pressure to purine-load might change ACA to AAA, or AGC to AGG, thus secondarily causing proteins to encode lysine or arginine (i.e. purine-loading "calls the tune").^35,38

A secondary increase in purine-loading following primary codon mutations from ACA to AAA, and from AGC to AGG, would favor (make propitious) the acceptance of local synonymous exchanges of pyrimidines for purines, thus restoring the original degree of purine-loading. Similarly, an increased content of basic amino acids secondary to mutations from ACA to AAA, and AGC to AGG, which had been primarily driven (made more acceptable) by a pressure to purine-load, would favour the acceptance of local non-synonymous compensatory mutations in the amino acid sequence.⁵¹

As overall DNA divergence increases (Figs. 1b, 2b), the plot for protein divergence (d_n) increases rectilinearly, or curves upwards, whereas the plot for synonymous divergence (d_s) curves to the right as points become more scattered. At high degrees of divergence (i.e. where genes, by definition, have evolved rapidly), in some genes the d_n/d_s ratio increases, but in others it does not. Furthermore, the ratio increase is due more to a decline in d_s than to an increase in d_n. At high degrees of divergence, synonymous mutations tend to be constrained in some genes but not in others. Whatever the cause of this, the designation by the ratio method of some genes as under positive selection, would here seem to depend, not on a rapid change in the conventional phenotype, but on a decreased change in the genome phenotype.

Conversely, from the extrapolation back towards zero divergence, it would seem that synonymous mutations played a greater role in the early stages of the divergence, in contrast to their decreased role in many genes in later stages of the divergence. This is consistent with the initiation of the speciation process predominantly involving changes in the genome phenotype rather than in the conventional phenotype. Such changes precede species establishment, which is likely to predominantly involve changes in the conventional phenotype rather than the genome phenotype.^{16-18,32,34,57} These considerations suggest that, while often correlating with d_n, d_scan have a life of its own. Its reliability as a frame-of-reference for d_n remains problematic.

9. X-axis as a Conceptual Time Axis

It may not be immediately apparent that the X-axes in Figures 1b and 2b (DNA divergence) can be viewed as conceptual time axes. The following metaphor may help. When at cruising speed the efficiency of a vehicle's fuel usage (kilometres/litre) is constant. However, when initially accelerating to that cruising speed efficiency is less (i.e. fuel usage is greater). Thus, a plot of distance travelled (kilometres) against fuel usage (litres) would appear like the plots of d_n against DNA divergence in Figures 1b and 2b. Around the time of the divergence from an ancestral species (i.e. "acceleration" of a new species "to cruising speed"), synonymous mutations would have been differentially accepted ("high fuel usage relative to distance traveled"). After the divergence ("attainment of cruising speed"), synonymous mutations ("fuel usage"), and amino acid-changing mutations ("distance traveled"), would have been accepted proportionately (Fig. 3).

Fig. 3. Were the first mutations (X) associated with the mouse-rat divergence synonymous (S) or amino acid-changing (A; non-synonymous)? Genes A, B and C have low, intermediate, and high mutation rates, respectively. This is inferred from the number of known substitutions that distinguish the modern genes (bottom three lines). It is assumed that, while species "cruising speed" is maintained, rates of acceptance of synonymous and amino acid-changing mutations are proportionate (i.e. positively correlated). Hypothetical intermediate substitution levels at progressive intervals following the divergence are shown. There are approximately proportionate numbers of synonymous (S) and amino acid-changing (A) mutations in each gene. By extrapolation it is determined from Figures 1 and 2 that synonymous mutations were accepted disproportionately (i.e. X = S) when the species were initially "accelerating" up to "cruising speed."

Synonymous mutations, being non-amino acid changing, are a useful indicator of mutations occurring concomitantly both extragenically, and within introns. These, in concert with intragenic synonymous mutations, would have initiated the divergence process.^17,18 Indeed, the base compositions of synonymous sites often being closely correlated with those of neighbouring introns and extragenic DNA,⁵⁸ it is likely that they are under similar evolutionary constraints.

10. Conclusions

Numerous authors cited in the Introduction have proclaimed what would be the high virtues of a single-sequence method relative to the comparative method. To this extent, they have attacked the comparative method. This paper has drawn attention to the fact that a single-sequence method, which does not appear to have the defects of the method proposed by Plotkin et al.,¹ has been available for a decade. Thus, at least with respect to positive Darwinian selection, that "the comparative method rules"² is questionable. This should be further investigated by comparing methods that require more than one sequence, with methods that depend on conflict between different levels of information within a single sequence. Claims of sovereignty will be resolved by a balance-sheet of the advantages and disadvantages of the different methods. However, with respect to Darwin's great question, the comparative method provides evidence that the initial divergence between rat and mouse lineages was driven by synonymous base substitutions (i.e. differences in secondary genomic information). This is a key prediction of a non-genic "chromosomal" model for the origin of species,^{17-18,23,57,59} and should be further investigated in other lineages.

Acknowledgements

Queen's University hosts my web-pages where full text versions of some of the cited references may be found.

References
1. Plotkin JB, Dushoff J, Fraser HB, Detecting selection using a single genome sequence of M. tuberculosis and P. falciparum. Nature 428:942-945, 2004.
2. Dagan T, Graur D, The comparative method rules! Codon volatility cannot detect positive Darwinian selection using a single genome sequence. Mol Biol Evol 22:496-500, 2004.
3. Friedman R, Hughes AL, Codon volatility as an indicator of positive selection: data from eukaryotic genome comparisons. Mol Biol Evol 22:542-546, 2004.
4. Chen Y, Emerson JJ, Martin TM, Codon volatility does not detect selection. Nature 433:E6-E7, 2005.
5. Hahn MW, Mezey JG, Begun DJ, Gilliespie JH, Kern AD, Langley CH, Moyle LC, Codon bias and selection on single genomes. Nature 433:E5-E6, 2005.
6. Nielsen R, Hubisz MJ, Detecting selection needs comparative data. Nature 433:E6, 2005.
7. Zhang J, On the evolution of codon volatility. Genetics 169:495-501, 2005.
8. Sharp PM, Gene "volatility" is most unlikely to reveal adaptation. Mol Biol Evol 22:807-809, 2005.
9. Pillai SK, Pond SLK, Woelk CH, Richman DD, Smith DM, Codon volatility does not reflect selection pressure on the HIV-1 genome. Virology 336:137-143, 2005.
10. Plotkin JB, Dushoff J, Fraser HB, Reply. Nature 433:E7-E8, 2005.
11. Forsdyke DR, Stem-loop potential in MHC genes: a new way of evaluating positive Darwinian selection? Immunogenetics 43:182-189, 1996. (Click Here)
12. Forsdyke DR, A stem-loop "kissing" model for the initiation of recombination and the origin of introns. Mol Biol Evol 12:949-958, 1995.
13. Forsdyke DR , Conservation of stem-loop potential in introns of snake venom phospholipase A₂ genes. An application of FORS-D analysis. Mol Biol Evol 12:1157-1165, 1995.
14. Forsdyke DR , Reciprocal relationship between stem-loop potential and substitution density in retroviral quasispecies under positive Darwinian selection. J Mol Evol 41:1022-1037, 1995.
15. Seffens W, Digby D, mRNAs have greater negative folding free energies than shuffled or codon choice randomized sequences. Nucleic Acids Res. 27:1578-87, 1999.
16. Forsdyke DR, Mortimer JR, Chargaff's legacy. Gene 261:127-137, 2000.
17. Forsdyke DR, The Origin of Species, Revisited, McGill-Queen's University Press, Montreal, 2001.(Click Here)
18. Forsdyke DR, Evolutionary Bioinformatics. Springer, New York, 2006.(Click Here)
19. Zuker M, Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res 31:3406-3415, 2003.
20. Cohen B, Skiena S, Natural selection and algorithmic design of mRNA. J Comput Biol 10:419-432, 2003.
21. Zhang C-Y, Wei J-F, He S-H, Local base order influences the origin of ccr5 deletions mediated by DNA slip replication. Biochem Genet 43:229-237, 2005.
22. Meyer IM, Miklos I, Statistical evidence for conserved, local secondary structure in the coding regions of eukaryotic mRNAs and pre-mRNAs. Nucleic Acids Res 33:6338-6348, 2005.
23. Paz A, Kirzhner V, Nevo E, Korol A, Coevolution of DNA-interacting proteins and genome "dialect." Mol Biol Evol 23:56-64, 2005.
24. Andolfatto P, Adaptive evolution of non-coding DNA in Drosophila. Nature 437:1149-1152, 2005.
25. Xing Y, Lee C, Can RNA selection pressure distort the measurement of Ka/Ks? Gene 370:1-5, 2006.
26. Chamary JV, Parmley JL, Hurst LD, Hearing silence: non-neutral evolution at synonymous sites in mammals. Nature Rev Genet 7:98-108, 2006.
27. Forsdyke DR, Selective pressures that decrease synonymous mutations in Plasmodium falciparum. Trends Parasitol. 18:411-418, 2002.
28. Novella IS, Zarate S, Metzger D, Ebendick-Corpus BE, Positive selection of synonymous mutations in vesicular stomatitis virus. J Mol Biol 342:1415-1421, 2004.
29. Hughes AL , Friedman R, Variation in the pattern of synonymous and nonsynonymous difference between two fungal genomes. Mol Biol Evol 22:1320-1324, 2005.
30. Wyckoff GJ, Malcom CM, Vallender EJ, Lahn BT, A highly unexpected strong correlation between fixation probability of nonsynonymous mutations and mutation rate. Trends Genet 21:381-385, 2005.
31. Hughes AL, Adaptive Evolution of Genes and Genomes. Oxford University Press, New York, 1999.
32. Schaap T, Dual information in DNA and the evolution of the genetic code. J Theor Biol 32:293-298, 1971.
33. Ball LA, Mutual influence of the secondary structure and information content of a messenger RNA. J Theor Biol 41:243-247, 1973.
34. Lee S-J, Mortimer JR, Forsdyke DR, Genomic conflict settled in favour of the species rather than of the gene at extreme GC% values. Appl Bioinf 3:219-228, 2004.
35. Paz A, Mester D, Baca I, Nevo E, Korol A, Adaptive role of increased frequency of polypurine tracts in mRNA sequences of thermophilic prokaryotes. Proc Natl Acad Sci USA 101:2951-2956, 2004.
36. Shabalina SA, Ogurtsov AY, Spiridonov NA, A periodic pattern of mRNA secondary structure created by the genetic code. Nucleic Acids Res 34:2428-2437, 2006.
37. Pizzi E, Frontali C, Low-complexity regions in Plasmodium falciparum proteins. Genome Res 11:218-229, 2001.
38. Cristillo AD, Mortimer JR, Barrette IH, Lillicrap TP, Forsdyke DR, Double-stranded RNA as a not-self alarm signal: to evade, most viruses purine-load their RNAs, but some (HTLV-1, Epstein-Barr) pyrimidine-load. J Theor Biol 208:475-489, 2001.
39. Xue HY, Forsdyke DR, Low complexity segments in Plasmodium falciparum proteins are primarily nucleic acid level adaptations. Mol Biochem Parasitol 128:21-32, 2003.
40. Rayment JH, Forsdyke DR, Amino acids as placeholders: base composition pressures on protein length in malaria parasites and prokaryotes. Appl. Bioinf. 4:117-130, 2005.
41. Forsdyke DR, Madill CA, Smith SD , Immunity as a function of the unicellular state: implications of emerging genomic data. Trends Immunol 23:575-579, 2002.
42. Kurland CG, Major codon preference: theme and variation. Biochem Soc Trans 21:841-846, 1993.
43. Akashi H, Gene expression and molecular evolution. Curr Opin Genet Devel 11:660-666, 2001.
44. Hartl DL, The origin of malaria: mixed messages from genetic diversity. Nat Rev Microbiol 2:15-22, 2004.
45. Bains W, Codon distribution in vertebrate genes may be used to predict gene length. J Mol Biol 197:379-388, 1987.
46. Graur D, Li W-H, Fundamentals of Molecular Evolution, Sinauer Associates, Sunderland, pp. 113-115, 2000.
47. Wolfe KH, Sharp PM, Mammalian gene evolution: nucleotide sequence divergence between mouse and rat. J Mol Evol 37:441-456, 1993.
48. Mouchiroud D, Gautier C, Bernardi G, Frequencies of synonymous substitutions in mammals are gene-specific and correlated with frequencies of nonsynonymous substitutions. J Mol Evol 40:107-113, 1995.
49. Comeron JM, Kreitman M, The correlation between synonymous and nonsynonymous substitutions: mutation, selection or relaxed constraints? Genetics 150:767-775, 1998.
50. Williams EJB, Hurst LD, Is the synonymous substitution rate in mammals gene specific? Mol Biol Evol 19:1395-1398, 2002.
51. Bazykin GA, Kondrashov FA, Ogurtsov AY, Sunyeav S, Kondrashov AS, Positive selection at sites of multiple amino acid replacements since rat-mouse divergence. Nature 429:558-562, 2004.
52. Makalowski W, Boguski MS, Evolutionary parameters of the transcribed mammalian genome. An analysis of 2820 orthologous rodent and human sequences. Proc Natl Acad Sci USA 95:9407-9412, 1998.
53. Friedman R, Hughes AL, The pattern of nucleotide difference at individual codons among mouse, rat and human. Mol Biol Evol 22:1285-1289, 2005.
54. Dagan T, Talmor Y, Graur D, Ratios of radical to conservative amino acid replacement are affected by mutational and compositional factors and my not be indicative of positive Darwinian selection. Mol Biol Evol 19:1022-1025, 2002.
55. Alvarez-Valin F, Jabbari K, Bernardi G, Synonymous and nonsynonymous substitutions in mammalian genes: intragenic correlations. J Mol Evol 46:37-44, 1998.
56. Poon A, Chao L, The rate of compensatory mutation in the DNA bacteriophage ΦX174. Genetics 170:989-999, 2005.
57. Ironside JE, Filatov DA, Extreme population structure and high interspecific divergence of the Silene Y chromosome. Genetics 171:705-713, 2005.
58. Aota S-I, Ikemura T, Diversity in G + C content at the third position of codons in vertebrate genes and its cause. Nucleic Acids Res 14:6345-6355, 1986.
59. Forsdyke DR, William Bateson, Richard Goldschmidt, and non-genic modes of speciation. J Biol Syst 11:341-350, 2003.

End Note 2007 (not in the published paper)

In April 2004 Nature, presumably after an exhaustive review process, published the paper of Plotkin and coworkers claiming a paradigm shift!:

"Within the comparative paradigm, it would be impossible to measure selective pressures on the basis of a genome sequence from a single individual. Here we present a method for rapidly detecting differential selective pressures on genes by inspecting a single genome sequence."

Shortly thereafter papers flooded in, both to Nature and elsewhere, criticizing the Plotkin approach. Given the great interest in a single sequence method one might have thought that my respectful reminder (the above paper) that a single sequence method had been "out there" for at least a decade, would have been favourably received. Not so (see Table below). Had the above paper been published expeditiously, you, dear reader, would have been able to access it in the early summer of 2005. As it is, various members of the biomedical research establishment, acting as reviewers, had privileged access to it for two years ahead of you! In this period they have probably declined many papers that you may not need to read - thus, through their sifting, they have done you a good turn. But, as the peer-review section of these web-pages shows - there is a flip-side to this. Some potentially important babies can be lost with the bathwater!

The reviewers' comments were barely cogent, and I will not burden you with them. Most annoying was the retrospective discovery that Trends in Genetics, while toying with my proposal (2 months) and then requesting that the paper be forced into a 2500 word straight-jacket, was simultaneously reviewing and accepting a paper of Wyckoff, Malcom, Vallender and Lahn (University of Chicago) on positive Darwinian selection. These authors replotted the well-known tendency of d_s values to curve to the right when DNA divergences are high (e.g. Figs. 1b, 2b) in such a way that the d_a/d_s ratio was shown to be positively correlated to d_s. This was described as "a surprising finding" and "highly unexpected," for "neither classical theories nor previous studies predicted a strong positive correlation between -- [the K_a/K_s ratio] and K_s that is evident in our data." Of course, due to the decrease in relative numbers of synonymous mutations as divergence increases, it is obvious that d_a/d_s ratio [the K_a/K_s ratio] increases and so is positively correlated to d_s[K_s]. The authors also confused the correlation coefficient (r) with the coefficient of determination (r²). Through 2005 and early 2006, as more and more commentaries on the Plotkin paper appeared in the literature, my paper expanded in length, but the essential message remained unchanged.

Journal	First Submitted proposal, or paper	Revised version, or paper	Date Declined	External Review	Manu- script number
*Mol. Biol. Evol.*	19 Feb 05		21 Feb 05	No	05-0109
Trends Genetics	21 Feb 05 (proposal)	25 April 05 (2500 words)	15 June 05	Yes	0118
*Nature Reviews Genetics*	8 April 05 (proposal)		31 May 05	No	05-143 V1
Nature	16 June 05 (proposal)		17 June 05	No	06-06967
Science	17 June 05 (proposal)		28 June 05	No	63683
Proc.Nat. Acad. Sci	1 July 05 (proposal)		8 July 05	No	2005-05558
Gene	13 July 05		16 Feb 06	Yes	D-05-00269
J. Theor. Biol.	23 Feb. 06		23 May 06	Yes	D-06-00108
J. Biol. Systems.	25 May 06		18 Aug 06 Accepted!	Yes	Published March 07!

Usually new results first appear in scientific papers and, only much later, in books. Due to the two year delay, the results presented in the present paper were available in my text Evolutionary Bioinformatics (released by Springer in Nov 2006) some months before the formal publication of the paper in the Journal of Biological Systems! Readers who are inclined to be cynical (of course I don't mean you) should note that I have long been an "advisor" to this journal, but I have no reason to believe that its reviewing procedures are not adequate.

I have long passed being concerned with "publish or perish." So, given all this trouble, why bother? The first reason is that if one thinks one is saying something important one should try (indeed one has a responsibility) to get it heard above the babble. Publication in a high-profile journal achieves this. Second, the reviews (albeit usually unhelpful) allow one to "touch base." If there were some frightful error there is a chance a reviewer would spot it (although the Wyckoff case appears to demonstrate the opposite). Third, it provides you, dear reader, with some assurance that it has passed an author-independent quality check.

However, the stone-walling experienced with the above paper is more the rule than the exception. I suspect I, like many others, will be increasingly driven to choose to place my works in an institutional depository (e.g. these web-pages) and then move on. Life is just too short! So readers might anticipate that, failing drastic system reforms, the body of original work in institutional depositories is likely to grow. Fortunately, key-word searches, etc., now allow you to recover from such sources what is relevant to your needs. Those who omit such searches may find themselves reinventing wheels.

End Note 2009

Papers that appear supportive of the above thesis are becoming more evident in the mainstream literature. Thus, a paper from the Ellegren laboratory notes that genes that are candidates for being implicated in positive selection often "have an unexpectedly low number of synonymous substitutions compared with the genome background." So now it should be recognized "that inconsistencies in the behavior of dN/dS are to be expected" since "this behavior may be inherent to taking the ratio of two randomly distributed variables that are nonlinearly correlated."

Wolf, J. B. W. et al. (2009) Nonlinear dynamics of nonsynonymous (dN) and synonymous (dS) substitution rates affects inference of selection. Genome Biology & Evolution 1, 308-319.

Also, there is a convergence of the above bioinformatic analysis (see Section 9) with independent phylogenetic analysis. Venditti and Pagel (2008, 2009) postulate "accelerated rates of evolution following speciation" and hence deduce that an evolutionary line with more speciation events (nodes in the phylogenetic tree) might display an overall rate of evolution greater than an evolutionary line with less speciation events. Using data similar to the above, but plotting in terms of path-lengths and node number, they obtain evidence supporting this deduction. For this they assume a regular molecular "clock," so path lengths increase when the number of mutations separating two diverging lines increase. Thus: "Accelerated rates of evolution at the time of speciation are expected to leave a distinctive signature on a phylogenetic tree." They appear to be in agreement with the main theses of these web-pages:

(i) the primary nature of reproductive isolation in the speciation process,
(ii) the important role of synonymous mutations,

(iii) the relevence of Gould's punctuated evolution ideas, and

(iv) the parallels between the evolution of languages and of species.

However, they state that "previous studies have not made any serious attempt to determine whether the accelerated rates of change occur predominantly in neutral or coding sites or some combination of the two."

Venditti, C. & Pagel, M. (2008) Evolution by fits and starts. The Biologist 55, 140-146.

Venditti, C. & Pagel, M. (2009) Speciation as an active force in promoting genetic evolution. Trends in Ecology & Evolution 25, 14-20.

End Note Jan 2010

It should be noted that the data of Wolfe and Sharp (1993) showing that, early in the speciation process, there were changes in bases not critical for protein-encoding (here Figure 1), were previously plotted more schematically (Figure 4 of Forsdyke 1996; Click Here). The present form of data presentation is perhaps more intuitive [indeed some later referred to as "Forsdyke plots." DRF 2020] See also the end notes to my other papers on speciation.

End Note March 2011

The above noted observation of Wykoff and his coworkers (2004) continued to be regarded as "highly unexpected " (Vallender & Lahn 2007) and "unexpected" (Stoletzki & Eyre-Walker 2011).

Vallender EJ, Lahn BT (2007) Uncovering the mutation-fixation correlation in short lineages. BMC Ev0lutionary Biology 7, 168.

Stoletzkii N, Eyre-Walker A (2011) The positive correlation between dN/dS and dS in mammals is due to runs of adjacent substitutions.Molecular Biology & Evolution 28, 1371-1380.

End Note Feb 2013

Much of the work on the relationship between rate of evolution and nucleic acid structure, as reported in these webpages, has been confirmed by Park et al. (2013). For example, a major conclusion is that "amino acid substition rate is negatively correlated with mRNA folding strength" or "amino acid substitutions are slower as the mRNA folding strength increases." The latter would be equivalent to "amino acid substitutions are faster as the mRNA folding strength decreases," which is basically the point I made in the 1990s (refs 11-14 above). However, while their data look good, their interpretations differ from mine. The following abstract on "Significance" will give the flavour of their study:

"The expression level of a gene is a leading determinant of its rate of protein sequence evolution, but the underlying mechanisms are unclear. We show that as the mRNA concentration increases, natural selection for mRNA folding intensifies, resulting in a larger fraction of mutations deleterious to mRNA folding and lower rates of protein evolution. Counterintuitively, selection for mRNA folding also impacts the nonsynonymous-to-synonymous nucleotide substitution rate ratio, requiring a revision of the current interpretation of this ratio as a measure of protein-level selection. These findings demonstrate a prominent role of selection at the mRNA level in molecular evolution."

The possibility that mRNA has structure by default, because the encoding DNA needs to have structure, is not considered. Nevertheless it is a bold attempt at bringing to order numerous disparate observations. The subject also touches our work on X-chromosome dosage compensation, which is dealt with elsewhere on these pages.

Park C, Chen X, Yang J-R, Zhang J (2013) Differential requirements for mRNA folding partially explain why highly expressed proteins evolve slowly. Proceedings of the National Academy of Sciences USA In Press. doi/10.1073/pnas.1218066110

End Note Aug 2014

There was a follow up paper from the Zhang laboratory (Yang et al. 2014). Being in PLOS Biology, I was able to comment directly:

DNA structure trumps RNA structure
Posted by forsdyke on 31 Jul 2014 at 22:37 GMT

Yang et al. (2014) suggest that their results 'explain why highly expressed genes tend to have strong mRNA folding, slow translation elongation, and conserved protein sequences.' However, mRNA structure largely reflects the potential of the corresponding DNA to extrude similar structure, so departing from the classic duplex mode. Such stem-loop extrusion potential is widespread in genomes, is usually impaired in genic regions compared with non-genic regions, and is less impaired in conserved genes that are not evolving rapidly under positive Darwinian selection (Forsdyke, 1995a, b, c). Thus, 'strong mRNA folding' can be by default, mainly because the corresponding gene retains a potential for strong DNA folding. In protein-coding regions there is a conflict between the needs of nucleic acid structure and coding, which can often only be partially resolved by choice of alternative codons.

Within these constraints, there may indeed be sufficient flexibility to modulate translational elongation as the fine paper of Yang et al. so nicely shows, but it would seem misleading to imply that mRNA folding primarily serves this role. Furthermore, the higher conservation of the proteins of highly expressed genes could reflect their dual specific and collective roles, whereas the proteins of lowly expressed genes may mainly play specific roles (Forsdyke, 2012). Thus, there are two sources of negative selection pressure on genes with highly expressed products, and only one source of negative selection pressure on genes with lowly expressed products. Hence the greater conservation of the former.

The paper of Yang et al. (2014) concludes by noting that 'The enigmatic positive correlation between gene expression level and mRNA folding strength, at least partially results from selection for slower translational elongation of more abundant mRNAs to minimize mistranslation.' This conclusion is less dogmatic than that of an earlier paper (Park et al. 2013), which notes 'a major role of natural selection at the mRNA level in constraining protein evolution.'

Forsdyke DR (1995a) A stem-loop 'kissing' model for the initiation of recombination and the origin of introns. Mol Biol Evol 12: 949-958.
Forsdyke DR (1995b) Conservation of stem-loop potential in introns of snake venom phospholipase A2 genes: an application of FORS-D analysis. Mol Biol Evol 12: 1157-1165.
Forsdyke DR (1995c) Relative roles of primary sequence and (G+C)% in determining the hierarchy of frequencies of complementary trinucleotide pairs in DNAs of different species. J Mol Evol 41: 573-581.
Forsdyke DR (2012) Functional constraint and molecular evolution. In: Encyclopedia of Life Sciences. Chichester: John Wiley.
Park C, Chen X, Yang J-R, Zhang J (2013) Differential requirements for mRNA folding partially explain why highly expressed proteins evolve slowly. Proc Natl Acad Sci USA 110: E678-686.
Yang J-R, Chen Z, Zhang J (2014) Codon-by-codon modulation of translational speed and accuracy via mRNA folding. PLOS Biol 12: e1001910.

RE: DNA structure trumps RNA structure
forsdyke replied to forsdyke on 09 Aug 2014 at 13:48 GMT

To an email request for clarification of the above comment, the following reply was sent:

Benefit of strong DNA folding. That there is strong DNA folding, which can be presumed to have been sustained by selection, is a fact. My bioinformatics studies seemed most in keeping with the hypothesis of Kleckner and Weiner (1993. Potential advantages of unstable interactions for pairing of chromosomes in meiotic, somatic and premeiotic cells. Cold Spring Harbor Symp. Quant. Biol. 58:553-565). This is that 'kissing' interactions between loops favor recombination and, since recombination is usually advantageous, there has been selection for this structure. Importantly, this involves changes in base order to support structure. This creates problems in protein-coding regions and 'the hand of evolution' has to arrive at a compromise that is most appropriate for an organism. As a consequence, there may be less structure-potential in coding regions than in non-coding regions, especially in genes that are rapidly evolving, where there is more pressure to keep the protein optimized, and structure-potential must either be sacrificed, or directed to neighboring introns.

Why highly expressed genes need strong DNA folding. As a secondary consequence of this, another benefit appears. If a DNA sequence has a high stem-loop potential, then it is energetically easier for the two strands of the duplex to separate. The stem-loop conformation decreases the strain imposed by negative superhelical winding, and hence should facilitate the helix-opening needed for transcription. Highly expressed genes benefit from this more than lowly expressed genes, so there is more pressure for the former to retain structure in coding regions, despite the pressure to encode a unique amino acid sequence. Another consequence of the structure-potential is that the 'top' non-transcribed strand is free to form duplex stems, and hence to partially protect itself from mutagenic attack. The 'bottom' transcribed strand is less protected in this respect. A highly expressed gene is more vulnerable, so is more in need of such protection.

Return to: HomePage (Click Here)

Return to: Bioinformatics Index (Click Here)

Return to: Evolution Index (Click Here)

Go to: Haldane's Rule (Click Here)

This page was released in February 2007 and was last edited 13 Nov 2020 by Donald Forsdyke