HIV Analysis

Reciprocal Relationship between Stem-Loop Potential and Substitution Density in Retroviral Quasispecies under Positive Darwinian Selection

Donald R. Forsdyke

Journal of Molecular Evolution (1995) 41, 1022-1037 [With copyright permission from Springer.]

Keywords: Stem-loop - Base substitution - Base indel - Recombination - Speciation

Diploid HIV-1 virus particles. Two RNA genomes come together for packaging. This process is analogous to meiosis in that there is a homology search and "kissing" interactions between the tips of stem-loops in the "encapsidation site". Unlike meiosis there may be no flexibility in choice of site for pairing interaction.

Unlike most viruses (haploid), retroviruses are diploid (two copies per virus particle).

Abstract

Introduction

Methods

FORS-D Analysis: Theory and Practice
Determination of Base Substitution and Base Indel Densities

Results

FORS-D Plots for Divergent HIV-1 Genomes
Substitution Patterns Vary with Difference from Prototype
Negative Correlation between FORS-D and Substitutions
No Relationship between FORS-D Value and (G+C)%

Discussion

Codons and Stem-Loops are Local "Strategies"
FORS-D Values are Small but Highly Informative
Paradox Explained by Positive Darwinian Selection?
Accelerated Speciation Model
Importance of (G+C)% Differences in Speciation

End Note (June 2007)

End_Note_(March_2011)

End_Note_(May_2011)

End_Note_(Aug_2013)

Abstract. Nucleic acids have the potential to form intrastrand stem-loops if complementary bases are suitably located. Computer analyses of poliovirus and retroviral RNAs have revealed a reciprocal relationship between "statistically significant" stem-loop potential and "sequence variability". The statistically significant stem-loop potential of a nucleic acid segment has been defined as a function of the difference between the folding energy of the natural segment (FONS) and the mean folding energy of a set of randomized (shuffled) versions of the natural segment (FORS-M). Since FONS is dependent on both base composition and base order, whereas FORS-M is solely dependent on base composition (a genomic characteristic), it follows that statistically significant stem-loop potential (FORS-D) is a function of base order (a local characteristic).

In retroviral genomes, as in all DNA genomes studied, positive FORS-D values are widely distributed. Thus there have been pressures on base order both to encode specific functions and to encode stem-loops. As in the case of DNA genomes under positive Darwinian selection pressure, in HIV-1 specific function appears to dominate in rapidly evolving regions. Here high sequence variability, expressed as substitution density (not indel density), is associated with negative FORS-D values (impaired base order-dependent stem-loop potential). This suggests that in these regions HIV-1 genomes are under positive selection pressure by host defences.

The general function of stem-loops is recombination. This is a vital process if, from among members of viral "quasispecies", functional genomes are to be salvaged. Thus, for rapidly evolving RNA genomes, it is as important to conserve base-order dependent stem-loop potential, as to conserve other functions.

Introduction

Nucleic acids have the potential to form intra-strand stem-loop structures if complementary nucleotides are suitably located (Murchie et al. 1992). Theoretical optimum secondary structures of minimum free energy may be derived using appropriate computer algorithms (Zuker, 1989). These have been adapted to determine the distribution of "statistically significant" stem-loop potential along RNA molecules (Le and Maizel, 1989). The distribution of accepted mutations in nucleic acids can be correlated with structural features (Min Jou and Fiers, 1976; Salser, 1977). Analyses of RNA virus genomes reveals a positive correlation between low stem-loop potential and high "sequence variability". This applies to polio virus (Currey et al. 1986), visna virus (a sheep retrovirus; Braun et al. 1987), and the human retrovirus HIV-1 (Le et al. 1988, 1989).

A number of questions arise.

Since statistical significance and biological significance are not necessarily the same (Karlin and Brendel, 1992), what does "statistically significant" stem-loop potential mean in terms of nucleic acid function?
In scoring "sequence variability", base indels (insertions or deletions) were treated as a special kind of residue to be dealt with in the same way as normal base substitutions (Le et al. 1989). Is it valid to treat these two forms of variation collectively?
Finally, what is the biological significance of the reciprocal relationship?

Braun and coworkers (1987) suggested that:

"The apparent lack of base pairing may indicate that there are no secondary structural constraints in this [stem-loop] region, which might limit the rate of sequence divergence. Alternatively, this region might be more exposed to chemical mutagens than regions with more secondary structure".

Le and coworkers (1989) asked:

"Is there a structural or functional need to avoid secondary structure which coincidentally leads to regions that are less constrained in the rate of evolution?".

A more likely explanation derives from recent studies of the distribution of stem-loop potential in DNA genes under strong positive "Darwinian" selection. These also show the reciprocal relationship between stem-loop potential and sequence variability (Forsdyke, 1995c). Thus, the relationship may reflect the fact that the RNA genomes are also under strong positive selection pressure (Scpaer and Mullins, 1993).

In this paper an adaptation of the approach of Le and Maizel (1989) is used to study the potential secondary structures of retroviral sequences. In Methods the stability of stem-loops is shown to have base composition-determined and base order-determined components; only the latter corresponds to "statistically significant" stem-loop potential. The Results section is divided into three parts.

In the first part the distributions of stem-loop potential in three highly divergent HIV isolates are compared.
In the second the distributions of base substitutions and base indels are compared similarly.
In the third part correlations between the distributions of stem-loop potential and of substitutions and indels are analysed.

In the Discussion it is suggested that HIV-1 mutates rapidly in response to the positive selection pressure of host immune defences. Due to this extraordinarily high mutation rate, HIV-1 "quasispecies" are under strong pressure to conserve sequence features, such as stem-loops, which facilitate recombination. This allows the correction or repair of mutations. By the use of synonymous codons and by switching between codons for similar amino acids, the more conserved parts of the genome would have adapted to encode both stem-loops and specific functions. Thus base substitutions have been less readily accepted in regions rich in stem-loops than in other regions. The former conserved regions are the most likely regions to be susceptible to therapeutic attack at the nucleic acid level using antisense approaches, or at the protein level using immunological approaches. It is also suggested that the rapid rate of sequence divergence among HIV-1 isolates may assist the understanding of speciation among organisms whose sequences diverge on a geological time scale (Forsdyke, 1995a, 1996).

Methods

FORS-D Analysis: Theory and Practice

The secondary structure of a nucleic acid in single stranded form is very sensitive to small changes in sequence (Orita et al. 1989). Provided the length is not excessive, any such sequence can be analysed using computer programs such as FOLD (Zuker, 1989) to arrive at a theoretical optimum secondary structure of minimum free energy. Because of the greater strength of GC bonds relative to AT bonds, a GC-rich sequence tends to have a more stable structure than an AT-rich sequence of the same length. However, it does not follow that that the most stable structures are likely to be of most local functional relevance. The bases in a segment might show poor complementarity, but if the few complementary pairs were GC pairs the stability of the folded molecule might be quite high. Base composition is a genome, or genome sector (isochore), "strategy", which has a major influence on codon choice, particularly of synonymous codons (Grantham et al. 1980). Base composition is not a local "strategy". Confidence that a given secondary structure is of local functional relevance is greater if it can be shown that the sequence has accepted mutations which enhance stem-loop formation (i.e. that the actual sequence of bases, in addition to base composition, has contributed to secondary structure stability).

In principle, this should be possible by comparing the folding of a natural sequence with that of randomized versions of the same sequence. A natural sequence is but one member of a large set of possible sequences with the same base composition. The average characterics of this set can be arrived at by randomizing the natural sequence. Randomization (shuffling; Fitch, 1983) destroys information present in the primary sequence (base order), without changing base composition or sequence length. Thus, provided length is kept constant, average characteristics reflect base-composition alone. The character under study can be:

secondary structure, as in the present work, or even
phylogenetic trees as recently described by Bronson and Anderson (1994).

If base order has been adapted over evolutionary time to enhance the stem-loop potential of a sequence, then it may not so readily serve other functions, such as the encoding of a protein. Thus, there is the possibility of a conflict between different base order-determined functions. A sequence may not have been able to optimize simultaneously both its stem-loop potential and its protein-encoding potential. One way this conflict could have been resolved in organisms which were not under selection pressure to limit genome size, would have been to allow regions of protein-encoding potential to arise in small segments (exons), interrupted by sequences of high base order-determined stem-loop potential (introns; Forsdyke, 1995b).

Programs of the Genetics Computer Group Inc. (Gribskov and Devereux, 1991) were made available on line through the services of the Molecular Biology Data Service of the National Research Council, Ottawa. The program FOLD was used to find an RNA secondary structure of minimum free energy, using the energy values for base stacking and loop-destabilization assigned by Turner et al. (1988). The program SHUFFLE was used to generate random sequences. A Unix script program (SHUFFOLD) was written to apply these two programs to successive 200 nt windows, overlapping by 150 nt. Another Unix script program (STATS) subjected the output of SHUFFOLD to statistical analysis, using the Minitab system (Ryan and Joiner, 1994).

For each 200 nt window FOLD first determines the minimum free energy value for folding of the natural sequence (FONS value). This is a function of both base composition and base order and measures the "total stem loop potential" of a region. Then ten random sequences are generated from the same sequence and each randomised sequence is submitted to FOLD. The mean minimum free energy value for the 10 sequences (FORS-M value), provides a measure of the contribution of base composition alone to the stem-loop potential ("base composition-determined stem-loop potential"). Since the FONS value is usually more negative than the FORS-M value, the difference between the two values (FORS-M less FONS) is usually positive. This difference (the "FORS-D" value), provides a measure of the contribution of base order alone to the stem-loop potential.

Thus, a positive FORS-D value defines and quantitates the "base order-determined stem-loop potential". This closely corresponds to the "segment score" of Le and Maisel (1989), which is used to assess "statistical significance"; (segment score is FONS less FORS-M, divided by the standard deviation of FORS-M). A negative FORS-D value in a region may mean that base-order has been adapted to serve some other potential, rather than stem-loop potential.

It is probable that stem-loop structures are extruded from supercoiled DNA molecules as part of the homology search preceding meiotic recombination (Crick, 1971; Sobell, 1972; Wagner and Radman, 1975; Doyle, 1977; Hawley and Arbel, 1993). Stem-loops may also be involved in recombination between RNA molecules (Romanova et al. 1986; Tolskaya et al. 1987). From studies of the pairing of complementary sense and antisense RNA molecules (Tomizawa, 1984), it is proposed that homologous chromosomes recognize each other as the result of "kissing" interactions between the tips of stem-loops (Kleckner et al. 1991; Kleckner and Weiner, 1993; Klein, 1994).

The size of the extruded segment should approximate or exceed the size of the "minimum efficient processing segment" involved in recombination, which for mammalian cells approaches 250 nt (Radman et al. 1993). This consideration, and the exponential increase in computation costs as window size increases, prompted the use of window sizes of 200 nt in the present work. Preliminary comparisons of known sequence features with FORS-D values indicate that 100 nt windows are less informative than 200 nt windows. Further details are given in Forsdyke, 1995 b-d, which survey the distribution of FORS-D values in a variety of genes.

Determination of Base Substitution and Base Indel Densities

Sequences of different HIV-1 isolates were aligned with the program GENALIGN by the "regions" method using a matching weight of +1.0/base, and a deletion weight (gap penalty) of -0.5/base (Martinez, 1988). The program was accessed through the Bionet on-line computing service (IntelliGenetics Inc., Mountain View, California). Substitutions and indels were counted in successive 200 nt windows, each of which overlapped the preceding window by 150 nt. To determine the effect of deletion weight on substitution densities, the matching weight was kept constant and deletion weights were varied. No correction was made for the possibility of multiple substitutions over evolutionary time.

Results

FORS-D Plots for Divergent HIV-1 Genomes

FORS-D analysis is the nearest we have to a "Rosetta Stone" permitting the deciphering of the information in nucleic acid sequences. According to the theory of stem-loop stability energetics (see Methods), the distribution of FORS-D values in HIV genomes should point to regions of importance with respect to protein function or regulation. In view of the postulated role of stem-loops in recombination, it was of interest to compare the distribution of stem-loop potentials between divergent HIV-1 genomes. These have some of the characteristics expected of incipient species, so that, depending on the degree of sequence divergence, recombination between them should show varying degrees of impairment.

HIVHXB2 was one of the first HIV-1 genomes to be completely sequenced and is used as the reference sequence in the present work (Ratner et al. 1987). On the basis of sequence-based evolutionary trees, different HIV-1 isolates have been classified into subtypes, with HIVHXB2 being classified as subtype B (Myers et al. 1993). Members of other subtypes usually diverge more from HIVHXB2 than other B subtypes. Within a subtype, members usually show similar degrees of divergence from HIVHXB2. In the present work certain regions, marked in figures by fine vertical lines, are chosen to illustrate general points.

The 9718 nt GenBank sequence HIVHXB2CG was submitted to the program SHUFFOLD (see Methods). 200 nt sequence windows, overlapping by 150 nt, were successively examined with respect to the minimum free energy of folding of the natural sequence (FONS value), the mean minimum free energy of folding of 10 randomised versions of the window (FORS-M value), and the difference between these two values (FORS-D value). The latter provides a measure of the extent to which base order contributes to the stability of the optimally folded structure. By subtracting negative FONS values from negative FORS-M values, it is arranged that FORS-D values are usually positive, indicating the presence of base order-dependent stem-loop potential. Error bars provide a guide to the significance of the departure from zero.

hivfig1.tif (1909614 bytes)

Fig. 1. FORS-D analysis of HIV-1 subtype B member HIVHXB2CG (GenBank name of prototype sequence used in this work). Values are for successive 200 nt sequence windows, which begin at nt 1 of the GenBank entry and overlap by 150 nt. Locations of LTRs and genes (shown as boxes) are taken from the GenBank entry. In this sequence the 5' end of the pol gene is uncertain and the vpu gene (dotted box) is believed to be non-functional. The vertical placements of the boxes do not indicate correspondences of reading frames. FORS-D values (top) are shown as filled triangles (" standard errors of means). FONS values are shown as filled circles. FORS-M values are shown as open triangles. LTR boxes are divided into the U3, R and U5 regions by vertical dotted lines. The fine vertical dashed lines in the upper figure indicate, from left to right, the middle of (i) the TARCE element (Blum et al. 1990; Freter et al. 1992), (ii) a sequence 77% conserved between HIV-1 and HTLV-1 (CTATCAAAGCAGTAAGTAGTACATGTAATG; Ratner et al. 1985), (iii) a sequence encoding part of the CD4 recognition sequence in gp120 (marked by a small box in the env box; Lasky et al. 1987), and (iv) the rev response element (RRE; marked by a small box in the env box; Mann et al. 1994). For further details see Methods.

FONS and FORS-M values are plotted together in the lower part of Figure 1. As reported with many other genes (Forsdyke, 1995b), the two values follow each other closely, with FONS values usually being more negative than FORS-M values (indicating greater stability of the stem-loops of the natural sequence). The profiles tend to fluctuate in a regular manner with a periodicity of approximately 0.6 kb. In keeping with the results of Le and coworkers (Le et al. 1988, 1989, 1990), the most negative FONS values are found in the R regions of the long terminal repeats (LTRs), and around nt 7900 in the env gene. It is in these regions that the difference between FORS-M and FONS values is greatest, so that FORS-D values are extremely positive (upper part of Figure 1). It is inferred that in these regions the order of bases has responded to evolutionary forces favouring stem-loop formation. This is consistent with the R region containing a structure which is recognized by the Tat protein (Cullen, 1991), and the env gene containing a structure which is recognized by the Rev protein (Rev-responsive element; RRE; Mann et al. 1994).

Some other positive FORS-D regions could correspond to sites of recognition by proteins concerned with regulating the expression or structure of HIV-1 (Le et al, 1989, 1991). However, positive FORS-D regions are so frequent that the possibility should be considered that most of them have been generated by a genome-wide evolutionary pressure for the selection of mutations which enhance some genome-wide stem-loop- dependent function, such as recombination (Forsdyke,1995b-d, 1996). The regions with which Tat and Rev associate would serve both the specific functions related to the association, and the recombination function.

Proceeding along the gag gene, positive FORS-D values increase in a step-wise fashion. Conversely, negative FORS-D values become progressively more negative proceeding along the pol gene. Thus, there is a high positive FORS-D value at the 3' end of the gag gene (nt 1901-2100), and a high negative FORS-D value at the 3' end of the pol gene (nt 4801-4950). A slightly higher negative FORS-D value occurs in the env gene (nt 6901-7400). There is a region of sustained negative FORS-D values at the 3' end of the env gene (nt 7901-8700). The sequence here has to encode products of the env, tat and rev genes, and thus may have had less flexibility in responding to pressures for the evolution of stem-loop potential.

The average FORS-D value for all segments in HIVHXB2 is 2.08 kcal/mol (standard error + 0.38), which is significantly different from zero (P<0.01). This FORS-D value is less than the values found for human genomic regions consisting largely of intergenic DNA (around 4.0 kcal/mol), which is also significantly different from zero (Forsdyke, 1995b). This indicates that in the compact HIV-1 genome other functional constraints have limited stem-loop formation.

Regions with negative FORS-D values are postulated to result from a conflict between evolutionary pressures on base order. A sequence may not have been able to adapt for the simultaneous encoding both of stem-loops and of other functions (see Methods). In HIVHXB2, regions with extremely negative FORS-D values include the 3' end of the U3 regions upstream of the TATA box, which contains various promoter elements (Cullen, 1991). Similar negative FORS-D values are found in the 5' promoter regions of some mammalian genes (Forsdyke, 1995b).

The FORS-D negative pol 4800-4950 region contains the "TA-rich conserved element" (TARCE; marked by first fine vertical line); this has been found both in HIV-1 and in various "intermediate-early" cytokine genes (Blum et al. 1990). Part of the element has been shown to play a regulatory role in the expression of immediate-early genes (Freter et al. 1992).
The second vertical line marks the position of a sequence which is conserved between HIV-1 and HTLV-1 (nts 6034-6061; Ratner et al. 1985). This corresponds to a region of slightly negative FORS-D.
The third vertical line marks the env sequence encoding part of the CD4-binding site of the gp120 protein (shown as a box within the env box in the lower part of the figure; Lasky et al. 1987). FORS-D values in this region are slightly positive. This suggests that by the appropriate choice of synonymous codons and of codons for conservative amino acids, the region has been able to retain both protein-encoding potential and a small degree of base order-dependent stem-loop potential (Forsdyke, 1995b).

A highly cytopathic strain of HIV-1 (HIVNDK) isolated from a Zairian patient, is classified as subtype D. Although the sequence differs considerably from HIVHXB2 (i.e. 808 substitutions), Spire and coworkers (1989) concluded that only minor sequence differences were responsible for the increase in virulence.

hivfig2.tif (1909614 bytes)

Fig. 2. FORS-D analysis of HIV-1 subtype group D member HIVNDK. Only parts of the LTRs is present in the GenBank sequence. The vpu gene is considered operational. Other details are as in Figure 1.

Figure 2 shows that, although FONS and FORS-M plots are quite similar to those for HIVHXB2, the FORS-D plots differ in several respects. The high negative FORS-D regions at the end of the pol gene and in the middle of the env gene are more spread out in HIVNDK. The region of negative FORS-D at the end of the env gene is less spread out in HIVNDK.

hivfig03.tif (1909614 bytes)

Fig. 3. FORS-D analysis of HIV-1 subtype group O member HIVMVP5180. The vpu gene is considered operational. Other details are as in Figure 1.

Figure 3 shows similar plots for a member of subtype O (HIVMVP1580), which has 2972 substitutions relative to HIVHXB2 (Gurtler et al. 1994). FONS and FORS-M plots again do not differ much from those shown in Figures 1 and 2. The stepwise ascending positive FORS-D values found over the gag gene in HIVHXB2, and to some extent in HIVNDK, are not found in HIVMVP. However, the stepwise increases in negative FORS-D values over the POL gene are still present, and more clearly divide the gene into three central high FORS-D domains. Proceeding from the 3' end of pol to the vif gene, FORS-D values progressively increase in step-wise fashion, cumulating in a high value in vif which is not seen in the vif of HIVHXB2, and is less evident in the vif of HIVNKD.

The sequence conserved between HIV-1 and HTLV-1 (marked by the second vertical line) corresponds to near zero FORS-D in HIVMVP, and negative FORS-D in HIVNDK and HIVHXB2. A part of the gp120 protein which interacts with the CD4 cell receptor corresponds to a regions of positive FORS-D in HIVMVP (Fig. 3), negative FORS-D in HIVNDK (Fig. 2) and slightly positive FORS-D in HIVHXB2 (Fig.1). Extremely negative FORS-D values are not seen in the middle of the env gene of HIVMVP. These different FORS-D profiles reflect differences in stem-loop potential, which might account for restrictions in recombination between different HIV-1 isolates (see Discussion).

hivfig04.tif (1909614 bytes)

Fig. 4. Density of base substitutions relative to HIVHXB2 for various HIV-1 isolates. Substitutions are for the same 200 nt overlapping windows as in Figure 1. Members of subtype B are HIVBH102 and HIVPV22 (upper B set), and HIVSF2, HIVMN and HIVJRCSF (lower B set). Members of subtype D are HIVNDK, HIVELI and HIVZ2Z6. The subtype A member is HIVU55A. Members of subtype O are HIVANT70C and HIVMVP5180. The boxes and fine vertical lines are as in Figure 1. For further details see Methods and Table 1.

Substitution Patterns Vary with Difference from Prototype

The distribution of accepted mutations in retroviral genomes is not random (Huet et al. 1990). Figure 4 shows substitutions in the same 200 nt windows as were used in Figure 1, for members of HIV-1 subtypes B, D and A and O. As indicated by differences in the ordinate scales, the total number of substitutions relative to the subtype B member HIVHXB2 increases progressively from the upper figure downwards. The overall patterns show similarities, but there are considerable differences in detail.

Although the pol gene is usually highly conserved (Doolittle et al. 1989), in subtype B members which diverge by less than 100 substitutions from HIVHXB2 (top figure), the highest substitution density is found in the middle of the pol gene (nt 3800-4100). This peak is less dominant in the case of HIV-1 genomes which differ from HIVHXB2 by more than 100 substitutions (three lower figures). The TA-rich conserved element at the 3' end of the pol gene (TARCE; first vertical line), is at or near a peak of substitution density in the case of subtype B members, but is on a slope in the case of subtype D members, and at one of the most conserved parts of the genome in the case of subtype A and O members.

On the other hand, the sequence which is conserved between HTLV-1 and HIV-1 (second vertical line), tends to locate either at or very close to a peak. The sequence corresponding to part of the CD4 recognition region of the Env protein gp120 (marked by a box in the env box and the third vertical line) corresponds to peaks of substitution density in most members of subtype B and the one subtype A member studied, but to a slope in most members of subtypes D and A. The Rev response element (RRE; marked by a box in the env box and the fourth vertical line) is relatively conserved, but the substitution density tends to increase at its 3' end, particularly in members of subtype B with less than 100 substitutions (top figure) and of subtype O (bottom figure).

Substitution patterns can be manipulated simply by varying the relative weightings given to base matches and indels ("deletion weight" or "gap cost"). Substitutions are one component of the "sequence variability" studied by Le et al. (1989), who did not provide information on the weighting employed. Gap cost is unlikely to affect computer-generated alignments of sequences which show little divergence (i.e. matches dominate the alignment), but in the case of subtype O members which differ by nearly 3000 substitutions from HIVHXB2, patterns of substitution density might change dramatically with gap cost.


Fig. 5. Linear regression analysis of base substitutions and base indels in HIVHXB2 relative to HIVMVP5180, at gap cost (deletion weight) values of A -0.25/base, B -0.5/base, and C -1.0/base. Substitutions and indels were counted in successive 200 nt windows, each of which overlapped the preceding window by 150 nt. The number of indels in each window are plotted against the number of substitutions in the same window. Correlation coefficients (r) and probabilities (P) that the slopes of the regression lines are not significantly different from zero are, respectively: A 0.447, <0.001; B 0.094, 0.196; C 0.110, 0.129.	Fig. 6. Base substitutions (filled symbols) and base indels (open symbols) in 200 nt overlapping windows in HIVHXB2 relative to HIVMVP5180, at various gap cost (deletion weight) values (A -0.25/base, B -0.5/base, C -1.0/base). Each data point corresponds to the middle of its window. . .

To examine this, base indel densities at three different gap costs were determined for each of the overlapping 200 nt windows used for the determination of substitution densities in Figure 4. Plots of indels densities against substitution densities in each window are shown in Figure 5. At gap costs of -0.5/base and -1.0/base, indels have approximately equal probabilities of appearing in windows of low and high substitution density. The slopes of the least square plots are not significantly different from zero (Figs. 5b,c). However, when the gap cost is lowered to -0.25/base the computer program replaces substitutions with indels and more indels are seen in windows with low substitution densities (Fig. 5a).

This shift between substitutions and indels is of an all-or-none character and affects certain substitution peaks selectively. This is shown in Figure 6 where substitution and indel density profiles are compared at different deletion weights. Substitution patterns change very little when the gap-cost is changed from -1.0/base (Fig. 6c) to -0.5/base (Fig. 6b). However, a gap cost of -0.25/base demolishes some substitution peaks, leaving others unscathed (Fig. 6a).

hivfig07.tif (1909614 bytes)

Fig. 7. Base substitutions and base indel densities in HIVHXB2 relative to HIVMVP5180, at the gap costs indicated in Figures 5 and 6. In A substitutions obtained in each 200 nt window calculated using a gap cost of -0.25/base are plotted against substitutions in the same window using a gap cost of -0.5/base. In B substitutions obtained in each window using a gap cost of -1.0/base are plotted against substitutions in the same window using a gap cost of -0.5/base. C and D show similar plots for indels.

The all-or-none nature of the shift between substitutions and indels is clearly shown in Figure 7. When substitution and indel densities obtained with a deletion weight of -1.0.base are plotted against substitution and indel densities obtained with a deletion weight of -0.5/base, most points correlate well with the least squares linear regression line, especially in the case of substitutions (Figs. 7b,d). However, when substitution and indel densities obtained with a gap cost of -0.25/base are plotted against substitution and indel densities obtained with a gap cost of -0.5/base, the points for substitutions fall into two groups (Fig. 7a), and points for indels become more scattered (Fig. 7c).

Thus, there are two categories of regions rich in substitutions, one which resists conversion to indels when the gap costs is lowered and one which readily converts when the gap cost is lowered. A gap cost of -0.5/base was chosen for routine studies of substitution density.

hivfig08.tif (1909614 bytes)

Fig. 8. Comparison of distribution of substitutions for HIVSF2 relative to HIVHXB2 (coloured triangles), with FORS-D values for HIVHXB2 (filled triangles). FORS-D values and substitution densities are for the same 200 nt overlapping windows referred to in Figure 1. Other details are as in Figure 1.

Negative Correlation between FORS-D and Substitutions

To investigate the reported negative correlation between "statistically significant" stem-loop potential and "sequence variability" (Braun et al. 1987; Le et al. 1988, 1989), FORS-D profiles were compared with substitution and indel profiles. Figure 8 shows substitutions relative to HIVHXB2 in the subtype B member HIVSF2, which differs from HIVHXB2 in 455 substitutions. The substitution profile is shown together with the FORS-D profile of HIVHXB2 taken from Figure 1. When FORS-D values are high, indicating an evolutionary pressure on base order to enhance stem-loop potential, substitutions tend to be low. On the other hand, when base order-determined stem-loop potential is decreased (low FORS-D values) the number of substitutions is high. These relationships are most apparent in the regions of the LTRs and the envelope protein.

The TA-rich conserved element (TARCE) at the 3' end of the pol gene (marked by the first vertical line) is in a region of high negative FORS-D, consistent with its conservation at the 3' end of various genes (Blum et al. 1990; Freter et al, 1992). However, for HIVSF2 the substitution density is quite high in this region. Similarly, the sequence which is conserved between HIV-1 and HTLV-1 (marked by the second vertical line) is in a region of slightly negative FORS-D, but high substitution density. The sequence corresponding to part of the CD4-binding site (marked by a box within the env box and the third vertical line) is in a region of high substitution density and a relatively low positive FORS-D value. The RRE corresponds to a relatively low substitution density and a high positive FORS-D value.

hivfig09.tif (1909614 bytes)

Fig. 9. Linear regression analysis of (A, D) FONS, (B, E) FORS-M, and (C, F) FORS-D values for HIVHXB2 in 200 nt sequence windows, against substitution or indel densities relative to HIVSF2 in the same windows. Data for A, B, and C are from Figure 8. Within figures are shown the correlation coefficients (r) and the probabilities (P) that the slopes of the regression lines are not significantly different from zero. These values also form part of Tables 1 and 2.

That the relationship to substitution density is specific for FORS-D values, rather than for FONS and FORS-M values, is shown in Figure 9. Here, for each 200 nt window, the three values are plotted separately against the substitutions in that window. Plots for FONS and FORS-M values (Figures 9a, b) have slopes which are not significantly different from zero (P0.05), and have low correlation coefficients (r). In Figure 9c the regression line slopes downwards as would be expected from the negative relationship between FORS-D value and the number of substitutions. The slope of the line is significantly greater than zero (P<0.001 that the slope is not significantly different from zero), and the correlation coefficient is higher than those of Figures 9a and 9b. Thus, with respect to base substitutions, these results support the observation of a negative correlation between "sequence variability" and "statistically significant" stem-loop potential (Le et al. 1989). On the other hand, base indel densities show no such correlation (Figures 9d,e,f).

hivfig10.tif (1909614 bytes)

Fig. 10. Linear regression analysis of FONS, FORS-M and FORS-D values for HIVHXB2, against substitution or indel densities relative to HIVMVP5180. Other details are as in the legend to Figure 9.

Because indels are infrequent in HIVSF2, the above analysis was repeated with HIVMVP5180 which diverges more dramatically from HIVHXB2. Plots corresponding to a gap cost of -0.5/base are shown in Fig. 10. The relationships of the various fold parameters to substitutions was similar to that obtained with HIVSF2 (Figs. 10a-c). Positive correlations with indel densities are noted; these are best with FORS-M values (the base composition-dependent component of the stem-loop potential), and least with the FORS-D values (the base order-dependent component of the stem-loop potential). It was found that correlation coefficients and probabilities are "best" (i.e. positive correlation with indels and negative correlation with substitutions), when using a gap cost of -0.5/base, rather than -0.25/base, or -1.0/base (data not shown).

Table 1 summarizes the results of studies on the negative correlation with substitution density in the case of other HIV-1 genomes. These again emphasize the significance of the relationship of FORS-D (rather than FONS and FORS-M) to substitution density. Thus substitution density is related to an evolutionary force ("fold pressure") which acts on the primary sequence (base order), rather than on base composition.

**Table 1**. Linear Regression Relationships between Base Substitution Density and FORS-D, FONS and FORS-M
Sub- type	GenBank name	Subs- titu- tions	(C+G) %	FORS-D			FONS			FORS-M
Sub- type	GenBank name	Subs- titu- tions	(C+G) %	Slope	P	r	Slope	P	r	Slope	P	r
B	HIVBH102	75	41.7	+0.142	0.446	0.058	-0.638	0.115	0.119	-0.495	0.116	0.119
	HIVPV22	93	42.6	-0.022	0.904	0.009	-0.645	0.134	0.109	-0.667	0.049	0.143
	HIVSF2CG	455	42.3	-0.342	<0.001	0.314	0.333	0.073	0.130	-0.009	0.950	0.005
	HIVMNCG	460	42.2	-0.159	0.018	0.170	0.035	0.824	0.016	-0.123	0.325	0.072
	HIVJRCSF	461	41.9	-0.253	<0.001	0.261	0.156	0.339	0.070	-0.098	0.456	0.055
D	HIVZ2Z6	782	41.4	-0.114	0.007	0.202	0.182	0.060	0.142	0.068	0.371	0.067
	HIVNDK	808	41.7	-0.095	0.039	0.154	0.026	0.800	0.019	-0.069	0.387	0.065
	HIVELI	811	41.5	-0.071	0.083	0.129	0.053	0.547	0.043	0.008	0.800	0.008
A	HIVU55A	1202	41.9	-0.091	0.019	0.174	-0.032	0.716	0.027	-0.122	0.068	0.136
O	HIVANT70C	2820	42.9	-0.033	0.031	0.157	0.062	0.084	0.126	0.029	0.317	0.073
O	HIVMVP5180	2972	42.6	-0.063	<0.001	0.314	0.102	0.003	0.217	0.039	0.149	0.105
The linear regression parameters of plots, similar to those shown in Figs. 9 and 10, are summarized here for other HIV-1 isolates. Substitutions are relative to the prototype HIVHXB2. Base compositions,(C+G)%,were determined from the GenBank files.

The FORS-D-related data from Table 1 show that the negative slopes correlate best with the data points (highest value of r), and achieve their greatest and most significant displacement from zero (P<0.001 that they not significantly different from zero), when total substitutions relative to HIVHXB2 are around 450. This number of substitutions is characteristic of members which are still within group B. Members of group B which differ by less than around 100 substitutions from HIVHBX2, have low slopes, high P values (indicating low significance), and low correlation coefficients (indicating poor correlations). Above 450 substitutions, slopes and correlation coefficients tend to decline, and P values tend to increase, but usually remain <0.05 (indicating significance).

**Table 2**. Linear Regression Relationships between Base Indel Density and FORS-D, FONS and FORS-M
Subtype	Genbank name	Indels	FORS-D			FONS			FORS-M
Subtype	Genbank name	Indels	Slope	P	r	Slope	P	r	Slope	P	r
B . .	HIVBH102	40	0.243	<0.001	0.267	-0.230	0.124	0.116	0.013	0.914	0.008
	HIVPV22	44	0.240	0.001	0.240	-0.255	0.135	0.108	-0.014	0.915	0.008
	HIVSF2CG	119	-0.059	0.402	0.061	0.230	0.163	0.101	0.171	0.189	0.096
	HIVMNCG	112	-0.040	0.601	0.037	0.207	0.256	0.083	0.167	0.244	0.085
	HIVJRCSF	106	0.032	0.617	0.037	-0.016	0.915	0.008	0.016	0.890	0.010
D .	HIVZ2Z6	103	-0.114	0.105	0.122	0.329	0.038	0.156	0.215	0.082	0.131
	HIVNDK	110	-0.131	0.036	0.157	0.358	0.009	0.154	0.228	0.034	0.158
	HIVELI	90	-0.211	0.034	0.178	0.549	0.005	0.207	0.338	0.027	0.165
A	HIVU55A	197	-0.127	0.016	0.180	0.098	0.408	0.062	-0.029	0.748	0.024
O .	HIVANT70C	583	0.007	0.744	0.021	-0.303	<0.001	0.396	-0.297	<0.001	0.482
O .	HIVMNP5180	512	0.051	0.074	0.130	-0.361	<0.001	0.394	-0.310	<0.001	0.430
The linear regression parameters of plots, similar to those shown in Figs. 9 and 10, are summarized here for other HIV-1 isolates. Indels are relative to the prototyp HIVHXB2.

Table 2 summarizes similar data for indels. The most significant correlation with FORS-D values occurs in the case of genomes which diverge least from the prototype, and this correlation is positive. The strong positive correlation with FORS-M values seen in the case of HIVMVP5180 (Fig. 10e) also applies to another strongly diverging genome (HIVANT70C). Thus indels correlate best with a base order-dependent function when at low density and with a base composition-dependent function when at high density.

It can be seen from Tables 1 and 2 that, following from the fact that for each sequence window the FORS-D value is the difference between the FORS-M and the FONS values, the corresponding slope values show the same relationship.

No Relationship between FORS-D Value and (G+C)%

hivfig11.tif (1909614 bytes)

Fig. 11. Linear regression analysis of average values for (A) FONS, (B) FORS-M and (C) FORS-D for various retroviruses, as a function of their base composition, (G+C)%. For each retrovirus plots were prepared similar to those shown in Figures 1-3. Average values were determined for all the 200 nt overlapping windows. G+C percentages were determined from the GenBank entries. Probabilities (P) that the slopes are not significantly different from zero are A, 0.014, B, 0.016, and C, 0.812. Correlation coefficients (r) are A, 0.68, B, 0.68, and C, 0.08. Names of viruses are abbreviated in A and C. The arrangement of points in B is similar to that in A.

The full names and GenBank designations of the viral isolates referred to are: SFV (simian foamy virus; SFVGENOME), VISNA (visna virus; VLVCGA), HIVNDK (human immunodeficiency virus type 1; HIVNDK), HIVHXB2 (human immunodeficiency virus type 1; HIVHXB2CG), HIVMVP (human immunodeficiency virus type 1; HIVMVP5180), SIVMP (simian Mason-Pfizer virus; SIVMPCG), MMTV (murine mammary tumour virus; MMTPROCG), SIVAGM (simian immunodeficiency virus; SIVAGM155), HIV-2 (human immuno- deficiency virus type 2; HIV2ISY), FCVF (feline leukaemia virus; FCVF6A), HTLV-1 (human T cell leukaemia virus type 1; HTVPRCAR), and ROUS (Rous sarcoma virus; ALRCR).

The (G+C)% of the HIV-1 genomes shown in Table 1 have similar values, so that one cannot assess the effect of changes in this parameter on base composition- dependent components of the stem-loop potential. Some retroviruses have quite different G+C percentages and, as might be expected from the greater interaction energy of GC base pairs relative to AT base pairs, average FONS and FORS-M values tend to increase with increasing (G+C)% (Figs. 11a,b). However, the relationship is not uniform (correlation coefficients of 0.68). For example, the G-rich Rous sarcoma virus and the very C-rich human T cell leukaemia virus (HTLV-1), although having approximately the same (G+C)%, differ considerably from each other in values for FONS (Fig. 11a) and FORS-M (Fig. 11b). The great excess of C over G in HTLV-1 means that G might limit the number of GC bonds that could form; thus FONS and FORS-M values would be less than expected based on total G+C content.

Also, as might be expected, no significant relationship is evident between base composition (G+C percentage) and FORS-D values. By subtracting the FONS value (a function both of base order and of base composition) from the FORS-M value (a function of base composition), to arrive at the FORS-D value, the contribution of base composition to stem-loop potential is removed. All FORS-D values are positive, although most values are below the values determined for some long human genomic segments which are largely composed of intergenic DNA (approx. 4 kcal/mol; Forsdyke, 1995b). Whereas points for the three HIV-1 subtypes shown (HIVHXB2, HIVNDK and HIVMVP), are close together in Figures 11a and 11b, the points are more widely dispersed among the points for other retroviral types in Figure 11c. HTLV-1 has the highest FORS-D value. Thus, inspite of the imbalance in relative levels of G and C, the primary sequence of the C-rich HTLV-1 has adapted to generate FORS-D values of the same order as long genomic sequences of its host.

FORS-D plots for the retroviruses mentioned in Figure 11, differ in numerous respects from those shown for the HIV subtypes (Figs 1-3). This will be the subject of a future communication, which will also contain a more detailed consideration of the relationship between FORS-D values and sequence features (e.g. proportions of synonymous and non-synonymous mutations; Scpaer and Mullins, 1993).

Discussion

Codons and Stem-Loops are Local "Strategies"

Base composition may be expressed in a variety of ways. (G+C)% is particularly informative, being a dominant characteristic of a genome or large genome segment (Bernardi, 1989). Thus codons adapt to the (G+C)% of the segment inwhich they are located (Grantham et al. 1980). This indicates that (G+C)% is a genomic "strategy", not a local "strategy". Codons are a local "strategy". For this reason, the contribution of base composition to the local stability of stem-loops is seen as an externally imposed (non-local) characteristic (see Methods).

To a small extent, base order is also a genomic "strategy". Thus the low frequency of the dinucleotide TpA seems to be universal. The dinucleotide CpG is in low frequency in many taxa, but in high frequency in most bacteria (Nussinov, 1994; Forsdyke, 1995d). However, base-order is predominantly a local "strategy", influencing both codons and stem-loop stability. It has been shown (see Methods) that base order-determined stem-loop potential closely corresponds to the "statistically significant" stem-loop potential of Le and Maisel (1989). In the present work, their approach to quantitating the latter as the "segment score", was adapted as "FORS-D analysis".

FORS-D Values are Small but Highly Informative

It may seem unlikely that FORS-D values, which are so small relative to the FONS and FORS-M values from which they are derived, could be informative about underlying evolutionary processes affecting nucleic acids. However, the values do not fluctuate randomly about zero; positive values are found consistently, and are widely distributed in long genomic segments from a variety of species (Forsdyke, 1995b-d, 1996). For example, the average value for 69 windows at 1 kb intervals in a 68 kb segment from human chromosome 19 (HUMMMDBC) is 4.4+0.9_kcal/mol. This reveals the existence of a genome-wide evolutionary pressure ("fold pressure") favouring positive FORS-D values. Negative values have been interpreted as revealing some local functional constraint on base order (such as the encoding of a protein), which prevents a region achieving the most stable stem-loop conformation of which it is capable. Indeed, negative values are found in promoter regions, and tend to occur more frequently in exons than in introns. The present work shows that the density of substitution mutations in HIV-1 is reciprocally related to FORS-D values (reflecting base order-determined stem-loop potential), but not to FORS-M values (reflecting base composition-determined stem-loop potential; Figs. 9, 10; Table 1). This confirms the observations of Le et al. (1988, 1989, 1990), and emphasises the power of an analytical approach which relies on differences of a few kcal/mol.

Le et al (1991) state that "The significance score [e.g. FORS-D value] characterizes the specific arrangement of the nucleotides in the segment that could imply a structural role for the sequence information". This is the nearest the authors get to discriminating between the contributions of base order and base composition which is an important aspect of the present work. Also in the present work base substitutions and indels are considered separately. Le et al. (1989) assess base substitutions and indels collectively as "sequence variability". The failure of indels to contribute to the reciprocal relationship between FORS-D and sequence variability (Figs. 9,10; Table 2) may not have been critical to their results; in HIV-1 substitutions exceed indels by a factor of between 2 (HIVBH102) and 9 (HIVELI), when using a gap penalty of -0.5.

Paradox Explained by Positive Darwinian Selection?

The observation that negative FORS-D values (implying functional importance) are correlated with sequence variability (substitutions) appears paradoxical in that functional importance is usually correlated with sequence conservation in the case of species in which evolutionary changes occur on a geological time scale. Individuals with mutations in functionally important regions are usually negatively selected. An important clue leading to an apparent resolution of the paradox is provided by studies of the distribution of FORS-D values in genes, which, like retroviruses (Scpaer and Mullins, 1993), are under positive "Darwinian" selection.

Snake venom protein genes appear to be engaged in an "arms race" with the genes of their prey which confer resistance. Snake venom phospholipase A₂genes have very high substitution rates in exons and very low substitution rates in introns. As in the case of the RNA genomes considered here, there is a reciprocal relationship between base order-dependent stem-loop potential and substitution density (Forsdyke, 1995c). Similarly, the reciprocal relationship holds for the polymorphic regions of MHC genes which bind peptide fragments from pathogens for recognition by the host immune system (D. R. Forsdyke, unpublished work). These regions are also under positive Darwinian selection (Hughes et al. 1994), probably in response to coevolving parasites (Klein and O'Huigin, 1994). Thus, it is possible that the reciprocal relationship holds for other genomes or genome segments which are rapidly evolving in this way. This implies that it may be easier to adapt the sequence of a conserved region for encoding both specific function and stem-loop potential (by synonymous substitutions and conservative amino acid exchanges), than to adapt the sequence of a region rapidly accepting mutations under positive selection pressure.

Thus, the proposed resolution of the paradox requires consideration of the particularly high rates of accepted mutations in RNA genomes, which result, not in species with members of relatively uniform sequence, but in "quasispecies" with members of varying sequence (Holland et al. 1992). Within one HIV-1-infected individual, the sequence of the predominant viral population varies with the stage of the disease. This is of adaptive value to the virus which appears more able to escape the cellular and humoral immune defences of its host. However, at some level, the adaptive value of a high mutation rate should be limited by the maladaptive effect of loss of information required for normal function.

An important way this loss of information could be countered in the case of genomes -- which, at some stage in the life history of the organism, might share a common intracellular environment, -- would be by recombination Model of Human Immunodeficiency Virus 1 (Bernstein and Bernstein, 1991). Retroviruses transmit two copies of their genomes to the cells they infect, and have high rates of recombination (Coffin, 1992). Thus, if segments of the primary sequence are functionally important for recombination (i.e. have stem-loops), keeping operational these segments might be as important as keeping operational protein-encoding segments and segments concerned with regulation. The latter two groups of segments, when in regions under positive selection pressure (Scpaer and Mullins, 1993), may be associated with negative FORS-D values since the sequences may not have been able to optimize simultaneously both their function-encoding abilities and their folding propensities. As these segments become less important for successful recombination (loss of base order-dependent stem-loop potential), they are able to accept more substitutions than segments better adapted for recombination (associated with high FORS-D values). The proposed relationship between the conflicting evolutionary forces acting on the HIV-1 genome is summarized in Figure 12.

hivfig12.tif (1909614 bytes)

Fig. 12. Summary of the proposed evolutionary factors which result in a negative correlation between fold pressure as measured by FORS-D values, and substitution density. A high mutation rate results in sequence variation (substitutions), which has three functional consequences. Positive "Darwinian" adaptions (e.g. changes in a sequence encoding a protein recognized by the host immune system) are favourable, and create a selection pressure for further increasing the mutation rate (e.g. decreasing the fidelity of reverse transriptase).

Negative adaptions (e.g. impairment of a vital function) are detrimental, and create a selection pressure for lowering the mutation rate (e.g. increasing the fidelity of reverse transcriptase). These negative adaptations may be countered by repair or correction through recombination, which itself may be impaired by sequence variations which decrease base order-determined stem-loop potential (since stem-loops are important for recombination).

Of the three consequences of variation, the decrease of stem-loop potential is likely to be of particular importance, since it impairs the ability to correct other mutations by recombination. Thus, mutations which decrease this potential (and hence the recombination function) are as likely to be selected against as are negatively adaptive mutations which decrease conventional functions (e.g. the encoding of a specific protein or the regulation of its expression). On the other hand, recombination must not be so efficient that it repairs or corrects mutations which are positively adaptive.

Accelerated Speciation Model

In the words of Holland and coworkers (1982), "DNA chromosomes of eukaryotic host organisms generally require geologic time spans to evolve to a degree that their RNA viruses can achieve in a single human generation". Members of a particular HIV-1 subtype might be able to recombine successfully with each other if coinfection occurred (Robertson et al. 1995). However, because of variations in the distribution of base order-determined stem-loop potential (Figs. 1-3), intersubtype recombination might be more difficult.

The changes in the various linear regression parameters associated with FORS-D values plotted as a function of substitution density (Table 1), may reflect this. When substitutions are less than 100/genome, the slope of the plot of FORS-D versus substitutions is low, and not significantly different from zero. At this low substitution density high FORS-D regions are spared no more than other regions, implying that at this density recombination tends not to be impaired. In this respect the two genomes are behaving like alleles which usually do not change in heterozygotes undergoing recombination.

When substitutions are around 460/genome, the slope is maximum (Fig. 9; Table 1). At this substitution density, organisms are still subtype B (the same as the prototype), and high FORS-D regions are spared; this implies that this density impairs the organism's recombination function, so that genomes with mutations in potential stem-loop regions are selected against.

At substitution densities above 460/genome, the slope decreases, implying that high FORS-D regions are being spared less. These high substitution densities can be regarded as approaching an "escape" value beyond which recombination with the prototype (to correct sequence errors) is less likely to be beneficial, and thus less evolutionarily advantageous. Under this condition, the value of matching the FORS-D profile of a genome (Figs. 2, 3) with that of the prototype genome (Fig. 1), in order to optimize recombination, decreases; thus, the ratio of prototype FORS-D values to substitution density (slope), declines (Table 1).

In this circumstance, there would be less selective advantage in retaining the similarities in (G+C)% which, at present, are very close (Table 1). The ability to resist intrinsic mutational biases affecting overall base composition (Filipski, 1990; Vartanian et al. 1991), would be decreased. Thus, it would be easier for G+C percentages to diverge. As a consequence of this, recombination with the prototype would be even more difficult. Thus, the generation of distinct retroviral species (speciation) would have begun.

In this respect retroviral genomes seem to offer a tractable model system for addressing general aspects of the speciation problem. Mechanisms by which percentage G+C differences might affect stem-loops and inhibit recombination are discussed elsewhere (Forsdyke, 1996). The importance of (G+C)% compatibility for recombination is emphasized by the observation that the high (G+C)% HTLV-1 genome is excluded from low (G+C)% sectors of the host genome (Zoubak et al. 1994).

Importance of (G+C)% Differences in Speciation

Although a subject of some controversy, species can simply be defined as consisting of organisms which are successful at reproducing sexually with each other. Sexual species are reproductively isolated from other sexual species. Sueoka in 1961 noted that freely crossing (i.e. recombining) strains of tetrahymena have similar G+C percentages.

The importance of (G+C)% to speciation has recently been emphasized by comparisons of the retroviral phylogenetic trees generated from natural and randomised sequences (Bronson and Anderson 1994). A natural sequence is but one member of a large set of possible sequences with the same base composition. The average characteristic of this set can be arrived at by randomizing the natural sequence (see Methods). If the phylogenetic tree which can be generated from natural sequences closely resembles the average tree which can be generated from their randomized equivalents (as was found by Bronson and Anderson who randomized and aligned pairs of sequences a thousand times), it follows that speciation is more likely to be a function of base composition than of base order.

That base composition is a major force driving the evolution of retroviruses is apparent from the major influence of base composition on codon choice, which extends to non-synonymous codons and greatly biases amino acid composition (Berkout and Hemert, 1994; Bronson and Anderson, 1994). The latter authors note that viruses with radically different base frequencies have the potential to coinfect cells, and postulate that the different base frequencies would dictate different precursor requirements (nucleotides, amino acids), so that the viruses would occupy different "ecological niches" within the host cell. This avoidance of competition with other viruses for common nutrients would have driven the differentiation in base composition (i.e. the viruses would have resisted intrinsic mutational biases to a lesser extent).

However, while there is some evidence for intracellular compartmentation at the nucleotide level (Leeds et al. 1985), there is none for compartmentation at the amino acid level. A more attractive explanation, which has guided the present work, is advanced elsewhere (Forsdyke, 1995b, 1996). The key postulates are that:

(i) recombination between genomes facilitates the repair or correction of mutations;
(ii) accurate repair requires an accurate template, which must not have diverged much from the sequence which is being repaired; thus intra-species recombination is favoured and inter-species recombination is disfavoured (i.e. organisms which recombine with other species are negatively selected);
(iii) for both DNA and RNA genomes, differences in base-composition inhibit recombination based on the "kissing" between the loops of stem-loop structures,
(iv) a key stage in speciation is arrived at when there is a critical degree of sequence divergence from the prototype such that it is more important not to recombine, than to recombine, with members of the prototype species. Thus, it is the need to occupy different "recombinational niches" (i.e. to occupy particular species niches), rather than "ecological niches", that drives differentiation of base composition and speciation.

Acknowledgements. I thank J. Gerlach for assistance with computer configurations, J. Mau (National Research Council) and the Queen's University Statlab for advice, L. Russell for technical help, and the Medical Research Council of Canada for support.

References

Berkhout B, Hemert, F. J. van (1994) The unusual nucleotide content of the HIV RNA genome results in a biased amino acid composition of HIV proteins. Nucleic Acids Res 22:1705-1711

Bernardi G (1989) The isochore organization of the human genome. Annu Rev Genet 23:637-661

Bernstein C, Bernstein H (1991) Aging, Sex and DNA Repair. Academic Press, San Diego

Blum S, Forsdyke RE, Forsdyke DR (1990) Three human homologs of a murine gene encoding an inhibitor of stem cell proliferation. DNA Cell Biol 9:589-602

Braun MJ, Clements JE, Gonda MA (1987) The visna virus genome: evidence for a hypervariable site in the env gene and sequence homology among lentivirus envelope proteins. J Virol 61:4046-4054

Bronson EC, Anderson JN (1994) Nucleotide composition as a driving force in the evolution of retroviruses. J Mol Evol 38:506-532

Crick F (1971) General model for the chromosomes of higher organisms. Nature 234:25-27

Coffin JM (1992) Genetic diversity and evolution of retroviruses. Curr Topics Microbiol Immunol 176:143-163

Cullen BR (1991) Regulation of HIV-1 gene expression. FASEB J. 5:2361-2368

Currey KM, Peterlin BM, Maizel JV (1986) Secondary structure prediction of poliovirus RNA: correlation of computer-predicted with electron microscopically observed structure. Virology 148:33-46

Doolittle RF, Feng D-F, Johnson MS, McClure MA (1989) Origins and evolutionary relationships of retroviruses. Quart Rev Biol 64:1-30

Doyle GG (1978) A general theory of chromosome pairing based on the palindromic DNA model of Sobell with modifications and amplification. J Theor Biol 70:171-184

Filipski J (1990) Evolution of DNA sequence. Contributions of mutational bias and selection to the origin of chromosomal compartments. Adv Mutagenesis Res 2:1-54

Fitch WM (1983) Random sequences. J Mol Biol 163:171-176

Forsdyke DR (1995a). Fine-tuning of intracellular protein concentrations, a collective protein function involved in aneuploid lethality, sex-determination and speciation? J Theor Biol 172:335-345

Forsdyke DR (1995b) A stem-loop "kissing" model for the initiation of recombination and the origin of introns. Mol Biol Evol 12:949-958

Forsdyke DR (1995c) Conservation of stem-loop potential in introns of snake venom phospholipase A₂ genes. An application of FORS-D analysis. Mol Biol Evol 12:1157-1165

Forsdyke DR (1995d) Relative roles of primary sequence and (G+C)% in determining the hierarchy of frequencies of complementary trinucleotide pairs in DNAs of different species. J Mol Evol 41: 573-581.

Forsdyke DR (1996) Different biological species "broadcast" their DNAs at different (G+C)% "wavelengths". J Theor Biol 178:405-417.

Freter RR, Irminger JC, Porter JA, Jones SD, Stiles CD (1992) A novel 7-nucleotide motif located in 3' untranslated sequences of the immediate-early gene set mediates platelet-derived growth factor induction of the JE gene. Mol Cell Biol 12:5288-5300

Grantham R, Gautier G, Gouy M, Mercier R, Pare A (1980) Codon catalog usage and the genome hypothesis. Nucleic Acids Res. 8:r49-r62

Gribskov M, Devereux J (1991) Sequence Analysis Primer. Stockton Press, New York.

Gurtler LG, Hauser PH, Eberle J, Brunn A von, Knapp S, Zekeng L, Tsague JM, Kaptue L (1994) A new type of human immunodeficiency virus type 1 from Cameroon. J Virol 68:1581-1585

Hawley RS, Arbel T (1993) Yeast genetics and the fall of the classical view of meiosis. Cell 72:301-303

Holland J, Spindler K, Horodyski F, Grabau E, Nichol S, VandePol S (1982) Rapid evolution of RNA genomes. Science 215:1577-1585

Holland JJ, Torre JC de la, Steinhauer DA (1992) RNA virus populations as quasispecies. Curr Topics Microbiol Immunol 176:1-20

Huet T, Cheynier R, Meyerhans AR, Roelants G, Wain-Hobson S (1990) Genetic organization of a chimpanzee lentivirus related to HIV-1. Nature 345:356-358

Hughes AL, Hughes MK, Howell CY, Nei M (1994) Natural selection at the class II major histocompatibility complex loci of mammals. Phil Trans R Soc Lond B 345:359-367

Karlin S, Brendel V (1992) Chance and statistical significance in protein and DNA sequence analysis. Science 257:39-49

Kleckner N, Padmore R, Bishop DK (1991) Meiotic chromosome metabolism: one view. Cold Spring Harbor Symp Quant Biol 56:729-743

Kleckner N, Weiner BM (1993) Potential advantages of unstable interactions for pairing of chromosomes in meiotic, somatic and premeiotic cells. Cold Spring Harbor Symp Quant Biol 58:553-565

Klein S (1994) Choose your partner: chromosome pairing in yeast meiosis. Bioessays 16:869- 871

Klein J, O'Huigin C (1994) MHC polymorphism and parasites. Phil Trans R Soc Lond B 346:351-358

Lasky LA, Nakamura G, Smith DH, Fennie C, Shimasaki C, Patzer E, Berman P, Gregory T, Capon DJ (1987) Delineation of a region of the human immunodeficiency virus type 1 gp120 glycoprotein critical for interaction with the CD4 receptor. Cell 50:975-985

Le S-Y, Chen J-H, Braun MJ, Gonda MA, Maizel JV (1988) Stability of RNA stem-loop structure and distribution of non-random structure in the human immunodeficiency virus (HIV-I). Nucleic Acids Res 16:5153-5168

Le S-Y, Chen J-H, Chatterjee D, Maizel JV (1989). Sequence divergence and open regions of RNA secondary structures in the envelope regions of 17 human immunodefiency virus isolates. Nucleic Acids Res 17:3275-3288

Le S-Y, Chen J-H, Maizel JV (1989) Thermodynamic stability and statistical significance of potential stem-loop structures situated at the frameshift sites of retroviruses. Nucleic Acids Res. 17:6143-6152

Le S-Y, Chen J-H, Maizel JV (1991) Detection of unusual RNA folding regions in HIV and SIV sequences. CABIOS 7:51-55

Le S-Y, Maizel JV (1989) A method for assessing the statistical significance of RNA folding. J Theor Biol 138:495-510

Le S-Y, Malim MH, Cullen BR, Maizel JV (1990). A highly conserved RNA folding region coincident with the Rev response element of primate immunodeficiency viruses. Nucleic Acids Res 18:1613-1623

Leeds JM, Slabourgh MB, Mathews CK (1985) DNA precursor pools and ribonucleotide reductase activity: distribution between the nucleus and cytoplasm of mammalian cells. Mol Cell Biol 5:3443-3450

Mann DA, Mikaelian I, Zemmel RW, Green SM, Lowe AD, Kimura T, Singh M, Butler JG, Gait MJ, Karn J (1994) A molecular rheostat. Cooperative Rev binding to stem I of the Rev-response element modulates human immunodeficiency virus type-1 late gene expression. J Mol Biol 241:193-207

Martinez HM (1988) A flexible multiple sequence alignment program. Nucleic Acids Res 16:1683-1691

Min Jou W, Fiers W (1976) Studies on the bacteriophage MS2. XXXIII. Comparison of the nucleotide sequences in related bacteriophage RNAs. J Mol Biol 106:1047-1060

Murchie AIH, Bowater R, Aboul-Ela F, Lilley DMJ (1992) Helix opening transitions in supercoiled DNA. Biochem Biophys Acta 1131:1-15

Myers G, Korber B, Wain-Hobson S, Smith RF, Pavlakis GN (1993) Human Retroviruses and AIDS 1993. A Compilation and Analysis of Nucleic Acid and Amino Acid Sequences. Los Alamos National Laboratory, New Mexico

Nussinov R (1984) Strong doublet preferences in nucleotide sequences and DNA geometry. J Mol Evol 20:111-119

Orita M, Iwahana H, Kanazawa H, Hayashi K, Sekiya T (1989) Detection of polymorphisms of human DNA by gel electrophoresis as single strand conformation polymorphisms. Proc Natl Acad Sci USA 86:2766-2770

Radman M, Wagner R, Kricker MC (1993) Homologous DNA interactions in the evolution of gene and chromosome structure. Genome Anal. 7, 139-154.

Ratner L, Haseltine W, Patarca R, Livak KJ, Starcich B, Josephs SF, Doran ER, Rafalski JA, Whitehorn EA, Baumeister K, Ivanoff L, Petteway SR, Pearson ML, Lautenberger JA, Papas TS, Ghrayeb J, Chang NT, Gallo RC, Wong-Staal F (1985) Complete nucleotide sequence of the AIDS virus, HTLV-III. Nature 313:277-284

Ratner L, Fisher A, Jagodzinski LL, Mitsuya H, Liou R-S, Gallo RC, Wong-Staal F (1987). Complete nucleotide sequences of functional clones of the AIDS virus. AIDS Res Hum Retrovir 3:57-69

Robertson DL, Sharp PM, McCutchan FE, Hahn BH (1995). Recombination in HIV-1. Nature 374:124-126

Romanova LI, Blinov VM, Tolskaya EA, Viktorova EG, Kolesnikova MS, Guseva EA, Agol VI (1986) The primary structure of crossover regions of intertypic poliovirus recombinants: a model of recombination between RNA genomes. Virology 155:202-213

Ryan BF, Joiner BL (1994) Minitab Handbook. 3rd edition. Wadsworth Publishing, Belmont, California.

Salser W (1977) Globin mRNA sequences: analysis of base pairing and evolutionary implications. Cold Spring Harb Symp Quant Biol 42:985-1002

Scpaer EG, Mullins JI (1993) Rates of amino acid change in the envelope protein correlates with pathogenicity of primate lentiviruses. J Mol Evol 37:57-65

Sobell HM (1972). Molecular mechanism for genetic recombination. Proc Natl Acad Sci USA 69:2483-2487

Spire B, Sire J, Zachar V, Rey F, Barre-Sinoussi B, Galibert F, Hampe A, Chermann J- (1989). Nucleotide sequence of HIV1-NDK: a highly pathogenic strain of the human immunodeficiency virus. Gene 81:275-284

Sueoka N (1961) Compositional correlation between deoxyribonucleic acid and protein. Cold Spring Harbor Symp Quant Biol 26:35-43

Tolskaya EA, Romanova LI, Blinov VM, Viktorova EG, Sinyakov AN, Kolesnikova MS, Agol VI (1987) Studies on the recombination between RNA genomes of poliovirus: the primary structure and nonrandom distribution of crossover regions in the genomes of intertypic poliovirus recombinants. Virology 161:54-61

Tomizawa J (1984) Control of ColE1 plasmid replication: the process of binding of RNA I to the primer transcript. Cell 38:861-870

Turner DH, Sugimoto N, Freier SM (1988) RNA structure prediction. Annu Rev Biophys Chem 17:167-192

Vartanian JP, Meyerhans A, Asjo B, Wain-Hobson S (1991) Selection, recombination and G-A hypermutation of human immunodeficiency virus type 1 genomes. J Virol 65:1779-1788

Wagner RE, Radman M (1975) A mechanism for initiation of genetic recombination. Proc Natl Acad Sci USA 72:3619-3622

Zoubak S, Richardson JH, Rynditch A, Hollsberg P, Hafler DA, Boeri E, Lever AML, Bernardi G. (1994) Regional specificity of HTLV-1 proviral integration in the human genome. Gene 143:155-163

Zuker M (1989) Computer prediction of RNA secondary structure. Meth Enzym 180:262-289

End Note (June 2007)

Supporting the observations of Bronson and Anderson (1994) in retroviruses, Sampath et al. (2007) have found for influenza virus, that different "evolving virus species" can be differentiated on the basis of base composition ("Base composition derived clusters inferred from this analysis showed 100% concordance to previously established clades."). The authors did not consider whether the differences in base composition followed or were a concomitant to phenotypic changes (novel antigens), or preceded those changes (thus being implicated in the speciation process).

Sempath, R. et al. (2007) PLOS One 2(5), e489. Global surveillance of emerging influenza virus genotypes by mass spectrometry.

End Note (March 2011)

Supporting our above observations, Sanjuan and Borderia conclude (2011) for HIV-1 that "RNA structure and proteins do not evolve independently. A negative correlation exists between the extent of base pairing in the genomic RNA and amino acid variability."

Sanjuan R, Borderia AV (2011) Molecular Biology & Evolution 28 (4), 1333-1338. Interplay between RNA structure and protein evolution in HIV-1.

End Note (May 2011)

Further analyses of HIV-1 and influenza virus RNA structures again support our case that there is potential conflict between the needs of genome structure and other functions. For example, for influenza virus, Moss et al. note that "RNA structural constraints lead to suppression of variation in the third (wobble) position of amino acid codons." Instead of FORS-D they use the original Le-Maizel Z-score terminology. Viral plus and minus strands display asymmetrical folding energies, a feature discussed elsewhere on these pages. (Click Here)
Moss WN, Priore SF, Turner DH (2011) RNA 17, 991-1011. Identification of potential conserved RNA secondary structure throughout influenza A codoing regions.
Watts JM, Dang KK, Gorelick RJ, Leonard CW, et al. (2009) Nature 460, 711-716. Architecture and secondary structure of an entire HIV-1 RNA genome.

End Note (Aug 2013)

Selection acting at synonymous coding positions is becoming more widely recognized, even for the fruit fly genome (Lawrie et al. 2013). A preprint in arXiv by Zanini and Neher (2013) is of high relevance to the above, which it cites. I recently placed a historical review of the field at arXiv (Forsdyke 2013). The title of a paper by Mayrose et al. (2013), puts the case quite succinctly!
Forsdyke DR (2013) Role of HIV RNA structure in recombination and speciation. http://arxiv.org/abs/1305.2132
Lawrie DS, Messer PW, Hershberg R, Petrov DA (2013) Strong purifying selection at synonymous sites in D. melanogaster. PLOS Genetics 9 e1003527.
Mayrose I, et al. (2013) Synonymous site conservation in the HIV-1 genome. BMC Evolutionary Biology 13:164.
Zanini F, Neher RA (2013) Deleterious synonymous mutations hitchhike to high frequency in HIV env evolution. http://arxiv.org/pdf/1303.0805.pdf .

Go to: HIV Recombination and Speciation (2013) (Click Here)

Return to: AIDS Page (Click Here)

Return to: Bioinformatics Page (Click Here)

Return to: Evolution Index Page (Click Here)

Return to: HomePage (Click Here)

This page was established circa 1998 and last edited on 01 September 2014 by Donald Forsdyke