Amino acids as placeholders

Most organisms have many small proteins and few very large ones. For an individual species, in plots of the base compositions of each protein-encoding region (ORF) against the corresponding lengths (kilobases), the multiple data points are distributed as rightward-pointing arrowheads.^[12] Figure 1 shows this for 3772 genes of P. falciparum, each point corresponding to an individual gene. Many features of the distributions can be captured by first order linear regression analysis (Figures 1a, b). Although small proteins dominate the statistics, points at the tips of arrowheads that correspond to long proteins usually fit close to regression lines. second order regressions (not shown) offer little improvement.

Fig. 1. Variation of the base composition at different codon positions with ORF length in 3772 genes of the malaria parasite P. falciparum. Points were fitted to first order linear regression lines (r² = adjusted square of the correlation coefficient; Y₀ = intercept at the ordinate; P = probability that the slope is not significantly different from zero; seE = standard error of the estimate, which provides an index of the dispersion of points about the regression line and equals the standard error of the mean when slopes are zero). GC1, GC2, GC3, AG1, AG2 and AG3 refer to the base compositions (GC% or AG%) at different codon positions.

As protein lengths increase there is a significant decline in GC% (a slope value of -0.4%/kb), which accounts for 6% of the variation between genes (r² = 0.057), and a significant increase in AG% (a slope value of 0.28%/kb), which accounts for 3% of the variation (r² = 0.029). These values indicate that the P. falciparum genome has been under downward GC-pressure (i.e. away from base equifrequency and towards low GC% values), and/or upward AG-pressure (i.e. away from base equifrequency and towards high AG% values). The "and/or" indicates that both pressures may be operating independently. However, there tends to be a reciprocal relationship between AG% and GC%, largely due to interchanges between A and C.^[11,16] Thus, an increase in AG% ("purine-loading") can be at the expense of C, which is traded for A. So, if G% is constant, GC% will tend to decrease.

Following, or as part of, an indel event, codons for putative placeholder amino acids would have appeared in a sequence.^[23] A codon might have satisfactorily contributed to a base composition pressure immediately, or further mutation might have been required. Were base composition pressures accommodated without affecting the nature of encoded putative placeholder amino acids (changes primarily involving third codon positions), or were distinctive amino acids involved (changes primarily involving first and second codon positions)? For example, the amino acids whose single-letter designations spell "GARP" have codons whose first and second codon position maximally contribute to upward GC-pressure. Similarly the "GREK" amino acids can contribute to upwards AG-pressure. Whereas Figures 1a,b include all codon positions, Figures 1c-h relate to individual codon positions. Since the slope values (Figs. 1a,b) are largely contributed by first and second codon positions (Figs. 1c-h), the changes in base composition with length are likely to have involved distinctive amino acids, the codons of which were either directly inserted into sequences, or were derived by mutation from the originally inserted codons.

In the case of GC%, all three positions contribute to the decline in GC% with ORF length, but the slope is greatest for second codon positions and 6% of the variation (r² = 0.061) can be accounted for in terms of length differences (Fig. 1e). In the case of AG%, the contributions of first and second codon positions to the increase in AG% with ORF length are partly countermanded by the contributions of third codon positions (Fig. 1h). Nevertheless, the major contribution is from second codon positions where 11% of the variation (r² = 10.8) can be accounted for in terms of length differences (Fig. 1f).

RNY-Pressure

Why were third codon position changes either meagre (GC%; Fig. 1g) or of a type that would countermand the non-synonymous mutation trend (AG%; Fig. 1h)? Other, even stronger, evolutionary pressures may have been operative. The inflexibility of third codon positions with respect to GC% differences is considered below. In the case of AG% differences, the negative slope in the case of third codon position bases suggests the operation of RNY-pressure - namely the translational pressure for third bases of codons to be pyrimidines (Y) rather than purines (R).^[16] Delays in elongation or termination of proteins tend to stall ribosomes on mRNAs. Since stalling of protein synthesis on long polysomes (i.e. long mRNAs) would sequester more ribosomes than stalling on short polysomes (i.e. short mRNAs), thus potentially delaying overall protein synthesis, then genes corresponding to long ORFs might be particularly susceptible to RNY-pressure in organisms where the rate of protein synthesis was limiting growth rate.^[24]

Given the many data points corresponding to small ORFs and the generally modest slope values, intercepts at the ordinate (Y₀ values) can provide an indication of overall base composition at each codon position. Thus, whereas the first and second codon positions contribute greatly to downward GC-pressure (Y₀ = 33.6% and 25.5% respectively), the third codon positions make an even greater contribution (Y₀ = 18.4%); all three negative slopes (Figs. 1c, e, g) show that this trend increases with gene length. Thus, it appears that, while small genes respond to downward GC-pressure, large genes have the capacity to respond more. Most organisms have average GC% values within the range 20% to 80%, so that the extremely low Y₀ value and low slope value for third codon positions (Fig. 1g), suggests that the sequences are approaching functional AT-saturation and gene length cannot facilitate much of a further response to downward GC-pressure.

Consistent with RNY-pressure being operative (first codon position is a purine, second codon position is any base, third codon position is a pyrimidine), first and second codon positions make the greatest contributions to upward AG-pressure (Y₀ = 65.7% and 56.3% respectively; Figs. 1d, f); again, this is a trend that increases with gene length, so that large genes have the capacity to respond more to this pressure. The third codon positions of many genes are in slight purine excess (Y₀ = 51.3%), but long genes tend to pyrimidine-load (negative slope; Fig. 1h) so that they would be in slight pyrimidine excess (in keeping with a special susceptibility of codons in long ORFs to RNY-pressure).

Fig. 2. Variation of the base composition at different codon positions with ORF length in 179 genes of the malaria parasite P. vivax. For details please see the legend to Figure 1.

Plasmodium vivax

Another malaria parasite, P. vivax, appears to have been under less extreme GC-pressure, but under upward AG-pressure of the same order as P. falciparum (see Y₀ values in Figs. 2a,b). Although relatively few genes were available for P. vivax in the CUTG, significant increases were observed in both GC% and AG% with increasing length of sequence (slope values of 0.7%/kilobase for GC% and 0.5%/kilobase for AG%; Figs. 2a,b). In contrast to P. falciparum, mainly the first and third codon positions are involved, and the primary amino acid-determining second codon positions play a minor role (Figs. 2c-h). For third codon positions there is a slight purine excess (Y₀ = 50.7%), which long genes support (positive slope; Fig. 1h); thus in P. vivax there is not a general tendency to pyrimidine-load third codon positions. In this organism, which tends toward base compositional equifrequency, purine-loading trumps RNY-pressure.

In P. falciparum the base of the arrowhead is very narrow in the case of the GC% of third codon positions (Fig. 1g), and this corresponds to a small standard error of the estimate (seE). Individual genes in P. falciparum vary little from each other in third codon position GC% values. In contrast, in P. vivax the base of the arrowhead for the GC% of third codon positions is very broad, corresponding to a much larger seE (Fig. 2g). Thus, individual genes in this organism can vary greatly from each other in third codon position GC% values, and it would be expected that an extreme response to base composition pressures, namely insertion of codons for placeholder amino acids, would be less necessary than in P. falciparum.

The great differences between the two Plasmodium species in third codon position base compositions can be displayed by plotting the base compositions of individual codon positions against overall base compositions.^[16,25] In general, the genes of P. falciparum have a very low GC% (Fig. 3a), and the genes of P. vivax have an intermediate GC% (Fig. 3b), but both species show the same order of purine-loading, and ORFs of AG% less than 50 (i.e. pyrimidine-loading) achieve this largely by virtue of third codon position values (see black triangles in Figs. 3c,d). As a function of genic GC%, third codon position GC% values of P. falciparum change little (low slope value; Fig. 3a), whereas third codon position GC% values of P. vivax change greatly (high slope value; Fig. 3b).

Fig. 3. Variation of the base compositions at different codon positions with the overall base compositions of the corresponding genes. The data are from P. falciparum as in Figure 1 (Figs. 3a, c), and from P. vivax as in Figure 2 (Figs. 3b, d). First order regression lines are shown with numbers indicating the corresponding codon positions. Symbols for different codon positions are as in Figures 1 and 2: open circles, first codon positions; grey filled squares, second codon positions; black-filled triangles, third codon positions.

These data affirm that, in species under extreme GC-pressure (which may be upwards or downwards), third codon position GC% values tend to remain constant, serving the needs of the species rather than of individual genes, whereas in species with intermediate GC% values third codon position GC% values are more at liberty to vary, thus serving the needs of individual genes.^[21] Accordingly, in P. falciparum the burden of responding to GC-pressure to serve the needs of individual genes rests primarily on first and second codon positions (Fig. 3a), so leading to changes in the nature of encoded amino acids. In P. vivax, third codon positions appear free to adopt the burden, and there is less pressure for amino acid change (low slopes for first and second codon positions; Fig. 3b).

Values for AG% show similar, but not identical trends to those for GC% (Figs. 3c,d). In P. falciparum the slope value for third codon positions is low, perhaps reflecting a role of RNY pressure as suggested above. Thus, again, the burden of responding to AG-pressure rests primarily on first and second codon positions, so leading to changes in the nature of encoded amino acids. In P. vivax, the burden is largely carried by first and third codon positions. So there is less pressure for amino acid change. Thus P. vivax can respond to pressures for changes in base composition without necessarily changing the nature of its encoded amino acids, whereas for P. falciparum, this is mandatory.

As will be shown below to usually apply to prokaryotes, there is a positive correlation of ORF length with GC% among genes of the eukaryote P. vivax (Fig. 2a), largely due to third codon position values (Fig. 2g). However, for many eukaryotic sequences the correlation is negative,^[26-28] as is the case with P. falciparum (Fig. 1a).

Fig. 4. Variation of the base composition at different codon positions with ORF length in 1710 genes of the eubacterium Haemophilus influenzae. For details please see the legend to Figure 1.

Haemophilus influenzae

Plots similar to Figures 1 and 2 were constructed from the completed genome sequences of 144 prokaryotic species that covered a wide range of genomic GC% values and optimum growth temperatures. For example, Figure 4 shows the plot for the eubacterium Haemophilus influenzae, which has an optimum growth temperature of 37oC and appears to have been under weak downward GC-pressure (Y₀= 37.3%; Fig. 4a). However, long genes countermand this. GC% values increase with ORF length and this is due to the values of first and second codon positions, implying changes in encoded amino acids. Approximately 3% of the variation in GC% at these codon positions can be explained on the basis of gene length (r² values of 0.027 and 0.028; Figs. 4c, e). Although third codon positions strongly serve downward GC-pressure (Y₀ = 28.7%; Fig 4g), third codon positions in long proteins do not significantly exceed those in short proteins in this respect (slopes not significantly different from zero). Thus, it appears that short genes suffice to satisfy downward GC-pressure, so long genes are free to follow other imperatives. In contrast to GC% values, AG% values decrease with ORF length (Fig. 4b) and this is largely due to the AG% values of third codon positions (Fig. 4h), implying no changes in encoded amino acids, and an increasing influence of the need for efficient translation (RNY-pressure) as protein length increases.

Fig. 5. Variation of base compositions with ORF lengths (slopes of plots as shown in Fig. 4a,b), as a function of genome base composition (a, GC%; b, AG%). Data are from 144 complete prokaryotic genomic sequences (125 eubacteria and 19 archaebacteria). Black-filled circles represent values of GC% / ORF length (slopes as in Fig. 4a). Grey-filled circles represent values of AG% / ORF length (slopes as in Fig. 4b). All species are included in the displayed statistics. Two letter abbreviations refer to points corresponding to species which deviate widely from the trends: AP, Aeropyrum pernix; CT, Chlorobium tepidum; PH, Pyrococcus horikoshii; TP, Treponema pallidum; TW, Tropheryma whipplei

Length Effects Depend on Species Base Compositions

In Figures 4a and 4b there are two slope values (GC%/kb = 1.1 and AG%/kb = -0.4), which summarize all codon positions of an individual prokaryote. In Figure 5 the two corresponding slope values for the complete genomes of each of 144 prokaryotic species are summarized in terms of either the GC% (Fig. 5a) or the AG% (Fig. 5b) of each species. Thus, whereas each point in Figure 4 corresponds to a gene, each point in Figure 5 corresponds to a species. With five exceptions (Aeropyrum pernix, Legionella pneumophila, Treponema pallidum, Tropheryma whipplei, Ureaplasma parvum), GC% slope values (%/kb) are positive with an average of 1.03se0.06 %/kb (P <0.0001 that not significantly different from zero). With two major exceptions (Aeropyrum pernix, Pyrococcus horikoshii), AG% slope values are close to zero or negative with an average of -0.26se0.05 %/kb (P < 0.0001). Thus, relative to short ORFs, long ORFs tend to contribute positively to GC% and negatively to AG%. Do the extents of these contributions vary with species base composition?

Slope values for GC% (GC%/kb) increase as a function of GC% (Fig. 5a), and decrease as a function of AG% (Fig. 5b). Thus, the contribution to GC% of long ORFs increases as overall GC% increases. The higher the GC% the more long ORFs contribute to the GC% value. On the other hand, the negative slope values for AG% (AG%/kb) do not vary significantly with either overall GC% or overall AG%. In other words, relative to ORFs for short proteins, ORFs for long proteins selectively pyrimidine-load irrespective of the base composition of the organism.

Fig. 6. Variation of the base composition at different codon positions with ORF lengths (slopes of plots as shown in Figures 4c-h), as a function of genome base composition (a, c, e, GC%; b, d, f, AG%). Other details are as in Figure 5.

Length Effects Depend on Codon Positions

In Figure 4 there are six slope values for different codon positions (Figs. 4c-h). In similar fashion to Figure 5, these are displayed for the 144 prokaryotic species in Figure 6. Taking all codon positions into account, slope values for AG% (AG%/kb) are generally negative (Fig. 5). However, indicating the importance of RNY-pressure, values at first codon positions are generally positive (Figs. 6a, b; average = 0.55se0.07; P < 0.0001), values at second codon positions are generally slightly negative (Figs. 6c, d; average = -0.33se0.05; P <0.0001), and values at third codon positions are generally strongly negative (Figs. 6e, f; average = -1.01se0.06; P <0.0001). Again, whatever the codon position, these values do not vary significantly with species base compositions.

Although, taking all codon positions into account, slope values for GC% (GC%/kb) are positive (Fig. 5), in the case of second codon positions, values become zero or negative in organisms with a high overall GC% (Fig. 6c) and a low overall AG% (Fig. 6d) The converse applies, and more dramatically so, in the case of third codon positions (Figs. 6e, f). Thus, the increasing role of gene length in contributing positively to the overall GC% of a prokaryote as species GC% increases (Fig. 5a), is dealt with in different ways be different prokaryotic species. In low and medium GC% species, increases in open reading frame GC% as the corresponding encoded proteins increase in length are largely accounted for by the base compositions at first and second codon positions. In high GC% species first, and especially third, codon positions play this role. So in low and medium GC% prokaryotes the increase in GC% in long open reading frames is contributed by distinct placeholder amino acids with GC-rich first and second codon positions (amino acid-determining). In high GC% prokaryotes the further increase in GC% in long open reading frames allows flexibility in the nature of placeholders, which mainly contribute by virtue of the GC-richness of their first and third codon positions (non-amino acid-determining).

In Figure 6c the downward slope crosses the abscissa at a GC% value of 64, whereas in Figure 6e the upward slope crosses the abscissa at a GC% value of 35. These values approximate to those previously identified (68 and 38, respectively) as the GC% values at which there are transitions from primarily serving genic demands to primarily serving species demands.^[21]

Incomplete Genomes Confirm Complete

For reasons given above (see Data), a further 377 sequences from incompletely sequenced prokaryotic genomes were added to the above sequences from 144 completely sequenced genomes. The various linear regression parameters, as in Figures 5 and 6, are shown in Table 1. While there are some differences from the results obtained when only completely sequenced genomes were studied, the essential observations are confirmed, suggesting that they apply to prokaryotes in general. However, in the case of GC% slopes the slope values for first and second codon positions tend to cancel out the slope values for third codon positions, so that slope values for all codon positions are not significantly different from zero (c.f. Fig. 5a). In these studies no major differences were detected between archaebacteria and eubacteria.

Table 1 Variation of base composition at different codon positions Statistical parameters for different codon positions of plots of slopes of ORF base compositions versus ORF lengths against genome GC% and AG% for 521 prokaryotes
Codon position	Genome GC%						Genome AG%
	GC% Slopes			AG% Slopes			GC% Slopes			AG% Slopes
	r²	Slope	P	r²	Slope	P	r²	Slope	P	r²	Slope	P

All	<0.0001	0.005	0.38	<0.0001	-0.003	0.65	<0.0001	-0.013	0.65	<0.0001	0.01	0.74

First	0.021	-0.04	0.0004	<0.0001	-0.012	0.46	0.015	0.168	0.003	<0.0001	0.025	0.75

second	0.048	-0.039	<0.0001	<0.0001	-0.007	0.65	0.045	0.189	<0.0001	<0.0001	0.052	0.5

Third	0.144	0.094	<0.0001	0.01	0.027	0.013	0.102	-0.397	<0.0001	<0.0001	-0.045	0.41

Protein Length Constrained in Thermophiles

AG% (purine-loading) increases with optimum growth temperature, whereas GC% tends to decrease.^[10-13,17] Since AG-pressure can increase protein lengths by virtue of interdomain insertions of purine-rich simple sequence elements,^[3,9] which should be extruded as loops from protein structures, it would be predicted that homologous proteins would be longer in thermophiles than in mesophiles. However, it is probable that thermophilic proteins achieve a greater compactness than mesophilic proteins by decreasing the length and number of external loops.^[29] Consistent with this, ORFs tend to be shorter in thermophiles than in mesophiles.^[30]

Of the above 144 prokaryotic species for which complete genomic sequences are available, the optimum growth temperatures of 108 were available in the PGTdb. Of the above 521 prokaryotic species for which either complete or incomplete sequences (containing 20 or more ORFS) were available, the optimum growth temperatures of 293 were available in the PGTdb. The slope values (%/kb) as recorded above, were plotted as a function of optimum growth temperatures, which ranged from 0oC to 101.5oC. Because there was an overrepresentation of species with optimum growth temperatures of 37oC or less, slope values were also plotted for a narrower temperature range (38oC to 101.5oC). Some statistical parameters from the first order linear regression plots (not shown) are given in Table 2.

Table 2 Statistical parameters for different codon positions of plots of slopes of ORF base compositions versus ORF lengths against optimum growth temperature of prokaryotes
		Genomes (Complete)^a						Genomes (Complete + Incomplete)^b
Temperature range (degrees)	Codon position	GC% Slopes			AG% Slopes			GC% Slopes			AG% Slopes
Temperature range (degrees)	Codon position	r²	Slope	P	r²	Slope	P	r²	Slope	P	r²	Slope	P

0 - 101	All	0.034	-0.007	0.03	<0.0001	0.001	0.62	0.007	-0.009	0.09	<0.0001	0.004	0.49

	First	<0.0001	-0.004	0.34	<0.0001	0.0005	0.9	<0.0001	-0.011	0.32	<0.0001	-0.009	0.44

	second	<0.0001	-0.004	0.35	<0.0001	0.002	0.46	<0.0001	-0.005	0.44	0.003	0.012	0.17

	Third	0.015	-0.012	0.11	0.0001	0.002	0.62	0.0009	-0.011	0.26	0.003	0.01	0.18

38 - 101	All	0.436	-0.038	0.0008	0.019	0.021	0.25	0.135	-0.027	0.002	<0.0001	<0.001	0.99

	First	0.07	-0.017	0.13	<0.0001	0.019	0.43	0.035	-0.02	0.08	0.001	-0.015	0.3

	second	0.02	-0.025	0.25	0.076	0.025	0.12	0.04	-0.023	0.07	<0.0001	<0.001	0.97

	Third	0.236	-0.073	0.01	0.004	0.018	0.31	0.09	-0.04	0.01	0.004	0.015	0.27
^a108 species, of which 21 have optimum growth temperatures 37 degrees Celcius.
b293 species, of which 59 have optimum growth temperatures 37 degrees Celcius.

Whatever the codon position, whatever the temperature range, and whether or not incomplete sequences are included, slopes of plots of AG% against ORF lengths (kb) do not change significantly as optimum growth temperatures increases. This implies that prokaryote thermophiles, while they may purine-load, do not concomitantly increase protein length to accommodate this. Thermophiles purine-load within the constraints of ORF length and would not seem, in general, to employ interdomain regions for this purpose.

For complete genomes, over a broad temperature range slopes of plots of GC% for all codon positions against ORF lengths decreased slightly as optimum growth temperatures increased (r² = 0.034; P = 0.031), but this was not significant when incomplete genomes were included (r² = 0.007; P = 0.09), nor when broken down to individual codon positions. However, over a narrower temperature range (38oC to 101.5oC) the decrease was highly significant, both with complete genomes alone (r² = 0.436; P = 0.0008), and when incomplete genomes were added (r² = 0.135; P = 0.002). All three codon positions contribute to the decrease, with third codon positions usually making the greatest and most significant contribution (r² = 0.236; P = 0.01 for complete genomes; r² = 0.09; P = 0.01 when incomplete sequences were included). The regression lines sloped downwards to cross the abscissa at a point corresponding to a growth temperature of approximately 100oC (data not shown). Thus, species with low growth temperatures can respond to GC-pressure by increasing ORF lengths, possibly employing interdomain regions for "GC-loading" (e.g. Figs. 4a,c,e), but this is countermanded in species with high growth temperatures. In other words, thermophiles tend not to load GC, perhaps again reflecting constraints on loop formation and ORF lengths at high temperatures, but also perhaps reflecting the reciprocal relationship between AG-pressure (high in thermophiles) and GC-pressure.

Base Composition and Chain-Terminating Codons

Since chain-terminating codons are GC-poor, in a random GC-rich sequence there should be a deficit of chain-terminating codons (UAA, UAG, or UGA) and hence longer potential ORFs.^[31] Indeed, the positive correlation between GC% and ORF length noted for many prokaryotes (Fig. 4a)^[27,32] would seem consistent with this. However, although the extent of the positive correlation increases with species GC%, the correlation is general, occurring both in low GC% and high GC% species (Fig. 5a). Thus, deficit of a small subset of GC-poor chain-terminating codons is unlikely to be a major factor in the correlation.

Placeholder Hypothesis

Long genes, in general, may have such different functions from short genes, in general, that different amino acids are required. For example, in eukaryotes housekeeping genes tend to be shorter than tissue-specific genes; this is attributed to the "functional architectures" of their proteins, which require fewer functional domains.^[33] In prokaryotes, membrane proteins rich in hydrophobic amino acids tend to be shorter.^[32] So, by default, long prokaryotic genes might happen to need, for the optimal functioning of their encoded proteins, more GC-rich codons. Indeed, this would appear consistent with long genes from low or intermediate GC% prokaryotes contributing to increased GC-pressure by virtue of second (amino acid determining) codon positions. However, it does not explain why long genes from GC-rich prokaryotes do not also contribute to increased GC-pressure by virtue of second codon positions (Fig. 6c). On balance, the data support the placeholder hypothesis, rather than the function hypothesis.

But why would distinct amino acids be involved in the case of low and intermediate GC% prokaryotes? Placeholder amino acids with GC-rich first and second codon positions ("GARP") might interfere less with the functions of their proteins than others. For example beta turns, usually occurring at protein surfaces, often contain proline (P) and glycine (G), which may be accompanied by, or form part of, simple sequence repeats.^[23]

If this argument applies to prokaryotic proteins it might also apply to eukaryotic proteins, and there should again be a positive relationship between ORF length and GC%. However, the relationship is usually negative (Fig. 1).^[26-28] Here purine-loading may be of more importance. In the extreme case of P. falciparum, increased purine-loading with ORF length accounts for 11% of the variation in second codon position purine levels(Fig. 1f) and purine-loading is seen in many simple sequence elements.^[1-3] Thus, purine-loading appears to dominate and, in view of the reciprocal relationship between AG% and GC%, the negative relationship between ORF length and GC% may be secondary to this. Eukaryotic proteins are generally longer than prokaryotic proteins,^[34] their ORFs are preferentially loaded with long purine tracts,^[35] and their simple sequence elements can confer intragenic codon bias.^[9,36] Again, in view of the reciprocal relationship (increased AG% implies decreased GC%), purine-loading might be driving the generally negative relationship between ORF length and GC% in eukaryotes.

Conclusions

Most results of this study are explained in terms of two major factors, base composition pressures and RNY-pressure. While the causes of base composition pressures remain contentious, the pressures themselves can be dealt with as abstract entities. Insertion of placeholder amino acids would be an extreme response to base composition pressures. The best evidence supporting this has derived from organisms that deviate strongly from base equifrequency. In this study less strongly deviant species have also been considered. Despite the distinct possibility that the putative placeholders do not contribute positively to the specific functions of the proteins containing them, in prokaryotes of low and intermediate GC% placeholder-choice is not random. Placeholder amino acids with GC-rich first and second codon positions might interfere less with the functions of their proteins than others. For example beta turns, usually occurring at protein surfaces, often contain proline and glycine, which may be accompanied by, or form part of, simple sequence repeats.^[23] However, certain prokaryotes (e.g. Aeropyrum pernix) deviate dramatically from the general trend (Fig. 5).^[12] The possibility of annotation errors has been referred to.^[15] Nevertheless, following Bateson's admonition to "treasure your exceptions,"^[37] it is possible that focused studies of these organisms will be highly informative.

Acknowledgements. We thank J. R. Mortimer for programs that extract base compositions from codon usage tables. Queen's University hosts Forsdyke's webpages where partial or full text versions of some of the cited references may be found.

Placed on the Internet in July 2005 and last edited 08 Nov 2020 by Donald Forsdyke