Grantham's Genome Hypothesis

Patterns in codon usage of different kinds of species

Richard Grantham kindly donated this photograph

The genetic code is used differently by different kinds of species. Each type of genome has a particular coding strategy, that is, choices among degenerate bases are consistently similar for all genes therein. This uniformity in the selection between degenerate bases within each taxonomic group has been discovered by applying new methods to the study of coding variability. It is now possible to calculate relative distances between genomes, or genome types, based on use of the codon catalog by the mRNAs therein.

Richard Grantham. Bomber pilot who, after WW2 left the USA (California) and settled in France (Lyons).

Workings of the genetic code

Richard Grantham (1980) Trends in Biochemical Sciences 5, 327-331. (With permission of the author, and copyright permission from Elsevier Science; Click Here)

The Genome Hypothesis

Correspondence analysis

Variations in Coding Strategy

This is the age of sequences. A few years ago protein sequencing was in vogue, now nucleic acid determinations have moved to the fore. We have 160 messenger sequences in our Nucleic Acid Sequence Bank. Why are all these sequences being determined? What information is in them?

Current evolutionary debates involve sociobiology, neutralism and origins of different kinds of genomes. Sociobiology and neutralism can be seen as opposing themes. The first proclaims that the phenotype and its comportment are the products of gene structure (1,2). But neutralism assigns a minor evolutionary role to molecular changes in the gene (3). As for genome origins, the monophyletic substructure of life has been upset in the last few years by observations on mycoplasmas, bacteria, mitochondria, plastids and viruses(4-6). I believe investigations into the way that the code is exploited in various species can throw light on all these questions. Consequently, my justification for all this sequencing is that nucleic acid sequences reveal how the code is working, or has been worked.

There is of course interaction in each of the above debates between the research methods used and the results found. For example, neutralism has been based on studies of amino acid substitutions and the results have been extrapolated to molecular evolution as a whole. Kimura says that:

".... at the molecular level most evolutionary change and most of the variability within a species are caused not by selection but by random drift of mutant genes that are selectively equivalent" (3).

An independent view of evolution will be exposed here. My evolutionary outlook derives from work with a new kind of methodology, based on nucleic acid sequences, that my colleagues and I have developed in recent years.

The genome hypothesis

We state our main result as a hypothesis because further testing is required to establish its general validity: all genes in a genome, or more loosely genome type, tend to have the same coding strategy. By this we mean they employ the codon catalog similarly; that is, they show similar choices between synonymous codons, or between degenerate bases (those in codon position III). Hence a systematic exploitation of the code's degeneracy, particular to the genome type, is portrayed in each gene sequence. Unlike the picture emerging from studies on proteins with the same method (see below and Refs 7-9), our results with nucleic acids resemble classical systematics by distinguishing groups of like species. For example, the most gross observation is that viruses and mammals have widely separate coding strategies. This is evident by simple comparison of codon frequencies in the two kinds of genes.

Different levels of codon degeneracy for different amino acids

Fig. 1. Degeneracy of the genetic code. Codons are read vertically. Each of the four rows represents a different level of degeneracy (number of codons per amino acid). The 61 amino acid codons are grouped in 20 sets of 1-6 synonymous members. Each six-membered set (sextet) is composed of a quartet and a duet. Thus the code includes 8 quartets and 12 duets, the isoleucine trio and the single codons of methionine and tryptophane, plus the three terminators. With quartet codons, changing the third base cannot affect the amino acid coded.

To eliminate the influence of amino acid frequency on codon frequency, consider only the eight sets of codons called "quartets" (see Fig. 1). Each of these 32 codons belongs to a set of four synonymous triplets in which only the third base varies. Thus a complete choice of bases exists for filling codon position III without changing the resultant amino acid. This simplified approach gives only a partial view of the functioning of the code since there are 29 other amino acid codons, but we have found that the pattern is quite similar to that obtained with all 61 codons (7-9).

Frequencies of bases at third positions of quartet codons

Fig. 2. Frequencies of third bases of the 32 quartet codons obtained from all 119 mRNAs combined (see text). Here the same weight is assigned to each codon; previously (see Fig. 1 of Ref. 9) each messenger was weighted equally. The two methods yield similar results; no effect of mRNA length on the choice of degenerate bases has been detected. For identification, reference and codon fiequency in each gene see Ref. 7.

Fig. 2 shows the composition of the third bases of these quartets for 119 mRNAs taken together. We see that pyrimidines are generally preferred to purines as degenerate bases. Fig. 3 portrays systematic differences between genome types in filling codon position III. Thus quartet third bases in mammalian messengers contain less A and less U but more C and more G than in mRNAs of any other genome type. Little overlap in coding strategy occurs among individual genes of different genome types (7). The degenerate base choices in each mRNA consistently characterize the genome type of the relevant gene.

Different choices of third bases in different types of genome

Fig. 3. The composition of quartet third bases according to genome type. Six examples of genome type among the 119 mRNAs are shown. PAB, papova- + adeno- + hepatitis B viruses; mt, mitochondria. The mRNA in each genome type are described in Ref. 7. ds = double-stranded; ss = single-stranded.

In the above comparison, mRNAs for mouse immunoglobulins (Ig) were excluded from the data for other mammalian mRNAs. Ig mRNAs use a sub-strategy in which an average of only 47.3% C+G is found in quartet position III while the other mammalian messengers show 70.9%. Also mouse Ig mRNAs use three times as much A as other mammalian messengers. The frequencies of C and U are close for general mammalian mRNAs and Ig mRNAs; the difference mainly lies in the use of purines. Thus in quartet position III, Ig mRNAs have a G/A ratio of only 0.6 while other mammalian messengers have a ratio of 4.0. The Ig coding strategy, unlike that of other mouse mRNAs, curiously resembles that of papova viruses (6-9). Of all the sequences so far obtained. mammalian messengers (excluding Ig messengers) repeatedly exhibit the highest C+G content and the lowest A content in degenerate bases (7).

Another aspect visible in Fig. 3 is the variation in use of A versus U, and C versus G. Five times more U than A appears in quartet third bases of mRNAs of single-stranded DNA phages (this of course increases the contrast between U and A in Fig. 2. since 35 of the 119 total mRNA sequences come from ssDNA phages). Conversely, all groups show fairly even use of C and G except Ig, whose mRNAs have over twice as much degenerate C as G.

Correspondence analysis provides gene distances

A better image of the genome hypothesis is to be had by the simultaneous consideration of all 61 codons in the total sample of mRNA sequences. The best tool we have found for demonstrating this is correspondence analysis, which is a multivariate method adaptable to assessing biological variability and allowing graphical representation of the quantitative results (10,11). The analysis identifies and measures the importance of the various factors in codon usage that separate mRNAs. Variation of the frequency, among all mRNAs of each of the 61 codons is simultaneously calculated; the results position each messenger as a point in a multidimensional space. Then the data are projected on to a plane whose horizontal and vertical axes correspond to the first and second most important factors, respectively, in creating distance between the mRNAs. Grouping is achieved by the automatic classification method of Fages (12), which is equivalent to minimizing the variance in each class of a chosen number of classes. Some distortion in the projection is inevitable but this does not affect the classification. Two neighboring mRNAs in the plane can belong to different classes if the perpendicular distance between them is great. This means that factors other than the first two are important in distinguishing their coding strategies.

Correspondence analysis of codon choices by different groups of species

Fig. 4. Correspondence analysis on codon frequencies in 119 genes. This figure results from simultaneous analysis of the frequency of each of the 61 codons in each messenger (Ref. 7). Grouping is by automatic classification (Ref. 12). Of the eight total classes only the seven closed ones appear here. The eighth class (in the space between these seven classes) is a heterogeneous group including some Ig, sea urchin histone, single-stranded (ss) RNA virus and other genes, totalling 30 mRNAs. Not every mRNA corresponding to a given label is found in the class bearing that label. Each label reflects the taxonomic origin of the majority of the sequences in that class. The most 'contaminated' group is that labelled PA B (papova- + adeno- + hepatitis B viruses). For details see Ref. 7. The horizontal axis has been found to correspond to the C +G content of the degenerate bases (see text).

Results of correspondence analysis on 119 mRNAs appear in Fig. 4, in which separation of classes (delimited by automatic classification) is highly correlated with genome types. Two new groups, having too few total codons for inclusion in Fig. 3, are yeast mitochondrial, and yeast and slime mold genes. The seven Ig messengers lie between the upper right tip of the mammalian group and the top of the PAB group (papova, adeno and hepatitis B viruses). The double-stranded DNA bacteriophages occur mainly between bacteria and the large single-stranded DNA class. However, neither the Ig nor double-stranded DNA phage mRNAs constitute a separate class in this analysis. Messengers furthest to the left contain 88-90% C+ G in quartet position III while those furthest to the right have only 3-10%. There is little contamination of classes by genes of a different genome type (see Ref. 7 for identification and placement of each of the 119 mRNAs). This approach does not simply reproduce classical systematics; the figure contains new information on evolutionary mechanisms and paths. Nevertheless. it does sort genes according to genomic origin; therefore, it demonstrates that evolutionary change in genes is related to the differentiation of taxa.

We wondered of course, how much the mRNA correspondence analysis pattern of Fig. 4 depended on the proteins coded. A correspondence analysis coupled with automatic classification was therefore done on the frequencies of the 20 amino acids in the 119 proteins; this analysis is shown in Fig. 5. No correlation between Figs 4 and 5 has been found. Indeed, we have not been able to account for placement of the proteins in Fig. 5. Viral, bacterial, mammalian and other proteins often lie in the same class. Every one of the seven classes of Fig. 5 includes proteins of viruses and at least one other genome type. We conclude that mRNA sequences contain other information than that necessary for coding proteins. This other "genome-type" information is mainly in the degenerate bases of the sequence. Consequently, it is largely independent of the amino acids coded (see Refs 7-9).

Correspondence analysis of amino acid frequencies

Fig. 5. Correspondence analysis on amino acid frequencies in the 119 proteins. Simultaneous analysis of frequencies of the 20 amino acids followed by automatic classification gave these seven closed classes of proteins (see Ref 7). Classes here cannot be characterized by genome typey genome type. The group furthest to the right contains viral, bacterial and mammalian proteins. The group furthest to the left is the most homogeneous; it represents four viral and seven slime mold genes. The top central class with diagonal lines carries viral, bacterial, yeast and mammalian proteins. The bottom group with vertical lines has viral and Ig proteins. Of the three remaining smaller classes, the bottom-most includes viral, bacterial and yeast gene products; the dotted one includes viral, yeast, chicken and mammalian proteins, and the third group includes products of viral, bacterial and mammalian genes. The mRNA classes in Fig. 4 are much 'purer' in genomic origim and relative distances between them in the plane are much greater (see Ref. 7).

Explaining variations in coding strategy

Why should individual genes segregate according to genome or genome type as in Fig. 4? One possible reason is metabolic discrimination between nucleotide bases. The basis for the mechanism would be an evolutionary interaction between concentrations of mononucleotide pools and replication errors. Thus different species, or kinds of species, would have arrived at different optimizations of the tolerated error level and amount of each base in the pool. Theoretical and experimental work supporting this approach has been done by Ninio and colleagues (for example see Ref. 13). An error with a given base relates not only to its concentration in the pool but also to that of the adjacent base. The error depends both on the time available for incorporation and for proof-reading. Incorporation time is a function of the concentration of the base being incorporated, while time for correction depends on concentration of the next base in the sequence. If the pool contains an abundance of the next base it will be incorporated rapidly, leaving little time for proof-reading of the first base. The mononucleotide pools have not been measured for all tissues and cells, hence correlation with the gene pattern in Fig. 4 has not been tried.

A second possibility is regulation of replication or transcription through the choice of degenerate base. The speed and accuracy of copying could be influenced both by the nature of the base and its relative concentration in the pool, without invoking a proof-reading mechanism. Taxonomic groups could have exploited this double lever in varying manners, leading to different degenerate base compositions in the genes. Of course. this notion has implications for untranslated regions also, but lack of data precludes one from deciding on its applicability.

The optimization of secondary structure by choices between possible third bases might also affect coding strategy. The optimal secondary structure for a messenger could depend on cell size, nuclease content, salt concentrations, temperature range and other factors. In addition, the form of the messenger could be a brake to control its translation rate. Unfortunately. progress has been slow in determining mRNA conformation in the cell experimentally.

Another explanation for the genome type distances of Fig. 4 might be that the codon and anticodon populations are harmonized. Here we encounter a problem with regard to parasites. E. coli is a human symbiont and phages are E. coli parasites. If nucleotide pool concentrations are the determining factor in the separation of mRNAs revealed by correspondence analysis, parasites and hosts should have similar placements in the figure. The two examples are not analogous, however: E. coli cells establish their own pools. Coliphages do not, and hence they might be expected to have a coding strategy closer to E. coli than E. coli has to man. Curiously, bacteria fall about halfway between human and single-stranded DNA genes, although highly expressed bacterial mRNAs are nearer the large single-stranded DNA class. The double-stranded DNA phage messengers are closer to bacterial mRNAs (7).

Why should single-stranded DNA phages (fxl74, G4, M13 and fd) fabricate messenger sequences that use the translation apparatus and tRNA of their bacterial hosts, yet make different choices from the host among synonymous codons? The host has had a long time to harmonize codon and anticodon populations. This may indicate that single-stranded DNA phages are relatively recent invaders of bacteria and have not yet evolved codon frequencies perfectly adapted to the bacterial anticodon distribution. Of course, a too-perfect adaptation could mean extinction through killing too many bacteria. However, the mRNAs of double-stranded DNA lambdoid phages are near those of bacteria; this could mean they have been bacterial parasites for a longer time.

Another problem is mitochondria. Yeast mitochondria genes fall about as far from yeast genes as papova virus genes do from human genes (we shall soon begin work with human mt sequences). The coordinated use of codons and anticodons is discussed further in Ref. 6 where it is shown that the mammalian cell must be deficient in tRNA for translating the frequent A-ending codons of SV40 mRNAs. It is easy to imagine that this is a reflection of the relatively slow growth of papova viruses in primates, but the subject needs further analysis and experimentation.

Indeed, the overall strategy of papova viruses is obscure. SV40 is found in all tissues of monkeys. Although these viruses are considered neurotropic. They can transform lymphocytes (the site of production of Ig mRNAs). As seen in Fig. 2 of Ref. 7, mRNAs of papova viruses have coding strategies closest to those of three Ig among all mRNAs sequenced in mammals. Hence it would be interesting to know the tRNA distribution in lymphocytes.

Another curious aspect of papova viruses is their 'poly A tendency'. Of the above 119 messengers, 19 exhibit frequent runs of four or more adenines ( 4.0% of total bases). Of these 19, five are SV40 or BKV mRNAs (14). Thus their elevated content of degenerate A is at least partly a reflection of poly A tendency. These five papova genes use much more A and U in codon position III than do those of mammals (see Table 5 of Ref. 6), except for mRNAs of these Igs and three hormones, which also fall in the same class with papova viruses (7). None of the six Ig or hormone messenger sequences shows poly A tendency however. Poly A tendency determinations should help to understand differences in coding strategy in these and other genes. Nonetheless, we have not yet been able to 'rationalize' the vertical axis of Fig. 4.

Finally, third base choice could regulate the expression of mRNA at the translation level (7-9). The mRNAs of abundant proteins lie at the bottom of Fig. 4, whose vertical axis is therefore linked to mRNA expressivity. Such a regulation might be realized by controlling the secondary structure of the messenger. However, the explanation appears less simple. Codons in the class of highly expressed bacterial genes have less C and G in position III than do those of other bacterial genes (note that as well as being lower in the figure, the highly expressed mRNAs are to the right of other bacterial mRNAs). But the axis representing degenerate C + G content, which should be closely related to variation in secondary structure, is horizontal not vertical. Hence we must consider other possibilities of mRNA regulation.

It is conceivable that third base choice is constrained by the relative concentrations in the pool of the four monoribonucleotides and that there is an optimum choice of bases for maximizing the rate of mRNA transcription (or avoiding errors). Thus the number of copies transcribed of each messenger may be influenced by the third base composition relative to these concentrations. However, the existence of such a mechanism would not prevent another control at the translation level. A possibility for translation regulation exists in codon context effects. It has been experimentally demonstrated that the interaction of tRNA with mRNA is not independent of mRNA sequences outside the codon. Recent results suggest that any given codon may be read preferentially by one or another member of an isocoding tRNA family, depending on the context (neighboring codons). The efficiency of reading a particular codon can vary over a ten-fold range (15). Consequentlv, 'internal' regulation of translation of a messenger would be possible through degenerate base choices (7-9). Evolutionary interaction with the monoribonucleotide pool concentrations could exist to optimize the overall cell economy.

As already shown, substitutions in protein are highly correlated with physicochemical properties of the exchanging residues (16). These exchanges, however, are not all there is to evolution or even molecular evolution. The nature of the protein coded has little to do with the position of its messenger in Fig. 4 (compare Figs 2 and 3 of Ref. 7). The different coding strategies can be viewed simply as distinct ways of coding a given protein. For example, the average protein of Dayhoff (17) could be coded by an mRNA falling in any one of the classes of Fig. 4. But if that protein, or any other, is to be produced by a species belonging to a genome type represented by one of these classes, I predict that its mRNA will make choices among synonymous codons such that the position of the messenger given by correspondence analysis will be inside the class of its genome type. As seen above, such predictions pertain to most genes in a genome or genome type, but a few exceptions do exist. These results also imply that we now have a means of estimating, before sequencing either the mRNA or the protein, the degenerate base composition for mRNAs of proteins of known origin and amino acid composition. Consequently, the total base composition of the messenger can be predicted since the non-degenerate bases are decoded without ambiguity.

Messenger RNA is an evolutionary structure in its own right. For a long time it was not suspected that such strong constraints could exist, independently of protein coding, on nucleic acids. The picture is increasingly one of manifold constraints and adaptations, of both structural and functional natures.

The systematics of viruses, bacteria, mitochondria and of small species and genomes in general is difficult, partly because there is less phenotype to work with and systematists have often worked exclusively with phenotypes. Our ideas about the origins of theme genomes, and whether they are autogeneous or endosymbiotic, are being revised (4). The genome-distance-by-coding-strategy approach can aid in resolving such questions. As the sample of sequenced genes and genomes grows our analyses can be refined and the number of classes in the correspondence analysis increased.

The genome hypothesis resulted from studying codon usage in the mRNA in our sequence bank. Additional analyses on the same sequences have been done or are in progress. We are finding further examples of differences and similarities between genome types, genomes and genes. This work continues to indicate protein-independent molecular evolution of a non-neutral character, and may aid in understanding and extending the genome hypothesis.

References

1 Wilson, E. 0. (1978) On Human Nature, Harvard University Press, Cambridge MA
2 Dawkins, R. (1976) The Selfish Gene, Oxford University Press, Oxford
3 Kimura, M. (1979) Sci. Am. 241, 94-104
4 Doolittle, W. F. (1980) Trends Biochem. Sci. 5, pp. 146-149
5 Woese, C. R., Maniloff, J. and Zablen. L. B. (1980) Proc. Natl. Acad Sci. U.S.A. 77, 494-498
6 Grantham, R. (1978) FEBS Lett. 95,1-11
7 Grantham, R., Gautier. C. and Gouy, M. (1980) Nucleic Acids Res. 8, 1893-1912
8 Grantham, R. and Gautier, C. (1980) Naturwissenschaften 67, 93-94
9 Grantham, R., Gautier, C., Gouy, M., Mercier, R. and Pave, A. (1980) N ucleic Acids Res. 8, r49-r62
10 Benzecri, J. P. (1973) in l'Analyse des donnees 2. L'analyse des correspondences. . Dunod. Paris
11 Hill, M. O. (1974) Appl. Statist. 23, 340-354
12 Fages, R. (1978) Joumees Soc. Franc. Classific., p. 99
13 Bernardi, F. and Ninio. J. (1978) Biochimie 60, 1083-1095
14 Grantham, R. FEBS Lett. (in press)
15 Bossi, L. and Roth. J. R. (1980) Nature (London) 286,123-127
16 Grantham, R. (1974) Science 185, 862-864
17 Dayhoff, M. 0. (1972) Atlas of Protein Sequence and Structure. . p. D-355. National Biomedical Research Foundation, Silver Spring, Maryland

Patterns in codon usage of different kinds of species

RICHARD GRANTHAM, PASCALE PERRIN, AND DOMINIQUE MOUCHIROUD

(1986) Oxford Surveys of Evolutionary Biology 3, 48-81 (With permission of the first author, and of Elizabeth Mann for Oxford University Press)

1. Introduction
2. Interspecific patterns of codon choices
3. Explaining codon use
4. Intraspecific variation and expressivity in the immune system
5. Particularities in human viruses
6. A previous RNY code?
7. Concluding remarks

End_Note_(July_2011)

End_Note_(Sept_2016)

1. Introduction

When Miescher discovered nucleic acids in hospital pus in 1869, a decade after publication of Darwin's Origin and just following Mendel's experiments, the development of molecular evolution became possible. Recognition that DNA was the genetic substance, however, had to wait another 75 years. Of course the biochemical and statistical methodologies were lacking, but around 1872 Galton began introducing statistical methods into biology. Such methods are necessary for arriving at reliable generalizations. Galton, and in the next decade Weismann and others, also did experiments that contributed greatly to the evolutionary synthesis a half-century later. Although Darwin was aided by personal contacts with Lyell, Huxley, and Galton, isolated and abbreviated careers were the lot of Mendel and Miescher, and their work was not followed up for many years after their deaths. Partly as a result of this, perhaps, molecular evolution as a discipline has not fully established itself. We do not yet have a theory of molecular evolution and remain largely at the stage of data gathering. Articulation between biochemical phenomena and genetic expression in populations is poorly understood and hypotheses, when they can be formulated, are often difficult to test.

The genotype and the phenotype evolve together. Direct, but unidirectional information flow between them is assured by the genetic code. The genome phrases its messages under the surveillance of natural selection, which eventually chooses among genotypic variants. The genome ordinarily has an immense number of formal choices in composing a messenger RNA sequence to be translated into a given protein. These options derive from the correspondence of 61 triplet codons, made up of four different kinds of nucleotide bases, to the 20 amino acids of protein. This degeneracy or synonymy structure is nearly invariant. Thus, for example, in mRNA of all known species, each residue of phenylalanine can be designated by either the codon UUU or UUC, and each residue of alanine by GCA, GCC, GCG, or GCU.

Choices in biology in general are many, but those implicated in the coding of proteins are particular: they are directly documented in the genome. The code's degeneracy formalizes and obliges choices of the genome; it must decide which codon to use for each amino acid. Although invisible in the proteins, these choices between synonymous triplets are inscribed in that great document the genome, where they remain for at least the life of the individual. Thus, a genetic companion to the fossil record exists, or existed, in DNA sequences.

Weismann would rejoice. For example, for coding leucine, which has six codons, mRNA of human mitochondria prefers CUA, that of nuclear genes of certain plant species CUC, of human nuclear genes CUG, of ssDNA bacteriophages G4 and j X174 CUU, of AIDS virus UUA, and of yeast nuclear genes UUG (Anderson et al. 1981; Ikemura 1985; Ratner et al. 1985; Roe et al. 1985; deBoer and Kastelein 1986; Li et al. 1985; GenBank release 35).

According to the genome hypothesis each kind of species has a 'system' or coding strategy for choosing among synonymous codons (Grantham et al. 1980a,b). This system or dialect (Ikemura 1985; Ikemura and Ozeki 1983) is repeated in each gene of a genome and hence is characteristic of the genome or type of genome (Grantham 1980; Grantham et al. 1980a,b, 1981, 1983). The dialect is not inflexible; as seen below, intraspecific variation in employment of the codon catalogue does occur. Some genes in a genome, particularly a large genome such as ours, may use the catalogue somewhat differently than others (Grantham and Perrin 1986). It is the overall use of the code, obtained by summing codon frequencies of all sequenced genes in the genome that characterizes the species.

Analysis of overall codon usage by different taxonomic groups has remained a marginal activity for two reasons.

First, the methodology, frequently demanding multivariate and non-parametric statistics, is out of reach for most biologists (and many journal editors).
Secondly, although codon use is a characteristic of the genotype, most evolutionary analyses have been based on the phenotype.

How much independence exists between the two levels of evolution has not been determined, although neutralists and selectionists are converging, which should help to find a solution. Possibly, future data on the relative rates of silent and non-silent mutations will help to clarify this situation.

This review seeks to summarize and interpret the main features of variation in selection among synonymous codons. Why codon usage in each species is biased is not known. Nor are we sure that a general bias for the whole biosphere exists, because the sample of all sequenced genes is still too small. Some hypotheses have been announced, but it is often not clear whether one should expect the bias in each species to be determined by phenotypic or genotypic considerations. A tangle of proximate and ultimate causes, and of cause and effect ambiguities, is encountered.

For example, what is the influence on a species' use of the codon catalogue of food, population size, niche, individual lifespan, or size of the phenotype? The response seems to be 'none' each time. In fact, we are simply not ready to answer these questions.

In this paper we take the view that coding strategy is a fundamental evolutionary structure and that species or kinds of species can be characterized by variation in this structure. Indeed, certain distinctive patterns have been reported. Three recent reviews have aided in preparing this synthesis (deBoer and Kastelein 1986; Ikemura 1985; Li et al. 1985). We have selected 10 species groups for special study; these are the groups with the greatest number of published gene sequences.

2. Interspecific patterns of codon choices

Only part of the [conventional protein-encoding] information contained in the genotype is expressed in the phenotype as protein. This part varies from over 90 per cent for small viruses to only about 2 per cent in humans and other mammals. Another quantitative genetic difference between species is in degenerate base use.

It was formerly often thought that variation in degenerate base frequencies would be a neutral phenomenon since no direct phenotypic expression results. But, this has turned out not to be so. Systematic exploitation of the codon catalogue creates genetic distances between species (Grantham and Gautier 1980). It has been shown that the greatest determinant in creating these distances is not the protein composition; instead it is the pattern of choices among degenerate bases (Grantham 1980). Thus, in an early analysis, mammalian, bacterial, virus, mitochondrial, and fungus genes fell in different codon use classes defined by minimizing the variance in codon frequency in each of a given number of classes (Grantham et al. 1980a). In the same study it was demonstrated that no such separation of the proteins coded was obtained on the basis of amino acid frequency variation. Therefore, an mRNA sequence provides a better indication of the evolutionary position of a gene than does the protein sequence it codes (Grantham and Gautier 1980; Blake and Hinds 1984). This does not mean, of course, that evolutionary trends cannot be described for individual proteins; an example with cytochrome c is given in Grantham (1974a). Nonetheless, in general, protein evolution is extremely conservative; most amino acid substitutions are between chemically similar residues (Grantham 1974b).

Analysis of all sequenced genes for overall use of the 61 codons separates them into groups of similar species. For example, a correspondence analysis on the first 54 mRNA sequences published for eukaryotes showed separation between yeast mitochondrial and yeast nuclear genes, and between fungus and animal genes (see Fig. 16 of Grantham et al. 1981). In another correspondence analysis, human and yeast nuclear genes and their mitochondria were seen to have distinct coding strategies (Grantham et al. 1983). That is, there are patterns of usage of the codon catalogue. These graphic patterns have been accompanied by identification and quantification of the importance of the principal factors responsible for the separations between messengers observed.

In general, the most important factor in producing the separation is the G + C content of the degenerate bases, which is the most variable parameter of codon usage identified between taxa.
The second most important factor, at least in human nuclear and human viral genes, is differential use of bases A and U. The analytical expression most exactly representing this factor is 1.5 per cent A-0.5 per cent U; thus a weighting of 3 occurs between relative frequencies of A and U (see Fig. 7 of Grantham et al. 1985). This kind of reduction of coding strategy to a hierarchy of importances in creating the differences helps interpret the phenomena in terms of molecular evolution (see below).

2. 1. HUMANS AND OTHER VERTEBRATES

Although total human nuclear DNA, like that of other mammals, contains about 41 per cent (G + C), all major families of protein coding sequences have over 50 per cent (G + C) in degenerate bases (see Figs 5 and 7 of Grantham et al. 1985). In fact C-ending codons are favoured in 14 of the 16 possibilities for choice between such codons and those ending in other bases, while keeping the same amino acid each time. The two exceptions are CUG and GUG, the codons of highest frequency for leucine and valine, respectively (Grantham et al. 1981; Li et al. 1985). G-ending codons of Thr, Pro, Ala, and Ser are rare because they have C in position 11, forming the di-nucleotide CG, which is strongly avoided in man and most eukaryotes (see below).

Why C-ending codons generally predominate in vertebrate sequences over those ending in A or U is not clear. To appreciate this, note that the complementary triplet for AAA is UUU and that for GGG is CCC; since G-C pair formation liberates much more free energy than A-U pairing, the pairing of the last two triplets is called 'strong' binding and that of the first two triplets 'weak' binding. Consequently, as seen below, UUC, AUC, UAC, and AAC are expected to be used more frequently than their U-ending cognates from codon-anticodon binding energy considerations. These four codons form pairs with their specific anticodons characterized by intermediate energies while their cognates UUU, AUU, UAU, and AAU form pairs of weak interaction energy. That is, in each of the four cases the anticodon is the same for both the C- and U-ending codon and it contains G in the degenerate (wobble) position; G forms a much stronger bond to C than to U.

On the other hand, elevated frequencies would not be expected, given the overall genome composition, for triplets CCC, GCC, CGC, or GGC, which form extreme energy pairs with their anticodons (Grantham et al. 1981). But, the four latter codons, like the four former C-endings ones, are each of highest frequency within their specific set (Grantham et al. 1981; Ikemura 1985; Li et al. 1985). When a (methylated or otherwise) modified base occurs in the anticodon wobble position, as happens frequently in these eight cases (Sprinzl et al. 1985), we do not understand why C is favoured over U as third base.

This field of research has been neglected for several years and no good explanation has been found for the tendency to high G + C content in codon position III of most genes. Pairing energies involving modified bases have not been quantified. Adams and Eason (1984), and Perrin (1984) have proposed that mutation rate decreases with increasing G + C content, which would tend to stabilize coding strategy. However, confirmation of this notion has not appeared in the case of CpG mutating to UpG (Cooper and Gerber-Huber 1985; Grantham 1985).

2.2. INVERTEBRATES

Invertebrates will be exemplified by two species, the nematode Caenorhabditis elegans and the fruit fly Drosophila melanogaster. Codon choices differ strikingly between the two species. For example, in the nine highly expressed genes of C. elegans sequenced (Kramer et al. 1982; Files et al. 1983; Karn et al. 1983; Klass et al. 1984; Spieth et al. 1985), CUU is the Leu codon of highest frequency while CUG is favoured in the 46 Drosophila sequences. Furthermore, avoidance of doublets CG and UA is much more severe in the worm. As seen below, a rather strong case for energy optimization in codon-anticodon pairing can be made with C. elegans.

2.3. YEAST

Since deBoer and Kastelein (1986) have just summarized codon frequencies in 34 yeast genes, we take their data for comparison with other species. As appears in Section 3, CpG avoidance in Saccharomyces cerevisiae and Homo sapiens is similar; however, UpA avoidance in yeast is stricter than in man, being surpassed only by that in C. elegans among species studied here. No good explanation for avoidance of the UA doublet has appeared. Codon-anticodon pairing energy optimization in yeast has been discussed by several authors, who have found a strong preference for middle-level energies in highly expressed genes (Bennetzen and Hall 1982; Ikemura 1985; deBoer and Kastelein 1986; Li et al. 1985). In summary, overall usage of the codon catalogue in genes for abundant proteins is such as to assure intermediate levels of codon-anticodon interaction energy, in yeast as well as in E. coli.

2.4. PLANTS AND CHLOROPLASTS

Of the amino acids having codon choices, only Gln favours the same codon, CAA, in chloroplasts and plant nuclear genes sequenced, as can be seen below. This suggests different origins for these two plant genomes. Chloroplasts appear to have more genetic freedom at the molecular level than do mitochondria. Eleven of the 18 amino acids show highest frequency for the same codon in nuclear and mitochondria genes of man (insufficient plant mitochondria have been sequenced for a good comparison). It is also intriguing that 10 preferences are the same between plant nuclear and E. coli genes, making it difficult to believe that chloroplasts descended from Eubacteria since preferences coincide in only five cases between chloroplasts and E. coli. Of these 18 amino acids, 11 show the highest frequency for the same codon in man and E. coli (see below). We are therefore a long way from understanding what conserves and what changes codon preferences.

As will be revealed in Section 3, chloroplast genes so far sequenced favour UUA for coding Leu. They weakly avoid CpG and even more weakly, UpA. They have much higher frequencies for A and U than C and G as degenerate bases and show no evidence of pairing energy optimization by C/U or A/G choices (Boudraa 1987). CUC is slightly favoured over UUG as preferred Leu codon in plants, as seen below.

2.5. MITOCHONDRIA

The complete genomes of Xenopus laevis, mouse, rat, bovine, and human mitochondria have been sequenced; each contains 13 long, open reading frames, that is, potential coding sequences for proteins free of terminator triplets (Anderson et al. 1981, 1982- Bibb et al. 1981; Saccone et al. 1981; Roe et al. 1985). In some cases the protein has not been identified, hence these open reading frames in the genome sequence are potential genes, most of which, however, have been found to correspond to functional proteins.

Overall exploitation of the codon catalogue by vertebrate mitochondrial genes is extremely economical. These genomes, although they use all codons, contain genes for only 22 tRNAs; Leu and Ser each have two tRNAs, the other amino acids only one each. Hence, bias in synonymous codon frequencies cannot be due to availability of several tRNAs for each amino acid with different concentrations. Bias exists, however.

For example, human mitochondria generally favour codons ending in C while Xenopus mitochondria have higher frequencies for those with U as third base. Hence, the amphibian mitochondrial system prefers G-U wobble to the standard G-C reading of codon position III found in mammalian mitochondria (Roe et al. 1985).

Mitochondria thus present a curious evolutionary history. From Drosophila to man their genome size seems minimized and varies little. Gene order is different between Drosophila and vertebrates, but practically identical from X. laevis to man (Roe et al. 1985). Also, codon use differs greatly between mitochondria of X. laevis and man; 13 amino acids have different preferred codons between the two species. Between mitochondria of Drosophila and Xenopus, nine amino acids differ in codon preferences while between those of yeast and Xenopus 10 such differences exist. The only such difference between mitochondria of yeast (10 sequenced genes) and Aspergillus nidulans (12 sequenced genes) is with the amino acid Met, the former favouring AUG, the latter AUA (GenBank release 35). This suggests strong conservation of coding strategy in the two species over long times, although no date for their common ancestor has been proposed.

Indeed, we do not know how human mitochondria evolve - that is, how they are and have been selected. Do they have to be evaluated at the level of the host phenotype? This seems unlikely in view of the values for certain indexes presented below, for to maintain these values would seem to mean the elimination of many host individuals at each generation.

3. Explaining codon use

What is the fundamental explanation for interspecific variation in coding strategy? Are we faced with a situation of continuous variation within and between species, thus embracing a Darwinian perspective of gradual separation of populations to form new species, of species to form new genera, etc.? This is the heart of the problem of molecular evolution, its articulation with the rest of evolution, its importance in speciation and systematics in general. So, where do the codon dialects come from? One possible source might be mutational bias. But, Li et al. (1985) conclude that non-random mutations cannot explain non-random codon frequencies since the pattern of mutations seen in pseudogenes would predict accumulation of A and U in codon position III, instead of C and G as observed in animal genes. Therefore, some other factor must exert stronger selection pressure than the mutational trend. We envisage three potential origins of codon bias.

3.1. SEQUENCE PHYSICOCHEMICAL CHARACTERISTICS

The protein coded, of course, conditions properties of the nucleotide sequence, but much freedom for varying properties through degenerate base choice remains. Consider a few structural aspects and sequence properties:

Structure	Property
B- or Z-DNA, (RY)_n	Conformation
Polypurines, polypyrimidines	General Physicochemical stability
Runs (homonucleotides)	Half-life of mRNA in the cell
Varying base composition	Resistance to nucleases
Sequence element organization	Mutation rate

All these structures probably interact with each of the properties; consideration of the evolutionary importance of these features has begun, notably with the work of Rich and colleagues (Johnston and Rich 1985; see also Temin 1985; Grantham et al. 1985).

3.2 TRANSLATION OPTIMIZATION

Do codon frequencies adapt to tRNA concentrations or the converse (Garel 1982)? Both are adapting to something, that is, they are being selected. Changes among synonymous codons do not change protein structure, but they may influence the amount of protein made and the efficiency of its synthesis. That is, rate of translation and quality of the product can both be controlled by codon choice because some triplets translate more rapidly and accurately than their cognates (a protein containing translation errors may have a different half-life and biological activity from a more faithful copy).

Yet there is a mystery in all this. For example, proteins of chloroplasts and man, or of E. coli and man, do not differ greatly in amino acid composition, as several studies have indicated (Grantham 1980; Grantham et al. 1983; Blake and Hinds 1984). But, base composition of the coding sequences does differ enormously; chloroplast degenerate bases average only about 30 per cent (G + C) while the mean value for the human genes of Table 1 is 61.1 per cent.

**Table 1 Codon frequencies in 10 kinds of species**
Excepting bacteriophages, these 10 columns correspond to the 10 largest sequence files in our ACNUC bank (Grantham et al. 1985). Mixed species occur in the Plt* (plants) and Chl (chloroplasts) files, where many fewer sequences per* species are available. Number of gene sequences is: Man 195, Rat 95, Mus 77, Chic* (chicken)* 67, Dros (Drosophila melanogaster) 46, Plt 53, Chl 33, Eco (Escherichia coli) 149, EBV (Epstein-Barr virus) 59 and Ad2* (adenovirus type 2) 28. Absolute frequencies of each codon* are calculated from GenBank release 35. Immune system genes have been excluded. species occur in the Plt* (plants) and Chl (chloroplasts) files, where many fewer sequences per* species are available. Number of gene sequences is: Man 195, Rat 95, Mus 77, Chic* (chicken)* 67, Dros (Drosophila melanogaster) 46, Plt 53, Chl 33, Eco (Escherichia coli) 149, EBV (Epstein-Barr virus) 59 and Ad2* (adenovirus type 2) 28. Absolute frequencies of each codon* are calculated from GenBank release 35. Immune system genes have been excluded.
Codon		Man	Rat	*Mus*	Chic	Dros	Plt	Chl	Eco	EBV	Ad2
Arg	AGA	517	219	224	129	49	66	117	44	214	80
	AGG	552	219	239	146	31	80	29	34	449	100
	CGA	246	88	116	39	42	17	80	125	162	69
	CGC	520	246	206	164	249	86	49	1247	457	333
	CGG	346	194	144	97	46	41	19	141	433	112
	CGU	222	112	86	108	89	61	173	1581	166	81
Leu	CUA	292	130	131	45	42	159	137	130	273	110
	CUC	1052	454	368	226	115	253	28	493	763	129
	CUG	1981	973	718	561	446	184	39	3359	1305	351
	CUU	506	215	206	106	43	227	157	432	300	173
	UUA	310	74	79	33	17	87	242	455	119	24
	UUG	532	237	213	104	132	249	174	552	287	153
Ser	AGC	1028	364	423	351	109	165	49	855	567	219
	AGU	430	207	165	95	35	63	111	295	202	68
	UCA	432	168	169	80	29	133	73	282	270	61
	UCC	901	379	365	242	191	153	90	658	616	171
	UCG	202	68	60	74	121	67	34	389	266	84
	UCU	670	283	281	155	45	142	159	658	348	107
Thr	ACA	727	302	267	133	89	96	114	236	354	105
	ACC	1173	540	346	306	308	190	104	1428	916	311
	ACG	235	125	86	73	111	65	36	583	423	85
	ACU	717	292	238	144	83	118	247	613	281	106
Pro	CCA	616	324	231	208	175	300	102	415	518	134
	CCC	998	407	291	288	237	124	74	166	939	284
	CCG	260	76	57	67	134	84	43	1488	413	184
	CCU	823	319	250	248	58	182	211	286	375	151
Ala	GCA	649	247	215	198	131	258	191	1172	475	143
	GCC	1445	672	474	458	443	349	93	1289	1517	338
	GCG	264	117	71	91	102	147	73	1915	509	228
	GCU	1004	439	357	366	219	298	388	1163	380	143
Gly	GGA	852	296	349	239	180	145	243	246	544	154
	GGC	1326	513	457	411	301	250	90	1785	733	234
	GGG	605	265	248	183	26	104	107	416	696	134
	GGU	655	234	227	333	169	157	352	1887	331	121
Val	GUA	321	104	95	62	33	73	271	751	159	93
	GUC	754	382	332	234	187	158	40	734	609	117
	GUG	1351	611	514	385	302	245	69	1340	925	319
	GUU	561	211	194	136	87	174	229	1332	259	113
Lys	AAA	1115	385	378	281	119	133	229	2221	252	163
	AAG	1842	913	671	642	538	328	92	696	705	216
Asn	AAC	1120	549	467	340	252	311	168	1589	604	335
	AAU	846	326	340	152	92	98	241	707	347	123
Gln	CAA	609	248	203	101	70	603	183	701	276	136
	CAG	1583	874	548	360	287	363	59	1804	839	305
His	CAC	642	320	291	218	128	117	75	664	503	169
	CAU	485	184	170	93	53	84	130	538	240	60
Glu	GAA	1384	532	502	334	149	175	414	2699	464	257
	GAG	1910	999	674	664	642	269	118	1074	1137	436
Asp	GAC	1459	681	532	352	315	248	90	1441	893	364
	GAU	1110	421	413	263	217	172	254	1704	464	189
Tyr	UAC	926	402	336	229	205	247	95	869	599	223
	UAU	618	263	209	77	76	101	167	730	278	82
Cys	UGC	757	364	286	130	173	123	22	307	401	117
	UGU	497	209	204	64	30	43	54	231	182	52
Phe	UUC	1176	559	456	281	235	305	198	1198	520	148
	UUU	814	309	295	167	76	154	222	855	627	213
Ile	AUA	257	78	110	51	28	77	119	97	189	85
	AUC	1118	534	432	450	368	256	195	1797	584	143
	AUU	706	358	265	165	138	179	345	1347	330	148
Met	AUG	1182	488	438	355	252	225	218	1534	593	249
Trp	UGG	644	237	252	95	96	146	138	599	337	123

What such differences mean in evolution is still obscure. Clearly, there is harmonization between codon and anticodon intracellular populations in yeast and E. coli (Ikemura 1981, 1982, 1985; Bennetzen and Hall 1982; Gouy and Gautier 1982; Grosjean and Fiers 1982) and there can be little doubt that this facilitates translation. Codons of high frequency in mRNA are in general decoded by anticodons of high frequency in the cell's tRNA. This harmonization of the two intracellular populations optimizes translation by increasing its speed (since a high frequency codon is decoded faster, due to more specific anticodons being present in the cytoplasm) and decreasing mismatching errors (Gouy and Grantham 1980).

The extent of selection on codon-anticodon pairing energies has not been generally studied; analyses have been confined to E. coli and yeast (Bennetzen and Hall 1982; Grosjean and Fiers 1982; Ikemura and Ozeki 1983; Ikemura 1985). We attempt an extension of this phenomenon to Metazoa, as shown below.

3.3. ANCESTOR SEQUENCE BIAS

The life system started with certain sequences, possibly with one or a few particular sequences. It is sometimes thought that, because of the mutation process, all trace of the original sequences has been lost. But, as seen in this review, coding strategy appears to be conserved over long evolutionary time. In addition, even though the mutation rate is sufficient to wipe out the original condition, natural selection has probably been interacting all the time and perhaps re-selecting certain features of the starting sequence, although the function and environment of the sequence have changed. Many things have changed in the biosphere in the last 3.5 thousand million years, but many have remained rather constant (temperature, pressure, inorganic composition of the earth . . . ). It is reasonable to suppose that these relatively constant factors may be reflected in the conservation of certain sequence characteristics. We also believe that each lineage has developed its own strategy for codon choices and has had to contend with whatever bias existed in the ancestor sequences. In some cases the lineage may have adjusted to and conserved the ancestral bias instead of letting it mutate away (this is probably one function of repair enzymes). The above is only logical, of course, and we want to test this logic when possible.

3.4. CODON CHOICE INDEXES

Several indexes requiring only codon frequencies and simple arithmetic aid us in assessing the importance of the three above influences, especially the second one. Absolute frequencies of the 61 codons in the 10 kinds of species appear in Table 1. Tables 2 and 3 then show values for the indexes in each kind of species and some mitochondria. The first two indexes, NCG/NCC and NUA/NUU, concern CG and UA doublet avoidance, and are explained in legend Table 2. The third kind of index relates to energy optimization in codon-anticodon pairing during translation; the explanation follows.

Table 2. *Avoidance of CG and UA doublets in codon position II-III*
NCG/NCC is the frequency ratio, for codons having C as middle base, of G-ending to C-ending triplets. For codons having U as middle base, NUA/NUU is the ratio of A-ending to U-ending triplets. Both indexes conserve G+C contents. Values (calculated for GenBank release 35) are multiplied by 100.
Species	Man	Rat	Mus	Chic	Dros	Plt	Chl	Eco	EBV	Ad2	Yeast¹	Xen	C. el.²
NCG/NCC	29	19	19	24	40	44	52	124	40	53	30	13	6
NUA/NUU	49	40	51	39	38	53	72	31	70	66	27	48	3
Mitoch. of	Man	4Mam	Xen³	Dros	Yeast	A.nid
NCG/NCC	6	7	17	88	55	160
NUA/NUU	364	323	98	178	144	392
1. deBoer and Kastelein 1986. 2. Files, Carr and Hirsh 1983; Karn, Brenner and Barnett 1983; Klass, Kinsley and Lopez 1984; Kramer, Cox and Hirsh 1982; Spieth et al. 1985. 3. Roe et al. 1985.

Degenerate C/U choice is interesting because ordinarily the same anticodon responds to synonymous codons ending in C or U. Interchanging A and G in codon position III, however, often implies changing the anticodon. C/U choices clearly relate to energy optimization in codon-anticodon pairing. Thus, the pattern of the choices should be indicative of the importance this parameter has had in evolution.

These choices have often been neglected in studying coding strategy and the general impression is that "tRNA concentrations explain codon usage". Apart from leaving unexplained the origin of differential tRNA concentrations, which in fact poses the same problem as does codon usage, we have just seen that this cannot be the case with biased C/U choice.

If energy optimization exists, codons having G or C in the first two positions should prefer A or U in position III. Likewise, codons with A or U in position I and II should tend to increase codon-anticodon interaction energy by choosing G or C as third base (Grosjean et al. 1978; Grantham et al. 1981; Gouy and Gautier 1982; Grosjean and Fiers 1982). Schematically, WWX codons, where W (weak binding) is A or U and X is any base, would prefer S (strong binding, that is, G or C) degenerate bases. Similarly, SSX codons would tend to use A and U as third bases. Middle energies provided by mixed doublets (MM: one W and one S base) serve as controls: MMX codons should show no systematic bias under this hypothesis. We must recognize at the outset that the eukaryote system has many more anticodons than do prokaryotes and that modified bases in the anticodon, which may sometimes change considerably the pairing energy, occur frequently. These changes have not been quantified, however, and consequently our analyses have been done without taking them into account.

We compare C/U degenerate choice in codons of weak binding energy in the first two positions to those of strong binding energy in these positions. Frequency ratios of C- and U-ending codons are, respectively, represented by WWC/WWU and SSC/SSU. As explained above, these ratios are contrasted to that for codons having one W and one S base in positions I and II, MMC/MMU. Table 3 summarizes results on a few species and gene families.

Table 3 *C/U choice in codon position III*
Species (nb. genes)	WWC/WWU	SSC/SSU	MMC/MMU
C. elegans HE (9)	3.59	0.91	1.31
D. melanogaster (46)	2.77	2.30	2.58
E. coli (149)	1.50	0.91	1.13
E. coli HE (Gouy and Gautier 1982)	5.04	0.37	0.92
H. sapiens (195)	1.45	1.59	1.56
Human b-globin (6)	1.52	1.43	1.06
Human a-globin (3)	14.3	4.57	8.79
Human hormones (28)	1.97	2.32	2.62
Human enzymes (15)	1.30	1.46	1.71
Human Ig segment C (8)	8.65	3.85	6.12
Human Ig segment V (9)	1.57	0.79	1.63
Mus Ig segment C (11)	2.93	1.15	1.99
Mus Ig segment V (59)	1.44	0.69	1.18
Mus Tcr segment C (8)	1.53	1.70	1.50
Mus Tcr segment V (11)	0.88	1.13	1.10
See text for explanation of column headings. HE, highly expressed mRNA.

The value in the last column of Table 3 reflects overall G + C content of degenerate bases, values in the first two columns are meaningful by comparison with that for MMC/MMU. We observe that the whole E. coli sample of 149 sequenced genes indicates translation pairing energy optimization but that the highly expressed (HE) sample of Gouy and Gautier (1982) shows much wider variation between the values for the first two columns.

Because of anticodon base modification we cannot be sure that there is not general pairing energy optimization in man, but the data in Table 3 definitely imply its existence in the nine highly expressed genes of C. elegans and probably in Drosophila (where HE sequences are not separated).

Two cases among human genes are particularly interesting.

The first is a-globin mRNA, for which the optimization is indicated by these ratios, while for b-globin mRNA it is not. It has never been settled whether translation efficiency differs between a- and b-messengers; this result suggests that it does differ.

The second case is Ig (immunoglobulin) C (constant) segments, which favour C as third base when the first two bases are A or U. The same phenomenon is seen with mouse Ig C segments (relative differences between columns are similar for mouse and human Ig C segments). But, mouse Ig V (variable) segments also show evidence, although less strong, for the optimization whereas human Ig V segments do not. The two kinds of segments in both man and mouse avoid C as third base when the first two bases are C or G, but a preference for C with A or U in the first two positions is not observed with human Ig V segments.

We conclude that, in the absence of other explanation for Table 3, there is some codon-anticodon pairing energy optimization in Metazoa, at least in certain gene families, all the way up to and possibly including humans. These results, which are new for Metazoa, indicate that this phenomenon is linked to expressivity level, as in lower organisms (Gouy and Gautier 1982; Grosjean and Fiers 1982; Ikemura 1985).

4. Intraspecific variation and expressivity in the immune system

4.1. DESCRIPTION OF THE IMMUNE SYSTEM (For more on the immune system Click Here).

The immune system of vertebrates is a complex organization involving several cell types and many protein molecules. Many of these molecules show a considerable degree of polymorphism which may be of two distinct types.

(i) A classical multiple-allele polymorphism where the population as a whole shows a very wide range of phenotypes, but each individual expresses a defined, simple type inherited in normal Mendelian fashion by offspring. This is the case for the antigens of the major histocompatibility complex (MHC).

The class 1 antigens are expressed on the majority of cell types; they are believed to be involved in the determination of self-recognition by the organism, and are major targets for the graft rejection reaction.
The class 2 antigens are chiefly expressed on cell types involved in the mounting of the immune response (lymphocytes, macrophages, . .), and are implicated in the cell-to-cell co-operation within the immune system.

The MHC antigens provide a cellular context for foreign antigen recognition. A foreign antigen, e.g. a virus, presented on a cell is only capable of inducing an immune response under normal conditions if the responding cell shares MHC antigens with the presenting cell. The MHC antigens are the most polymorphic genetic marker known.

(ii) A second and unique type of polymorphism is seen in the effector molecules of the B lymphocyte - the immunoglobulins (Ig) - and in the T cell receptors (Tcr). Every individual of a species expresses a vast number of chemically distinct molecules of Ig and Tcr. The molecular events which generate this variability are now moderately well understood. During lymphocyte differentiation, a rearrangement of the cell genome apposes a segment coding for the N-terminal portion of the final protein, via one or two junctional segments, to a position upstream of the region coding for an invariant C-terminal portion. The N-terminal (variable) region genes and the junctional segments are present in multiple copies, and the joining process has some positional flexibility; this leads to a combinatorial generation of many variant sequences. In addition, somatic mutations appear to increase the diversity of these segments during the life of the cell.

Stimulation of a particular lymphocyte by its specific antigen leads to proliferation, producing daughter cells with the same genetic rearrangement, and hence to increased production of the relevant immune response. Both immunoglobulins and T-cell receptors are made up of two different polypeptide chains, coded at separate genetic loci, which both consist of variable (V) and constant (C) regions, leading to additional combinatorial variability. For immunoglobulins, the two polypeptides are called light chains (L) and heavy chains (H). Two different classes of L chains (Kappa and Lambda) are coded on separate chromosomes and possess distinct libraries of V regions. Either class may interact with any H chain to produce an immunoglobulin molecule. The several classes of H chain are coded at a single complex locus on another chromosome and share a common V region library. The different C region genes are arranged as a closely grouped series and each consists of several exons.

The C region gene proximal to the rearranged active V region gene corresponds to IgM. The immature B lymphocyte expresses IgM from a mRNA generated by splicing from the V region segment (V-DJ) and the C region exons. Occasionally, a few molecules of other immunoglobulin classes may be made by the immature B lymphocyte by an alternative splicing event which removes the whole of the C coding segment together with the first intron. At a later stage in cellular differentiation, a further genome rearrangement may occur, leading to the elimination of the DNA coding for an arbitrary number of Ig C region genes and thereby bringing the V region coding segment with the first intron into apposition with a downstream C region segment. The B lymphocyte (and its progeny) will then produce a new class of immunoglobulin, but will conserve the L chain and H chain V regions, and thus the antibody specificity of the resulting molecule. This is called 'class switching'. These rearrangements (V-J joining, class switching) employ recognizable signal sequences in the genomic DNA as positional markers.

In the following section we wish to examine whether the coding strategies within the different regions of these molecules may be involved in:

(i) the extreme allelic polymorphism of the MHC system; and

(ii) the unique mechanism for the generation of molecular variability in the immunoglobulins and T cell receptors.

Similar studies on the less polymorphic molecules of the complement system, the Ig receptors with their nucleic acid and protein homologies to the immunoglobulins, and to the interleukins, etc., have been deferred for the present, due to the paucity of published sequence data.

4.2. DIFFERENTIAL MUTATION ALONG THE SEQUENCES

In Ig sequences we observe differences in coding strategy between V regions and C regions (Perrin 1984). The most striking in terms of the 'genome hypothesis' (Grantham et al. 1980), is the variation according to segment type of percentage (G + C) in the third position of quartet codons (see legend Table 4). C regions use more C- or G-ending codons than V regions (Miyata et al. 1979; Perrin 1984). This appears to be a general tendency in vertebrates since A- and U-ending codons are rare in C regions of rabbit, rat, chicken, and caiman Ig genes (Perrin 1984). It is difficult to understand this phenomenon in terms of expressivity because C and V regions are transcribed on the same messenger.

Table 4 *Percentage (G+C) in total sequence and in third position of quartet codons of human and mouse Ig V and C regions*
	Man		Mouse
	C (8)	V (9)	C (11)	V (59)
%(G+C) total	60.6	53.8	52.6	50.3
%(G+C) QIII	76.0	57.1	55.7	46.4
Number of sequences studied appear in parenthesis. 'Quartet' codons are the four-fold degenerate sets of Arg, Leu, Ser, Thr, Pro, Ala, Gly, and Val (Grantham 1980). QIII indicates the third position of such codons. C, constant region; V, variable region.

The different specificities of antibodies are generated, in part, by recombinations between V and J (joining) segments of L chains, or V, D (diversity), and J segments for V regions of H chains, present in the germinal library (Tonegawa 1983). Somatic mutations, involving only V regions, help to increase the range of specificities (Bothwell et al. 1981; Gershenfeld et al. 1981; Perlmutter et al. 1984; Jerne 1985; Sablitzky et al. 1985). X-ray diffraction studies have shown that three zones of V regions are directly involved in antigen recognition. These are HV (hypervariable) zones. The rest of the V region constitutes the framework (FR). Gojobori and Nei (1984) revealed that HV domains have a nucleotide substitution rate three times greater than that in the FR. Are A- and U-ending codons used more in HV zones (Perrin 1984)? This appears to be the case.

.		% (G+C)
.		HV	FR
Mouse	I and II	42.15	50.79
	Q3	26.87	45.86
	QID3	42.06	49.90
Man	I and II	47.42	50.80
	Q3	43.34	60.20
	QID3	53.15	62.89
See Kabat et al. (1983) for HV limits. I and II, first two codon positions combined; Q3, third position in quartet codons; QID3, third position in all degenerate codons.

The local (A + T) content seems to correlate with local nucleotide substitution rate. The lower (G + C) content of HV domains may lead to a less tight binding between DNA strands and thus increase the basic mutation rate (Adams and Eason 1984; Perrin 1984). It is known that replication accuracy changes along the genome (Bernardi and Ninio 1978).

Preliminary analysis on Tcr coding sequences of mouse also indicates differentia1 usage of synonymous codons for V and C regions. But, the difference is smaller than in Ig segments and depends on the peptide chain. For example, for six b-chain sequences of murine Tcr (Chien et al. 1984; Hedrick et al. 1984; Patten et al. 1984; Saito et al. 1984) the values of (G + C)Q3 are 61.6 per cent for C regions and 46.3 per cent for V regions.

We do not find differential codon usage between different domains of MHC sequences, which exhibit multi-allelic polymorphism, and not somatic mutation and segment recombination (Benacerraf 1981; Steinmetz 1984).

4.3. NUMBER OF DIFFERENT CODONS USED IN Ig GENES

Harmonization between codon usage and tRNA availability occurs probably at the messenger level, as seen above (selection of tRNA genes may also take place, of course). The range of codons used in V and C regions is quite similar although relative frequencies of the different codons vary considerably between the two kinds of regions (Perrin 1984; unpublished observations). Analysis on codon choices in C g genes and C e genes has revealed no great variation (Grantham and Perrin 1985) in spite of their different contents in plasma. IgG represents 75 per cent of plasmatic Ig whereas IgE content is less than 0.1 per cent (Nisonoff et al. 1975), yet their codon usage appears similar. But, IgE may be highly produced locally, hence we cannot be sure its gene has not been selectively optimized for coding strategy. Therefore, so far no differential range in number of codons used has been found among the various Ig genes. The few qualitative data available on the tRNA lymphocyte population (Marini and Mushinski 1979) are too imprecise for a related study.

4.4. DISTRIBUTION OF CG, UA, UG, AND CA DOUBLETS, AND VARYING G + C CONTENT

The 16 dinucleotides (doublets) differ in frequency in natural nucleic acids; this variation may be linked to regulation involving base modification (methylation). It happens that, in most eukaryote sequences studied, C followed by G is much rarer than C followed by any other base (Grantham et al. 1985). Vertebrate genomes are strongly methylated and C is the only base so modified. Cytosine is methylated only in CpG (Felsenfeld and McGhee 1982). The mC tends to mutate to thymine, raising (in RNA) the frequency of UG (and CA on the complementary strand in DNA) (Barker et al. 1984). CpG frequency is interesting for three reasons.

(i) Is avoidance of CG doublets strictly correlated to high frequency of UG (or CA)?
(ii) Are regions rich in (G + C) characterized by non-avoidance of CpG, as suggested by Adams and Eason (1984)?
(iii) Is local non-avoidance of CpG linked to gene expressivity (Cooper and Gerber-Huber 1985; Wolf and Migeon 1985)? That is, do genes containing larger relative amounts of the CG doublet tend to code for abundant proteins?

4.4.1. CG, UA, UG, and CA doublet frequencies in Ig coding sequences

For this study we used a statistical test to compare observed and expected frequencies. The expected frequency is calculated by base permutation (Grantham et al. 1985; Gautier et al. 1985). Results are given in Table 5. They lead to three conclusions.

Table 5 *CG, UA, UG and CA doublet normalized frequencies in V and C regions of human and murine Ig mRNA*
		Man		Mouse
Codon position		C (8)	V (9)	C (11)	V (59)
CpG	I-II	-10.77 (-3.81)	-8.02 (-2.67)	-15.00 (-4.52	-16.54 (-2.15)
	II-III	-12.22 (-4.32)	-9.27 (-3.09)	-17.50 (-5.28)	-24.71 (-3.22)
	III-I	-8.06 (-2.85)	-9.83 (-3.28)	-10.42 (-3.14)	-29.28 (-3.81)
.
UpA	I-II	-5.61 (-1.98)	ns	-7.97 (-2.40)	-3.30 (-0.43)
	II-III	-5.34 (-1.89)	-5.36 (-1.79)	-8.07 (-2.43)	-8.60 (-1.12)
	III-I	-6.50 (-2.30)	-3.79 (-1.26)	-7.75 (-2.34)	-15.91 (-2.07)
.
UpG	I-II	8.43 (2.98)	ns	11.48	ns
	II-III	9.87 (3.49)	4.73 (1.58)	9.03 (2.72)	12.35 (1.61)
	III-I	5.82 (2.06)	6.60 (2.20)	7.60 (2.29)	20.08 (2.61)
.
CpA	I-II	7.15 (2.53)	4.67 (1.56)	8.61 (2.59)	10.56 (1.37)
	II-III	6.60 (2.33)	5.97 (1.99)	10.87 (3.27)	13.94 (1.81)
	III-I	ns	6.34 (2.11)	ns	19.26 (2.51)
.
Absolute values > 1.96 are statistically significant at 5%; absolute values > 2.57 are significant at 1%. The value in parenthesis is the mean and the top value is the accumulated measure for the sequences in that column. Positive values indicate doublets of higher than expected frequency (from permutations conserving base composition and codon position); negative values reveal avoided doublets. ns, non significant. C, constant. V, variable. The number of sequences studied appears in parenthesis at the head of the column.

(i) CG doublets are avoided in human and mouse V and C regions. This avoidance appears also in V and C introns (unpublished results).

(ii) The C regions (human and mouse) tend to avoid UpA, as V regions do, especially in position III-I (between codons).

(iii) The C regions have more CA (except in III-I) and UpG (in all positions) than expected. The V regions also show this tendency, especially in position III-I.

Since UpA frequency is lower than expected (either in all positions or in III-I), its avoidance cannot be explained exclusively by terminators being UA-beginning codons, as has often been suggested. In Ig coding V sequences, the avoidance of CpG increases from position I-II to III-I. C regions affected by this phenomenon contain high (G + C) content (>60 per cent in human C regions). Murine Tcr sequences also avoid CG and UA doublets and have elevated UG and CA frequencies in positions II-III and III-I (Table 6).

Table 6 *CG, UA, UG, and CA doublet normalized frequencies in V and C regions of mouse T-cell receptor mRNA*
Position		C (8)	V (11)
CpG	I-II	-5.62 (-1.99)	-5.68 (-1.71)
	II-III	-9.69 (-3.43)	-7.63 (-4.28)
	III-I	-12.10 (-4.28)	-9.49 (-2.86)

UpA	I-II	-5.90 (-2.09)	-3.26 (-0.98)
	II-III	-6.71 (-2.37)	-5.01 (-1.51)
	III-I	-5.30 (-1.87)	-6.45 (-1.94)

UpG	I-II	2.97 (1.05)	ns
	II-III	5.99 (2.12)	3.87 (1.17)
	III-I	7.92 (2.80)	7.01 (2.11)

CpA	I-II	ns	ns
	II-III	4.83 (1.71)	4.30 (1.30)
	III-I	2.86 (1.01)	3.55 (1.07)
See legend Table 5.

4.4.2. CpG frequency along the MHC sequences

Studies on genes like HPRT (hypoxanthine phosphoribosyl-dehydrogenase) and G6PD (glucose-6-phosphate dehydrogenase) reveal CpG clusters in their 5' extremity (Wolf and Migeon 1985). CpG frequency varies along the MHC genes too (Tykocinski and Max 1984). Exons of each MHC sequence have been separated for two classes of histocompatibility antigen (MHC-I and MHC-II). Each exon codes for a determined structural domain of the protein chain (three domains in heavy MHC-I chains, two in a and b MHC-II chains). Table 7 gives results on CpG, UpA, UpG, and CpA usage, revealing the following.

Table 7. *Normalized frequencies of doublets CG, UA, UG, and CA in combined human and mouse MHC sequences according to codon position*
.		MHC-I			MHC-II
.		Heavy chains			Alpha chains		Beta chains
Position		Exon 2 (5)	Exon 3 (5)	Exon 4 (4)	Exon 2 (8)	Exon 3 (8)	Exon 2 (7)	Exon 3 (7)
CpG	I-II	ns	ns	-5.01 (-2.50)	ns	-3.28 (-1.16)	3.99 (1.51)	-3.40 (-1.29)
	II-III	ns	ns	-8.26 (-4.13)	-5.38 (-1.90)	-6.69 (-2.36)	ns	-7.21 (-2.72)
	III-I	ns	ns	-8.02 (-4.02)	-7.56 (-2.71)	-11.00 (-3.89)	ns	-10.49 (-3.97)

UpA	I-II	ns	ns	-2.61 (-1.30)	-6.08 (-2.15)	-5.15 (-1.82)	ns	-3.25 (-1.23)
	II-III	ns	ns	-3.30 (-1.65)	-4.01 (-1.42)	-5.26 (-1.86)	-2.84 (-1.07)	-5.43 (-2.05)
	III-I	-2.41 (-1.08)	-2.96 (-1.33)	-3.57 (-1.78)	-3.48 (-1.23)	-7.50 (-2.65)	ns	-3.75 (-1.42)

UpG	I-II	ns	2.36 (1.05)	5.20 (2.60)	ns	4.44 (1.57)	ns	4.86 (1.84)
	II-III	ns	2.75 (1.23)	4.91 (2.46)	2.60 (0.92)	2.80 (0.99)	3.00 (1.13)	4.24 (1.60)
	III-I	ns	2.31 (1.03)	6.79 (3.40)	7.88 (2.79)	5.22 (1.85)	ns	4.81 (1.82)

CpA	I-II	ns	ns	ns	ns	-4.30 (-1.52)	-3.97 (-1.50)	-2.39 (+0.90)
	II-III	2.58 (1.16)	ns	2.48 (1.24)	ns	6.12 (2.16)	2.66 (1.00)	4.73 (1.79)
	III-I	ns	ns	2.04 (1.02)	5.75 (2.03)	7.94 (2.81)	3.53 (1.33)	ns
See legend Table 5.

(i) Exons (E2 and E3) for the first two domains of heavy MHC-1 chains show no avoidance of CpG, but do avoid UpA in position III-I; exon (E2) of MHC-II b-chains avoids UpA only in position II-III and does not avoid CpG in any position.

(ii) Avoidance of both CpG and UpA in all positions occurs in MHC-I exon 4 and MHC-II a-exon 3 and b-exon 3.

(iii) Exon 2 for MHC-11 a-chains avoids UpA and CpG in all positions except I-II for the latter doublet.

CG doublet avoidance is similar in positions II-III and III-I of MHC genes. Translation constraints explain the variation in I-II. For example, exons coding for b-1 domains use slightly more quartet codons (70 per cent) than expected (4/6 = 67 per cent) to code arginine. Some exons that do not avoid CpG are rich in (G + C), but exons for the third domain of HLA-A3 and HLA-CW3 transplantation antigens (Sodoyer et al. 1984; Strachan et al. lQ84) have high (G + C) content (>60 per cent) while avoiding CG doublets. HLA-I 5' untranslated regions and the first two introns have expected CpG frequencies (Table 8), as does the HLA-AW24 5' extremity (N'Guyen et al. 1985).

Table 8. *CG doublet normalized frequency in 5' and 3' untranslated regions and introns of MHC sequences*
.	Number of bases	CpG normalized frequency
5' HLAI	426	ns
3' HLAI	1174	-9.6 (-6.8)
Introns 1 & 2 (HLAI)	743	-4.0 (-2.0)
Intron 3 (HLAI)	2898	-14.8 (-4.7)

5' HLAII	1637	-7.2 (-5.1)
3' HLAII	2569	-10.3 (-3.7)

5' H2II	1916	-8.5 (-3.8)
3' H2II	7251	-21.0 (-6.1)
Absolute values > 1.96 are statistically significant at 5%; absolute values > 2.57 are significant at 1 %. The value in parenthesis is the mean and that preceding is the accumulated measure for the sequences in that row. Since these are untranslated sequences no account is taken of triplet position. See legend Table 5 for other information.

4.4.3. Discussion

The 5' regions of HLA-1 heavy chains, from the 5' end of the untranslated zone to the 3' end of exon 3 (5'UT + El + I1 + E2 + I2 + E3) do not avoid CpG. This may relate to the housekeeping status of classic transplantation antigens (Robertson 1985). These clusters in conjunction with hypomethylation may maintain gene activity (Wolf and Migeon 1985). But, this is not specific to HLA -I genes since we find CpG clusters in the 5' region of the b-chain sequence, too (HLA-II and H2-II).

Adams and Eason (1984) proposed that stability of regions with high (G + C) content protects CG doublets against mutations via deamination, thus explaining non-avoidance of CpG. However, we have shown that some exons with high (G + C) content do avoid CG dinucleotides, for example, human Ig C regions. C regions seem to be highly methylated in non-mature B lymphocytes (Storb and Arp 1983). Other gene families show similar behaviour (Grantham 1985).

Non-avoidance of CG (and UA) doublets occurs in the most polymorphic domains (Choi et al. 1983; Sodoyer et al. 1984). Exons for MHC-II a-1 domains (moderately polymorphic) avoid CpG less strongly than those coding a-2 domains (less polymorphic) (Benoist et al. 1983). Hence, a correlation between the degree of polymorphism and CpG frequency can be demonstrated. CpG clusters may assume a specific function. We know that, according to physiological conditions, nucleic acids may change in local conformation and that these changes are sequence dependent. A region rich in (G + C) under different conditions may assume B- or Z-DNA conformation (Hamada et al. 1982; Johnston and Rich 1985; Nordheim and Rich 1983). Z-DNA conformation may be a hot spot for rearrangement and gene conversion (Hamada et al. 1982; Nordheim and Rich 1983; Rogers 1983; Perrin and Grantham 1986). This scenario is compatible with conserving polymorphism. Gene conversion is a major mechanism for the generation of polymorphism in MHC genes (Weiss et al. 1983). Synonymous codon choices allow organisms or cells to vary doublet frequencies along the gene sequences. In turn the varying doublet frequencies could be linked to conformation changes between B- and Z-DNA, which could induce genetic variability and differential expression. Data are, however, still inadequate for definitely resolving the question of the relation between CpG frequency and expressivity.

5. Particularities in human viruses

Human viruses in general have less G and C in codon position III than does the host genome, 47.5 versus 66.1 per cent, respectively, having been found in large samples (Grgntham et al. 1985). The viral genes also showed a larger variation that the host gene families in G + C degenerate content (see Fig. 5 of Grantham et al. 1985). In addition, the study revealed that DNA viruses vary more in coding strategy than do RNA viruses.

Table 9 *Coding sequences in human virus families*
Sequence origin (GenBank release 35)		Number of sequences
I	Herpesviridae ds DNA
	Epstein-Barr virus (EBV)	57
	Herpes simplex virus (HSV)	11
	Varicella-Zoster virus (VZV)	5
	Cytomegalovirus (CYV)	1
II	Poxviridae ds DNA
	Variola virus (VAR)	1
III	Adenoviridae ds DNA
	Adenovirus type 2 (AD2)	26
	Adenovirus type 5 (AD5)	15
	Adenovirus type 7 (AD7)	7
	Adenovirus type 12 (AD12)	5
IV	Papovaviridae ds DNA
	Papovavirus (BKV)	11
	Papilloma virus (HPV)	4
V	Hepadnaviridae ds DNA enveloped
	Hepatitis B virus (HBV)	9
VI	Reoviridae ds RNA
	Rotavirus (WAR)	3
VII	Orthomyxoviridae ss RNA
	Influenza A (FLNT)	6
	Influenza A (FLP)	10
	Influenza A (FLU)	10
	Influenza A (FL)	20
	Influenza B (FLB)	15
VIII	Picornaviridae ss RNA
	Poliovirus (POLIO)	4
	Rhinovirus (HRV)	3
IX	Paramyxoviridae ss RNA
	Respiratory syncytial virus (HRSV)	2
X	Retroviridae ss RNA
	Human T-cell leukaemia type I (HTLV-I)	2
	Human T-cell leukaemia type II (HTLV-II)	2
	Lymphoadenopathy-associated virus or Human T-cell leukaemia virus type III (HTLV-III/LAV) (AIDS) [Now HIV-1]	10

We now analyse 186 human and 243 virus gene sequences, each of at least 300 nucleotides. Table 9 groups the viral genes according to family, while Table 10 and Fig. 1 give percentage (G + C) of third bases in the sequences.

Table 10. *Base composition of human and human virus coding sequences*
.			%
.			A	C	G	U	G+C
Human
	186 (1)	T	24.5	27.4	26.8	21.3	54.2
	48875 (2)	I	27.0	23.6	32.3	17.1	55.9
	23452 (3)	II	31.1	23.4	19.1	26.4	42.5
		III	15.5	35.1	29.0	20.4	64.1
		Q3	16.7	36.6	25.4	21.3	62.0
Virus (excepting herpes)
	169 (1)	T	29.9	22.6	23.9	23.6	46.5
	76631 (2)	I	31.1	20.4	30.2	18.3	50.6
	33967 (3)	II	31.0	23.2	19.0	26.8	42.2
		III	27.6	24.2	22.6	25.6	46.8
		Q3	30.6	24.9	19.5	25.0	44.4
Herpes virus
	74 (1)	T	21.1	31.0	28.1	19.8	59.1
	36001 (2)	I	23.2	26.8	33.8	16.2	60.6
	20067 (3)	II	25.8	29.0	19.4	25.8	48.4
		III	14.5	37.2	31.0	17.3	68.2
		Q3	16.2	38.8	30.4	14.6	69.2
(1) Number of genes. (2) Number of codons. (3) Number of quartet codons. T, total; I, II, III and Q3 are codon position, Q3 being confined to degenerate bases in quartet (fully degenerate) codon sets.

This larger sample confirms our previous findings: the 10 types of host genes in Fig. 1 all have a mean of over 50 per cent (G + C) in degenerate bases or in total composition (excepting interferons).

Base composition [C+G)%] of human and viral genes

Fig. 1. Percentage (G+C) for total sequence (continuous lines) and codon position (dashed lines) for human nuclear and human virus genes. See Table 9 for virus identification.

In all cases of host genes the degenerate percentage (G + C) is greater than that of total composition. Most viral genes have less than 50 per cent (G + C) in codon position III, although herpes EBV, HSV, and cytomegalovirus exceed this value, as do Ad 2 and Ad 5. Again we see that RNA viruses vary less in synonymous codon choices than do DNA viruses. The fast evolving influenza viruses reveal a surprisingly uniform percentage (G + C) in third bases. Overall, the new data confirm the previous conclusion that viruses do not closely imitate the use of the codon catalogue by the host. This is clearly portrayed in Fig. 2 (see Fig. 7 of Grantham et al. 1985), where the high variation of viral coding strategy compared to that of the human genome is also evident.

Base composition of human and viral genes

Fig. 2. Position of human (small letters) and viral genes as a function of percentage (G + C) and (1.5 per cent A-0.5 per cent U) in codon position III. DNA viruses are underlined. The abscissa and ordinate represent, respectively, the first and second most important factors in distinguishing coding strategies of gene sequences. See Table 9 for virus identification. ant, antigen; enz, enzyme; horm, hormone; igc, Ig constant (segment); igv, Ig variable; intf, interferon; onc, oncogene; oth, others; a-gl, , a-globin; b-gl, , b-globin.

Contrasting AIDS virus (Ratner et al. 1985) to other retroviruses can be extended to codon choices. Five other retroviruses (BLV, bovine leukaemia virus; MoMuLV, Moloney murine leukaemia virus; AKV, strain AKR ecotropic endogenous murine leukaemia virus; RSV, Rous sarcoma virus; HTLV-1, human T cell leukaemia virus type 1) have been compared to AIDS. In summary (data not shown), for the three amino acids with six codons each and the five with four codons each, the preferred codon is nearly always different in all three viral genes (gag-pol-env) between AIDS and any of these five oncoviruses (Shinnick et al. 1981; Schwartz et al. 1983; Seiki et al. 1983; Herr 1984; Sagata et al. 1985). AIDS generally favours A-ending codons while these five viruses favour C- or G and, less often, U-ending triplets.

Codons of highest frequency in AIDS for the eight amino acids are: Arg AGA, Leu UUA, Ser AGU, Thr ACA, Pro CCA, Ala GCA, Gly GGA, and Val GUA. These choices are consistently repeated in all three AIDS genes with only two exceptions. In env, UUG is slightly favoured over UUA for coding Leu and in gag, AGC and UCA are tied as highest frequency Ser codons. With any of the above five viruses, at most two of the eight amino acids show the same preferred codon as in AIDS for all three genes, and this occurs only with Arg and Gly in AKV, and MoMuLV.

Much closer agreement in coding strategy is seen between AIDS and Visna lentivirus (VLV) (Sonigo et al. 1985). The preferred codon is identical for five of the eight amino acids in gag (VLV favours AGU for Ser, CCC for Pro and GUG for Val). With both pol and env genes all eight choices coincide between VLV and AIDS. Thus, the five other viruses appear evolutionarily distant from AIDS, as judged by favoured triplet for amino acids having full degeneracy in their codon sets. AIDS and VLV by this criterion are rather similar; this conclusion is compatible with other findings in suggesting that AIDS/LAV is more closely related to lentiviruses than to oncoviruses (Chiu et al. 1985; Sonigo et al. 1985). Table 11 summarizes codon use for the eight amino acids in the six viruses compared to AIDS. On the basis of absolute frequencies of preferred codons for these amino acids in the combined gag-pol-env genes of each genome, HTLV-l appears as most distant of any of the viruses from AIDS.

Table 11. *Triplet frequencies in AIDS and other retroviruses for the eight amino acids having complete degeneracy in their codon sets*
.		Absolute frequency in gag + pol + env of
.		AIDS	VLV	BLV	MoMuLV	AKV	RSV	HTLVI
Arg	AGA	80	83	17	49	48	27	11
	AGG	30	52	13	22	21	26	4
	CGA	4	9	21	21	22	26	18
	CGC	3	4	19	22	20	13	22
	CGG	5	4	24	23	23	19	16
	CGU	1	2	7	12	8	15	13

Leu	UUA	65	74	25	33	34	40	31
	UUG	31	45	29	33	37	45	15
	CUA	27	28	39	60	57	22	53
	CUC	23	17	56	58	66	41	63
	CUG	31	19	41	67	57	48	34
	CUU	20	6	42	30	28	36	50

Ser	AGC	32	22	14	17	17	15	26
	AGU	42	39	8	15	16	26	11
	UCA	30	31	19	22	23	24	28
	UCC	11	11	48	40	52	34	56
	UCG	4	6	11	8	8	15	6
	UCU	12	11	40	37	34	34	23

Pro	CCA	61	45	34	48	50	36	44
	CCC	22	26	79	82	84	43	86
	CCG	3	16	20	23	26	36	18
	CCU	29	32	48	63	53	43	40

Ala	GCA	75	70	29	32	36	37	15
	GCC	31	25	69	81	84	60	77
	GCG	6	16	8	14	11	56	9
	GCU	32	30	31	37	33	43	34

Gly	GGA	92	116	31	65	70	58	29
	GGC	18	10	27	32	27	34	31
	GGG	33	56	23	55	50	69	21
	GGU	18	17	15	19	27	31	10

Val	GUA	84	83	14	27	28	25	11
	GUC	16	8	30	43	41	35	31
	GUG	29	55	13	29	41	43	15
	GUU	21	10	18	31	21	43	14
See text for virus identity.

6. A previous RNY code?

Shepherd (1982) has proposed that the present code derives from a prototype code in which purines predominated in codon position I and pyrimidines in position III, hence his 'RNY code' (R purine, N any base, Y pyrimidine). Indeed, for some reason the biological system prefers pyrimidines as degenerate bases (Grantham et al. 1983). Thus, with man, C + U in position III of the 195 genes of Table 1 is 55.4 per cent (52.3 is expected from the code structure). In fact, C is preferred over U as third base in human mRNA, as implied by the three columns of Table 3. This fact, unaccounted for by RNY theory (Shepherd 1982), apparently extends to most eukaryote organisms (excepting fungi), but not viruses (Grantham et al. 1983). It is not merely a consequence of CG doublet avoidance (avoidance of G as third base could tend to favour C) since Table 2 shows that CpG is favoured in codon position II-III of E. coli genes.

From Table 1 we calculate that C represents 29.3 per cent of E. coli third bases while U only accounts for 25.5 per cent (human values are 33.5 per cent and 21.8 per cent). Since G is favoured (28.2 per cent) and A is avoided (17.0 per cent) as third base (human values are rather similar), a better primitive code model would be NNS (N, any base and S = G or C) for both humans and E. coli. In sum, the large gene samples we work with do not support the RNY hypothesis because it does not account for the asymmetry between C and U (or G and A) frequencies as degenerate bases.

In addition, the apparent RNY working of the code in some species may relate to UpA and CpG rarity in codon position I-II. Both doublets are strongly avoided in yeast genes [see Table 2 above and entry 'Fun' (fungus) in Tables 13 and 14 of Grantham et al. 1985], on which Shepherd's model (1982 and 1984) was based. Their avoidance in position I-II, combined with the above general preference for pyrimidine third bases predicts the RNY (or RYY) schema. This is because CG and UA are both YR type doublets and the above avoidance necessarily favours A and G in position I. Note that UG and CA frequencies increase due to methylation of C in CG and mutation of mCG to UG (Bird 1980) and can compensate for CG avoidance, but not for UA avoidance. No molecular mechanism for explaining UA rarity has been advanced and no other YR type doublet has been proposed to be favoured by UA elimination. UpA is avoided in practically all kinds of sequences, both translated and untranslated, except mitochondria (Grantham et al. 1985).

7. Concluding remarks

What could be done to further the understanding of bias in use of synonymous codons? We offer some speculative suggestions.

One set of urgently needed data is concentrations of the different tRNAs that carry the same amino acid, the 'iso-acceptor-tRNAs'. Such data have been published only for bacteria and yeast (Bennetzen and Hall 1982; Ikemura 1985; de Boer and Kastelein 1986; Li et al. 1985), but their determination in various tissues of higher organisms and especially of man, for whom we now have many gene sequences for several protein families, would be most useful. This would allow assessment of the degree of harmonization between codon and anticodon distributions in different cells, both for nuclear genes and those of virus parasites. Thus, a better view of the evolutionary importance of this aspect of coding strategy would become possible. This appears especially cogent in understanding lymphotropic viruses, notably the AIDS virus (Grantham and Perrin 1986).

But on a longer term basis we need also to ask, so what? What if the two distributions do match rather well in each type of organism and cell (as most likely will be found), but each type of organism and cell has its own kind of distribution, its own coding strategy, which may be greatly different from that in other types of organism? We already know that both codon and tRNA distributions vary enormously between species. For example, the two distributions are known to be rather well harmonized for yeast and E. coli highly expressed genes, but these two organisms have different patterns of codon preferences and distinct iso-tRNA concentrations. That is, they have different biases. Therefore, why does the bias exist? This question is so difficult to treat scientifically that in effect it remains philosophical.

It will only become accessible as more data are accumulated on overall nucleotide metabolism, that is, the half-life and concentration in the cell of each kind of nucleotide, and perhaps that will only be a step in the right direction. It is already known that these factors vary widely in different cells, but no overall picture has been forthcoming. Perhaps a cell's overall nucleotide metabolism correlates with its degenerate base preferences, we can only speculate on this for the time being. We can, however, recognize a few related questions whose consideration may help in the general comprehension of the existence of this bias.

(i) Why don't degenerate bases have the same composition as introns or other untranslated sequences? The provisional answer here is:

(a) that the third bases are harmonized with the tRNA distribution and
(b) that codon-anticodon pairing energies are optimized for translation efficiency by third base choice.

(ii) Why does each kind of transcription product (mRNA, rRNA and tRNA) have a rather limited range of G + C content that is most often different (and in animals, at least, generally higher) than that of the whole genome? The simplistic answer is that this is the way the biological system happened to develop, but there are probably other, functional and historical, reasons to be found.

(iii) Why, for example, do a- and b-globin mRNAs make such different third base choices when they are translated at the same time and at similar abundances in the same cell?

(iv) The same question can be asked regarding C and V segments of immunoglobulin mRNA. Here the situation is even worse since the two kinds of segments are incorporated into the same messenger.

(v) Why is degenerate G + C content so high on the average and yet so variable in animal genes? Especially difficult to understand is the large variation in individual human genes, in which percentage (G + C) in codon position III runs from around 40 to over 90 per cent. These intraspecific codon biases must be maintained at great selective cost, most likely at the prenatal stage in our species, to eliminate mutants. Otherwise repair enzymes, for some unknown reason, would have to assure degenerate base use in each gene. As mentioned above, the selection of human mitochondria constitutes a similar problem. It is too easy just to say most mutations are neutral.

The genome hypothesis has posed a chicken and egg dilemma whose resolution remains distant.

Acknowledgements

We thank M. Gouy, T. Greenland, J. L. Prato and D. Quilichini for unpublished data and help during preparation of the manuscript.

References

Adams, R. L. P. and Eason, R. (1984). Increased G + C content of DNA stabilises methyl CpG dinucleotides. Nucleic Acids Res. 12, 5869-77.

Anderson, S., Bankier, A. T., Baffell, B. G., DeBruijn, M. H. L., Coulson, A. R., Drouin, J., Eperon, I. C., Nierlich, D. P., Roe, B. A., Sanger, F., Schreier, P. H., Smith, A. J. H., Staden, R., and Young, I. G. (1981). Sequence and organization of the human mitochondrial genome. Nature, 290, 457-65.

DeBruijn, M. H. L., Coulson, A. R., Eperon, I. C., Sanger, F., and Young, I. G. (1982). Complete sequence of bovine mitochondrial DNA: conserved features of the mammalian mitochondrial genome. J. Mol. Biol. 156, 683-717.

Barker, D., Schaffer, M., and White, R. (1984). Restriction sites containing CpG show a higher frequency of polymorphism in human DNA. Cell, 36, 131-8.

Benacerraf, B. (1981). Role of MHC products in immune regulation. Science, 212, 1229-38.

Bennetzen, J. L. and Hall, B. D. (1982). Codon selection in yeast. J. Biol. Chem. 2579 3026-31.

Benoist, C. O., Mathis, D. J., Kanter, M. R., Williams, V. E., II, and McDevitt, H. 0. (1983). Regions of alielic hypervariability in the murine Aa immune response gene. Cell, 34, 169-77.

Bernardi, F. and Ninio, J. (1978). The accuracy of DNA replication. Biochimie, 60, 1083-95.

Bibb, M. J., Van Etten, R. A., Wright, C. T., Walberg, M. W., and Clayton, D. A. (1981). Sequence and gene organization of mouse mitochondrial DNA. Cell, 26,167-180.

Bird, A. P. (1980). DNA methylation and the frequency of CpG in animal DNA. Nucleic Acid Res. 8, 1499-504.

Blake, R. D. and Hinds, P. W. (1984). Analysis of codon bias in E. coli sequences. J. Biomol. Struct. Dyn. 2, 593-606.

Boer, H. A., de and Kastelein, R. A. (1986). Biased codon usage: an exploration of its role in optimization of translation. In From Gene to Protein: Steps Dictating the Maximal Level of Gene Expression (eds J. Davis, B. Reznikoff, and L. Gold). Butterworths, New York. (In press.)

Bothwell, A. L. M., Paskind, M., Reth, M., Imanishi-Kari, T., Rajewsky, T., and Baltimore, T. (1981). Heavy chain variable region contribution to the NPb family of antibodies: somatic mutations evident in a g 2a variable region. Cell, 24, 625-637.

Boudraa, M. (1987). Variation de la strategic de codage dans le systeme vegetal. Genet. Sel. Evol. (in press).

Chien, Y. H., Gascoigne, N. R. J., Kavaler, J., Lee, N. E., and Davis, M. M. (1984). Somatic recombination in a murine T-cell receptor gene. Nature, 309, 322-6.

Chiu, 1. M., Yaniv, A., Dahlberg, J. E., Gazit, A., Skuntz, S. F., Tronick, S. R., and Aaronson, S. A. (1985). Nucleotide sequence evidence for relationship of AIDS retrovirus to lentiviruses. Nature, 317, 366-8.

Choi, E., McIntyre, K., Germain, R. N., and Seidman, J. G. (1983). Murine I-A chain polymorphism: nucleotide sequences of three allelic I-A genes. Science, 22, 283-286.

Cooper, D. N. and Gerber-Huber, S. (1985). DNA methylation and CpG suppression. Cell Different. 17, 199-205.

Felsenfeld, G. and McGhee, J. (1982). Methylation and gene control. Nature, 296, 602-603.

Files, J. G., Carr, S. and Hirsh, D. (1983). Actin gene family of Caenorhabditis elegans. J. Mol. Biol. 164, 355-375.

Garel, J. P. (1982). The silkworm, a model for molecular and cellular biologists. Trends Biochem. Sci. 7, 105-8.

Gautier, C., Gouy, M., and Louail, S. (1985). Non-parametric statistics for nucleic acid sequence study. Biochimie, 67, 449-53.

Gershenfeld, H. K., Tsukamoto, A., Weissman, I. L., and Joho, R. (1981). Somatic diversification is required to generate the V genes of MOPC511 and MOPC167 myeloma proteins. Proc. Nat. Acad. Sci. USA, 78, 7674-7678.

Gojobori, T. and Nei, M. (1984). Concerted evolution of the immunoglobulin VH gene family. Mol. Biol. Evol. 1, 195-212.

Gouy, M. and Gautier, C. (1982). Codon usage in Bacteria: correlation with gene expressivity. Nucleic Acids Res. 10, 7055-7074.

Gouy, M. and Grantham, R. (1980). Polypeptide elongation and tRNA cycling in Escherichia coli: a dynamic approach. FEBS Lett. 115, 151-155.

Grantham, R. (1974a). Composition drift in the cytochrome c cistron. Nature, 248, 791-793.

Grantham, R. (1974b). Amino acid difference formula to help explain protein evolution. Science, 185, 862-864.

Grantham, R. (1980). Workings of the genetic code. Trends Biochem. Sci. 5, 327-31.

Grantham, R. (1985). CG doublet difficulties in Vertebrate DNA. Nature, 313, 437.

Grantham, R. and Gautier, C. (1980). Genetic distances from mRNA sequences. Naturwissenschaften, 67, 93-4.

Grantham, R., Gautier, C. and Gouy, M. (1980a). Codon frequencies in 119 individual genes confirm consistent choices of degenerate bases according to genome type. Nucleic Acids Res. 8, 1893-1912.

Grantham, R., Gautier, C. and Gouy, M. (1983). The genome as unit of selection: evidence from molecular biology. In Darwin Today (eds E. Geissler and W. Scheler), pp. 95-110. Akademie-Verlag, Berlin.

Grantham, R., Gautier, C., Gouy, M., Jacobzone, M., and Mercier, R. (1981). Codon catalog usage is a genome strategy modulated for gene expressivity. Nucleic Acids Res. 9, r43-74.

Grantham, R., Gautier, C. and Gouy, M., Mercier, R., and Pavd, A. (1980b). Codon catalog usage and the genome hypothesis. Nucleic Acids Res. 8, r49-62.

Grantham, R., Greenland, T., Louail, S., Mouchiroud, D., Prato, J. L., Gouy, M., and Gautier, C. (1985). Molecular evolution of viruses as seen by nucleic acid sequence study. Bull. Inst. Pasteur, 83, 95-148.

Grantham, R. and Perrin, P. (1985). Tentative de modelisation des sequences de genes hautement exprimes: rapport sur l'avancement des travaux. Rapport C.N.R.S. du 1 Novembre 1985.

Grantham, R. and Perrin, P. (1986). AIDS virus and HTLV-1 differ greatly in codon choices. Nature, 319, 727-8.

Grosjean, H., Sankoff, D., Min Jou, W., Fiers, W., and Cedergren, R. (1978). Bacteriophage MS2 RNA: a correlation between the stability of the codon: anticodon interaction and the choice of codewords. J. Mol. Evol. 12, 113-9.

Grosjean, H. and Fiers, W. (1982). Preferential codon usage in prokaryotic genes: the optimal codon-anticodon interaction energy and the selective codon usage in efficiently expressed genes. Gene, 18, 199-209.

Hamada, H., Petrino, M. G., and Kakunaga, T. (1982). A novel repeated element with Z-DNA-forming potential is widely found in evolutionary diverse eukaryotic genomes. Proc. Nat. Acad. Sci. USA, 79, 6465-6469.

Hedrick, S. M., Nielsen, E. A., Kavaler, J., Cohen, D. I., and Davis, M. M. (1984). Sequence relationships between putative T-cell receptor polypeptides and immunoglobulins. Nature, 308, 153-8.

Herr, W. (1984). Nucleotide sequence of AKV murine leukaemia virus. J. Virol. 49, 471-478.

Ikemura, T. (1981). Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J. Mol. Biol. 151, 389-409.

Ikemura, T. (1982). Correlation between the abundance of yeast transfer RNAs and the occurrence of the respective codons in protein genes. J. Mol. Biol. 158, 573-97.

Ikemura, T. (1985). Codon usage and tRNA content in unicellular and multicellular organisms. Mol Biol. Evol. 2, 13-34.

Ikemura, T. and Ozeki, H. (1983). Codon usage and transfer RNA contents: organism-specific codon choice patterns in reference to the isoacceptor contents. Cold Spring Harbor Symp. Quant. Biol. . 47, 1087-1097.

Jerne, N. K. (1985). The generative grammar of the immune system. Science, 229, 1057-1059.

Johnston, B. H. and Rich, A. (1985). Chemical probes of DNA conformation: detection of Z-DNA at nucleotide resolution. Cell, 42, 713-724.

Kabat, E. A., Wu, T. T., Bilofsky, H., Reid-Miller, M., and Perry, H. (1983). In Sequences of Proteins of Immunological Interest. U.S. Department of Health and Human Services, Public Health Service, National Institutes of Health.

Kam, J., Brenner, S., and Barnett, L. (1983). Protein structural domains in the Caenorhabditis elegans unc-54 myosin heavy chain gene are not separated by introns. Proc. Nat. Acad. Sci. USA, 80, 4253-4257.

Klass, M. R., Kinsley, S., and Lopez, L. C. (1984). Isolation and characterization of a sperm-specific gene family in the nematode Caenorhabditis elegans. Mol. Cell. Biol. 4, 529-537.

Kramer, J. M., Cox, G. N., and Hirsh, D. (1982). Comparisons of the complete sequences of two collagen genes from Caenorhabditis elegans. Cell, 30, 599-606.

Li, W. H., Luo, C. C., and Wu, C. I. (1985). Evolution of DNA sequences. In Molecular evolutionary genetics (ed. R. J. MacIntyre), pp. 51-65. New York: Plenum Press.

Marini, M. and Mushinski, J. F. (1979). Transfer ribonucleic acids from eleven immunoglobulin secreting mouse plasmacytomas. Constant and variable chromatographic profiles compared with the myeloma protein sequences. Biochim. Biophys. Acta, 562, 252-270.

Miyata, T., Hayashida, H., Yasunaga, T., and Hasegawa, M. (1979). The preferential codon usages in variable and constant region of immunoglobulin genes are quite distinct from each other. Nucleic Acids Res. 7, 2431-2437.

N'Guyen, C., Sodoyer, R., Trucy, J., Strachan, T., and Jordan, B. R. (1985). The HLA-AW24 gene: sequence, surroundings and comparison with the HLA-A2 and HLA-A3 genes. Immunogenetics, 21, 479-89.

Nisonoff, A., Hopper, J. E., and Spring, S. B. (1975). Human immunoglobulins. In The Antibody Molecule, pp. 86-137. Academic Press, London.

Nordheim, A. and Rich, A. (1983). The sequence (dc-dA)n-(dG-dT)n forms left-handed Z-DNA in negatively supercoiled plasmids. Proc. Nat. Acad. Sci. USA, 80, 1821-1825.

Patten, P., Yokota, T., Rothbard, J., Chien, Y. H., Arai, K. I., and Davis, M. M. (1984). Structure, expression and divergence of T-cell receptor P-chain variable regions. Nature, 312, 40-6.

Perlmutter, R. M., Crews, S. T., Douglas, R., Sorensen, G., Johnson, N., Nivera, N., Gearhart, P 0 J., and Hood L. (1984). The generation of diversity in phosphorylcholine-binding antibodies. Adv. Immunol. 35, 1-37.

Perrin, P. (1984). Coding strategy differences between constant and variable segments of immunoglobulin genes. Nucleic Acids Res. 12, 5515-37.

Perrin, P. and Grantham, R. (1986). Avoidance of base runs in switch regions of immune system genes. (Submitted.)

Ratner, L., Haseltine, W., Patarca, R., Livak., K. J., Starcich, B., Josephs, S. F., Doran, E. R., Rafalsky, J. A., Whitehorn, E. A., Baumeister, K., Ivanoff , L., Petterway, S. R., Jr, Pearson, M. L., Lautenberger, J. A., Papas, T. S., Ghrayeb, J., Chang, N. T., Gallo, R. C., and Wong-Staal, F. (1985).. Complete nucleotide sequence of the AIDS virus, HTLV-III. Nature, 313, 277-84.

Robertson, M. (1985). The present state of recognition. Nature, 317, 768-771.

Roe, B. A., Ma, D. P., Wilson, R. K., and Wong, J. F. H. (1985). The complete nucleotide sequence of the Xenopus laevis mitochondrial genome. J. Biol. Chem. . 260, 9759-74.

Rogers, J. (1983). CACA sequences - the ends and the means? Nature, 305, 101-2.

Sablitzky, F., Wildner, G., and Rajewsky, K. (1985). Somatic mutation and clonal expansion of B cells in an antigen-driven immune response . Embo. J. 4, 345-50.

Saccone, C., Cantatore, P., Gadaleta, G., Gallerani, R., Lanave, C. , Pepe, G., and Kroon, A. M. (1981). The nucleotide sequence of the large ribosomal RNA gene and the adjacent tRNA genes from rat mitochondria. Nucleic Acids Res. 9, 4139-48.

Sagata, N., Yasunaga, T., Tsuzuku-Kawamura, J., Ohishi, K., Ogawa, Y., and Ikawa, Y. (1985). Complete nucleotide sequence of the genome of bovine leukemia virus: its evolutionary relationship to other retroviruses. Proc. Nat. Acad. Sci. USA, , 82, 677-81.

Saito, H., Kranz, D. M., Takagaki, Y., Hayday, A. C., Eisen, H. N., and Tonegawa, S. (1984). Complete primary structure of a heterodimeric T-cell receptor deduced from cDNA sequences. Nature, 309, 757-762.

Schwartz, D. E.. Tizard, R., and Gilbert, W. (1983). Nucleotide sequence of Rous sarcoma virus. Cell, 32, 853-69.

Seiki, M., Hattori, S., Hirayama, Y., and Yoshida, M. (1983). Human adult T-cell leukaemia virus: complete nucleotide sequence of the provirus genome integrated in leukaemia cell DNA. Proc. Nat. Acad. Sci. USA, 80, 3618-3622.

Shepherd, J. C. W. (1982). From primeval message to present-day gene. Cold Spring Harbor Symp. Quant. Biol. 46, 1099-1108.

Shepherd, J.C. W. (1984). Fossil remnants of a primeval genetic code in all forms of life? Trends Biochem. Sci. 9, 8-10.

Shinnick, T. M., Lerner, R. A., and Sutcliffe, J. G. (1981). Nucleotide sequence of Moloney murine leukaemia virus. Nature, 293, 543-548.

Sodoyer, R., Damotte, M., Delovitch, T. L.,. Trucy, J., Jordan, B. R., and Strachan, T. (1984). Complete nucleotide sequence of a gene encoding a functional human class I histocompatibility antigen (HLA-CW3). Embo J. 3, 879-85.

Sonigo, P., Alizon, M., Staskus, K., Klatzmann, D., Cole, S., Danos, O., Retzel, E., Triollais, P., Haase, A., and Wain-Hobson, S. (1985). Nucleotide sequence of the Visna lentivirus: relationship to the AIDS virus. Cell, 43, 369-382.

Spieth, J., Denison, K., Zucker, E. and Blumenthal, T. (1985). The nucleotide sequence of a nematode vitellogenin gene. Nucleic Acids Res. 13, 7129-38.

Sprinzl, M., Moll, J., Meissner, F., and Hartmann, T. (1985). Compilation of tRNA sequences. Nucleic Acids Res. 13, rl-49.

Steinmetz, M. (1984). Structure, function and evolution of the major histocompatibility complex of the mouse. Trends Biochem. Sci. 9, 224-6.

Storb, U. and Arp, B. (1983). Methylation patterns of immunoglobulin genes in lymphoid cells: correlation of expression and differentiation with undermethylation. Proc. Nat. Acad. Sci. USA, 80, 6642-6646.

Strachan, T., Sodoyer, R., Damotte, M., and Jordan, B. R. (1984). Complete nucleotide sequence of a functional class I gene, HLA-A3: implications for the evolution of HLA genes. Embo J. 3, 887-894.

Temin, H. M. (1985). Reverse transcription in the eukaryotic genome: retroviruses, pararetroviruses, retrotransposons and retrotranscripts. Mol. Biol. Evol. 2, 455-68.

Tonegawa, S. (1983). Somatic generation of antibody diversity. Nature, 302, 575-81.

Tykocinski, M. L. and Max, E. E. (1984). CG dinucleotide clusters in MHC genes and in 5' demethylated genes. Nucleic Acids Res. 12, 4385-4396.

Weiss, E., Golden, L., Zakut, R., Mellor, M., Fahmer, K., Kvist, S., and Flavell, R. A. (1983). The DNA sequence of the H-2Kb gene: evidence for gene conversion as a mechanism for the generation of polymorphism in histocompatibility antigens. Embo J. 2, 453-62.

Wolf, S. F. and Migeon, B. R. (1985). Clusters of CpG dinucleotides implicated by nuclease hypersensitivity as control elements of housekeeping genes. Nature, 67, 449-53.

End Note (July 2011)

It would be nice to know more about Richard Grantham's life. His friend, Timothy Greenland, tells me RG died in 2009. I can find no obituary notices. The US Social Security Death Index lists a Richard L. Grantham as having been born April 9 1922 and as having died July 28 2009. This seems about right, but there are many RG's out there. If anyone has information on RG that they would be willing to share, please contact me. It would be nice to know more about the founder of Evolutionary Bioinformatics (EB).

Donald Forsdyke

End Note (Sept 2016)

After some correspondence, I finally met Richard at the 2000 Ischia workshop on "Neutralism and Selectionism" (Click Here). At that time he was appeared well and we had a splendid discussion followed up by even more correspondence.

For many years Richard was deeply concerned to find some way to remedy the environmental degradation of our planet and sought out Thomas Goreau with whom there was a long correspondence. He and Timothy Greenland have published their reminiscences of Richard in the forward to a book, where he is hailed as "the father of geotherapy:"

Geotherapy. Innovative Methods of Soil Fertility Restoration, Carbon Sequestration, and Reversing CO2 Increase. (Goreau TJ, Larson RW, Campe J, editors) CRC Press, 2015 (Taylor & Francis Group) pp. ix-xi.

Richard Grantham: Father of Geotherapy, 1922�2009

It must have been in the late 1970s when I first met Richard Grantham. He came to give a lecture at the research laboratory where I worked and described how he had used a computer analysis of the genetic sequences present in a virus capable of causing cancer to identify a gene that had its origins in the virus's host, not in the ancestors of the virus. I was fascinated. Some years before, I had attempted to develop computer programs for biological problems, although I am a dreadful mathematician, and I asked him if, with the consent of my director, I could come and collaborate with his group one day a week. He agreed, and this arrangement lasted throughout our working careers. His group developed many of the basic computer techniques employed today for the accession and analysis of genetic sequences - it was an exciting time and I learned much. A lot of the time was spent discussing projects and future research directions together in English, because although Richard, like me, was very fluent in French (we both had French wives), we sometimes found it useful to revert to our native language for the more abstruse concepts. As time passed, these discussions took a more philosophical turn and Richard became progressively more concerned with the degradation of the biosphere and the impact of humanity on the natural world.

The contributions that Richard and his colleagues made to the dawning science of the molecular biology of genetic sequences, and their relevance to evolutionary processes, were many and varied. He is most often remembered in the context of his genome hypothesis, which allowed that the DNA sequence in the gene itself carried to some degree a signature characteristic of the species (or species group) from which that gene derived. Classically at that time, it was considered that Darwinian selection could only operate at the level of proteins because the genes themselves were only visible to selection pressures through their expressed products. However, most of the amino acids that compose proteins can be specified by more than one sequence of three nucleotides (codon) in the genetic DNA. Richard realized that the different alternative codons were not used randomly in the genes from different species and that there was a pattern of use that appeared to be common to different genes from the same species. It must be realized that, for a paper published in 1980 (1), the total number of sequences available in his pioneering database was 160. As of April 2011, GenBank contained 135,440,924 sequences! It is not surprising that his initial insight and analysis have now been greatly developed by later work, but it remains amazingly relevant today. For a deeper view of his contributions, I recommend a visit to Grantham's Genome Hypothesis (queensu.ca) where Donald Forsdyke presents some of his works. Professor Forsdyke's own contributions to this important field may be found on the same website.

During our discussions, we found that we had both read and been affected by Rachel Carson's The Silent Spring and we quickly discovered Aldo Leopold's A Sand County Almanac: And Sketches Here and There, then the influential books by Van Rensselaer Potter (see References). These convinced Richard that he should concern himself actively with the problems of human impact on the natural world, particularly the concept to global bioethics. Incidentally, we both deplored the hijacking of the simple term bioethics by the medical community - despite clear priority for the term for ecological concerns. Among other important inputs, I would mention especially John Rawls' A Theory of Justice and Julia Annas' The Morality of Happiness, which helped us consider the knotty problems of the impact of bioremediation and the instauration of a sustainable lifestyle on a the real human population.

Despite the hundreds of hours of discussion and exchange with Richard, I realized that I would never get more than a glimpse of the complex and profound person I was privileged to encounter. We became friends, but I never knew much of his background or personal life. He had served as a bomber pilot during World War II in Italy and returned to Europe to study evolutionary biology at the University of Montpelier in the 1960s. I learned little about his youth in central California or of his family. Any diversion in our conversations in those directions was quickly brought back to the central theme of his projects and philosophical analyses. On occasion we would find time to walk through the woods together - sometimes hunting mushrooms, and always alert for whatever wildlife we should chance to meet. His knowledge was wide and eclectic and it was a joy to accompany him on these diversions.

Richard corresponded quite extensively with Van Rensselaer Potter on themes of mutual interest, and I had the pleasure of participating in their exchanges. I think that both of them appreciated finding someone of similar stature to converse with, and Richard met Van on a rare visit to the United States. He brought back some very helpful insights and several pleasant reminiscences. His mind turned progressively to a very practical problem: The damage is done - how can we mitigate its effects? To this end, he imagined grandiose schemes like the irrigation of North African dessert areas with desalinated water to reconstitute the forests that existed there not so long ago (at least in geological timescales). Geotherapy became his watchword. The earth is already very sick, and heroic remedies are necessary to give even a chance of the survival of our patient. As an evolutionary biologist, he was only too aware of the perils of an overreaching species. It was clear to both of us that the carrying capacity of the planet for Homo sapiens had been exceeded for a long while. Richard did not despair of a downturn to more supportable human numbers and balanced survival, given that we could buy time for the message to be heard and understood by our species as a whole. I confess to being less optimistic. We both gave much time and thought to possible remedial approaches - which are, indeed, both marginally possible and vitally urgent - but the scale of the problem and the intransigence of human politics remained discouraging.

We both feared that the natural solution would be the elimination, probably through disease, of much of the human population - and that our experience of studies of such events in other species suggested that the most frequent outcome is extinction and not simple population reduction! Humanity is perhaps the first species to be able to see the consequences of its proliferation, even if dimly, but will it be able to act on those warnings? We both agonized on the unfolding of the AIDS epidemic and contributed to studies of the virus and its consequences. We have perhaps postponed that threat. Then came other novel pathogens, and more will certainly come. Each may be the last. The question is simply one of survival, not of preservation of comforts, but life or death. Will we measure up? He and Professor Potter jointly
published a call to action that can be seen at http://sunsite.utk.edu/FINS/Sustainable_Development/Fins-SD-03.htm.

The term geotherapy was formally introduced at a conference entitled "The Colloquium on
Modelling and Geotherapy for Global Changes" held at the Universite Claude Bernard in Lyon in May 1991, and sponsored by the CNRS. There Richard and other colleagues sought to define the needs and responsibilities encompassing human responses to the environmental degradation that was so obvious to those who would look around them. The final declaration may be seen at http://www.angelfire.com/mac/egmatthews/geotherapy/foundation.html and other references at http://www.angelfire.com/mac/egmatthews/geotherapy/geotherapy.html or http://jebin08.blogspot.com/2007/06/thinkable-why-geotherapy-should-not-be.html. A very interesting artistic reaction to the notion of geotherapy can be found at

http://www.thewip.net/contributors/2010/05/geotherapy_artist_mara_haselti.html - I am not sure what Richard's reaction would have been, but knowing his deep interest in, and appreciation for, matters artistic, I am sure he would have been intrigued!

Over the last years of his life, Richard often received me at his apartment to continue our discussions. His interest in the problems of the planet remained present to the last, and he retained his knowledge of evolution and of humanity, although much else faded. I have been privileged to know and to spend time with an exceptional man. I hope that his message will be considered by others and that H. sapiens may manage to stop before the crumbling edge of the precipice. It is very close to our feet right now.

Tim Greenland
Universite Claude Bernard Lyon I
Villeurbanne, France

References
Annas, J., The Morality of Happiness, Oxford University Press, Oxford, U.K., 1993.
Carson, R., The Silent Spring, Houghton Mifflin, Boston, MA, 1962.
Leopold, A., A Sand County Almanac, Oxford University Press, Oxford, U.K., 1949.
Potter, V.R., Bioethics: Bridge to the Future, Prentice-Hall, Englewood Cliffs, NJ, 1971.
Potter, V.R., Global Bioethics: Building on the Leopold Legacy, Michigan State University Press, East Lancing, MI, 1988.
Rawls, J., A Theory of Justice, Harvard University Press, Cambridge, MA, 1971.

Donald Forsdyke

This page was established circa 2000 and last edited on 11 March 2021 by D. R. Forsdyke