What is a Gene?

Scherrer and Josts' symposium. 

The gene concept in 2008

Theory in BioScience (2009) 128, 157-161. DOI: 10.1007/s12064-009-0071-2
Submitted 11th Dec 2008, received 16 December 2008, and accepted for publication 3 February 2009. 
Copyright held by Springer Publishing. The authorized version of this paper is at the publisher's website (www.springerlink.com).

 Donald R. Forsdyke


Conventional phenotype and genome phenotype

Mutations, placeholders and EBV

Definition through mutation affecting specific function

Hierarchical levels of information


End Note August 2009

Abstract  Reconsideration of the term "gene" should take into account (a) the potential clash between hierarchical levels of information discussed in the 1970s by Gregory Bateson, (b) the contrast between conventional and genome phenotypes discussed in the 1980s by Richard Grantham, and (c) the emergence in the 1990s of a new science - Evolutionary Bioinformatics - that views genomes as channels conveying multiple forms of information through the generations. From this perspective, there is conceptual continuity between the functional "gene" of Mendel and today's GenBank sequences. 

    If the function attributed to a gene can change specifically as the result of a DNA mutation, then the mutated part of DNA can be considered as part of the gene. Conversely, even if appearing to locate within a gene, a mutation that does not change the specific function is not part of the gene, although it may change some other function to which the DNA sequence contributes. This strict definition is impractical, but serves as a guide to more workable, context-dependent, definitions. 

    The gene is either: 1. The DNA sequence that is transcribed. 2. The latter plus the immediate 5' and 3' sequences that, when mutated, specifically affect the function. 3. The latter two, plus any remote sequences that, when mutated, specifically affect the function. Attempts, such as that of Scherrer and Jost, to redefine Mendel's "gene," may be too narrowly focused on regulation to the exclusion of other important themes.


In 1900 Henry Bernard of the Natural History section of the British Museum, who regarded Ernst Haeckel of Jena as his "friend and teacher," devised a numerical taxonomic system to replace the classical Linnaean system. Describing the latter as "philosophically absurd and practically disastrous," he provoked several leading biologists to contribute their thoughts on "the species concept" to a letter that was passed from correspondent to correspondent, finally ending up in an envelope labeled "Bernard's Symposium." The year was ominous. Mendelism was about to burst on the scene and, after a bitter battle with the biometricians, a new science - Genetics - was to emerge triumphant. Decades later Bernard's envelope was discovered in the archives of the geneticist William Bateson (Cock 1977). Despite its imperfections, the Linnaean system of species nomenclature had survived because it worked. It remains unchallenged today.

Many of the same points can be made concerning the present "symposium." Describing the discovery that eukaryotic genes are fragmented as "devastating for the original gene concept," a molecular biologist and a mathematician have teamed up to amend, and provoke our thoughts on, the meaning of the word "gene" (Scherrer and Jost 2007). Again, the preferred solution is numerical. It is hoped that a refined definition will allow better application of "mathematical algorithms that can analyse gene storage and expression in terms of information processing." Again the year is ominous. From the torrent of sequence information that began in the 1980s (Benson et al. 2009), a new science - Evolutionary Bioinformatics - is emerging (Forsdyke 2006). Whether it will emerge triumphant is yet to tell. But from its basic tenets this correspondent is led to believe that the Mendelian gene concept is safe. Despite its imperfections, it will be the concept preferred by most of those attempting to identify and tackle biological problems in the twenty first century.


Conventional phenotype and genome phenotype

Although the word had not then been coined, Gregor Mendel's "gene" of 1865 was something segregating intact among offspring that determined a character. The latter was some morphological or physiological feature that we now refer to as being part of the "conventional phenotype" - the phenotype that is most obviously responsible for interactions of an organism with its environment. More than a century later, in his "genome hypothesis" Richard Grantham (1980) referred to an apparently more inward-looking, genome-based, phenotype, for which the term "genome phenotype" was suggested (Bernardi and Bernardi 1986). The genome phenotype concept allowed a better understanding of genomes (i.e. the multiple forms of information, including genic information, that pass through the generations in the form of nucleic acid) and appeared to explain the occurrence of "placeholder" bases and amino acids in nucleic acids and proteins - a subject relevant to our present task.

Rather than "characters," the genome phenotype deals with "pressures" that relate, in ways that remain to be fully explored, to fundamental biological themes such as self/not-self discrimination, the preservation of genome integrity, and the abeyance of that integrity needed for speciation. Among the pressures are the genome-wide pressures exerted by pairs of bases (e.g. GC-pressure, which can be regarded as the "accent" of DNA) and the potential to extrude stem-loop structures from duplex DNA (fold pressure). Some pressures are local, being confined to specific regions. AG-pressure and RNY pressure apply to exons, which are the DNA sequences corresponding to what is left as messenger RNA (mRNA) after introns have been removed from a primary transcript.

Molecules of RNA and protein must assume higher ordered structures in order to perform various structural and/or catalytic roles. Although sometimes requiring the assistance of molecular chaperones (Cristofari and Darlix 2002), the information for such structures is mainly encoded in their primary sequences. However there can be conflicts. When incorporated into a stem-loop structure, purines tend to occupy the less stable loops. Thus, fold pressure (quantified as the stability of stem-loop structures) tends to be countermanded by AG-pressure. It is important to distinguish general, genome-wide, fold pressure, which appears primarily to relate to function at the DNA level in the nucleus, from local fold pressure, which appears primarily to relate to function at the RNA level in the cytoplasm. Thus, the DNA from which a ribosomal RNA (rRNA) is transcribed is under two, potentially conflicting, fold pressures - general and local. The potential to fold that satisfies the needs of a segment of DNA may not be the same as that which satisfies the needs of the RNA transcribed from that segment. There must be some compromise - perhaps post-transcriptional RNA editing by removal of segments and/or base modifications (Scherrer and Darnell 1962; Greenberg and Penman 1966; Bass 2002).

Similarly, a protein-encoding exon can be considered under a local "protein pressure," which must contend for genome space with other pressures. Many features of a genome (e.g. introns) can be understood in terms of the way the "hand of nature" has resolved contending pressures over evolutionary time to arrive at a form that best satisfies the needs of members of the species (Forsdyke and Mortimer 2000; Forsdyke 2001, 2002).  


Mutations, placeholders and EBV

Through classical recombination mapping, Mendel's "gene" was localized, first to a linkage group (chromosome) and then to a distinctive chromosomal region (Cock and Forsdyke 2008). When a break happened to move that region from one chromosome to another, the gene moved with the region. Mendel's "gene" was further localized through mutation. In general, mutations elsewhere in the genome did not disturb the function attributed to a gene. Mutations in the region of the gene often, but not always, disturbed the function attributed to it.

When DNA sequencing techniques emerged in the 1970s, the types, and fine-resolution locations, of mutations affecting a function, could be determined. This suggested a strict, but largely unworkable, gene definition, which nevertheless could serve as a frame-of-reference for more practical definitions. Pertinent to this are "placeholder" bases or amino acids, that may occur both in genes that have RNA transcripts but no protein product, and in protein-encoding genes (Xue and Forsdyke 2003; Rayment and Forsdyke 2005). For example, genome compactness being a virtue in viruses, it was surprising that a glycine-alanine repeat region in the EBNA1 protein of Epstein-Barr virus could be removed without interfering with the major functions assigned to the protein (Yates and Camiolo 1988; Wu et al. 2002). The paradox appeared to be resolved when it was reported that the repeat region could inhibit antigen processing, so permitting the virus to evade host immune defences (Levitskaya et al. 1995; Levitsky and Masucci 2002).

However, supporting evidence derived from an expression construct that was transferred to target cells. Here the gene was transcribed into mRNA, which was then translated into the protein. Was it the transcribed mRNA, or its translation product, that was responsible for the effects observed (Cristillo et al. 2001)? The two amino acids in the repeat could each be encoded by any of four codons. It has recently been shown that changing from one synonymous codon to another can prevent the inhibition of antigen processing, even though the encoded amino acids have not changed. While it is possible that the rate of translation, and hence protein folding, might have been affected by different codon usage, it is most likely that, at the protein level, the glycines and alanines in the repeat region are mere placeholders, perhaps with no functional role (Starck et al. 2008; Tellam et al. 2008).

Hence, from the perspective of protein function, the repeat-encoding region in viral DNA would not be part of the gene, because a mutation that affected the region would be presumed not to impact the function of the gene. If needed at the DNA level for some other purpose (Schaap 1971), the region encoding the repeat could have evolved as an intron and then have been spliced out during mRNA processing. That the region was not spliced out and remained intact in mRNA, despite not appearing to function at the protein level, suggests function at the level of mRNA itself. Thus, from the perspective of mRNA function, the repeat-encoding region in viral DNA is part of the gene, since a mutation would affect function at the RNA level. In this case RNA trumps protein when it comes to gene definition.

While the possibility that the glycine-alanine repeat has some protein-level function is not excluded (Daskalogianni et al. 2008), for present purposes we can regard the amino acids as mere placeholders - a secondary consequence of the sequence requirements of the corresponding mRNA. Also for our purposes, we consider nucleic acids to have four bases (disregarding modified bases), and proteins to have twenty amino acids. We do not consider epigenetic effects or modified amino acids.


Definition through mutation affecting specific function  

For a gene with an RNA end-product (e.g. the intron-containing Xist gene; Pfeifer and Tilghman 1994), a base that could be substituted with any other base without affecting RNA function would have satisfied one criterion for exclusion of that position from the gene. But the base could still be a "placeholder" regarding the gene's assigned function. This role would be tested by deleting or adding a base at that position. If function were still not impaired, then the base would not be deemed a placeholder and the position would not be considered part of the gene, although the position might serve some other genomic pressure. Even if the position were deemed a placeholder (and hence was part of the gene), it could still concomitantly serve another genomic role. This argument would apply to both exons and introns in a gene with an RNA end-product. A mutation in an intronic splice site might result in mis-splicing and so affect the function of the final RNA product. By this criterion some parts of introns would be considered to contribute to the gene.

To the extent that mutations in them can affect a protein product quantitatively, the same criteria would apply to the exons and introns corresponding to the 5' and 3' untranslated segments of mRNAs. However, we should note that, perhaps influenced by Scherrer and Jost, many consider that such segments should not be considered as having derived from exons (Griffiths and Stotz 2006; Catania and Lynch 2008). This has long been a convenient assumption since 5' and 3' introns could then be ignored and "the intron problem" could be approached by comparing sites of intron insertion with features of protein structure (Stoltzfus et al. 1994).

Protein-encoding exons have many base positions where function of the protein would be affected by mutation. The first and second positions of amino acid-encoding triplets would most likely affect the nature of an amino acid, and hence protein function. Third codon positions are often redundant so, in simple form, the protein-encoding parts of a gene would consist of sets of two bases followed, in many cases, by an irrelevant third base. However, since deleting, or adding to, this placeholder base would upset the reading frame, the entire coding region can be considered part of the gene. Similarly, many intron mutations would affect the protein (e.g. omission of a protein-encoding exon through mis-splicing) so, to that extent, its introns would also be part of the gene.

Since it is normally not feasible to put the region assigned to a gene through all these mutational tests, a practical compromise is define a gene at one of three levels depending on the context of the particular discourse. 1. The most fundamental definition of a gene is that region of the genome that is transcribed. Its boundaries would be the first and last bases of the transcript, prior to any 5' capping or 3' trimming. Since the latter may be hard to define, then the 3' end of the trimmed transcript prior to polyadenylation would be an acceptable compromise. 2. Beyond this, there are usually related regions on either side of the transcribed region, mutations of which would affect production of the gene product. So in some contexts these neighboring regions would be included. 3. Finally, if a distant region could be shown to affect the quality or quantity of the gene product, with a high degree of specificity for that gene, this could also be considered part of the gene. Since molecular chaperones are not specific, usually having a variety of "client" proteins, their genes would not be considered as part of the genes of their clients. Many other processing factors (e.g. RNA splicing proteins) are likewise non-specific. This three level definition may not meet all of Scherrer and Josts' concerns, but it is close to that of many textbooks (e.g. Lewin 2006).   

Hierarchical levels of information

The notion of the genome phenotype has been long on the table. So also has the notion of competition between hierarchical levels of information. Acknowledging possible conflicts between different types (layers) of information, Gregory Bateson (1979) observed that in biological systems:

"Two or more information sources come together to give information of a sort different from what was in either source separately. At present there is no existing science whose special interest is the combining of pieces of information. But I shall argue that the evolutionary process must depend upon such double increments of information. Every evolutionary step is an addition of information to an already existing system. Because this is so, the combinations, harmonies, and discords between successive pieces and layers of information will present many problems of survival and determine many directions of change."

Scherrer and Jost (2007) acknowledge "the nuclear DNA not only carries several types of information, but is at the same time the mechanistic carrier of the information contained," and they recognize "the existence of an independent mechanism which lays down signals for meiotic alignment which seems to be largely independent of all other genomic information." However, for their purposes these considerations are given little weight, even though "in the perspective of evolutionary biology, different conceptual emphasis has lead to utilizations of the term 'gene' that are different from ours." They appear most concerned with regulatory aspects of the transfer of information from nucleus to cytoplasm and do not appear to have explored the ideas of Richard Grantham and Gregory Bateson that, with the subsequent appearance of vast quantities of DNA sequence information in the 1980s, led to deeper analyses in the 1990s. A new science - for which the name "Evolutionary Bioinformatics" was suggested - emerged (Forsdyke 2006). Just as Bernard's symposium now serves to illustrate the controversies of 1900, so the present symposium on a proposed "new information theoretic scheme" will perhaps best serve as an indicator to future historians of the disparate lines of thought prevailing in the first decade of the 21st century.



I thank Klaus Scherrer for suggesting that I be invited to contribute to this debate. Queen's University hosts my webpages where some of the references may be found.



Bass BL (2002) RNA editing by adenosine deaminases that act on RNA. Annu Rev Biochem 71:817-846.

Bateson, G. (1979) Mind and Nature. A Necessary Unity. Dutton, New York, p. 21

Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW (2009) GenBank. Nucleic Acids Res 37:D26-D31

Bernardi G, Bernardi G (1986) Compositional constraints and genome evolution. J Mol Evol 24:1-11

Catania F, Lynch M (2008) Where do introns come from? PLoS Biology 6:11, e283

Cock AG (1977) Bernard's symposium. The species concept in 1900. Biol J Linn Soc 9:1-30

Cock AG, Forsdyke DR (2008) "Treasure Your Exceptions." The Science and Life of William Bateson. Springer, New York

Cristillo AD, Mortimer JR, Barrette IH, Lillicrap TP, Forsdyke DR (2001) Double-stranded RNA as a not-self alarm signal: to evade, most viruses purine-load their RNAs, but some (HTLV-1, Epstein-Barr) pyrimidine-load. J Theor Biol 208:475-491

Cristofari G, Darlix J-L (2002) The ubiquitous nature of RNA chaperone proteins. Prog Nuc Acid Res Mol Biol 72:223-268

Daskalogianni C, Apcher S, Candeias MM, Naski N, Calvo F, Fahraeus R (2008) Gly-Ala repeats induce position- and substrate-specific regulation of 26S proteosome-dependent partial processing. J Biol Chem 283:30090-30100

Forsdyke DR (2001) The Origin of Species, Revisited. McGill-Queen's University Press, Montreal

Forsdyke DR (2002) Selective pressures that decrease synonymous mutations in Plasmodium falciparum. Trends Parasitol 18:411-418

Forsdyke DR (2006) Evolutionary Bioinformatics. Springer, New York

Forsdyke DR and Mortimer JR (2000) Chargaff's legacy. Gene 261:127-137

Grantham R (1980) Workings of the genetic code. Trends Biochem Sci 5:327-331

Greenberg H, Penman S (1966) Methylation and processing of ribosomal RNA in HeLa cells. J Mol Biol 21:527-535

Griffiths PE, Stotz K (2006) Genes in the postgenomic era. Theor Med Bioethics 27:499-521

Lewin B (2006) Genes IX. Jones and Bartlett, Sudbury MA

Levitskaya J, Coram M, Levitsky V, Imreh S, Steigerwald-Mullen PM, Klein G, Kurilla MG, Masucci MG (1995) Inhibition of antigen processing by the internal repeat region of the Epstein-Barr virus nuclear antigen-1. Nature 375:685-688

Levitsky V, Masucci MG (2002) Manipulation of immune responses by Epstein-Barr virus. Virus Res 88:71-86

Pfeifer K, Tilghman SM (1994) Allele-specific gene expression in mammals: the curious case of imprinted RNAs. Genes Devel 8:1867-1874

Rayment JH and Forsdyke DR (2005) Amino acids as placeholders: Base composition pressures on protein length in malaria parasites and prokaryotes. Applied Bioinformatics 4:117-130

Schaap T (1971) Dual information in DNA and the evolution of the genetic code. J Theor Biol 32:293-298

Scherrer K, Darnell JE (1962) Sedimentation characteristics of rapidly labeled RNA from Hela cells. Biochem Biophys Res Comm 7:486-490

Scherrer K, Jost J (2007) Gene and genon concept: coding versus regulation. A conceptual and information-theoretic analysis of genetic storage and expression in the light of modern molecular biology. Theory Biosci. 126, 65-113

Starck SR, Cardinaud S, Shastri N (2008) Immune surveillance obstructed by viral mRNA. Proc. Natl Acad Sci USA 105:9135-9136

Stoltzfus A, Spencer DF, Zuker M, Logsdon JM, Doolittle WF (1994) Testing the exon theory of genes: the evidence from protein structure. Science 265:202-207

Tellam J, Smith C, Rist M, Webb N, Cooper L, Vuocolo T, Connolly G, Tscharke DC, Devoy MP, Khanna R (2008) Regulation of protein translation through mRNA structure influences MHC class 1 loading and T cell recognition. Proc Natl Acad Sci USA 105:9319-9324

Wu H, Kapoor P, Frappier L (2002) Separation of the DNA replication, segregation, and transcriptional activation functions of Epstein-Barr nuclear antigen 1. J Virol 76:2480-2490

Xue HY, Forsdyke DR. (2003) Low complexity segments in Plasmodium falciparum are primarily nucleic acid level adaptations. Mol Biochem Parasitol 128:21-32

Yates JL, Camiolo SM (1988) Dissection of DNA replication and enhancer activation functions of Epstein-Barr virus nuclear antigen 1. Cancer Cells 6:197-205

End Note August 2009

This paper was part of a special issue of Theory in Bioscience for which several authors with different backgrounds had been invited to consider the problem of gene definition. Thus, the description "symposium" in the title.

Bioinformatics Index (Click Here)

HomePage (Click Here)

This page was begun in December 2008 and last edited on 08 Nov 2020 by Donald Forsdyke