Other taxonomic groups such as dicot plants, animals, fungi, bacteria, and archaea, do not have bimodal GC content distributions, although vertebrate animals come close. Arabidopsis, with its unimodal distribution centered at 44% ORFGC, is therefore not a representative model system for this chiefly monocot (cereal) phenomenon. [Note, abbreviations are: GC, G+C% nucleotide content; ORF, open reading frame from ATG to stop; 3GC, third codon position GC%].
The low and high ORFGC maize gene modes peak at about 51% and 67% GC, with circa a 2/3 majority of genes found in the lower GC mode. This GC bimodality is a product of nuclear-encoded genes: 111 chloroplast-encoded ORFs have a unimodal distribution with a low 39.1% average ORFGC content, and 16 mitochondrion-encoded gene ORFs averaged 43.4% GC.
Most bimodal GC content variation is in the coding region, especially the third codon position, where GC ranges up to 100%, with obvious affects on codon usage. Third codon position C predominates over G by a ratio of 1.3 among high GC mode genes. In the ORF overall, however, G% is nonetheless slightly higher than C% among high GC genes, primarily due to the high frequency of G in the non-synonymous 1st codon position, evidently related to amino acid frequencies. The 5-UTRs and 3-UTRs of GC-rich genes are more GC-rich, but only slightly more so, 1.8% and 2.6% respectively, than their counterparts from low GC mode genes.
Maize genes generally have a negative GC gradient along the transcript, from the 5'-UTR, through the coding region, and the 3'-UTR. (See also Wong, GK-S et al. Genome Res. 12:851, 2002). Nonetheless, most high GC mode genes, and a subset of the low GC mode genes, have only slightly negative ORFGC gradients. The remaining genes from the low GC mode have marked negative GC gradients, but this negative gradient tends to rebound to a positive gradient before the end of the ORF. The gradient and its reversal are more pronounced for the ORF3GC than the ORFGC. Interestingly, because of these differences, a plot of ORF3GC gradient tendencies versus GC content (as ORF3GC) reveals a tri-modal maize gene distribution: 1) high GC mode, little gradient; 2) low GC mode, little gradient; and 3) low GC mode, negative gradient and rebound.
Like GC content, the CpG methylation site frequencies are also bimodal, as might be expected given this site is comprised of G and C. However, importantly, if the observed (direct count) and expected (calculated from GC content) CpG site frequencies are determined for each gene in a codon-position-specific manner (i.e., separately for codon positions 1,2; 2,3; and 3,1), a bimodal pattern still emerges. High GC genes have a generally balanced CpG ratio of about 0.9, but low GC genes tend to have a deficit CpG ratio of about 0.6. It should be emphasized that this represents a two-order separation in CpG site frequencies between low and high GC genes, because for the low GC mode genes their expected CpG frequency is already low due to their own intrinsically lower GC content.
A plot of the obs/exp CpG ratios is thus bimodal, and a plot of these CpG ratios in turn versus the bimodal ORFGC content, results in a diagonal distribution anchored with two clusters at each end. The correlation of the CpG ratios versus GC content is fairly high (r2=0.72). Nonetheless, the plot is slightly curved, with the slope declining with higher GC content. Like the GC content, the CpG ratios also show a generally negative gradient along the ORF length, with a rebound from the negative gradient towards the C-terminus, especially among low GC mode genes. The more balanced CpG ratios of high GC genes also extends to the 5'- and 3'-UTRs, indicating it is not limited to the coding region. The frequencies of the alternative methylation site CpNpG do not vary as much as CpG sites between high and low GC mode genes, and the obs/exp CpNpG ratios are balanced (1.0) for both high and low GC mode genes in all three codon positions (1,3; 2,1; and 3,2), and in both the 5'- and 3'-UTRs.
While most GC content variation is manifest in the third codon position, the first two non-synonymous codon positions are nonetheless also higher in GC content among high GC mode genes. This is therefore reflected in shifts in amino acid composition. High GC mode genes are richer in necessarily GC-rich codon amino acids, such as alanine (GCX), glycine (GGX), and proline (CCX). Low GC mode genes, conversely, are richer in the necessarily GC-poor codon amino acids, and moderate GC codon amino acids are found at similar frequencies between high and low GC mode genes. Like GC content and the obs/exp CpG ratios, the frequencies of GC-rich amino acids generally decline 5'-3' (N-terminal to C-terminal) along the ORF, but rebound toward the C-terminus among low GC genes. This is especially true for alanine, which being found at higher levels near the N-terminus, contributes to higher signal peptide prediction rates for high GC genes, because alanine is often scored as a signal peptide cleavage site. These compositional biases thus have implications for automated gene functional annotation, and might also complicate protein evolutionary comparisons.
Maize GC-rich genes tend to be more compact, having shorter coding regions (by 18%), encoded predicted protein MWs (by 20%), 5'-UTRs (by 10%), and 3'-UTRs (by 14%), relative to the counterparts from low GC genes. The 5-UTRs of high GC genes also contain nearly four-fold fewer non-coding ATG sites. Using rice as a surrogate cereal to investigate introns (because there are yet few maize full-length genes with annotated introns), we found that total intron length of GC-rich genes is markedly shorter, primarily due to only 40% as many introns present in high GC mode genes relative to low GC mode genes. High GC mode rice genes are also 18.2 intronless, compared to 7.9% for GC-poor mode rice genes. Similar trends will presumably hold when comparable numbers of maize introns are available for study.
The compact and simpler structure of GC-rich gene transcripts, the extreme codon bias, and the fewer ATG sites in 5'-UTR (which the ribosomal apparatus could confuse with bona fide start sites), together might suggest that GC-rich genes may be adapted to more efficient gene expression, both in terms of transcript production and processing, and in protein translation. The Kozak translational initiation site frequencies, however, do not differ greatly between high and low GC mode genes: both favor "GCCATGGC". Kozak context is therefore not a good marker for bimodal gene GC content. In humans, on the other hand, Kozak context does vary between high and low GC genes (Pesole, G et al., FEBS Lett. 464:60, 1999).
We investigated mRNA expression of high and low GC mode genes using both EST distribution analysis (over 400,000 ESTs) and Lynx MPSS technology (63.4 million 17-mer tags) (Brenner, S et al., Nat. Biotech. 18:630, 2000). We found that while gene expression varied widely within high and low GC modes, considering the average expression levels among 12 key distinct tissue categories, the overall average tissue expression level of high and low GC genes was similar; only 1.1- and 1.2-fold higher for high GC mode genes, in EST and MPSS analyses respectively. We observed, however, a tendency for higher magnitude tissue-preferred expression of GC-rich genes, especially in vegetative tissues (root, mesocotyl, stalk, leaf; averaging 1.6- and 3.0-fold higher, EST and MPSS respectively) and in non-kernel reproductive tissues (silks, tassel, and pollen; averaging 2.5- and 4.3-fold higher, EST and MPSS respectively). In contrast, in endosperm, pericarp, and R1 kernel tissues, expression of low GC genes was higher than the high GC mode genes by an average of 1.9- and 2.0-fold, EST and MPSS respectively.
High and low GC mode genes were expressed constitutively in similar frequencies: 4.8% and 9.1%, respectively, by EST analysis; and 7.8% and 8.9%, respectively, by MPSS analysis. For this study constitutive expression was defined for the EST analysis as expression in at least 10 of the 12 tissue categories at a level of at least 25 PPM in each of the 10 tissues, with an overall average expression among all 12 tissues of 100 PPM. For the Lynx MPSS technology the definition was similar, but it was set to a 12.5 PPM minimum and 50 PPM average. However, the constitutive expression levels (i.e., magnitude in PPM) of the GC-rich genes averaged higher by 1.7- and 2.1-fold, in EST and MPSS analyses respectively.
Inspection of gene identities for high and low GC mode genes revealed diverse biological and biochemical functions of the gene products, both within and between the high and low GC modes. Protein relationships were higher within genes of each GC mode, no doubt contributed to by close gene family relationships, however substantial protein relationships existed for predicted proteins between high and low GC mode genes as well. Highly conserved proteins (that is, conserved in other distant taxonomic groups such a bacteria, fungi and animals) were readily found among both high and low GC mode genes. It had been argued that since methylated CpG sites mutate at a high rate, then high GC genes would tend to be relatively new (e.g., see Moore, G et al., Genomics 15:472, 1993). There was little indication of bimodality arising from pre-existing GC differences in parental genomes of polyploids. For example, among 11 loci pairs attributed to maize tetraploid ancestry (Gaut, BS and Doebley, JF, PNAS 94:6809, 1997), the GC content differed by only 0-3% (average 1.2%). Of course, all cereals investigated to date have the bimodality, indicating its origin precedes the maize lineage.
The cellular locations of high and low GC gene products appeared to be similarly diverse, and gene plots of presumed cellular location (nuclear, cytoplasm, chloroplast, mitochondria, plasma membrane, and extracellular), versus ORFGC content were all generally bimodal. The extracellular proteins investigated on average had higher GC contents however. Looking in more detail at one cellular compartment, the nucleus, we investigated the 45 transcription factor families of maize using 2384 members from among 84,085 "UniGene" EST assemblies. The transcription factor genes as a whole differed little from the overall gene GC content or bimodal tendency. Yet, while most families showed a broad or somewhat bimodal GC content distribution, some families displayed an upper or lower GC bias. For example, the Wrky transcription factor family (IPR003657), averaged 59% GC with the mode peak at 67%GC (N=67).
While the cause of the bimodal gene distribution is unknown, these investigations have directed our attention to CpG sites and methylation. Of all the gene characteristics we investigated, the obs/exp CpG ratios had the best correlation to the GC content variation, and these ratios differed not only within the ORF, but also along the ORF length, and outside the ORFs in the non-coding regions. The gradients in GC content suggest an organizing 'force' emanating from the 5' end of the transcript region. The rebound at the 3' end among some genes may be a compensational recovery from this 5'-end pressure. After all, the amino acid bias rebounds sooner (i.e., closer to the N-terminus) than does the base composition, which might be expected with a force declining with distance from the 5'-end. Yet, other features associated with high GC genes, such as their compact structure and fewer 5'-UTR ATG sites, are not obviously related to a 5-end emanating force or methylation, and instead, at least superficially, suggest expression efficiency could somehow relate to GC content as well. Whether these varied characteristics are the product of one or multiple evolutionary trends is unknown, but this data suggests there is an important organizing principle at play in cereal genomes. Perhaps it relates to how genes are registered in chromosomes and disposed towards developmental or temporal expression. Analysis of 129 of the 1831 genes that were genetically mapped indicated that they were distributed throughout the genome on all 20 chromosome arms; however more extensive physical maps and genomic sequencing will be needed to affirm or refute any relationship of maize bimodal gene GC content to chromosomal position.
Codon bias is still assumed to be a key condition for optimizing (translational)
expression in eukaryotes such as maize - presumably because some microbes
have correlations between codon usage, iso-accepting tRNA pools, and expression
levels. Codon biases have accordingly figured into gene re-engineering
methods for transgene expression (eg. Koziel, M et al., US Patent
6121014, 2000). The findings here cast further doubt upon this assumption
and the need for this practice. First, maize does not have one codon table,
but in effect two. Second, high and low GC genes have generally similar
levels of mRNA expression. Third, high and low GC mode genes are both expressed
within the same tissues, and presumably the same cells, even while there
are some differences in tissue preference. Fourth, the GC content variation
is not limited to the ORF, and within the ORF it presses beyond merely
codon usage to affect amino acid content itself. Fifth, the correlation
of obs/exp CpG sites to GC content is unlikely related to translation.
And sixth, the codon usage varies along the ORF length with GC content.
It is not apparent how the codon-anticodon coadaptation hypothesis can
account for all these varied observations. We have, however, developed
and applied computerized methods for reengineering ORFs from any source
into configurations at least compatible with the natural structures of
maize genes revealed by this study, for subsequent reintroduction into
transgenic crops, by drawing upon a more elaborate combination of attributes
such as GC content, ORFGC gradients, obs/exp CpG ratios, and codon biases.
Return to the MNL 77 On-Line Index
Return to the Maize Newsletter Index
Return to the Maize Genome Database Page