TIGR Maize Gene Index --Lee, D, Quackenbush, J Overview. The most comprehensive resource for the cataloging of transcribed genes is the vast body of data generated by the sequencing of Expressed Sequence Tags (ESTs). ESTs are single-pass, partial sequences of cDNA clones, and they have been used extensively for gene discovery and genomic mapping in various species. Similar to the other TIGR Gene Index databases, the Maize Gene Index uses assembly algorithms to first cluster, then assemble EST and gene sequences to produce tentative consensus (TC) sequences that represent the underlying mRNA transcripts. The resulting TCs can be used for eukaryotic genome sequence annotation, EST sequence annotation, integration of complex mapping data and identification of orthologous genes in other crop plants.

Building Maize Gene Index. Maize TCs were assembled by treating ESTs and Expressed Transcripts (ET) sequences as elements of a transcriptome shotgun sequencing project. Maize ESTs are downloaded daily from dbEST, cleaned to remove untrimmed vector, linker, ribosomal, mitochondrial, low quality, poly(A/T) and contaminating bacterial sequences. Maize ETs are extracted from annotated mRNA and CDS features in GenBank records. Cleaned ESTs and ETs are compared pair-wise to identify overlaps using megablast. Sequences sharing a minimum of 95% identity over a 40 nt or longer region with 20 bases or fewer of mismatched sequence at either end are grouped into a cluster. Each cluster is then assembled separately using CAP3, the resulting TC sequences are annotated to provide a provisional functional annotation, and the assemblies and their annotations are stored in a Sybase relational database that allows versioning and heritability to be maintained. All nonclustered, non-overlapping sequences remain as singletons. The resulting Gene Index is released through the TIGR maize gene index web site ( Each TC reports (example TC160405, includes: the assembled TC sequence, the predicted open reading frames (ORFs) from ESTscan, DIANA and framefinder, coordinates of each EST and ET in the assembly, information about each EST and ET with links to a variety of databases, alternative splicing cluster, an expression summary by counting number of ESTs from different libraries, SNP detection, orientation determination, functional annotations including matches to a known protein and GO annotations, and tentative orthologous gene identifications in other species from the EGO database, and maps to rice and Arabidopsis genomes .

Using TIGR Maize Gene Index. There are a variety of means by which a user might gain entry to the Maize Gene Index database. Users can search the database using a variety of sequence identifiers, such as GenBank Accession or TC number, or by searching gene name or for TCs that are preferentially expressed in specific tissues.

However, the most common entry point for most users is the sequence search page (<>). Both BLASTN and TBLASTN versions of the WU-BLAST package have been implemented allowing DNA and protein queries to be used. Alignments to high scoring TCs and singleton ESTs are returned and users can view the individual target sequences by clicking on the TC number or EST_id. From the TC reports (see Figure 2), users can view the annotation provided for the sequence and its evidence, link to orthologues in EGO, or view genomic sequence alignments with rice and Arabidopsis.

In addition to the Web interface, the TIGR Maize Gene Index is available as a set of flat files. The TC consensus sequences are provided in a FASTA format file; the ESTs comprising each TC are specified in a separate file. Many users involved in the annotation of genomic sequence and in the analysis of cDNA microarray data have found these to be particularly useful. In addition, we provide a putative annotation of all the assembled and singleton ESTs in the database through the EST Annotator feature available through the main maize gene index page.

TIGR Maize Gene Index release data. The current release of the Maize Gene Index, ZmGI 11.0, was released on February 1, 2003, and contains 188,973 ESTS, 173,826 in TCs and 15,147 as singletons, as well as 3,463 expressed transcripts. New releases will be available every 120 days, provided a minimum 10% increase in the number of available ESTs.

Acknowledgements. We would like to thank the members of the maize EST cloning and sequencing community whose data made this project possible. This work was supported by National Science Foundation, grant DBI-9983070.

