PISCATAWAY, NEW JERSEY
University of Arizona
Institut fur Bioinformatik
High-resolution physical mapping of the maize genome and sequencing a part thereof
--Bharti, AK, Wei, F, Butler, E, Yu, Y, Goicoechea, JL, Kim, H, Fuks, G, Nelson, W, Hatfield, J, Gundlach, H, Karlowski, WM, Raymond, C, Towey, S, Jaffe, D, Nusbaum, C, Birren, B, Mayer, K, Soderlund, C, Wing, RA, Messing, J
Due to its economic significance, maize (>2 Gb) is likely to be the next cereal to be sequenced after rice (0.4 Gb). Sequencing the maize genome will present a new challenge not only because it is 5-times larger but also because it contains many gene families, tandemly arrayed and nested repeat sequences. Since maize genetics is so well advanced, it will aid in developing a map-based sequencing approach, which will further provide an arsenal for functional genomics. As a beginning towards sequencing the maize genome, out of a set of 156 BACs chosen, we have already sequenced 48 random clones. In addition, we have also sequenced 355,294 BAC ends yielding random small fragments of an average size of 627 bp (Table 1), which represents >11% of the maize genome (223 Mb). To establish a link between sequences and the genetic map, the same clones that have been sequenced at their ends have also been fingerprinted. DNA fingerprinting of highly redundant BAC libraries yields bins of overlapping BAC clones in a high-throughput fashion. Physical linkage of large bins of BACs can then be established to the genetic map by detection of genetic markers contained within individual BAC clones either by filter hybridization or by PCR-related methods. In the previous NSF-funded Maize Mapping Project, 10,913 such BAC clones (anchored by 1,937 markers) have already been identified. Since these anchored BACs are positioned within contigs, the entire contig can be linked to the map (Fig. 1). A limiting factor to the anchoring process is marker density and the size of BAC contigs. The maize genetic map is about 2,000 cM, which would translate into roughly one cM per Mb. Given the current marker density and the preference for at least two markers per contig, it becomes necessary to generate contigs that are very large in size. For this we need a high resolution fingerprinting method that is capable of identifying even small overlaps between neighboring BACs, thereby resulting in large contigs.
Table 1. Current status of fingerprinting and sequencing of Zea mays ssp. mays cv. B73 BAC Libraries.
|High Information Content Fingerprinting (HICF)|
|Passed QC (presence of internal standards and expected vector bands)||“382,696”|
|“Genome Coverage based on 2,036 Mb coverage by agarose fpc”||28.2x|
|BAC End Sequencing (BES)|
|Submitted to GenBank||“355,294”|
|Average read length submitted to GenBank||627 bp|
|“Genome Coverage based on 2,036 Mb coverage by agarose fpc”||11% (223 Mb)|
|Submitted to GenBank (28 Phase 1 and 20 Phase 2)||48|
Earlier, MMP had generated a genetically anchored physical map using HindIII-digested agarose-based fingerprints of BACs offering 21.5x physical coverage of the genome. Manual editing of the agarose fpc assembly has resulted in an increase in the number of contigs with >200 overlapping BAC clones from 272 to 390 and a decrease in the total number of contigs from 4,518 to 3,488 (1,446 anchored contigs). To further reduce the number of contigs and also enlarge their sizes, a high-resolution fluorescent fingerprinting method known as HICF (High Information Content Fingerprinting), has been carried out for the same BAC libraries (NSF-B73 and CHORI-201). A total of 305,849 BAC clones have already been fingerprinted and assembled into contigs by the HICF method whose results (Table 2) can be accessed via an interactive website. Using the BAC nomenclature and address, all agarose-based contigs are being aligned to the HICF contigs (Fig. 1).
Table 2. Current status of the agarose and the HICF fpc builds.
|Particulars||“Agarose fpc build
(June 06, 2003)”
|“HICF fpc build
(December 15, 2003)”
|Number of successful fingerprints||“292,039”||“305,849”|
|“Genome Coverage (Based on 2,036 Mb Coverage by Agarose fpc and Av. Insert Size of BAC Libraries as 150 kb)”||21.5x||22.5x|
|Number of contigs||“3,488*”||“4,681”|
|Number of markers||“15,422”||“15,403”|
|Number of anchored contigs||“1,446”||“2,010”|
|Number of singletons||“14,482”||“33,566”|
|Contigs with >200 BACs||390||188|
|Contigs with 101 to 200 BACs||528||829|
|Contigs with 51 to 100 BACs||570||979|
|Contigs with 26 to 50 BACs||461||694|
|Contigs with 10 to 25 BACs||400||514|
|Cut off||e-12 (∼70% overlap)||e-48 (∼58% overlap)|
|Tolerance (resolution)||7 (=7 bp)||7 (=0.35 bp)|
*After manual editing
Figure 1. The HICF Contig# 2 (Chr 1) consists of two agarose contigs Ctg#1 and Ctg#1798. The clear-cut boundary between the two agarose contigs (both from maize chromosome 1) indicates a clean merge. The BAC clone b0229C02 (in green) has been anchored to chromosome 1 by the marker umc1354 (in blue).
HICF is based on simultaneous restriction of the DNA with a type IIS restriction enzyme (EarI) along with a 4-base cutter (TaqI) followed by labeling the ends with base-specific fluorescent dyes and resolving the fragments within a 35–500 bp range along with internal size standards. Based on simulations, it has been observed that the agarose fingerprinting (∼40 bands/BAC) requires a clone overlap of ∼70% to achieve a medium cutoff (e-12). With the HICF (∼120 bands/BAC), assembly can be achieved at a much lower cutoff (e-48), which requires a slightly lesser overlap of ∼58%. Moreover, a resolution of as low as 0.35 bp is being used for the HICF fpc build as opposed to 7 bp for the agarose fpc build. Due to the restriction site bias, it is more likely that HICF will also cover those regions of the maize genome that have fewer HindIII sites. The HICF method, originally designed for gel-based sequencers (ABI377) at DuPont, has been adapted successfully for use with capillary sequencers (ABI3700). The fluorescent trace output is extracted using ABI GeneScan Analysis software. In order to call peaks and produce fpc-compatible files, the perl script provided by DuPont is being utilized with their peak-scoring scheme (with modifications). Typically the observed band sizes are off by 2 to 6 bp from the expected simulated digests. However a very good consistency in the sizes between duplicated runs has been observed in the HICF data being generated from different machines. As far as reproducibility of band sizes are concerned, in the observed vector band sizes from >72,000 HICF fingerprints, there is a maximum average size standard deviation of 0.11 bp.
Presence of the expected BAC vector bands along with the internal size standard peaks are being used as the two quality control (QC) steps for analysis of the entire HICF data set. As of now, the HICF success rate is ∼86%, which has yielded 382,696 successful fingerprints that pass both the above criteria (Table 1). The current HICF fpc build has been made with data from 305,849 clones after screening for fingerprints with band count <175, which has resulted in 4,681 contigs (Table 2). There are numerous examples indicating that HICF can determine overlaps where the agarose method cannot (Fig. 1). Thus, the HICF method is likely to significantly reduce the number of contigs in the physical map by successfully joining distinct agarose-based contigs along with singletons. Furthermore, because a fraction of BAC end sequences is conserved in orthologous positions in the rice and maize genomes, additional contigs are expected to get anchored to the maize map due to the rice-maize synteny. It is also expected that resolution of the HICF-generated contigs would be high enough to select the entire Minimum Tiling Path (MTP) at once, thereby providing a minimal BAC clone set of <20,000 clones representing the entire maize genome. This single filter clone set available for hybridization would be a powerful tool in the hands of researchers interested in identifying a desired locus from within the maize genome. This would greatly facilitate cloning of desired marker-assisted traits.
Table 3. Repeat content of 43 BACs analyzed.
|Number of repeat hits||“4,364”|
|Average number of repeat hits per BAC||101|
|Average length of repeat hit||“1,186 bp”|
|Total number of nucleotides||“7,086,942 bp”|
|Number of nucleotides as repeats||“5,178,209 bp”|
|Percentage of nucleotides as repeats||73%|
Table 4. Distribution of the 73% hit in terms of total length and number of hits among each repeat category.
|No. of hits||% of hit no.||Length of hit (bp)||% of hit length|
|Class I elements (retroelements)||“3,660”||83.87||“4,969,641”||95.97|
|Class II elements (DNA transposons)||550||12.60||“160,301”||3.11|
Besides the development of a BAC-based genetic map of maize, we are also assembling information about the repeat sequences of the genome by de novo repeat detection from the sequenced BACs. This updated repeat database would be very useful for future comparative genomics work. Though it has been suggested that the maize genome is composed of small gene islands interspersed between large oceans of repetitive DNA, a more detailed picture of this organization is yet to emerge. Annotation of an initial set of random BACs sequenced indicate that large repeat-free open spaces vary from region to region in the genome. Another very interesting observation that could be drawn from the analysis is that most repeat elements are shorter than those existing in the repeat collection. This points towards a high incidence of fragmentation of repeats within the maize genome. Comparison of 43 sequenced BACs to the updated repeat database reveals a total repeat content of 73% (Table 3), out of which nearly 96% of the sequence length is comprised of retroelements and only about 3% constitutes DNA transposons (Table 4). In addition to the 48 BACs already sequenced, sequencing of 108 more clones will enable us to assemble a more complete set of repeat elements and assess their abundance.