1Rosario, Argentina

Area Comunicaciones(FCIA-UNR)

2Cordoba, Argentina

Estad’stica y Biometr’a(FCA-UNC).

 

A machine learning approach for heterotic performance prediction of maize (Zea mays L.) based on molecular marker data

 

--Ornella1, L; Balzarini2 , M;  Tapia1 ,E

 

 

 

A number of statistical methods based on molecular data are currently available for assigning new inbreds to heterotic groups in maize (Zea mays L.) with variable results (Reif et al., 2005; dos Santos Diaz et al., 2004). On the other side, novel multiclassification methods arose from machine learning domain. By machine learning we meant a research domain mostly related to statistical inference, artificial intelligence, and optimization. Its aim is to construct systems able to learn to solve tasks given a set of examples that were drawn from an unknown probability distribution, and given some a priori knowledge of the task. (Witten and Eibe, 2000). We conjecture that the main flaw of traditional statistical models is that they do not capture the non-linear relation between parental data and progeny performance (Tollenar et al., 2004); alternatively, experimental results show that such type of non linearity can be easily captured by supervised Machine Learning models i.e., by  multiclassifiers  (Witten and Eibe, 2000).

The field data analyzed in this study was taken from experiments that have been described in detail in (Nestares et al., 1999). Briefly, our investigation involved 26 inbred lines (all lines but one, B73, were orange flint germplasm developed by INTA from twelve different sources: synthetics, composites, landraces, etc) from a total of 48 evaluated for their combining ability with four testers: sB73 & sMo17 from the Reid x Lancaster pattern and HP3 & P5L2 from the local orange flint pattern. The 48 lines were grouped according their combining ability with the testers populations into for heterotic groups (H1-H4) using SAS-Fastclus procedure (Nestares et al., 1999). The twenty six lines were characterized using 21 SSR (simple sequence repeats) evenly distributed in the genome (Morales Yokobori et al., 2005).

 A dataset comprising 42 attributes corresponding to the 21 SSR (2 alleles of each locus per line) were generated. This dataset contains 26 instances (26 lines) and 4 classes defined by the four heterotic groups (H1 = 4 instances, H2 = 8 instances, H3 = 6 instances and H4 = 8 instances). Finally, we considered six standard multiclassifiers provided by the Java WEKA library (Witten and Eibe, 2000): Na•ve Bayes, Support Vector Machines with Radial Basis function kernel-one against all (SVM-RBF), Decision Tree (J48 and random forest), AdaBoost Decision Stumps and Multilayer Perceptron. ClassifiersÕ performance  were evaluated by 3, 5 and 10 Fold Cross Validation (3-CV, 5-CV and 10-CV) (we run all classifiers with WEKAÕs default values). Results are presented in table 1.

Considering that our classification results are preliminary, they suggest the usefulness of a molecular based, machine learning approach for solving general heterotic group assignation problems; we must consider the effect of population structure (parents highly divergent) which affects linkage disequilibrium between DNA markers and genes involved in the expression of target traits (Charcosset and Essioux, 1994). Alternatively, and based on previous work, we hypothesize that further application of feature selection methods i.e., the selection of highly discriminant molecular markers, might improve heterotic group assignation. This hypothesis is supported in the observed similarity between classification problems involving microsatellite marker and those involving microarray data. In both cases, missing and noisy features might be present in scarce data samples. This type of classification noise can be properly limited by feature selection methods so that resulting data sets can be safely managed by binary based, Coding Theory inspired multiclassifiers (Ornella et al 2006).

 

 

 

Bibliography

dos Santos Dias, LA.; de Toledo Picoli, EA; Barros Rocha, R and Couto Alfenas, A. (2004) A priori choice of hybrid  parents in plants. Genet. Mol. Res. 3:356-368.

Charcosset, A., and L. Essioux (1994) The effect of population structure on the relationship between heterosis and heterozygosity at marker loci. Theor. Appl. Genet. 89:336–343.

Morales Yokobori, M; Decker, V; Ornella, LA (2005) Analysis of heterotic maize (Zea mays L.) populations using molecular markers. MNL 79:36.

Nestares G., Frutos, E and Eyherabide GH (1999) Evaluaci—n de l’neas de ma’z flint colorado por aptitud combinatoria. Pesq. agropec. bras. 34:1399-1406.

Ornella, L and  Tapia, E (2006) A classification approach for heterotic performance prediction based on molecular marker data.  VIII Argentine Symposium on Artificial Intelligence (ASAI 2006)  Mendoza, Argentina.

Reif, JC; Melchinger, AE and Frisch M (2005) Genetical and Mathematical Properties of Similarity and Dissimilarity Coefficients Applied in Plant Breeding and Seed Bank Management. Crop Sci. 45:1-7.

Tollenaar, M; Ahmadzadeh, A and  Lee, EA (2004) Physiological Basis of Heterosis for Grain Yield in Maize. Crop Sci. 44: 2086 – 2094.

Witten, IH and  Eibe, F (2000) "Data Mining: Practical machine learning tools with Java implementations". Morgan Kaufmann, San Francisco.

 

 

 

Table 1: 3, 5 and 10 Fold CV error on the Heterosis dataset using multiclass clasiffiers

 

Multiclassifier

3 CV error

5 CV error

10 CV error

Naive Bayes

0.654

0.692

0.769

SVM-RBF

0.654

0.769

0.769

Decision Tree (J48)

0.808

0.769

0.769

Decision Tree

 (random forest)

0.731

0.846

0.769

Adaboost-Decision Stump

0.731

0.610

0.770

Multilayer  Perceptron

0.770

0.770

0.692

Error Expected by Chance

0.774

0.774

0.774