1Rosario, Argentina
Area Comunicaciones(FCIA-UNR)
2Cordoba, Argentina
Estad’stica y
Biometr’a(FCA-UNC).
A machine learning approach for
heterotic performance prediction of maize (Zea mays L.) based on molecular marker data
--Ornella1, L; Balzarini2 , M; Tapia1 ,E
A number of statistical methods based on molecular data are currently
available for assigning new inbreds to heterotic groups in maize (Zea mays L.) with variable
results (Reif et al., 2005; dos Santos Diaz et al., 2004). On the other side, novel
multiclassification methods arose from machine learning domain. By machine
learning we meant a research domain mostly related to statistical inference,
artificial intelligence, and optimization. Its aim is to construct systems able
to learn to solve tasks given a set of examples that were drawn from an unknown
probability distribution, and given some a priori knowledge of the task. (Witten and Eibe, 2000). We conjecture that the main flaw of traditional
statistical models is that they do not capture the non-linear relation between
parental data and progeny performance (Tollenar et al., 2004); alternatively,
experimental results show that such type of non linearity can be easily
captured by supervised Machine Learning models i.e., by multiclassifiers (Witten and
Eibe, 2000).
The field data analyzed in this study was
taken from experiments that have been described in detail in (Nestares et al.,
1999). Briefly, our investigation involved 26 inbred lines (all lines but one,
B73, were orange flint germplasm developed by INTA from twelve different
sources: synthetics, composites, landraces, etc) from a total of 48 evaluated
for their combining ability with four testers: sB73 & sMo17 from the Reid x
Lancaster pattern and HP3 & P5L2 from the local orange flint pattern. The
48 lines were grouped according their combining ability with the testers
populations into for heterotic groups (H1-H4) using SAS-Fastclus procedure
(Nestares et al., 1999). The twenty six lines were characterized using 21 SSR
(simple sequence repeats) evenly distributed in the genome (Morales Yokobori et
al., 2005).
A dataset comprising 42 attributes corresponding to the 21
SSR (2 alleles of each locus per line) were generated. This dataset contains 26
instances (26 lines) and 4 classes defined by the four heterotic groups (H1 = 4
instances, H2 = 8 instances, H3 = 6 instances and H4 = 8 instances). Finally,
we considered six standard multiclassifiers provided by the Java WEKA library
(Witten and Eibe, 2000): Na•ve Bayes, Support Vector Machines
with Radial Basis function kernel-one against all (SVM-RBF), Decision Tree (J48
and random forest), AdaBoost Decision Stumps and Multilayer Perceptron. ClassifiersÕ performance were evaluated by 3, 5 and 10 Fold Cross Validation
(3-CV, 5-CV and 10-CV) (we run all classifiers with
WEKAÕs default values). Results are presented in table 1.
Considering that our classification
results are preliminary, they suggest the usefulness of a molecular based, machine
learning approach for solving general heterotic group assignation problems; we
must consider the effect of population structure (parents highly divergent)
which affects linkage disequilibrium between DNA markers and genes
involved in the expression of target traits (Charcosset and Essioux, 1994). Alternatively, and based on previous work, we hypothesize that further
application of feature selection methods i.e., the selection of highly
discriminant molecular markers, might improve heterotic group assignation. This
hypothesis is supported in the observed similarity between classification
problems involving microsatellite marker and those involving microarray data.
In both cases, missing and noisy features might be present in scarce data samples.
This type of classification noise can be properly limited by feature selection
methods so that resulting data sets can be safely managed by binary based,
Coding Theory inspired multiclassifiers (Ornella et al 2006).
Bibliography
dos Santos Dias, LA.; de Toledo Picoli,
EA; Barros Rocha, R and Couto Alfenas, A. (2004) A priori choice of hybrid parents in plants. Genet. Mol. Res. 3:356-368.
Charcosset, A.,
and L. Essioux (1994) The effect of population structure on the relationship
between heterosis and heterozygosity at marker loci. Theor. Appl. Genet.
89:336–343.
Morales Yokobori, M; Decker, V; Ornella, LA (2005) Analysis of heterotic
maize (Zea mays L.) populations using molecular markers. MNL 79:36.
Nestares G., Frutos, E and Eyherabide GH (1999)
Evaluaci—n de l’neas de ma’z flint colorado por aptitud combinatoria. Pesq.
agropec. bras. 34:1399-1406.
Ornella, L and Tapia, E
(2006) A classification approach for heterotic performance prediction based on molecular
marker data. VIII Argentine
Symposium on Artificial Intelligence (ASAI 2006) Mendoza, Argentina.
Reif, JC; Melchinger, AE and Frisch M (2005) Genetical and Mathematical
Properties of Similarity and Dissimilarity Coefficients Applied in Plant
Breeding and Seed Bank Management. Crop Sci. 45:1-7.
Tollenaar, M; Ahmadzadeh, A and Lee, EA (2004) Physiological Basis of Heterosis for Grain
Yield in Maize. Crop Sci. 44: 2086 – 2094.
Witten, IH and Eibe, F
(2000) "Data Mining: Practical machine learning tools with Java
implementations". Morgan Kaufmann, San Francisco.
Table 1: 3, 5 and 10 Fold CV
error on the Heterosis dataset using multiclass clasiffiers
|
Multiclassifier |
3 CV error |
5 CV error |
10 CV error |
|
Naive Bayes |
0.654 |
0.692 |
0.769 |
|
SVM-RBF |
0.654 |
0.769 |
0.769 |
|
Decision Tree (J48) |
0.808 |
0.769 |
0.769 |
|
Decision Tree (random forest) |
0.731 |
0.846 |
0.769 |
|
Adaboost-Decision Stump |
0.731 |
0.610 |
0.770 |
|
Multilayer
Perceptron |
0.770 |
0.770 |
0.692 |
|
Error Expected by Chance |
0.774 |
0.774 |
0.774 |