CN103020864A - Corn fine breed breeding method - Google Patents

Corn fine breed breeding method Download PDF

Info

Publication number
CN103020864A
CN103020864A CN2012105212286A CN201210521228A CN103020864A CN 103020864 A CN103020864 A CN 103020864A CN 2012105212286 A CN2012105212286 A CN 2012105212286A CN 201210521228 A CN201210521228 A CN 201210521228A CN 103020864 A CN103020864 A CN 103020864A
Authority
CN
China
Prior art keywords
attribute
algorithm
sample
corn
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012105212286A
Other languages
Chinese (zh)
Other versions
CN103020864B (en
Inventor
邱建林
顾翔
陈建平
季丹
陈燕云
卞彩峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Majority Of Yunnan Seed Co Ltd
Original Assignee
Nantong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nantong University filed Critical Nantong University
Priority to CN201210521228.6A priority Critical patent/CN103020864B/en
Publication of CN103020864A publication Critical patent/CN103020864A/en
Application granted granted Critical
Publication of CN103020864B publication Critical patent/CN103020864B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a corn fine breed breeding method which comprises steps of sample set selection, crosswise dimensionality reduction, longitudinal reduction, classification forecasting and the like. The corn fine breed breeding method is easy and convenient, greatly reduces labor intensity in manual fine breed breeding and improves the decision efficiency and accuracy in corn fine breed breeding.

Description

Selection method of fine corn seeds
Technical field
The present invention relates to a kind of selection method of fine corn seeds.
Background technology
Original data are difficult to directly be applied in the processing procedure of data mining, need before analyzing data to be carried out certain conversion, and to convert the feature useful to algorithm to, this step just is called preparation and the conversion of raw data set.It is widely used in the links of data mining, is a step of outbalance.Traditional processing means have a lot, and simple map function is arranged, such as the standardization of variable, discretize etc.; Feature extraction, selection and building method based on dimensionality reduction are arranged, such as principal component analysis (PCA) (PCA), non-linear difference analysis, Kohonen coupling, Sammon projection etc.; Disposal route in conjunction with other field knowledge is arranged, such as fractal technology, cluster and support vector machine etc.The pre-treatment step method of visual data is numerous, and the scope of application is extensive.
Cluster is a kind of common data mining analysis instrument, it is based on the thought of " things of a kind come together, people of a mind fall into the same group ", the set divide into several classes of mass data point or bunch so that farthest similar between the data in each class, and the data in the inhomogeneity are farthest different.Cluster analysis belongs to a kind of guideless learning method, and its outstanding feature is to process large complicated data set, and can be used as the pre-treatment step of other algorithms.
Traditional clustering method can be divided into four aspects: based on the cluster mode of division, level, density and grid.Common classic algorithm comprises the K-MEANS partitioning algorithm; CURE [1], BIRCH, CHAMELEON [2]The level algorithm; DBSCAN density algorithm; STING, WaveCluster, CLIQUE trellis algorithm etc.Wherein, the K-MEANS algorithm is easily understood, and does not need complicated priori conditions, and better for the Clustering Effect of small-scale data; The CURE algorithm has adopted a fixed number sample point representative bunch class, can catch the sample set of arbitrary shape; The BIRCH algorithm is comparatively effective for convex surface and the spherical data set of unified size, but responsive to partial parameters; The DBSCAN algorithm is flexible, need not to know clusters number, and is better for the treatment effect of noise and high dimensional data, just comparatively responsive to density parameter etc.; The STING algorithm is multiplex in the parallel processing step of other algorithm, can improve the treatment effeciency of algorithm
This shows, all there are some defectives in traditional clustering algorithm more or less at the processing of telescopicing performance, data type, the sensitivity of parameter, the aspects such as bunch class shape of discovery, and is also running into certain bottleneck aspect the high dimensional data of processing day by day.Therefore, improve traditional clustering algorithm, inject fresh domain knowledge, form modern clustering method, process large-scale high dimensional data for us and be absolutely necessary.For example, based on COBWEB statistical model, neural network model and the hypergraph model of model, based on the Spectral Clustering of spectrogram, for the clustering method of flow data and in conjunction with the clustering method of other field knowledge gained (based on the ant group algorithm of genetic algorithm and artificial fish-swarm algorithm, based on fuzzy clustering algorithm of fuzzy theory etc.).Every kind of clustering algorithm has relative merits and the suitable environment of oneself, so we when selecting clustering algorithm, need for concrete target call and own characteristic, selects optimal clustering algorithm, so that we can excavate potential useful rule.
Decision Tree algorithms is as a branch of sorting technique, it is one of widely used logical method, its great advantage does not need too many background knowledge exactly in learning process, only need can instruct sample by classified information, and show with the form of attribute-conclusion.This statement that is similar to process flow diagram can reflect the characteristic relation of data intuitively.For the data set that does not need too many expertise, use decision Tree algorithms that data set is analyzed, effect is better.At present comparatively famous have ID3, C4.5, CART, SLIQ, SPRINT, a CHAID etc.But more or less all there are some problems in these algorithms, can cause the problem of attribute deflection such as the mode that adopts information gain; Optimal threshold determined when attribute was divided; The achievement process can not be recalled, and can only seek the local optimum result; Different Pruning strategies can cause different decision tree etc.
Summary of the invention
The object of the present invention is to provide a kind of easy, effective selection method of fine corn seeds.
Technical solution of the present invention is:
A kind of selection method of fine corn seeds is characterized in that: comprise the following steps:
(1) chooses sample set
Choose the concentrated a plurality of subclass corns of original corn sample, a plurality of important attribute as analytic target, to use the data mining algorithm that merges;
(2) horizontal dimensionality reduction
1) the selected sample set of standardization;
2) correlation matrix of the selected a plurality of attributes of calculating is analyzed correlation matrix, and mark relating attribute collection M1;
3) by the design factor factor, can obtain former row matrix of coefficients of major component, and mark relating attribute collection M2;
The set of 4) being correlated with among merge connection property set M1, the M2 obtains several groups of set of properties that correlativity is stronger, and the attribute in each group set of properties is also high related respectively;
5) according to the relating attribute group of gained, select corresponding major component formula, and according to the coefficient factor of attribute in the major component formula, determine this attribute shared proportion in corresponding composition, as weight, can get the New Characteristics value set, be used for the processing of subsequent algorithm;
(3) vertically yojan
1) outlier detection; Detect New Characteristics and concentrate the special sample point that surpasses [1.5,1.5] scope, and carry out anomaly analysis;
2) grid division; Utilize gridding technique, select to divide parameter, the characteristic set of new formation is carried out grid divide, replaces original equal portions division;
3) improved k-means method;
Respectively the data point in the grid of dividing is carried out improved k-means and carry out cluster; This improvement k-means algorithm is as follows: at first realize the selection of initial cluster center by maximum distance, and then data set is carried out the cluster of traditional k-means algorithm.Such clustering method calculated amount is little, and iterations is few, and can effectively alleviate the blindness that cluster centre is chosen, and improves the clustering precision of algorithm.Carry out among a small circle Local Clustering by introducing this k-means algorithm, not only can give full play to the k-means algorithm for the effect of small-scale data clusters, and can reduce the computing consumption, realize preferably the effect of Local Clustering.
4) merge local bunch class, form final cluster result.
The a lot of little bunch class that obtains for cluster in the grid, adopt the merge method of original CURE clustering algorithm, replace original single data point with a fixed number representative point, by piling the nearest bunch class of this data structure lookup, and then merge small-sized bunch of class obtained in the previous step, obtain final cluster result;
(4) classification prediction
1) attribute is judged in input;
Input a corn and judge attribute, this sample point is carried out horizontal dimension-reduction treatment, obtain New Characteristics value group, sample point is carried out follow-up decision tree analysis;
2) determine categorical attribute;
3) size threshold of acquiescence is set;
4) Discretization for Continuous Attribute;
Eigenwert is pressed the ascending ordering of numerical value, when the categorical attribute of corresponding sample changes, then will be up and down two samples division points very.By calculating the expectation of each division points, the division points that the value of finding out is minimum then can be defined as the optimal dividing threshold value;
5) the decision tree root node determines;
A. carry out determining of root node by the C4.5 classifying rules;
6) foundation of final decision tree;
Methods according to the 4th, 5 steps continue to set up lower floor's subtree of decision tree, until that all sample points are all classified is complete, can get thus categorised decision tree finally;
7) optimum corn parents' determines
According to final decision-tree model, according to Euclidean distance, select a corn variety the most similar to this sample point, and with the parent corn of this kind as the optimum parents that cultivate fine seed strains.
According to the basic characteristics of original corn sample set and the suitable environment of different pieces of information mining algorithm, a kind of data mining algorithm of fusion has been proposed.This algorithm is made of three kinds of methods: dimensionality reduction, cluster and decision tree, and can realize respectively the classification forecast function of dimensionality reduction, attribute reduction and the sample of attribute.Its structural drawing such as Fig. 1.
For fear of the impact that different dimensions and research object merge method, this blending algorithm has been selected respectively PCA, CURE and C4.5 algorithm, and has carried out corresponding improvement.
1.PCA the improved main thought of algorithm comprises following two aspects:
(1) relating attribute determines
In PCA, the contribution rate size of eigenwert can represent the significance level that attribute comprises raw information, embodies the relation between attribute and the target, and related coefficient has then shown the correlation degree between attribute and the attribute.If can then can effectively select the important attribute useful to target in conjunction with eigenwert, proper vector and related coefficient, reduce the redundance of feature, realize the horizontal dimensionality reduction of data set.
(2) description of feature set
Here considered the relating attribute group of proper vector and related coefficient gained, to determine suitable relating attribute, and these relating attributes shared proportion in major component introduced as weight information, thereby simplify the expression formula of major component, form the new feature set that is associated with target, reduce the redundance of attribute.
2.CURE the improved main thought of algorithm comprises following two aspects:
(1) outlier detection
After data having been carried out the processing of standardization and weight, data concentrate in [1.5,1.5] scope basically.If property value surpasses this scope, prove that then this sample point is unusual.
(2) the local clustering algorithm of dividing improves
Original Local Clustering algorithm is in the scope that each five equilibrium is divided, select a fixed number representative point, carries out respectively CURE algorithm cluster.Can guarantee local cluster efficient of dividing although so do, also cause the increase of whole operand.Thus, in subrange, introduce a kind of improved k-means algorithm, can improve this problem.Carry out among a small circle Local Clustering by introducing this k-means algorithm, not only can give full play to the k-means algorithm for the effect of small-scale data clusters, and can reduce the computing consumption, realize preferably the effect of Local Clustering.
3.C4.5 the improved main thought of algorithm comprises following two aspects:
(1) selection of important attribute
When the sample set scale is excessive, sample evenly can be divided to the selection of carrying out simultaneously important attribute in three sorters, its measure is followed successively by C4.5 method, gini-index and χ 2 statistics.
(2) division of connection attribute optimal threshold
Traditional algorithm is on the problem of processing the division of connection attribute optimum threshold, mostly adopt self-defining dynamic division, perhaps by the primitive attribute value is sorted, determine all possible threshold value, and select gain maximum division to come corresponding attribute is carried out discretize.But the former accuracy is not high, and latter's computation complexity is higher.This has brought larger difficulty to our deal with data.
Thus, the threshold values division methods that we can mention according to document is simplified determining of threshold value, and by calculating the information gain of each separation, finds out optimum threshold values, comes the corresponding connection attribute of discretize.
The inventive method is easy, has reduced greatly labour intensity in the artificial fine-variety breeding, has improved the efficiency of decision-making and the accuracy of fine corn seeds seed selection.
Description of drawings
The invention will be further described below in conjunction with drawings and Examples.
Fig. 1 is the structural drawing of fused data mining algorithm.
Fig. 2 is final decision tree synoptic diagram.
Embodiment
A kind of selection method of fine corn seeds comprises the following steps:
(1) chooses sample set
Choose the concentrated a plurality of subclass corns of original corn sample, a plurality of important attribute as analytic target, to use the data mining algorithm that merges;
(2) horizontal dimensionality reduction
1) the selected sample set of standardization;
2) correlation matrix of the selected a plurality of attributes of calculating is analyzed correlation matrix, and mark relating attribute collection M1;
3) by the design factor factor, can obtain former row matrix of coefficients of major component, and mark relating attribute collection M2;
The set of 4) being correlated with among merge connection property set M1, the M2 obtains several groups of set of properties that correlativity is stronger, and the attribute in each group set of properties is also high related respectively;
5) according to the relating attribute group of gained, select corresponding major component formula, and according to the coefficient factor of attribute in the major component formula, determine this attribute shared proportion in corresponding composition, as weight, can get the New Characteristics value set, be used for the processing of subsequent algorithm;
(3) vertically yojan
1) outlier detection; Detect New Characteristics and concentrate the special sample point that surpasses [1.5,1.5] scope, and carry out anomaly analysis;
2) grid division; Utilize gridding technique, select to divide parameter, the characteristic set of new formation is carried out grid divide, replaces original equal portions division;
3) improved k-means method;
Respectively the data point in the grid of dividing is carried out improved k-means and carry out cluster; This improvement k-means algorithm is as follows: at first realize the selection of initial cluster center by maximum distance, and then data set is carried out the cluster of traditional k-means algorithm.Such clustering method calculated amount is little, and iterations is few, and can effectively alleviate the blindness that cluster centre is chosen, and improves the clustering precision of algorithm.Carry out among a small circle Local Clustering by introducing this k-means algorithm, not only can give full play to the k-means algorithm for the effect of small-scale data clusters, and can reduce the computing consumption, realize preferably the effect of Local Clustering.
4) merge local bunch class, form final cluster result.The a lot of little bunch class that obtains for cluster in the grid, adopt the merge method of original CURE clustering algorithm, replace original single data point with a fixed number representative point, by piling the nearest bunch class of this data structure lookup, and then merge small-sized bunch of class obtained in the previous step, obtain final cluster result;
(4) classification prediction
1) attribute is judged in input;
Input a corn and judge attribute, this sample point is carried out horizontal dimension-reduction treatment, obtain New Characteristics value group, sample point is carried out follow-up decision tree analysis;
2) determine categorical attribute;
3) size threshold of acquiescence is set;
4) Discretization for Continuous Attribute;
Eigenwert is pressed the ascending ordering of numerical value, when the categorical attribute of corresponding sample changes, then will be up and down two samples division points very.By calculating the expectation of each division points, the division points that the value of finding out is minimum then can be defined as the optimal dividing threshold value;
5) the decision tree root node determines;
B. carry out determining of root node by the C4.5 classifying rules;
6) foundation of final decision tree;
Methods according to the 4th, 5 steps continue to set up lower floor's subtree of decision tree, until that all sample points are all classified is complete, can get thus categorised decision tree finally;
7) optimum corn parents' determines
According to final decision-tree model, according to Euclidean distance, select a corn variety the most similar to this sample point, and with the parent corn of this kind as the optimum parents that cultivate fine seed strains.
The each several part arthmetic statement
(1) the PCA algorithm steps specifically describes
1) standardization raw data set;
2) correlation matrix of computational data collection, and find out the larger community set of relevance, be labeled as M1; Wherein the formula of correlation coefficient of computation attribute x and y is as follows:
r = Σ ( x - x ‾ ) ( y - y ‾ ) Σ ( x - x ‾ ) 2 Σ ( y - y ‾ ) 2 - - - ( 4 - 1 )
X, y are property value,
Figure BDA00002542679000102
Be respectively the average of property value x and y.
3) calculate the eigenwert of raw data set and the size of contribution rate, determine the number of major component, calculate simultaneously the coefficient factor P of each composition in the major component Ij, i.e. different characteristic corresponding to principal character value vector.By analyzing the size of the coefficient factor in the major component, can find out the important relating attribute in each major component, they are classified as a class, and are denoted as M2;
4) consider main attribute and the larger attribute of relevance of mark, which the main attribute that can intelligently determine required consideration has, principle is as follows: if the identical attribute of two or more is arranged among M1 and the M2, just merge this group relating attribute, if be less than 2, this is organized relating attribute separately as one group of consideration.The attribute that does not occur can be not considered.
5) the shared proportion of the composition coefficient factor of gained relating attribute is introduced as weight, to form the New Characteristics collection.Formula is as follows:
Figure BDA00002542679000111
(i=1 ..., c; J=1 ..., d); The weight of d important attribute in the // calculating c group relating attribute; (4-2)
W i=(w Ij) D * 1, (i=1 ..., c; J=1 ..., d); // integrate c group weight, obtain respectively the matrix of d * 1;
(4-3)
F i=V*W i, (i=1 ..., c); The c stack features value that // calculating is new.
(4-4)
Wherein, w IjRefer to the weight of d relating attribute, p IjRefer to d relating attribute characteristic of correspondence vector in the major component formula, p iRefer to all coefficient sums in the c group major component, W iRefer to that the weight matrix integrated, V refer to the eigenmatrix that d property value forms, F iIt is the c group new feature value matrix that calculates.
(2) the CURE algorithm steps specifically describes
1) exceptional sample point detects;
Detect New Characteristics and concentrate the special sample point that surpasses [1.5,1.5] scope, and carry out anomaly analysis.
2) grid is divided;
Introduce gridding technique, selects suitable division parameter, the characteristic set of new formation is carried out grid divide, replace original equal portions division.It is suitable that this technology and original equal portions are divided effect, but for follow-up improved local k-means cluster, the but more effectively fusion between implementation algorithm and the algorithm.
3) each local division is carried out improved k-means algorithm cluster, specifically see document [23];
4) local bunch class is carried out the CURE cluster.
The a lot of little bunch class that obtains for cluster in the grid, adopt the merge thought of original CURE clustering algorithm, concrete visible original [1], replace original single data point with a fixed number representative point, by piling the nearest bunch class of this data structure lookup, and then merge small-sized bunch of class obtained in the previous step, obtain final result.
(3) the C4.5 algorithm steps specifically describes
1) data set after the yojan of arrangement scale;
This part has comprised determining etc. of the processing of default value and categorical attribute, outbalance.
2) determine size threshold;
If the data set scale after the yojan less than size threshold, is then directly carried out the metric analysis that 3 kinds of attributes are selected to this data set, otherwise, then property set is divided in three different sorters randomly, carry out the selection of important attribute.
3) discretize connection attribute;
Here adopt some to simplify mechanism, to reduce the complexity of calculating, seen document [24] for details.
4) foundation of root node;
According to the different sorters at data set place, select correspondingly attribute selection mechanism, with this metric of determining each testing attribute, find out best important attribute, as the root node of decision tree.If the attribute according to three kinds of tactful gained is identical above two, then according to this attribute dividing data collection; If the attribute of gained is not identical, then the selection strategy according to the C4.5 decision tree carries out.
5) foundation of lower floor's subtree;
To remaining sample set and remaining testing attribute, re-start the division of optimum attributes threshold value and the selection of important attribute, be similar to repeating step 2 and step 3, to set up the branch of each lower floor's subtree.So be cycled to repeat, can realize the foundation of decision tree.
6) formation of decision rule.
Utilize the partial test data in the test set that the decision tree that generates is verified, and the result of all trees is averaged, namely measurable final result forms correspondingly classifying rules.
An instance analysis of fine corn seeds seed selection
(1) chooses sample set
Choose 51 concentrated subclass corns of original corn sample, 9 important attribute as analytic target, to use the data mining algorithm that merges.Attribute is respectively thick, mass of 1000 kernel of breeding time, plant height, fringe height, spike length, fringe, tassel row number, a row grain number, cell production, is labeled as attribute 1-9.See Table 4-1.
Table 4-1 chooses sample set
Kind The time of infertility Plant height Fringe is high Spike length Fringe is thick Mass of 1000 kernel Tassel row number Row grain number District's output
Y1 100 194.8 78.9 15.63 4.19 200.8 15.8 38.1 6.73
Y2 101 229.5 93.9 18.28 4.53 269.8 15.2 40.5 7.83
Y3 99 270.1 114.7 16.34 4.65 287.3 14.4 35.4 6.70
Y4 101 249.0 109.7 21.46 4.04 303.5 12.8 42.8 7.54
Y5 100 229.2 96.6 16.83 4.25 255.5 15.0 35.6 7.56
Y6 100 245.7 108.0 19.48 4.30 210.8 15.6 43.9 7.13
Y7 100 252.5 114.2 18.92 4.29 227.0 15.0 38.7 6.38
Y8 101 249.7 109.3 18.42 4.46 324.5 13.6 39.4 9.28
Y51 102 245.4 103.3 18.00 4.23 250.3 13.4 41.0 7.49
[0129](2) horizontal dimensionality reduction
1) the selected sample set of standardization;
2) correlation matrix of selected 9 attributes of calculating is seen formula (4-5);
R = 1.0000 0.0502 1.0000 0.2038 0.7924 1.0000 - 0.1602 0.0794 - 0.0197 1.0000 0.1722 0.2256 0.2695 - 0.3415 1.0000 0.2250 0.2629 0.1296 0.1356 0.2498 1.0000 0.0213 - 0.0683 - 0.0741 - 0.3034 0.4356 - 0.3998 1.0000 - 0.0562 0.19241 0.2802 0.5785 - 0.2782 - 0.0238 - 0.4638 1.0000 0.1442 0.3648 0.2634 0.1046 0.3884 0.5488 0.0395 0.1583 1.0000 - - - ( 4 - 5 )
Analyze correlation matrix and can get R[2] [1]=0.7924, prove that set of properties (attribute 2, attribute 3) correlativity is high; By R[7] [3]=0.5785, prove that set of properties (4,8) correlativity is higher; By R[8] [5]=0.5488, prove that set of properties (6,9) correlativity is larger, mark relating attribute collection M1:{ (2,3) then, (4,8), (6,9) }.
3) computation of characteristic values is: 2.5482,2.2439,1.2782,0.9895,0.7965,0.4771,0.4066,0.1447,0.1153 its accumulation contribution rate is respectively 28.314%, 53.235%, 67.448%, 78.442%, 87.292%, 92.593%... because the accumulation contribution rate of the 5th eigenwert has reached 87.292%〉85%, so can determine, major component has 5, by the design factor factor, can obtain the front 5 row matrix of coefficients of major component, shown in (4-6):
p = 0.1769 0.1897 0.2317 - 0.6983 0.5867 0.4953 0.0428 - 0.3890 - 0.0136 - 0.2644 0.4728 0.0623 - 0.4780 - 0.2316 - 0.0842 0.1331 - 0.4887 0.0442 0.3369 0.3673 0.2312 0.4956 0.0346 0.2485 0.0899 0.3917 0.0125 0.6120 0.0176 - 0.2841 - 0.1626 0.4713 - 0.2690 0.3470 0.4210 0.2356 - 0.4869 - 0.1889 0.0468 0.3683 0.4396 0.1195 0.2841 0.4004 0.2053 - - - ( 4 - 6 )
Hence one can see that, and 0.4953,0.4728,0.4396 value is larger in the first principal component, learns in a disguised form that then attribute (2,3,9) correlativity is larger, by that analogy, but mark relating attribute collection M2:{ (2,3,9), (4,5,7,8), (3,6), (1), (1) };
4) consider M1 obtained above, M2, merge relevant set, can obtain 3 groups of set of properties that correlativity is stronger, be respectively (2,3,6,9), (4,5,7,8), (1), be (plant height, fringe is high, mass of 1000 kernel, cell production), (spike length, fringe is thick, tassel row number, row grain number), (time of infertility), and the attribute in each group set of properties is also high related respectively, and simultaneously, we can find out that also the main attribute of data set has plant height, cell production, spike length, the time of infertility etc.;
5) according to the relating attribute group of gained, select corresponding major component formula, and according to the coefficient factor of attribute in the major component formula, determine this attribute shared proportion in corresponding composition, as weight, can get the New Characteristics value set, be used for the processing of subsequent algorithm, see formula (4-7)-(4-9).
F1 (2,3,6,9)=(V Plant height, V Fringe is high, V Mass of 1000 kernel, V Cell production) N * 4* (w 1i) 4 * 1(i=1 ..., 4) (4-7)
F2 (4,5,7,8)=(V Spike length, V Fringe is thick, V Tassel row number, V Row grain number) N * 4* (w 2i) 4 * 1(i=1 ..., 4) (4-8)
F3 (1)=(V The time of infertility) (4-9)
Wherein, V represents the eigenwert of this attribute column, and w represents the weight of respective attributes in the major component formula.
(3) vertically yojan
1) outlier detection;
Sensing range is fixed in [1.5,1.5] scope, can detect 5 unusual sample points, be respectively Y1, Y30, Y33, Y35, Y45.
2) grid division;
Because data volume is few, so Selecting All Parameters g=2 here, namely every dimension is divided into two parts; And only have three New Characteristics values, i.e. three dimensions.Thus, then data set can be divided into 8 five equilibrium grids.
3) improved k-means method;
Respectively the data point in the grid of dividing is carried out improved k-means and carry out cluster.Here selecting tuning parameter is 3, and it is 3 classes that sample in each grid is gathered.
4) merge local bunch class, form final cluster result.
A bunch class number of supposing final required cluster is 2, then can be according to the merge thought of traditional C URE algorithm, the local submanifold class of above-mentioned gained is merged, and remove simultaneously the slower bunch class of aggregate speed.Finally can get two bunches of classes, be respectively:
C1{Y2,Y?3,Y?4,Y?6,Y?7,Y?8,Y?9,Y?12,Y?13,Y?14,Y?16,Y?18,Y?21,Y?24,Y?25,Y?26,Y?27,Y?28,Y?36,Y?39,Y?41,Y?43,Y?50};
C2{Y5,Y?10,Y?11,Y?15,Y?17,Y?18,Y?19,Y?20,Y?22,Y?28,Y?29,Y?31,Y?32,Y?34,Y?37,Y?40,Y?42,Y?44,Y?46,Y?47,Y?48,Y?49,Y?51}。
(4) classification prediction
1) attribute is judged in input;
Suppose a corn judgement of input attribute: P1 (100,250,106,17,4.3,250,15,36,8), then can carry out horizontal dimension-reduction treatment to this sample point, obtain New Characteristics value group (0.3246 ,-0.0044 ,-0.4993).Distance by judging pretreated P1 point and bunch C1, C2 barycenter as can be known, the distance of P1 to C1 is less than the distance of P1 to C2, so we can be included into P1 among bunch C1.Thus, we can carry out follow-up decision tree analysis to 23 sample points in C1 bunch.
2) determine categorical attribute, shown in table 4-2;
Determining of table 4-2 categorical attribute
The time of infertility Cell production Categorical attribute Representative
Few Many High yield
Few Few Middle product
Many Many Middle product
Many Few Low yield
Professional knowledge (time of infertility of namely wishing corn can be as much as possible little, simultaneously cell production higher) according to us is determined categorical attribute, and then the corn variety in the class 2 is optimum as can be known.
3) the size threshold α of acquiescence=50 are set;
Because the number of data set can be put into 3 sorters with whole C1 bunch and calculate less than threshold value, determine optimum selection attribute herein.
4) Discretization for Continuous Attribute;
The F1 eigenwert of 23 samples among the C1 is pressed the ascending ordering of numerical value, such as table 4-3.
The ordering of 4-3 F1 eigenwert
Numbering Sample point Classification Numbering Sample point Classification Numbering Sample point Classification
[0167]?
1 Y21 I 9 Y13 I 17 Y26 I
2 Y9 10 Y27 I 18 Y4
3 Y12 11 Y18 19 Y3
4 Y41 12 Y43 I 20 Y16 I
5 Y6 13 Y9 21 Y28 I
6 Y2 I 14 Y14 I 22 Y36 I
7 Y25 15 Y39 I 23 Y8 I
8 Y7 16 Y24 I ? ? ?
When the categorical attribute of corresponding sample changes, then will be up and down two samples division points very.By calculating the expectation of each division points, the division points that the value of finding out is minimum then can be defined as the optimal dividing threshold value.
……
I ( 5,18 ) = - 5 23 ( 4 5 log 2 4 5 + 1 5 log 2 1 5 ) - 18 23 ( 12 18 lo g 2 12 18 + 6 18 log 2 6 18 ) ≈ 0.8756 - - - ( 4 - 10 )
I ( 19,4 ) = - 19 23 ( 9 19 log 2 9 19 + 10 19 log 2 10 19 ) - 4 23 ( 4 4 log 2 4 4 ) ≈ 0.8244 - - - ( 4 - 11 )
Find that by calculating the optimal dividing point of F1 attribute is at sample Y3 place, dividing threshold value is-1.2755.In like manner, the optimal dividing threshold value of connection attribute F2 is at the 7th sample point place, and the optimal dividing threshold value of F3 is at the 4th sample point place.
5) the decision tree root node determines;
If carry out determining of root node by the C4.5 classifying rules, concrete computation process is as follows:
C. calculate the expectation information of categorical attribute:
I ( 13,10 ) = - 13 23 log 2 13 23 - 10 23 log 2 10 23 ≈ 0.9878 - - - ( 4 - 12 )
D. computation attribute F 1Information entropy based on classifying and dividing:
Split ( F 1 ) = - 19 23 log 2 19 23 - 4 23 log 2 4 23 ≈ 0.6666 - - - ( 4 - 13 )
E. computation attribute F 1Classification expectation information based on the optimal dividing threshold value:
E ( F 1 ) = 19 23 I ( 9,10 ) + 4 23 I ( 4,0 ) ≈ 0.8244 - - - ( 4 - 14 )
F. computation attribute F 1The information gain rate:
GainRatio ( F 1 ) = I ( 13,10 ) - E ( F 1 ) Split ( F 1 ) ≈ 0.2451 - - - ( 4 - 15 )
G. repeating step B-D, computation attribute F 2And F 3The information gain rate, be respectively 0.1148 and 0.3669.Thus, can select the F3 attribute of information gain rate value maximum.
6) foundation of final decision tree;
Computing method according to the 4th, 5 steps continue to set up lower floor's subtree of decision tree, until that all sample points are all classified is complete, can get thus categorised decision tree finally.
7) optimum corn parents' determines.
According to final decision-tree model, can judge that input sample point P1 (0.3246 ,-0.0044 ,-0.4993) should belong to middle product " II " classification of the second layer, and similarly sample have four, be respectively Y3, Y 6, Y 7, Y 25.
In order to select parents' kind of this judgement corn of the most suitable cultivation, we can select a corn variety the most similar to this sample point according to Euclidean distance, and with the parent corn of this kind as the optimum parents that cultivate fine seed strains.By calculating, this judgement sample is nearest to Y7's, so can be with the parent YCK2 of Y7 and female class YCK3 as the optimum parents that cultivate the P1 sample, with the purpose of realization fine corn seeds seed selection.

Claims (1)

1. a selection method of fine corn seeds is characterized in that: comprise the following steps:
(1) chooses sample set
Choose the concentrated a plurality of subclass corns of original corn sample, a plurality of important attribute as analytic target, to use the data mining algorithm that merges;
(2) horizontal dimensionality reduction
1) the selected sample set of standardization;
2) correlation matrix of the selected a plurality of attributes of calculating is analyzed correlation matrix, and mark relating attribute collection M1;
3) by the design factor factor, can obtain former row matrix of coefficients of major component, and mark relating attribute collection M2;
The set of 4) being correlated with among merge connection property set M1, the M2 obtains several groups of set of properties that correlativity is stronger, and the attribute in each group set of properties is also high related respectively;
5) according to the relating attribute group of gained, select corresponding major component formula, and according to the coefficient factor of attribute in the major component formula, determine this attribute shared proportion in corresponding composition, as weight, can get the New Characteristics value set, be used for the processing of subsequent algorithm;
(3) vertically yojan
1) outlier detection; Detect New Characteristics and concentrate the special sample point that surpasses [1.5,1.5] scope, and carry out anomaly analysis;
2) grid division; Utilize gridding technique, select to divide parameter, the characteristic set of new formation is carried out grid divide, replaces original equal portions division;
3) improved k-means method is carried out improved k-means to the data point in the grid of dividing respectively and is carried out cluster;
This improvement k-means algorithm is as follows: at first realize the selection of initial cluster center by maximum distance, and then data set is carried out the cluster of traditional k-means algorithm;
4) merge local bunch class, form final cluster result.
The a lot of little bunch class that obtains for cluster in the grid, adopt the merge method of original CURE clustering algorithm, replace original single data point with a fixed number representative point, by piling the nearest bunch class of this data structure lookup, and then merge small-sized bunch of class obtained in the previous step, obtain final cluster result;
(4) classification prediction
1) attribute is judged in input;
Input a corn and judge attribute, this sample point is carried out horizontal dimension-reduction treatment, obtain New Characteristics value group, sample point is carried out follow-up decision tree analysis;
2) determine categorical attribute;
3) size threshold of acquiescence is set;
4) Discretization for Continuous Attribute;
Eigenwert is pressed the ascending ordering of numerical value, when the categorical attribute of corresponding sample changes, then will be up and down two samples division points very.By calculating the expectation of each division points, the division points that the value of finding out is minimum then can be defined as the optimal dividing threshold value;
5) the decision tree root node determines;
A. carry out determining of root node by the C4.5 classifying rules;
6) foundation of final decision tree;
Methods according to the 4th, 5 steps continue to set up lower floor's subtree of decision tree, until that all sample points are all classified is complete, can get thus categorised decision tree finally;
7) optimum corn parents' determines
According to final decision-tree model, according to Euclidean distance, select a corn variety the most similar to this sample point, and with the parent corn of this kind as the optimum parents that cultivate fine seed strains.
CN201210521228.6A 2012-12-07 2012-12-07 Corn fine breed breeding method Expired - Fee Related CN103020864B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210521228.6A CN103020864B (en) 2012-12-07 2012-12-07 Corn fine breed breeding method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210521228.6A CN103020864B (en) 2012-12-07 2012-12-07 Corn fine breed breeding method

Publications (2)

Publication Number Publication Date
CN103020864A true CN103020864A (en) 2013-04-03
CN103020864B CN103020864B (en) 2014-03-12

Family

ID=47969441

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210521228.6A Expired - Fee Related CN103020864B (en) 2012-12-07 2012-12-07 Corn fine breed breeding method

Country Status (1)

Country Link
CN (1) CN103020864B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106613913A (en) * 2016-12-23 2017-05-10 天津农学院 Near infrared-intermediate infrared rapid selection method for corn inbred line combination selection
CN107579866A (en) * 2017-10-25 2018-01-12 重庆电子工程职业学院 A kind of business and Virtual Service intelligent Matching method of wireless dummyization access autonomous management network
CN106577267B (en) * 2016-12-23 2018-07-20 天津农学院 The gas chromatography mass spectrometry rapid screening method of corn inbred line combination selection
CN117933580A (en) * 2024-03-25 2024-04-26 河北省农林科学院农业信息与经济研究所 Breeding material optimization evaluation method for wheat breeding management system

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101697167A (en) * 2009-10-30 2010-04-21 邱建林 Clustering-decision tree based selection method of fine corn seeds

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101697167A (en) * 2009-10-30 2010-04-21 邱建林 Clustering-decision tree based selection method of fine corn seeds

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
JI DAN, QIU JIANLIN, DAI XIAOYU, GU XIANG, CHEN LI: "The Application of Information Fusion and Extraction in Maize Seed Breeding", 《INTERNATIONAL CONFERENCE ON ICCE2011》, 19 November 2011 (2011-11-19), pages 477 - 485 *
JI DAN, QIU JIANLIN,CHEN JIANPING, CHEN LI, HE PENG: "An Improved Decision Tree Algorithm and Its Application in Maize Seed Breeding", 《2010 SIXTH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION》, 10 August 2010 (2010-08-10), pages 117 - 121 *
JI DAN, QIU JIANLIN,GU XIANG, CHEN LI, HE PENG: "A Synthesized Data Mining Algorithm Based on Clustering and Decision Tree", 《10TH IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY》, 29 June 2010 (2010-06-29), pages 2722 - 2728 *
赵静娴 , 倪春鹏 , 詹原瑞 , 杜子平: "一种大规模数据库的组合优化决策树算法", 《系统工程与电子技术》, vol. 31, no. 3, 31 March 2009 (2009-03-31), pages 583 - 587 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106613913A (en) * 2016-12-23 2017-05-10 天津农学院 Near infrared-intermediate infrared rapid selection method for corn inbred line combination selection
CN106577267B (en) * 2016-12-23 2018-07-20 天津农学院 The gas chromatography mass spectrometry rapid screening method of corn inbred line combination selection
CN106613913B (en) * 2016-12-23 2018-07-20 天津农学院 Infrared rapid screening method in the near-infrared-of corn inbred line combination selection
CN107579866A (en) * 2017-10-25 2018-01-12 重庆电子工程职业学院 A kind of business and Virtual Service intelligent Matching method of wireless dummyization access autonomous management network
CN107579866B (en) * 2017-10-25 2019-05-10 重庆电子工程职业学院 A kind of business and Virtual Service intelligent Matching method of wireless dummyization access autonomous management network
CN117933580A (en) * 2024-03-25 2024-04-26 河北省农林科学院农业信息与经济研究所 Breeding material optimization evaluation method for wheat breeding management system
CN117933580B (en) * 2024-03-25 2024-05-31 河北省农林科学院农业信息与经济研究所 Breeding material optimization evaluation method for wheat breeding management system

Also Published As

Publication number Publication date
CN103020864B (en) 2014-03-12

Similar Documents

Publication Publication Date Title
Keerthana et al. An ensemble algorithm for crop yield prediction
CN102750286B (en) A kind of Novel decision tree classifier method processing missing data
CN109886349B (en) A kind of user classification method based on multi-model fusion
CN108776820A (en) It is a kind of to utilize the improved random forest integrated approach of width neural network
CN106503867A (en) A kind of genetic algorithm least square wind power forecasting method
CN103500344A (en) Method and module for extracting and interpreting information of remote-sensing image
CN104809230A (en) Cigarette sensory quality evaluation method based on multi-classifier integration
CN109409647A (en) A kind of analysis method of the salary level influence factor based on random forests algorithm
CN102024179A (en) Genetic algorithm-self-organization map (GA-SOM) clustering method based on semi-supervised learning
Khoshnevisan et al. A clustering model based on an evolutionary algorithm for better energy use in crop production
CN103020864B (en) Corn fine breed breeding method
CN104765839A (en) Data classifying method based on correlation coefficients between attributes
CN110569605A (en) Non-glutinous rice leaf nitrogen content inversion model method based on NSGA2-ELM
Nhita A rainfall forecasting using fuzzy system based on genetic algorithm
CN105930531A (en) Method for optimizing cloud dimensions of agricultural domain ontological knowledge on basis of hybrid models
CN106682915A (en) User cluster analysis method in customer care system
CN107194468A (en) Towards the decision tree Increment Learning Algorithm of information big data
CN105550711A (en) Firefly algorithm based selective ensemble learning method
Abuzir Predict the Main Factors that Affect the Vegetable Production in Palestine Using WEKA Data Mining Tool
Chandana et al. A comprehensive survey of classification algorithms for formulating crop yield prediction using data mining techniques
CN111488520A (en) Crop planting species recommendation information processing device and method and storage medium
Maurya et al. Estimation of major agricultural crop with effective yield prediction using data mining
CN108717551A (en) A kind of fuzzy hierarchy clustering method based on maximum membership degree
Yazdi et al. Hierarchical tree clustering of fuzzy number
CN113222288B (en) Classified evolution and prediction method of village and town community space development map

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: YUNNAN GUANGDA SEED INDUSTRY CO., LTD.

Free format text: FORMER OWNER: NANTONG UNIVERSITY

Effective date: 20150408

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 226019 NANTONG, JIANGSU PROVINCE TO: 674299 LIJIANG, YUNNAN PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20150408

Address after: 674299 Lijiang City, Yunnan province Yongsheng County Beizhen Yong Hua Nan Jie Lane No. 2

Patentee after: The majority of Yunnan Seed Co. Ltd.

Address before: 226019 Jiangsu city of Nantong province sik Road No. 9

Patentee before: Nantong University

CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140312

Termination date: 20201207

CF01 Termination of patent right due to non-payment of annual fee