Summary of the invention
The object of the present invention is to provide a kind of easy, effective selection method of fine corn seeds.
Technical solution of the present invention is:
A kind of selection method of fine corn seeds is characterized in that: comprise the following steps:
(1) chooses sample set
Choose the concentrated a plurality of subclass corns of original corn sample, a plurality of important attribute as analytic target, to use the data mining algorithm that merges;
(2) horizontal dimensionality reduction
1) the selected sample set of standardization;
2) correlation matrix of the selected a plurality of attributes of calculating is analyzed correlation matrix, and mark relating attribute collection M1;
3) by the design factor factor, can obtain former row matrix of coefficients of major component, and mark relating attribute collection M2;
The set of 4) being correlated with among merge connection property set M1, the M2 obtains several groups of set of properties that correlativity is stronger, and the attribute in each group set of properties is also high related respectively;
5) according to the relating attribute group of gained, select corresponding major component formula, and according to the coefficient factor of attribute in the major component formula, determine this attribute shared proportion in corresponding composition, as weight, can get the New Characteristics value set, be used for the processing of subsequent algorithm;
(3) vertically yojan
1) outlier detection; Detect New Characteristics and concentrate the special sample point that surpasses [1.5,1.5] scope, and carry out anomaly analysis;
2) grid division; Utilize gridding technique, select to divide parameter, the characteristic set of new formation is carried out grid divide, replaces original equal portions division;
3) improved k-means method;
Respectively the data point in the grid of dividing is carried out improved k-means and carry out cluster; This improvement k-means algorithm is as follows: at first realize the selection of initial cluster center by maximum distance, and then data set is carried out the cluster of traditional k-means algorithm.Such clustering method calculated amount is little, and iterations is few, and can effectively alleviate the blindness that cluster centre is chosen, and improves the clustering precision of algorithm.Carry out among a small circle Local Clustering by introducing this k-means algorithm, not only can give full play to the k-means algorithm for the effect of small-scale data clusters, and can reduce the computing consumption, realize preferably the effect of Local Clustering.
4) merge local bunch class, form final cluster result.
The a lot of little bunch class that obtains for cluster in the grid, adopt the merge method of original CURE clustering algorithm, replace original single data point with a fixed number representative point, by piling the nearest bunch class of this data structure lookup, and then merge small-sized bunch of class obtained in the previous step, obtain final cluster result;
(4) classification prediction
1) attribute is judged in input;
Input a corn and judge attribute, this sample point is carried out horizontal dimension-reduction treatment, obtain New Characteristics value group, sample point is carried out follow-up decision tree analysis;
2) determine categorical attribute;
3) size threshold of acquiescence is set;
4) Discretization for Continuous Attribute;
Eigenwert is pressed the ascending ordering of numerical value, when the categorical attribute of corresponding sample changes, then will be up and down two samples division points very.By calculating the expectation of each division points, the division points that the value of finding out is minimum then can be defined as the optimal dividing threshold value;
5) the decision tree root node determines;
A. carry out determining of root node by the C4.5 classifying rules;
6) foundation of final decision tree;
Methods according to the 4th, 5 steps continue to set up lower floor's subtree of decision tree, until that all sample points are all classified is complete, can get thus categorised decision tree finally;
7) optimum corn parents' determines
According to final decision-tree model, according to Euclidean distance, select a corn variety the most similar to this sample point, and with the parent corn of this kind as the optimum parents that cultivate fine seed strains.
According to the basic characteristics of original corn sample set and the suitable environment of different pieces of information mining algorithm, a kind of data mining algorithm of fusion has been proposed.This algorithm is made of three kinds of methods: dimensionality reduction, cluster and decision tree, and can realize respectively the classification forecast function of dimensionality reduction, attribute reduction and the sample of attribute.Its structural drawing such as Fig. 1.
For fear of the impact that different dimensions and research object merge method, this blending algorithm has been selected respectively PCA, CURE and C4.5 algorithm, and has carried out corresponding improvement.
1.PCA the improved main thought of algorithm comprises following two aspects:
(1) relating attribute determines
In PCA, the contribution rate size of eigenwert can represent the significance level that attribute comprises raw information, embodies the relation between attribute and the target, and related coefficient has then shown the correlation degree between attribute and the attribute.If can then can effectively select the important attribute useful to target in conjunction with eigenwert, proper vector and related coefficient, reduce the redundance of feature, realize the horizontal dimensionality reduction of data set.
(2) description of feature set
Here considered the relating attribute group of proper vector and related coefficient gained, to determine suitable relating attribute, and these relating attributes shared proportion in major component introduced as weight information, thereby simplify the expression formula of major component, form the new feature set that is associated with target, reduce the redundance of attribute.
2.CURE the improved main thought of algorithm comprises following two aspects:
(1) outlier detection
After data having been carried out the processing of standardization and weight, data concentrate in [1.5,1.5] scope basically.If property value surpasses this scope, prove that then this sample point is unusual.
(2) the local clustering algorithm of dividing improves
Original Local Clustering algorithm is in the scope that each five equilibrium is divided, select a fixed number representative point, carries out respectively CURE algorithm cluster.Can guarantee local cluster efficient of dividing although so do, also cause the increase of whole operand.Thus, in subrange, introduce a kind of improved k-means algorithm, can improve this problem.Carry out among a small circle Local Clustering by introducing this k-means algorithm, not only can give full play to the k-means algorithm for the effect of small-scale data clusters, and can reduce the computing consumption, realize preferably the effect of Local Clustering.
3.C4.5 the improved main thought of algorithm comprises following two aspects:
(1) selection of important attribute
When the sample set scale is excessive, sample evenly can be divided to the selection of carrying out simultaneously important attribute in three sorters, its measure is followed successively by C4.5 method, gini-index and χ 2 statistics.
(2) division of connection attribute optimal threshold
Traditional algorithm is on the problem of processing the division of connection attribute optimum threshold, mostly adopt self-defining dynamic division, perhaps by the primitive attribute value is sorted, determine all possible threshold value, and select gain maximum division to come corresponding attribute is carried out discretize.But the former accuracy is not high, and latter's computation complexity is higher.This has brought larger difficulty to our deal with data.
Thus, the threshold values division methods that we can mention according to document is simplified determining of threshold value, and by calculating the information gain of each separation, finds out optimum threshold values, comes the corresponding connection attribute of discretize.
The inventive method is easy, has reduced greatly labour intensity in the artificial fine-variety breeding, has improved the efficiency of decision-making and the accuracy of fine corn seeds seed selection.
Embodiment
A kind of selection method of fine corn seeds comprises the following steps:
(1) chooses sample set
Choose the concentrated a plurality of subclass corns of original corn sample, a plurality of important attribute as analytic target, to use the data mining algorithm that merges;
(2) horizontal dimensionality reduction
1) the selected sample set of standardization;
2) correlation matrix of the selected a plurality of attributes of calculating is analyzed correlation matrix, and mark relating attribute collection M1;
3) by the design factor factor, can obtain former row matrix of coefficients of major component, and mark relating attribute collection M2;
The set of 4) being correlated with among merge connection property set M1, the M2 obtains several groups of set of properties that correlativity is stronger, and the attribute in each group set of properties is also high related respectively;
5) according to the relating attribute group of gained, select corresponding major component formula, and according to the coefficient factor of attribute in the major component formula, determine this attribute shared proportion in corresponding composition, as weight, can get the New Characteristics value set, be used for the processing of subsequent algorithm;
(3) vertically yojan
1) outlier detection; Detect New Characteristics and concentrate the special sample point that surpasses [1.5,1.5] scope, and carry out anomaly analysis;
2) grid division; Utilize gridding technique, select to divide parameter, the characteristic set of new formation is carried out grid divide, replaces original equal portions division;
3) improved k-means method;
Respectively the data point in the grid of dividing is carried out improved k-means and carry out cluster; This improvement k-means algorithm is as follows: at first realize the selection of initial cluster center by maximum distance, and then data set is carried out the cluster of traditional k-means algorithm.Such clustering method calculated amount is little, and iterations is few, and can effectively alleviate the blindness that cluster centre is chosen, and improves the clustering precision of algorithm.Carry out among a small circle Local Clustering by introducing this k-means algorithm, not only can give full play to the k-means algorithm for the effect of small-scale data clusters, and can reduce the computing consumption, realize preferably the effect of Local Clustering.
4) merge local bunch class, form final cluster result.The a lot of little bunch class that obtains for cluster in the grid, adopt the merge method of original CURE clustering algorithm, replace original single data point with a fixed number representative point, by piling the nearest bunch class of this data structure lookup, and then merge small-sized bunch of class obtained in the previous step, obtain final cluster result;
(4) classification prediction
1) attribute is judged in input;
Input a corn and judge attribute, this sample point is carried out horizontal dimension-reduction treatment, obtain New Characteristics value group, sample point is carried out follow-up decision tree analysis;
2) determine categorical attribute;
3) size threshold of acquiescence is set;
4) Discretization for Continuous Attribute;
Eigenwert is pressed the ascending ordering of numerical value, when the categorical attribute of corresponding sample changes, then will be up and down two samples division points very.By calculating the expectation of each division points, the division points that the value of finding out is minimum then can be defined as the optimal dividing threshold value;
5) the decision tree root node determines;
B. carry out determining of root node by the C4.5 classifying rules;
6) foundation of final decision tree;
Methods according to the 4th, 5 steps continue to set up lower floor's subtree of decision tree, until that all sample points are all classified is complete, can get thus categorised decision tree finally;
7) optimum corn parents' determines
According to final decision-tree model, according to Euclidean distance, select a corn variety the most similar to this sample point, and with the parent corn of this kind as the optimum parents that cultivate fine seed strains.
The each several part arthmetic statement
(1) the PCA algorithm steps specifically describes
1) standardization raw data set;
2) correlation matrix of computational data collection, and find out the larger community set of relevance, be labeled as M1; Wherein the formula of correlation coefficient of computation attribute x and y is as follows:
X, y are property value,
Be respectively the average of property value x and y.
3) calculate the eigenwert of raw data set and the size of contribution rate, determine the number of major component, calculate simultaneously the coefficient factor P of each composition in the major component
Ij, i.e. different characteristic corresponding to principal character value vector.By analyzing the size of the coefficient factor in the major component, can find out the important relating attribute in each major component, they are classified as a class, and are denoted as M2;
4) consider main attribute and the larger attribute of relevance of mark, which the main attribute that can intelligently determine required consideration has, principle is as follows: if the identical attribute of two or more is arranged among M1 and the M2, just merge this group relating attribute, if be less than 2, this is organized relating attribute separately as one group of consideration.The attribute that does not occur can be not considered.
5) the shared proportion of the composition coefficient factor of gained relating attribute is introduced as weight, to form the New Characteristics collection.Formula is as follows:
(i=1 ..., c; J=1 ..., d); The weight of d important attribute in the // calculating c group relating attribute; (4-2)
W
i=(w
Ij)
D * 1, (i=1 ..., c; J=1 ..., d); // integrate c group weight, obtain respectively the matrix of d * 1;
(4-3)
F
i=V*W
i, (i=1 ..., c); The c stack features value that // calculating is new.
(4-4)
Wherein, w
IjRefer to the weight of d relating attribute, p
IjRefer to d relating attribute characteristic of correspondence vector in the major component formula, p
iRefer to all coefficient sums in the c group major component, W
iRefer to that the weight matrix integrated, V refer to the eigenmatrix that d property value forms, F
iIt is the c group new feature value matrix that calculates.
(2) the CURE algorithm steps specifically describes
1) exceptional sample point detects;
Detect New Characteristics and concentrate the special sample point that surpasses [1.5,1.5] scope, and carry out anomaly analysis.
2) grid is divided;
Introduce gridding technique, selects suitable division parameter, the characteristic set of new formation is carried out grid divide, replace original equal portions division.It is suitable that this technology and original equal portions are divided effect, but for follow-up improved local k-means cluster, the but more effectively fusion between implementation algorithm and the algorithm.
3) each local division is carried out improved k-means algorithm cluster, specifically see document [23];
4) local bunch class is carried out the CURE cluster.
The a lot of little bunch class that obtains for cluster in the grid, adopt the merge thought of original CURE clustering algorithm, concrete visible original [1], replace original single data point with a fixed number representative point, by piling the nearest bunch class of this data structure lookup, and then merge small-sized bunch of class obtained in the previous step, obtain final result.
(3) the C4.5 algorithm steps specifically describes
1) data set after the yojan of arrangement scale;
This part has comprised determining etc. of the processing of default value and categorical attribute, outbalance.
2) determine size threshold;
If the data set scale after the yojan less than size threshold, is then directly carried out the metric analysis that 3 kinds of attributes are selected to this data set, otherwise, then property set is divided in three different sorters randomly, carry out the selection of important attribute.
3) discretize connection attribute;
Here adopt some to simplify mechanism, to reduce the complexity of calculating, seen document [24] for details.
4) foundation of root node;
According to the different sorters at data set place, select correspondingly attribute selection mechanism, with this metric of determining each testing attribute, find out best important attribute, as the root node of decision tree.If the attribute according to three kinds of tactful gained is identical above two, then according to this attribute dividing data collection; If the attribute of gained is not identical, then the selection strategy according to the C4.5 decision tree carries out.
5) foundation of lower floor's subtree;
To remaining sample set and remaining testing attribute, re-start the division of optimum attributes threshold value and the selection of important attribute, be similar to repeating step 2 and step 3, to set up the branch of each lower floor's subtree.So be cycled to repeat, can realize the foundation of decision tree.
6) formation of decision rule.
Utilize the partial test data in the test set that the decision tree that generates is verified, and the result of all trees is averaged, namely measurable final result forms correspondingly classifying rules.
An instance analysis of fine corn seeds seed selection
(1) chooses sample set
Choose 51 concentrated subclass corns of original corn sample, 9 important attribute as analytic target, to use the data mining algorithm that merges.Attribute is respectively thick, mass of 1000 kernel of breeding time, plant height, fringe height, spike length, fringe, tassel row number, a row grain number, cell production, is labeled as attribute 1-9.See Table 4-1.
Table 4-1 chooses sample set
Kind |
The time of infertility |
Plant height |
Fringe is high |
Spike length |
Fringe is thick |
Mass of 1000 kernel |
Tassel row number |
Row grain number |
District's output |
Y1 |
100 |
194.8 |
78.9 |
15.63 |
4.19 |
200.8 |
15.8 |
38.1 |
6.73 |
Y2 |
101 |
229.5 |
93.9 |
18.28 |
4.53 |
269.8 |
15.2 |
40.5 |
7.83 |
Y3 |
99 |
270.1 |
114.7 |
16.34 |
4.65 |
287.3 |
14.4 |
35.4 |
6.70 |
Y4 |
101 |
249.0 |
109.7 |
21.46 |
4.04 |
303.5 |
12.8 |
42.8 |
7.54 |
Y5 |
100 |
229.2 |
96.6 |
16.83 |
4.25 |
255.5 |
15.0 |
35.6 |
7.56 |
Y6 |
100 |
245.7 |
108.0 |
19.48 |
4.30 |
210.8 |
15.6 |
43.9 |
7.13 |
Y7 |
100 |
252.5 |
114.2 |
18.92 |
4.29 |
227.0 |
15.0 |
38.7 |
6.38 |
Y8 |
101 |
249.7 |
109.3 |
18.42 |
4.46 |
324.5 |
13.6 |
39.4 |
9.28 |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
Y51 |
102 |
245.4 |
103.3 |
18.00 |
4.23 |
250.3 |
13.4 |
41.0 |
7.49 |
[0129](2) horizontal dimensionality reduction
1) the selected sample set of standardization;
2) correlation matrix of selected 9 attributes of calculating is seen formula (4-5);
Analyze correlation matrix and can get R[2] [1]=0.7924, prove that set of properties (attribute 2, attribute 3) correlativity is high; By R[7] [3]=0.5785, prove that set of properties (4,8) correlativity is higher; By R[8] [5]=0.5488, prove that set of properties (6,9) correlativity is larger, mark relating attribute collection M1:{ (2,3) then, (4,8), (6,9) }.
3) computation of characteristic values is: 2.5482,2.2439,1.2782,0.9895,0.7965,0.4771,0.4066,0.1447,0.1153 its accumulation contribution rate is respectively 28.314%, 53.235%, 67.448%, 78.442%, 87.292%, 92.593%... because the accumulation contribution rate of the 5th eigenwert has reached 87.292%〉85%, so can determine, major component has 5, by the design factor factor, can obtain the front 5 row matrix of coefficients of major component, shown in (4-6):
Hence one can see that, and 0.4953,0.4728,0.4396 value is larger in the first principal component, learns in a disguised form that then attribute (2,3,9) correlativity is larger, by that analogy, but mark relating attribute collection M2:{ (2,3,9), (4,5,7,8), (3,6), (1), (1) };
4) consider M1 obtained above, M2, merge relevant set, can obtain 3 groups of set of properties that correlativity is stronger, be respectively (2,3,6,9), (4,5,7,8), (1), be (plant height, fringe is high, mass of 1000 kernel, cell production), (spike length, fringe is thick, tassel row number, row grain number), (time of infertility), and the attribute in each group set of properties is also high related respectively, and simultaneously, we can find out that also the main attribute of data set has plant height, cell production, spike length, the time of infertility etc.;
5) according to the relating attribute group of gained, select corresponding major component formula, and according to the coefficient factor of attribute in the major component formula, determine this attribute shared proportion in corresponding composition, as weight, can get the New Characteristics value set, be used for the processing of subsequent algorithm, see formula (4-7)-(4-9).
F1 (2,3,6,9)=(V
Plant height, V
Fringe is high, V
Mass of 1000 kernel, V
Cell production)
N * 4* (w
1i)
4 * 1(i=1 ..., 4) (4-7)
F2 (4,5,7,8)=(V
Spike length, V
Fringe is thick, V
Tassel row number, V
Row grain number)
N * 4* (w
2i)
4 * 1(i=1 ..., 4) (4-8)
F3 (1)=(V
The time of infertility) (4-9)
Wherein, V represents the eigenwert of this attribute column, and w represents the weight of respective attributes in the major component formula.
(3) vertically yojan
1) outlier detection;
Sensing range is fixed in [1.5,1.5] scope, can detect 5 unusual sample points, be respectively Y1, Y30, Y33, Y35, Y45.
2) grid division;
Because data volume is few, so Selecting All Parameters g=2 here, namely every dimension is divided into two parts; And only have three New Characteristics values, i.e. three dimensions.Thus, then data set can be divided into 8 five equilibrium grids.
3) improved k-means method;
Respectively the data point in the grid of dividing is carried out improved k-means and carry out cluster.Here selecting tuning parameter is 3, and it is 3 classes that sample in each grid is gathered.
4) merge local bunch class, form final cluster result.
A bunch class number of supposing final required cluster is 2, then can be according to the merge thought of traditional C URE algorithm, the local submanifold class of above-mentioned gained is merged, and remove simultaneously the slower bunch class of aggregate speed.Finally can get two bunches of classes, be respectively:
C1{Y2,Y?3,Y?4,Y?6,Y?7,Y?8,Y?9,Y?12,Y?13,Y?14,Y?16,Y?18,Y?21,Y?24,Y?25,Y?26,Y?27,Y?28,Y?36,Y?39,Y?41,Y?43,Y?50};
C2{Y5,Y?10,Y?11,Y?15,Y?17,Y?18,Y?19,Y?20,Y?22,Y?28,Y?29,Y?31,Y?32,Y?34,Y?37,Y?40,Y?42,Y?44,Y?46,Y?47,Y?48,Y?49,Y?51}。
(4) classification prediction
1) attribute is judged in input;
Suppose a corn judgement of input attribute: P1 (100,250,106,17,4.3,250,15,36,8), then can carry out horizontal dimension-reduction treatment to this sample point, obtain New Characteristics value group (0.3246 ,-0.0044 ,-0.4993).Distance by judging pretreated P1 point and bunch C1, C2 barycenter as can be known, the distance of P1 to C1 is less than the distance of P1 to C2, so we can be included into P1 among bunch C1.Thus, we can carry out follow-up decision tree analysis to 23 sample points in C1 bunch.
2) determine categorical attribute, shown in table 4-2;
Determining of table 4-2 categorical attribute
The time of infertility |
Cell production |
Categorical attribute |
Representative |
Few |
Many |
Ⅰ |
High yield |
Few |
Few |
Ⅱ |
Middle product |
Many |
Many |
Ⅱ |
Middle product |
Many |
Few |
Ⅲ |
Low yield |
Professional knowledge (time of infertility of namely wishing corn can be as much as possible little, simultaneously cell production higher) according to us is determined categorical attribute, and then the corn variety in the class 2 is optimum as can be known.
3) the size threshold α of acquiescence=50 are set;
Because the number of data set can be put into 3 sorters with whole C1 bunch and calculate less than threshold value, determine optimum selection attribute herein.
4) Discretization for Continuous Attribute;
The F1 eigenwert of 23 samples among the C1 is pressed the ascending ordering of numerical value, such as table 4-3.
The ordering of 4-3 F1 eigenwert
Numbering |
Sample point |
Classification |
Numbering |
Sample point |
Classification |
Numbering |
Sample point |
Classification |
[0167]?
1 |
Y21 |
I |
9 |
Y13 |
I |
17 |
Y26 |
I |
2 |
Y9 |
Ⅱ |
10 |
Y27 |
I |
18 |
Y4 |
Ⅱ |
3 |
Y12 |
Ⅱ |
11 |
Y18 |
Ⅱ |
19 |
Y3 |
Ⅱ |
4 |
Y41 |
Ⅱ |
12 |
Y43 |
I |
20 |
Y16 |
I |
5 |
Y6 |
Ⅱ |
13 |
Y9 |
Ⅱ |
21 |
Y28 |
I |
6 |
Y2 |
I |
14 |
Y14 |
I |
22 |
Y36 |
I |
7 |
Y25 |
Ⅱ |
15 |
Y39 |
I |
23 |
Y8 |
I |
8 |
Y7 |
Ⅱ |
16 |
Y24 |
I |
? |
? |
? |
When the categorical attribute of corresponding sample changes, then will be up and down two samples division points very.By calculating the expectation of each division points, the division points that the value of finding out is minimum then can be defined as the optimal dividing threshold value.
……
Find that by calculating the optimal dividing point of F1 attribute is at sample Y3 place, dividing threshold value is-1.2755.In like manner, the optimal dividing threshold value of connection attribute F2 is at the 7th sample point place, and the optimal dividing threshold value of F3 is at the 4th sample point place.
5) the decision tree root node determines;
If carry out determining of root node by the C4.5 classifying rules, concrete computation process is as follows:
C. calculate the expectation information of categorical attribute:
D. computation attribute F
1Information entropy based on classifying and dividing:
E. computation attribute F
1Classification expectation information based on the optimal dividing threshold value:
F. computation attribute F
1The information gain rate:
G. repeating step B-D, computation attribute F
2And F
3The information gain rate, be respectively 0.1148 and 0.3669.Thus, can select the F3 attribute of information gain rate value maximum.
6) foundation of final decision tree;
Computing method according to the 4th, 5 steps continue to set up lower floor's subtree of decision tree, until that all sample points are all classified is complete, can get thus categorised decision tree finally.
7) optimum corn parents' determines.
According to final decision-tree model, can judge that input sample point P1 (0.3246 ,-0.0044 ,-0.4993) should belong to middle product " II " classification of the second layer, and similarly sample have four, be respectively Y3, Y 6, Y 7, Y 25.
In order to select parents' kind of this judgement corn of the most suitable cultivation, we can select a corn variety the most similar to this sample point according to Euclidean distance, and with the parent corn of this kind as the optimum parents that cultivate fine seed strains.By calculating, this judgement sample is nearest to Y7's, so can be with the parent YCK2 of Y7 and female class YCK3 as the optimum parents that cultivate the P1 sample, with the purpose of realization fine corn seeds seed selection.