CN103020864A

CN103020864A - Corn fine breed breeding method

Info

Publication number: CN103020864A
Application number: CN2012105212286A
Authority: CN
Inventors: 邱建林; 顾翔; 陈建平; 季丹; 陈燕云; 卞彩峰
Original assignee: Nantong University
Current assignee: Majority Of Yunnan Seed Co Ltd
Priority date: 2012-12-07
Filing date: 2012-12-07
Publication date: 2013-04-03
Anticipated expiration: 2032-12-07
Also published as: CN103020864B

Abstract

The invention discloses a corn fine breed breeding method which comprises steps of sample set selection, crosswise dimensionality reduction, longitudinal reduction, classification forecasting and the like. The corn fine breed breeding method is easy and convenient, greatly reduces labor intensity in manual fine breed breeding and improves the decision efficiency and accuracy in corn fine breed breeding.

Description

Selection method of fine corn seeds

Technical field

The present invention relates to a kind of selection method of fine corn seeds.

Background technology

Original data are difficult to directly be applied in the processing procedure of data mining, need before analyzing data to be carried out certain conversion, and to convert the feature useful to algorithm to, this step just is called preparation and the conversion of raw data set.It is widely used in the links of data mining, is a step of outbalance.Traditional processing means have a lot, and simple map function is arranged, such as the standardization of variable, discretize etc.; Feature extraction, selection and building method based on dimensionality reduction are arranged, such as principal component analysis (PCA) (PCA), non-linear difference analysis, Kohonen coupling, Sammon projection etc.; Disposal route in conjunction with other field knowledge is arranged, such as fractal technology, cluster and support vector machine etc.The pre-treatment step method of visual data is numerous, and the scope of application is extensive.

Cluster is a kind of common data mining analysis instrument, it is based on the thought of " things of a kind come together, people of a mind fall into the same group ", the set divide into several classes of mass data point or bunch so that farthest similar between the data in each class, and the data in the inhomogeneity are farthest different.Cluster analysis belongs to a kind of guideless learning method, and its outstanding feature is to process large complicated data set, and can be used as the pre-treatment step of other algorithms.

Traditional clustering method can be divided into four aspects: based on the cluster mode of division, level, density and grid.Common classic algorithm comprises the K-MEANS partitioning algorithm; CURE ^[1], BIRCH, CHAMELEON ^[2]The level algorithm; DBSCAN density algorithm; STING, WaveCluster, CLIQUE trellis algorithm etc.Wherein, the K-MEANS algorithm is easily understood, and does not need complicated priori conditions, and better for the Clustering Effect of small-scale data; The CURE algorithm has adopted a fixed number sample point representative bunch class, can catch the sample set of arbitrary shape; The BIRCH algorithm is comparatively effective for convex surface and the spherical data set of unified size, but responsive to partial parameters; The DBSCAN algorithm is flexible, need not to know clusters number, and is better for the treatment effect of noise and high dimensional data, just comparatively responsive to density parameter etc.; The STING algorithm is multiplex in the parallel processing step of other algorithm, can improve the treatment effeciency of algorithm

This shows, all there are some defectives in traditional clustering algorithm more or less at the processing of telescopicing performance, data type, the sensitivity of parameter, the aspects such as bunch class shape of discovery, and is also running into certain bottleneck aspect the high dimensional data of processing day by day.Therefore, improve traditional clustering algorithm, inject fresh domain knowledge, form modern clustering method, process large-scale high dimensional data for us and be absolutely necessary.For example, based on COBWEB statistical model, neural network model and the hypergraph model of model, based on the Spectral Clustering of spectrogram, for the clustering method of flow data and in conjunction with the clustering method of other field knowledge gained (based on the ant group algorithm of genetic algorithm and artificial fish-swarm algorithm, based on fuzzy clustering algorithm of fuzzy theory etc.).Every kind of clustering algorithm has relative merits and the suitable environment of oneself, so we when selecting clustering algorithm, need for concrete target call and own characteristic, selects optimal clustering algorithm, so that we can excavate potential useful rule.

Decision Tree algorithms is as a branch of sorting technique, it is one of widely used logical method, its great advantage does not need too many background knowledge exactly in learning process, only need can instruct sample by classified information, and show with the form of attribute-conclusion.This statement that is similar to process flow diagram can reflect the characteristic relation of data intuitively.For the data set that does not need too many expertise, use decision Tree algorithms that data set is analyzed, effect is better.At present comparatively famous have ID3, C4.5, CART, SLIQ, SPRINT, a CHAID etc.But more or less all there are some problems in these algorithms, can cause the problem of attribute deflection such as the mode that adopts information gain; Optimal threshold determined when attribute was divided; The achievement process can not be recalled, and can only seek the local optimum result; Different Pruning strategies can cause different decision tree etc.

Summary of the invention

The object of the present invention is to provide a kind of easy, effective selection method of fine corn seeds.

Technical solution of the present invention is:

A kind of selection method of fine corn seeds is characterized in that: comprise the following steps:

(1) chooses sample set

Choose the concentrated a plurality of subclass corns of original corn sample, a plurality of important attribute as analytic target, to use the data mining algorithm that merges;

(2) horizontal dimensionality reduction

1) the selected sample set of standardization;

2) correlation matrix of the selected a plurality of attributes of calculating is analyzed correlation matrix, and mark relating attribute collection M1;

3) by the design factor factor, can obtain former row matrix of coefficients of major component, and mark relating attribute collection M2;

The set of 4) being correlated with among merge connection property set M1, the M2 obtains several groups of set of properties that correlativity is stronger, and the attribute in each group set of properties is also high related respectively;

5) according to the relating attribute group of gained, select corresponding major component formula, and according to the coefficient factor of attribute in the major component formula, determine this attribute shared proportion in corresponding composition, as weight, can get the New Characteristics value set, be used for the processing of subsequent algorithm;

(3) vertically yojan

1) outlier detection; Detect New Characteristics and concentrate the special sample point that surpasses [1.5,1.5] scope, and carry out anomaly analysis;

2) grid division; Utilize gridding technique, select to divide parameter, the characteristic set of new formation is carried out grid divide, replaces original equal portions division;

3) improved k-means method;

Respectively the data point in the grid of dividing is carried out improved k-means and carry out cluster; This improvement k-means algorithm is as follows: at first realize the selection of initial cluster center by maximum distance, and then data set is carried out the cluster of traditional k-means algorithm.Such clustering method calculated amount is little, and iterations is few, and can effectively alleviate the blindness that cluster centre is chosen, and improves the clustering precision of algorithm.Carry out among a small circle Local Clustering by introducing this k-means algorithm, not only can give full play to the k-means algorithm for the effect of small-scale data clusters, and can reduce the computing consumption, realize preferably the effect of Local Clustering.

4) merge local bunch class, form final cluster result.

The a lot of little bunch class that obtains for cluster in the grid, adopt the merge method of original CURE clustering algorithm, replace original single data point with a fixed number representative point, by piling the nearest bunch class of this data structure lookup, and then merge small-sized bunch of class obtained in the previous step, obtain final cluster result;

(4) classification prediction

1) attribute is judged in input;

Input a corn and judge attribute, this sample point is carried out horizontal dimension-reduction treatment, obtain New Characteristics value group, sample point is carried out follow-up decision tree analysis;

2) determine categorical attribute;

3) size threshold of acquiescence is set;

4) Discretization for Continuous Attribute;

Eigenwert is pressed the ascending ordering of numerical value, when the categorical attribute of corresponding sample changes, then will be up and down two samples division points very.By calculating the expectation of each division points, the division points that the value of finding out is minimum then can be defined as the optimal dividing threshold value;

5) the decision tree root node determines;

A. carry out determining of root node by the C4.5 classifying rules;

6) foundation of final decision tree;

Methods according to the 4th, 5 steps continue to set up lower floor's subtree of decision tree, until that all sample points are all classified is complete, can get thus categorised decision tree finally;

7) optimum corn parents' determines

According to final decision-tree model, according to Euclidean distance, select a corn variety the most similar to this sample point, and with the parent corn of this kind as the optimum parents that cultivate fine seed strains.

According to the basic characteristics of original corn sample set and the suitable environment of different pieces of information mining algorithm, a kind of data mining algorithm of fusion has been proposed.This algorithm is made of three kinds of methods: dimensionality reduction, cluster and decision tree, and can realize respectively the classification forecast function of dimensionality reduction, attribute reduction and the sample of attribute.Its structural drawing such as Fig. 1.

For fear of the impact that different dimensions and research object merge method, this blending algorithm has been selected respectively PCA, CURE and C4.5 algorithm, and has carried out corresponding improvement.

1.PCA the improved main thought of algorithm comprises following two aspects:

(1) relating attribute determines

In PCA, the contribution rate size of eigenwert can represent the significance level that attribute comprises raw information, embodies the relation between attribute and the target, and related coefficient has then shown the correlation degree between attribute and the attribute.If can then can effectively select the important attribute useful to target in conjunction with eigenwert, proper vector and related coefficient, reduce the redundance of feature, realize the horizontal dimensionality reduction of data set.

(2) description of feature set

Here considered the relating attribute group of proper vector and related coefficient gained, to determine suitable relating attribute, and these relating attributes shared proportion in major component introduced as weight information, thereby simplify the expression formula of major component, form the new feature set that is associated with target, reduce the redundance of attribute.

2.CURE the improved main thought of algorithm comprises following two aspects:

(1) outlier detection

After data having been carried out the processing of standardization and weight, data concentrate in [1.5,1.5] scope basically.If property value surpasses this scope, prove that then this sample point is unusual.

(2) the local clustering algorithm of dividing improves

Original Local Clustering algorithm is in the scope that each five equilibrium is divided, select a fixed number representative point, carries out respectively CURE algorithm cluster.Can guarantee local cluster efficient of dividing although so do, also cause the increase of whole operand.Thus, in subrange, introduce a kind of improved k-means algorithm, can improve this problem.Carry out among a small circle Local Clustering by introducing this k-means algorithm, not only can give full play to the k-means algorithm for the effect of small-scale data clusters, and can reduce the computing consumption, realize preferably the effect of Local Clustering.

3.C4.5 the improved main thought of algorithm comprises following two aspects:

(1) selection of important attribute

When the sample set scale is excessive, sample evenly can be divided to the selection of carrying out simultaneously important attribute in three sorters, its measure is followed successively by C4.5 method, gini-index and χ 2 statistics.

(2) division of connection attribute optimal threshold

Traditional algorithm is on the problem of processing the division of connection attribute optimum threshold, mostly adopt self-defining dynamic division, perhaps by the primitive attribute value is sorted, determine all possible threshold value, and select gain maximum division to come corresponding attribute is carried out discretize.But the former accuracy is not high, and latter's computation complexity is higher.This has brought larger difficulty to our deal with data.

Thus, the threshold values division methods that we can mention according to document is simplified determining of threshold value, and by calculating the information gain of each separation, finds out optimum threshold values, comes the corresponding connection attribute of discretize.

The inventive method is easy, has reduced greatly labour intensity in the artificial fine-variety breeding, has improved the efficiency of decision-making and the accuracy of fine corn seeds seed selection.

Description of drawings

The invention will be further described below in conjunction with drawings and Examples.

Fig. 1 is the structural drawing of fused data mining algorithm.

Fig. 2 is final decision tree synoptic diagram.

Embodiment

A kind of selection method of fine corn seeds comprises the following steps:

(1) chooses sample set

(2) horizontal dimensionality reduction

1) the selected sample set of standardization;

(3) vertically yojan

3) improved k-means method;

4) merge local bunch class, form final cluster result.The a lot of little bunch class that obtains for cluster in the grid, adopt the merge method of original CURE clustering algorithm, replace original single data point with a fixed number representative point, by piling the nearest bunch class of this data structure lookup, and then merge small-sized bunch of class obtained in the previous step, obtain final cluster result;

(4) classification prediction

1) attribute is judged in input;

2) determine categorical attribute;

3) size threshold of acquiescence is set;

4) Discretization for Continuous Attribute;

5) the decision tree root node determines;

B. carry out determining of root node by the C4.5 classifying rules;

6) foundation of final decision tree;

7) optimum corn parents' determines

The each several part arthmetic statement

(1) the PCA algorithm steps specifically describes

1) standardization raw data set;

2) correlation matrix of computational data collection, and find out the larger community set of relevance, be labeled as M1; Wherein the formula of correlation coefficient of computation attribute x and y is as follows:

r = \frac{Σ (x - \overset{&OverBar;}{x}) (y - \overset{&OverBar;}{y})}{\sqrt{Σ {(x - \overset{&OverBar;}{x})}^{2} Σ {(y - \overset{&OverBar;}{y})}^{2}}} - - - (4 - 1)

X, y are property value,

Be respectively the average of property value x and y.

3) calculate the eigenwert of raw data set and the size of contribution rate, determine the number of major component, calculate simultaneously the coefficient factor P of each composition in the major component _Ij, i.e. different characteristic corresponding to principal character value vector.By analyzing the size of the coefficient factor in the major component, can find out the important relating attribute in each major component, they are classified as a class, and are denoted as M2;

4) consider main attribute and the larger attribute of relevance of mark, which the main attribute that can intelligently determine required consideration has, principle is as follows: if the identical attribute of two or more is arranged among M1 and the M2, just merge this group relating attribute, if be less than 2, this is organized relating attribute separately as one group of consideration.The attribute that does not occur can be not considered.

5) the shared proportion of the composition coefficient factor of gained relating attribute is introduced as weight, to form the New Characteristics collection.Formula is as follows:

(i=1 ..., c; J=1 ..., d); The weight of d important attribute in the // calculating c group relating attribute; (4-2)

W _i=(w _Ij) _{D * 1}, (i=1 ..., c; J=1 ..., d); // integrate c group weight, obtain respectively the matrix of d * 1;

(4-3)

F _i=V*W _i, (i=1 ..., c); The c stack features value that // calculating is new.

(4-4)

Wherein, w _IjRefer to the weight of d relating attribute, p _IjRefer to d relating attribute characteristic of correspondence vector in the major component formula, p _iRefer to all coefficient sums in the c group major component, W _iRefer to that the weight matrix integrated, V refer to the eigenmatrix that d property value forms, F _iIt is the c group new feature value matrix that calculates.

(2) the CURE algorithm steps specifically describes

1) exceptional sample point detects;

Detect New Characteristics and concentrate the special sample point that surpasses [1.5,1.5] scope, and carry out anomaly analysis.

2) grid is divided;

Introduce gridding technique, selects suitable division parameter, the characteristic set of new formation is carried out grid divide, replace original equal portions division.It is suitable that this technology and original equal portions are divided effect, but for follow-up improved local k-means cluster, the but more effectively fusion between implementation algorithm and the algorithm.

3) each local division is carried out improved k-means algorithm cluster, specifically see document [23];

4) local bunch class is carried out the CURE cluster.

The a lot of little bunch class that obtains for cluster in the grid, adopt the merge thought of original CURE clustering algorithm, concrete visible original [1], replace original single data point with a fixed number representative point, by piling the nearest bunch class of this data structure lookup, and then merge small-sized bunch of class obtained in the previous step, obtain final result.

(3) the C4.5 algorithm steps specifically describes

1) data set after the yojan of arrangement scale;

This part has comprised determining etc. of the processing of default value and categorical attribute, outbalance.

2) determine size threshold;

If the data set scale after the yojan less than size threshold, is then directly carried out the metric analysis that 3 kinds of attributes are selected to this data set, otherwise, then property set is divided in three different sorters randomly, carry out the selection of important attribute.

3) discretize connection attribute;

Here adopt some to simplify mechanism, to reduce the complexity of calculating, seen document [24] for details.

4) foundation of root node;

According to the different sorters at data set place, select correspondingly attribute selection mechanism, with this metric of determining each testing attribute, find out best important attribute, as the root node of decision tree.If the attribute according to three kinds of tactful gained is identical above two, then according to this attribute dividing data collection; If the attribute of gained is not identical, then the selection strategy according to the C4.5 decision tree carries out.

5) foundation of lower floor's subtree;

To remaining sample set and remaining testing attribute, re-start the division of optimum attributes threshold value and the selection of important attribute, be similar to repeating step 2 and step 3, to set up the branch of each lower floor's subtree.So be cycled to repeat, can realize the foundation of decision tree.

6) formation of decision rule.

Utilize the partial test data in the test set that the decision tree that generates is verified, and the result of all trees is averaged, namely measurable final result forms correspondingly classifying rules.

An instance analysis of fine corn seeds seed selection

(1) chooses sample set

Choose 51 concentrated subclass corns of original corn sample, 9 important attribute as analytic target, to use the data mining algorithm that merges.Attribute is respectively thick, mass of 1000 kernel of breeding time, plant height, fringe height, spike length, fringe, tassel row number, a row grain number, cell production, is labeled as attribute 1-9.See Table 4-1.

Table 4-1 chooses sample set

Kind	The time of infertility	Plant height	Fringe is high	Spike length	Fringe is thick	Mass of 1000 kernel	Tassel row number	Row grain number	District's output
										Y1	100	194.8	78.9	15.63	4.19	200.8	15.8	38.1	6.73
Y2	101	229.5	93.9	18.28	4.53	269.8	15.2	40.5	7.83
										Y3	99	270.1	114.7	16.34	4.65	287.3	14.4	35.4	6.70
Y4	101	249.0	109.7	21.46	4.04	303.5	12.8	42.8	7.54
										Y5	100	229.2	96.6	16.83	4.25	255.5	15.0	35.6	7.56
Y6	100	245.7	108.0	19.48	4.30	210.8	15.6	43.9	7.13
										Y7	100	252.5	114.2	18.92	4.29	227.0	15.0	38.7	6.38
Y8	101	249.7	109.3	18.42	4.46	324.5	13.6	39.4	9.28
										…	…	…	…	…	…	…	…	…	…
Y51	102	245.4	103.3	18.00	4.23	250.3	13.4	41.0	7.49

[0129](2) horizontal dimensionality reduction

1) the selected sample set of standardization;

2) correlation matrix of selected 9 attributes of calculating is seen formula (4-5);

R = \{\begin{matrix} 1.0000 \\ 0.0502 & 1.0000 \\ 0.2038 & 0.7924 & 1.0000 \\ - 0.1602 & 0.0794 & - 0.0197 & 1.0000 \\ 0.1722 & 0.2256 & 0.2695 & - 0.3415 & 1.0000 \\ 0.2250 & 0.2629 & 0.1296 & 0.1356 & 0.2498 & 1.0000 \\ 0.0213 & - 0.0683 & - 0.0741 & - 0.3034 & 0.4356 & - 0.3998 & 1.0000 \\ - 0.0562 & 0.19241 & 0.2802 & 0.5785 & - 0.2782 & - 0.0238 & - 0.4638 & 1.0000 \\ 0.1442 & 0.3648 & 0.2634 & 0.1046 & 0.3884 & 0.5488 & 0.0395 & 0.1583 & 1.0000 \end{matrix}\} - - - (4 - 5)

Analyze correlation matrix and can get R[2] [1]=0.7924, prove that set of properties (attribute 2, attribute 3) correlativity is high; By R[7] [3]=0.5785, prove that set of properties (4,8) correlativity is higher; By R[8] [5]=0.5488, prove that set of properties (6,9) correlativity is larger, mark relating attribute collection M1:{ (2,3) then, (4,8), (6,9) }.

3) computation of characteristic values is: 2.5482,2.2439,1.2782,0.9895,0.7965,0.4771,0.4066,0.1447,0.1153 its accumulation contribution rate is respectively 28.314%, 53.235%, 67.448%, 78.442%, 87.292%, 92.593%... because the accumulation contribution rate of the 5th eigenwert has reached 87.292%〉85%, so can determine, major component has 5, by the design factor factor, can obtain the front 5 row matrix of coefficients of major component, shown in (4-6):

p = \{\begin{matrix} 0.1769 & 0.1897 & 0.2317 & - 0.6983 & 0.5867 \\ 0.4953 & 0.0428 & - 0.3890 & - 0.0136 & - 0.2644 \\ 0.4728 & 0.0623 & - 0.4780 & - 0.2316 & - 0.0842 \\ 0.1331 & - 0.4887 & 0.0442 & 0.3369 & 0.3673 \\ 0.2312 & 0.4956 & 0.0346 & 0.2485 & 0.0899 \\ 0.3917 & 0.0125 & 0.6120 & 0.0176 & - 0.2841 \\ - 0.1626 & 0.4713 & - 0.2690 & 0.3470 & 0.4210 \\ 0.2356 & - 0.4869 & - 0.1889 & 0.0468 & 0.3683 \\ 0.4396 & 0.1195 & 0.2841 & 0.4004 & 0.2053 \end{matrix}\} - - - (4 - 6)

Hence one can see that, and 0.4953,0.4728,0.4396 value is larger in the first principal component, learns in a disguised form that then attribute (2,3,9) correlativity is larger, by that analogy, but mark relating attribute collection M2:{ (2,3,9), (4,5,7,8), (3,6), (1), (1) };

4) consider M1 obtained above, M2, merge relevant set, can obtain 3 groups of set of properties that correlativity is stronger, be respectively (2,3,6,9), (4,5,7,8), (1), be (plant height, fringe is high, mass of 1000 kernel, cell production), (spike length, fringe is thick, tassel row number, row grain number), (time of infertility), and the attribute in each group set of properties is also high related respectively, and simultaneously, we can find out that also the main attribute of data set has plant height, cell production, spike length, the time of infertility etc.;

5) according to the relating attribute group of gained, select corresponding major component formula, and according to the coefficient factor of attribute in the major component formula, determine this attribute shared proportion in corresponding composition, as weight, can get the New Characteristics value set, be used for the processing of subsequent algorithm, see formula (4-7)-(4-9).

F1 (2,3,6,9)=(V _{Plant height}, V _{Fringe is high}, V _{Mass of 1000 kernel}, V _{Cell production}) _{N * 4}* (w _1i) _{4 * 1}(i=1 ..., 4) (4-7)

F2 (4,5,7,8)=(V _{Spike length}, V _{Fringe is thick}, V _{Tassel row number}, V _{Row grain number}) _{N * 4}* (w _2i) _{4 * 1}(i=1 ..., 4) (4-8)

F3 (1)=(V _{The time of infertility}) (4-9)

Wherein, V represents the eigenwert of this attribute column, and w represents the weight of respective attributes in the major component formula.

(3) vertically yojan

1) outlier detection;

Sensing range is fixed in [1.5,1.5] scope, can detect 5 unusual sample points, be respectively Y1, Y30, Y33, Y35, Y45.

2) grid division;

Because data volume is few, so Selecting All Parameters g=2 here, namely every dimension is divided into two parts; And only have three New Characteristics values, i.e. three dimensions.Thus, then data set can be divided into 8 five equilibrium grids.

3) improved k-means method;

Respectively the data point in the grid of dividing is carried out improved k-means and carry out cluster.Here selecting tuning parameter is 3, and it is 3 classes that sample in each grid is gathered.

4) merge local bunch class, form final cluster result.

A bunch class number of supposing final required cluster is 2, then can be according to the merge thought of traditional C URE algorithm, the local submanifold class of above-mentioned gained is merged, and remove simultaneously the slower bunch class of aggregate speed.Finally can get two bunches of classes, be respectively:

C1{Y2，Y?3，Y?4，Y?6，Y?7，Y?8，Y?9，Y?12，Y?13，Y?14，Y?16，Y?18，Y?21，Y?24，Y?25，Y?26，Y?27，Y?28，Y?36，Y?39，Y?41，Y?43，Y?50}；

C2{Y5，Y?10，Y?11，Y?15，Y?17，Y?18，Y?19，Y?20，Y?22，Y?28，Y?29，Y?31，Y?32，Y?34，Y?37，Y?40，Y?42，Y?44，Y?46，Y?47，Y?48，Y?49，Y?51}。

(4) classification prediction

1) attribute is judged in input;

Suppose a corn judgement of input attribute: P1 (100,250,106,17,4.3,250,15,36,8), then can carry out horizontal dimension-reduction treatment to this sample point, obtain New Characteristics value group (0.3246 ,-0.0044 ,-0.4993).Distance by judging pretreated P1 point and bunch C1, C2 barycenter as can be known, the distance of P1 to C1 is less than the distance of P1 to C2, so we can be included into P1 among bunch C1.Thus, we can carry out follow-up decision tree analysis to 23 sample points in C1 bunch.

2) determine categorical attribute, shown in table 4-2;

Determining of table 4-2 categorical attribute

The time of infertility	Cell production	Categorical attribute	Representative
				Few	Many	Ⅰ	High yield
Few	Few	Ⅱ	Middle product
				Many	Many	Ⅱ	Middle product
Many	Few	Ⅲ	Low yield

Professional knowledge (time of infertility of namely wishing corn can be as much as possible little, simultaneously cell production higher) according to us is determined categorical attribute, and then the corn variety in the class 2 is optimum as can be known.

3) the size threshold α of acquiescence=50 are set;

Because the number of data set can be put into 3 sorters with whole C1 bunch and calculate less than threshold value, determine optimum selection attribute herein.

4) Discretization for Continuous Attribute;

The F1 eigenwert of 23 samples among the C1 is pressed the ascending ordering of numerical value, such as table 4-3.

The ordering of 4-3 F1 eigenwert

Numbering

Sample point

Classification

Numbering

Sample point

Classification

Numbering

Sample point

Classification

[0167]?

1

Y21

I

9

Y13

I

17

Y26

I

2

Y9

Ⅱ

10

Y27

I

18

Y4

Ⅱ

3

Y12

Ⅱ

11

Y18

Ⅱ

19

Y3

Ⅱ

4

Y41

Ⅱ

12

Y43

I

20

Y16

I

5

Y6

Ⅱ

13

Y9

Ⅱ

21

Y28

I

6

Y2

I

14

Y14

I

22

Y36

I

7

Y25

Ⅱ

15

Y39

I

23

Y8

I

8

Y7

Ⅱ

16

Y24

I

?

When the categorical attribute of corresponding sample changes, then will be up and down two samples division points very.By calculating the expectation of each division points, the division points that the value of finding out is minimum then can be defined as the optimal dividing threshold value.

……

I (5,18) = - \frac{5}{23} (\frac{4}{5} \log_{2} \frac{4}{5} + \frac{1}{5} \log_{2} \frac{1}{5}) - \frac{18}{23} (\frac{12}{18} lo g_{2} \frac{12}{18} + \frac{6}{18} \log_{2} \frac{6}{18}) \approx 0.8756 - - - (4 - 10)

I (19,4) = - \frac{19}{23} (\frac{9}{19} \log_{2} \frac{9}{19} + \frac{10}{19} \log_{2} \frac{10}{19}) - \frac{4}{23} (\frac{4}{4} \log_{2} \frac{4}{4}) \approx 0.8244 - - - (4 - 11)

Find that by calculating the optimal dividing point of F1 attribute is at sample Y3 place, dividing threshold value is-1.2755.In like manner, the optimal dividing threshold value of connection attribute F2 is at the 7th sample point place, and the optimal dividing threshold value of F3 is at the 4th sample point place.

5) the decision tree root node determines;

If carry out determining of root node by the C4.5 classifying rules, concrete computation process is as follows:

C. calculate the expectation information of categorical attribute:

I (13,10) = - \frac{13}{23} \log_{2} \frac{13}{23} - \frac{10}{23} \log_{2} \frac{10}{23} \approx 0.9878 - - - (4 - 12)

D. computation attribute F ₁Information entropy based on classifying and dividing:

Split (F 1) = - \frac{19}{23} \log_{2} \frac{19}{23} - \frac{4}{23} \log_{2} \frac{4}{23} \approx 0.6666 - - - (4 - 13)

E. computation attribute F ₁Classification expectation information based on the optimal dividing threshold value:

E (F 1) = \frac{19}{23} I (9,10) + \frac{4}{23} I (4,0) \approx 0.8244 - - - (4 - 14)

F. computation attribute F ₁The information gain rate:

GainRatio (F 1) = \frac{I (13,10) - E (F 1)}{Split (F 1)} \approx 0.2451 - - - (4 - 15)

G. repeating step B-D, computation attribute F ₂And F ₃The information gain rate, be respectively 0.1148 and 0.3669.Thus, can select the F3 attribute of information gain rate value maximum.

6) foundation of final decision tree;

Computing method according to the 4th, 5 steps continue to set up lower floor's subtree of decision tree, until that all sample points are all classified is complete, can get thus categorised decision tree finally.

7) optimum corn parents' determines.

According to final decision-tree model, can judge that input sample point P1 (0.3246 ,-0.0044 ,-0.4993) should belong to middle product " II " classification of the second layer, and similarly sample have four, be respectively Y3, Y 6, Y 7, Y 25.

In order to select parents' kind of this judgement corn of the most suitable cultivation, we can select a corn variety the most similar to this sample point according to Euclidean distance, and with the parent corn of this kind as the optimum parents that cultivate fine seed strains.By calculating, this judgement sample is nearest to Y7's, so can be with the parent YCK2 of Y7 and female class YCK3 as the optimum parents that cultivate the P1 sample, with the purpose of realization fine corn seeds seed selection.

Claims

1. a selection method of fine corn seeds is characterized in that: comprise the following steps:

(1) chooses sample set

(2) horizontal dimensionality reduction

1) the selected sample set of standardization;

(3) vertically yojan

3) improved k-means method is carried out improved k-means to the data point in the grid of dividing respectively and is carried out cluster;

This improvement k-means algorithm is as follows: at first realize the selection of initial cluster center by maximum distance, and then data set is carried out the cluster of traditional k-means algorithm;

4) merge local bunch class, form final cluster result.

(4) classification prediction

1) attribute is judged in input;

2) determine categorical attribute;

3) size threshold of acquiescence is set;

4) Discretization for Continuous Attribute;

5) the decision tree root node determines;

A. carry out determining of root node by the C4.5 classifying rules;

6) foundation of final decision tree;

7) optimum corn parents' determines