CN108877947A

CN108877947A - Depth sample learning method based on iteration mean cluster

Info

Publication number: CN108877947A
Application number: CN201810558766.XA
Authority: CN
Inventors: 李勇明; 郑源林; 王品; 颜芳; 张�成; 李新科
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2018-06-01
Filing date: 2018-06-01
Publication date: 2018-11-23
Anticipated expiration: 2038-06-01
Also published as: CN108877947B

Abstract

The depth sample learning method based on iteration mean cluster that the invention discloses a kind of, follows the steps below：S1：Training data is selected, and handles to obtain N+1 layers of training sample subset, N >=1 by n times iteration means clustering algorithm；S2：Every layer of training sample subset is independently subjected to regression training, obtains N+1 recurrence device；S3：Verify data is selected, and verify data is respectively fed to obtain N+1 verification result in N+1 recurrence device；S4：Corresponding optimal weight (the w of each recurrence device is determined based on Weighted Fusion mechanism₀,w₁,…,w_N)；S5：Test data is obtained, and obtains final prediction result using N+1 recurrence device and corresponding optimal weight.Its effect is：Learning sample is obtained into different training sample data collection by successive ignition mean cluster, is then trained and learns respectively, in the case where identical sample size, effectively increase the learning ability of model, improve the accuracy of classification or prediction.

Description

Depth sample learning method based on iteration mean cluster

Technical field

The present invention relates to artificial intelligence technologys, and in particular to a kind of depth sample learning side based on iteration mean cluster Method.

Background technique

With the development of artificial intelligence technology, the mode of sample learning is also varied, and the quality of sample learning method is tight The accuracy of subsequent classification and recurrence is affected again.

Intelligent algorithm in the prior art, it is most of to be learnt and trained using single sample data set, one Aspect only enhances classifier by increasing the number of iterations or returns device due to the learning sample limited amount that can be directly acquired Performance, effect is limited；On the other hand, the true and false degree of existing learning sample can also generate the performance of training pattern serious It influences, if treated all learning samples are same, it is difficult to pseudo- sample be avoided to impact model performance.

In order to avoid the influence of pseudo- sample, also it has been proposed that on-line study mechanism, such as Chinese patent 201010166225.6 A kind of disclosed self-adaptive cascade classifier training method based on on-line study is initially cascaded using a small amount of sample training first Then the classifier is used for the target detection in image by classifier, since training sample is less, classifier initial detecting effect Fruit is bad.But on-line study sample is automatically extracted by tracking, using self-adaptive cascade classifier algorithm to initial cascade point Class device carries out on-line study, so as to step up the precision that the classifier carries out target detection in the picture.And pass through Tracking obtains the new samples of classifier on-line study automatically and automatic marking, improves the intelligence of classifier training process Degree can be changed, significantly reduce the workload of artificial mark sample class.

But by this mechanism of on-line study, the new learning sample of extraction gradually is needed, algorithm complexity is increased Degree, and the promotion of algorithm performance needs a relatively very long process, and initial performance is relatively poor.

Summary of the invention

To solve the above-mentioned problems, the present invention provides a kind of depth sample learning method based on iteration mean cluster, In the learning process of classifier or recurrence device, original sample is classified as by many levels, each layer of list by iteration mean cluster Solely one classifier of training or recurrence device, are then verified by validation data set respectively, obtain each weight for returning device, So that it is guaranteed that maximized study and the accuracy for identifying or classifying using the characteristic in sample data, Lifting scheme.

To achieve the above object, specific technical solution of the present invention is as follows：

A kind of depth sample learning method based on iteration mean cluster, key are to follow the steps below：

S1：Training data is selected, and handles to obtain N+1 layers of training sample subset, N by n times iteration means clustering algorithm ≥1；

S2：Every layer of training sample subset is independently subjected to regression training, obtains N+1 recurrence device；

S3：Verify data is selected, sample space of the sample with each layer will be first verified and carries out Euclidean distance Similarity measures, To convert this layer of sample space most like sample therewith for the verifying sample, and these samples are respectively fed to N+1 It returns in device and obtains N+1 verification result；

S4：Corresponding optimal weight (the w of each recurrence device is determined based on Weighted Fusion mechanism₀,w₁,…,w_N)；

S5：Test data is obtained, test sample and each layer of sample space are first subjected to Euclidean distance Similarity measures, To convert this layer of sample space most like sample therewith for the test sample, then these samples are respectively fed to step S2 Resulting N+1 recurrence device and the corresponding optimal weight of the resulting each recurrence device of step S4 obtain final prediction result.

Further, it is determined that optimal weight (w₀,w₁,…,w_N) when constraint condition be：

Optionally, the iteration means clustering algorithm uses K mean cluster.

Optionally, the recurrence device model uses Support vector regression model, kernel function using linear kernel function or Radial basis kernel function.

Optionally, the test data is the medical data of object to be measured, and the training data and verify data are selected from UCI Equal public databases, each sample includes multiple features, and the prediction result is label value (integer or the floating-point of object to be measured Number).

Optionally, the test data is the medical data of object to be measured, and the training data and verify data are selected from UCI Diabetes data or heart disease data in equal public databases, each sample includes multiple features, the prediction result be to Survey the age value of object.

Optionally, using mean absolute error MAE come the performance of evaluation and foreca algorithm, specially：M indicates the number of samples of test data, a_jIndicate the corresponding actual value of j-th of test sample, a '_j Indicate the corresponding predicted value of j test sample.

Remarkable result of the invention is：

Learning sample is obtained different training sample data collection by successive ignition mean cluster by this method, is then distinguished It is trained and learns, in the case where identical sample size, by training by different level and learn, effectively increase model Habit ability improves the accuracy of classification or recurrence.

Detailed description of the invention

Fig. 1 is depth sample learning model proposed by the present invention；

Fig. 2 is iteration mean cluster model in Fig. 1；

Fig. 3 is age prediction effect figure in specific embodiment.

Specific embodiment

It is described in detail below in conjunction with embodiment of the attached drawing to technical solution of the present invention.Following embodiment is only used for Clearly illustrate technical solution of the present invention, therefore be only used as example, and cannot be used as a limitation and limit protection model of the invention It encloses.

It should be noted that unless otherwise indicated, technical term or scientific term used in this application should be this hair The ordinary meaning that bright one of ordinary skill in the art are understood.

The present embodiment by the age predict for the purpose of describe in detail, select come from UCI database (http:// Archive.ics.uci.edu/ the part sample in two datasets), one is diabetes data collection, abbreviation MD (Mellitus Data Set), the other is heart disease data set, abbreviation HD (Heart Disease Data Set).Heart Sick data set includes 137 normal samples, and each sample includes 14 features；Diabetes data collection includes 268 normal samples, Each sample includes 8 features.The detailed information of two datasets is as shown in table 1.

The essential information of 1 data set of table

	Number	The range of age (year)	Age mean value (year)	Age criterion is poor
					HD	137	34~71	52.71	9.14
DM	268	21~66	29.94	10.51

Each type of data sample is divided into training set at random, and verifying collection test set 100 times, obtains 100 groups of samples This.In this trial, computer operating system is Windows 10,64,8GB memory；Experiment porch is MATLAB, 2016a.For the ease of the algorithm that subsequent analysis and explanation, the present embodiment propose, referred to as PAEM, traditional algorithm is referred to as TAEM.Method proposed by the present invention can be in conjunction with different regression models, feature selecting algorithm, example optimal algorithm, assessment mark Standard, to be converted into other various specific algorithms.The present embodiment is used as using Support vector regression model returns device, and Use linear kernel function and default parameters.

Specific steps include (note as can be seen from Figure 1：In figure verifying collection and test set be combine depth sample space it Result afterwards)：

S1：Training data is selected, and handles to obtain 3 layers of training sample subset by 2 iteration means clustering algorithms；

S2：Every layer of training sample subset is independently subjected to regression training, obtains 3 recurrence devices；

S3：Verify data is selected, sample space of the sample with each layer will be first verified and carries out Euclidean distance Similarity measures, To convert this layer of sample space most like sample therewith for the verifying sample, then these samples are respectively fed to 3 and are returned Return in device and obtains 3 verification results；

S4：Corresponding optimal weight (the w of each recurrence device is determined based on Weighted Fusion mechanism₀,w₁,w₂)；

S5：Test data is obtained, test sample and each layer of sample space are first subjected to Euclidean distance Similarity measures, To convert this layer of sample space most like sample therewith for the test sample, then these samples are respectively fed to step S2 Resulting 3 recurrence devices and the corresponding optimal weight of the resulting each recurrence device of step S4 obtain final prediction result.

Specifically, the cluster process of iteration means clustering algorithm is similar to K mean cluster in step S1, as shown in Fig. 2, passing through The center of each class is found at the distance between minimum number strong point and arest neighbors center.

The core concept of iterative mean cluster：Minimize all samples to generic center Euclidean distance and, adopt It is restrained with the mode of iteration.

Given training sample：{x⁽¹⁾,x⁽²⁾,...,x^m, specific step is as follows for K mean cluster algorithm：

1:Choose K cluster centre point, respectively μ₁,μ₂,...,μ_k

2:The generic c of each sample x is calculated according to following formula^j(1≤j≤k)：

3:The center that every one kind is updated according to following formula, by μ_jIt is updated to μ '_j：

4:Constantly repetition step 2,3, until μ_jNo longer change (convergence)

5:It is resulting as a result, micro- by the random noise progress for increasing a zero-mean normal distribution for clustering each time It adjusts, to obtain next sample set (sample space).

Y in figure₀It is original and training set by iteration means clustering algorithm respectively obtains other two layers of sample Y₁,Y₂。 Three recurrence devices are obtained using the sample set of each layer, based on verifying collection, available corresponding result (r₀,r₁,r₂), it is optimal Weight w_op=(w₀,w₁,w₂) can be obtained by formula (3).

Determine optimal weight (w₀,w₁,w₂) when constraint condition be：

After recurrence device model training has learnt, it is based on test set, obtains the prediction age a=(a that each layer returns device₀, a₁,a₂), by merging weight (w₀,w₁,w₂) obtain final age a_f=w_op ^Ta。

The performance of Measurement Algorithm, using mean absolute error MAE come the performance of evaluation and foreca algorithm, specially：M indicates the number of samples of test data, a_jIndicate the corresponding actual value of j-th of test sample, a '_j Indicate the corresponding predicted value of j test sample.Age detection mechanism of the present invention is better than to time of traditional age detection mechanism simultaneously Number scale is Score.

Details are as shown in table 2, and mean indicates that average value, std indicate standard deviation.

From table 2 it can be seen that carrying out the MAE that age detection obtains using the mentioned method of the present invention for two datasets Mean value and standard deviation it is all smaller than traditional, the age of illustration method age forecasting mechanism prediction is than traditional age forecasting mechanism Want more acurrate.Meanwhile Score value is bigger, can illustrate the superiority of this method from the other hand.

The result at 2 two datasets of table prediction age

The histogram of table 2 is shown in Fig. 3.It mainly shows the difference and P value at the prediction age that this method obtains.

From figure 3, it can be seen that the MAE at the age that two datasets are predicted by this paper mechanism is smaller, and hypothesis testing Obtained P value is both less than 0.05, illustrates that the MAE at the prediction age of PAEM is more preferable in significance.

Finally, it should be noted that foregoing description is the preferred embodiment of the present invention, those skilled in the art exist Under enlightenment of the invention, without prejudice to the purpose of the present invention and the claims, multiple similar expressions can be made, this The transformation of sample is fallen within the scope of protection of the present invention.

Claims

1. a kind of depth sample learning method based on iteration mean cluster, it is characterised in that follow the steps below：

S1：Training data is selected, and handles to obtain N+1 layers of training sample subset, N >=1 by n times iteration means clustering algorithm；

S3：Verify data is selected, sample space of the sample with each layer will be first verified and carries out Euclidean distance Similarity measures, thus This layer of sample space most like sample therewith is converted by the verifying sample, and these samples are respectively fed to N+1 recurrence N+1 verification result is obtained in device；

S5：Test data is obtained, test sample and each layer of sample space are first subjected to Euclidean distance Similarity measures, thus This layer of sample space most like sample therewith is converted by the test sample, then these samples are respectively fed to obtained by step S2 N+1 recurrence device and the corresponding optimal weight of the resulting each recurrence device of step S4 obtain final prediction result.

2. the depth sample learning method according to claim 1 based on iteration mean cluster, it is characterised in that：It determines most Good weight (w₀,w₁,…,w_N) when constraint condition be：

3. the depth sample learning method according to claim 1 based on iteration mean cluster, it is characterised in that：It is described to change Be equal to K mean cluster for the search principle of the cluster centre of means clustering algorithm, but each time iteration when, on original sample is exactly Cluster centre after primary cluster.

4. the depth sample learning method according to claim 1 based on iteration mean cluster, it is characterised in that：Described time Return device model using Support vector regression model, kernel function uses linear kernel function or Radial basis kernel function.

5. the depth sample learning method according to claim 1 based on iteration mean cluster, it is characterised in that：The survey The medical data that data are object to be measured is tried, the training data and verify data are selected from the public databases such as UCI, each sample Including multiple features, the prediction result is the label of object to be measured.

6. the depth sample learning method according to claim 1 based on iteration mean cluster, it is characterised in that：The survey Try the medical data that data are object to be measured, the glycosuria of the training data and verify data in the public databases such as UCI Sick data or heart disease data, each sample include multiple features, and the prediction result is the age value of object to be measured.

7. -6 any depth sample learning method based on iteration mean cluster according to claim 1, it is characterised in that： Using mean absolute error MAE come the performance of evaluation and foreca algorithm, specially：M indicates test number According to number of samples, a_jIndicate the corresponding actual value of j-th of test sample, a '_jIndicate the corresponding predicted value of j test sample.