CN110059752A

CN110059752A - A kind of statistical learning querying method based on comentropy Sampling Estimation

Info

Publication number: CN110059752A
Application number: CN201910319193.XA
Authority: CN
Inventors: 曲豫宾; 李芳�
Original assignee: Nantong Textile Vocational Technology College
Current assignee: Nantong Textile Vocational Technology College
Priority date: 2019-04-19
Filing date: 2019-04-19
Publication date: 2019-07-26

Abstract

The present invention relates to a kind of statistical learning querying methods based on comentropy Sampling Estimation, this method is using having marked training pattern that sample obtains to not marking each example calculation comentropy in Instances Pool, it selects several uncertainty highest samples and calculates the expectation empiric risk of corresponding data distribution, selection makes it is expected that the sample of empirical risk minimization is labeled.The present invention has the advantages that the present invention selects sample from the microcosmic angle of sample, the information content of sample itself is made full use of, adequately combine to the two facilitates selection to be not only able to satisfy sample information content higher but also be able to satisfy the smallest sample of expected loss；Simultaneous selection strategy effectively reduces the computation complexity based on statistical learning selection strategy.

Description

A kind of statistical learning querying method based on comentropy Sampling Estimation

Technical field

The present invention relates to statistical learning querying method, in particular to a kind of statistical learning based on comentropy Sampling Estimation is looked into Inquiry method.

Background technique

Traditional supervised learning uses flag data collection training pattern, however flag data collection sometimes needs to spend largely Time and cost, Active Learning frame select a small amount of example to be labeled by concentrating in unlabeled data, reach preferable point Class effect.The common Active Learning query strategy based on pond can be divided into Uncertain Sapmpling, Query-By- Committee,Expected Model Change,Expected Error Reduction,Variance Reduction, Density-Weighted Methods etc. is several, and the model for classification includes naive Bayesian, random forest, supporting vector Machine etc..Uncertain Sapmpling selects not mark sample based on the angle of uncertainty, and the strategy is found in practice With stronger robustness, but there are problems that abnormal point selection；The collection of Query-By-Committee maintenance disaggregated model It closes, does not mark the standard of sample alternatively according to the inconsistency of different classifications device, common evaluation criterion includes vote Entropy, Killback-Leibler divergence etc., the strategy are substantially a kind of contractions by space is assumed To realize that sample selects；The method choice of Expected Model Change strategy use decision information opinion influences most model Big does not mark example；Expected Error Reduction is directly to be calculated based on Statistical Learning Theory and do not marked sample Different labeled bring expected risk, criterion is minimized according to expected risk and selects not mark sample, which is Selection strategy is exactly direct optimization expected risk, exists simultaneously the higher problem of computation complexity；Variance Reduction plan Slightly not instead of by directly optimizing expected risk, indirectly select to reduce output variance realization sample is not marked； Density-Weighted Methods considers the representativeness for not marking sample while consideration does not mark sample information content, right Information content imposes different weights from representativeness, selects sample according to weighted value.Huang Shirong, Jin Rongrong, Zhou Zhenhua etc., " to look into Information and typical example are ask to carry out Active Learning ", " neural information processing systems progress ", 2010 combination supporting vector machines The QUIRE algorithm of proposition also belongs to such, has preferable classifying quality in multiple fields, however that there are computation complexities is higher Problem.

Sample is not marked using the method choice of statistical learning has obtained in-depth study.D. MacKay.(1992) Information-based objective functions for active data selection,Neural Computation 4 (4): 590-604 and Cohn, D.A., Ghahramani, Z., &Jordan, M.I. (1996) Active learning with statistical models.Journal of Artificial Intelligence Research, 4,129-145 propose to carry out optimization object function using the method for statistical learning, are created using classifiers such as feedforward neural networks Model.N.Roy,A.McCallum,"Towards optimal active learning through sampling estimation of error reduction", Proc.18th Int.Conf.Mach.Learn.,pp.441-448, 2001 propose directly to select not mark sample using the statistical learning method for minimizing expected risk function, however this method is still So there is a problem of that calculating information content is larger.Z.Wang,J.Ye, ″Querying discriminative and Representative samples for batch mode active learning ", Proc.ACM SIGKDD, Pp.158-166,2013 are contained much information by the selection of minimization empiric risk and representational do not mark sample.Tang Ying Roc, Huang Shengjun, autonomous rhythm Active Learning --- in the correct thing of correct time inquiring, " the 33rd AAAI is artificial Intelligent conference collection of thesis ", 2019 introducings do not mark example from what step study simultaneous selection was easily classified, and simultaneous selection meets information What the features such as amount is big had a potential value does not mark example, achieves preferable classifying quality.

Therefore, research and develop that a kind of time complexity is lower, the higher statistical learning based on comentropy Sampling Estimation of validity is looked into Inquiry method is necessary.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of time complexity is lower, validity is higher to be taken out based on comentropy The statistical learning querying method of sample estimation.

In order to solve the above technical problems, the technical solution of the present invention is as follows: a kind of statistics based on comentropy Sampling Estimation Practise querying method, innovative point is: the statistical learning querying method the following steps are included:

Step 1: setting training examples x ∈ D=Rⁿ, the label of example is ∈ Y={ y₁, y₂... y_k, to training examples X's Conditional probability distribution is P (y | x), has marked sample set D using independent same distribution sampling, joint probability distribution P (x, y)= P (y | x) P (x) then generates posterior probability to input sample xTherefore the expected risk based on statistical learningAre as follows:

Step 2: loss function L is used to measure true probability distribution P (x, y) and posterior probability estimation point of sample (x, y) ClothDifference, then loss function L are as follows:

Step 3: expected riskThe target of optimization, which is that selection is optimal, does not mark sample sequence k={ x₁, x₂, x₃, ...x_k, wherein k indicates never to mark the number sampled in sample, does not mark sample (x for each of sample sequence k^*, y^*), then

Step 4: determining have for not marking sample for the range for being learnt based on not marking sample M, therefore being learnt Determining estimation P (x)；Definition will not mark sample (x^*, y^*) be added marked sample set D generation new mark integrate as D^* =D+ (x^*, y^*), new mark sample set D^*Distribution function it is unknown, in order to effective calculation formula (2), using having marked The probability distribution of sample set is infused to estimate not marking sample (x currently^*, y^*), then the empiric risk of current class deviceAre as follows:

Step 5: byIt calculates and does not mark sample x^*, in y^*Expected risk value, y in the case of ∈ Y^*True value be not Know, known probability distribution P (x, y) can be used, calculate estimated probability Distribution Value, using different probability distribution as power Value calculatesFinal desired value are as follows:

Step 6: selecting the highest sample x of comentropy in never mark example set M_{U, max}:

x_{U, max}=argmax (- ∑_iP_D(y_i|x)log P_D(y_i|x)) (6)；

Step 7: calculating the information entropy of sample to have marked sample combination D, according to comentropy, select uncertainty most Q high sample, calculates Q sample the expected risk of response, and the smallest sample of expected risk value is selected to carry out manual mark Note.

Further, the step 7, the specific steps are as follows:

Step 1: input initialization flag data collection D={ x₁... x_l, Unlabeled data collection M={ x_l+1... x_l+u, Data markers y_1i... y_l, maximum cycle U_max；

Step 2: output condition probability distribution

Step 3: using flag data initialization training pattern P (x, y)；

Step 4: when reaching maximum cycle or, then calculating corresponding information according to formula (6) in unmarked training set M Entropy；

Step 5: according to comentropy, selecting the maximum Q sample of comentropy；

Step 6: sample being labeled using the classification in set Y, is added separately in training set, re -training mould Type calculates corresponding loss function according to formula (2)；

Step 7: calculating corresponding empirical risk function according to formula (4)；

Step 8: according to the difference of classification, the desired value of expected risk function is calculated according to formula (5), Q sample calculates It completes, then enters step 9, Q sample and do not calculate completion, then return step 6；

Step 9: selection carries out manual mark so that the smallest sample of desired value；If without the smallest sample of desired value, Return step 2.

The present invention has the advantages that the present invention is based on the statistical learning querying method of comentropy Sampling Estimation, this method makes With training pattern that sample obtains has been marked to each example calculation comentropy in Instances Pool is not marked, several uncertainties are selected Highest sample and the expectation empiric risk for calculating corresponding data distribution, selection make it is expected that the sample of empirical risk minimization is marked Note selects sample from the microcosmic angle of sample, makes full use of the information content of sample itself, carries out adequately combination to the two and helps It is higher but also be able to satisfy the smallest sample of expected loss that sample information content had not only been able to satisfy in selection；Simultaneous selection strategy effectively reduces Computation complexity based on statistical learning selection strategy.

Detailed description of the invention

The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

Fig. 1 is that the present invention is based on steps 7 in the statistical learning querying method of comentropy Sampling Estimation to be specifically chosen process Figure.

Fig. 2 is the ACCURACY performance change figure of data set tic-tac-toe.

Fig. 3 is the ACCURACY performance change figure of data set transfusion.

Fig. 4 is the ACCURACY performance change figure of data set kr-vs-kp.

Fig. 5 is the ACCURACY performance change figure of data set diagnosis.

Fig. 6 is the ACCURACY performance change figure of data set breast-cancer.

Specific embodiment

The following examples can make professional and technical personnel that the present invention be more fully understood, but therefore not send out this It is bright to be limited among the embodiment described range.

Embodiment

Statistical learning querying method of the present embodiment based on comentropy Sampling Estimation, the statistical learning querying method include with Lower step:

Step 5: byIt calculates and does not mark sample x^*, in y^*Expected risk value, y in the case of ∈ Y^*True value be unknown , it can be used known probability distribution P (x, y), calculate estimated probability Distribution Value, using different probability distribution as weight, It calculatesFinal desired value are as follows:

x_{U, max}=argmax (- ∑_iP_D(y_i|x)log P_D(y_i|x)) (6)；

As embodiment, specific embodiment is the step 7, as shown in Figure 1, the specific steps are as follows:

Step 2: output condition probability distribution

Step 3: using flag data initialization training pattern P (x, y)；

In order to verify the validity of the statistics active learning strategies based on comentropy Sampling Estimation, exist with random sampling procedure It is compared on multiple data sets.Random sampling procedure randomly chooses several samples from unmarked example, selects so that the phase The smallest sample in danger of keeping watch carries out manual mark.

The data set for machine learning that experimental data is proposed from University of California at Irvine.It selects therein Tic-tac-toe, transfusion, kr-vs-kp, diagnosis, breast-cancer are used for the data set of two classification, The partial data collection is frequently used for the research of Active Learning query strategy, Tang Yingpeng, Huang Shengjun, autonomous rhythm Active Learning --- In the correct thing of correct time inquiring, " the 33rd AAAI artificial intelligence conference collection of thesis ", 2019, specific number 1 is shown in Table according to collection description.

The data set used in the experiment of table 1

Experimental data divides data set using stratified sampling, and 50% is used for training data, and 50% for testing number According to taking-up 10% is used as initial labeled data collection from training data, for establishing model.Experiment repeats 5 times at random, adopts With cross validation, each data set generates 5*2 group data, takes the average value of all data for the prediction knot of the labeled data point Fruit.Experiment uses sklearn kit, and classifier selects random forest grader and logistics to return classifier, and parameter makes With system default parameter.

Categorical data in UCI data set converts the LabelEncoder class by sklearn to realize the coding of category, Attribute is converted into corresponding integer value.Never mark selection subset in example and need to set hyper parameter C, hyper parameter C indicate from The number sampled in example set is not marked, in this strategy, sets C as 20.The evaluation index of algorithm uses ACCURACY, indicates The ratio of real example and the sum of real example and false positive example.

Disaggregated model using random random forest as data set, in data set

In tic-tac-toe, transfusion, kr-vs-kp, diagnosis, breast-cancer two kinds it is to be compared Algorithm is as the increase performance change of mark sample is as shown in Fig. 2, Fig. 3, Fig. 4, Fig. 5, Fig. 6.

In order to further investigate propose strategy validity, labeled data ratio be 20%, 40%, 60%, 80%, Win/draw/loss analysis is done to the result of two algorithms in the case where 100%.Win/draw/loss analysis is for describing not The algorithm of algorithm difference when with to(for) same data set.Such as labeled data ratio be 20% when, in data set tic- The classifier accuracy mean value of the upper comentropy sampling policy of tac-toe is denoted as A_ie, adopted at random on data set tic-tac-toe The classifier accuracy mean value of sample strategy is denoted as A_rIf A_ie>A_r, then win=1, if A_ie=A_r, then draw=1, If A_ie<A_r, then loss=1.Table 2 illustrates the win/draw/ based on comentropy sampling policy and randomized policy comparison loss。

Table 2 be based on comentropy sampling policy and randomized policy labeled data ratio be 20%, 40%, 60%, 80%, The win/draw/loss analysis compared in the case where 100%

From Fig. 2 to Fig. 6 and from the point of view of the displaying result of table 2, based on the Sampling Strategies of comentropy than randomized policy in most of feelings It is attained by preferable classifying quality under condition, has absolutely proved the validity based on comentropy strategy.Simultaneously also illustrate into When the capable sub-sample based on statistical learning, the selection towards individual information amount is better than random selection.With mark sample number Purpose increases, and two sampling policies of Active Learning are sampled according to different angle, and the precision of sorting algorithm is all mentioned Height also illustrates the validity of Active Learning Method frame.But also there are different performances, base on data set diagnosis In the tactful with the increase for marking sample of comentropy, it is rapidly achieved preferable classifying quality；And stochastic sampling strategy does not have not only Have and reaches preferable classifying quality or even biggish fluctuation occurs in classification performance.Illustrate the strategy based on comentropy so more Added with the classifier precision for being conducive to lift scheme.From the point of view of the performance tendency of data set transfusion, with mark sample Increase, algorithm performance has obtained faster growth, and converges on more stable classifying quality, and the strategy of stochastical sampling exists Reach preferable classifying quality and even occurs the case where performance decline later.

In order to sufficiently study the performance change situation of different sampling policies, selection using other classifiers to data set into Row modeling, the performance comparison for comparing different sampling policies on different classifications device only describe labeled data ratio to save space Example is performance comparison situation in the case where 20%, 40%, 60%, 80%, 100%.Table 3 illustrates different on different classifications device The performance comparison of sampling policy.

The performance comparison of different sampling policies on 3 different classifications device of table does comparative test based on paired t-test, performance compared with Good is shown with black matrix

Either still logistic regression is used to classify using random forest grader from the results shown in Table 3 Device, it is proposed that the Sampling Estimation strategy based on comentropy all achieve optimal effect in most cases, even if in portion It is not optimal in the case of point, it is many is not weaker than optimal performance yet.It is possible thereby to illustrate, for different classifier or In the case where different mark example ratios, it is proposed that the Sampling Strategies based on comentropy can there is stable performance to mention It rises.

In addition from the performance comparable situation of two kinds of different classifications devices, we can also be seen that be sampled based on comentropy and estimate The strategy of meter possesses more stable performance, and with increasing for mark sample, classification performance is promoted steadily until converging to best Level.And stochastic sampling strategy obviously shows the unstability of stronger randomness and performance.

Experiment on the common data collection of machine learning shows that this method effectively can be marked never in example and selects The example for needing manually to mark.

Basic principles and main features and advantages of the present invention of the invention have been shown and described above.The skill of the industry Art personnel it should be appreciated that the present invention is not limited to the above embodiments, the above embodiments and description only describe The principle of the present invention, without departing from the spirit and scope of the present invention, various changes and improvements may be made to the invention, these Changes and improvements all fall within the protetion scope of the claimed invention.The claimed scope of the invention by appended claims and Its equivalent thereof.

Claims

1. a kind of statistical learning querying method based on comentropy Sampling Estimation, it is characterised in that: the statistical learning issuer Method the following steps are included:

Step 1: setting training examples x ∈ D=Rⁿ, the label of example is ∈ Y={ y₁, y₂... y_k, to the condition of training examples X Probability distribution is P (y | x), has marked sample set D using independent same distribution sampling, joint probability distribution P (x, y)=P (y | X) P (x) then generates posterior probability to input sample xTherefore the expected risk based on statistical learningAre as follows:

Step 2: loss function L is used to measure true probability distribution P (x, y) of sample (x, y) and posterior probability estimation is distributedDifference, then loss function L are as follows:

Step 3: expected riskThe target of optimization, which is that selection is optimal, does not mark sample sequence k={ x₁, x₂, x₃... x_k, wherein K indicates the number sampled in never mark sample, does not mark sample (x for each of sample sequence k^*, y^*), then

Step 4: determining there is determination for not marking sample for the range for being learnt based on not marking sample M, therefore being learnt Estimation P (x)；Definition will not mark sample (x^*, y^*) be added marked sample set D generation new mark integrate as D^*=D+ (x^*, y^*), new mark sample set D^*Distribution function it is unknown, in order to effective calculation formula (2), using having marked sample The probability distribution of example collection is estimated currently not mark sample (x^*, y^*), then the empiric risk of current class deviceAre as follows:

Step 5: byIt calculates and does not mark sample x^*, in y^*Expected risk value, y in the case of ∈ Y^*True value be it is unknown, Known probability distribution P (x, y) can be used, calculate estimated probability Distribution Value, using different probability distribution as weight, calculateFinal desired value are as follows:

Step 6: selecting the highest sample χ of comentropy in never mark example set M_{U, max}:

x_{U, max}=argmax (- ∑_iP_D(y_i|x)log P_D(y_i|x)) (6)；

Step 7: calculating the information entropy of sample to have marked sample combination D, according to comentropy, select uncertainty highest Q sample, calculates Q sample the expected risk of response, and the smallest sample of expected risk value is selected to carry out manual mark.

2. the statistical learning querying method according to claim 1 based on comentropy Sampling Estimation, it is characterised in that: described Step 7, the specific steps are as follows:

Step 1: input initialization flag data collection D={ x₁,...x_i, Unlabeled data collection

M={ x_i+1,...x_i+u, data markers y_1i... y_l, maximum cycle U_max；

Step 2: output condition probability distribution

Step 3: using flag data initialization training pattern P (x, y)；

Step 4: when reaching maximum cycle or, then calculating corresponding information entropy according to formula (6) in unmarked training set M；

Step 6: sample is labeled using the classification in set Y, is added separately in training set, re -training model, according to Corresponding loss function is calculated according to formula (2)；

Step 8: according to the difference of classification, the desired value of expected risk function is calculated according to formula (5), Q sample, which calculates, to be completed, It then enters step 9, Q sample and does not calculate completion, then return step 6；

Step 9: selection carries out manual mark so that the smallest sample of desired value；If being returned without the smallest sample of desired value Step 2.