CN110059752A - A kind of statistical learning querying method based on comentropy Sampling Estimation - Google Patents
A kind of statistical learning querying method based on comentropy Sampling Estimation Download PDFInfo
- Publication number
- CN110059752A CN110059752A CN201910319193.XA CN201910319193A CN110059752A CN 110059752 A CN110059752 A CN 110059752A CN 201910319193 A CN201910319193 A CN 201910319193A CN 110059752 A CN110059752 A CN 110059752A
- Authority
- CN
- China
- Prior art keywords
- sample
- comentropy
- mark
- probability distribution
- statistical learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000005070 sampling Methods 0.000 title claims abstract description 33
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000012549 training Methods 0.000 claims abstract description 23
- 238000004364 calculation method Methods 0.000 claims abstract description 5
- 230000006870 function Effects 0.000 claims description 18
- 238000013480 data collection Methods 0.000 claims description 11
- 238000005457 optimization Methods 0.000 claims description 5
- 238000005315 distribution function Methods 0.000 claims description 3
- 230000004044 response Effects 0.000 claims description 3
- 230000008859 change Effects 0.000 description 9
- 238000004422 calculation algorithm Methods 0.000 description 8
- 230000009467 reduction Effects 0.000 description 5
- 238000003745 diagnosis Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 238000007637 random forest analysis Methods 0.000 description 4
- 206010006187 Breast cancer Diseases 0.000 description 3
- 208000026310 Breast neoplasm Diseases 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 239000004744 fabric Substances 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- 230000033764 rhythmic process Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000008602 contraction Effects 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000007427 paired t-test Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of statistical learning querying methods based on comentropy Sampling Estimation, this method is using having marked training pattern that sample obtains to not marking each example calculation comentropy in Instances Pool, it selects several uncertainty highest samples and calculates the expectation empiric risk of corresponding data distribution, selection makes it is expected that the sample of empirical risk minimization is labeled.The present invention has the advantages that the present invention selects sample from the microcosmic angle of sample, the information content of sample itself is made full use of, adequately combine to the two facilitates selection to be not only able to satisfy sample information content higher but also be able to satisfy the smallest sample of expected loss;Simultaneous selection strategy effectively reduces the computation complexity based on statistical learning selection strategy.
Description
Technical field
The present invention relates to statistical learning querying method, in particular to a kind of statistical learning based on comentropy Sampling Estimation is looked into
Inquiry method.
Background technique
Traditional supervised learning uses flag data collection training pattern, however flag data collection sometimes needs to spend largely
Time and cost, Active Learning frame select a small amount of example to be labeled by concentrating in unlabeled data, reach preferable point
Class effect.The common Active Learning query strategy based on pond can be divided into Uncertain Sapmpling, Query-By-
Committee,Expected Model Change,Expected Error Reduction,Variance Reduction,
Density-Weighted Methods etc. is several, and the model for classification includes naive Bayesian, random forest, supporting vector
Machine etc..Uncertain Sapmpling selects not mark sample based on the angle of uncertainty, and the strategy is found in practice
With stronger robustness, but there are problems that abnormal point selection;The collection of Query-By-Committee maintenance disaggregated model
It closes, does not mark the standard of sample alternatively according to the inconsistency of different classifications device, common evaluation criterion includes vote
Entropy, Killback-Leibler divergence etc., the strategy are substantially a kind of contractions by space is assumed
To realize that sample selects;The method choice of Expected Model Change strategy use decision information opinion influences most model
Big does not mark example;Expected Error Reduction is directly to be calculated based on Statistical Learning Theory and do not marked sample
Different labeled bring expected risk, criterion is minimized according to expected risk and selects not mark sample, which is
Selection strategy is exactly direct optimization expected risk, exists simultaneously the higher problem of computation complexity;Variance Reduction plan
Slightly not instead of by directly optimizing expected risk, indirectly select to reduce output variance realization sample is not marked;
Density-Weighted Methods considers the representativeness for not marking sample while consideration does not mark sample information content, right
Information content imposes different weights from representativeness, selects sample according to weighted value.Huang Shirong, Jin Rongrong, Zhou Zhenhua etc., " to look into
Information and typical example are ask to carry out Active Learning ", " neural information processing systems progress ", 2010 combination supporting vector machines
The QUIRE algorithm of proposition also belongs to such, has preferable classifying quality in multiple fields, however that there are computation complexities is higher
Problem.
Sample is not marked using the method choice of statistical learning has obtained in-depth study.D. MacKay.(1992)
Information-based objective functions for active data selection,Neural
Computation 4 (4): 590-604 and Cohn, D.A., Ghahramani, Z., &Jordan, M.I. (1996) Active
learning with statistical models.Journal of Artificial Intelligence Research,
4,129-145 propose to carry out optimization object function using the method for statistical learning, are created using classifiers such as feedforward neural networks
Model.N.Roy,A.McCallum,"Towards optimal active learning through sampling
estimation of error reduction", Proc.18th Int.Conf.Mach.Learn.,pp.441-448,
2001 propose directly to select not mark sample using the statistical learning method for minimizing expected risk function, however this method is still
So there is a problem of that calculating information content is larger.Z.Wang,J.Ye, ″Querying discriminative and
Representative samples for batch mode active learning ", Proc.ACM SIGKDD,
Pp.158-166,2013 are contained much information by the selection of minimization empiric risk and representational do not mark sample.Tang Ying
Roc, Huang Shengjun, autonomous rhythm Active Learning --- in the correct thing of correct time inquiring, " the 33rd AAAI is artificial
Intelligent conference collection of thesis ", 2019 introducings do not mark example from what step study simultaneous selection was easily classified, and simultaneous selection meets information
What the features such as amount is big had a potential value does not mark example, achieves preferable classifying quality.
Therefore, research and develop that a kind of time complexity is lower, the higher statistical learning based on comentropy Sampling Estimation of validity is looked into
Inquiry method is necessary.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind of time complexity is lower, validity is higher to be taken out based on comentropy
The statistical learning querying method of sample estimation.
In order to solve the above technical problems, the technical solution of the present invention is as follows: a kind of statistics based on comentropy Sampling Estimation
Practise querying method, innovative point is: the statistical learning querying method the following steps are included:
Step 1: setting training examples x ∈ D=Rn, the label of example is ∈ Y={ y1, y2... yk, to training examples X's
Conditional probability distribution is P (y | x), has marked sample set D using independent same distribution sampling, joint probability distribution P (x, y)=
P (y | x) P (x) then generates posterior probability to input sample xTherefore the expected risk based on statistical learningAre as follows:
Step 2: loss function L is used to measure true probability distribution P (x, y) and posterior probability estimation point of sample (x, y)
ClothDifference, then loss function L are as follows:
Step 3: expected riskThe target of optimization, which is that selection is optimal, does not mark sample sequence k={ x1, x2, x3,
...xk, wherein k indicates never to mark the number sampled in sample, does not mark sample (x for each of sample sequence k*,
y*), then
Step 4: determining have for not marking sample for the range for being learnt based on not marking sample M, therefore being learnt
Determining estimation P (x);Definition will not mark sample (x*, y*) be added marked sample set D generation new mark integrate as D*
=D+ (x*, y*), new mark sample set D*Distribution function it is unknown, in order to effective calculation formula (2), using having marked
The probability distribution of sample set is infused to estimate not marking sample (x currently*, y*), then the empiric risk of current class deviceAre as follows:
Step 5: byIt calculates and does not mark sample x*, in y*Expected risk value, y in the case of ∈ Y*True value be not
Know, known probability distribution P (x, y) can be used, calculate estimated probability Distribution Value, using different probability distribution as power
Value calculatesFinal desired value are as follows:
Step 6: selecting the highest sample x of comentropy in never mark example set MU, max:
xU, max=argmax (- ∑iPD(yi|x)log PD(yi|x)) (6);
Step 7: calculating the information entropy of sample to have marked sample combination D, according to comentropy, select uncertainty most
Q high sample, calculates Q sample the expected risk of response, and the smallest sample of expected risk value is selected to carry out manual mark
Note.
Further, the step 7, the specific steps are as follows:
Step 1: input initialization flag data collection D={ x1... xl, Unlabeled data collection M={ xl+1... xl+u,
Data markers y1i... yl, maximum cycle Umax;
Step 2: output condition probability distribution
Step 3: using flag data initialization training pattern P (x, y);
Step 4: when reaching maximum cycle or, then calculating corresponding information according to formula (6) in unmarked training set M
Entropy;
Step 5: according to comentropy, selecting the maximum Q sample of comentropy;
Step 6: sample being labeled using the classification in set Y, is added separately in training set, re -training mould
Type calculates corresponding loss function according to formula (2);
Step 7: calculating corresponding empirical risk function according to formula (4);
Step 8: according to the difference of classification, the desired value of expected risk function is calculated according to formula (5), Q sample calculates
It completes, then enters step 9, Q sample and do not calculate completion, then return step 6;
Step 9: selection carries out manual mark so that the smallest sample of desired value;If without the smallest sample of desired value,
Return step 2.
The present invention has the advantages that the present invention is based on the statistical learning querying method of comentropy Sampling Estimation, this method makes
With training pattern that sample obtains has been marked to each example calculation comentropy in Instances Pool is not marked, several uncertainties are selected
Highest sample and the expectation empiric risk for calculating corresponding data distribution, selection make it is expected that the sample of empirical risk minimization is marked
Note selects sample from the microcosmic angle of sample, makes full use of the information content of sample itself, carries out adequately combination to the two and helps
It is higher but also be able to satisfy the smallest sample of expected loss that sample information content had not only been able to satisfy in selection;Simultaneous selection strategy effectively reduces
Computation complexity based on statistical learning selection strategy.
Detailed description of the invention
The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
Fig. 1 is that the present invention is based on steps 7 in the statistical learning querying method of comentropy Sampling Estimation to be specifically chosen process
Figure.
Fig. 2 is the ACCURACY performance change figure of data set tic-tac-toe.
Fig. 3 is the ACCURACY performance change figure of data set transfusion.
Fig. 4 is the ACCURACY performance change figure of data set kr-vs-kp.
Fig. 5 is the ACCURACY performance change figure of data set diagnosis.
Fig. 6 is the ACCURACY performance change figure of data set breast-cancer.
Specific embodiment
The following examples can make professional and technical personnel that the present invention be more fully understood, but therefore not send out this
It is bright to be limited among the embodiment described range.
Embodiment
Statistical learning querying method of the present embodiment based on comentropy Sampling Estimation, the statistical learning querying method include with
Lower step:
Step 1: setting training examples x ∈ D=Rn, the label of example is ∈ Y={ y1, y2... yk, to training examples X's
Conditional probability distribution is P (y | x), has marked sample set D using independent same distribution sampling, joint probability distribution P (x, y)=
P (y | x) P (x) then generates posterior probability to input sample xTherefore the expected risk based on statistical learningAre as follows:
Step 2: loss function L is used to measure true probability distribution P (x, y) and posterior probability estimation point of sample (x, y)
ClothDifference, then loss function L are as follows:
Step 3: expected riskThe target of optimization, which is that selection is optimal, does not mark sample sequence k={ x1, x2, x3,
...xk, wherein k indicates never to mark the number sampled in sample, does not mark sample (x for each of sample sequence k*,
y*), then
Step 4: determining have for not marking sample for the range for being learnt based on not marking sample M, therefore being learnt
Determining estimation P (x);Definition will not mark sample (x*, y*) be added marked sample set D generation new mark integrate as D*
=D+ (x*, y*), new mark sample set D*Distribution function it is unknown, in order to effective calculation formula (2), using having marked
The probability distribution of sample set is infused to estimate not marking sample (x currently*, y*), then the empiric risk of current class deviceAre as follows:
Step 5: byIt calculates and does not mark sample x*, in y*Expected risk value, y in the case of ∈ Y*True value be unknown
, it can be used known probability distribution P (x, y), calculate estimated probability Distribution Value, using different probability distribution as weight,
It calculatesFinal desired value are as follows:
Step 6: selecting the highest sample x of comentropy in never mark example set MU, max:
xU, max=argmax (- ∑iPD(yi|x)log PD(yi|x)) (6);
Step 7: calculating the information entropy of sample to have marked sample combination D, according to comentropy, select uncertainty most
Q high sample, calculates Q sample the expected risk of response, and the smallest sample of expected risk value is selected to carry out manual mark
Note.
As embodiment, specific embodiment is the step 7, as shown in Figure 1, the specific steps are as follows:
Step 1: input initialization flag data collection D={ x1... xl, Unlabeled data collection M={ xl+1... xl+u,
Data markers y1i... yl, maximum cycle Umax;
Step 2: output condition probability distribution
Step 3: using flag data initialization training pattern P (x, y);
Step 4: when reaching maximum cycle or, then calculating corresponding information according to formula (6) in unmarked training set M
Entropy;
Step 5: according to comentropy, selecting the maximum Q sample of comentropy;
Step 6: sample being labeled using the classification in set Y, is added separately in training set, re -training mould
Type calculates corresponding loss function according to formula (2);
Step 7: calculating corresponding empirical risk function according to formula (4);
Step 8: according to the difference of classification, the desired value of expected risk function is calculated according to formula (5), Q sample calculates
It completes, then enters step 9, Q sample and do not calculate completion, then return step 6;
Step 9: selection carries out manual mark so that the smallest sample of desired value;If without the smallest sample of desired value,
Return step 2.
In order to verify the validity of the statistics active learning strategies based on comentropy Sampling Estimation, exist with random sampling procedure
It is compared on multiple data sets.Random sampling procedure randomly chooses several samples from unmarked example, selects so that the phase
The smallest sample in danger of keeping watch carries out manual mark.
The data set for machine learning that experimental data is proposed from University of California at Irvine.It selects therein
Tic-tac-toe, transfusion, kr-vs-kp, diagnosis, breast-cancer are used for the data set of two classification,
The partial data collection is frequently used for the research of Active Learning query strategy, Tang Yingpeng, Huang Shengjun, autonomous rhythm Active Learning ---
In the correct thing of correct time inquiring, " the 33rd AAAI artificial intelligence conference collection of thesis ", 2019, specific number
1 is shown in Table according to collection description.
The data set used in the experiment of table 1
Experimental data divides data set using stratified sampling, and 50% is used for training data, and 50% for testing number
According to taking-up 10% is used as initial labeled data collection from training data, for establishing model.Experiment repeats 5 times at random, adopts
With cross validation, each data set generates 5*2 group data, takes the average value of all data for the prediction knot of the labeled data point
Fruit.Experiment uses sklearn kit, and classifier selects random forest grader and logistics to return classifier, and parameter makes
With system default parameter.
Categorical data in UCI data set converts the LabelEncoder class by sklearn to realize the coding of category,
Attribute is converted into corresponding integer value.Never mark selection subset in example and need to set hyper parameter C, hyper parameter C indicate from
The number sampled in example set is not marked, in this strategy, sets C as 20.The evaluation index of algorithm uses ACCURACY, indicates
The ratio of real example and the sum of real example and false positive example.
Disaggregated model using random random forest as data set, in data set
In tic-tac-toe, transfusion, kr-vs-kp, diagnosis, breast-cancer two kinds it is to be compared
Algorithm is as the increase performance change of mark sample is as shown in Fig. 2, Fig. 3, Fig. 4, Fig. 5, Fig. 6.
In order to further investigate propose strategy validity, labeled data ratio be 20%, 40%, 60%, 80%,
Win/draw/loss analysis is done to the result of two algorithms in the case where 100%.Win/draw/loss analysis is for describing not
The algorithm of algorithm difference when with to(for) same data set.Such as labeled data ratio be 20% when, in data set tic-
The classifier accuracy mean value of the upper comentropy sampling policy of tac-toe is denoted as Aie, adopted at random on data set tic-tac-toe
The classifier accuracy mean value of sample strategy is denoted as ArIf Aie>Ar, then win=1, if Aie=Ar, then draw=1,
If Aie<Ar, then loss=1.Table 2 illustrates the win/draw/ based on comentropy sampling policy and randomized policy comparison
loss。
Table 2 be based on comentropy sampling policy and randomized policy labeled data ratio be 20%, 40%, 60%, 80%,
The win/draw/loss analysis compared in the case where 100%
From Fig. 2 to Fig. 6 and from the point of view of the displaying result of table 2, based on the Sampling Strategies of comentropy than randomized policy in most of feelings
It is attained by preferable classifying quality under condition, has absolutely proved the validity based on comentropy strategy.Simultaneously also illustrate into
When the capable sub-sample based on statistical learning, the selection towards individual information amount is better than random selection.With mark sample number
Purpose increases, and two sampling policies of Active Learning are sampled according to different angle, and the precision of sorting algorithm is all mentioned
Height also illustrates the validity of Active Learning Method frame.But also there are different performances, base on data set diagnosis
In the tactful with the increase for marking sample of comentropy, it is rapidly achieved preferable classifying quality;And stochastic sampling strategy does not have not only
Have and reaches preferable classifying quality or even biggish fluctuation occurs in classification performance.Illustrate the strategy based on comentropy so more
Added with the classifier precision for being conducive to lift scheme.From the point of view of the performance tendency of data set transfusion, with mark sample
Increase, algorithm performance has obtained faster growth, and converges on more stable classifying quality, and the strategy of stochastical sampling exists
Reach preferable classifying quality and even occurs the case where performance decline later.
In order to sufficiently study the performance change situation of different sampling policies, selection using other classifiers to data set into
Row modeling, the performance comparison for comparing different sampling policies on different classifications device only describe labeled data ratio to save space
Example is performance comparison situation in the case where 20%, 40%, 60%, 80%, 100%.Table 3 illustrates different on different classifications device
The performance comparison of sampling policy.
The performance comparison of different sampling policies on 3 different classifications device of table does comparative test based on paired t-test, performance compared with
Good is shown with black matrix
Either still logistic regression is used to classify using random forest grader from the results shown in Table 3
Device, it is proposed that the Sampling Estimation strategy based on comentropy all achieve optimal effect in most cases, even if in portion
It is not optimal in the case of point, it is many is not weaker than optimal performance yet.It is possible thereby to illustrate, for different classifier or
In the case where different mark example ratios, it is proposed that the Sampling Strategies based on comentropy can there is stable performance to mention
It rises.
In addition from the performance comparable situation of two kinds of different classifications devices, we can also be seen that be sampled based on comentropy and estimate
The strategy of meter possesses more stable performance, and with increasing for mark sample, classification performance is promoted steadily until converging to best
Level.And stochastic sampling strategy obviously shows the unstability of stronger randomness and performance.
Experiment on the common data collection of machine learning shows that this method effectively can be marked never in example and selects
The example for needing manually to mark.
Basic principles and main features and advantages of the present invention of the invention have been shown and described above.The skill of the industry
Art personnel it should be appreciated that the present invention is not limited to the above embodiments, the above embodiments and description only describe
The principle of the present invention, without departing from the spirit and scope of the present invention, various changes and improvements may be made to the invention, these
Changes and improvements all fall within the protetion scope of the claimed invention.The claimed scope of the invention by appended claims and
Its equivalent thereof.
Claims (2)
1. a kind of statistical learning querying method based on comentropy Sampling Estimation, it is characterised in that: the statistical learning issuer
Method the following steps are included:
Step 1: setting training examples x ∈ D=Rn, the label of example is ∈ Y={ y1, y2... yk, to the condition of training examples X
Probability distribution is P (y | x), has marked sample set D using independent same distribution sampling, joint probability distribution P (x, y)=P (y |
X) P (x) then generates posterior probability to input sample xTherefore the expected risk based on statistical learningAre as follows:
Step 2: loss function L is used to measure true probability distribution P (x, y) of sample (x, y) and posterior probability estimation is distributedDifference, then loss function L are as follows:
Step 3: expected riskThe target of optimization, which is that selection is optimal, does not mark sample sequence k={ x1, x2, x3... xk, wherein
K indicates the number sampled in never mark sample, does not mark sample (x for each of sample sequence k*, y*), then
Step 4: determining there is determination for not marking sample for the range for being learnt based on not marking sample M, therefore being learnt
Estimation P (x);Definition will not mark sample (x*, y*) be added marked sample set D generation new mark integrate as D*=D+
(x*, y*), new mark sample set D*Distribution function it is unknown, in order to effective calculation formula (2), using having marked sample
The probability distribution of example collection is estimated currently not mark sample (x*, y*), then the empiric risk of current class deviceAre as follows:
Step 5: byIt calculates and does not mark sample x*, in y*Expected risk value, y in the case of ∈ Y*True value be it is unknown,
Known probability distribution P (x, y) can be used, calculate estimated probability Distribution Value, using different probability distribution as weight, calculateFinal desired value are as follows:
Step 6: selecting the highest sample χ of comentropy in never mark example set MU, max:
xU, max=argmax (- ∑iPD(yi|x)log PD(yi|x)) (6);
Step 7: calculating the information entropy of sample to have marked sample combination D, according to comentropy, select uncertainty highest
Q sample, calculates Q sample the expected risk of response, and the smallest sample of expected risk value is selected to carry out manual mark.
2. the statistical learning querying method according to claim 1 based on comentropy Sampling Estimation, it is characterised in that: described
Step 7, the specific steps are as follows:
Step 1: input initialization flag data collection D={ x1,...xi, Unlabeled data collection
M={ xi+1,...xi+u, data markers y1i... yl, maximum cycle Umax;
Step 2: output condition probability distribution
Step 3: using flag data initialization training pattern P (x, y);
Step 4: when reaching maximum cycle or, then calculating corresponding information entropy according to formula (6) in unmarked training set M;
Step 5: according to comentropy, selecting the maximum Q sample of comentropy;
Step 6: sample is labeled using the classification in set Y, is added separately in training set, re -training model, according to
Corresponding loss function is calculated according to formula (2);
Step 7: calculating corresponding empirical risk function according to formula (4);
Step 8: according to the difference of classification, the desired value of expected risk function is calculated according to formula (5), Q sample, which calculates, to be completed,
It then enters step 9, Q sample and does not calculate completion, then return step 6;
Step 9: selection carries out manual mark so that the smallest sample of desired value;If being returned without the smallest sample of desired value
Step 2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910319193.XA CN110059752A (en) | 2019-04-19 | 2019-04-19 | A kind of statistical learning querying method based on comentropy Sampling Estimation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910319193.XA CN110059752A (en) | 2019-04-19 | 2019-04-19 | A kind of statistical learning querying method based on comentropy Sampling Estimation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110059752A true CN110059752A (en) | 2019-07-26 |
Family
ID=67319780
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910319193.XA Pending CN110059752A (en) | 2019-04-19 | 2019-04-19 | A kind of statistical learning querying method based on comentropy Sampling Estimation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110059752A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111914061A (en) * | 2020-07-13 | 2020-11-10 | 上海乐言信息科技有限公司 | Radius-based uncertainty sampling method and system for text classification active learning |
CN114169470A (en) * | 2022-02-15 | 2022-03-11 | 南京航空航天大学 | Artificial intelligence learning method based on target model and sample double sampling |
-
2019
- 2019-04-19 CN CN201910319193.XA patent/CN110059752A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111914061A (en) * | 2020-07-13 | 2020-11-10 | 上海乐言信息科技有限公司 | Radius-based uncertainty sampling method and system for text classification active learning |
CN111914061B (en) * | 2020-07-13 | 2021-04-16 | 上海乐言科技股份有限公司 | Radius-based uncertainty sampling method and system for text classification active learning |
CN114169470A (en) * | 2022-02-15 | 2022-03-11 | 南京航空航天大学 | Artificial intelligence learning method based on target model and sample double sampling |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Dong et al. | Gaussian classifier-based evolutionary strategy for multimodal optimization | |
Huang et al. | An effective hybrid learning system for telecommunication churn prediction | |
CN108446741B (en) | Method, system and storage medium for evaluating importance of machine learning hyper-parameter | |
CN107766929B (en) | Model analysis method and device | |
CN109308306A (en) | A kind of user power utilization anomaly detection method based on isolated forest | |
CN111324642A (en) | Model algorithm type selection and evaluation method for power grid big data analysis | |
CN105809672B (en) | A kind of image multiple target collaboration dividing method constrained based on super-pixel and structuring | |
Guo et al. | CST: Convolutional Swin Transformer for detecting the degree and types of plant diseases | |
CN109461025A (en) | A kind of electric energy substitution potential customers' prediction technique based on machine learning | |
CN109273096A (en) | A kind of risk management grading evaluation method based on machine learning | |
Bifet et al. | Ensembles of restricted hoeffding trees | |
JP6897749B2 (en) | Learning methods, learning systems, and learning programs | |
CN106202388B (en) | A kind of user gradation Automated Partition Method and system | |
CN111815054A (en) | Industrial steam heat supply network short-term load prediction method based on big data | |
CN107403255A (en) | Chinese medicine price composite index assessment method and system | |
CN116485280B (en) | UVC-LED production quality evaluation method and system based on artificial intelligence | |
CN108595499A (en) | A kind of population cluster High dimensional data analysis method of clone's optimization | |
CN110059752A (en) | A kind of statistical learning querying method based on comentropy Sampling Estimation | |
CN110365603A (en) | A kind of self adaptive network traffic classification method open based on 5G network capabilities | |
Chundawat et al. | TabSynDex: A universal metric for robust evaluation of synthetic tabular data | |
CN117495512B (en) | Order data management method, device, equipment and storage medium | |
CN110232397A (en) | A kind of multi-tag classification method of combination supporting vector machine and projection matrix | |
CN107274025B (en) | System and method for realizing intelligent identification and management of power consumption mode | |
CN112541010B (en) | User gender prediction method based on logistic regression | |
CN116956160A (en) | Data classification prediction method based on self-adaptive tree species algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190726 |
|
RJ01 | Rejection of invention patent application after publication |