CN106874944A - A kind of measure of the classification results confidence level based on Bagging and outlier - Google Patents
A kind of measure of the classification results confidence level based on Bagging and outlier Download PDFInfo
- Publication number
- CN106874944A CN106874944A CN201710054802.4A CN201710054802A CN106874944A CN 106874944 A CN106874944 A CN 106874944A CN 201710054802 A CN201710054802 A CN 201710054802A CN 106874944 A CN106874944 A CN 106874944A
- Authority
- CN
- China
- Prior art keywords
- data
- confidence
- classification results
- collection
- classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
- G06F18/24155—Bayesian classification
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of measure of the classification results confidence level based on Bagging and outlier, measurement confidence data is treated as base grader using in Logistic recurrence, SVMs and naive Bayesian first to be classified, classification results collection and class probability collection that the class probability in different classifications obtains confidence data to be measured are calculated, the classification results of confidence data to be measured are obtained by classification results collection;Concentrated in class probability, using each classification an as point in space, the point concentrated using classification results correspondence class probability is used as outlier, the point that remainder classification correspondence class probability is concentrated is a cluster, finally compare distance and distance to outlier of each point to cluster barycenter in cluster using Euclidean distance, if meeting the distance for arriving cluster barycenter in cluster a little less than the distance to outlier, the classification results are credible, otherwise are then insincere.Because employing influence of the incredible classification results to training pattern when learning again present invention, avoiding model.
Description
Technical field
It is more particularly to a kind of to be based on Bagging and outlier the invention belongs to classification results confidence metric technical field
Classification results confidence level measure.
Background technology
The accuracy that model is improved by treating metric data is part and parcel in on-line study, and how to keep study
The accuracy of data becomes particularly important.The method of classification results confidence metric is to being used to weigh classification after every subseries
Credible result or incredible method, this has very important significance to keeping training set and model retraining.Traditional is right
The category of model results such as Logistic recurrence, SVM and naive Bayesian do not carry out confidence metric, cannot be kept away when model learns again
Exempt to learn influence of the incredible classification results to model.
The existing Research foundation of Yan Yunyang and Zhu Quanyin et al. includes:Yan Yunyang, Wu Qianyin, Du Jing, Zhou Jingbo, Liu with
Peace is based on color and dodges the video flame detection computer science of frequency feature and explore, and 2014,08 (10):1271-1279;S
Gao, J Yang, Y Yan.A novel multiphase active contour model for inhomogeneous
Image segmentation.Multimedia Tools and Applications, 2014,72 (3):2321-2337;S
Gao,J Yang,Y Yan.A local modified chan–vese model for segmenting
inhomogeneous multiphase images.International Journal of Imaging Systems and
Technology,2012,22(2):103-113;The short message text sorting technique that Liu Jinling, tight cloud ocean are based on context is calculated
Machine engineering, 2011,37 (10):41-43;Yan Yunyang, noble soldier, Guo Zhibo contains fire of the bright super based on video image and examines automatically
Survey computer applications research, 2008,25 (4):1075-1078Y Yan,Z Guo,JYang.Fast Feature Value
Searching for Face Detection.Computer and Information Science, 2008,1 (2):120-
128;Zhu Quanyin, Pan Lu, Liu Wenru, wait .Web science and technology news classification extraction algorithm [J] Huaiyingong College journals, 2015,24
(5):18-24;Li Xiang, Zhu Quan silver joints cluster and shared collaborative filtering recommending [J] the computer science of rating matrix and spy
Rope, 2014,8 (6):751-759;Quanyin Zhu,Sunqun Cao.A Novel Classifier-independent
Feature Selection Algorithm for Imbalanced Datasets.2009,p:77-82;Quanyin Zhu,
Yunyang Yan,Jin Ding,Jin Qian.The Case Study for Price Extracting of Mobile
Phone Sell Online.2011,p:282-285;Quanyin Zhu,Suqun Cao,Pei Zhou,Yunyang Yan,
Hong Zhou.Integrated Price Forecast based on Dichotomy Backfilling and
Disturbance Factor Algorithm.International Review on Computers and Software,
2011,Vol.6(6):1089-1093;Zhu Quan silver et al. application, the open Patents with mandate:Zhu Quanyin, Hu Rongjing, what
A kind of commodity price sorting technique Chinese patents based on linear interpolation and Adaptive windowing mouthful of the such as Su Qun, week training:ZL
201110423015.5,2015.07.01;Zhu Quanyin, Cao Suqun, Yan Yunyang, Hu Rong wait quietly, a kind of based on the repairing of two divided datas
With the commodity price sorting technique Chinese patents of disturbing factors:ZL 2011 1 0422274.6,2013.01.02;Zhu Quanyin,
Yin Yonghua, Yan Yunyang, Cao Suqun etc., a kind of data preprocessing method of the multi items commodity price classification based on neutral net
Chinese patent:ZL 2012 1 0325368.6;Li Xiang, Zhu Quanyin, Hu Ronglin, a kind of all Cold Chain Logistics based on spectral clustering of deep
Prestowage intelligent recommendation method China Patent Publication No.:CN105654267A,2016.06.08;Cao Suqun, Zhu Quanyin, Zuo Xiaoming,
Noble soldier et al., a kind of feature selection approach China Patent Publication No. for pattern classification:The A of CN 103425994,
2013.12.04;Zhu Quanyin, Yan Yunyang, Li Xiang, Zhang Yongjun et al., a kind of science and technology excavated for text classification and picture depth
Information is obtained and method for pushing China Patent Publication No.:CN 104035997 A,2014.09.10;Zhu Quanyin, Xin Cheng, Li Xiang,
Xu Kang et al., a kind of network behavior custom clustering method China Patent Publication No. based on K means and LDA bi-directional verifications:CN
106202480 A,2016.12.07。
Bagging (pocket-like):
Bagging is a kind of method for improving the learning algorithm degree of accuracy, and this method is by constructing a classification letter
Number system is arranged, and a classification function is then combined into some way.The main thought of Bagging technologies is using weight
Sampling techniques, concentrate from initial data and independently randomly choose data, and this process are independently carried out repeatedly, until producing
Raw many independent data sets.A weak learning algorithm is given, the multiple for producing can be trained by the weak learning algorithm
Sample set is learnt, and draws classification function sequence, and result is voted, and who gets the most votes is used as last result.
Outlier:
Outlier detection is a branch in data mining, and its task is to recognize that its data characteristics is markedly different from it
The observation of his data object.Outlier detection is extremely important in data mining, because if abnormal is by inherent data
What variation was caused, then they are analyzed it can be found that containing deeper, potential, valuable letter wherein
Breath.Therefore, outlier detection is a significantly research direction.
Logistic is returned:
Logistic is returned also known as logistic regression analyses, is a kind of linear regression analysis model of broad sense, and linear
Difference is returned, it is a kind of nonlinear model that Logistic is returned, and the method for parameter estimation for generally using is maximum likelihood estimate.
It is usually used in data mining, disease is diagnosed automatically, the field such as economic classification.Logistic homing methods can to classification dependent variable and point
Class independent variable or continuous independent variable, or hybrid variable carry out regression modeling, there is joining to regression model and recurrence for a whole set of maturation
The standard that number is tested, provides result in the form of event occurrence rate.
SVMs:
SVMs is that Vapnik et al. is proposed on the basis of the Research statistics theories of learning for many years to linear classifier
Another kind design optimum criterion.Its principle is also talked about from linear separability, then expands to the situation of linearly inseparable.Even extend
To in using nonlinear function, this grader is referred to as SVMs (Support Vector Machine, abbreviation
SVM)。
Naive Bayes Classifier:
Naive Bayes Classifier is that a kind of application is based on the independent Bayesian simple probability grader assumed, more
This potential probabilistic model is accurately described for independent characteristic model, the basis of Bayes's classification is probability inference, be exactly
The presence of various conditions does not know, and in the case of only knowing its probability of occurrence, how to complete reasoning and decision task, and probability inference is
It is corresponding with certainty reasoning, and Naive Bayes Classifier is assumed based on independent, that is, assume each feature of sample with
Other features are all uncorrelated.
Euclidean distance:
Euclidean metric is also referred to as Euclidean distance, is a distance definition for generally using, and refers to two points in m-dimensional space
Between actual distance, or vector natural length (i.e. distance of the point to origin).Europe in two and three dimensions space
Family name's distance is exactly the actual range between 2 points.
Logistic recurrence, SVMs and naive Bayesian treat data and the classification of confidence metric in classification
Result is directly added into training set, and the method cannot avoid for incredible metric data and classification results being added to trust data
Concentrate, this degree of accuracy for causing model and stability are reduced.In order to be able to preferably utilize algorithm above, it is to avoid grouped data is added
To the influence of model during trust data collection, it is therefore desirable to find a kind of method that classification results can be carried out with confidence metric,
The models such as Logistic recurrence, SVMs and naive Bayesian are made to avoid learning incredible classification results to disaggregated model
Influence.
The content of the invention
Goal of the invention:For problems of the prior art, the present invention provides one kind by Bagging and the point analysis that peels off
With reference to, the classification results to models such as Logistic recurrence, SVMs and naive Bayesians carry out confidence metric, and then
Avoid the models such as Logistic recurrence, SVM and naive Bayesian when training data is expanded because employing incredible classification number
Influenceed according on training pattern, the present invention proposes a kind of measurement side of the classification results confidence level based on Bagging and outlier
Method.
Technical scheme:In order to solve the above technical problems, a kind of classification based on Bagging and outlier that the present invention is provided
As a result the measure of confidence level, comprises the following steps:
Step one:Bagging integrated learning approachs are used to existing trust data collection, i.e., is returned using Logistic, propped up
One is held in vector machine and naive Bayesian as base grader, the disaggregated model collection of base grader is obtained;
Step 2:The disaggregated model collection of the base grader drawn by step one, is treated measurement confidence data and is divided
Class, and the class probability in different classifications is calculated, obtain the classification results collection and confidence to be measured of confidence data to be measured
The class probability collection of degrees of data, then classification results collection is counted, obtain the classification results of confidence data to be measured;
Step 3:Using outlier analysis method, the classification results for treating measurement confidence data carry out confidence metric,
Obtain trust data in confidence data to be measured and can not letter data, and confidence bar will be met in confidence data to be measured
The data of part add existing trust data collection.
Further, the specific method that the disaggregated model collection of base grader is obtained in the step one is:
Step 1.1:The feature and categorical attribute of the existing trust data collection of definition;
Step 1.2:Selection Logistic is returned, in SVMs and naive Bayesian one as base grader
Function;
Step 1.3:Existing trust data collection to being crossed defined in step 1.1 uses Bagging integrated learning approachs, with step
The Function selected in rapid 1.2 is base grader, obtains the disaggregated model collection of Function;
Further, the specific method of the classification results for obtaining confidence data to be measured in the step 2 is:
Step 2.1:Treat measurement confidence data to be classified, and calculate the class probability in different classifications, treated
Measure the classification results collection Y of the confidence data and class probability collection Cf of confidence data to be measured;
Step 2.2:In statistic procedure 2.1 in classification results collection Y each classification number, obtain confidence data to be measured
Classification results py:
Further, the classification results for treating measurement confidence data using outlier analysis method in the step 3 enter
The specific method of row confidence metric is:
Step 3.1:If meeting Point=CfpyPoint be outlier, by the class probability collection Cf of confidence data to be measured
In CfpyTake out, and Cf is deleted from probability set Cfpy, obtain matrix P;
Step 3.2:Each is classified in Ergodic Matrices P, the barycenter of calculating matrix P, and its formula is:
In formula, PLoopFor class probability concentrates the Loop classification, Num is the current classification for calculating, and X is classification number;
Step 3.3:Each classification and the distance and the distance of outlier of barycenter, calculate the formula of barycenter in Ergodic Matrices P
For:
Calculate outlier formula be:
In formula, PNumFor class probability concentrates the Num classification, MNumIt is the corresponding barycenter of Num classification, α is self-defining value;
Step 3.4:After performing step 3.3, if meeting dNum,2>dNum,1, then confidence data to be measured is trust data,
And in adding it to existing trust data collection Train;Otherwise, confidence data to be measured for can not letter data, be added without
In having a trust data collection Train.
Compared with prior art, the advantage of the invention is that:
The inventive method is by Bagging and peels off point analysis, can effectively to Logistic recurrence, SVMs and
The classification results of the models such as naive Bayesian carry out confidence metric, insincere because employing when model learns again so as to avoid
Influence of the classification results to training pattern.Additionally, the present invention creatively proposes a kind of measurement of classification results confidence level
Method, for the expansion to existing trust data collection trust data, and then improves the validity of learning model.
Brief description of the drawings
Fig. 1 is overview flow chart of the invention;
Fig. 2 is the flow chart of pocket-like model training in Fig. 1;
Fig. 3 is the flow chart of confidence data classification to be measured in Fig. 1;
Fig. 4 is the flow chart of classification results confidence metric in Fig. 1.
Specific embodiment
With reference to the accompanying drawings and detailed description, the present invention is furture elucidated.
The technical scheme is that to the classification results of the models such as Logistic recurrence, SVMs and naive Bayesian
Confidence metric is carried out, first using Bagging integrated learning approachs, i.e. using Logistic recurrence, SVMs and Piao
One in plain Bayes is treated measurement confidence data and is classified, and calculated in different classifications as base grader
Class probability, obtains the classification results collection of confidence data to be measured and the class probability collection of confidence data to be measured, and passes through
Classification results collection obtains the classification results of confidence data to be measured, and secondly, is concentrated in class probability, using each classification as sky
Between in a point, the point concentrated using classification results correspondence class probability used as outlier, concentrate by remainder classification correspondence class probability
Point be a cluster, finally, using Euclidean distance, compare each point in cluster to the distance and the distance to outlier of cluster barycenter,
If meeting the distance for arriving cluster barycenter in cluster a little less than the distance to outlier, the classification results are credible, otherwise are then
It is insincere, and then realize the measurement to classification results confidence level.
Specifically, the present invention comprises the following steps:
Step one:Bagging integrated learning approachs are used to existing trust data collection, i.e. returned using Logistic, propped up
Hold in vector machine and naive Bayesian one and obtain the disaggregated model collection of base grader as base grader, it is specific such as Fig. 2 institutes
Show;
Step 1.1:If existing classification number is the trust data collection Train={ T of X1,T2,T3,……,Tn, n is
Gather number, feature set Ti={ a in Train1,a2,a3,……,afd, ajIt is TiJ-th feature, fd is characterized number, its
In, i ∈ [1, n], j ∈ [1, fd];
Step 1.2:Selection Logistic is returned, in SVMs and naive Bayesian one as base grader
Function, if Function models quantity is N;
Step 1.3:If Models is Function disaggregated model collection, tax initial value is empty set;
Step 1.4:It is 1 to define cyclic variable q and assign initial value;
Step 1.5:As cyclic variable q<During=N, then step 1.6 is performed;Otherwise perform step 1.10;
Step 1.6:To random sampling E in the trust data collection Train in step 1.1 as sample, i.e. Sub=
{T1,T2,T3,……,TE,
Step 1.7:Function is trained using Sub, the disaggregated model L after must trainingq;
Step 1.8:Models=Models ∪ Lq;
Step 1.9:Cyclic variable q=q+1;
Step 1.10:Obtain Function disaggregated model collection Models={ L1,L2,L3,……,LN};
Step 2:By the disaggregated model collection of base grader, treat measurement confidence data and classified, and calculate not
With the class probability in classification, the classification for obtaining the classification results collection and confidence data to be measured of confidence data to be measured is general
Rate collection, then classification results collection is counted, the classification results of confidence data to be measured are obtained, it is specific as shown in Figure 3;
Step 2.1:If the feature set of confidence data to be measured is Test={ b1,b2,b3,……,bgd, wherein, bkFor
K-th data characteristics in Test, gd is the Characteristic Number of Test;
Step 2.2:Test is classified using Models, obtains the classification results collection Y=of confidence data to be measured
{y1,y2,y3,……,yNAnd confidence data to be measured class probability collection Cf={ C1,C2,C3,……,CX, wherein, ysFor
The classification results of metric data Test are treated in s-th base grader Function model;CrIt is each base grader Function
Model is to r-th class probability of classification, Cu={ pr1,pr2,pr3,……,prN, prhIt is h-th base grader Funtion
The class probability value of model, wherein, s, h ∈ [1, N], u ∈ [1, X];
Step 2.3:The classification results collection Y of model in statistic procedure 2.2, if M is each classification in statistical classification result set Y
Number, the maximum classification of statistical value is selected in M as the classification results py of confidence data to be measured;
Step 3:Using outlier analysis method, the classification results for treating measurement confidence data carry out confidence metric,
Obtain trust data in confidence data to be measured and can not letter data, and confidence bar will be met in confidence data to be measured
The data of part add existing trust data collection, specific as shown in Figure 4;
Step 3.1:If meeting Point=CfpyPoint be outlier, by the class probability collection Cf of confidence data to be measured
In CfpyTake out, and remove the Cf in class probability collection Cfpy, obtain P={ C1,C2,C3... ..., CX-1, wherein,
Step 3.2:If it is 1 that cyclic variable Num assigns initial value, for the row of Ergodic Matrices P;
Step 3.3:As cyclic variable Num<During=X-1, then step 3.4 is performed;Otherwise perform step 3.8;
Step 3.4:The barycenter of the class probability collection P of confidence data to be measured is calculated, wherein not comprising PNum, obtain;
Step 3.5:Calculate PNumEuclidean distance with M is:PNumIt is European with Point
Distance is,Wherein, α is entered as 0.5
Step 3.6:Work as d1<d2When, then perform step 3.4;Otherwise perform step 3.7;
Step 3.7:Cyclic variable Num=Num+1;
Step 3.8:Confidence data to be measured is obtained for can not letter data, Train=Train;
Step 3.10:Confidence data to be measured is obtained for trust data, and adds it to existing trust data collection
In Train, i.e. Train=Train ∪ { Test, py }.
Wherein, with Bagging integrated learning approachs, it is Logistic recurrence, SVMs and Piao to use base grader
One in plain Bayes is trained as base grader to trust data, the class probability obtained by confidence data to be measured
Collection, concentrates in class probability, and each classification an as point in space is made with the point that classification results correspondence class probability is concentrated
It is outlier, the point that remainder classification correspondence class probability is concentrated is a cluster, and putting for classification results is judged by Euclidean distance
Reliability.
Wherein, step 1.1 is to provide primary data needed for model training;Step 1.2 is to step 1.10, with Bagging
Integrated learning approach is trained to data, wherein with Logistic recurrence, SVMs and naive Bayesian as base
Grader;Step 2.1 is that the data for treating measurement confidence level are classified to step 2.3, and is calculated general in different classifications
Rate, obtains the class probability collection of the classification results collection of confidence data to be measured and the data of confidence level to be measured;Step 3.1 is arrived
Step 3.10 is a kind of method of the confidence metric of the classification results for calculating and treating measurement confidence data.
In order to the validity of this method is better described, by open on existing Web page grouped data and UCI official websites
Car Evaluation data sets and Letter Recognition data sets as raw data set, pass through respectively
Logistic regression models, SVM models and model-naive Bayesian are classified, and to classify result carry out confidence measure
Amount.
Tested by the data of Web page grouped data 4553, be characterized as the title fields in Web page
Keywords in describe, using sample 70% as training set, 30% as test set, mould is returned by Logistic
Type is classified, and obtains 90.64% accuracy rate, wherein comprising 128 wrong divided datas, if by the confidence measure to classification results
Amount, can select 1092 (accounting for the 80% of former test set) from classification results, and the subset accuracy rate that this is filtered out is 98.07%.
Classified by model-naive Bayesian, obtain 88.1% accuracy rate, wherein comprising 162 wrong divided datas, if by classification
The confidence metric of result, can select 1012 (accounting for the 74.1% of former test set), the subset for filtering out from classification results
Accuracy rate is 96.93%.By SVM categories of model, 88.64% accuracy rate is obtained, wherein comprising 155 wrong divided datas, if
By the confidence metric to classification results, 1004 (accounting for the 73.5% of former test set), the sieve can be selected from classification results
The subset accuracy rate selected is 94.5%.
By the data disclosed in UCI, from the data Car Evaluation of handwritten word identification, the data volume is 1728
Bar, is characterized as 6.Using sample 70% as training set, 30% as test set, classified by Logistic regression models,
81.3% accuracy rate is obtained, wherein comprising 96 wrong divided datas, if by the confidence metric to classification results, can be from classification
Selected in result 407 (accounting for the 78.6% of former test set), the subset accuracy rate that this is filtered out is 98.07%.By simple shellfish
Leaf this category of model, obtains 70% accuracy rate, wherein comprising 155 wrong divided datas, if by the confidence level to classification results
Measurement, can select 429 (accounting for the 82.8% of former test set) from classification results, and the subset accuracy rate that this is filtered out is
78.3%.By SVM categories of model, 94.8% accuracy rate is obtained, wherein comprising 27 wrong divided datas, if being tied by classification
The confidence metric of fruit, can select 496 (accounting for the 95.8% of former test set) from classification results, and the subset that this is filtered out is accurate
Rate is 97.8%.
From the Letter Recognition data sets disclosed in UCI, the data volume is 20000, is characterized as 16.
Using sample 70% as training set, 30% as test set, classified by Logistic regression models, obtain 71.3% standard
True rate, wherein comprising 1722 wrong divided datas, if by the confidence metric to classification results, can be selected from classification results
2902 (accounting for the 48.37% of former test set), the subset accuracy rate that this is filtered out is 91.42%.By model-naive Bayesian
Classification, obtains 54.78% accuracy rate, wherein comprising 2713 wrong divided datas, if by the confidence metric to classification results,
Can be selected from classification results 2362 (accounting for the 39.37% of former test set), the subset accuracy rate that this is filtered out is 79.17%.
By SVM categories of model, 96.87% accuracy rate is obtained, wherein comprising 187 wrong divided datas, if by classification results
Confidence metric, can select 5821 (accounting for the 97% of former test set) from classification results, and the subset accuracy rate that this is filtered out is
98.2%.
Except being returned by Logistic, in addition to SVMs and naive Bayesian, can also be to iteration decision tree and KNN
Confidence metric is carried out Deng the classification results of class probability output model are supported.By CarEvaluation data sets, to iteration
Decision tree and KNN categories of model result carry out confidence metric, and model accuracy rate is respectively 98.5% and 91.71%, if passing through
To the measurement of classification results, 499 (accounting for the 96.3% of former test set) can be selected from classification results and 415 (account for former test
Collection 80%), subset accuracy rate be 99.8%% and 99%.
The present invention can be combined with computer system, so as to be automatically performed the measurement to classification results confidence level.
A kind of measure of classification results confidence level based on Bagging and outlier proposed by the present invention, above institute
Only embodiments of the invention is stated, is not intended to limit the invention.Except being returned to Logistic, SVM and simple shellfish
The classification results of the models such as Ye Si are carried out outside confidence metric, it can also be used to iteration decision tree (GBDT), KNN and BP nerve nets
Network etc. supports the model of class probability output.All equivalents within principle of the invention, made, should be included in this hair
Within bright protection domain.The content that the present invention is not elaborated belongs to existing skill known to this professional domain technical staff
Art.
Claims (4)
1. a kind of measure of the classification results confidence level based on Bagging and outlier, it is characterised in that including following step
Suddenly:
Step one:To existing trust data collection use Bagging integrated learning approachs, i.e., using Logistic return, support to
In amount machine and naive Bayesian one as base grader, obtain the disaggregated model collection of base grader;
Step 2:The disaggregated model collection of the base grader drawn by step one, is treated measurement confidence data and is classified, and
The class probability in different classifications is calculated, the classification results collection and confidence data to be measured of confidence data to be measured is obtained
Class probability collection, then classification results collection is counted, obtain the classification results of confidence data to be measured;
Step 3:Using outlier analysis method, the classification results for treating measurement confidence data carry out confidence metric, obtain
Trust data in confidence data to be measured and can not letter data, and confidence condition will be met in confidence data to be measured
Data add existing trust data collection.
2. the measure of the classification results confidence level based on Bagging and outlier according to claim 1, its feature
The specific method for being the disaggregated model collection that base grader is obtained in the step one is:
Step 1.1:The feature and categorical attribute of the existing trust data collection of definition;
Step 1.2:Selection Logistic is returned, in SVMs and naive Bayesian one as base grader
Function;
Step 1.3:Existing trust data collection to being crossed defined in step 1.1 uses Bagging integrated learning approachs, with step
The Function selected in 1.2 is base grader, obtains the disaggregated model collection of Function.
3. the measure of the classification results confidence level based on Bagging and outlier according to claim 1, its feature
It is that the specific method that the classification results of confidence data to be measured are obtained in the step 2 is:
Step 2.1:Treat measurement confidence data to be classified, and calculate the class probability in different classifications, obtain waiting to measure
The classification results collection Y of the confidence data and class probability collection Cf of confidence data to be measured;
Step 2.2:In statistic procedure 2.1 in classification results collection Y each classification number, obtain confidence data to be measured point
Class result py.
4. the measure of the classification results confidence level based on Bagging and outlier according to claim 1, its feature
It is that the step 3 carries out confidence metric using the classification results that outlier analysis method treats measurement confidence data
Specific method is:
Step 3.1:If meeting Point=CfpyPoint be outlier, by the class probability collection Cf of confidence data to be measured
CfpyTake out, and Cf is deleted from probability set Cfpy, obtain matrix P;
Step 3.2:Each is classified in Ergodic Matrices P, the barycenter of calculating matrix P, and its formula is:
In formula, PLoopFor class probability concentrates the Loop classification, Num is the current classification for calculating, and X is classification number;
Step 3.3:In Ergodic Matrices P each classification respectively with the distance and the distance of outlier of barycenter, calculate barycenter formula
For:
Calculate outlier formula be:
In formula, PNumFor class probability concentrates the Num classification, MNumIt is the corresponding barycenter of Num classification, α is self-defining value;
Step 3.4:After performing step 3.3, if meeting dNum,2>dNum,1, then confidence data to be measured be trust data, and by its
It is added in existing trust data collection Train;Otherwise, confidence data to be measured for can not letter data, be added without credible
In data set Train.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710054802.4A CN106874944A (en) | 2017-01-24 | 2017-01-24 | A kind of measure of the classification results confidence level based on Bagging and outlier |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710054802.4A CN106874944A (en) | 2017-01-24 | 2017-01-24 | A kind of measure of the classification results confidence level based on Bagging and outlier |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106874944A true CN106874944A (en) | 2017-06-20 |
Family
ID=59159071
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710054802.4A Pending CN106874944A (en) | 2017-01-24 | 2017-01-24 | A kind of measure of the classification results confidence level based on Bagging and outlier |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106874944A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110619206A (en) * | 2019-08-15 | 2019-12-27 | 中国平安财产保险股份有限公司 | Operation and maintenance risk control method, system, equipment and computer readable storage medium |
CN110990455A (en) * | 2019-11-29 | 2020-04-10 | 杭州数梦工场科技有限公司 | Method and system for identifying house properties by big data |
-
2017
- 2017-01-24 CN CN201710054802.4A patent/CN106874944A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110619206A (en) * | 2019-08-15 | 2019-12-27 | 中国平安财产保险股份有限公司 | Operation and maintenance risk control method, system, equipment and computer readable storage medium |
CN110619206B (en) * | 2019-08-15 | 2024-04-02 | 中国平安财产保险股份有限公司 | Operation and maintenance risk control method, system, equipment and computer readable storage medium |
CN110990455A (en) * | 2019-11-29 | 2020-04-10 | 杭州数梦工场科技有限公司 | Method and system for identifying house properties by big data |
CN110990455B (en) * | 2019-11-29 | 2023-10-17 | 杭州数梦工场科技有限公司 | Method and system for recognizing house property by big data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Guo et al. | Supplier selection based on hierarchical potential support vector machine | |
CN103632168B (en) | Classifier integration method for machine learning | |
CN103617235B (en) | Method and system for network navy account number identification based on particle swarm optimization | |
CN106709800A (en) | Community partitioning method and device based on characteristic matching network | |
CN105447505B (en) | A kind of multi-level important email detection method | |
Okori et al. | Machine learning classification technique for famine prediction | |
CN108009690B (en) | Ground bus stealing group automatic detection method based on modularity optimization | |
CN103116588A (en) | Method and system for personalized recommendation | |
CN108960833A (en) | A kind of abnormal transaction identification method based on isomery finance feature, equipment and storage medium | |
CN109754258B (en) | Online transaction fraud detection method based on individual behavior modeling | |
Chen et al. | Research on location fusion of spatial geological disaster based on fuzzy SVM | |
Savage et al. | Detection of money laundering groups: Supervised learning on small networks | |
CN105022754A (en) | Social network based object classification method and apparatus | |
CN108021651A (en) | Network public opinion risk assessment method and device | |
CN109408641A (en) | It is a kind of based on have supervision topic model file classification method and system | |
CN108647691A (en) | A kind of image classification method based on click feature prediction | |
CN110532429B (en) | Online user group classification method and device based on clustering and association rules | |
CN104850868A (en) | Customer segmentation method based on k-means and neural network cluster | |
CN109739844A (en) | Data classification method based on decaying weight | |
CN106250909A (en) | A kind of based on the image classification method improving visual word bag model | |
CN109359551A (en) | A kind of nude picture detection method and system based on machine learning | |
CN105574213A (en) | Microblog recommendation method and device based on data mining technology | |
CN116109898A (en) | Generalized zero sample learning method based on bidirectional countermeasure training and relation measurement constraint | |
Zhu et al. | NUS: Noisy-sample-removed undersampling scheme for imbalanced classification and application to credit card fraud detection | |
CN108491719A (en) | A kind of Android malware detection methods improving NB Algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170620 |
|
RJ01 | Rejection of invention patent application after publication |