CN106874944A - A kind of measure of the classification results confidence level based on Bagging and outlier - Google Patents

A kind of measure of the classification results confidence level based on Bagging and outlier Download PDF

Info

Publication number
CN106874944A
CN106874944A CN201710054802.4A CN201710054802A CN106874944A CN 106874944 A CN106874944 A CN 106874944A CN 201710054802 A CN201710054802 A CN 201710054802A CN 106874944 A CN106874944 A CN 106874944A
Authority
CN
China
Prior art keywords
data
confidence
classification results
collection
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710054802.4A
Other languages
Chinese (zh)
Inventor
严云洋
瞿学新
朱全银
于柿民
赵阳
唐海波
潘舒新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huaiyin Institute of Technology
Original Assignee
Huaiyin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huaiyin Institute of Technology filed Critical Huaiyin Institute of Technology
Priority to CN201710054802.4A priority Critical patent/CN106874944A/en
Publication of CN106874944A publication Critical patent/CN106874944A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of measure of the classification results confidence level based on Bagging and outlier, measurement confidence data is treated as base grader using in Logistic recurrence, SVMs and naive Bayesian first to be classified, classification results collection and class probability collection that the class probability in different classifications obtains confidence data to be measured are calculated, the classification results of confidence data to be measured are obtained by classification results collection;Concentrated in class probability, using each classification an as point in space, the point concentrated using classification results correspondence class probability is used as outlier, the point that remainder classification correspondence class probability is concentrated is a cluster, finally compare distance and distance to outlier of each point to cluster barycenter in cluster using Euclidean distance, if meeting the distance for arriving cluster barycenter in cluster a little less than the distance to outlier, the classification results are credible, otherwise are then insincere.Because employing influence of the incredible classification results to training pattern when learning again present invention, avoiding model.

Description

A kind of measure of the classification results confidence level based on Bagging and outlier
Technical field
It is more particularly to a kind of to be based on Bagging and outlier the invention belongs to classification results confidence metric technical field Classification results confidence level measure.
Background technology
The accuracy that model is improved by treating metric data is part and parcel in on-line study, and how to keep study The accuracy of data becomes particularly important.The method of classification results confidence metric is to being used to weigh classification after every subseries Credible result or incredible method, this has very important significance to keeping training set and model retraining.Traditional is right The category of model results such as Logistic recurrence, SVM and naive Bayesian do not carry out confidence metric, cannot be kept away when model learns again Exempt to learn influence of the incredible classification results to model.
The existing Research foundation of Yan Yunyang and Zhu Quanyin et al. includes:Yan Yunyang, Wu Qianyin, Du Jing, Zhou Jingbo, Liu with Peace is based on color and dodges the video flame detection computer science of frequency feature and explore, and 2014,08 (10):1271-1279;S Gao, J Yang, Y Yan.A novel multiphase active contour model for inhomogeneous Image segmentation.Multimedia Tools and Applications, 2014,72 (3):2321-2337;S Gao,J Yang,Y Yan.A local modified chan–vese model for segmenting inhomogeneous multiphase images.International Journal of Imaging Systems and Technology,2012,22(2):103-113;The short message text sorting technique that Liu Jinling, tight cloud ocean are based on context is calculated Machine engineering, 2011,37 (10):41-43;Yan Yunyang, noble soldier, Guo Zhibo contains fire of the bright super based on video image and examines automatically Survey computer applications research, 2008,25 (4):1075-1078Y Yan,Z Guo,JYang.Fast Feature Value Searching for Face Detection.Computer and Information Science, 2008,1 (2):120- 128;Zhu Quanyin, Pan Lu, Liu Wenru, wait .Web science and technology news classification extraction algorithm [J] Huaiyingong College journals, 2015,24 (5):18-24;Li Xiang, Zhu Quan silver joints cluster and shared collaborative filtering recommending [J] the computer science of rating matrix and spy Rope, 2014,8 (6):751-759;Quanyin Zhu,Sunqun Cao.A Novel Classifier-independent Feature Selection Algorithm for Imbalanced Datasets.2009,p:77-82;Quanyin Zhu, Yunyang Yan,Jin Ding,Jin Qian.The Case Study for Price Extracting of Mobile Phone Sell Online.2011,p:282-285;Quanyin Zhu,Suqun Cao,Pei Zhou,Yunyang Yan, Hong Zhou.Integrated Price Forecast based on Dichotomy Backfilling and Disturbance Factor Algorithm.International Review on Computers and Software, 2011,Vol.6(6):1089-1093;Zhu Quan silver et al. application, the open Patents with mandate:Zhu Quanyin, Hu Rongjing, what A kind of commodity price sorting technique Chinese patents based on linear interpolation and Adaptive windowing mouthful of the such as Su Qun, week training:ZL 201110423015.5,2015.07.01;Zhu Quanyin, Cao Suqun, Yan Yunyang, Hu Rong wait quietly, a kind of based on the repairing of two divided datas With the commodity price sorting technique Chinese patents of disturbing factors:ZL 2011 1 0422274.6,2013.01.02;Zhu Quanyin, Yin Yonghua, Yan Yunyang, Cao Suqun etc., a kind of data preprocessing method of the multi items commodity price classification based on neutral net Chinese patent:ZL 2012 1 0325368.6;Li Xiang, Zhu Quanyin, Hu Ronglin, a kind of all Cold Chain Logistics based on spectral clustering of deep Prestowage intelligent recommendation method China Patent Publication No.:CN105654267A,2016.06.08;Cao Suqun, Zhu Quanyin, Zuo Xiaoming, Noble soldier et al., a kind of feature selection approach China Patent Publication No. for pattern classification:The A of CN 103425994, 2013.12.04;Zhu Quanyin, Yan Yunyang, Li Xiang, Zhang Yongjun et al., a kind of science and technology excavated for text classification and picture depth Information is obtained and method for pushing China Patent Publication No.:CN 104035997 A,2014.09.10;Zhu Quanyin, Xin Cheng, Li Xiang, Xu Kang et al., a kind of network behavior custom clustering method China Patent Publication No. based on K means and LDA bi-directional verifications:CN 106202480 A,2016.12.07。
Bagging (pocket-like):
Bagging is a kind of method for improving the learning algorithm degree of accuracy, and this method is by constructing a classification letter Number system is arranged, and a classification function is then combined into some way.The main thought of Bagging technologies is using weight Sampling techniques, concentrate from initial data and independently randomly choose data, and this process are independently carried out repeatedly, until producing Raw many independent data sets.A weak learning algorithm is given, the multiple for producing can be trained by the weak learning algorithm Sample set is learnt, and draws classification function sequence, and result is voted, and who gets the most votes is used as last result.
Outlier:
Outlier detection is a branch in data mining, and its task is to recognize that its data characteristics is markedly different from it The observation of his data object.Outlier detection is extremely important in data mining, because if abnormal is by inherent data What variation was caused, then they are analyzed it can be found that containing deeper, potential, valuable letter wherein Breath.Therefore, outlier detection is a significantly research direction.
Logistic is returned:
Logistic is returned also known as logistic regression analyses, is a kind of linear regression analysis model of broad sense, and linear Difference is returned, it is a kind of nonlinear model that Logistic is returned, and the method for parameter estimation for generally using is maximum likelihood estimate. It is usually used in data mining, disease is diagnosed automatically, the field such as economic classification.Logistic homing methods can to classification dependent variable and point Class independent variable or continuous independent variable, or hybrid variable carry out regression modeling, there is joining to regression model and recurrence for a whole set of maturation The standard that number is tested, provides result in the form of event occurrence rate.
SVMs:
SVMs is that Vapnik et al. is proposed on the basis of the Research statistics theories of learning for many years to linear classifier Another kind design optimum criterion.Its principle is also talked about from linear separability, then expands to the situation of linearly inseparable.Even extend To in using nonlinear function, this grader is referred to as SVMs (Support Vector Machine, abbreviation SVM)。
Naive Bayes Classifier:
Naive Bayes Classifier is that a kind of application is based on the independent Bayesian simple probability grader assumed, more This potential probabilistic model is accurately described for independent characteristic model, the basis of Bayes's classification is probability inference, be exactly The presence of various conditions does not know, and in the case of only knowing its probability of occurrence, how to complete reasoning and decision task, and probability inference is It is corresponding with certainty reasoning, and Naive Bayes Classifier is assumed based on independent, that is, assume each feature of sample with Other features are all uncorrelated.
Euclidean distance:
Euclidean metric is also referred to as Euclidean distance, is a distance definition for generally using, and refers to two points in m-dimensional space Between actual distance, or vector natural length (i.e. distance of the point to origin).Europe in two and three dimensions space Family name's distance is exactly the actual range between 2 points.
Logistic recurrence, SVMs and naive Bayesian treat data and the classification of confidence metric in classification Result is directly added into training set, and the method cannot avoid for incredible metric data and classification results being added to trust data Concentrate, this degree of accuracy for causing model and stability are reduced.In order to be able to preferably utilize algorithm above, it is to avoid grouped data is added To the influence of model during trust data collection, it is therefore desirable to find a kind of method that classification results can be carried out with confidence metric, The models such as Logistic recurrence, SVMs and naive Bayesian are made to avoid learning incredible classification results to disaggregated model Influence.
The content of the invention
Goal of the invention:For problems of the prior art, the present invention provides one kind by Bagging and the point analysis that peels off With reference to, the classification results to models such as Logistic recurrence, SVMs and naive Bayesians carry out confidence metric, and then Avoid the models such as Logistic recurrence, SVM and naive Bayesian when training data is expanded because employing incredible classification number Influenceed according on training pattern, the present invention proposes a kind of measurement side of the classification results confidence level based on Bagging and outlier Method.
Technical scheme:In order to solve the above technical problems, a kind of classification based on Bagging and outlier that the present invention is provided As a result the measure of confidence level, comprises the following steps:
Step one:Bagging integrated learning approachs are used to existing trust data collection, i.e., is returned using Logistic, propped up One is held in vector machine and naive Bayesian as base grader, the disaggregated model collection of base grader is obtained;
Step 2:The disaggregated model collection of the base grader drawn by step one, is treated measurement confidence data and is divided Class, and the class probability in different classifications is calculated, obtain the classification results collection and confidence to be measured of confidence data to be measured The class probability collection of degrees of data, then classification results collection is counted, obtain the classification results of confidence data to be measured;
Step 3:Using outlier analysis method, the classification results for treating measurement confidence data carry out confidence metric, Obtain trust data in confidence data to be measured and can not letter data, and confidence bar will be met in confidence data to be measured The data of part add existing trust data collection.
Further, the specific method that the disaggregated model collection of base grader is obtained in the step one is:
Step 1.1:The feature and categorical attribute of the existing trust data collection of definition;
Step 1.2:Selection Logistic is returned, in SVMs and naive Bayesian one as base grader Function;
Step 1.3:Existing trust data collection to being crossed defined in step 1.1 uses Bagging integrated learning approachs, with step The Function selected in rapid 1.2 is base grader, obtains the disaggregated model collection of Function;
Further, the specific method of the classification results for obtaining confidence data to be measured in the step 2 is:
Step 2.1:Treat measurement confidence data to be classified, and calculate the class probability in different classifications, treated Measure the classification results collection Y of the confidence data and class probability collection Cf of confidence data to be measured;
Step 2.2:In statistic procedure 2.1 in classification results collection Y each classification number, obtain confidence data to be measured Classification results py:
Further, the classification results for treating measurement confidence data using outlier analysis method in the step 3 enter The specific method of row confidence metric is:
Step 3.1:If meeting Point=CfpyPoint be outlier, by the class probability collection Cf of confidence data to be measured In CfpyTake out, and Cf is deleted from probability set Cfpy, obtain matrix P;
Step 3.2:Each is classified in Ergodic Matrices P, the barycenter of calculating matrix P, and its formula is:
In formula, PLoopFor class probability concentrates the Loop classification, Num is the current classification for calculating, and X is classification number;
Step 3.3:Each classification and the distance and the distance of outlier of barycenter, calculate the formula of barycenter in Ergodic Matrices P For:
Calculate outlier formula be:
In formula, PNumFor class probability concentrates the Num classification, MNumIt is the corresponding barycenter of Num classification, α is self-defining value;
Step 3.4:After performing step 3.3, if meeting dNum,2>dNum,1, then confidence data to be measured is trust data, And in adding it to existing trust data collection Train;Otherwise, confidence data to be measured for can not letter data, be added without In having a trust data collection Train.
Compared with prior art, the advantage of the invention is that:
The inventive method is by Bagging and peels off point analysis, can effectively to Logistic recurrence, SVMs and The classification results of the models such as naive Bayesian carry out confidence metric, insincere because employing when model learns again so as to avoid Influence of the classification results to training pattern.Additionally, the present invention creatively proposes a kind of measurement of classification results confidence level Method, for the expansion to existing trust data collection trust data, and then improves the validity of learning model.
Brief description of the drawings
Fig. 1 is overview flow chart of the invention;
Fig. 2 is the flow chart of pocket-like model training in Fig. 1;
Fig. 3 is the flow chart of confidence data classification to be measured in Fig. 1;
Fig. 4 is the flow chart of classification results confidence metric in Fig. 1.
Specific embodiment
With reference to the accompanying drawings and detailed description, the present invention is furture elucidated.
The technical scheme is that to the classification results of the models such as Logistic recurrence, SVMs and naive Bayesian Confidence metric is carried out, first using Bagging integrated learning approachs, i.e. using Logistic recurrence, SVMs and Piao One in plain Bayes is treated measurement confidence data and is classified, and calculated in different classifications as base grader Class probability, obtains the classification results collection of confidence data to be measured and the class probability collection of confidence data to be measured, and passes through Classification results collection obtains the classification results of confidence data to be measured, and secondly, is concentrated in class probability, using each classification as sky Between in a point, the point concentrated using classification results correspondence class probability used as outlier, concentrate by remainder classification correspondence class probability Point be a cluster, finally, using Euclidean distance, compare each point in cluster to the distance and the distance to outlier of cluster barycenter, If meeting the distance for arriving cluster barycenter in cluster a little less than the distance to outlier, the classification results are credible, otherwise are then It is insincere, and then realize the measurement to classification results confidence level.
Specifically, the present invention comprises the following steps:
Step one:Bagging integrated learning approachs are used to existing trust data collection, i.e. returned using Logistic, propped up Hold in vector machine and naive Bayesian one and obtain the disaggregated model collection of base grader as base grader, it is specific such as Fig. 2 institutes Show;
Step 1.1:If existing classification number is the trust data collection Train={ T of X1,T2,T3,……,Tn, n is Gather number, feature set Ti={ a in Train1,a2,a3,……,afd, ajIt is TiJ-th feature, fd is characterized number, its In, i ∈ [1, n], j ∈ [1, fd];
Step 1.2:Selection Logistic is returned, in SVMs and naive Bayesian one as base grader Function, if Function models quantity is N;
Step 1.3:If Models is Function disaggregated model collection, tax initial value is empty set;
Step 1.4:It is 1 to define cyclic variable q and assign initial value;
Step 1.5:As cyclic variable q<During=N, then step 1.6 is performed;Otherwise perform step 1.10;
Step 1.6:To random sampling E in the trust data collection Train in step 1.1 as sample, i.e. Sub= {T1,T2,T3,……,TE,
Step 1.7:Function is trained using Sub, the disaggregated model L after must trainingq
Step 1.8:Models=Models ∪ Lq
Step 1.9:Cyclic variable q=q+1;
Step 1.10:Obtain Function disaggregated model collection Models={ L1,L2,L3,……,LN};
Step 2:By the disaggregated model collection of base grader, treat measurement confidence data and classified, and calculate not With the class probability in classification, the classification for obtaining the classification results collection and confidence data to be measured of confidence data to be measured is general Rate collection, then classification results collection is counted, the classification results of confidence data to be measured are obtained, it is specific as shown in Figure 3;
Step 2.1:If the feature set of confidence data to be measured is Test={ b1,b2,b3,……,bgd, wherein, bkFor K-th data characteristics in Test, gd is the Characteristic Number of Test;
Step 2.2:Test is classified using Models, obtains the classification results collection Y=of confidence data to be measured {y1,y2,y3,……,yNAnd confidence data to be measured class probability collection Cf={ C1,C2,C3,……,CX, wherein, ysFor The classification results of metric data Test are treated in s-th base grader Function model;CrIt is each base grader Function Model is to r-th class probability of classification, Cu={ pr1,pr2,pr3,……,prN, prhIt is h-th base grader Funtion The class probability value of model, wherein, s, h ∈ [1, N], u ∈ [1, X];
Step 2.3:The classification results collection Y of model in statistic procedure 2.2, if M is each classification in statistical classification result set Y Number, the maximum classification of statistical value is selected in M as the classification results py of confidence data to be measured;
Step 3:Using outlier analysis method, the classification results for treating measurement confidence data carry out confidence metric, Obtain trust data in confidence data to be measured and can not letter data, and confidence bar will be met in confidence data to be measured The data of part add existing trust data collection, specific as shown in Figure 4;
Step 3.1:If meeting Point=CfpyPoint be outlier, by the class probability collection Cf of confidence data to be measured In CfpyTake out, and remove the Cf in class probability collection Cfpy, obtain P={ C1,C2,C3... ..., CX-1, wherein,
Step 3.2:If it is 1 that cyclic variable Num assigns initial value, for the row of Ergodic Matrices P;
Step 3.3:As cyclic variable Num<During=X-1, then step 3.4 is performed;Otherwise perform step 3.8;
Step 3.4:The barycenter of the class probability collection P of confidence data to be measured is calculated, wherein not comprising PNum, obtain;
Step 3.5:Calculate PNumEuclidean distance with M is:PNumIt is European with Point Distance is,Wherein, α is entered as 0.5
Step 3.6:Work as d1<d2When, then perform step 3.4;Otherwise perform step 3.7;
Step 3.7:Cyclic variable Num=Num+1;
Step 3.8:Confidence data to be measured is obtained for can not letter data, Train=Train;
Step 3.10:Confidence data to be measured is obtained for trust data, and adds it to existing trust data collection In Train, i.e. Train=Train ∪ { Test, py }.
Wherein, with Bagging integrated learning approachs, it is Logistic recurrence, SVMs and Piao to use base grader One in plain Bayes is trained as base grader to trust data, the class probability obtained by confidence data to be measured Collection, concentrates in class probability, and each classification an as point in space is made with the point that classification results correspondence class probability is concentrated It is outlier, the point that remainder classification correspondence class probability is concentrated is a cluster, and putting for classification results is judged by Euclidean distance Reliability.
Wherein, step 1.1 is to provide primary data needed for model training;Step 1.2 is to step 1.10, with Bagging Integrated learning approach is trained to data, wherein with Logistic recurrence, SVMs and naive Bayesian as base Grader;Step 2.1 is that the data for treating measurement confidence level are classified to step 2.3, and is calculated general in different classifications Rate, obtains the class probability collection of the classification results collection of confidence data to be measured and the data of confidence level to be measured;Step 3.1 is arrived Step 3.10 is a kind of method of the confidence metric of the classification results for calculating and treating measurement confidence data.
In order to the validity of this method is better described, by open on existing Web page grouped data and UCI official websites Car Evaluation data sets and Letter Recognition data sets as raw data set, pass through respectively Logistic regression models, SVM models and model-naive Bayesian are classified, and to classify result carry out confidence measure Amount.
Tested by the data of Web page grouped data 4553, be characterized as the title fields in Web page Keywords in describe, using sample 70% as training set, 30% as test set, mould is returned by Logistic Type is classified, and obtains 90.64% accuracy rate, wherein comprising 128 wrong divided datas, if by the confidence measure to classification results Amount, can select 1092 (accounting for the 80% of former test set) from classification results, and the subset accuracy rate that this is filtered out is 98.07%. Classified by model-naive Bayesian, obtain 88.1% accuracy rate, wherein comprising 162 wrong divided datas, if by classification The confidence metric of result, can select 1012 (accounting for the 74.1% of former test set), the subset for filtering out from classification results Accuracy rate is 96.93%.By SVM categories of model, 88.64% accuracy rate is obtained, wherein comprising 155 wrong divided datas, if By the confidence metric to classification results, 1004 (accounting for the 73.5% of former test set), the sieve can be selected from classification results The subset accuracy rate selected is 94.5%.
By the data disclosed in UCI, from the data Car Evaluation of handwritten word identification, the data volume is 1728 Bar, is characterized as 6.Using sample 70% as training set, 30% as test set, classified by Logistic regression models, 81.3% accuracy rate is obtained, wherein comprising 96 wrong divided datas, if by the confidence metric to classification results, can be from classification Selected in result 407 (accounting for the 78.6% of former test set), the subset accuracy rate that this is filtered out is 98.07%.By simple shellfish Leaf this category of model, obtains 70% accuracy rate, wherein comprising 155 wrong divided datas, if by the confidence level to classification results Measurement, can select 429 (accounting for the 82.8% of former test set) from classification results, and the subset accuracy rate that this is filtered out is 78.3%.By SVM categories of model, 94.8% accuracy rate is obtained, wherein comprising 27 wrong divided datas, if being tied by classification The confidence metric of fruit, can select 496 (accounting for the 95.8% of former test set) from classification results, and the subset that this is filtered out is accurate Rate is 97.8%.
From the Letter Recognition data sets disclosed in UCI, the data volume is 20000, is characterized as 16. Using sample 70% as training set, 30% as test set, classified by Logistic regression models, obtain 71.3% standard True rate, wherein comprising 1722 wrong divided datas, if by the confidence metric to classification results, can be selected from classification results 2902 (accounting for the 48.37% of former test set), the subset accuracy rate that this is filtered out is 91.42%.By model-naive Bayesian Classification, obtains 54.78% accuracy rate, wherein comprising 2713 wrong divided datas, if by the confidence metric to classification results, Can be selected from classification results 2362 (accounting for the 39.37% of former test set), the subset accuracy rate that this is filtered out is 79.17%. By SVM categories of model, 96.87% accuracy rate is obtained, wherein comprising 187 wrong divided datas, if by classification results Confidence metric, can select 5821 (accounting for the 97% of former test set) from classification results, and the subset accuracy rate that this is filtered out is 98.2%.
Except being returned by Logistic, in addition to SVMs and naive Bayesian, can also be to iteration decision tree and KNN Confidence metric is carried out Deng the classification results of class probability output model are supported.By CarEvaluation data sets, to iteration Decision tree and KNN categories of model result carry out confidence metric, and model accuracy rate is respectively 98.5% and 91.71%, if passing through To the measurement of classification results, 499 (accounting for the 96.3% of former test set) can be selected from classification results and 415 (account for former test Collection 80%), subset accuracy rate be 99.8%% and 99%.
The present invention can be combined with computer system, so as to be automatically performed the measurement to classification results confidence level.
A kind of measure of classification results confidence level based on Bagging and outlier proposed by the present invention, above institute Only embodiments of the invention is stated, is not intended to limit the invention.Except being returned to Logistic, SVM and simple shellfish The classification results of the models such as Ye Si are carried out outside confidence metric, it can also be used to iteration decision tree (GBDT), KNN and BP nerve nets Network etc. supports the model of class probability output.All equivalents within principle of the invention, made, should be included in this hair Within bright protection domain.The content that the present invention is not elaborated belongs to existing skill known to this professional domain technical staff Art.

Claims (4)

1. a kind of measure of the classification results confidence level based on Bagging and outlier, it is characterised in that including following step Suddenly:
Step one:To existing trust data collection use Bagging integrated learning approachs, i.e., using Logistic return, support to In amount machine and naive Bayesian one as base grader, obtain the disaggregated model collection of base grader;
Step 2:The disaggregated model collection of the base grader drawn by step one, is treated measurement confidence data and is classified, and The class probability in different classifications is calculated, the classification results collection and confidence data to be measured of confidence data to be measured is obtained Class probability collection, then classification results collection is counted, obtain the classification results of confidence data to be measured;
Step 3:Using outlier analysis method, the classification results for treating measurement confidence data carry out confidence metric, obtain Trust data in confidence data to be measured and can not letter data, and confidence condition will be met in confidence data to be measured Data add existing trust data collection.
2. the measure of the classification results confidence level based on Bagging and outlier according to claim 1, its feature The specific method for being the disaggregated model collection that base grader is obtained in the step one is:
Step 1.1:The feature and categorical attribute of the existing trust data collection of definition;
Step 1.2:Selection Logistic is returned, in SVMs and naive Bayesian one as base grader Function;
Step 1.3:Existing trust data collection to being crossed defined in step 1.1 uses Bagging integrated learning approachs, with step The Function selected in 1.2 is base grader, obtains the disaggregated model collection of Function.
3. the measure of the classification results confidence level based on Bagging and outlier according to claim 1, its feature It is that the specific method that the classification results of confidence data to be measured are obtained in the step 2 is:
Step 2.1:Treat measurement confidence data to be classified, and calculate the class probability in different classifications, obtain waiting to measure The classification results collection Y of the confidence data and class probability collection Cf of confidence data to be measured;
Step 2.2:In statistic procedure 2.1 in classification results collection Y each classification number, obtain confidence data to be measured point Class result py.
4. the measure of the classification results confidence level based on Bagging and outlier according to claim 1, its feature It is that the step 3 carries out confidence metric using the classification results that outlier analysis method treats measurement confidence data Specific method is:
Step 3.1:If meeting Point=CfpyPoint be outlier, by the class probability collection Cf of confidence data to be measured CfpyTake out, and Cf is deleted from probability set Cfpy, obtain matrix P;
Step 3.2:Each is classified in Ergodic Matrices P, the barycenter of calculating matrix P, and its formula is:
M N u m = &Sigma; L o o p = 1 , L o o p &NotEqual; N u m X - 2 P L o o p X - 2
In formula, PLoopFor class probability concentrates the Loop classification, Num is the current classification for calculating, and X is classification number;
Step 3.3:In Ergodic Matrices P each classification respectively with the distance and the distance of outlier of barycenter, calculate barycenter formula For:
d N u m , 1 = &Sigma; w = 1 N ( P N u m , w - M N u m , w ) 2
Calculate outlier formula be:
d N u m , 2 = &Sigma; g = 1 N ( P N u m , g - Point g ) 2 - &alpha;
In formula, PNumFor class probability concentrates the Num classification, MNumIt is the corresponding barycenter of Num classification, α is self-defining value;
Step 3.4:After performing step 3.3, if meeting dNum,2>dNum,1, then confidence data to be measured be trust data, and by its It is added in existing trust data collection Train;Otherwise, confidence data to be measured for can not letter data, be added without credible In data set Train.
CN201710054802.4A 2017-01-24 2017-01-24 A kind of measure of the classification results confidence level based on Bagging and outlier Pending CN106874944A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710054802.4A CN106874944A (en) 2017-01-24 2017-01-24 A kind of measure of the classification results confidence level based on Bagging and outlier

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710054802.4A CN106874944A (en) 2017-01-24 2017-01-24 A kind of measure of the classification results confidence level based on Bagging and outlier

Publications (1)

Publication Number Publication Date
CN106874944A true CN106874944A (en) 2017-06-20

Family

ID=59159071

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710054802.4A Pending CN106874944A (en) 2017-01-24 2017-01-24 A kind of measure of the classification results confidence level based on Bagging and outlier

Country Status (1)

Country Link
CN (1) CN106874944A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110619206A (en) * 2019-08-15 2019-12-27 中国平安财产保险股份有限公司 Operation and maintenance risk control method, system, equipment and computer readable storage medium
CN110990455A (en) * 2019-11-29 2020-04-10 杭州数梦工场科技有限公司 Method and system for identifying house properties by big data

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110619206A (en) * 2019-08-15 2019-12-27 中国平安财产保险股份有限公司 Operation and maintenance risk control method, system, equipment and computer readable storage medium
CN110619206B (en) * 2019-08-15 2024-04-02 中国平安财产保险股份有限公司 Operation and maintenance risk control method, system, equipment and computer readable storage medium
CN110990455A (en) * 2019-11-29 2020-04-10 杭州数梦工场科技有限公司 Method and system for identifying house properties by big data
CN110990455B (en) * 2019-11-29 2023-10-17 杭州数梦工场科技有限公司 Method and system for recognizing house property by big data

Similar Documents

Publication Publication Date Title
Guo et al. Supplier selection based on hierarchical potential support vector machine
CN103632168B (en) Classifier integration method for machine learning
CN103617235B (en) Method and system for network navy account number identification based on particle swarm optimization
CN106709800A (en) Community partitioning method and device based on characteristic matching network
CN105447505B (en) A kind of multi-level important email detection method
Okori et al. Machine learning classification technique for famine prediction
CN108009690B (en) Ground bus stealing group automatic detection method based on modularity optimization
CN103116588A (en) Method and system for personalized recommendation
CN108960833A (en) A kind of abnormal transaction identification method based on isomery finance feature, equipment and storage medium
CN109754258B (en) Online transaction fraud detection method based on individual behavior modeling
Chen et al. Research on location fusion of spatial geological disaster based on fuzzy SVM
Savage et al. Detection of money laundering groups: Supervised learning on small networks
CN105022754A (en) Social network based object classification method and apparatus
CN108021651A (en) Network public opinion risk assessment method and device
CN109408641A (en) It is a kind of based on have supervision topic model file classification method and system
CN108647691A (en) A kind of image classification method based on click feature prediction
CN110532429B (en) Online user group classification method and device based on clustering and association rules
CN104850868A (en) Customer segmentation method based on k-means and neural network cluster
CN109739844A (en) Data classification method based on decaying weight
CN106250909A (en) A kind of based on the image classification method improving visual word bag model
CN109359551A (en) A kind of nude picture detection method and system based on machine learning
CN105574213A (en) Microblog recommendation method and device based on data mining technology
CN116109898A (en) Generalized zero sample learning method based on bidirectional countermeasure training and relation measurement constraint
Zhu et al. NUS: Noisy-sample-removed undersampling scheme for imbalanced classification and application to credit card fraud detection
CN108491719A (en) A kind of Android malware detection methods improving NB Algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170620

RJ01 Rejection of invention patent application after publication