CN108805156A - A kind of improved selective Nae Bayesianmethod - Google Patents

A kind of improved selective Nae Bayesianmethod Download PDF

Info

Publication number
CN108805156A
CN108805156A CN201810291375.6A CN201810291375A CN108805156A CN 108805156 A CN108805156 A CN 108805156A CN 201810291375 A CN201810291375 A CN 201810291375A CN 108805156 A CN108805156 A CN 108805156A
Authority
CN
China
Prior art keywords
attribute
classification
accuracy
formula
values
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810291375.6A
Other languages
Chinese (zh)
Inventor
姚全珠
李莎莎
费蓉
范慧敏
白赞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN201810291375.6A priority Critical patent/CN108805156A/en
Publication of CN108805156A publication Critical patent/CN108805156A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The improved selective Nae Bayesianmethod of one kind disclosed by the invention, includes the following steps:WoE values and IV values are introduced into Attributions selection, obtained and the higher attribute set of the classification degree of correlation, construction Naive Bayes Classifier;Then redundant attributes are further deleted on its basis, obtain optimum attributes subset.On the basis of the existing bayesian algorithm of the improved selective Nae Bayesianmethod of the present invention, WoE and IV indexs are introduced into Attributions selection, classification performance of the naive Bayesian in Attribute Redundancy is improved, while keeping the classification performance of naive Bayesian in the case of attribute not redundancy;It is screened to obtain first round attribute set according to threshold value, to reduce traversal space, solves the problems, such as to improve the correctness of classification while reducing attribute dimensions.

Description

A kind of improved selective Nae Bayesianmethod
Technical field
The invention belongs to attribute selection method technical fields, and in particular to a kind of improved selective naive Bayesian side Method.
Background technology
For a learning tasks, given attribute collection, some of which attribute may be very crucial, very useful, we by its Referred to as " association attributes " (relevant feature), otiose attribute are known as " unrelated attribute " (irrelevant feature).The process of association attributes subset is selected from given attribute set, referred to as " Attributions selection " (feature selection)。
Dimension disaster problem often is caused because attribute is excessive in realistic task, if can be screened by Attributions selection Go out important attribute, then the efficiency of processing high dimensional data can be greatly improved.In addition to removal " unrelated attribute ", should also remove " redundant attributes ", the generic attribute that i.e. those information for itself being included can be deduced out from other attributes.It should be noted that Being that Attributions selection process must assure that does not lose important attribute, otherwise follow-up learning process can because the missing of important information and The performance that can not have been obtained.Selected attribute can improve the performance of model, the characteristics of capable of more helping us understand data, Fabric, this to further improving model, algorithm suffers from important function.
Common attribute selection method can substantially be divided into three classes:Filtering type (filter), packaging type (wrapper) and Embedded (embedding).Filtering type method passes through related coefficient, information gain (InfoGain), ratio of profit increase (Gain Ratio), the methods of OneR calculates the degree of correlation that primitive attribute concentrates each conditional attribute and category attribute, then carries out " filtering ", then with filtered attribute come training pattern.The standard of filter attribute is to carry out related journey according to primitive attribute collection The priority of degree sorts.
The case where need not considering subsequently to learn from filtering type Attributions selection, is different, and packaging type Attributions selection is directly final Using the performance of learner to be used as the evaluation criterion of attribute set.Generally, due to packaging type attribute selection method It is optimized directly against given learner, therefore from the point of view of final learner performance, packaging type Attributions selection is than filtering type category Property selection it is more preferable.But then, due to needing repeatedly to train learner during Attributions selection, the choosing of packaging type attribute The computing cost selected is usually more much bigger than filtering type Attributions selection.
In filtering type and packaging type attribute selection method, Attributions selection process has significantly with learner training process Difference;Unlike this, embedded Attributions selection is that Attributions selection process and learner training process combine together, and the two exists It is completed in the same optimization process, i.e., has automatically carried out Attributions selection in learner training process.
Invention content
The object of the present invention is to provide a kind of improved selective Nae Bayesianmethods, solve and are reducing attribute dimension The problem of correctness of classification is improved while spending
The technical solution adopted in the present invention is a kind of improved selective Nae Bayesianmethod, including following step Suddenly:
Step 1, the data set T containing n attribute is given, if S={ A1, A2..., AnIt is that Category Attributes variable is limited Collection, C={ C1, C2..., CmIt is class variable, m is the value number of class variable, CjFor j-th of value of class variable;
When discussing two classification problems, that is, assume j=2, C={ C1, C2When, for arbitrary conditional attribute variables AiIf It has SiA different valueThat is attribute AiK-th of value be expressed as aik
Step 2, WoE indexs are defined
WoE indexs are a kind of coding forms to original argument, to carry out WoE codings to a variable, need first This variable is grouped processing, such as formula (2) and (3):
In formula (2)-(3) formula:C1Indicate the class label of the 1st training sample, C2Indicate the category of the 2nd training sample Label, P (A=aik| C=C1) expression attribute be aik, classification C1Conditional probability, P (A=aik| C=C2) expression attribute be aik、 Classification is C2Conditional probability, N (C) indicate classification be C sample number, N be data sample sum, N (A=aik| C) indicate class Not and attribute value is aikWhen sample number;
Step 3, IV indexs are defined
IV indexs are the information content for weighing variable, i.e., independent variable is for the influence degree of target variable, such as formula (4) shown in:
IV(aik, C) and=(P (A=aik| C=C1)-P (A=aik| C=C2))*WoE(aik,C) (4)
Then attribute AiIV values be each grouping the sum of IV values, i.e.,:
Step 4, in conjunction with step 1, the IV indexs of the WoE indexs of step 2 and step 3 are introduced into Attributions selection, construction Piao Plain Bayes classifier;
Step 5, on the basis of step 4, need to first pass through IV indexs has the Category Attributes variable of the most original of step 1 Limit collection S is filtered, and obtains the attribute set S' for meeting threshold requirement, and to the attribute in S' according to IV values size by height to Low sequence is ranked up, and finally search can be such that the performance of grader is optimal on the attribute set S' of aligned orderly Attribute set.
It is of the invention to be further characterized in that,
The concrete operations of step 4 are:
Step 4.1, the attribute set highly relevant with classification is filtered out from primitive attribute set by the calculating of IV values:
According to naive Bayesian weighted formula it is found that carry out classification to sample X needs formula (6), formula (7):
In formula (6)-(7):P(aik|C1) and P (A=aik| C=C1) identical, expression attribute is aik, classification C1Item Part probability;P(aik|C2) and P (A=aik| C=C2) identical, expression attribute is aik, classification C2Conditional probability;P(C1) table Show that classification is C1Conditional probability;P(C2) expression classification be C2Conditional probability;P(C1| X) indicate attribute be X, classification C1's Conditional probability;P(C2| X) indicate attribute be X, classification C2Conditional probability;X indicates each database sample without class label This uses n dimensional feature vectors;
Step 4.2, selected threshold carries out attribute filtering
Formula (6), which is normalized, can obtain formula (8):
WhereinUnderstand that a is constant under given data set,Similarly, to public affairs Formula (7), which is normalized, can obtain formula (9):
In formula (8)-(9):P(C1| X) ' normalization after indicate attribute be X, classification C1Conditional probability;P(C2|X)' Indicate that attribute is X after normalizing, classification C2Conditional probability;
Step 4.3, in step 4.2 Naive Bayes Classifier is constructed on the preferable attribute set of classification capacity.
The division of threshold value and the degree of correlation of IV values measurement attribute and generic attribute in step 4.2 is as follows:
Degree of correlation IV values
Non-correlation IV<0.02
Weak dependence 0.02≤IV<0.1
Medium correlation 0.1≤IV<0.3
Strong correlation IV≥0.3
The concrete operations of step 5 are:
Step 5.1, sample data set T to be sorted in input step 1, conditional attribute set, that is, Category Attributes variable have Limit collection S={ A1, A2..., An, decision attribute set, that is, class variable C={ C1, C2..., Cm};And to sample data to be sorted Collect T and carries out data prediction;
Step 5.2, alternative conditions attribute set S is initialized, is S' by the attribute set that Attributions selection is selected, not Selected attribute set is S ", and the attribute set according to the height sequence of the IV indexs of attribute is S " ', enable S', S ", S " ' be all Sky, maximum accuracy Accuracymax=0, current accuracy Accuracycur=0;
Step 5.3, in design conditions attribute set S all properties information value IV values, and pass through threshold value carry out first Wheel screening, the attribute that IV values are more than or equal to threshold value are added in the S " ' in set step 5.2, and IV values are less than the attribute of threshold value It is added in set S ", and sorts from high to low according to IV values to the attribute in S ";
Step 5.4, if S " ' is 0, terminate to calculate, preserve S' and Accuracy at this timemax
Step 5.5, if S " ' is not 0, continue to calculate, select first attribute A in attribute set S " 'i, by its from It deletes, and is added in S' in S " ', Naive Bayes Classifier is constructed on newer S', and calculate Accuracycur
Step 5.6, if Accuracy in step 5.5cur>Accuracymax, then it is updated to Accuracymax= Accuracycur;If the Accuracy in step 5.5cur≤Accuracymax, by AiIt removes, is added in S " from S', Preserve S' and Accuracy at this timemax, terminate to calculate.
The beneficial effects of the invention are as follows:A kind of existing Bayes of improved selective Nae Bayesianmethod of the invention On the basis of algorithm, WoE and IV indexs are introduced into Attributions selection, improve classification performance of the naive Bayesian in Attribute Redundancy, The classification performance of naive Bayesian is kept in the case of attribute not redundancy simultaneously;It is screened to obtain first round attribute according to threshold value Subset solves the problems, such as to improve the correctness of classification while reducing attribute dimensions to reduce traversal space.
Specific implementation mode
The present invention is described in detail With reference to embodiment.
The most common conventional method of correlation is linear dependence method between computation attribute, but as an attribute Ordering mechanism, major defect is only sensitive to linear relationship, if there are nonlinear relationships between attribute, even if they With one-to-one relationship, related coefficient may also be close to 0, and it includes numeric type that this method, which requires all properties, Attribute value.In order to overcome these disadvantages, introduce some methods based on information theory, as information gain, ratio of profit increase and it is symmetrical not The serial of methods such as deterministic coefficient, but when carrying out Attributions selection by the correlation between computation attribute and classification, these There is no specific threshold values to set for method, can only sort according to degree of correlation to attribute, still needs to be carried in time and space efficiency It is high.
If there are two discrete random variable X and Y, KL divergences are also known as using relative entropy and portray two discrete random variables The distance between DKLFormula (1) is shown in in=(YX), definition:
DKL(Y | X)=∑ (PY(i)log(PY(i)/PX(i))) (1)
In formula (1), the signal of each variable:PY(i) probability distribution when being stochastic variable Y=i;PX(i) it is random The probability distribution of variable X=i;
Relative entropy is a kind of mode for describing two probability distribution variances, and value is greater than always equal to 0, when and only When two branches are identical, relative entropy is equal to 0.WoE values and IV values are exactly to develop from relative entropy.WoE(Weight of Evidence) it is a kind of coding form to original argument;IV (Information Value) is for weighing variable The influence degree of information content, i.e. independent variable for target variable.
A kind of improved selective Nae Bayesianmethod of the invention, includes the following steps:
Step 1, the data set T containing n attribute is given, if S={ A1, A2..., AnIt is that Category Attributes variable is limited Collection, C={ C1, C2..., CmIt is class variable, m is the value number of class variable, CjFor j-th of value of class variable;
When discussing two classification problems, that is, assume j=2, C={ C1, C2When, for arbitrary conditional attribute variables AiIf It has SiA different valueThat is attribute AiK-th of value be expressed as aik
Step 2, WoE indexs are defined
WoE is a kind of quantitative analysis method of the combination example based on particular category, and this method is as a kind of probability statistics Theory is suggested for the first time the 1950s, is applied to medical diagnosis system later, the eighties in last century, WoE was used extensively In GIS-Geographic Information System.In recent years, gradually attention is recorded with the expansion of bank and other financial mechanism business and to personal credit, WoE is increasingly becoming the hot spot of research as a kind of method for weighing customer capital quality.
WoE indexs are a kind of coding forms to original argument, to carry out WoE codings to a variable, need first This variable is grouped processing, is also discretization, branch mailbox etc..Such as formula (2) and (3):
In formula (2)-(3) formula:C1Indicate the class label of the 1st training sample, C2Indicate the category of the 2nd training sample Label, P (A=aik| C=C1) expression attribute be aik, classification C1Conditional probability, P (A=aik| C=C2) expression attribute be aik、 Classification is C2Conditional probability, N (C) indicate classification be C sample number, N be data sample sum, N (A=aik| C) indicate class Not and attribute value is aikWhen sample number.
What value number changed the influence WoE reflections brought is a kind of proportionate relationship, and is reduced by Logarithmic calculation Positive and negative example, such as under certain value of a certain attribute, more one of positive example can't be such that the generation of WoE values dramatically changes so that accidentally Influence of the problems such as difference sampling, noise to WoE can obtain certain control.WoE is primarily used to weigh under same attribute respectively A value is to the tendentiousness of classification results, if certain value of an attribute only occurs primary and is directed to classification results Positive example can not illustrate this value just to have classification results to be positive absolute guidance quality.Therefore, when needing to weigh the attribute When in entire property set to the percentage contribution of classification results, it is necessary to the height conduct for being included in the value frequency of occurrences is considered, This has just caused the appearance of IV indexs.
Step 3, IV indexs are defined
IV indexs are the information content for weighing variable, i.e., independent variable is for the influence degree of target variable, such as formula (4) shown in:
IV(aik, C) and=(P (A=aik| C=C1)-P (A=aik| C=C2))*WoE(aik,C) (4)
Then attribute AiIV values be each grouping the sum of IV values, i.e.,:
Step 4, in conjunction with step 1, step 2WoE indexs and step 3IV indexs are introduced into Attributions selection, construct simple shellfish This grader of leaf.
IV values are introduced into Naive Bayes Classification Algorithm, as a part for Attributions selection, to fully consider IV values pair The influence of Naive Bayes Classification Algorithm can be seen that from formula of mathematical (4)-(5) of IV values:
As P (Ai=aik| C=C1)>P(Ai=aik| C=C2)>When 0, WoE (aik,C)>0, IV (aik,C)>0;
When 0<P(Ai=aik| C=C1)<P(Ai=aik| C=C2) when, WoE (aik,C)<0, IV (aik,C)>0;
As P (Ai=aik| C=C1)=P (Ai=aik| C=C2) when, WoE (aik, C) and=0, IV (aik, C)=0.
Step 4.1, the attribute set highly relevant with classification is filtered out from primitive attribute set by the calculating of IV values:
According to naive Bayesian weighted formula it is found that carry out classification to sample X needs formula (6), formula (7):
In formula (6)-(7):P(aik|C1) and P (A=aik| C=C1) identical, expression attribute is aik, classification C1Item Part probability;P(aik|C2) and P (A=aik| C=C2) identical, expression attribute is aik, classification C2Conditional probability;P(C1) table Show that classification is C1Conditional probability;P(C2) expression classification be C2Conditional probability;P(C1| X) indicate attribute be X, classification C1's Conditional probability;P(C2| X) indicate attribute be X, classification C2Conditional probability;X indicates each database sample without class label This uses n dimensional feature vectors;
Step 4.2, selected threshold carries out attribute filtering
Formula (6), which is normalized, can obtain formula (8):
WhereinUnderstand that a is constant under given data set,Similarly, to public affairs Formula (7), which is normalized, can obtain formula (9):
In formula (8)-(9):P(C1| X) ' indicate that attribute is X after normalizing, classification C1Conditional probability;P(C2|X)' Indicate that attribute is X after normalizing, classification C2Conditional probability;
By analyzing above-mentioned formula:
As β (aik)>When 1, i.e. WoE (aik,C)>0, IV (aik,C)>When 0, β (aik) bigger, P (C1| X) ' value get over It is small, and P (C2| X) ' value it is bigger;
When 0<β(aik)<When 1, i.e. WoE (aik,C)<0, IV (aik,C)>When 0, β (aik) bigger, P (C1| X) ' value get over It is small, and P (C2| X) ' value it is bigger.
In summary:As β (aik)>When 0, β (aik) bigger, posterior probability P (C1| X) ' and P (C2| X) ' difference is bigger;When β(aikWhen)=1, β (aik) to P (C1| X) ' and P (C2| X) ' value all do not influence.
Due to β (aik) and IV (aik, C) and directly proportional, that is, as IV (aik,C)>When 0, value is bigger, shows different Difference of the class label on the variable is bigger, that is to say, that the classification capacity of the variable is better. IV(aik, C)=0 when, the variable Classification is not influenced.
As shown in table 1, the threshold value for giving the degree of correlation that IV values weigh attribute and generic attribute divides.
The threshold value that 1 IV values of table weigh the degree of correlation of attribute and generic attribute divides
Degree of correlation IV values
Non-correlation IV<0.02
Weak dependence 0.02≤IV<0.1
Medium correlation 0.1≤IV<0.3
Strong correlation IV30.3
Step 4.3, in step 4.2 Naive Bayes Classifier is constructed on the preferable attribute set of classification capacity.
From the foregoing, it will be observed that IV values are introduced Attributions selection, there will be following advantage:(1) IV can weigh each attribute to classification Influence size;(2) compare with other most Feature Selection Algorithms, IV has more pervasive threshold value to determine attribute and class The degree of correlation of attribute, and other most Attributions selection indexs are typically only capable to enough sort to influence power size, are not easy to determine selection How many a attributes can just achieve the effect that relatively satisfactory;(3) IV calculating speeds are fast, and space-time expense is also small, and it is pre- to be appropriate for data Processing.
But directly IV indexs are also needed to consider problems with as Attributions selection:Actually calculate in, IV values for from Decorrelated probability value can be obtained by statistics by dissipating the calculating of attribute, but be difficult to obtain the probability distribution of data for connection attribute, And the calculatings tool of integral acquires a certain degree of difficulty and larger space-time expense, so in IV calculating early period, can by connection attribute from Dispersion processing, to ensure the high efficiency of IV indexs.In addition, in actual data, it often will appear under certain attribute certain and take It is worth the case where frequency is 0, i.e., as attribute AiValue is aikWhen, there is no the examples that classification is C, and it is 0 that this, which may result in denominator, Situation so that the value of WoE tends to be infinitely great, it is clear that is unreasonable.Normalized is taken for such case, at it We indicated that the slight variations of value will not cause the fluctuation of WoE values in preceding analysis, therefore in attribute certain value When instance number is 0, we are handled as 1.
The purpose of attribute set selection, exactly finds a succinct subset of primitive attribute set so that learning algorithm exists It only include the grader of the pinpoint accuracy as far as possible of generation one after being run on the data acquisition system of attribute in this subset.And it is original Attribute set is mostly such as the data set of UCI Repository of machine learning databases, such as 2 institute of table Show, there are the attributes such as Continuous valued attributes and missing values.Therefore, the key of attribute set selection is to find a simplification and excellent Attribute set.One excellent attribute set had not only shown that this set was whole very strong with the relevance of generic attribute, but also table Whole redundancy very little in this set between attribute and attribute now.In order to select an optimal subset, attribute is being carried out When subset selects, it is necessary to consider the relevance in subset between attribute and generic attribute and the redundancy between attribute and attribute Property.
2 data set of table describes
Selectivity Nae Bayesianmethod of the present invention is the category that can be obtained by IV index screenings with generic attribute strong correlation Temper collection, but the redundancy between ignoring attribute;And although Nae Bayesianmethod can solve Attribute Redundancy and ask Topic, but there is no the standards determined when carrying out attribute set screening, it is likely that it causes exhaustive to search for, it is multiple to increase calculating Miscellaneous degree and operational efficiency.
Step 5, IV indexs need to be first passed through to be filtered the Category Attributes variable finite aggregate S of most original, obtain meeting threshold It is worth desired attribute set S', and the attribute in S' is ranked up according to the sequence of IV values size from high to low, is finally arranging Show the attribute set that search can be such that the performance of grader is optimal on the attribute set S' of sequence.
Concrete operations are as follows:
Step 5.1, sample data set T to be sorted in input step 1, conditional attribute set, that is, Category Attributes variable have Limit collection S={ A1, A2..., An, decision attribute set, that is, class variable C={ C1, C2..., Cm};And to sample data to be sorted Collect T and carries out data prediction;
Step 5.2, alternative conditions attribute set S is initialized, is S' by the attribute set that Attributions selection is selected, not Selected attribute set is S ", and the attribute set according to the height sequence of the IV indexs of attribute is S " ', enable S', S ", S " ' be all Sky, maximum accuracy Accuracymax=0, current accuracy Accuracycur=0;
Step 5.3, in design conditions attribute set S all properties information value IV values, and pass through threshold value carry out first Wheel screening, the attribute that IV values are more than or equal to threshold value are added in the S " ' in set step 5.2, and IV values are less than the attribute of threshold value It is added in set S ", and sorts from high to low according to IV values to the attribute in S ";
Step 5.4, if S " ' is 0, terminate to calculate, preserve S' and Accuracy at this timemax
Step 5.5, if S " ' is not 0, continue to calculate, select first attribute A in attribute set S " 'i, by its from It deletes, and is added in S' in S " ', Naive Bayes Classifier is constructed on newer S', and calculate Accuracycur
Step 5.6, if Accuracy in step 5.5cur>Accuracymax, then it is updated to Accuracymax= Accuracycur;If the Accuracy in step 5.5cur≤Accuracymax, by AiIt removes, is added in S " from S', Preserve S' and Accuracy at this timemax, terminate to calculate.
Attribute selected by the present invention is the subset of naive Bayesian attribute set, it can improve naive Bayesian Classification performance in Attribute Redundancy, at the same in the case of attribute not redundancy keep naive Bayesian classification performance.Pass through IV concepts are introduced, are screened to obtain first round attribute set according to threshold value, to reduce traversal space, and pass through sweep forward Two wheel selections are carried out to attribute set, according to the principle of greedy algorithm, it is assumed that in each step of search, which thinks current All localized variations in attribute set are all optimal selections.
By the fine or not reference always related to specific interpretational criteria, different for the optimal subset that Attributions selection is selected It is often different to the degree of recognition of " optimal subset " under system, the optimal subset that standard A is obtained be likely at standard B be not Best.For supervised learning, final purpose and the meaning of Attributions selection are to make the accuracy of grader more preferable.Cause This, the present invention uses most intuitive way, and nicety of grading (or error rate) is used as evaluation criterion, by Naive Bayes Classification Measurement of the accuracy of device as Attributions selection quality degree.
For the validity of verification method, first, UCI Repository of machine learning are chosen The database of databases, reference table 1 compare attribute set number and classification correctness under different threshold values.To demonstrate,prove The real correctness for introducing IV indexs.Then, pass through IV indexs and common attribute selection method Cor (Correlation), GR (Gain Ratio), IG (INfoGain), OneR are compared, and calculate the phase of conditional attribute and category attribute under each method Pass degree and classification accuracy rate, to obtain the validity of the method for the present invention.Finally, to Naive Bayes Classifier and process The comparison of the performance of Naive Bayes Classifier after Attributions selection, it was demonstrated that improved the method for the present invention can significantly drop Ensure the accuracy of classification while low attribute dimensions.

Claims (4)

1. a kind of improved selective Nae Bayesianmethod, which is characterized in that include the following steps:
Step 1, the data set T containing n attribute is given, if S={ A1, A2..., AnIt is Category Attributes variable finite aggregate, C= {C1, C2..., CmIt is class variable, m is the value number of class variable, CjFor j-th of value of class variable;
When discussing two classification problems, that is, assume j=2, C={ C1, C2When, for arbitrary conditional attribute variables AiIf it has Si A different valueThat is attribute AiK-th of value be expressed as aik
Step 2, WoE indexs are defined
WoE indexs are a kind of coding forms to original argument, to carry out WoE codings to a variable, need first this Variable is grouped processing, such as formula (2) and (3):
In formula (2)-(3) formula:C1Indicate the class label of the 1st training sample, C2Indicate the class label of the 2nd training sample, P (A=aik| C=C1) expression attribute be aik, classification C1Conditional probability, P (A=aik| C=C2) expression attribute be aik, classification For C2Conditional probability, N (C) indicate classification be C sample number, N be data sample sum, N (A=aik| C) indicate classification and category Property value be aikWhen sample number;
Step 3, IV indexs are defined
IV indexs are the information content for weighing variable, i.e., independent variable for target variable influence degree, such as formula (4) institute Show:
IV(aik, C) and=(P (A=aik| C=C1)-P (A=aik| C=C2))*WoE(aik,C) (4)
Then attribute AiIV values be each grouping the sum of IV values, i.e.,:
Step 4, in conjunction with step 1, the IV indexs of the WoE indexs of step 2 and step 3 are introduced into Attributions selection, construct simple shellfish This grader of leaf;
Step 5, on the basis of step 4, Category Attributes variable finite aggregate S of the IV indexs to the most original of step 1 need to be first passed through It is filtered, obtains the attribute set S' for meeting threshold requirement, and to the attribute in S' according to IV values size from high to low suitable Sequence is ranked up, and finally search can make the property set that the performance of grader is optimal on the attribute set S' of aligned orderly It closes.
2. the improved selective Nae Bayesianmethod of one kind according to claim 1, which is characterized in that the step 4 Concrete operations be:
Step 4.1, the attribute set highly relevant with classification is filtered out from primitive attribute set by the calculating of IV values:
According to naive Bayesian weighted formula it is found that carry out classification to sample X needs formula (6), formula (7):
In formula (6)-(7):P(aik|C1) and P (A=aik| C=C1) identical, expression attribute is aik, classification C1Condition it is general Rate;P(aik|C2) and P (A=aik| C=C2) identical, expression attribute is aik, classification C2Conditional probability;P(C1) indicate classification For C1Conditional probability;P(C2) expression classification be C2Conditional probability;P(C1| X) indicate attribute be X, classification C1Condition it is general Rate;P(C2| X) indicate attribute be X, classification C2Conditional probability;X expressions each do not have the database sample of class label to be tieed up with n Feature vector;
Step 4.2, selected threshold carries out attribute filtering
Formula (6), which is normalized, can obtain formula (8):
WhereinUnderstand that a is constant under given data set,Similarly, to formula (7) Formula (9) can be obtained by being normalized:
In formula (8)-(9):P(C1| X) ' indicate that attribute is X after normalizing, classification C1Conditional probability;P(C2| X) ' indicate After normalization attribute be X, classification C2Conditional probability;
Step 4.3, in step 4.2 Naive Bayes Classifier is constructed on the preferable attribute set of classification capacity.
3. the improved selective Nae Bayesianmethod of one kind according to claim 2, which is characterized in that the step The division of threshold value and the degree of correlation of IV values measurement attribute and generic attribute in 4.2 is as follows:
Degree of correlation IV values Non-correlation IV<0.02 Weak dependence 0.02≤IV<0.1 Medium correlation 0.1≤IV<0.3 Strong correlation IV≥0.3
4. the improved selective Nae Bayesianmethod of one kind according to claim 1, which is characterized in that the step 5 Concrete operations be:
Step 5.1, sample data set T to be sorted in input step 1, conditional attribute set, that is, Category Attributes variable finite aggregate S ={ A1, A2..., An, decision attribute set, that is, class variable C={ C1, C2..., Cm};And to sample data set T to be sorted into Line number Data preprocess;
Step 5.2, alternative conditions attribute set S is initialized, is S' by the attribute set that Attributions selection is selected, is not selected Attribute set be S ", the attribute set according to the height sequence of the IV indexs of attribute is S " ', enable S', S ", S " ' be all sky, it is maximum Accuracy Accuracymax=0, current accuracy Accuracycur=0;
Step 5.3, in design conditions attribute set S all properties information value IV values, and pass through threshold value carry out first round sieve Choosing, the attribute that IV values are more than or equal to threshold value are added in the S " ' in set step 5.2, and the attribute that IV values are less than threshold value adds It sorts from high to low according to IV values into set S ", and to the attribute in S ";
Step 5.4, if S " ' is 0, terminate to calculate, preserve S' and Accuracy at this timemax
Step 5.5, if S " ' is not 0, continue to calculate, select first attribute A in attribute set S " 'i, it is deleted from S " ' It removes, and is added in S', Naive Bayes Classifier is constructed on newer S', and calculate Accuracycur
Step 5.6, if Accuracy in step 5.5cur>Accuracymax, then it is updated to Accuracymax= Accuracycur;If the Accuracy in step 5.5cur≤Accuracymax, by AiIt removes, is added in S " from S', protect Deposit S' and Accuracy at this timemax, terminate to calculate.
CN201810291375.6A 2018-04-03 2018-04-03 A kind of improved selective Nae Bayesianmethod Pending CN108805156A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810291375.6A CN108805156A (en) 2018-04-03 2018-04-03 A kind of improved selective Nae Bayesianmethod

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810291375.6A CN108805156A (en) 2018-04-03 2018-04-03 A kind of improved selective Nae Bayesianmethod

Publications (1)

Publication Number Publication Date
CN108805156A true CN108805156A (en) 2018-11-13

Family

ID=64094674

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810291375.6A Pending CN108805156A (en) 2018-04-03 2018-04-03 A kind of improved selective Nae Bayesianmethod

Country Status (1)

Country Link
CN (1) CN108805156A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241079A (en) * 2020-01-08 2020-06-05 哈尔滨工业大学 Data cleaning method and device and computer readable storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241079A (en) * 2020-01-08 2020-06-05 哈尔滨工业大学 Data cleaning method and device and computer readable storage medium

Similar Documents

Publication Publication Date Title
Xiao et al. Cost-sensitive semi-supervised selective ensemble model for customer credit scoring
CN108898479B (en) Credit evaluation model construction method and device
Sabau Survey of clustering based financial fraud detection research
CN108596199A (en) Unbalanced data classification method based on EasyEnsemble algorithms and SMOTE algorithms
Kim et al. Ordinal classification of imbalanced data with application in emergency and disaster information services
CN109739844B (en) Data classification method based on attenuation weight
CN109615014A (en) A kind of data sorting system and method based on the optimization of KL divergence
CN107273387A (en) Towards higher-dimension and unbalanced data classify it is integrated
CN107391772A (en) A kind of file classification method based on naive Bayesian
CN112001788B (en) Credit card illegal fraud identification method based on RF-DBSCAN algorithm
Safitri et al. Improved accuracy of naive bayes classifier for determination of customer churn uses smote and genetic algorithms
Manziuk et al. Definition of information core for documents classification
CN109726918A (en) The personal credit for fighting network and semi-supervised learning based on production determines method
CN112700324A (en) User loan default prediction method based on combination of Catboost and restricted Boltzmann machine
Chern et al. A decision tree classifier for credit assessment problems in big data environments
CN104615789A (en) Data classifying method and device
Pristyanto et al. The effect of feature selection on classification algorithms in credit approval
Tsai Two‐stage hybrid learning techniques for bankruptcy prediction
Chiang et al. The Chinese text categorization system with association rule and category priority
CN114064459A (en) Software defect prediction method based on generation countermeasure network and ensemble learning
CN109716660A (en) Data compression device and method
CN114037001A (en) Mechanical pump small sample fault diagnosis method based on WGAN-GP-C and metric learning
CN108805156A (en) A kind of improved selective Nae Bayesianmethod
CN115271442A (en) Modeling method and system for evaluating enterprise growth based on natural language
CN115618297A (en) Method and device for identifying abnormal enterprise

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181113