CN108805156A - A kind of improved selective Nae Bayesianmethod - Google Patents
A kind of improved selective Nae Bayesianmethod Download PDFInfo
- Publication number
- CN108805156A CN108805156A CN201810291375.6A CN201810291375A CN108805156A CN 108805156 A CN108805156 A CN 108805156A CN 201810291375 A CN201810291375 A CN 201810291375A CN 108805156 A CN108805156 A CN 108805156A
- Authority
- CN
- China
- Prior art keywords
- attribute
- classification
- accuracy
- formula
- values
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
- G06F18/24155—Bayesian classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The improved selective Nae Bayesianmethod of one kind disclosed by the invention, includes the following steps:WoE values and IV values are introduced into Attributions selection, obtained and the higher attribute set of the classification degree of correlation, construction Naive Bayes Classifier;Then redundant attributes are further deleted on its basis, obtain optimum attributes subset.On the basis of the existing bayesian algorithm of the improved selective Nae Bayesianmethod of the present invention, WoE and IV indexs are introduced into Attributions selection, classification performance of the naive Bayesian in Attribute Redundancy is improved, while keeping the classification performance of naive Bayesian in the case of attribute not redundancy;It is screened to obtain first round attribute set according to threshold value, to reduce traversal space, solves the problems, such as to improve the correctness of classification while reducing attribute dimensions.
Description
Technical field
The invention belongs to attribute selection method technical fields, and in particular to a kind of improved selective naive Bayesian side
Method.
Background technology
For a learning tasks, given attribute collection, some of which attribute may be very crucial, very useful, we by its
Referred to as " association attributes " (relevant feature), otiose attribute are known as " unrelated attribute " (irrelevant
feature).The process of association attributes subset is selected from given attribute set, referred to as " Attributions selection " (feature
selection)。
Dimension disaster problem often is caused because attribute is excessive in realistic task, if can be screened by Attributions selection
Go out important attribute, then the efficiency of processing high dimensional data can be greatly improved.In addition to removal " unrelated attribute ", should also remove
" redundant attributes ", the generic attribute that i.e. those information for itself being included can be deduced out from other attributes.It should be noted that
Being that Attributions selection process must assure that does not lose important attribute, otherwise follow-up learning process can because the missing of important information and
The performance that can not have been obtained.Selected attribute can improve the performance of model, the characteristics of capable of more helping us understand data,
Fabric, this to further improving model, algorithm suffers from important function.
Common attribute selection method can substantially be divided into three classes:Filtering type (filter), packaging type (wrapper) and
Embedded (embedding).Filtering type method passes through related coefficient, information gain (InfoGain), ratio of profit increase (Gain
Ratio), the methods of OneR calculates the degree of correlation that primitive attribute concentrates each conditional attribute and category attribute, then carries out
" filtering ", then with filtered attribute come training pattern.The standard of filter attribute is to carry out related journey according to primitive attribute collection
The priority of degree sorts.
The case where need not considering subsequently to learn from filtering type Attributions selection, is different, and packaging type Attributions selection is directly final
Using the performance of learner to be used as the evaluation criterion of attribute set.Generally, due to packaging type attribute selection method
It is optimized directly against given learner, therefore from the point of view of final learner performance, packaging type Attributions selection is than filtering type category
Property selection it is more preferable.But then, due to needing repeatedly to train learner during Attributions selection, the choosing of packaging type attribute
The computing cost selected is usually more much bigger than filtering type Attributions selection.
In filtering type and packaging type attribute selection method, Attributions selection process has significantly with learner training process
Difference;Unlike this, embedded Attributions selection is that Attributions selection process and learner training process combine together, and the two exists
It is completed in the same optimization process, i.e., has automatically carried out Attributions selection in learner training process.
Invention content
The object of the present invention is to provide a kind of improved selective Nae Bayesianmethods, solve and are reducing attribute dimension
The problem of correctness of classification is improved while spending
The technical solution adopted in the present invention is a kind of improved selective Nae Bayesianmethod, including following step
Suddenly:
Step 1, the data set T containing n attribute is given, if S={ A1, A2..., AnIt is that Category Attributes variable is limited
Collection, C={ C1, C2..., CmIt is class variable, m is the value number of class variable, CjFor j-th of value of class variable;
When discussing two classification problems, that is, assume j=2, C={ C1, C2When, for arbitrary conditional attribute variables AiIf
It has SiA different valueThat is attribute AiK-th of value be expressed as aik;
Step 2, WoE indexs are defined
WoE indexs are a kind of coding forms to original argument, to carry out WoE codings to a variable, need first
This variable is grouped processing, such as formula (2) and (3):
In formula (2)-(3) formula:C1Indicate the class label of the 1st training sample, C2Indicate the category of the 2nd training sample
Label, P (A=aik| C=C1) expression attribute be aik, classification C1Conditional probability, P (A=aik| C=C2) expression attribute be aik、
Classification is C2Conditional probability, N (C) indicate classification be C sample number, N be data sample sum, N (A=aik| C) indicate class
Not and attribute value is aikWhen sample number;
Step 3, IV indexs are defined
IV indexs are the information content for weighing variable, i.e., independent variable is for the influence degree of target variable, such as formula
(4) shown in:
IV(aik, C) and=(P (A=aik| C=C1)-P (A=aik| C=C2))*WoE(aik,C) (4)
Then attribute AiIV values be each grouping the sum of IV values, i.e.,:
Step 4, in conjunction with step 1, the IV indexs of the WoE indexs of step 2 and step 3 are introduced into Attributions selection, construction Piao
Plain Bayes classifier;
Step 5, on the basis of step 4, need to first pass through IV indexs has the Category Attributes variable of the most original of step 1
Limit collection S is filtered, and obtains the attribute set S' for meeting threshold requirement, and to the attribute in S' according to IV values size by height to
Low sequence is ranked up, and finally search can be such that the performance of grader is optimal on the attribute set S' of aligned orderly
Attribute set.
It is of the invention to be further characterized in that,
The concrete operations of step 4 are:
Step 4.1, the attribute set highly relevant with classification is filtered out from primitive attribute set by the calculating of IV values:
According to naive Bayesian weighted formula it is found that carry out classification to sample X needs formula (6), formula (7):
In formula (6)-(7):P(aik|C1) and P (A=aik| C=C1) identical, expression attribute is aik, classification C1Item
Part probability;P(aik|C2) and P (A=aik| C=C2) identical, expression attribute is aik, classification C2Conditional probability;P(C1) table
Show that classification is C1Conditional probability;P(C2) expression classification be C2Conditional probability;P(C1| X) indicate attribute be X, classification C1's
Conditional probability;P(C2| X) indicate attribute be X, classification C2Conditional probability;X indicates each database sample without class label
This uses n dimensional feature vectors;
Step 4.2, selected threshold carries out attribute filtering
Formula (6), which is normalized, can obtain formula (8):
WhereinUnderstand that a is constant under given data set,Similarly, to public affairs
Formula (7), which is normalized, can obtain formula (9):
In formula (8)-(9):P(C1| X) ' normalization after indicate attribute be X, classification C1Conditional probability;P(C2|X)'
Indicate that attribute is X after normalizing, classification C2Conditional probability;
Step 4.3, in step 4.2 Naive Bayes Classifier is constructed on the preferable attribute set of classification capacity.
The division of threshold value and the degree of correlation of IV values measurement attribute and generic attribute in step 4.2 is as follows:
Degree of correlation | IV values |
Non-correlation | IV<0.02 |
Weak dependence | 0.02≤IV<0.1 |
Medium correlation | 0.1≤IV<0.3 |
Strong correlation | IV≥0.3 |
The concrete operations of step 5 are:
Step 5.1, sample data set T to be sorted in input step 1, conditional attribute set, that is, Category Attributes variable have
Limit collection S={ A1, A2..., An, decision attribute set, that is, class variable C={ C1, C2..., Cm};And to sample data to be sorted
Collect T and carries out data prediction;
Step 5.2, alternative conditions attribute set S is initialized, is S' by the attribute set that Attributions selection is selected, not
Selected attribute set is S ", and the attribute set according to the height sequence of the IV indexs of attribute is S " ', enable S', S ", S " ' be all
Sky, maximum accuracy Accuracymax=0, current accuracy Accuracycur=0;
Step 5.3, in design conditions attribute set S all properties information value IV values, and pass through threshold value carry out first
Wheel screening, the attribute that IV values are more than or equal to threshold value are added in the S " ' in set step 5.2, and IV values are less than the attribute of threshold value
It is added in set S ", and sorts from high to low according to IV values to the attribute in S ";
Step 5.4, if S " ' is 0, terminate to calculate, preserve S' and Accuracy at this timemax;
Step 5.5, if S " ' is not 0, continue to calculate, select first attribute A in attribute set S " 'i, by its from
It deletes, and is added in S' in S " ', Naive Bayes Classifier is constructed on newer S', and calculate Accuracycur;
Step 5.6, if Accuracy in step 5.5cur>Accuracymax, then it is updated to Accuracymax=
Accuracycur;If the Accuracy in step 5.5cur≤Accuracymax, by AiIt removes, is added in S " from S',
Preserve S' and Accuracy at this timemax, terminate to calculate.
The beneficial effects of the invention are as follows:A kind of existing Bayes of improved selective Nae Bayesianmethod of the invention
On the basis of algorithm, WoE and IV indexs are introduced into Attributions selection, improve classification performance of the naive Bayesian in Attribute Redundancy,
The classification performance of naive Bayesian is kept in the case of attribute not redundancy simultaneously;It is screened to obtain first round attribute according to threshold value
Subset solves the problems, such as to improve the correctness of classification while reducing attribute dimensions to reduce traversal space.
Specific implementation mode
The present invention is described in detail With reference to embodiment.
The most common conventional method of correlation is linear dependence method between computation attribute, but as an attribute
Ordering mechanism, major defect is only sensitive to linear relationship, if there are nonlinear relationships between attribute, even if they
With one-to-one relationship, related coefficient may also be close to 0, and it includes numeric type that this method, which requires all properties,
Attribute value.In order to overcome these disadvantages, introduce some methods based on information theory, as information gain, ratio of profit increase and it is symmetrical not
The serial of methods such as deterministic coefficient, but when carrying out Attributions selection by the correlation between computation attribute and classification, these
There is no specific threshold values to set for method, can only sort according to degree of correlation to attribute, still needs to be carried in time and space efficiency
It is high.
If there are two discrete random variable X and Y, KL divergences are also known as using relative entropy and portray two discrete random variables
The distance between DKLFormula (1) is shown in in=(YX), definition:
DKL(Y | X)=∑ (PY(i)log(PY(i)/PX(i))) (1)
In formula (1), the signal of each variable:PY(i) probability distribution when being stochastic variable Y=i;PX(i) it is random
The probability distribution of variable X=i;
Relative entropy is a kind of mode for describing two probability distribution variances, and value is greater than always equal to 0, when and only
When two branches are identical, relative entropy is equal to 0.WoE values and IV values are exactly to develop from relative entropy.WoE(Weight of
Evidence) it is a kind of coding form to original argument;IV (Information Value) is for weighing variable
The influence degree of information content, i.e. independent variable for target variable.
A kind of improved selective Nae Bayesianmethod of the invention, includes the following steps:
Step 1, the data set T containing n attribute is given, if S={ A1, A2..., AnIt is that Category Attributes variable is limited
Collection, C={ C1, C2..., CmIt is class variable, m is the value number of class variable, CjFor j-th of value of class variable;
When discussing two classification problems, that is, assume j=2, C={ C1, C2When, for arbitrary conditional attribute variables AiIf
It has SiA different valueThat is attribute AiK-th of value be expressed as aik;
Step 2, WoE indexs are defined
WoE is a kind of quantitative analysis method of the combination example based on particular category, and this method is as a kind of probability statistics
Theory is suggested for the first time the 1950s, is applied to medical diagnosis system later, the eighties in last century, WoE was used extensively
In GIS-Geographic Information System.In recent years, gradually attention is recorded with the expansion of bank and other financial mechanism business and to personal credit,
WoE is increasingly becoming the hot spot of research as a kind of method for weighing customer capital quality.
WoE indexs are a kind of coding forms to original argument, to carry out WoE codings to a variable, need first
This variable is grouped processing, is also discretization, branch mailbox etc..Such as formula (2) and (3):
In formula (2)-(3) formula:C1Indicate the class label of the 1st training sample, C2Indicate the category of the 2nd training sample
Label, P (A=aik| C=C1) expression attribute be aik, classification C1Conditional probability, P (A=aik| C=C2) expression attribute be aik、
Classification is C2Conditional probability, N (C) indicate classification be C sample number, N be data sample sum, N (A=aik| C) indicate class
Not and attribute value is aikWhen sample number.
What value number changed the influence WoE reflections brought is a kind of proportionate relationship, and is reduced by Logarithmic calculation
Positive and negative example, such as under certain value of a certain attribute, more one of positive example can't be such that the generation of WoE values dramatically changes so that accidentally
Influence of the problems such as difference sampling, noise to WoE can obtain certain control.WoE is primarily used to weigh under same attribute respectively
A value is to the tendentiousness of classification results, if certain value of an attribute only occurs primary and is directed to classification results
Positive example can not illustrate this value just to have classification results to be positive absolute guidance quality.Therefore, when needing to weigh the attribute
When in entire property set to the percentage contribution of classification results, it is necessary to the height conduct for being included in the value frequency of occurrences is considered,
This has just caused the appearance of IV indexs.
Step 3, IV indexs are defined
IV indexs are the information content for weighing variable, i.e., independent variable is for the influence degree of target variable, such as formula
(4) shown in:
IV(aik, C) and=(P (A=aik| C=C1)-P (A=aik| C=C2))*WoE(aik,C) (4)
Then attribute AiIV values be each grouping the sum of IV values, i.e.,:
Step 4, in conjunction with step 1, step 2WoE indexs and step 3IV indexs are introduced into Attributions selection, construct simple shellfish
This grader of leaf.
IV values are introduced into Naive Bayes Classification Algorithm, as a part for Attributions selection, to fully consider IV values pair
The influence of Naive Bayes Classification Algorithm can be seen that from formula of mathematical (4)-(5) of IV values:
As P (Ai=aik| C=C1)>P(Ai=aik| C=C2)>When 0, WoE (aik,C)>0, IV (aik,C)>0;
When 0<P(Ai=aik| C=C1)<P(Ai=aik| C=C2) when, WoE (aik,C)<0, IV (aik,C)>0;
As P (Ai=aik| C=C1)=P (Ai=aik| C=C2) when, WoE (aik, C) and=0, IV (aik, C)=0.
Step 4.1, the attribute set highly relevant with classification is filtered out from primitive attribute set by the calculating of IV values:
According to naive Bayesian weighted formula it is found that carry out classification to sample X needs formula (6), formula (7):
In formula (6)-(7):P(aik|C1) and P (A=aik| C=C1) identical, expression attribute is aik, classification C1Item
Part probability;P(aik|C2) and P (A=aik| C=C2) identical, expression attribute is aik, classification C2Conditional probability;P(C1) table
Show that classification is C1Conditional probability;P(C2) expression classification be C2Conditional probability;P(C1| X) indicate attribute be X, classification C1's
Conditional probability;P(C2| X) indicate attribute be X, classification C2Conditional probability;X indicates each database sample without class label
This uses n dimensional feature vectors;
Step 4.2, selected threshold carries out attribute filtering
Formula (6), which is normalized, can obtain formula (8):
WhereinUnderstand that a is constant under given data set,Similarly, to public affairs
Formula (7), which is normalized, can obtain formula (9):
In formula (8)-(9):P(C1| X) ' indicate that attribute is X after normalizing, classification C1Conditional probability;P(C2|X)'
Indicate that attribute is X after normalizing, classification C2Conditional probability;
By analyzing above-mentioned formula:
As β (aik)>When 1, i.e. WoE (aik,C)>0, IV (aik,C)>When 0, β (aik) bigger, P (C1| X) ' value get over
It is small, and P (C2| X) ' value it is bigger;
When 0<β(aik)<When 1, i.e. WoE (aik,C)<0, IV (aik,C)>When 0, β (aik) bigger, P (C1| X) ' value get over
It is small, and P (C2| X) ' value it is bigger.
In summary:As β (aik)>When 0, β (aik) bigger, posterior probability P (C1| X) ' and P (C2| X) ' difference is bigger;When
β(aikWhen)=1, β (aik) to P (C1| X) ' and P (C2| X) ' value all do not influence.
Due to β (aik) and IV (aik, C) and directly proportional, that is, as IV (aik,C)>When 0, value is bigger, shows different
Difference of the class label on the variable is bigger, that is to say, that the classification capacity of the variable is better. IV(aik, C)=0 when, the variable
Classification is not influenced.
As shown in table 1, the threshold value for giving the degree of correlation that IV values weigh attribute and generic attribute divides.
The threshold value that 1 IV values of table weigh the degree of correlation of attribute and generic attribute divides
Degree of correlation | IV values |
Non-correlation | IV<0.02 |
Weak dependence | 0.02≤IV<0.1 |
Medium correlation | 0.1≤IV<0.3 |
Strong correlation | IV30.3 |
Step 4.3, in step 4.2 Naive Bayes Classifier is constructed on the preferable attribute set of classification capacity.
From the foregoing, it will be observed that IV values are introduced Attributions selection, there will be following advantage:(1) IV can weigh each attribute to classification
Influence size;(2) compare with other most Feature Selection Algorithms, IV has more pervasive threshold value to determine attribute and class
The degree of correlation of attribute, and other most Attributions selection indexs are typically only capable to enough sort to influence power size, are not easy to determine selection
How many a attributes can just achieve the effect that relatively satisfactory;(3) IV calculating speeds are fast, and space-time expense is also small, and it is pre- to be appropriate for data
Processing.
But directly IV indexs are also needed to consider problems with as Attributions selection:Actually calculate in, IV values for from
Decorrelated probability value can be obtained by statistics by dissipating the calculating of attribute, but be difficult to obtain the probability distribution of data for connection attribute,
And the calculatings tool of integral acquires a certain degree of difficulty and larger space-time expense, so in IV calculating early period, can by connection attribute from
Dispersion processing, to ensure the high efficiency of IV indexs.In addition, in actual data, it often will appear under certain attribute certain and take
It is worth the case where frequency is 0, i.e., as attribute AiValue is aikWhen, there is no the examples that classification is C, and it is 0 that this, which may result in denominator,
Situation so that the value of WoE tends to be infinitely great, it is clear that is unreasonable.Normalized is taken for such case, at it
We indicated that the slight variations of value will not cause the fluctuation of WoE values in preceding analysis, therefore in attribute certain value
When instance number is 0, we are handled as 1.
The purpose of attribute set selection, exactly finds a succinct subset of primitive attribute set so that learning algorithm exists
It only include the grader of the pinpoint accuracy as far as possible of generation one after being run on the data acquisition system of attribute in this subset.And it is original
Attribute set is mostly such as the data set of UCI Repository of machine learning databases, such as 2 institute of table
Show, there are the attributes such as Continuous valued attributes and missing values.Therefore, the key of attribute set selection is to find a simplification and excellent
Attribute set.One excellent attribute set had not only shown that this set was whole very strong with the relevance of generic attribute, but also table
Whole redundancy very little in this set between attribute and attribute now.In order to select an optimal subset, attribute is being carried out
When subset selects, it is necessary to consider the relevance in subset between attribute and generic attribute and the redundancy between attribute and attribute
Property.
2 data set of table describes
Selectivity Nae Bayesianmethod of the present invention is the category that can be obtained by IV index screenings with generic attribute strong correlation
Temper collection, but the redundancy between ignoring attribute;And although Nae Bayesianmethod can solve Attribute Redundancy and ask
Topic, but there is no the standards determined when carrying out attribute set screening, it is likely that it causes exhaustive to search for, it is multiple to increase calculating
Miscellaneous degree and operational efficiency.
Step 5, IV indexs need to be first passed through to be filtered the Category Attributes variable finite aggregate S of most original, obtain meeting threshold
It is worth desired attribute set S', and the attribute in S' is ranked up according to the sequence of IV values size from high to low, is finally arranging
Show the attribute set that search can be such that the performance of grader is optimal on the attribute set S' of sequence.
Concrete operations are as follows:
Step 5.1, sample data set T to be sorted in input step 1, conditional attribute set, that is, Category Attributes variable have
Limit collection S={ A1, A2..., An, decision attribute set, that is, class variable C={ C1, C2..., Cm};And to sample data to be sorted
Collect T and carries out data prediction;
Step 5.2, alternative conditions attribute set S is initialized, is S' by the attribute set that Attributions selection is selected, not
Selected attribute set is S ", and the attribute set according to the height sequence of the IV indexs of attribute is S " ', enable S', S ", S " ' be all
Sky, maximum accuracy Accuracymax=0, current accuracy Accuracycur=0;
Step 5.3, in design conditions attribute set S all properties information value IV values, and pass through threshold value carry out first
Wheel screening, the attribute that IV values are more than or equal to threshold value are added in the S " ' in set step 5.2, and IV values are less than the attribute of threshold value
It is added in set S ", and sorts from high to low according to IV values to the attribute in S ";
Step 5.4, if S " ' is 0, terminate to calculate, preserve S' and Accuracy at this timemax;
Step 5.5, if S " ' is not 0, continue to calculate, select first attribute A in attribute set S " 'i, by its from
It deletes, and is added in S' in S " ', Naive Bayes Classifier is constructed on newer S', and calculate Accuracycur;
Step 5.6, if Accuracy in step 5.5cur>Accuracymax, then it is updated to Accuracymax=
Accuracycur;If the Accuracy in step 5.5cur≤Accuracymax, by AiIt removes, is added in S " from S',
Preserve S' and Accuracy at this timemax, terminate to calculate.
Attribute selected by the present invention is the subset of naive Bayesian attribute set, it can improve naive Bayesian
Classification performance in Attribute Redundancy, at the same in the case of attribute not redundancy keep naive Bayesian classification performance.Pass through
IV concepts are introduced, are screened to obtain first round attribute set according to threshold value, to reduce traversal space, and pass through sweep forward
Two wheel selections are carried out to attribute set, according to the principle of greedy algorithm, it is assumed that in each step of search, which thinks current
All localized variations in attribute set are all optimal selections.
By the fine or not reference always related to specific interpretational criteria, different for the optimal subset that Attributions selection is selected
It is often different to the degree of recognition of " optimal subset " under system, the optimal subset that standard A is obtained be likely at standard B be not
Best.For supervised learning, final purpose and the meaning of Attributions selection are to make the accuracy of grader more preferable.Cause
This, the present invention uses most intuitive way, and nicety of grading (or error rate) is used as evaluation criterion, by Naive Bayes Classification
Measurement of the accuracy of device as Attributions selection quality degree.
For the validity of verification method, first, UCI Repository of machine learning are chosen
The database of databases, reference table 1 compare attribute set number and classification correctness under different threshold values.To demonstrate,prove
The real correctness for introducing IV indexs.Then, pass through IV indexs and common attribute selection method Cor (Correlation), GR
(Gain Ratio), IG (INfoGain), OneR are compared, and calculate the phase of conditional attribute and category attribute under each method
Pass degree and classification accuracy rate, to obtain the validity of the method for the present invention.Finally, to Naive Bayes Classifier and process
The comparison of the performance of Naive Bayes Classifier after Attributions selection, it was demonstrated that improved the method for the present invention can significantly drop
Ensure the accuracy of classification while low attribute dimensions.
Claims (4)
1. a kind of improved selective Nae Bayesianmethod, which is characterized in that include the following steps:
Step 1, the data set T containing n attribute is given, if S={ A1, A2..., AnIt is Category Attributes variable finite aggregate, C=
{C1, C2..., CmIt is class variable, m is the value number of class variable, CjFor j-th of value of class variable;
When discussing two classification problems, that is, assume j=2, C={ C1, C2When, for arbitrary conditional attribute variables AiIf it has Si
A different valueThat is attribute AiK-th of value be expressed as aik;
Step 2, WoE indexs are defined
WoE indexs are a kind of coding forms to original argument, to carry out WoE codings to a variable, need first this
Variable is grouped processing, such as formula (2) and (3):
In formula (2)-(3) formula:C1Indicate the class label of the 1st training sample, C2Indicate the class label of the 2nd training sample, P
(A=aik| C=C1) expression attribute be aik, classification C1Conditional probability, P (A=aik| C=C2) expression attribute be aik, classification
For C2Conditional probability, N (C) indicate classification be C sample number, N be data sample sum, N (A=aik| C) indicate classification and category
Property value be aikWhen sample number;
Step 3, IV indexs are defined
IV indexs are the information content for weighing variable, i.e., independent variable for target variable influence degree, such as formula (4) institute
Show:
IV(aik, C) and=(P (A=aik| C=C1)-P (A=aik| C=C2))*WoE(aik,C) (4)
Then attribute AiIV values be each grouping the sum of IV values, i.e.,:
Step 4, in conjunction with step 1, the IV indexs of the WoE indexs of step 2 and step 3 are introduced into Attributions selection, construct simple shellfish
This grader of leaf;
Step 5, on the basis of step 4, Category Attributes variable finite aggregate S of the IV indexs to the most original of step 1 need to be first passed through
It is filtered, obtains the attribute set S' for meeting threshold requirement, and to the attribute in S' according to IV values size from high to low suitable
Sequence is ranked up, and finally search can make the property set that the performance of grader is optimal on the attribute set S' of aligned orderly
It closes.
2. the improved selective Nae Bayesianmethod of one kind according to claim 1, which is characterized in that the step 4
Concrete operations be:
Step 4.1, the attribute set highly relevant with classification is filtered out from primitive attribute set by the calculating of IV values:
According to naive Bayesian weighted formula it is found that carry out classification to sample X needs formula (6), formula (7):
In formula (6)-(7):P(aik|C1) and P (A=aik| C=C1) identical, expression attribute is aik, classification C1Condition it is general
Rate;P(aik|C2) and P (A=aik| C=C2) identical, expression attribute is aik, classification C2Conditional probability;P(C1) indicate classification
For C1Conditional probability;P(C2) expression classification be C2Conditional probability;P(C1| X) indicate attribute be X, classification C1Condition it is general
Rate;P(C2| X) indicate attribute be X, classification C2Conditional probability;X expressions each do not have the database sample of class label to be tieed up with n
Feature vector;
Step 4.2, selected threshold carries out attribute filtering
Formula (6), which is normalized, can obtain formula (8):
WhereinUnderstand that a is constant under given data set,Similarly, to formula (7)
Formula (9) can be obtained by being normalized:
In formula (8)-(9):P(C1| X) ' indicate that attribute is X after normalizing, classification C1Conditional probability;P(C2| X) ' indicate
After normalization attribute be X, classification C2Conditional probability;
Step 4.3, in step 4.2 Naive Bayes Classifier is constructed on the preferable attribute set of classification capacity.
3. the improved selective Nae Bayesianmethod of one kind according to claim 2, which is characterized in that the step
The division of threshold value and the degree of correlation of IV values measurement attribute and generic attribute in 4.2 is as follows:
4. the improved selective Nae Bayesianmethod of one kind according to claim 1, which is characterized in that the step 5
Concrete operations be:
Step 5.1, sample data set T to be sorted in input step 1, conditional attribute set, that is, Category Attributes variable finite aggregate S
={ A1, A2..., An, decision attribute set, that is, class variable C={ C1, C2..., Cm};And to sample data set T to be sorted into
Line number Data preprocess;
Step 5.2, alternative conditions attribute set S is initialized, is S' by the attribute set that Attributions selection is selected, is not selected
Attribute set be S ", the attribute set according to the height sequence of the IV indexs of attribute is S " ', enable S', S ", S " ' be all sky, it is maximum
Accuracy Accuracymax=0, current accuracy Accuracycur=0;
Step 5.3, in design conditions attribute set S all properties information value IV values, and pass through threshold value carry out first round sieve
Choosing, the attribute that IV values are more than or equal to threshold value are added in the S " ' in set step 5.2, and the attribute that IV values are less than threshold value adds
It sorts from high to low according to IV values into set S ", and to the attribute in S ";
Step 5.4, if S " ' is 0, terminate to calculate, preserve S' and Accuracy at this timemax;
Step 5.5, if S " ' is not 0, continue to calculate, select first attribute A in attribute set S " 'i, it is deleted from S " '
It removes, and is added in S', Naive Bayes Classifier is constructed on newer S', and calculate Accuracycur;
Step 5.6, if Accuracy in step 5.5cur>Accuracymax, then it is updated to Accuracymax=
Accuracycur;If the Accuracy in step 5.5cur≤Accuracymax, by AiIt removes, is added in S " from S', protect
Deposit S' and Accuracy at this timemax, terminate to calculate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810291375.6A CN108805156A (en) | 2018-04-03 | 2018-04-03 | A kind of improved selective Nae Bayesianmethod |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810291375.6A CN108805156A (en) | 2018-04-03 | 2018-04-03 | A kind of improved selective Nae Bayesianmethod |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108805156A true CN108805156A (en) | 2018-11-13 |
Family
ID=64094674
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810291375.6A Pending CN108805156A (en) | 2018-04-03 | 2018-04-03 | A kind of improved selective Nae Bayesianmethod |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108805156A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111241079A (en) * | 2020-01-08 | 2020-06-05 | 哈尔滨工业大学 | Data cleaning method and device and computer readable storage medium |
-
2018
- 2018-04-03 CN CN201810291375.6A patent/CN108805156A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111241079A (en) * | 2020-01-08 | 2020-06-05 | 哈尔滨工业大学 | Data cleaning method and device and computer readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xiao et al. | Cost-sensitive semi-supervised selective ensemble model for customer credit scoring | |
CN108898479B (en) | Credit evaluation model construction method and device | |
Sabau | Survey of clustering based financial fraud detection research | |
CN108596199A (en) | Unbalanced data classification method based on EasyEnsemble algorithms and SMOTE algorithms | |
Kim et al. | Ordinal classification of imbalanced data with application in emergency and disaster information services | |
CN109739844B (en) | Data classification method based on attenuation weight | |
CN109615014A (en) | A kind of data sorting system and method based on the optimization of KL divergence | |
CN107273387A (en) | Towards higher-dimension and unbalanced data classify it is integrated | |
CN107391772A (en) | A kind of file classification method based on naive Bayesian | |
CN112001788B (en) | Credit card illegal fraud identification method based on RF-DBSCAN algorithm | |
Safitri et al. | Improved accuracy of naive bayes classifier for determination of customer churn uses smote and genetic algorithms | |
Manziuk et al. | Definition of information core for documents classification | |
CN109726918A (en) | The personal credit for fighting network and semi-supervised learning based on production determines method | |
CN112700324A (en) | User loan default prediction method based on combination of Catboost and restricted Boltzmann machine | |
Chern et al. | A decision tree classifier for credit assessment problems in big data environments | |
CN104615789A (en) | Data classifying method and device | |
Pristyanto et al. | The effect of feature selection on classification algorithms in credit approval | |
Tsai | Two‐stage hybrid learning techniques for bankruptcy prediction | |
Chiang et al. | The Chinese text categorization system with association rule and category priority | |
CN114064459A (en) | Software defect prediction method based on generation countermeasure network and ensemble learning | |
CN109716660A (en) | Data compression device and method | |
CN114037001A (en) | Mechanical pump small sample fault diagnosis method based on WGAN-GP-C and metric learning | |
CN108805156A (en) | A kind of improved selective Nae Bayesianmethod | |
CN115271442A (en) | Modeling method and system for evaluating enterprise growth based on natural language | |
CN115618297A (en) | Method and device for identifying abnormal enterprise |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181113 |