CN107169515B - Personal income classification method based on improved naive Bayes - Google Patents

Personal income classification method based on improved naive Bayes Download PDF

Info

Publication number
CN107169515B
CN107169515B CN201710323947.XA CN201710323947A CN107169515B CN 107169515 B CN107169515 B CN 107169515B CN 201710323947 A CN201710323947 A CN 201710323947A CN 107169515 B CN107169515 B CN 107169515B
Authority
CN
China
Prior art keywords
attribute
attributes
classification result
data set
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710323947.XA
Other languages
Chinese (zh)
Other versions
CN107169515A (en
Inventor
宁可
孙同晶
曹红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN201710323947.XA priority Critical patent/CN107169515B/en
Publication of CN107169515A publication Critical patent/CN107169515A/en
Application granted granted Critical
Publication of CN107169515B publication Critical patent/CN107169515B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Accounting & Taxation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Finance (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a personal income classification method based on improved naive Bayes. The method firstly provides a class conditional probability estimation method based on continuous data, and is used for determining prior probability and class conditional probability of each class. Secondly, Laplace calibration is introduced into the segmented data set, so that the problem of overlarge influence of the 0 point is avoided. And thirdly, an improved association rule algorithm based on an Apriori algorithm is provided, and the method can measure the affinity between the two attributes according to the classification result of the training set. And finally, introducing an attribute weighting concept, thereby solving the problem that all attributes of the naive Bayes algorithm have the same influence on the result.

Description

Personal income classification method based on improved naive Bayes
Technical Field
The invention belongs to the field of classification methods, relates to an improved naive Bayes classification method for processing continuous data, and particularly relates to a personal income classification method based on improved naive Bayes, which is used for judging the level of personal income.
Background
With the continuous development of society, the purchasing power of people is also continuously improved, and online shopping also becomes a main purchasing means of young people. Thus, a large number of shopping website applications come with a strong demand for commodity recommendation systems.
When recommending commodities, the income level of a user is firstly confirmed, then the users are classified according to the income, and different commodities are recommended to users of different classes. However, most of the current user classification systems classify according to the types and the quantity of things purchased by users, and the classification method has certain advantages, and can accurately classify old users according to the shopping records of the users and determine the income level of the users belonging to which type. However, for a newly registered user, since there is no record of purchase, there is no judgment.
Based on the current situation, the invention provides a personal income judgment and classification method based on improved naive Bayes.
Classification is a key step for data mining, machine learning, and pattern recognition. The method belongs to supervised learning, and the main process is to analyze a data set so as to obtain a classification principle contained in the data set, further establish a classifier, and then determine the category of unclassified data by means of the classifier.
The naive Bayes classification algorithm is realized based on Bayes theorem. Bayesian theorem is the probability that two events can be exchanged if the conditional probability is known. The naive Bayes algorithm is developed based on the principle, and for a given item to be classified, the probability of the item to be classified which is respectively appeared when the item belongs to each category is solved under the condition of the appearance of the item, and which is the largest, the item to be classified is considered to belong to which category. However, because of the problem that the conditional independence characteristic of the bayesian theorem and all attributes have the same influence on the result, when the naive bayes algorithm is actually used, the accuracy is low, and therefore, the naive bayes algorithm is improved by many people.
The existing improvement method of the naive Bayes algorithm mainly comprises the following steps: firstly, a naive Bayes classifier is improved by utilizing a frequent item set aiming at text data and combining an Apriori algorithm, the method has excellent classification effect, but has a narrow use range and is only suitable for the text data; secondly, aiming at a naive Bayes classifier with weighted combination attributes of which the attributes influence the result, the method can improve the classification accuracy to a certain extent, but still cannot obtain excellent effect under the condition of single use; thirdly, the basic formula of the algorithm is improved, the method can improve the accuracy rate without causing other consequences, but the improvement surface is narrow, and the method is not beneficial to further research later.
Disclosure of Invention
The invention aims to solve the problems, provides an improved Bayesian classification method based on association rules, and applies the improved Bayesian classification method to a personal income judgment classification method. The method is based on the association rule to improve the accuracy, can be suitable for other types of data sets except text data, and solves the problem that the Bayes algorithm improved by the association rule is only suitable for text classification. In the method, firstly, a class conditional probability estimation method based on continuous data is provided for determining class conditional probability of each class. Secondly, Laplace calibration is introduced into the segmented data set, so that the problem of overlarge influence of the 0 point is avoided. Thirdly, an improved association rule algorithm is provided, and the method can measure the affinity between the two attributes according to the classification result of the training set. And finally, introducing an attribute weighting concept, thereby solving the problem that all attributes of the naive Bayes algorithm have the same influence on the result.
In order to achieve the technical problem, the technical solution adopted by the invention is as follows: an improved naive Bayes algorithm based on a continuous data set comprises the following steps:
acquiring a data set for discriminating the income level of residents, wherein attribute variables comprise age, work category, academic calendar, gender, work place and other information for classifying the income level of residents;
the collected data set attribute variables comprise a continuous attribute and a discrete attribute;
step (2), carrying out quantization processing on the font attributes of the discrete texts in the data set acquired in the step (1):
one attribute is selected from the discrete character type attributes, wherein elements with the same characters are represented by the same number, and elements with different character contents are not represented by the same number.
For example, the work category attribute is selected, the work category attribute comprises the text contents of doctors, teachers, lawyers and the like, and the quantification processing process is to replace all the contents appearing in the work category attribute as the elements of the doctors with 1, the teachers with 2 and the lawyers with 3, so that the situation that the teachers and the lawyers are replaced with 1 cannot be obtained.
Step (3), discretizing the continuous attribute in the data set acquired in the step (1):
3.1 selecting a continuous type attribute A from the data set;
3.2 Classification results C from the Presence of data sets1,C2,…,CnAnd recording the element sets belonging to different categories in the attribute A as Ac1,Ac2,…,Acn
3.3 calculation of Ac1,Ac2,…,AcnMean value μ1,μ2,…,μnSum variance
Figure BDA0001290595020000031
3.4 calculating two adjacent class element sets A by adopting a Gaussian formulaciAnd Ac(i+1)Cross point x ofiIs denoted by q1,q2,…,qn-1. The calculation formula of the intersection point is as follows:
Figure BDA0001290595020000032
wherein i is more than or equal to 1 and less than or equal to n-1;
3.5 intersection point q1,q2,…,qn-1Arranging the elements in the order from small to large, and classifying all the elements of the attribute A by taking the elements as a division point to form an element set A1,A2,…,An
3.6 replace all elements of the same category treated in 3.5 with a constant, while elements belonging to different categories must be replaced with different constants (e.g. A)1In (1) is replaced by the number 1, A2… with 2; a. the1And A2The number 1 cannot be used instead).
3.7 discretizing other continuous attributes by adopting the steps 3.1-3.6, and sorting and merging until all the continuous attributes are processed;
step (4), processing the condition that the class condition probability existing in the data set is 0 after the preliminary processing of the steps (2) to (3);
class conditional probabilities are probabilities that are conditional on a class. In step 4, the probability of occurrence of a total classification result in an attribute class is specifically referred to. For example, in the attribute category of doctors, doctors with annual income of more than 50w account for all doctors. In reverse, they also belong to class conditional probabilities, but are not easy to understand.
The problem that the influence of the 0 point is too large is solved by using Laplacian calibration, so that 1 is added to the attribute quantity value corresponding to the class conditional probability of each attribute, and the occurrence of 0 is avoided;
step (5) of obtaining prior probability and class conditional probability P (A) of each attribute and each classi|Cj) Wherein A isiRepresents the ith attribute class in attribute A, CjRepresenting the jth category in the classification result C.
And (6) judging the correlation among the attributes by adopting an improved association rule algorithm, and judging the attribute with higher association degree:
6.1 select attributes with the same number of attribute categories, and determine the same total classification result CkThe degree of association of any two attribute categories:
P(Ai|Ck)-P(Bi|Ck),i≤n,k≤n;
the attribute category refers to the category within the attribute itself, such as the categories of doctors, lawyers, teachers, etc. in the work category attribute we refer to above.
If all the absolute values of the correlation degrees are less than 0.2, the total classification result C is indicatedkIn the classification, the correlation degree of the attributes A and B is high, so that the correlation degree of the attributes A and B in other total classification results needs to be continuously judged; if the correlation degree of the two attributes is larger than 0.2, the correlation degree of the two attributes is not high, so that the judgment is not needed to be carried out continuously;
6.2 if the association degree of the attribute A and the attribute B is still higher in all the total classification results, randomly selecting one attribute from the attribute A to be reserved, and deleting the other attribute; and if the association degree of the two attributes is lower under the total classification result, both the two attributes are reserved.
The above mentioned overall classification result refers to the category of the final result of the data set itself. Taking the text as an example, the text is measured by personal income classification, and specifically comprises annual income of more than 50w, 30-50w, 10-30w and less than 10 w. This classification itself exists in the data set.
Briefly describe a data set:
Figure BDA0001290595020000041
this is a data set for classification and must include information on the last annual revenue.
6.3, judging the relevance of the attributes with the same number of other attribute types according to the operation of the steps 6.1-6.2, deleting and reserving the attributes in the data set according to the result until all the attributes with the same number of attribute types are judged, and updating the data set.
And (7) changing the weight of each attribute by adopting attribute weighting, thereby improving the accuracy.
7.1 finding the maximum class conditional probability in the attribute A under each total classification result, and marking as P (A)i|C1),P(Aj|C2),…,P(Ak|Cn) (ii) a If the attribute type of the attribute A repeatedly appears, the association degree of the attribute type of the attribute A and the total classification result is low, and the attribute A is considered not to be a good attribute, so that the attribute A is deleted; if the attribute types of the attribute A are different, the association degree between the attribute type of the attribute A and the total classification result is higher, the attribute A is considered to be a good attribute, and the attribute A is reserved, and the step 7.2 is carried out;
7.2 calculating the average confidence of the attribute A according to the maximum value class conditional probability obtained in the step 7.1, namely the degree of association with the total classification result:
Figure BDA0001290595020000051
wherein a larger value of T indicates a higher degree of association.
7.3 obtaining the confidence of other attributes according to the average confidence of the attribute A obtained in the step 7.2 and the confidence of other attributes obtained in the step, and calculating the power coefficient alpha to be 1-T, wherein the formula after the attribute weighting is as follows
Figure BDA0001290595020000052
Represents C corresponding to the maximum value of the weighted Bayes' formula numeratoriThe value of (c).
7.4 according to the steps 7.1-7.3, the relevance judgment of other attributes and the total classification result is carried out, and deletion or weighting operation is carried out according to the relevance judgment.
And (8) a classification judgment process.
The Bayesian basic formula of the multi-attribute is as follows:
Figure BDA0001290595020000053
shown in step (7):
Figure BDA0001290595020000054
then there is
Figure BDA0001290595020000055
Calculated CiI.e. the maximum class corresponding to the element.
All elements are judged by adopting the steps, CiThe classification result corresponding to the maximum value is the required classification result.
The invention has the beneficial effects that:
compared with the prior art, the method can be applied to other types of data sets except text data sets, including discrete type and continuous type data sets, and the application range is greatly improved; and certain accuracy can be improved, and especially under the condition of excessive attribute categories, the improvement is more obvious.
Drawings
FIG. 1 is a block diagram of an improved naive Bayes algorithm based on a continuum data set;
fig. 2 is a flow chart of data segmentation using gaussian distributions.
Detailed Description
The present invention is further analyzed with reference to the following specific examples.
Example 1: as shown in fig. 1:
1) and analyzing the data set, wherein the data set is extracted from the population database and can be used for discriminating the income level of residents. The attribute variables comprise important information such as age, work category, academic calendar, occupation, ethnicity and the like, comprise continuous attributes and discrete attributes, and are a very representative comprehensive data set.
2) For convenience of description, the discrete character type attributes in the data set are quantized, such as occupation, race, work, and the like. The specific treatment method comprises the following steps: all the non-digital information in the data set is converted into numbers, the same non-digital information is converted into the same numbers, and if the working attribute is that all the doctor's attribute values are converted into numbers 1; the attribute values are programmer's and are all converted to number 2; the attribute values are teacher's, all converted to numbers 3 …. The same numbers may be used between different attributes, such as both the "work" and "work" attributes, and the numbers 1,2,3 … may be used in both attributes instead of the original values of the attributes. However, when different attribute values are described in the same attribute, different numbers must be used, for example, in a work attribute, a doctor and a programmer cannot use the number 1 instead, or a great error in wrong classification occurs.
The purpose of the above operation is to facilitate the calculation of class conditional probabilities in the next step. As shown in fig. 2.
3) And discretizing continuous attribute data.
The specific process is as follows:
1. a continuation-type attribute a is selected from the data set.
2. The classification result of the data set is C1,C2,…,Cn. The elements belonging to different categories in the statistical sorting attribute A are named as A respectivelyc1,Ac2,…,Acn
3. Calculation of Ac1,Ac2,…,AcnMean value of1,μ2,…,μnSum variance
Figure BDA0001290595020000061
4. Calculating the intersection point of two adjacent categories by adopting a Gaussian formula, namely Ac1And Ac2,Ac2And Ac3,…,Acn-1And Acn. Respectively name it as q1,q2,…,qn-1. The intersection point is calculated as follows:
Figure BDA0001290595020000071
the value of x thus obtained represents the position of the intersection
5. Will intersect q1,q2,…,qn-1Arranged in the order from small to large, and classifying all values of the attribute A by using the values as dividing points, wherein the classes are respectively A1,A2,…,An
6. Replacing attribute values of the same generic class with a constant (e.g., A)1In (1) is replaced by the number 1, A2… with 2).
7. And processing other continuous attributes by adopting the method until all the continuous attributes are processed.
8. And (5) sorting and merging to obtain a data set after preliminary treatment.
4) In the data set obtained by the preliminary processing, there is a part of attribute categories (such as A)i,BnEtc.) have no element belonging to a classification result, which has a great influence on the accuracy of the classification result. For example, such as AiIn (C)kThe number of elements of a class is 0, i.e., P (A)i|Ck) 0, the multi-attribute Bayes formula is
Figure BDA0001290595020000072
Wherein
P(X|Ci)=P(Ax|Ci)×P(Bx|Ci)×…×P(Xx|Ci)
When P (A)x|Ck) When 0, P (C) will be increasedk| X) is also equal to 0, so that P (B) will bex|Ck) And the classification effect of other attributes covers up, so that the scores are wrongly classified to a certain extent.
To solve this problem, laplacian calibration is introduced, i.e. adding 1 to the number of elements of each classification result class in all attribute classes, whether it is 0 or not. Such as at AiIn this attribute category, the pairs belong to C1,C2,…,CnAnd 1 is added to the number of attribute elements of the classification category, so that 0 can be avoided without influencing data.
5) And solving the prior probability and class conditional probability of each class.
6) And judging the correlation between the attributes by adopting an improved association rule algorithm, and judging the attribute with higher association degree. The specific process is shown as the following example:
the data set herein is in attribute A, with A1,A2,A3,A4,A5Five categories, among attributes B, there is B1,B2,B3Three categories, among the attributes D, there is D1,D2,D3Three categories, and the overall classification result is classified as C1,C2,C3Three categories.
For the attribute A and the attribute B, the correlation is relatively low because the two attributes have different category numbers, and therefore, the attribute A and the attribute B do not need to participate in judgment.
For the attribute B and the attribute D, the number of the categories of the two attributes is the same, and the basic conditions for participating in judgment are provided. The judgment process is as follows:
1. calculate P (B)1|C1),P(B2|C1),P(B3|C1),P(B1|C2) Class conditional probability of 9 in total
2. Calculate P (D)1|C1),P(D2|C1),P(D3|C1),P(D1|C2) Class conditional probability of 9 in total
3. First, judge C1Degree of association under category, P (D)1|C1),P(D2|C1),P(D3|C1) Each minus P (B)1|C1),P(B2|C1),P(B3|C1) See no repetition (i.e., belonging to C)1All class conditional probabilities of a class, whether they belong to attribute B or attribute D, occur only once) are subtracted, and whether the absolute values of all three differences are less than 0.2 occurs. Such asIf not, the correlation degree of the two attributes can be directly proved to be not high, and continuous judgment is not needed; if the three numbers are less than 0.2, the classification result C is indicated1In the above case, the correlation between the attributes B and D is still relatively high, and it is necessary to continue the judgment at C2,C3The case (1).
4. The method in 6.3 is adopted to respectively judge the category C2And C3The correlation between the attributes B and D. If in category C2And C3If three numbers are less than 0.2, it means that the association degree between the attribute B and the attribute D is high in the overall classification result, and one of the attributes can be selected for retention and the other attribute can be deleted. If this does not occur, it means that there is a portion of the attributes B and D that the degree of association is low, and therefore both the attributes B and D remain.
5. And performing the relevance judgment of the steps on other attributes with the same category number, deleting and reserving the attributes in the data set according to the result, and updating the data set until all the attributes with the same category number are judged.
7) And the weight of each attribute is changed by adopting attribute weighting, so that the accuracy is improved. The above-mentioned attribute a and classification result C are used for description, and the specific process is as follows:
1. a is known as an attribute, having A1,A2,A3,A4,A5Five categories, C is the overall classification result, with C1,C2,C3Three categories.
2. Finding the maximum conditional probability value of the three classes, namely P (A), for the three classes respectivelyi|C1),P(Aj|C2),P(Ak|C3) The maximum of these three classes of conditional probabilities. The values of i, j and k are different, if repetition occurs, the connection degree of the attribute category of the A and the classification effect is low, which indicates that the attribute is not a good attribute, and the attribute is deleted.
3. Order to
Figure BDA0001290595020000091
Where T represents the average confidence of the attribute a, i.e., the degree of association with the classification result. The larger the value of T, the higher the degree of association.
4. Let α be 1-T, then the formula after attribute weighting is
Figure BDA0001290595020000092
5. And judging the relevance of other attributes and the classification result, and deleting or weighting according to the relevance.
8) And (5) a classification judgment process.
The Bayesian basic formula of the multi-attribute is as follows:
Figure BDA0001290595020000093
its main function is to judge that an element belongs to C when the specific value and category of all attributes (i.e. X) of the element are determinediProbability of such a classification result.
The specific function in this context is to judge whether the elements belong to C respectively1,C2,C3The probability of the three classes, with the largest value, is the class to which the attribute is most likely to belong.
In the above formula, since X is determined by its class, P (X) is constant, so that it is only necessary to apply to P (C)i)×P(X|Ci) The value of (2) is judged.
In step 7) above, the method of attribute weighting is used herein, so that in this step, the attribute weighting can be added to the bayesian basis formula, as shown in step 7):
Figure BDA0001290595020000094
then there is
Figure BDA0001290595020000095
Calculated CiI.e. the maximum class corresponding to the element.
And judging all elements by adopting the steps to obtain a final classification result.

Claims (1)

1. A commodity recommendation method classifies the income level of users, classifies the users according to the income level, and recommends different commodities for users of different classes; the method for classifying the income level of the user is characterized by comprising the following steps:
acquiring a data set for discriminating the income level of residents, wherein attribute variables of the data set for discriminating the income level of residents are information for classifying the income level of residents, and the attribute variables comprise age, work category, academic calendar, gender and work place;
the attribute variables comprise continuous attributes and discrete font attributes;
step (2), carrying out quantization processing on the discrete character type attribute obtained in the step (1):
selecting an attribute from the font attributes of the discrete characters, wherein elements with the same characters are represented by the same numbers, and elements with different character contents are represented by different numbers;
step (3), discretizing the continuous attribute in the data set acquired in the step (1):
3.1 selecting all elements of the continuous type attribute A from the data set;
3.2 record C according to the existing classification result in the data set1,C2,…,CnThe collection of elements belonging to different categories in all the elements of the continuous attribute A is recorded as Ac1,Ac2,…,Acn
3.3 calculation of Ac1,Ac2,…,AcnMean value μ1,μ2,…,μnSum variance
Figure FDA0002706725210000012
3.4 calculating each two adjacent classified element sets A by adopting Gaussian formulaciAnd Ac(i+1)Cross point x ofi(ii) a The calculation formula of the intersection point is as follows:
Figure FDA0002706725210000011
wherein i is more than or equal to 1 and less than or equal to n-1;
3.5 arranging the intersection points in the order from small to large, classifying all the elements of the attribute A by taking the ordered intersection points as dividing points, and resetting the element set Ac1,Ac2,…,Acn
3.6 replacing all the elements of the same category processed by the step 3.5 with one constant, wherein different constants are required to replace the elements of different categories;
3.7 discretizing other continuous attributes by adopting the steps 3.1-3.6, and sorting and merging the processed continuous attributes until all the continuous attributes are processed;
step (4), processing the condition that the class condition probability in the data set after the preliminary processing in the steps (2) to (3) is 0;
adding 1 to the attribute quantity value corresponding to the class conditional probability of each attribute by using Laplacian calibration, thereby avoiding the occurrence of 0;
step (5) of obtaining the prior probability and class conditional probability P (M) of each class of the attributes processed in the step (4)i|Cj) Wherein M isiRepresents the ith attribute class, C, in the attribute MjRepresenting the jth category in the classification result C;
and (6) judging the correlation among the attributes by adopting an improved association rule algorithm, and judging the attribute with higher association degree:
6.1 select attributes with the same number of attribute categories, and determine the same classification result CkThe degree of association of any two attribute categories:
P(Mi|Ck)-P(Ni|Ck),i≤n,k≤n;
if all the absolute values of the correlation degrees are less than 0.2, the classification result C is indicatedkIn the method, the correlation degree of the attribute M and the attribute N is higher, so that the correlation degree of the attribute M and the attribute N in other classification results needs to be continuously judged; if there is more than 0.2, then the two genera are consideredThe correlation degree of the sex is not high, so that the continuous judgment is not needed;
6.2 if the association degree of the attribute M and the attribute N is still high in all the classification results, randomly selecting one attribute from the classification results to be reserved, and deleting the other attribute; if the association degree of the two attributes is lower under each classification result, both the two attributes are reserved;
6.3, judging the relevance of the attributes with the same number of other attribute types according to the operation of the steps 6.1-6.2, deleting and reserving the attributes in the data set according to the result until all the attributes with the same number of attribute types are judged, and updating the data set;
step (7), attribute weighting is adopted to change the weight of each attribute, so that the accuracy is improved;
7.1 finding the maximum class conditional probability in the attribute M under each classification result, and marking as P (M)i|C1),P(Mj|C2),…,P(Mk|Cn) (ii) a If the attribute type of the attribute M appears repeatedly, the association degree of the attribute type of the attribute M and the classification result is low, and the attribute M is not considered to be a good attribute, so that the attribute M is deleted; if the attribute types of the attribute M are different, the association degree between the attribute type of the attribute M and the classification result is higher, the attribute M is considered to be a good attribute, and therefore the attribute M is reserved, and the step 7.2 is carried out;
7.2 calculating the average confidence of the attribute M according to the maximum class conditional probability obtained in the step 7.1, namely the degree of association with the classification result:
Figure FDA0002706725210000031
wherein the larger the value of T, the higher the correlation degree is;
7.3 obtaining the confidence of other attributes according to the average confidence of the attribute M obtained in the step 7.2 and the confidence of other attributes obtained in the above step, and calculating the power coefficient α to be 1-T, the formula after attribute weighting is
Figure FDA0002706725210000032
I.e. after attribute weightingC corresponding to maximum value of Bayesian formula moleculeiA value;
7.4 according to the steps 7.1-7.3, judging the relevance of other attributes and the classification result, and deleting or weighting according to the relevance;
step (8), a classification judgment process;
the Bayesian basic formula of the multi-attribute is as follows:
Figure FDA0002706725210000033
in step 7)
Figure FDA0002706725210000034
Then there is
Figure FDA0002706725210000035
Calculated CiI.e. the desired classification result.
CN201710323947.XA 2017-05-10 2017-05-10 Personal income classification method based on improved naive Bayes Active CN107169515B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710323947.XA CN107169515B (en) 2017-05-10 2017-05-10 Personal income classification method based on improved naive Bayes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710323947.XA CN107169515B (en) 2017-05-10 2017-05-10 Personal income classification method based on improved naive Bayes

Publications (2)

Publication Number Publication Date
CN107169515A CN107169515A (en) 2017-09-15
CN107169515B true CN107169515B (en) 2020-12-15

Family

ID=59813803

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710323947.XA Active CN107169515B (en) 2017-05-10 2017-05-10 Personal income classification method based on improved naive Bayes

Country Status (1)

Country Link
CN (1) CN107169515B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107844514A (en) * 2017-09-22 2018-03-27 深圳市易成自动驾驶技术有限公司 Data digging method, device and computer-readable recording medium
CN107977687A (en) * 2017-12-28 2018-05-01 重庆理工大学 A kind of grape wine sorting technique based on association rule algorithm and naive Bayesian method
CN110852441B (en) * 2019-09-26 2023-06-09 温州大学 Fire disaster early warning method based on improved naive Bayes algorithm
CN112131516B (en) * 2020-09-01 2022-11-11 山东科技大学 Anomaly detection method based on feature weight mixed naive Bayes model

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105229494A (en) * 2013-05-31 2016-01-06 界标制图有限公司 Importance of Attributes is determined
CN105930906A (en) * 2016-04-15 2016-09-07 上海大学 Trip detection method based on characteristic weighting and improved Bayesian algorithm
CN106056137B (en) * 2016-05-25 2019-06-04 南京大学 A kind of business recommended method of telecommunications group based on data mining multi-classification algorithm

Also Published As

Publication number Publication date
CN107169515A (en) 2017-09-15

Similar Documents

Publication Publication Date Title
US10204153B2 (en) Data analysis system, data analysis method, data analysis program, and storage medium
CN107169515B (en) Personal income classification method based on improved naive Bayes
Kao et al. A Bayesian latent variable model with classification and regression tree approach for behavior and credit scoring
CN110163647B (en) Data processing method and device
Passalis et al. Temporal bag-of-features learning for predicting mid price movements using high frequency limit order book data
CN109829733B (en) False comment detection system and method based on shopping behavior sequence data
JP2017045434A (en) Data analysis system, data analysis method, program, and recording medium
CN108614855A (en) A kind of rumour recognition methods
CN111966878B (en) Public sentiment event reversal detection method based on machine learning
CN112905739B (en) False comment detection model training method, detection method and electronic equipment
CN111680225B (en) WeChat financial message analysis method and system based on machine learning
CN112070543A (en) Method for detecting comment quality in E-commerce website
CN116431931B (en) Real-time incremental data statistical analysis method
Maulana et al. Logistic model tree and decision tree J48 algorithms for predicting the length of study period
CN114139634A (en) Multi-label feature selection method based on paired label weights
CN113656699B (en) User feature vector determining method, related equipment and medium
CN108846128B (en) Cross-domain text classification method based on adaptive noise reduction encoder
MS et al. A novel technique for detecting sudden concept drift in healthcare data using multi-linear artificial intelligence techniques
CN112598405B (en) Business project data management method and system based on big data
CN108304568B (en) Real estate public expectation big data processing method and system
CN116501840B (en) NLP intelligent analysis method for guest acquisition marketing
CN109408808B (en) Evaluation method and evaluation system for literature works
CN111221915B (en) Online learning resource quality analysis method based on CWK-means
CN109254993B (en) Text-based character data analysis method and system
Zhang et al. Probabilistic verb selection for data-to-text generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant