CN107169515B

CN107169515B - Personal income classification method based on improved naive Bayes

Info

Publication number: CN107169515B
Application number: CN201710323947.XA
Authority: CN
Inventors: 宁可; 孙同晶; 曹红
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2017-05-10
Filing date: 2017-05-10
Publication date: 2020-12-15
Anticipated expiration: 2037-05-10
Also published as: CN107169515A

Abstract

The invention provides a personal income classification method based on improved naive Bayes. The method firstly provides a class conditional probability estimation method based on continuous data, and is used for determining prior probability and class conditional probability of each class. Secondly, Laplace calibration is introduced into the segmented data set, so that the problem of overlarge influence of the 0 point is avoided. And thirdly, an improved association rule algorithm based on an Apriori algorithm is provided, and the method can measure the affinity between the two attributes according to the classification result of the training set. And finally, introducing an attribute weighting concept, thereby solving the problem that all attributes of the naive Bayes algorithm have the same influence on the result.

Description

Personal income classification method based on improved naive Bayes

Technical Field

The invention belongs to the field of classification methods, relates to an improved naive Bayes classification method for processing continuous data, and particularly relates to a personal income classification method based on improved naive Bayes, which is used for judging the level of personal income.

Background

With the continuous development of society, the purchasing power of people is also continuously improved, and online shopping also becomes a main purchasing means of young people. Thus, a large number of shopping website applications come with a strong demand for commodity recommendation systems.

When recommending commodities, the income level of a user is firstly confirmed, then the users are classified according to the income, and different commodities are recommended to users of different classes. However, most of the current user classification systems classify according to the types and the quantity of things purchased by users, and the classification method has certain advantages, and can accurately classify old users according to the shopping records of the users and determine the income level of the users belonging to which type. However, for a newly registered user, since there is no record of purchase, there is no judgment.

Based on the current situation, the invention provides a personal income judgment and classification method based on improved naive Bayes.

Classification is a key step for data mining, machine learning, and pattern recognition. The method belongs to supervised learning, and the main process is to analyze a data set so as to obtain a classification principle contained in the data set, further establish a classifier, and then determine the category of unclassified data by means of the classifier.

The naive Bayes classification algorithm is realized based on Bayes theorem. Bayesian theorem is the probability that two events can be exchanged if the conditional probability is known. The naive Bayes algorithm is developed based on the principle, and for a given item to be classified, the probability of the item to be classified which is respectively appeared when the item belongs to each category is solved under the condition of the appearance of the item, and which is the largest, the item to be classified is considered to belong to which category. However, because of the problem that the conditional independence characteristic of the bayesian theorem and all attributes have the same influence on the result, when the naive bayes algorithm is actually used, the accuracy is low, and therefore, the naive bayes algorithm is improved by many people.

The existing improvement method of the naive Bayes algorithm mainly comprises the following steps: firstly, a naive Bayes classifier is improved by utilizing a frequent item set aiming at text data and combining an Apriori algorithm, the method has excellent classification effect, but has a narrow use range and is only suitable for the text data; secondly, aiming at a naive Bayes classifier with weighted combination attributes of which the attributes influence the result, the method can improve the classification accuracy to a certain extent, but still cannot obtain excellent effect under the condition of single use; thirdly, the basic formula of the algorithm is improved, the method can improve the accuracy rate without causing other consequences, but the improvement surface is narrow, and the method is not beneficial to further research later.

Disclosure of Invention

The invention aims to solve the problems, provides an improved Bayesian classification method based on association rules, and applies the improved Bayesian classification method to a personal income judgment classification method. The method is based on the association rule to improve the accuracy, can be suitable for other types of data sets except text data, and solves the problem that the Bayes algorithm improved by the association rule is only suitable for text classification. In the method, firstly, a class conditional probability estimation method based on continuous data is provided for determining class conditional probability of each class. Secondly, Laplace calibration is introduced into the segmented data set, so that the problem of overlarge influence of the 0 point is avoided. Thirdly, an improved association rule algorithm is provided, and the method can measure the affinity between the two attributes according to the classification result of the training set. And finally, introducing an attribute weighting concept, thereby solving the problem that all attributes of the naive Bayes algorithm have the same influence on the result.

In order to achieve the technical problem, the technical solution adopted by the invention is as follows: an improved naive Bayes algorithm based on a continuous data set comprises the following steps:

acquiring a data set for discriminating the income level of residents, wherein attribute variables comprise age, work category, academic calendar, gender, work place and other information for classifying the income level of residents;

the collected data set attribute variables comprise a continuous attribute and a discrete attribute;

step (2), carrying out quantization processing on the font attributes of the discrete texts in the data set acquired in the step (1):

one attribute is selected from the discrete character type attributes, wherein elements with the same characters are represented by the same number, and elements with different character contents are not represented by the same number.

For example, the work category attribute is selected, the work category attribute comprises the text contents of doctors, teachers, lawyers and the like, and the quantification processing process is to replace all the contents appearing in the work category attribute as the elements of the doctors with 1, the teachers with 2 and the lawyers with 3, so that the situation that the teachers and the lawyers are replaced with 1 cannot be obtained.

Step (3), discretizing the continuous attribute in the data set acquired in the step (1):

3.1 selecting a continuous type attribute A from the data set;

3.2 Classification results C from the Presence of data sets₁，C₂，…，C_nAnd recording the element sets belonging to different categories in the attribute A as A_c1，A_c2，…，A_cn。

3.3 calculation of A_c1，A_c2，…，A_cnMean value μ₁，μ₂，…，μ_nSum variance

3.4 calculating two adjacent class element sets A by adopting a Gaussian formula_ciAnd A_c(i+1)Cross point x of_iIs denoted by q₁，q₂，…，q_n-1. The calculation formula of the intersection point is as follows:

wherein i is more than or equal to 1 and less than or equal to n-1;

3.5 intersection point q₁，q₂，…，q_n-1Arranging the elements in the order from small to large, and classifying all the elements of the attribute A by taking the elements as a division point to form an element set A₁，A₂，…，A_n。

3.6 replace all elements of the same category treated in 3.5 with a constant, while elements belonging to different categories must be replaced with different constants (e.g. A)₁In (1) is replaced by the number 1, A₂… with 2; a. the₁And A₂The number 1 cannot be used instead).

3.7 discretizing other continuous attributes by adopting the steps 3.1-3.6, and sorting and merging until all the continuous attributes are processed;

step (4), processing the condition that the class condition probability existing in the data set is 0 after the preliminary processing of the steps (2) to (3);

class conditional probabilities are probabilities that are conditional on a class. In step 4, the probability of occurrence of a total classification result in an attribute class is specifically referred to. For example, in the attribute category of doctors, doctors with annual income of more than 50w account for all doctors. In reverse, they also belong to class conditional probabilities, but are not easy to understand.

The problem that the influence of the 0 point is too large is solved by using Laplacian calibration, so that 1 is added to the attribute quantity value corresponding to the class conditional probability of each attribute, and the occurrence of 0 is avoided;

step (5) of obtaining prior probability and class conditional probability P (A) of each attribute and each class_i|C_j) Wherein A is_iRepresents the ith attribute class in attribute A, C_jRepresenting the jth category in the classification result C.

And (6) judging the correlation among the attributes by adopting an improved association rule algorithm, and judging the attribute with higher association degree:

6.1 select attributes with the same number of attribute categories, and determine the same total classification result C_kThe degree of association of any two attribute categories:

P(A_i|C_k)-P(B_i|C_k)，i≤n，k≤n；

the attribute category refers to the category within the attribute itself, such as the categories of doctors, lawyers, teachers, etc. in the work category attribute we refer to above.

If all the absolute values of the correlation degrees are less than 0.2, the total classification result C is indicated_kIn the classification, the correlation degree of the attributes A and B is high, so that the correlation degree of the attributes A and B in other total classification results needs to be continuously judged; if the correlation degree of the two attributes is larger than 0.2, the correlation degree of the two attributes is not high, so that the judgment is not needed to be carried out continuously;

6.2 if the association degree of the attribute A and the attribute B is still higher in all the total classification results, randomly selecting one attribute from the attribute A to be reserved, and deleting the other attribute; and if the association degree of the two attributes is lower under the total classification result, both the two attributes are reserved.

The above mentioned overall classification result refers to the category of the final result of the data set itself. Taking the text as an example, the text is measured by personal income classification, and specifically comprises annual income of more than 50w, 30-50w, 10-30w and less than 10 w. This classification itself exists in the data set.

Briefly describe a data set:

this is a data set for classification and must include information on the last annual revenue.

6.3, judging the relevance of the attributes with the same number of other attribute types according to the operation of the steps 6.1-6.2, deleting and reserving the attributes in the data set according to the result until all the attributes with the same number of attribute types are judged, and updating the data set.

And (7) changing the weight of each attribute by adopting attribute weighting, thereby improving the accuracy.

7.1 finding the maximum class conditional probability in the attribute A under each total classification result, and marking as P (A)_i|C₁)，P(A_j|C₂)，…，P(A_k|C_n) (ii) a If the attribute type of the attribute A repeatedly appears, the association degree of the attribute type of the attribute A and the total classification result is low, and the attribute A is considered not to be a good attribute, so that the attribute A is deleted; if the attribute types of the attribute A are different, the association degree between the attribute type of the attribute A and the total classification result is higher, the attribute A is considered to be a good attribute, and the attribute A is reserved, and the step 7.2 is carried out;

7.2 calculating the average confidence of the attribute A according to the maximum value class conditional probability obtained in the step 7.1, namely the degree of association with the total classification result:

wherein a larger value of T indicates a higher degree of association.

7.3 obtaining the confidence of other attributes according to the average confidence of the attribute A obtained in the step 7.2 and the confidence of other attributes obtained in the step, and calculating the power coefficient alpha to be 1-T, wherein the formula after the attribute weighting is as follows

Represents C corresponding to the maximum value of the weighted Bayes' formula numerator_iThe value of (c).

7.4 according to the steps 7.1-7.3, the relevance judgment of other attributes and the total classification result is carried out, and deletion or weighting operation is carried out according to the relevance judgment.

And (8) a classification judgment process.

The Bayesian basic formula of the multi-attribute is as follows:

shown in step (7):

then there is

Calculated C_iI.e. the maximum class corresponding to the element.

All elements are judged by adopting the steps, C_iThe classification result corresponding to the maximum value is the required classification result.

The invention has the beneficial effects that:

compared with the prior art, the method can be applied to other types of data sets except text data sets, including discrete type and continuous type data sets, and the application range is greatly improved; and certain accuracy can be improved, and especially under the condition of excessive attribute categories, the improvement is more obvious.

Drawings

FIG. 1 is a block diagram of an improved naive Bayes algorithm based on a continuum data set;

fig. 2 is a flow chart of data segmentation using gaussian distributions.

Detailed Description

The present invention is further analyzed with reference to the following specific examples.

Example 1: as shown in fig. 1:

1) and analyzing the data set, wherein the data set is extracted from the population database and can be used for discriminating the income level of residents. The attribute variables comprise important information such as age, work category, academic calendar, occupation, ethnicity and the like, comprise continuous attributes and discrete attributes, and are a very representative comprehensive data set.

2) For convenience of description, the discrete character type attributes in the data set are quantized, such as occupation, race, work, and the like. The specific treatment method comprises the following steps: all the non-digital information in the data set is converted into numbers, the same non-digital information is converted into the same numbers, and if the working attribute is that all the doctor's attribute values are converted into numbers 1; the attribute values are programmer's and are all converted to number 2; the attribute values are teacher's, all converted to numbers 3 …. The same numbers may be used between different attributes, such as both the "work" and "work" attributes, and the numbers 1,2,3 … may be used in both attributes instead of the original values of the attributes. However, when different attribute values are described in the same attribute, different numbers must be used, for example, in a work attribute, a doctor and a programmer cannot use the number 1 instead, or a great error in wrong classification occurs.

The purpose of the above operation is to facilitate the calculation of class conditional probabilities in the next step. As shown in fig. 2.

3) And discretizing continuous attribute data.

The specific process is as follows:

1. a continuation-type attribute a is selected from the data set.

2. The classification result of the data set is C₁，C₂，…，C_n. The elements belonging to different categories in the statistical sorting attribute A are named as A respectively_c1，A_c2，…，A_cn。

3. Calculation of A_c1，A_c2，…，A_cnMean value of₁，μ₂，…，μ_nSum variance

4. Calculating the intersection point of two adjacent categories by adopting a Gaussian formula, namely A_c1And A_c2，A_c2And A_c3，…，A_cn-1And A_cn. Respectively name it as q₁，q₂，…，q_n-1. The intersection point is calculated as follows:

the value of x thus obtained represents the position of the intersection

5. Will intersect q₁，q₂，…，q_n-1Arranged in the order from small to large, and classifying all values of the attribute A by using the values as dividing points, wherein the classes are respectively A₁，A₂，…，A_n。

6. Replacing attribute values of the same generic class with a constant (e.g., A)₁In (1) is replaced by the number 1, A₂… with 2).

7. And processing other continuous attributes by adopting the method until all the continuous attributes are processed.

8. And (5) sorting and merging to obtain a data set after preliminary treatment.

4) In the data set obtained by the preliminary processing, there is a part of attribute categories (such as A)_i，B_nEtc.) have no element belonging to a classification result, which has a great influence on the accuracy of the classification result. For example, such as A_iIn (C)_kThe number of elements of a class is 0, i.e., P (A)_i|C_k) 0, the multi-attribute Bayes formula is

Wherein

P(X|C_i)＝P(A_x|C_i)×P(B_x|C_i)×…×P(X_x|C_i)

When P (A)_x|C_k) When 0, P (C) will be increased_k| X) is also equal to 0, so that P (B) will be_x|C_k) And the classification effect of other attributes covers up, so that the scores are wrongly classified to a certain extent.

To solve this problem, laplacian calibration is introduced, i.e. adding 1 to the number of elements of each classification result class in all attribute classes, whether it is 0 or not. Such as at A_iIn this attribute category, the pairs belong to C₁，C₂，…，C_nAnd 1 is added to the number of attribute elements of the classification category, so that 0 can be avoided without influencing data.

5) And solving the prior probability and class conditional probability of each class.

6) And judging the correlation between the attributes by adopting an improved association rule algorithm, and judging the attribute with higher association degree. The specific process is shown as the following example:

the data set herein is in attribute A, with A₁，A₂，A₃，A₄，A₅Five categories, among attributes B, there is B₁，B₂，B₃Three categories, among the attributes D, there is D₁，D₂，D₃Three categories, and the overall classification result is classified as C₁，C₂，C₃Three categories.

For the attribute A and the attribute B, the correlation is relatively low because the two attributes have different category numbers, and therefore, the attribute A and the attribute B do not need to participate in judgment.

For the attribute B and the attribute D, the number of the categories of the two attributes is the same, and the basic conditions for participating in judgment are provided. The judgment process is as follows:

1. calculate P (B)₁|C₁)，P(B₂|C₁)，P(B₃|C₁)，P(B₁|C₂) Class conditional probability of 9 in total

2. Calculate P (D)₁|C₁)，P(D₂|C₁)，P(D₃|C₁)，P(D₁|C₂) Class conditional probability of 9 in total

3. First, judge C₁Degree of association under category, P (D)₁|C₁)，P(D₂|C₁)，P(D₃|C₁) Each minus P (B)₁|C₁)，P(B₂|C₁)，P(B₃|C₁) See no repetition (i.e., belonging to C)₁All class conditional probabilities of a class, whether they belong to attribute B or attribute D, occur only once) are subtracted, and whether the absolute values of all three differences are less than 0.2 occurs. Such asIf not, the correlation degree of the two attributes can be directly proved to be not high, and continuous judgment is not needed; if the three numbers are less than 0.2, the classification result C is indicated₁In the above case, the correlation between the attributes B and D is still relatively high, and it is necessary to continue the judgment at C₂，C₃The case (1).

4. The method in 6.3 is adopted to respectively judge the category C₂And C₃The correlation between the attributes B and D. If in category C₂And C₃If three numbers are less than 0.2, it means that the association degree between the attribute B and the attribute D is high in the overall classification result, and one of the attributes can be selected for retention and the other attribute can be deleted. If this does not occur, it means that there is a portion of the attributes B and D that the degree of association is low, and therefore both the attributes B and D remain.

5. And performing the relevance judgment of the steps on other attributes with the same category number, deleting and reserving the attributes in the data set according to the result, and updating the data set until all the attributes with the same category number are judged.

7) And the weight of each attribute is changed by adopting attribute weighting, so that the accuracy is improved. The above-mentioned attribute a and classification result C are used for description, and the specific process is as follows:

1. a is known as an attribute, having A₁，A₂，A₃，A₄，A₅Five categories, C is the overall classification result, with C₁，C₂，C₃Three categories.

2. Finding the maximum conditional probability value of the three classes, namely P (A), for the three classes respectively_i|C₁)，P(A_j|C₂)，P(A_k|C₃) The maximum of these three classes of conditional probabilities. The values of i, j and k are different, if repetition occurs, the connection degree of the attribute category of the A and the classification effect is low, which indicates that the attribute is not a good attribute, and the attribute is deleted.

3. Order to

Where T represents the average confidence of the attribute a, i.e., the degree of association with the classification result. The larger the value of T, the higher the degree of association.

4. Let α be 1-T, then the formula after attribute weighting is

5. And judging the relevance of other attributes and the classification result, and deleting or weighting according to the relevance.

8) And (5) a classification judgment process.

The Bayesian basic formula of the multi-attribute is as follows:

its main function is to judge that an element belongs to C when the specific value and category of all attributes (i.e. X) of the element are determined_iProbability of such a classification result.

The specific function in this context is to judge whether the elements belong to C respectively₁，C₂，C₃The probability of the three classes, with the largest value, is the class to which the attribute is most likely to belong.

In the above formula, since X is determined by its class, P (X) is constant, so that it is only necessary to apply to P (C)_i)×P(X|C_i) The value of (2) is judged.

In step 7) above, the method of attribute weighting is used herein, so that in this step, the attribute weighting can be added to the bayesian basis formula, as shown in step 7):

then there is

Calculated C_iI.e. the maximum class corresponding to the element.

And judging all elements by adopting the steps to obtain a final classification result.

Claims

1. A commodity recommendation method classifies the income level of users, classifies the users according to the income level, and recommends different commodities for users of different classes; the method for classifying the income level of the user is characterized by comprising the following steps:

acquiring a data set for discriminating the income level of residents, wherein attribute variables of the data set for discriminating the income level of residents are information for classifying the income level of residents, and the attribute variables comprise age, work category, academic calendar, gender and work place;

the attribute variables comprise continuous attributes and discrete font attributes;

step (2), carrying out quantization processing on the discrete character type attribute obtained in the step (1):

selecting an attribute from the font attributes of the discrete characters, wherein elements with the same characters are represented by the same numbers, and elements with different character contents are represented by different numbers;

3.1 selecting all elements of the continuous type attribute A from the data set;

3.2 record C according to the existing classification result in the data set₁，C₂，…，C_nThe collection of elements belonging to different categories in all the elements of the continuous attribute A is recorded as A_c1，A_c2，…，A_cn；

3.4 calculating each two adjacent classified element sets A by adopting Gaussian formula_ciAnd A_c(i+1)Cross point x of_i(ii) a The calculation formula of the intersection point is as follows:

wherein i is more than or equal to 1 and less than or equal to n-1;

3.5 arranging the intersection points in the order from small to large, classifying all the elements of the attribute A by taking the ordered intersection points as dividing points, and resetting the element set A_c1，A_c2，…，A_cn；

3.6 replacing all the elements of the same category processed by the step 3.5 with one constant, wherein different constants are required to replace the elements of different categories;

3.7 discretizing other continuous attributes by adopting the steps 3.1-3.6, and sorting and merging the processed continuous attributes until all the continuous attributes are processed;

step (4), processing the condition that the class condition probability in the data set after the preliminary processing in the steps (2) to (3) is 0;

adding 1 to the attribute quantity value corresponding to the class conditional probability of each attribute by using Laplacian calibration, thereby avoiding the occurrence of 0;

step (5) of obtaining the prior probability and class conditional probability P (M) of each class of the attributes processed in the step (4)_i|C_j) Wherein M is_iRepresents the ith attribute class, C, in the attribute M_jRepresenting the jth category in the classification result C;

6.1 select attributes with the same number of attribute categories, and determine the same classification result C_kThe degree of association of any two attribute categories:

P(M_i|C_k)-P(N_i|C_k)，i≤n，k≤n；

if all the absolute values of the correlation degrees are less than 0.2, the classification result C is indicated_kIn the method, the correlation degree of the attribute M and the attribute N is higher, so that the correlation degree of the attribute M and the attribute N in other classification results needs to be continuously judged; if there is more than 0.2, then the two genera are consideredThe correlation degree of the sex is not high, so that the continuous judgment is not needed;

6.2 if the association degree of the attribute M and the attribute N is still high in all the classification results, randomly selecting one attribute from the classification results to be reserved, and deleting the other attribute; if the association degree of the two attributes is lower under each classification result, both the two attributes are reserved;

6.3, judging the relevance of the attributes with the same number of other attribute types according to the operation of the steps 6.1-6.2, deleting and reserving the attributes in the data set according to the result until all the attributes with the same number of attribute types are judged, and updating the data set;

step (7), attribute weighting is adopted to change the weight of each attribute, so that the accuracy is improved;

7.1 finding the maximum class conditional probability in the attribute M under each classification result, and marking as P (M)_i|C₁)，P(M_j|C₂)，…，P(M_k|C_n) (ii) a If the attribute type of the attribute M appears repeatedly, the association degree of the attribute type of the attribute M and the classification result is low, and the attribute M is not considered to be a good attribute, so that the attribute M is deleted; if the attribute types of the attribute M are different, the association degree between the attribute type of the attribute M and the classification result is higher, the attribute M is considered to be a good attribute, and therefore the attribute M is reserved, and the step 7.2 is carried out;

7.2 calculating the average confidence of the attribute M according to the maximum class conditional probability obtained in the step 7.1, namely the degree of association with the classification result:

wherein the larger the value of T, the higher the correlation degree is;

7.3 obtaining the confidence of other attributes according to the average confidence of the attribute M obtained in the step 7.2 and the confidence of other attributes obtained in the above step, and calculating the power coefficient α to be 1-T, the formula after attribute weighting is

I.e. after attribute weightingC corresponding to maximum value of Bayesian formula molecule_iA value;

7.4 according to the steps 7.1-7.3, judging the relevance of other attributes and the classification result, and deleting or weighting according to the relevance;

step (8), a classification judgment process;

the Bayesian basic formula of the multi-attribute is as follows:

in step 7)

Then there is

Calculated C_iI.e. the desired classification result.