CN113657441A

CN113657441A - Classification algorithm based on weighted Pearson correlation coefficient and combined with feature screening

Info

Publication number: CN113657441A
Application number: CN202110774460.XA
Authority: CN
Inventors: 周红芳; 安蕾
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-07-08
Filing date: 2021-07-08
Publication date: 2021-11-16

Abstract

The invention discloses a classification algorithm based on weighted Pearson correlation coefficient and combined with feature screening, which comprises the steps of firstly preprocessing original data, and using an IMPROVE _ FCBF algorithm to screen features of a preprocessed data set; dividing the data subjected to feature screening into a training set and a test set by using a ten-fold cross validation method, and constructing a decision tree on the training set by using a decision tree algorithm based on a weighted Pearson correlation coefficient; and finally, classifying the test data by using the constructed decision tree model to obtain a result, and evaluating the decision tree classification model by using the accuracy rate of the evaluation index, the recall rate, the macroscopic F1 value and the construction time of the decision tree. Based on the evaluation indexes, compared with other decision tree classification algorithms, the method has improvement and improvement of different degrees.

Description

Classification algorithm based on weighted Pearson correlation coefficient and combined with feature screening

Technical Field

The invention belongs to the technical field of data mining, and relates to a classification algorithm based on weighted Pearson correlation coefficients and combined with feature screening.

Background

In the era of mobile internet, in the face of massive data, traditional data analysis cannot process the massive data, a new method must be used for processing the massive data, and a data mining technology is one of the best tools for processing massive data. In the technical field of data mining, the classification problem is particularly important, and the method is widely applied to various financial commercial activities such as telecommunication, banks, supermarkets and the like. The classification process can be divided into two steps: firstly, analyzing and calculating known sample data to obtain a function/model; in the second step, the class of other unknown data is predicted using the derived function/model. Currently, there are many kinds of related classification algorithms, such as: decision tree algorithm, genetic algorithm, clustering algorithm, neural network algorithm, etc. Among them, the decision tree classification algorithm is one of the most general classification algorithms because of its advantages of strong interpretability, fast speed, high accuracy, etc. Common decision tree classification algorithms are: the ID3 algorithm, the C4.5 algorithm, the CART algorithm, the PCC-Tree algorithm, and the like.

The traditional decision tree classification algorithms have good effects on processing small-scale data sets, but due to the influences of memory limitation, time complexity and data complexity, the time complexity of the algorithms on processing large-scale data sets is high. Therefore, how to increase the speed of constructing the decision tree is very important.

Disclosure of Invention

The invention aims to provide a classification algorithm based on weighted Pearson correlation coefficients and combined with feature screening, and has the characteristic of effectively improving the classification accuracy of a decision tree model.

The technical scheme adopted by the invention is that a classification algorithm based on weighted Pearson correlation coefficient and combined with feature screening is implemented according to the following steps:

step 1, for a category set C containing m categories, C ═ C₁,c₂,...c_mM 1,2,3.. m, and a feature set F with n as a feature number F ═ F₁,f₂,f₃,...f_nPreprocessing a data set of 1,2,3.. n;

step 2, carrying out feature screening on the preprocessed data set by using an IMPROVE _ FCBF algorithm;

step 3, dividing the data set subjected to feature screening into training data and test data;

step 4, constructing a decision tree model on the training set by using a decision tree classification method based on the weighted Pearson correlation coefficient;

and 5, testing the test data by using the established decision tree model, and evaluating the experimental result by using the accuracy, the recall rate, the macro F1 and the time required for constructing the decision tree as evaluation indexes.

The invention is also characterized in that:

the preprocessing in the step 1 is specifically that firstly, discretization is carried out on continuous characteristic values in a data set by using an equal-width method; then converting the character string type characteristic value into a nominal numerical value type; then, complementing the missing characteristic value by using a mode; and finally converting the character string class values in the data set into a nominal numerical type.

The step 2 is implemented according to the following steps:

step 2.1, initialize S_listIs an empty set;

step 2.2, calculate each feature f_iSymmetry uncertainty SU (f) between (i ═ 1, …, n) and class C_iC) value, and a measure of uncertainty of symmetry SU (f) between each two features_i，f_j) (i, j ≠ j) 1, …, n, and i ≠ j); the formula for calculating the SU values of the two variables X and Y is as follows:

step 2.3, will satisfy SU (fi, C)>Feature formation S of 0_listSubsets and sorting from large to small;

step 2.4, judging S circularly_listEach feature f in the subset_jWhether or not it is the main feature f_iIf the strong redundancy feature is the strong redundancy feature, the strong redundancy feature is selected from S_listRemoving from the subset;

step 2.5, for S_listEach feature F of_k(k is 1, …, n) circularly judging whether the Merits value is reduced or not, and if the Merits value is reduced, rejecting the Merits value; if S_listStopping searching when all the characteristic elements are judged to be finished or meet the early stop criterion;

step 2.6, return to the final feature subset S_list。

In step 2.5, if S_listWhen the middle characteristic element is not judged to be finished or does not meet the early stop criterion, repeating the following steps:

step 2.5.1, for each feature F_k(k is 1, …, n), let S_list[k]＝F_kMerits is calculated according to the formula shown below, where k is the number of features, r_cfIs characterized by_iThe value of SU (fi, C), r, between class C_ffFor the pairs SU (f) between each two features_i，f_j) Sum average:

step 2.5.2, if k > 1, if

The kth feature is deleted, otherwise it is added to the final feature subset S_listAmong them.

And 3, dividing the data set by adopting a ten-fold cross validation method.

Step 4 is specifically implemented according to the following steps:

step 4.1, traverse is carried out on each feature in the subset after feature screening, and S is assumed to be carried out at the moment_listIn which there are n features, each feature f is calculated_i∈S_list(i ═ 1,2,3,.. n) and class C, and the weighted pearson correlation coefficient between the two variables X and Y is calculated as follows:

4.2, sorting the features from big to small according to the WPCC values obtained by calculation;

4.3, when each layer of the decision tree is constructed, selecting the characteristic with the maximum WPCC value as a split node to construct the decision tree each time;

and 4.4, iteratively constructing the decision tree until a decision tree termination condition is reached, and completing construction of the decision tree model.

The invention has the beneficial effects that:

1. compared with other four classical decision Tree algorithms (ID3, CART, C4.5 and PCC-Tree), the FS-WPCCT algorithm is superior to a comparison algorithm in evaluation indexes such as accuracy, recall rate and macroscopic F1 value;

2. compared with a PCC-Tree algorithm and a WPCCT algorithm, the FS-WPCCT algorithm has obvious advantages in decision Tree construction time.

Drawings

FIG. 1 is a flow chart of a classification algorithm based on weighted Pearson correlation coefficients in combination with feature screening in accordance with the present invention;

FIG. 2 shows the result of comparing the accuracy of the FS-WPCCT algorithm, the WPCCT algorithm and the classical PCC-Tree algorithm on 25 data sets in the invention;

FIG. 3 is a histogram comparing the average accuracy, recall and macroscopic F1 values of the FS-WPCCT algorithm, WPCCT algorithm and the classical PCC-Tree algorithm over 25 data sets in the present invention;

FIG. 4 is a comparative line graph of construction decision Tree time of FS-WPCCT algorithm, WPCCT algorithm and classical PCC-Tree algorithm on 25 data sets in the invention;

FIG. 5 is a graph of the accuracy of the FS-WPCCT algorithm versus the accuracy of other classical decision Tree algorithms (ID3, CART, C4.5, PCC-Tree) on 25 datasets in the present invention;

FIG. 6 is a plot of recall against score for FS-WPCCT algorithm versus other classical decision Tree algorithms (ID3, CART, C4.5, PCC-Tree) on 25 datasets in the present invention;

FIG. 7 is a macroscopic F1-value comparison line graph of the FS-WPCCT algorithm and other classical decision Tree algorithms (ID3, CART, C4.5, PCC-Tree) on 25 data sets in the invention;

FIG. 8 is a histogram of the average accuracy, average recall and average macroscopic F1 value of the FS-WPCCT algorithm of the present invention versus other classical decision Tree algorithms (ID3, CART, C4.5, PCC-Tree) over 25 data sets.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The relevant definitions in the present invention are as follows:

definition 1 (mutual information): mutual information describes how much information is contained in one random variable about another random variable. For two random variables X, Y, their corresponding mutual information is defined as formula (1), where H (X) is the entropy of X and H (X | Y) is the conditional entropy.

I(X，Y)＝H(X)-H(X|Y) (I)

Definition 2 (symmetry uncertainty): for two variables X and Y, the symmetry uncertainty formula between them is shown in formula (2), where H (X) is the entropy of X, H (X | Y) is the conditional entropy for X given variable Y, and I (X, Y) is the mutual information of the two variables.

Definition 3 (conditional mutual information): the conditional mutual information indicates how much information is between the variable X and the variable Y when the variable Z is introduced. Given the variable Z, the mutual conditional information of the two random variables X and Y can be defined as equation (3). Where p (x, y, z) is the joint distribution probability and p (x | z), p (y | z) and p (x, y | z) are the conditional distribution probabilities.

Definition 4 (normalized interaction score NPIS): for two features F_iAnd F_jAnd (i ≠ j), giving class C, F_iAnd F_jThe normalized interaction score NPIS of is defined as equation (4).

Definition 5 (pearson correlation coefficient): the pearson correlation coefficient between two variables X and Y is calculated as in equation (5), where cov (X, Y) is the covariance between X and Y, var (X) is the variance of X, and var (Y) is the variance of Y.

Definition 6 (weighted pearson correlation coefficient): based on the pearson correlation coefficient, a weighted pearson correlation coefficient between two variables X and Y is calculated as in equation (6), where h (X) is the entropy of X and PCC (X, Y) is the pearson correlation coefficient for the two variables.

Definition 7 (accuracy): the ratio of a measurement value satisfying a predetermined condition among a plurality of measurement values under a predetermined experimental condition. The method is used for simultaneously representing the degree of the system error and the random error in the measurement result, and the degree of the approximation of the average value of a plurality of measurement values to the true value. The accuracy calculation formula is as follows:

definition 8 (accuracy): in a data set of a plurality of categories, the accuracy is calculated by taking samples of one category as one category and samples of other categories as another category at a time. The definition is as follows:

TP: true positive sample number; FP: the number of samples tested as positive, and actually negative.

Definition 9 (recall): in a data set of a plurality of categories, the recall ratio is calculated by taking samples of one category as one category and samples of other categories as another category at a time. The definition is as follows:

TP: true positive sample number; FN: the number of samples tested as negative, in fact positive.

Definition 10(F1 value): the F1 value is the harmonic mean of recall and precision, where P refers to precision and R refers to recall. The definition is as follows:

definition 11 (macroscopic F1 value): the F1 value represented by equation (10) can be used to measure the binary problem, if the number of classes is greater than 2, then the macro-average F1 can be used, and assuming that the number of classes is n, then the macro-average F1 is the average of the F1 values of the n class classification problems as the F1 values of the n binary classification problems.

The definition is as follows.

The classification algorithm based on the weighted Pearson correlation coefficient and combined with feature screening is specifically implemented according to the following steps as shown in FIG. 1:

step 1, a data set is given, and the data set comprises a category set C with the number m of categories { C ═ C }₁,c₂,...c_mM 1,2,3.. m, and a feature set F with n as a feature number F ═ F₁,f₂,f₃,...f_nN is 1,2,3. Firstly, discretizing a continuous characteristic value in a data set by using an equal-width method; then converting the character string type characteristic value into a nominal numerical value type; then, complementing the missing characteristic value by using a mode; and finally converting the character string class values in the data set into a nominal numerical type.

Step 2, performing feature screening on the preprocessed data set by using an IMPROVE _ FCBF algorithm to obtain a feature set used for constructing a decision tree, wherein the IMPROVE _ FCBF algorithm comprises the following specific steps:

step 2.1, initialize S_listIs an empty set.

Step 2.2, calculate each feature f_iSymmetry uncertainty SU (f) between (i ═ 1, …, n) and class C_iC) value, and between each two featuresSymmetry uncertainty measure SU (f)_i，f_j) (i, j ≠ j) 1, …, n, and i ≠ j); the formula for calculating the SU values of the two variables X and Y is as follows:

step 2.5, for S_listEach feature F of_kAnd (k is 1, …, n) circularly judging whether the Merits value is reduced or not, and if the Merits value is reduced, rejecting the Merits value. If S_listAnd stopping searching when all the characteristic elements are judged to be finished or meet the early stop criterion. Otherwise, repeating the following steps. The specific steps are as follows:

step 2.5.2, if k > 1, if

Step 2.6, return to the final feature subset S_list。

And 3, dividing the data set subjected to feature screening into a training set and a test set by using ten-fold intersection.

And 4, constructing a decision tree model on the training set by using a decision tree classification method based on the weighted Pearson correlation coefficient. The method comprises the following specific steps:

step 4.1, traverse is carried out on each feature in the subset after feature screening, and S is assumed to be carried out at the moment_listIn which there are n features, each feature f is calculated_i∈S_listA weighted pearson correlation coefficient between (i ═ 1,2,3,. and n) and class C. The weighted pearson correlation coefficient between the two variables X and Y is calculated as follows:

and 4.2, sequencing the features from large to small according to the WPCC values obtained by calculation.

And 4.3, when each layer of the decision tree is constructed, selecting the characteristic with the maximum WPCC value as a split node to construct the decision tree each time.

The pseudo code of the IMPROVE _ FCBF algorithm involved in the present invention is shown in Table 1:

table 1 advance _ FCBF algorithm pseudo code

Pseudo codes of decision tree classification algorithm based on weighted pearson correlation coefficient in the invention are shown in table 2:

TABLE 2 WPCCT Algorithm pseudocode

Evaluation of the Performance of the present invention:

in order to verify the effectiveness of the FS-WPCCT decision Tree classification algorithm in the invention, the method is used for comparing with a decision Tree algorithm (WPCCT) only using weighted Pearson correlation coefficients and a classical decision Tree algorithm (ID3, CART, C4.5 and PCC-Tree).

Through comparison experiments on 25 data sets, as can be seen from fig. 2, the average accuracy of the FS-WPCCT decision Tree classification algorithm on each data set is superior to that of the WPCCT algorithm and the PCC-Tree algorithm in most cases; as can be seen from FIG. 3, the average accuracy, average recall rate and average macroscopic F1 value of the FS-WPCCT decision Tree classification algorithm are superior to those of the WPCCT algorithm and the PCC-Tree algorithm on 25 data sets; as can be seen from FIG. 4, the FS-WPCCT algorithm has better time performance than the WPCCT algorithm and the PCC-Tree algorithm in average construction decision Tree on 25 data sets; as can be seen from fig. 5, 6 and 7, the FS-WPCCT algorithm performed optimally on most datasets compared to other classical algorithms in terms of accuracy, recall and macroscopic F1 values over 25 datasets; as can be seen from FIG. 8, FS-WPCCT is superior to other classical decision Tree classification algorithms (ID3, CART, C4.5, PCC-Tree) in average accuracy, average recall, and average macroscopic F1 values over 25 datasets.

Table 4 data set details

Claims

1. The classification algorithm based on the weighted Pearson correlation coefficient and combined with feature screening is characterized by being implemented according to the following steps:

2. The classification algorithm based on weighted pearson correlation coefficients and combined with feature screening as claimed in claim 1, wherein the preprocessing in step 1 is specifically to firstly discretize the continuous feature values in the data set by using an equal width method; then converting the character string type characteristic value into a nominal numerical value type; then, complementing the missing characteristic value by using a mode; and finally converting the character string class values in the data set into a nominal numerical type.

3. The classification algorithm based on weighted pearson correlation coefficients combined with feature screening according to claim 1, wherein the step 2 is implemented specifically according to the following steps:

step 2.1, initialize S_listIs an empty set;

step 2.2, calculate each feature f_iSymmetry uncertainty SU (f) between (i ═ 1, …, n) and class C_iC) value, and asymmetry between each two featuresQualitative measures SU (f)_i，f_j) (i, j ≠ j) 1, …, n, and i ≠ j); the formula for calculating the SU values of the two variables X and Y is as follows:

step 2.6, return to the final feature subset S_list。

4. The classification algorithm based on weighted pearson correlation coefficients combined with feature filtering as claimed in claim 3, wherein in step 2.5, if S is_listWhen the middle characteristic element is not judged to be finished or does not meet the early stop criterion, repeating the following steps:

step 2.5.2, if k > 1, if

5. The classification algorithm based on weighted pearson correlation coefficients combined with feature screening according to claim 1, wherein the step 3 is a ten-fold cross validation method for dividing the data set.

6. The classification algorithm based on weighted pearson correlation coefficients combined with feature screening according to claim 1, wherein the step 4 is implemented specifically according to the following steps: