CN110659682A

CN110659682A - Data classification method based on MCWD-KSMOTE-AdaBoost-DenseNet algorithm

Info

Publication number: CN110659682A
Application number: CN201910895521.0A
Authority: CN
Inventors: 胡燕祝; 王松
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-09-21
Filing date: 2019-09-21
Publication date: 2020-01-07

Abstract

The invention relates to a data classification method based on MCWD-KSMOTE-AdaBoost-DenseNet, which relates to the field of data processing and is characterized by comprising the following steps: (1) determining a tag correlation matrix; (2) sub-cluster weight distribution, and adjusting the unbalance of the samples in the class; the method comprises the steps of (3) predicting label information of all instances in a training set one by one, (4) normalizing data, (5) updating network weight by using an Adam optimizer during training of a convolutional neural network, and using cross entropy as a target loss function.

Description

Data classification method based on MCWD-KSMOTE-AdaBoost-DenseNet algorithm

Technical Field

The invention relates to the field of data processing and machine learning, in particular to a method for accurately classifying unbalanced data.

Background

Data that the computer can batch process need people to collect the arrangement through data acquisition equipment such as sensor. Then, the hidden deep knowledge and rules behind the data need to be processed by methods of data learning analysis, data mining and the like. Thereby improving the perception and the linking capability of people to external things. However, data generated in real life are often unbalanced. More accurate data classification has been advanced to the aspects of people's life. The existing problems to be solved urgently are that most of existing classifier models are designed aiming at some balanced data, a small amount of or a large amount of unbalanced data often exist in various data generated in actual life, the accurate classification rate of the data is usually sharply reduced due to the existence of the unbalanced data, the classification effect is often reduced under severe conditions, and even the use requirements of use or scientific research cannot be met. The existing undersampling method has the defects that a large number of negative sample characteristics are lost, a model cannot fully learn the sample characteristics of the negative samples, the classification accuracy of the negative samples is reduced, and the like. In addition, for the over-sampling mode, the generated positive sample is not the true positive sample obtained by collection, and the defects of sample noise are brought while the number and diversity of samples are increased.

The algorithm of the data classification problem faces important problems of incomplete data labels, difficulty in obtaining marks, large data size and the like. The classical algorithms like LP, BR, ECC, ML-KNN all require that all tags in the training data be complete. But as the amount of data grows explosively, it is not easy to acquire a fully labeled instance. Under the conditions of large noise influence and poor sample anti-interference capability, AdaBoost has the defect that effective identification and filtration of noise samples are difficult to carry out. In addition, when class-to-class imbalance exists between a few classes and a plurality of classes, the problem of sample distribution imbalance is not fully considered, and the problem of 'marginalization' is easily caused to be prominent. From the aspect of a data layer, the oversampling or undersampling mode is mainly adopted at present, but due to the unbalance of data distribution of an unbalanced data set, the effect of using a single classification algorithm is poor, and the accuracy of the model is low. Therefore, to realize accurate data classification, a set of efficient and accurate data classification method must be established, so that the problem of data imbalance can be effectively solved, and the data classification accuracy is effectively improved. Therefore, the problems can be found by the staff in time, and a better judgment standard is provided for subsequent operations such as the next prediction step.

Disclosure of Invention

In view of the problems in the prior art, the technical problem to be solved by the present invention is to provide a data classification method based on MCWD, KSMOTE-AdaBoost and densneet, and the specific flow is shown in fig. 1.

The technical scheme comprises the following implementation steps:

(1) determining a tag correlation matrix

Defining a tag correlation matrix

To measure the correlation between tags.

Wherein the content of the first and second substances,

indicating label c₁A set of instances of the annotation are presented,

indicating a labeled object c₁The number of instances of the annotation,is represented by c₁And c₂Number of instances noted at the same time. s is a parameter set to avoid some extreme cases due to tag imbalance problems to some extent.

(2) Sub-cluster weights w (i) are assigned, adjusting the imbalance of samples within a class:

where c represents the total number of class clusters of the sample set partition, L (c)₁,c₂)_num(i)Indicating the number of samples in the ith class cluster. I.e. the higher the number of samples in a cluster the lower the weight. Finally realizing the balance division among the same typesAnd (3) cloth.

(3) Predicting label information of all instances in training set one by one

In the formula, KNN (I)^Test) It means the k-th immediate vicinity,

the weight value of the jth label representing the ith instance at the last iteration, t is the number of iterations,

a matrix of labels is represented that is,representing a predictive label.

In addition, the updated value for the weight

Wherein sgn () is a sign function, e is a high confidence threshold value, which is (0.5,1), c is a low confidence threshold value, which is (0,0.5), and simultaneously

In this manner, the user can easily and accurately select the desired target,

will be reassigned to [ -1, 1]A value in between.

Repeating the above steps, dividing the sample set into a certain number of clusters by using a clustering algorithm, and synthesizing the number of samplesAnd the number of samples in each cluster and the weight occupied by each cluster and the number of samples needing to be synthesized are obtained. Each iteration process will obtain

A reset is performed. When the 80% of the label information in the data is recovered, namely the labels of the examples do not contain the missing value of 0, the cycle is ended, and the next step is continued. The algorithm flow is shown in fig. 2.

(4) Data normalization q_new：

In the formula (I), the compound is shown in the specification,

the maximum and minimum values of the raw data.

(5) Network weights are updated using Adam optimizers in training convolutional neural networks, using cross entropy as the objective loss function (loss):

in the formula, g_tGradient, theta, representing the t-th step_t-1For gradient update, α defaults to 0.001. The specific network structure is shown in fig. 3.

In the formula (I), the compound is shown in the specification,

which is indicative of a desired output, is,representing the actual output. The training error fitting process is shown in fig. 3.

Compared with the prior art, the invention has the advantages that:

(1) the invention overcomes the problem that the classification accuracy of the minority class is low because the maximum total classification accuracy is the target, the classification model is biased to the majority class and the minority class is ignored, and the unbalanced data classification accuracy can be effectively improved.

(2) The MCWD and KSMOTE-AdaBoost methods are applied to data classification, and higher classification accuracy is obtained by combining the data classification with DenseNet. This shows that the present invention can achieve a better classification effect when classifying the unbalanced data.

Drawings

For a better understanding of the present invention, reference is made to the following further description taken in conjunction with the accompanying drawings.

FIG. 1 is a flow chart of the steps of establishing an unbalanced data classification algorithm based on MCWD-KSMOTE-AdaBoost-DenseNet;

FIG. 2 is a flow chart for establishing an unbalanced data classification algorithm based on MCWD-KSMOTE-AdaBoost-DenseNet;

FIG. 3 is a training error fit graph;

fig. 4 is a diagram of a network update architecture.

Detailed description of the preferred embodiments

The present invention will be described in further detail below with reference to examples.

The data set selected by the embodiment is divided into four types, and there are 800 groups of samples, wherein 200 groups of stars, galaxy and unknown celestial bodies are respectively extracted from 4 groups of data by adopting a random sampling method to be used as a training set, and the rest 40 groups are used as a test set. Finally, the total number of samples used for training is 640 and the total number of samples used for testing is 160.

The overall flow of the unbalanced data classification algorithm provided by the invention is shown in fig. 1, and the specific steps are as follows:

(1) determining a tag correlation matrix

The existing 200 sets of data were taken, of which 50 were labeled stars, galaxies, quasar, and the rest were labeled by unknown stars. Randomly choosing 10 groups, then 10 groups are just marked as stars, galaxies, then the association between stars and galaxies is estimated to be 0.

Wherein the content of the first and second substances,

indicating label c₁A set of instances of the annotation are presented,

indicating a labeled object c₁The number of instances of the annotation,

is represented by c₁And c₂Number of instances noted at the same time. Obtaining a correlation matrix using fully recovered tag information for 80% of the sampled data

where c represents the total number of class clusters divided by the sample set, e.g., 4 classes, L (c)₁,c₂)_num(i)Indicating that the number of samples in the ith class cluster is 200 groups. I.e. the higher the number of samples in a cluster the lower the weight. And finally, realizing balanced distribution among the same classes.

(3) Predicting labels for all instances in a training set one by oneInformation

And dividing the sample set into a certain number of clusters by using a clustering algorithm, and obtaining the weight occupied by each cluster and the number of samples to be synthesized according to the number of samples to be synthesized and the number of samples contained in each cluster.

In the formula, KNN (I)^Test) It means the k-th immediate vicinity,

a matrix of labels is represented that is,

representing a predictive label.

In addition, the updated value for the weight

In this manner, the user can easily and accurately select the desired target,

will be reassigned to [ -1, 1]BetweenThe value of (c).

And repeating the steps, dividing the sample set into a specific number of clusters by using a clustering algorithm, and obtaining the weight occupied by each cluster and the number of samples to be synthesized according to the number of synthesized samples and the number of samples in each cluster. Each iteration process will obtain

(4) Data normalization q_new：

In the formula (I), the compound is shown in the specification,the maximum and minimum values of the raw data. I.e. uniform mapping of data to [0,1 ]]On the interval. So as to improve the convergence speed of the model and the accuracy of the model.

the neural network is trained in 60 times in total, the initial learning rate is set to be 0.01, and the initial learning rate is reduced by 10 times at 10 th, 30 th and 50 th times respectively. The training process is as shown in figure x, and the loss on the verification set is unstable due to the large learning rate of the first 10 times of training. As the learning rate decreases and the training increases, the loss on both the training set and the test set tends to stabilize and slowly decrease. The verification set loss hardly decreased after 30 training runs. To prevent overfitting, we finally retained the weights trained 35 times as the final model. As shown in fig. 3

In the formula (I), the compound is shown in the specification,

which is indicative of a desired output, is,

representing the actual output. The training error fitting process is shown in fig. 3.

In order to verify the accuracy of the present invention in classifying unbalanced data, four classification experiments were performed on the present invention, and the experimental results are shown in table 3. The accuracy rate of classifying the unbalanced data by the method combining MCWD, KSMOTE-AdaBoost and DenseNet established by the invention is kept above 92%, and the method can achieve higher accuracy rate on the basis of ensuring stability and has good classification effect. The MCWD, KSMOTE-AdaBoost and DenseNet classification methods established by the invention are effective, provide a better method for establishing an accurate data classification model, and have certain practicability.

Claims

1. The invention relates to a data classification method based on a MCWD-KSMOTE-AdaBoost-DenseNet algorithm, which is characterized by comprising the following steps: (1) determining a tag correlation matrix; (2) sub-cluster weight distribution, and adjusting the unbalance of the samples in the class; the method comprises the following steps of (3) predicting label information of all instances in a training set one by one, (4) data normalization, (5) updating network weight by using an Adam optimizer when a convolutional neural network is trained, and using cross entropy as a target loss function, wherein the method specifically comprises the following five steps:

the method comprises the following steps: determining a tag correlation matrix

Defining a tag correlation matrix

To measure the correlation between tags;

wherein the content of the first and second substances,

indicating label c₁A set of instances of the annotation are presented,

indicating a labeled object c₁The number of instances of the annotation,

is represented by c₁And c₂The number of instances, s, marked at the same time is a parameter set to avoid some extreme situations due to the problem of label imbalance to a certain extent;

step two: sub-cluster weights w (i) are assigned, adjusting the imbalance of samples within a class:

where c represents the total number of class clusters of the sample set partition, L (c)₁,c₂)_num(i)The number of samples in the ith class cluster is represented, namely the more the number of samples in a certain cluster is, the smaller the weight is, and finally, the balanced distribution among the same classes is realized;

step three: predicting label information of all instances in training set one by one

In the formula, KNN (I)^Test) It means the k-th immediate vicinity,

a matrix of labels is represented that is,represents a predictive tag;

in addition, the updated value for the weight

In this manner, the user can easily and accurately select the desired target,

will be reassigned to [ -1, 1]A value in between;

repeating the steps, dividing the sample set into a specific number of clusters by using a clustering algorithm, obtaining the weight occupied by each cluster and the number of samples to be synthesized according to the number of synthesized samples and the number of samples in each cluster, and carrying out each iteration process on the obtained weight

Resetting is carried out, when 80% of label information in the data is recovered, namely when the labels of the examples do not contain the missing value '0', the cycle is ended, and then the next step is continued;

step four: data normalization q_new：

In the formula (I), the compound is shown in the specification,

maximum and minimum values of the original data;

step five: updating network weights by using an Adam optimizer when the convolutional neural network is trained, and using cross entropy as a target loss function (loss);

in the formula, g_tGradient, theta, representing the t-th step_t-1For gradient update, α defaults to 0.001;

in the formula (I), the compound is shown in the specification,

which is indicative of a desired output, is,

representing the actual output.