CN105787046A

CN105787046A - Imbalanced data sorting system based on unilateral dynamic downsampling

Info

Publication number: CN105787046A
Application number: CN201610108097.7A
Authority: CN
Inventors: 王喆; 李冬冬; 范奇; 高大启
Original assignee: East China University of Science and Technology
Current assignee: East China University of Science and Technology
Priority date: 2016-02-28
Filing date: 2016-02-28
Publication date: 2016-07-20

Abstract

The invention provides an imbalanced data sorting system based on unilateral dynamic downsampling.Firstly, the system determines the structure of an adopted network according to the scale of imbalanced data and randomly initializes various layers of network nerve cell weights; secondly, the system adopts a gradient descent method to optimize a network model and sets the learning rate, charge factors and maximum iteration number of the gradient descent method; in the first iteration, all samples are used for calculating the total gradient, then the total gradient is used or updating the layers of weights, and training samples for the next iteration are selected according to distinguishing distances of the samples; the training samples selected for the previous iteration are repeatedly used for calculating the total gradient, the layers of weights are updated, and samples for the next iteration are selected till the maximum iteration number is reached; finally, an obtained sorting model is used for sorting unknown samples.Compared with a traditional sorting technology, the imbalanced data sorting system based on the unilateral dynamic downsampling achieves dynamic downsampling of training samples and avoids dataset information loss by combining the downsampling process with the training process of a sorter together and can effectively process the sorting problem of imbalanced data.

Description

Unbalanced data classification system based on unilateral dynamic downsampling

Technical Field

The invention relates to the field of pattern recognition, in particular to an unbalanced data classification method and system based on unilateral dynamic downsampling.

Background

At present, in the era of data explosion, the data volume has been increased from the TB level to the PB level or even the EB level, and it is very important how to perform data mining on massive data to obtain useful information from the massive data. There are many research directions for data mining, and the classification problem is one of the important research branches. Classification (Classification) refers to selecting a training set which is classified from data, analyzing and learning the training set by using a Classification technology, finding out rules hidden in the data, and establishing a Classification model, thereby performing Classification prediction on unknown test samples. At present, a plurality of mature algorithms such as a K neighbor algorithm, a decision tree algorithm, an artificial neural network algorithm, a Bayesian algorithm, a support vector machine algorithm and the like exist for the traditional classification problem, and the algorithms are applied to a plurality of fields of data mining and obtain good classification effect.

Although the traditional classification algorithms can achieve better classification effect, they are mostly established on the premise of data set distribution equalization, that is, the number of various types of samples in the data set is generally consistent. However, in the application field of each subject, unbalanced data sets are more common. For the two types of problems, the number of samples in one category is generally much larger than that in the other category, wherein the category with less samples is called Positive category (Positive) and the category with more samples is called Negative category (Negative). For example, in financial fraud detection, generally speaking, most customers' transaction behaviors are normal, only very individual customers may be potential fraud behaviors, and there may be one fraud behavior in 10 ten thousand transactions; in addition, in the fields of medical diagnosis, network intrusion detection, anti-spam, oil exploration and the like, the problem of unbalanced data sets also exists. In these areas, some data imbalance problems are inherent because the probability of a positive type sample occurring itself is low. In part, the positive type samples need to be subjected to experimental verification, and the negative type samples do not need to be subjected to experimental verification, so that the cost for obtaining the negative type samples is low, and the cost for obtaining the positive type samples is high, so that the situation that the negative type is far more than the positive type in the data set occurs.

Because the traditional classification algorithm always takes the maximum total average classification precision of a classification model as a training target, and does not consider the relative distribution condition of each class, when the traditional classifier is used for solving the problem of unbalanced data classification, the performance of the classifier is often greatly reduced, the obtained classifier is prone to be in a negative class, and samples which belong to the positive class are often wrongly classified into the negative class. Such classifiers work poorly on the positive class. Practical problems often require that the detection rate of positive classes be high enough because positive classes are generally much more important than negative classes. Also a financial fraud detection problem, conventional classifiers can easily classify fraudulent behavior as normal behavior, but the cost of the loss to the bank is often much higher when the fraudulent behavior is considered as normal behavior than when the normal behavior is mistaken as fraudulent behavior. In medical diagnosis, if the patient is misdiagnosed as a normal person, the optimal treatment time is delayed, and the loss is beyond the estimation. Therefore, the correct classification of unbalanced data is an urgent problem to be solved. Constructing a classification system that can effectively deal with the imbalance problem will bring great economic benefits to industrial production and social economy.

Currently, in terms of processing unbalanced data, there are some processing methods based on data planes, such as random under-sampling (random-sampling), single-side sample selection (One-side-sampling), random over-sampling (random-sampling), and the like. However, these processing methods are independent of the training algorithm itself, i.e., there is independence between the training algorithm and the data processing methods. That is, the data set processed by the data processing method can be used by a plurality of different training algorithms. But the processed samples remain unchanged during the algorithm training. For the down-sampling method, the removed samples will not be used for training any more in the classifier training phase, which results in the loss of sample information, thereby affecting the classifier performance. In order to overcome the defect of the down-sampling method, an unbalanced data classification system based on single-side down-sampling is provided. During the training phase, the system can take all samples into account. In each iteration, the system dynamically downsamples the negative class samples to obtain balanced training samples.

Disclosure of Invention

Aiming at the problems that the existing classification technology based on data down-sampling can not combine the down-sampling with classifier training when processing unbalanced data and can not avoid the loss of sample data information after the down-sampling, the invention provides a method based on single-side dynamic down-sampling, which adopts the discrimination distance (discriminationdistance) of a sample to realize the down-sampling of a negative sample, adopts a non-feedback neural network to train a classification model and adopts a gradient descent method to optimize an algorithm model. The single-side dynamic downsampling is combined with the non-feedback neural network, so that the unbalanced data classification system based on the single-side dynamic downsampling is provided. The system can effectively deal with the classification problem of unbalanced data.

The technical scheme adopted by the invention for solving the technical problems is as follows: firstly, the system determines the structure of the adopted network according to the scale of unbalanced data, and randomly initializes the weight of each layer of network neurons; secondly, optimizing a network model by adopting a gradient descent method, and setting a learning rate, a charging factor and a maximum iteration number of the gradient descent method; in the first iteration, calculating a total gradient by using all samples, updating weights of all layers by using the total gradient, and selecting the samples as training samples of the next iteration according to the distinguishing distance of the samples; repeatedly using the training samples selected in the previous round to calculate the total gradient, updating the weight of each layer, and selecting the samples of the next round of iteration until the maximum iteration times are reached and then stopping; finally, the obtained classification model is used for classifying the unknown samples.

The technical scheme adopted by the invention for solving the technical problem can be further perfected. The method for determining the neural network structure is manually determined according to prior information of specific data, and a more appropriate network structure, such as the number of network layers, the number of neurons in each hidden layer, the type of a node activation function and the like, can be determined manually by adopting an empirical method. The gradient descent method is to minimize the objective function of the network by using the negative gradient according to the gradient direction obtained by the objective function of the neural network, so that the network has better classification performance. The single-side descent method adopts the discrimination distance to dynamically select the negative samples, and the quantity of the positive and negative samples can be effectively balanced.

The invention has the beneficial effects that: the negative samples are sampled by using the discrimination distance of the samples, so that the balance of the training samples on the number of the samples is realized; by combining a single-side down-sampling method with a non-feedback neural network, an unbalanced data classification system based on single-side down-sampling is provided, so that the negative samples can be dynamically down-sampled, and the down-sampling process and the classifier training process are combined together; training a non-feedback network model by adopting a gradient descent method and resampling a sample after each iteration step to realize dynamic down-sampling of a training sample; by combining sample downsampling with model training, the algorithm can effectively solve the problem of classification of unbalanced data.

Drawings

FIG. 1 is a system framework for an unbalanced data classification system based on dynamic downsampling of the present invention.

Detailed Description

The invention will be further described with reference to the following figures and examples: the method of the invention is divided into three steps.

The first step is as follows: the network structure and network parameters are initialized.

The system determines the structure of the adopted network according to the scale of the unbalanced data, and randomly initializes the weight of each layer of network neurons. The initialization of the network structure comprises the number of nodes of each layer of the network and the type of an activation function adopted by a network order; the initialization of the network parameters comprises the initialization of the weight of each neuron and the determination of the training target of the training sample. The initialization of the network structure and network parameters includes the following steps.

1) Initializing a neural network structure: and determining the structure of the neural network according to the scale of the unbalanced data, including the dimension of the sample, the number of samples and the unbalanced rate, wherein the structure comprises the number of layers of the neural network and the number of neurons in each layer. The imbalance rate reflects the degree of imbalance of the data set, and the calculation formula is as follows:

when the data set imbalance ratio exceeds 1.5, the data set is said to be an unbalanced data set. Number of hidden nodesThe setting is done by a worker based on experience. Each weight of the neural networkAre randomly initialized to between-1 and 1. For certain problems, the network structure may be determined by manual empirical methods. The activation function of the network point adopts a Sigmoid function:

。

2) setting network model training parameters: learning rate of gradient descent methodSet to 0.1, charge factorMaximum number of iterations. Iterative indexingInitializing a training sample setAll training samples.

The second step is that: and (5) optimizing the network model.

The system adopts a gradient descent method to optimize a network model, and sets a gradient descent method learning rate, a charging factor and a maximum iteration number; in the first iteration, calculating a total gradient by using all samples, updating weights of all layers by using the total gradient, and selecting the samples as training samples of the next iteration according to the distinguishing distance of the samples; and repeatedly using the training samples selected in the previous round to calculate the total gradient, updating the weight values of each layer, and selecting the samples of the next round of iteration until the maximum iteration times are reached and then stopping. The network model optimization includes the following steps.

1) Calculating the sum of squares of errors for a network：

Wherein,is the total number of samples and is,are respectively a sampleThe training target and the actual output value of the network. Sample(s)To judge the distanceComprises the following steps:

。

2) computing the gradient of the network over the sample set SWhereinThe weights are the weights connecting the ith neuron of the input layer and the jth neuron of the hidden layer. To obtainWe can get by the face is the integration rule:

wherein,is the output value of the kth neuron of the upper layer. For the hidden layer to the output layer,

wherein,is the output value of the jth neuron of the upper layer network. Therefore, the weight updating rule from the hidden layer to the output layer is as follows:

since we use no feedbackAnd the neural network finally updates the weights between the hidden layer and the output layer only by a gradient descent method, and does not need to update the weights between the input layer and the hidden layer. Thus, the overall gradient of the network over the sample set SComprises the following steps:

wherein,is a sampleThe corresponding gradient value. In the first placeAfter the second iteration, the charge of the networkCalculated by the following formula:

wherein,andthe weights of the corresponding networks after the l-th and l-1 th iterations, respectively.

3) Updating the network weight: root of herbaceous plantAccording to the gradient obtained aboveAnd chargeNetwork weightUpdated after the l-th iteration to:

wherein,is the charge factor.

4) Reselecting a training sample set S: for the entire sample set，Is a sampleThe class number of (a) is,to representFor positive type samples and vice versaIs of negative classAnd (4) sampling. The discrimination distance of the sample is. The training samples are reselected according to the following steps:

For

If

will be provided withAdded to the training sample set S

Else

If

Will be provided withAdded to the training sample set S

End

Wherein the discrimination distance of the sampleCalculated in step 4.

5) If the number of iterationsSkipping to the step 3 to continue training the network model; otherwise, step 8 is executed.

The third step: and carrying out classification prediction on the unknown samples.

After optimizing the network model in the second step, the system can classify the unknown samples. Network weightWhereinRepresenting the weight between the input layer and the hidden layer;the weight value between the hidden layer and the output layer;representing the offsets of the hidden layer neurons and the output layer neurons, respectively. Network hidden layer outputComprises the following steps:

the network output layer outputsComprises the following steps:

inputting samplesPrediction category ofThe following formula can be given:

。

hereinbefore, specific embodiments of the present invention are described with reference to the drawings. It will be understood by those skilled in the art that various changes and substitutions may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention. Such modifications and substitutions are intended to be included within the scope of the present invention as defined by the appended claims.

Results of the experiment

To verify the effectiveness of our proposed method, we performed experiments with three algorithms. The first contrast method is an original non-feedback neural network algorithm (marked as a contrast method one), the second contrast method is a non-feedback neural network method adopting random down-sampling (marked as a contrast method two), and the third contrast method is a non-feedback neural network method adopting single-side sample selection (marked as a contrast method three). We selected four imbalance data from the key imbalance database [ http:// sci2s. ugr. es/key/imbalanced. php ], including Pima, Ecoli3, Pageblocks13Vs4, and Yeast2Vs 8. The information for these unbalanced data sets is shown in table 1. For each data set, the parameters of the comparison algorithm are set as follows.

1) Data set Pima: the comparison algorithms involved use the same network structure and parameters. The network structure is as follows: 5 nodes of the input layer, 20 nodes of the hidden layer and 1 node of the output layer are marked as [8-20-1 ]]. The parameters of the network are: iterative step size of gradient descent methodCharge factorMaximum number of iterations. All experimental results are mean values from 5 rounds of cross validation.

2) Data set Ecoli 3: the comparison algorithms involved use the same network structure and parameters. The network structure is [7-40-1 ]]. The parameters of the network are: iterative step size of gradient descent methodCharge factorMaximum number of iterations. All experimental results are mean values from 5 rounds of cross validation.

3) Data set Pageblocks13Vs 4: the comparison algorithms involved use the same network structure and parameters. The network structure is [10-35-1 ]]. The parameters of the network are: iterative step size of gradient descent methodCharge factorMaximum number of iterations. All experimental results are mean values from 5 rounds of cross validation.

4) Data set Yeast2Vs 8: the comparison algorithms involved use the same network structure and parameters. The network structure is [8-40-1 ]]. The parameters of the network are: iterative step size of gradient descent methodCharge factorMaximum number of iterations. All experimental results are mean values from 5 rounds of cross validation;

we used AUC to evaluate the performance of the algorithm in unbalanced datasets. The AUC is calculated as follows:

wherein,indicating the proportion of pairs in the positive type samples,indicating the proportion of errors in the negative class samples.Andthe calculation formula is as follows:

wherein,representing the number of paired samples in the positive class;representing the number of backup error samples in the negative class;andrespectively representing the number of positive and negative class samples.

The results of the experiment are shown in table 2. From the experimental results, the performance of the method in all data sets is the best in the comparison algorithm, so that the advantages of the method in the aspect of processing the imbalance problem are verified, and the effectiveness of the method is reflected.

Table 1: unbalanced data set information

Table 2: AUC value (%) -of comparison algorithm in unbalanced data set

Claims

1. An unbalanced data classification system based on unilateral dynamic downsampling is characterized in that: the method comprises the following specific steps:

1) determining the structure of the adopted network by the system according to the scale of the unbalanced data, and randomly initializing the weight of each layer of network neurons;

2) the system adopts a gradient descent method to optimize a network model, and sets a gradient descent method learning rate, a charge factor and a maximum iteration number; in the first iteration, calculating a total gradient by using all samples, updating weights of all layers by using the total gradient, and selecting the samples as training samples of the next iteration according to the distinguishing distance of the samples; repeatedly using the training samples selected in the previous round to calculate the total gradient, updating the weight of each layer, and selecting the samples of the next round of iteration until the maximum iteration times are reached and then stopping;

3) and classifying the unknown sample by using the obtained classification model.

2. The system according to claim 1, wherein said system further comprises: the unbalanced data set size comprises the number of samples of the unbalanced data set, the unbalanced rate of the data set and the dimensionality of the samples of the data set; the unbalanced data set refers to a data set with an unbalanced rate higher than 1.5, wherein the unbalanced rate refers to the number of negative samples compared with the number of positive samples; the network mechanism comprises a network layer number and a neuron node number of each layer; the neuron weight refers to the weight of the interconnection of neuron nodes of each layer.

3. The system according to claim 1, wherein said system further comprises: the iterative optimization of the network model by adopting a gradient descent method refers to the negative gradient of a network objective function; and then, updating the neuron node weight values of each layer according to the obtained negative gradient.

4. The system according to claim 1, wherein said system further comprises: the single side means that the negative samples in the training set are down sampled according to the distinguishing distance of the samples; the judgment distance refers to the distance between the network output value of the sample and the theoretical judgment value.

5. The system according to claim 1, wherein said system further comprises: and the dynamic sample selection comprises the steps of screening the negative samples according to the discrimination distance of the samples after each step of iteration, and adding all the positive samples into a new training set.

6. The system according to claim 1, wherein said system further comprises: the classification and prediction of the unknown samples comprises the steps of solving the network output of the unknown samples according to the obtained network weight, and comparing the network output with a theoretical discrimination value to determine the classification of the unknown samples.