CN110569919A

CN110569919A - Skew data training method based on generation countermeasure network

Info

Publication number: CN110569919A
Application number: CN201910876333.3A
Authority: CN
Inventors: 张吉昕; 秦拯; 黄小凤; 彭鹏; 廖鑫; 翟亚静
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2019-09-17
Filing date: 2019-09-17
Publication date: 2019-12-13

Abstract

The invention relates to a skew data training method based on a generation countermeasure network. The invention mainly comprises (1) a sparse data clustering method based on a mixed prototype density clustering algorithm; (2) a sparse data maximum likelihood estimation method based on a generated countermeasure network; (3) a skew data filling and training method based on maximum likelihood estimation samples. Based on the method, the problems of data deflection of a deflection data set, insufficient sparse data characteristic generalization and the like are solved, the problems of under-fitting/over-fitting of the model are solved, and the accuracy of model fitting is improved.

Description

Skew data training method based on generation countermeasure network

Technical Field

The invention relates to the field of data mining and machine learning, in particular to a skew data training method based on a generation countermeasure network.

Background

With the rapid development of artificial intelligence technology, the artificial intelligence technology has been widely applied to various industries. Machine learning techniques have received considerable attention as one of the typical representatives of artificial intelligence techniques. The machine learning technology mainly comprises supervised learning, unsupervised learning, reinforcement learning, ensemble learning and the like, wherein the supervised learning method is used for training data with labels and fitting the mapping relation between the data and the labels; the unsupervised learning method only trains the data without labels and self-fits the data characteristics; the reinforcement learning method searches a decision path based on a Markov decision process; the ensemble learning method further improves the accuracy of the supervised learning method by integrating a plurality of supervised learning methods.

The supervised learning method has become one of the hottest machine learning methods because of its higher accuracy. The method mainly comprises traditional classification methods such as a support vector machine, a decision tree, a Bayesian network and the like, and deep learning methods such as a convolutional neural network, a recurrent neural network, a deep belief network and the like. The methods are widely applied to the fields of safety, commerce, finance, intelligent driving and the like.

However, when the supervised learning method is adopted to train the skew data, the problem of insufficient feature generalization of sparse data exists due to the problem of data sparseness in partial classification of the skew data, and further the problem of under-fitting/over-fitting of a trained fitting model exists, so that the model fitting accuracy is insufficient. Conventional methods tend to fill in sparse data by massively copying data in the sparse class of data. However, although the method can effectively avoid the under-fitting problem, the problem of insufficient sparse data feature generalization cannot be well solved, so that the trained model still has the over-fitting problem, and the accuracy of the fitting model is insufficient.

Disclosure of Invention

The invention aims to solve the problem of training the skew data by a supervised learning method.

therefore, the invention provides a skew data training method based on a generation countermeasure network, which mainly comprises three parts:

(1) Sparse data clustering method based on mixed prototype density clustering algorithm;

(2) A sparse data maximum likelihood estimation method based on a generated countermeasure network;

(3) A skew data filling and training method based on maximum likelihood estimation samples.

The specific contents are as follows:

As shown in the general technical roadmap of fig. 1, the sparse data in the skewed data is clustered by using the method (1); respectively generating maximum likelihood estimation samples for the clustered sparse data by adopting the method (2); and (4) filling the skew data set by adopting the method (3) and training the filled skew data set. Based on the method, the problems of data deflection of a deflection data set, insufficient sparse data characteristic generalization and the like are effectively solved, the problems of under-fitting/over-fitting of the model are solved, and the accuracy of model fitting is improved.

(1) A sparse data clustering method based on a mixed prototype density clustering algorithm.

And extracting N-dimensional vectors of the data samples, respectively representing the N-dimensional vectors by adopting a normalization method and a bag-of-words model, and respectively obtaining the N-dimensional numerical vectors after normalization processing and the N-dimensional classification vectors after the bag-of-words model representation. Wherein, the N-dimensional numerical vector is normalized by Min-Max according to a formulaPerforming a calculation, x representing the value of a certain data sample in a certain dimension, x_minminimum data, x, representing all data samples in a dimension_maxMinimum data representing all data samples in a dimension; the elements in the N-dimensional classification vector are represented by 0/1, if the numerical value of the elements in the vector is not 0, the elements are represented by 1, otherwise, the elements are represented by 0.

when the distance between every two data samples is calculated, the cosine distance between N-dimensional numerical vectors between every two data samples and the Jacard distance between N-dimensional classification vectors are respectively calculated. Wherein the cosine distance between the N-dimensional numerical vectors is according to a formulaCalculation of X₀And X₁Respectively representing N-dimensional numerical vectors of the two data samples; Jack-Cald distances between N-dimensional classification vectors according to a formulaCalculation of X₀And X₁Respectively representing N-dimensional classification vectors for the two data samples.

The clustering method adopts density, and randomly selects a data sample as an initial sample to form an initial cluster. For any data sample, it is compared pairwise to all sample points in all clusters. The distance of N-dimensional numerical vectors of every two samples is calculated by adopting cosine distance, the distance of N-dimensional classification vectors of every two data samples is calculated by adopting Jacard similarity distance, and the distance of every two data samples is calculated according to the sum of the cosine distance and the Jacard distance. And if the number of samples in the same cluster, the distance between which the samples are compared with the new samples is less than the threshold value R, is more than N, the new sample point is clustered with the cluster. And if the number of samples in the same cluster, the distance between which the samples are compared with the new samples is less than the threshold value R, is less than N, the comparison with the samples in the next cluster is continued. And if the number of samples with the sample distance smaller than the threshold value R in all the clusters is smaller than N, taking the new samples as the independent clusters, and iteratively executing the process until each data sample is classified into a specific cluster.

(2) Sparse data maximum likelihood estimation method based on generation of countermeasure network.

and training the sparse data in different clusters by adopting a generated countermeasure network, and respectively generating a plurality of maximum likelihood estimation samples. The adopted generation countermeasure network is composed of a discrimination network and a generation network, wherein the generation network generates maximum likelihood estimation samples based on random samples, the discrimination network discriminates real samples and expected samples, and a generation model and a discrimination model are obtained simultaneously through mutual game countermeasure training of the generation network and the discrimination network, wherein fitting samples generated by the generation model are maximum likelihood estimation of all data samples in the same cluster. Through multiple iterative training, a plurality of maximum likelihood estimation samples can be generated respectively aiming at each cluster.

The judging network and the generating network are both formed by Logistic functions according to formulasAnd obtaining, wherein X represents N of the data sample as a numerical vector, and W represents the connection weight. The gradient descent method is adopted for judging the updating of the network weight, the gradient ascent method is adopted for generating the updating of the network weight, the cross entropy is adopted for Loss functions, and the method is obtained according to the formula of Loss which is-y.log (h (W.X)) - (1-y) log (1-h (W.X))Where y represents a data sample label. And updating the weight of the discrimination network according to a formula W + alpha X (y-h (W X)), wherein alpha represents a training step length, taking alpha as 1, and training the discrimination network until convergence. Generating a network input vector through random initialization, updating a generated network weight according to a formula W-alpha-X- (y-h (W-X)) based on a chain rule, and training the generated network until convergence based on a residual error for judging network back propagation. And the output of the converged generation network is the maximum likelihood estimation sample.

The method comprises the steps that a plurality of maximum likelihood estimation samples generated by a generated countermeasure network (a generation network part) aiming at different clusters are filled into the deflection data, and the generated maximum likelihood estimation samples are generalization of sparse data and have similar characteristics with the sparse data, so that the sparse data in the deflection data can be expanded, the problem of data deflection is reduced, and the accuracy of a training model is improved.

Let X { { X { } { (X)_1,1,x_1,2,....,x_1,l},{x_2,1,x_2,2,....,x_2,m},....,{x_k,1,x_k,2,....,x_k,nA. } denotes data in class 1, class 2 through class k, where class k is a class of sparse data. By pairs { x_k,1,x_k,2,....,x_k,nCluster and generate maximum likelihood estimate samples x_k,i,x_k,i+1,....,x_k,i+tFilling the maximum likelihood estimation samples into the classification k to obtain data { x after the classification k is filled_k,1,x_k,2,....,x_k,n,x_k,i,x_k,i+1,....,x_k,i+tReplace the data { x in the original classification k finally_k,1,x_k,2,....,x_k,nAnd (6) training.

Drawings

FIG. 1 is a technical scheme of the present invention

Detailed Description

The implementation path of the invention is as follows:

Step 1: and randomly selecting any data sample in the sparse data classification as an initial sample to form an initial cluster.

step 2: for any data sample in the sparse data classification, it is compared pairwise with all sample points in all clusters. The distance of N-dimensional numerical vectors of every two samples is calculated by adopting cosine distance, the distance of N-dimensional classification vectors of every two data samples is calculated by adopting Jacard similarity distance, and the distance of every two data samples is calculated according to the sum of the cosine distance and the Jacard distance.

And step 3: and if the number of samples in the same cluster, the distance between which the samples are compared with the new samples is less than the threshold value R, is more than the threshold value D, the new sample point is clustered with the cluster.

and 4, step 4: and if the number of samples in the same cluster, the distance between which the samples are compared with the new samples is less than the threshold value R, is less than the threshold value D, the comparison with the samples in the next cluster is continued.

And 5: and if the number of samples with the sample distance smaller than the threshold value R in all the clusters is smaller than D, taking the new samples as the independent clusters, and iteratively executing the process until each data sample is classified into a specific cluster.

Step 6: a generative countermeasure network model is initialized for each cluster in the sparse data classification.

and 7: for each generated countermeasure network model, randomly initializing an M-dimensional vector as a generated network input, wherein each dimension of the M-dimensional vector is fully connected with N of the generated network as an output vector.

And 8: for each generated confrontation network model, the samples within the corresponding cluster are input into the discrimination network.

And step 9: and (5) iteratively training the discrimination network based on a gradient descent method until convergence. And based on a gradient rising method and a residual error between a calculated value and a label value of the discrimination network, performing chain type back propagation correction to generate a network weight until the generated network converges.

Step 10: and generating a vector output by the network, namely the maximum likelihood estimation of the samples in the corresponding cluster. And 7-10, iterating for i times, and obtaining i maximum likelihood estimation samples for each cluster.

Step 11: and filling the maximum likelihood estimation samples in the original sparse data classification, and training the data skewness problem in the smaller sparse data classification by adopting a supervised learning method to obtain a 'skewness-free' model.

Claims

1. A skew data training method based on a generation countermeasure network is characterized by comprising the following steps:

2. the sparse data clustering method based on the mixed prototype density clustering algorithm according to claim 1, wherein aiming at the problem that the projection form of the sparse data in the skewed data in the N-dimensional space is irregular and discrete, the mixed prototype density clustering algorithm is adopted to perform density clustering on the sparse samples in the skewed data to form a plurality of clusters with irregular forms.

3. The sparse data maximum likelihood estimation method based on the generative countermeasure network as claimed in claim 1, wherein aiming at the problem of insufficient feature generalization caused by less training samples of sparse data in the deflection data, the generative countermeasure network is adopted to respectively train sparse data in different clusters, and a plurality of maximum likelihood estimation samples are respectively generated.

4. The maximum likelihood estimation sample-based skewed data filling and training method as claimed in claim 1, wherein aiming at the problem of under-fitting/over-fitting of sparse data caused by insufficient sparse data feature generalization when a supervised learning method is used for training skewed data, the problem of data skew is improved and the accuracy of training and fitting is improved by filling the maximum likelihood estimation sample of the sparse data into the skewed data.