CN110569919A - Skew data training method based on generation countermeasure network - Google Patents
Skew data training method based on generation countermeasure network Download PDFInfo
- Publication number
- CN110569919A CN110569919A CN201910876333.3A CN201910876333A CN110569919A CN 110569919 A CN110569919 A CN 110569919A CN 201910876333 A CN201910876333 A CN 201910876333A CN 110569919 A CN110569919 A CN 110569919A
- Authority
- CN
- China
- Prior art keywords
- data
- samples
- method based
- sparse
- maximum likelihood
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a skew data training method based on a generation countermeasure network. The invention mainly comprises (1) a sparse data clustering method based on a mixed prototype density clustering algorithm; (2) a sparse data maximum likelihood estimation method based on a generated countermeasure network; (3) a skew data filling and training method based on maximum likelihood estimation samples. Based on the method, the problems of data deflection of a deflection data set, insufficient sparse data characteristic generalization and the like are solved, the problems of under-fitting/over-fitting of the model are solved, and the accuracy of model fitting is improved.
Description
Technical Field
The invention relates to the field of data mining and machine learning, in particular to a skew data training method based on a generation countermeasure network.
Background
With the rapid development of artificial intelligence technology, the artificial intelligence technology has been widely applied to various industries. Machine learning techniques have received considerable attention as one of the typical representatives of artificial intelligence techniques. The machine learning technology mainly comprises supervised learning, unsupervised learning, reinforcement learning, ensemble learning and the like, wherein the supervised learning method is used for training data with labels and fitting the mapping relation between the data and the labels; the unsupervised learning method only trains the data without labels and self-fits the data characteristics; the reinforcement learning method searches a decision path based on a Markov decision process; the ensemble learning method further improves the accuracy of the supervised learning method by integrating a plurality of supervised learning methods.
The supervised learning method has become one of the hottest machine learning methods because of its higher accuracy. The method mainly comprises traditional classification methods such as a support vector machine, a decision tree, a Bayesian network and the like, and deep learning methods such as a convolutional neural network, a recurrent neural network, a deep belief network and the like. The methods are widely applied to the fields of safety, commerce, finance, intelligent driving and the like.
However, when the supervised learning method is adopted to train the skew data, the problem of insufficient feature generalization of sparse data exists due to the problem of data sparseness in partial classification of the skew data, and further the problem of under-fitting/over-fitting of a trained fitting model exists, so that the model fitting accuracy is insufficient. Conventional methods tend to fill in sparse data by massively copying data in the sparse class of data. However, although the method can effectively avoid the under-fitting problem, the problem of insufficient sparse data feature generalization cannot be well solved, so that the trained model still has the over-fitting problem, and the accuracy of the fitting model is insufficient.
Disclosure of Invention
The invention aims to solve the problem of training the skew data by a supervised learning method.
therefore, the invention provides a skew data training method based on a generation countermeasure network, which mainly comprises three parts:
(1) Sparse data clustering method based on mixed prototype density clustering algorithm;
(2) A sparse data maximum likelihood estimation method based on a generated countermeasure network;
(3) A skew data filling and training method based on maximum likelihood estimation samples.
The specific contents are as follows:
As shown in the general technical roadmap of fig. 1, the sparse data in the skewed data is clustered by using the method (1); respectively generating maximum likelihood estimation samples for the clustered sparse data by adopting the method (2); and (4) filling the skew data set by adopting the method (3) and training the filled skew data set. Based on the method, the problems of data deflection of a deflection data set, insufficient sparse data characteristic generalization and the like are effectively solved, the problems of under-fitting/over-fitting of the model are solved, and the accuracy of model fitting is improved.
(1) A sparse data clustering method based on a mixed prototype density clustering algorithm.
And extracting N-dimensional vectors of the data samples, respectively representing the N-dimensional vectors by adopting a normalization method and a bag-of-words model, and respectively obtaining the N-dimensional numerical vectors after normalization processing and the N-dimensional classification vectors after the bag-of-words model representation. Wherein, the N-dimensional numerical vector is normalized by Min-Max according to a formulaPerforming a calculation, x representing the value of a certain data sample in a certain dimension, xminminimum data, x, representing all data samples in a dimensionmaxMinimum data representing all data samples in a dimension; the elements in the N-dimensional classification vector are represented by 0/1, if the numerical value of the elements in the vector is not 0, the elements are represented by 1, otherwise, the elements are represented by 0.
when the distance between every two data samples is calculated, the cosine distance between N-dimensional numerical vectors between every two data samples and the Jacard distance between N-dimensional classification vectors are respectively calculated. Wherein the cosine distance between the N-dimensional numerical vectors is according to a formulaCalculation of X0And X1Respectively representing N-dimensional numerical vectors of the two data samples; Jack-Cald distances between N-dimensional classification vectors according to a formulaCalculation of X0And X1Respectively representing N-dimensional classification vectors for the two data samples.
The clustering method adopts density, and randomly selects a data sample as an initial sample to form an initial cluster. For any data sample, it is compared pairwise to all sample points in all clusters. The distance of N-dimensional numerical vectors of every two samples is calculated by adopting cosine distance, the distance of N-dimensional classification vectors of every two data samples is calculated by adopting Jacard similarity distance, and the distance of every two data samples is calculated according to the sum of the cosine distance and the Jacard distance. And if the number of samples in the same cluster, the distance between which the samples are compared with the new samples is less than the threshold value R, is more than N, the new sample point is clustered with the cluster. And if the number of samples in the same cluster, the distance between which the samples are compared with the new samples is less than the threshold value R, is less than N, the comparison with the samples in the next cluster is continued. And if the number of samples with the sample distance smaller than the threshold value R in all the clusters is smaller than N, taking the new samples as the independent clusters, and iteratively executing the process until each data sample is classified into a specific cluster.
(2) Sparse data maximum likelihood estimation method based on generation of countermeasure network.
and training the sparse data in different clusters by adopting a generated countermeasure network, and respectively generating a plurality of maximum likelihood estimation samples. The adopted generation countermeasure network is composed of a discrimination network and a generation network, wherein the generation network generates maximum likelihood estimation samples based on random samples, the discrimination network discriminates real samples and expected samples, and a generation model and a discrimination model are obtained simultaneously through mutual game countermeasure training of the generation network and the discrimination network, wherein fitting samples generated by the generation model are maximum likelihood estimation of all data samples in the same cluster. Through multiple iterative training, a plurality of maximum likelihood estimation samples can be generated respectively aiming at each cluster.
The judging network and the generating network are both formed by Logistic functions according to formulasAnd obtaining, wherein X represents N of the data sample as a numerical vector, and W represents the connection weight. The gradient descent method is adopted for judging the updating of the network weight, the gradient ascent method is adopted for generating the updating of the network weight, the cross entropy is adopted for Loss functions, and the method is obtained according to the formula of Loss which is-y.log (h (W.X)) - (1-y) log (1-h (W.X))Where y represents a data sample label. And updating the weight of the discrimination network according to a formula W + alpha X (y-h (W X)), wherein alpha represents a training step length, taking alpha as 1, and training the discrimination network until convergence. Generating a network input vector through random initialization, updating a generated network weight according to a formula W-alpha-X- (y-h (W-X)) based on a chain rule, and training the generated network until convergence based on a residual error for judging network back propagation. And the output of the converged generation network is the maximum likelihood estimation sample.
(3) A skew data filling and training method based on maximum likelihood estimation samples.
The method comprises the steps that a plurality of maximum likelihood estimation samples generated by a generated countermeasure network (a generation network part) aiming at different clusters are filled into the deflection data, and the generated maximum likelihood estimation samples are generalization of sparse data and have similar characteristics with the sparse data, so that the sparse data in the deflection data can be expanded, the problem of data deflection is reduced, and the accuracy of a training model is improved.
Let X { { X { } { (X)1,1,x1,2,....,x1,l},{x2,1,x2,2,....,x2,m},....,{xk,1,xk,2,....,xk,nA. } denotes data in class 1, class 2 through class k, where class k is a class of sparse data. By pairs { xk,1,xk,2,....,xk,nCluster and generate maximum likelihood estimate samples xk,i,xk,i+1,....,xk,i+tFilling the maximum likelihood estimation samples into the classification k to obtain data { x after the classification k is filledk,1,xk,2,....,xk,n,xk,i,xk,i+1,....,xk,i+tReplace the data { x in the original classification k finallyk,1,xk,2,....,xk,nAnd (6) training.
Drawings
FIG. 1 is a technical scheme of the present invention
Detailed Description
The implementation path of the invention is as follows:
Step 1: and randomly selecting any data sample in the sparse data classification as an initial sample to form an initial cluster.
step 2: for any data sample in the sparse data classification, it is compared pairwise with all sample points in all clusters. The distance of N-dimensional numerical vectors of every two samples is calculated by adopting cosine distance, the distance of N-dimensional classification vectors of every two data samples is calculated by adopting Jacard similarity distance, and the distance of every two data samples is calculated according to the sum of the cosine distance and the Jacard distance.
And step 3: and if the number of samples in the same cluster, the distance between which the samples are compared with the new samples is less than the threshold value R, is more than the threshold value D, the new sample point is clustered with the cluster.
and 4, step 4: and if the number of samples in the same cluster, the distance between which the samples are compared with the new samples is less than the threshold value R, is less than the threshold value D, the comparison with the samples in the next cluster is continued.
And 5: and if the number of samples with the sample distance smaller than the threshold value R in all the clusters is smaller than D, taking the new samples as the independent clusters, and iteratively executing the process until each data sample is classified into a specific cluster.
Step 6: a generative countermeasure network model is initialized for each cluster in the sparse data classification.
and 7: for each generated countermeasure network model, randomly initializing an M-dimensional vector as a generated network input, wherein each dimension of the M-dimensional vector is fully connected with N of the generated network as an output vector.
And 8: for each generated confrontation network model, the samples within the corresponding cluster are input into the discrimination network.
And step 9: and (5) iteratively training the discrimination network based on a gradient descent method until convergence. And based on a gradient rising method and a residual error between a calculated value and a label value of the discrimination network, performing chain type back propagation correction to generate a network weight until the generated network converges.
Step 10: and generating a vector output by the network, namely the maximum likelihood estimation of the samples in the corresponding cluster. And 7-10, iterating for i times, and obtaining i maximum likelihood estimation samples for each cluster.
Step 11: and filling the maximum likelihood estimation samples in the original sparse data classification, and training the data skewness problem in the smaller sparse data classification by adopting a supervised learning method to obtain a 'skewness-free' model.
Claims (4)
1. A skew data training method based on a generation countermeasure network is characterized by comprising the following steps:
(1) Sparse data clustering method based on mixed prototype density clustering algorithm;
(2) A sparse data maximum likelihood estimation method based on a generated countermeasure network;
(3) A skew data filling and training method based on maximum likelihood estimation samples.
2. the sparse data clustering method based on the mixed prototype density clustering algorithm according to claim 1, wherein aiming at the problem that the projection form of the sparse data in the skewed data in the N-dimensional space is irregular and discrete, the mixed prototype density clustering algorithm is adopted to perform density clustering on the sparse samples in the skewed data to form a plurality of clusters with irregular forms.
3. The sparse data maximum likelihood estimation method based on the generative countermeasure network as claimed in claim 1, wherein aiming at the problem of insufficient feature generalization caused by less training samples of sparse data in the deflection data, the generative countermeasure network is adopted to respectively train sparse data in different clusters, and a plurality of maximum likelihood estimation samples are respectively generated.
4. The maximum likelihood estimation sample-based skewed data filling and training method as claimed in claim 1, wherein aiming at the problem of under-fitting/over-fitting of sparse data caused by insufficient sparse data feature generalization when a supervised learning method is used for training skewed data, the problem of data skew is improved and the accuracy of training and fitting is improved by filling the maximum likelihood estimation sample of the sparse data into the skewed data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910876333.3A CN110569919A (en) | 2019-09-17 | 2019-09-17 | Skew data training method based on generation countermeasure network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910876333.3A CN110569919A (en) | 2019-09-17 | 2019-09-17 | Skew data training method based on generation countermeasure network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110569919A true CN110569919A (en) | 2019-12-13 |
Family
ID=68780487
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910876333.3A Pending CN110569919A (en) | 2019-09-17 | 2019-09-17 | Skew data training method based on generation countermeasure network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110569919A (en) |
-
2019
- 2019-09-17 CN CN201910876333.3A patent/CN110569919A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112560432B (en) | Text emotion analysis method based on graph attention network | |
CN106919942B (en) | Accelerated compression method of deep convolution neural network for handwritten Chinese character recognition | |
Babatunde et al. | A genetic algorithm-based feature selection | |
CN110188827B (en) | Scene recognition method based on convolutional neural network and recursive automatic encoder model | |
CN113064959B (en) | Cross-modal retrieval method based on deep self-supervision sorting Hash | |
CN111753190B (en) | Meta-learning-based unsupervised cross-modal hash retrieval method | |
WO2022037295A1 (en) | Targeted attack method for deep hash retrieval and terminal device | |
CN107273818B (en) | Selective integrated face recognition method based on genetic algorithm fusion differential evolution | |
Wang | Fuzzy clustering analysis by using genetic algorithm | |
CN110111365B (en) | Training method and device based on deep learning and target tracking method and device | |
CN107153837A (en) | Depth combination K means and PSO clustering method | |
CN115795065A (en) | Multimedia data cross-modal retrieval method and system based on weighted hash code | |
Bandyopadhyay et al. | VGA-classifier: design and applications | |
Zhang et al. | An intrusion detection method based on stacked sparse autoencoder and improved gaussian mixture model | |
CN115048983A (en) | Counterforce sample defense method of artificial intelligence system based on data manifold topology perception | |
CN115344693A (en) | Clustering method based on fusion of traditional algorithm and neural network algorithm | |
Ansari et al. | An optimized feature selection technique in diversified natural scene text for classification using genetic algorithm | |
CN114780725A (en) | Text classification algorithm based on deep clustering | |
KR102492277B1 (en) | Method for qa with multi-modal information | |
CN110569919A (en) | Skew data training method based on generation countermeasure network | |
Lee et al. | Ensemble of binary tree structured deep convolutional network for image classification | |
CN113158577A (en) | Discrete data characterization learning method and system based on hierarchical coupling relation | |
CN113326393B (en) | Image retrieval method based on deep hash feature and heterogeneous parallel processing | |
Rong et al. | Location bagging-based undersampling for imbalanced classification problems | |
Sousa et al. | Classification with reject option using the self-organizing map |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191213 |
|
RJ01 | Rejection of invention patent application after publication |