CN111414973A - Classification framework based on generating extremely unbalanced data for a countermeasure network - Google Patents
Classification framework based on generating extremely unbalanced data for a countermeasure network Download PDFInfo
- Publication number
- CN111414973A CN111414973A CN202010235521.0A CN202010235521A CN111414973A CN 111414973 A CN111414973 A CN 111414973A CN 202010235521 A CN202010235521 A CN 202010235521A CN 111414973 A CN111414973 A CN 111414973A
- Authority
- CN
- China
- Prior art keywords
- data
- model
- classification framework
- generation
- framework based
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a classification framework based on extremely unbalanced data for generating an anti-network, which is inspired by generating the anti-network and introducing transfer learning, synthesizes the data by generating the anti-network to solve the problem of data unbalance, pre-trains a classification model by utilizing the synthesized data, and finely adjusts the model by utilizing real data through a transfer learning method so as to finally solve the classification problem of data unbalance. Meanwhile, the method does not introduce new data variables, so that the process of searching for proper parameters in the weighting method is avoided.
Description
Technical Field
The invention belongs to the field of machine learning, and particularly relates to a classification framework based on extremely unbalanced data of a generation countermeasure network.
Background
The problem of data imbalance refers to that the label data quantity of a certain category in training data of machine learning is too small, or the proportional difference of positive and negative samples is too large, so that a learning model cannot be normally modeled, namely a normal classification model cannot be established by a conventional machine learning model method. For this reason, researchers have proposed a number of methods to solve the data imbalance problem, and these methods can be mainly classified into the following three categories: (1) oversampling, namely, copying few samples and adding the samples into a training set to solve the problem of insufficient labels in the few samples; (2) undersampling, namely solving the problem of unbalanced positive and negative proportion by reducing the data volume of a plurality of samples in a training set; (3) the weighting method redefines the loss function by weighting the loss function to solve the gradient shift problem in the training process. However, these methods all have some problems at present, for example, overfitting of the model is easily caused by copying few samples to be added into a training set in oversampling, so that the training model is not good in effect; while under-sampling reduces the amount of data for multiple types of samples, resulting in loss of information. The weighting method introduces a new weight variable, that is, the selection of the weight variable affects the quality of the final model, and how to find a suitable weight variable is a very difficult problem.
Disclosure of Invention
The invention provides a classification framework based on extremely unbalanced data of a generation countermeasure network, overcomes the defects of the traditional method for solving the data unbalance problem, effectively prevents overfitting of a model, ensures that no information is lost in training data, simultaneously introduces no new variable and improves the effect of the model.
The purpose of the invention is realized as follows: a classification framework based on generating extremely unbalanced data against a network, comprising at least:
generating a model to generate a composite positive and negative sample;
two discrimination models, the first discrimination model being used to discriminate whether the data is from real data or data generated by the generative model, the second discrimination model being used to discriminate whether the data is a positive sample or a negative sample;
the pre-training module is used for pre-training a learning model by using the positive and negative samples generated by the generation model and retaining the knowledge learned by the synthetic data;
the transfer learning module is used for retraining an output layer of the learning model by using real data through maintaining and recording the knowledge so as to finely adjust the learning model;
the generative model and the discriminant model form a generative confrontation network, and when the generative model is in a final convergence state, the generative model can obtain synthesized positive and negative samples.
Further, the device also comprises a preprocessing module which is used for converting the raw data into numerical data which can be used for calculation.
Furthermore, the preprocessing module preprocesses the original data in a manner including data cleaning, data integration, and data transformation.
Further, in performing the data cleansing, the data is cleansed by filling in missing values, smoothing out noisy data, and identifying or resolving inconsistencies.
Further, the data cleansing achieves the following goals: the method comprises the steps of data formatting standard, abnormal data clearing, error correction and repeated data clearing.
Further, the data integration is used for combining and uniformly storing data in a plurality of data sources to establish a data warehouse.
Further, the data transformation is used to convert the data into the form required by the learning model.
The invention has the beneficial effects that:
the problem of overfitting of the model caused by copying and introducing new data into real data in a conventional method can be solved by solving the problem of data imbalance through the method in the framework, and meanwhile, information of all data is not lost, and new parameters are not introduced, so that the model is relatively easy to optimize, and has stronger reproducibility and expansibility.
Drawings
FIG. 1 is a conventional generative confrontation network model;
FIG. 2 is a diagram of a new generative confrontation network model proposed by the present invention to address data imbalance;
FIG. 3 is a diagram of a model for model pre-training in the present invention;
FIG. 4 is a model diagram of transfer learning in the present invention.
Detailed Description
This is described in more detail below with reference to FIGS. 1-4.
The embodiment provides a general framework for solving data imbalance in machine learning, which is inspired by generation of a countermeasure network and provides a generation countermeasure network different from a conventional framework, wherein the generation countermeasure network provided by the embodiment comprises a generation model and two discrimination models, wherein the generation model generates synthesized positive and negative samples, the first discrimination model is used for discriminating whether the data is from real data or data generated by the generation model, and the second discrimination model is used for discriminating whether the data is from a positive sample or a negative sample. When the model finally converges, the generated model can obtain a synthesized positive and negative sample. The problem of data imbalance can be solved by pre-training the model by using the samples generated by the generated model, meanwhile, a transfer learning method is introduced, the pre-training model can keep the knowledge learned by the synthetic data, the transfer learning keeps the parameters in the feature extraction layer in the neural network unchanged, and the final model is finely adjusted by using the real data, so that the model learns the knowledge in the real data at the same time. The problem of overfitting of the model caused by copying and introducing new data into real data in a conventional method can be solved by solving the problem of data imbalance through the method in the framework, and meanwhile, information of all data is not lost, and new parameters are not introduced, so that the model is relatively easy to optimize, and has stronger reproducibility and expansibility.
The main modules in the method are implemented as follows:
and preprocessing, namely converting the original data into numerical data which can be used for calculation by a model, and filling missing values. The original fields of the data are shown in table 1:
TABLE 1 original field and processed field
It can be seen from fig. 1 that most of the available original fields are of a character string type, and only variables of a numerical type can be processed as a generation countermeasure network itself, so that the preprocessing includes not only the aforementioned data cleaning and data integration, but also the conversion of character data into numerical data that can be processed by a model in a data conversion process.
Since conventional GAN cannot generate discrete data because the generation model in GAN is implemented by back propagation algorithm using the loss of discriminant model D, we consider to use a feedback neural network as an automatic encoder to solve the unsupervised learning process, which includes an encoder Enc and a decoder Dec. The decoder and the encoder are both composed of a multi-layer neural network.
The process is as follows:
algorithm environment:
Python,numpy
inputting:
1. sample attribute X ∈ Rn
2. Label y
3. Number of iterations n
And (3) outputting:
1. generating model GDec
(1) When the number of iterations is less than n:
(2) m samples (x, y) -P are selected from the datadata(x,y)
(3) Updating the automatic coding machine by a gradient descent method:
where x' Dec (enc (x)). And when the iteration times of the algorithm are more than n, the algorithm is converged, and the training of the automatic coding machine is finished.
The pre-training module comprises the following processes:
model environment:
Python,Keras,Pandas
inputting:
1. real data Pdata(x,y)
2. Noise data Pz(x,y)
3. Number of iterations n of the model
And (3) outputting:
all parameters contained in the model
(1) When the number of iterations is less than n:
(3) Updating discrimination model D by gradient lifting method1:
(4) Updating generative model G by gradient descentDec:
(6) Updating discrimination model D by gradient lifting method2:
(7) Updating generative model G by gradient descentDec:
The transfer learning module is characterized in that the GAN in the embodiment is composed of a deep learning model, after pre-training is completed, characteristic parameters in the model are kept unchanged, real data are used for retraining an output layer of the model, the detailed process is shown in the attached drawings, the pre-training can train the model by using the generated data to solve the problem of data imbalance, and the transfer learning retains and records the knowledge, and retrains the model by using the real data, so that the knowledge in the real data is retained in the model.
It should be noted that while the foregoing has described the spirit and principles of the invention with reference to several specific embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in these aspects cannot be combined. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.
Claims (7)
1. A classification framework based on the generation of extremely unbalanced data of a countering network, characterized in that it comprises at least:
generating a model to generate a composite positive and negative sample;
two discrimination models, the first discrimination model being used to discriminate whether the data is from real data or data generated by the generative model, the second discrimination model being used to discriminate whether the data is a positive sample or a negative sample;
the pre-training module is used for pre-training a learning model by using the positive and negative samples generated by the generation model and retaining the knowledge learned by the synthetic data;
the transfer learning module is used for retraining an output layer of the learning model by using real data through maintaining and recording the knowledge so as to finely adjust the learning model;
the generative model and the discriminant model form a generative confrontation network, and when the generative model is in a final convergence state, the generative model can obtain synthesized positive and negative samples.
2. The classification framework based on generation of extremely unbalanced data for antagonistic networks as claimed in claim 1, further comprising a preprocessing module for converting raw data into numerical data that can be used for calculation.
3. The classification framework for generating extremely unbalanced data for countermeasure networks according to claim 2, wherein the preprocessing module preprocesses the raw data by means of data cleaning, data integration, and data transformation.
4. The classification framework based on highly unbalanced data generation against networks according to claim 3, characterized in that in the data cleaning, the data is cleaned by filling in missing values, smoothing noise data and identifying or solving inconsistencies.
5. The classification framework based on generation of extremely unbalanced data of a countering network according to claim 4, characterized in that the data cleansing achieves the following objectives: the method comprises the steps of data formatting standard, abnormal data clearing, error correction and repeated data clearing.
6. The classification framework for generating extremely unbalanced data for a countermeasure network according to claim 3, wherein the data integration is used for combining and uniformly storing data in a plurality of data sources to establish a data warehouse.
7. The classification framework based on generation of extremely unbalanced data of a countering network according to claim 3, characterized in that the data transformation is used to convert the data into a form required by a learning model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010235521.0A CN111414973A (en) | 2020-03-30 | 2020-03-30 | Classification framework based on generating extremely unbalanced data for a countermeasure network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010235521.0A CN111414973A (en) | 2020-03-30 | 2020-03-30 | Classification framework based on generating extremely unbalanced data for a countermeasure network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111414973A true CN111414973A (en) | 2020-07-14 |
Family
ID=71491606
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010235521.0A Pending CN111414973A (en) | 2020-03-30 | 2020-03-30 | Classification framework based on generating extremely unbalanced data for a countermeasure network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111414973A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112215268A (en) * | 2020-09-27 | 2021-01-12 | 浙江工业大学 | Method and device for classifying disaster weather satellite cloud pictures |
CN112529114A (en) * | 2021-01-13 | 2021-03-19 | 北京云真信科技有限公司 | Target information identification method based on GAN, electronic device and medium |
CN113159947A (en) * | 2021-03-17 | 2021-07-23 | 同济大学 | Difficult anomaly sample detection framework based on generation of countermeasure network |
-
2020
- 2020-03-30 CN CN202010235521.0A patent/CN111414973A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112215268A (en) * | 2020-09-27 | 2021-01-12 | 浙江工业大学 | Method and device for classifying disaster weather satellite cloud pictures |
CN112529114A (en) * | 2021-01-13 | 2021-03-19 | 北京云真信科技有限公司 | Target information identification method based on GAN, electronic device and medium |
CN113159947A (en) * | 2021-03-17 | 2021-07-23 | 同济大学 | Difficult anomaly sample detection framework based on generation of countermeasure network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109580215B (en) | Wind power transmission system fault diagnosis method based on deep generation countermeasure network | |
CN111414973A (en) | Classification framework based on generating extremely unbalanced data for a countermeasure network | |
CN108399428B (en) | Triple loss function design method based on trace ratio criterion | |
CN106934042B (en) | Knowledge graph representation system and implementation method thereof | |
CN105975573B (en) | A kind of file classification method based on KNN | |
Setnes et al. | GA-fuzzy modeling and classification: complexity and performance | |
CN110212528B (en) | Power distribution network measurement data missing reconstruction method | |
CN112766386B (en) | Generalized zero sample learning method based on multi-input multi-output fusion network | |
CN114332568B (en) | Training method, system, equipment and storage medium of domain adaptive image classification network | |
CN111753207B (en) | Collaborative filtering method for neural map based on comments | |
CN113688869B (en) | Photovoltaic data missing reconstruction method based on generation countermeasure network | |
CN111314353A (en) | Network intrusion detection method and system based on hybrid sampling | |
CN112115967B (en) | Image increment learning method based on data protection | |
Khoshgoftaar et al. | Enhancing software quality estimation using ensemble-classifier based noise filtering | |
CN112307130B (en) | Document-level remote supervision relation extraction method and system | |
CN110956277A (en) | Interactive iterative modeling system and method | |
CN114006870A (en) | Network flow identification method based on self-supervision convolution subspace clustering network | |
CN112926627A (en) | Equipment defect time prediction method based on capacitive equipment defect data | |
CN116166650A (en) | Multisource heterogeneous data cleaning method based on generation countermeasure network | |
CN112199637B (en) | Regression modeling method for generating contrast network data enhancement based on regression attention | |
CN115906959A (en) | Parameter training method of neural network model based on DE-BP algorithm | |
CN109492746A (en) | Deepness belief network parameter optimization method based on GA-PSO Hybrid Algorithm | |
CN115116616A (en) | Intra-group optimization based multiple interpolation breast cancer deletion data interpolation model | |
CN111402205B (en) | Mammary tumor data cleaning method based on multilayer perceptron | |
CN113158555A (en) | Heavy gas turbine control system analog input module BIT design method based on expert system and random forest classifier |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200714 |
|
RJ01 | Rejection of invention patent application after publication |