CN111414973A

CN111414973A - Classification framework based on generating extremely unbalanced data for a countermeasure network

Info

Publication number: CN111414973A
Application number: CN202010235521.0A
Authority: CN
Inventors: 王成; 胡腾
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2020-07-14

Abstract

The invention provides a classification framework based on extremely unbalanced data for generating an anti-network, which is inspired by generating the anti-network and introducing transfer learning, synthesizes the data by generating the anti-network to solve the problem of data unbalance, pre-trains a classification model by utilizing the synthesized data, and finely adjusts the model by utilizing real data through a transfer learning method so as to finally solve the classification problem of data unbalance. Meanwhile, the method does not introduce new data variables, so that the process of searching for proper parameters in the weighting method is avoided.

Description

Classification framework based on generating extremely unbalanced data for a countermeasure network

Technical Field

The invention belongs to the field of machine learning, and particularly relates to a classification framework based on extremely unbalanced data of a generation countermeasure network.

Background

The problem of data imbalance refers to that the label data quantity of a certain category in training data of machine learning is too small, or the proportional difference of positive and negative samples is too large, so that a learning model cannot be normally modeled, namely a normal classification model cannot be established by a conventional machine learning model method. For this reason, researchers have proposed a number of methods to solve the data imbalance problem, and these methods can be mainly classified into the following three categories: (1) oversampling, namely, copying few samples and adding the samples into a training set to solve the problem of insufficient labels in the few samples; (2) undersampling, namely solving the problem of unbalanced positive and negative proportion by reducing the data volume of a plurality of samples in a training set; (3) the weighting method redefines the loss function by weighting the loss function to solve the gradient shift problem in the training process. However, these methods all have some problems at present, for example, overfitting of the model is easily caused by copying few samples to be added into a training set in oversampling, so that the training model is not good in effect; while under-sampling reduces the amount of data for multiple types of samples, resulting in loss of information. The weighting method introduces a new weight variable, that is, the selection of the weight variable affects the quality of the final model, and how to find a suitable weight variable is a very difficult problem.

Disclosure of Invention

The invention provides a classification framework based on extremely unbalanced data of a generation countermeasure network, overcomes the defects of the traditional method for solving the data unbalance problem, effectively prevents overfitting of a model, ensures that no information is lost in training data, simultaneously introduces no new variable and improves the effect of the model.

The purpose of the invention is realized as follows: a classification framework based on generating extremely unbalanced data against a network, comprising at least:

generating a model to generate a composite positive and negative sample;

two discrimination models, the first discrimination model being used to discriminate whether the data is from real data or data generated by the generative model, the second discrimination model being used to discriminate whether the data is a positive sample or a negative sample;

the pre-training module is used for pre-training a learning model by using the positive and negative samples generated by the generation model and retaining the knowledge learned by the synthetic data;

the transfer learning module is used for retraining an output layer of the learning model by using real data through maintaining and recording the knowledge so as to finely adjust the learning model;

the generative model and the discriminant model form a generative confrontation network, and when the generative model is in a final convergence state, the generative model can obtain synthesized positive and negative samples.

Further, the device also comprises a preprocessing module which is used for converting the raw data into numerical data which can be used for calculation.

Furthermore, the preprocessing module preprocesses the original data in a manner including data cleaning, data integration, and data transformation.

Further, in performing the data cleansing, the data is cleansed by filling in missing values, smoothing out noisy data, and identifying or resolving inconsistencies.

Further, the data cleansing achieves the following goals: the method comprises the steps of data formatting standard, abnormal data clearing, error correction and repeated data clearing.

Further, the data integration is used for combining and uniformly storing data in a plurality of data sources to establish a data warehouse.

Further, the data transformation is used to convert the data into the form required by the learning model.

The invention has the beneficial effects that:

the problem of overfitting of the model caused by copying and introducing new data into real data in a conventional method can be solved by solving the problem of data imbalance through the method in the framework, and meanwhile, information of all data is not lost, and new parameters are not introduced, so that the model is relatively easy to optimize, and has stronger reproducibility and expansibility.

Drawings

FIG. 1 is a conventional generative confrontation network model;

FIG. 2 is a diagram of a new generative confrontation network model proposed by the present invention to address data imbalance;

FIG. 3 is a diagram of a model for model pre-training in the present invention;

FIG. 4 is a model diagram of transfer learning in the present invention.

Detailed Description

This is described in more detail below with reference to FIGS. 1-4.

The embodiment provides a general framework for solving data imbalance in machine learning, which is inspired by generation of a countermeasure network and provides a generation countermeasure network different from a conventional framework, wherein the generation countermeasure network provided by the embodiment comprises a generation model and two discrimination models, wherein the generation model generates synthesized positive and negative samples, the first discrimination model is used for discriminating whether the data is from real data or data generated by the generation model, and the second discrimination model is used for discriminating whether the data is from a positive sample or a negative sample. When the model finally converges, the generated model can obtain a synthesized positive and negative sample. The problem of data imbalance can be solved by pre-training the model by using the samples generated by the generated model, meanwhile, a transfer learning method is introduced, the pre-training model can keep the knowledge learned by the synthetic data, the transfer learning keeps the parameters in the feature extraction layer in the neural network unchanged, and the final model is finely adjusted by using the real data, so that the model learns the knowledge in the real data at the same time. The problem of overfitting of the model caused by copying and introducing new data into real data in a conventional method can be solved by solving the problem of data imbalance through the method in the framework, and meanwhile, information of all data is not lost, and new parameters are not introduced, so that the model is relatively easy to optimize, and has stronger reproducibility and expansibility.

The main modules in the method are implemented as follows:

and preprocessing, namely converting the original data into numerical data which can be used for calculation by a model, and filling missing values. The original fields of the data are shown in table 1:

TABLE 1 original field and processed field

It can be seen from fig. 1 that most of the available original fields are of a character string type, and only variables of a numerical type can be processed as a generation countermeasure network itself, so that the preprocessing includes not only the aforementioned data cleaning and data integration, but also the conversion of character data into numerical data that can be processed by a model in a data conversion process.

Since conventional GAN cannot generate discrete data because the generation model in GAN is implemented by back propagation algorithm using the loss of discriminant model D, we consider to use a feedback neural network as an automatic encoder to solve the unsupervised learning process, which includes an encoder Enc and a decoder Dec. The decoder and the encoder are both composed of a multi-layer neural network.

The process is as follows:

algorithm environment:

Python,numpy

inputting:

1. sample attribute X ∈ Rⁿ

2. Label y

3. Number of iterations n

And (3) outputting:

1. generating model G_Dec

(1) When the number of iterations is less than n:

(2) m samples (x, y) -P are selected from the data_data(x,y)

(3) Updating the automatic coding machine by a gradient descent method:

where x' Dec (enc (x)). And when the iteration times of the algorithm are more than n, the algorithm is converged, and the training of the automatic coding machine is finished.

The pre-training module comprises the following processes:

model environment:

Python，Keras，Pandas

inputting:

1. real data P_data(x,y)

2. Noise data P_z(x,y)

3. Number of iterations n of the model

And (3) outputting:

all parameters contained in the model

(1) When the number of iterations is less than n:

(2) respectively extracting m samples from real data and noise data

(3) Updating discrimination model D by gradient lifting method₁:

(4) Updating generative model G by gradient descent_Dec:

(5) Respectively extracting the following m samples from the real data

(6) Updating discrimination model D by gradient lifting method₂:

(7) Updating generative model G by gradient descent_Dec:

The transfer learning module is characterized in that the GAN in the embodiment is composed of a deep learning model, after pre-training is completed, characteristic parameters in the model are kept unchanged, real data are used for retraining an output layer of the model, the detailed process is shown in the attached drawings, the pre-training can train the model by using the generated data to solve the problem of data imbalance, and the transfer learning retains and records the knowledge, and retrains the model by using the real data, so that the knowledge in the real data is retained in the model.

It should be noted that while the foregoing has described the spirit and principles of the invention with reference to several specific embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in these aspects cannot be combined. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A classification framework based on the generation of extremely unbalanced data of a countering network, characterized in that it comprises at least:

generating a model to generate a composite positive and negative sample;

2. The classification framework based on generation of extremely unbalanced data for antagonistic networks as claimed in claim 1, further comprising a preprocessing module for converting raw data into numerical data that can be used for calculation.

3. The classification framework for generating extremely unbalanced data for countermeasure networks according to claim 2, wherein the preprocessing module preprocesses the raw data by means of data cleaning, data integration, and data transformation.

4. The classification framework based on highly unbalanced data generation against networks according to claim 3, characterized in that in the data cleaning, the data is cleaned by filling in missing values, smoothing noise data and identifying or solving inconsistencies.

5. The classification framework based on generation of extremely unbalanced data of a countering network according to claim 4, characterized in that the data cleansing achieves the following objectives: the method comprises the steps of data formatting standard, abnormal data clearing, error correction and repeated data clearing.

6. The classification framework for generating extremely unbalanced data for a countermeasure network according to claim 3, wherein the data integration is used for combining and uniformly storing data in a plurality of data sources to establish a data warehouse.

7. The classification framework based on generation of extremely unbalanced data of a countering network according to claim 3, characterized in that the data transformation is used to convert the data into a form required by a learning model.