CN111414973A - Classification framework based on generating extremely unbalanced data for a countermeasure network - Google Patents

Classification framework based on generating extremely unbalanced data for a countermeasure network Download PDF

Info

Publication number
CN111414973A
CN111414973A CN202010235521.0A CN202010235521A CN111414973A CN 111414973 A CN111414973 A CN 111414973A CN 202010235521 A CN202010235521 A CN 202010235521A CN 111414973 A CN111414973 A CN 111414973A
Authority
CN
China
Prior art keywords
data
model
classification framework
generation
framework based
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010235521.0A
Other languages
Chinese (zh)
Inventor
王成
胡腾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN202010235521.0A priority Critical patent/CN111414973A/en
Publication of CN111414973A publication Critical patent/CN111414973A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a classification framework based on extremely unbalanced data for generating an anti-network, which is inspired by generating the anti-network and introducing transfer learning, synthesizes the data by generating the anti-network to solve the problem of data unbalance, pre-trains a classification model by utilizing the synthesized data, and finely adjusts the model by utilizing real data through a transfer learning method so as to finally solve the classification problem of data unbalance. Meanwhile, the method does not introduce new data variables, so that the process of searching for proper parameters in the weighting method is avoided.

Description

Classification framework based on generating extremely unbalanced data for a countermeasure network
Technical Field
The invention belongs to the field of machine learning, and particularly relates to a classification framework based on extremely unbalanced data of a generation countermeasure network.
Background
The problem of data imbalance refers to that the label data quantity of a certain category in training data of machine learning is too small, or the proportional difference of positive and negative samples is too large, so that a learning model cannot be normally modeled, namely a normal classification model cannot be established by a conventional machine learning model method. For this reason, researchers have proposed a number of methods to solve the data imbalance problem, and these methods can be mainly classified into the following three categories: (1) oversampling, namely, copying few samples and adding the samples into a training set to solve the problem of insufficient labels in the few samples; (2) undersampling, namely solving the problem of unbalanced positive and negative proportion by reducing the data volume of a plurality of samples in a training set; (3) the weighting method redefines the loss function by weighting the loss function to solve the gradient shift problem in the training process. However, these methods all have some problems at present, for example, overfitting of the model is easily caused by copying few samples to be added into a training set in oversampling, so that the training model is not good in effect; while under-sampling reduces the amount of data for multiple types of samples, resulting in loss of information. The weighting method introduces a new weight variable, that is, the selection of the weight variable affects the quality of the final model, and how to find a suitable weight variable is a very difficult problem.
Disclosure of Invention
The invention provides a classification framework based on extremely unbalanced data of a generation countermeasure network, overcomes the defects of the traditional method for solving the data unbalance problem, effectively prevents overfitting of a model, ensures that no information is lost in training data, simultaneously introduces no new variable and improves the effect of the model.
The purpose of the invention is realized as follows: a classification framework based on generating extremely unbalanced data against a network, comprising at least:
generating a model to generate a composite positive and negative sample;
two discrimination models, the first discrimination model being used to discriminate whether the data is from real data or data generated by the generative model, the second discrimination model being used to discriminate whether the data is a positive sample or a negative sample;
the pre-training module is used for pre-training a learning model by using the positive and negative samples generated by the generation model and retaining the knowledge learned by the synthetic data;
the transfer learning module is used for retraining an output layer of the learning model by using real data through maintaining and recording the knowledge so as to finely adjust the learning model;
the generative model and the discriminant model form a generative confrontation network, and when the generative model is in a final convergence state, the generative model can obtain synthesized positive and negative samples.
Further, the device also comprises a preprocessing module which is used for converting the raw data into numerical data which can be used for calculation.
Furthermore, the preprocessing module preprocesses the original data in a manner including data cleaning, data integration, and data transformation.
Further, in performing the data cleansing, the data is cleansed by filling in missing values, smoothing out noisy data, and identifying or resolving inconsistencies.
Further, the data cleansing achieves the following goals: the method comprises the steps of data formatting standard, abnormal data clearing, error correction and repeated data clearing.
Further, the data integration is used for combining and uniformly storing data in a plurality of data sources to establish a data warehouse.
Further, the data transformation is used to convert the data into the form required by the learning model.
The invention has the beneficial effects that:
the problem of overfitting of the model caused by copying and introducing new data into real data in a conventional method can be solved by solving the problem of data imbalance through the method in the framework, and meanwhile, information of all data is not lost, and new parameters are not introduced, so that the model is relatively easy to optimize, and has stronger reproducibility and expansibility.
Drawings
FIG. 1 is a conventional generative confrontation network model;
FIG. 2 is a diagram of a new generative confrontation network model proposed by the present invention to address data imbalance;
FIG. 3 is a diagram of a model for model pre-training in the present invention;
FIG. 4 is a model diagram of transfer learning in the present invention.
Detailed Description
This is described in more detail below with reference to FIGS. 1-4.
The embodiment provides a general framework for solving data imbalance in machine learning, which is inspired by generation of a countermeasure network and provides a generation countermeasure network different from a conventional framework, wherein the generation countermeasure network provided by the embodiment comprises a generation model and two discrimination models, wherein the generation model generates synthesized positive and negative samples, the first discrimination model is used for discriminating whether the data is from real data or data generated by the generation model, and the second discrimination model is used for discriminating whether the data is from a positive sample or a negative sample. When the model finally converges, the generated model can obtain a synthesized positive and negative sample. The problem of data imbalance can be solved by pre-training the model by using the samples generated by the generated model, meanwhile, a transfer learning method is introduced, the pre-training model can keep the knowledge learned by the synthetic data, the transfer learning keeps the parameters in the feature extraction layer in the neural network unchanged, and the final model is finely adjusted by using the real data, so that the model learns the knowledge in the real data at the same time. The problem of overfitting of the model caused by copying and introducing new data into real data in a conventional method can be solved by solving the problem of data imbalance through the method in the framework, and meanwhile, information of all data is not lost, and new parameters are not introduced, so that the model is relatively easy to optimize, and has stronger reproducibility and expansibility.
The main modules in the method are implemented as follows:
and preprocessing, namely converting the original data into numerical data which can be used for calculation by a model, and filling missing values. The original fields of the data are shown in table 1:
TABLE 1 original field and processed field
Figure BDA0002430831280000041
Figure BDA0002430831280000051
It can be seen from fig. 1 that most of the available original fields are of a character string type, and only variables of a numerical type can be processed as a generation countermeasure network itself, so that the preprocessing includes not only the aforementioned data cleaning and data integration, but also the conversion of character data into numerical data that can be processed by a model in a data conversion process.
Since conventional GAN cannot generate discrete data because the generation model in GAN is implemented by back propagation algorithm using the loss of discriminant model D, we consider to use a feedback neural network as an automatic encoder to solve the unsupervised learning process, which includes an encoder Enc and a decoder Dec. The decoder and the encoder are both composed of a multi-layer neural network.
The process is as follows:
algorithm environment:
Python,numpy
inputting:
1. sample attribute X ∈ Rn
2. Label y
3. Number of iterations n
And (3) outputting:
1. generating model GDec
(1) When the number of iterations is less than n:
(2) m samples (x, y) -P are selected from the datadata(x,y)
(3) Updating the automatic coding machine by a gradient descent method:
Figure BDA0002430831280000061
where x' Dec (enc (x)). And when the iteration times of the algorithm are more than n, the algorithm is converged, and the training of the automatic coding machine is finished.
The pre-training module comprises the following processes:
model environment:
Python,Keras,Pandas
inputting:
1. real data Pdata(x,y)
2. Noise data Pz(x,y)
3. Number of iterations n of the model
And (3) outputting:
all parameters contained in the model
(1) When the number of iterations is less than n:
(2) respectively extracting m samples from real data and noise data
Figure BDA0002430831280000071
Figure BDA0002430831280000072
(3) Updating discrimination model D by gradient lifting method1:
Figure BDA0002430831280000073
(4) Updating generative model G by gradient descentDec:
Figure BDA0002430831280000074
(5) Respectively extracting the following m samples from the real data
Figure BDA0002430831280000075
Figure BDA0002430831280000076
(6) Updating discrimination model D by gradient lifting method2:
Figure BDA0002430831280000077
(7) Updating generative model G by gradient descentDec:
Figure BDA0002430831280000078
The transfer learning module is characterized in that the GAN in the embodiment is composed of a deep learning model, after pre-training is completed, characteristic parameters in the model are kept unchanged, real data are used for retraining an output layer of the model, the detailed process is shown in the attached drawings, the pre-training can train the model by using the generated data to solve the problem of data imbalance, and the transfer learning retains and records the knowledge, and retrains the model by using the real data, so that the knowledge in the real data is retained in the model.
It should be noted that while the foregoing has described the spirit and principles of the invention with reference to several specific embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in these aspects cannot be combined. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (7)

1. A classification framework based on the generation of extremely unbalanced data of a countering network, characterized in that it comprises at least:
generating a model to generate a composite positive and negative sample;
two discrimination models, the first discrimination model being used to discriminate whether the data is from real data or data generated by the generative model, the second discrimination model being used to discriminate whether the data is a positive sample or a negative sample;
the pre-training module is used for pre-training a learning model by using the positive and negative samples generated by the generation model and retaining the knowledge learned by the synthetic data;
the transfer learning module is used for retraining an output layer of the learning model by using real data through maintaining and recording the knowledge so as to finely adjust the learning model;
the generative model and the discriminant model form a generative confrontation network, and when the generative model is in a final convergence state, the generative model can obtain synthesized positive and negative samples.
2. The classification framework based on generation of extremely unbalanced data for antagonistic networks as claimed in claim 1, further comprising a preprocessing module for converting raw data into numerical data that can be used for calculation.
3. The classification framework for generating extremely unbalanced data for countermeasure networks according to claim 2, wherein the preprocessing module preprocesses the raw data by means of data cleaning, data integration, and data transformation.
4. The classification framework based on highly unbalanced data generation against networks according to claim 3, characterized in that in the data cleaning, the data is cleaned by filling in missing values, smoothing noise data and identifying or solving inconsistencies.
5. The classification framework based on generation of extremely unbalanced data of a countering network according to claim 4, characterized in that the data cleansing achieves the following objectives: the method comprises the steps of data formatting standard, abnormal data clearing, error correction and repeated data clearing.
6. The classification framework for generating extremely unbalanced data for a countermeasure network according to claim 3, wherein the data integration is used for combining and uniformly storing data in a plurality of data sources to establish a data warehouse.
7. The classification framework based on generation of extremely unbalanced data of a countering network according to claim 3, characterized in that the data transformation is used to convert the data into a form required by a learning model.
CN202010235521.0A 2020-03-30 2020-03-30 Classification framework based on generating extremely unbalanced data for a countermeasure network Pending CN111414973A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010235521.0A CN111414973A (en) 2020-03-30 2020-03-30 Classification framework based on generating extremely unbalanced data for a countermeasure network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010235521.0A CN111414973A (en) 2020-03-30 2020-03-30 Classification framework based on generating extremely unbalanced data for a countermeasure network

Publications (1)

Publication Number Publication Date
CN111414973A true CN111414973A (en) 2020-07-14

Family

ID=71491606

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010235521.0A Pending CN111414973A (en) 2020-03-30 2020-03-30 Classification framework based on generating extremely unbalanced data for a countermeasure network

Country Status (1)

Country Link
CN (1) CN111414973A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112215268A (en) * 2020-09-27 2021-01-12 浙江工业大学 Method and device for classifying disaster weather satellite cloud pictures
CN112529114A (en) * 2021-01-13 2021-03-19 北京云真信科技有限公司 Target information identification method based on GAN, electronic device and medium
CN113159947A (en) * 2021-03-17 2021-07-23 同济大学 Difficult anomaly sample detection framework based on generation of countermeasure network

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112215268A (en) * 2020-09-27 2021-01-12 浙江工业大学 Method and device for classifying disaster weather satellite cloud pictures
CN112529114A (en) * 2021-01-13 2021-03-19 北京云真信科技有限公司 Target information identification method based on GAN, electronic device and medium
CN113159947A (en) * 2021-03-17 2021-07-23 同济大学 Difficult anomaly sample detection framework based on generation of countermeasure network

Similar Documents

Publication Publication Date Title
CN109580215B (en) Wind power transmission system fault diagnosis method based on deep generation countermeasure network
CN111414973A (en) Classification framework based on generating extremely unbalanced data for a countermeasure network
CN108399428B (en) Triple loss function design method based on trace ratio criterion
CN106934042B (en) Knowledge graph representation system and implementation method thereof
CN105975573B (en) A kind of file classification method based on KNN
Setnes et al. GA-fuzzy modeling and classification: complexity and performance
CN110212528B (en) Power distribution network measurement data missing reconstruction method
CN112766386B (en) Generalized zero sample learning method based on multi-input multi-output fusion network
CN114332568B (en) Training method, system, equipment and storage medium of domain adaptive image classification network
CN111753207B (en) Collaborative filtering method for neural map based on comments
CN113688869B (en) Photovoltaic data missing reconstruction method based on generation countermeasure network
CN111314353A (en) Network intrusion detection method and system based on hybrid sampling
CN112115967B (en) Image increment learning method based on data protection
Khoshgoftaar et al. Enhancing software quality estimation using ensemble-classifier based noise filtering
CN112307130B (en) Document-level remote supervision relation extraction method and system
CN110956277A (en) Interactive iterative modeling system and method
CN114006870A (en) Network flow identification method based on self-supervision convolution subspace clustering network
CN112926627A (en) Equipment defect time prediction method based on capacitive equipment defect data
CN116166650A (en) Multisource heterogeneous data cleaning method based on generation countermeasure network
CN112199637B (en) Regression modeling method for generating contrast network data enhancement based on regression attention
CN115906959A (en) Parameter training method of neural network model based on DE-BP algorithm
CN109492746A (en) Deepness belief network parameter optimization method based on GA-PSO Hybrid Algorithm
CN115116616A (en) Intra-group optimization based multiple interpolation breast cancer deletion data interpolation model
CN111402205B (en) Mammary tumor data cleaning method based on multilayer perceptron
CN113158555A (en) Heavy gas turbine control system analog input module BIT design method based on expert system and random forest classifier

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200714

RJ01 Rejection of invention patent application after publication