CN114757335A - Dual-condition-based method for generating confrontation network and filling missing data - Google Patents

Dual-condition-based method for generating confrontation network and filling missing data Download PDF

Info

Publication number
CN114757335A
CN114757335A CN202210347936.6A CN202210347936A CN114757335A CN 114757335 A CN114757335 A CN 114757335A CN 202210347936 A CN202210347936 A CN 202210347936A CN 114757335 A CN114757335 A CN 114757335A
Authority
CN
China
Prior art keywords
data
sample
condition
model
generation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210347936.6A
Other languages
Chinese (zh)
Inventor
钱鹰
戴思聪
刘歆
万邦睿
黄江平
王毅峰
韦庆杰
王奕琀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202210347936.6A priority Critical patent/CN114757335A/en
Publication of CN114757335A publication Critical patent/CN114757335A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Image Processing (AREA)

Abstract

The invention relates to a missing data filling generation method for generating a countermeasure network based on dual conditions, which belongs to the field of data perception and reconstruction in a computer and comprises the following steps: s1: encoding sample data, and designing a sample type as a data generation condition and a representation mode of the existing sample data in the generation process; s2: constructing a structure of a dual-condition generation countermeasure network, wherein the structure comprises a generation model and a discrimination model; s3: generating an objective optimization function of the antagonistic network structure under the dual conditions for representation; s4: establishing a training data set of a data generation model, and training a dual-condition generation countermeasure network; s5: analyzing different data missing conditions, and generating a countermeasure network by adopting a trained dual condition to generate and fill missing data. The invention provides a method for constructing a high-quality training data set aiming at table data, which is used for supporting machine learning model training in a big data application scene.

Description

Missing data filling generation method for generating countermeasure network based on dual conditions
Technical Field
The invention relates to a missing data filling generation method for generating a countermeasure network based on dual conditions, and belongs to the field of data perception and reconstruction in computers.
Background
In recent years, in machine learning, the generation of antagonistic network models has become increasingly important and popular due to their applicability in different fields. Their ability to represent complex, high-dimensional data can be used in other areas of academia, processing images, video, tabular data, and the like. In the field of digital finance and the like, the generation of form data is a core issue of interest to researchers. Given a set of random noises, the generation of the countermeasure network model can generate corresponding table data.
In a real application scenario, table data under a given condition is often required to be output, and the table data usually has a missing situation. In view of the specificity of the table data, the design of the encoding process and conditions thereof presents certain challenges. Therefore, it is necessary to design a generation countermeasure network model that can effectively encode the table data and solve the data missing problem.
Conditional Generative adaptive Networks (Conditional generic adaptive Networks) are certainly the most representative of Conditional Generative network models. Coding the table data based on the improved CTGAN of the CGAN, introducing a condition vector, splicing the random noise and the condition vector, and inputting the spliced random noise and the condition vector into a generation model to finally obtain the table data under the specified condition.
However, the above method has the following problems: (1) data loss of sample data in an actual scene often exists, and the problem of data imbalance of a few types of samples also exists; (2) the design only takes the category condition as input, and the quality of the training data set constructed by the generated sample cannot be well guaranteed, so that the requirement of machine learning model training in a big data application scene on the high-quality training data set is met.
Disclosure of Invention
In view of this, the present invention provides a missing data padding generation method based on a dual Conditional generated countermeasure network (Double Conditional robust generated adaptive Networks), which is used for generating table-type data to construct a high-quality training data set for supporting machine learning model training in a big data application scenario.
In order to achieve the purpose, the invention provides the following technical scheme:
a method for generating missing data filling of a countermeasure network based on dual conditions comprises the following steps:
s1: coding sample data in the table class data set, and designing a sample class serving as a data generation condition and a representation mode of existing sample data in a generation process;
s2: constructing a structure of a dual-condition generation countermeasure network, wherein the structure comprises a generation model and a discrimination model;
s3: generating an objective optimization function of the antagonistic network structure under the dual conditions for representation;
s4: establishing a training data set of a data generation model, and training a dual-condition generation countermeasure network;
s5: analyzing different data missing conditions, and generating a countermeasure network by adopting a trained dual condition to generate missing data and fill up the missing data so as to construct a high-quality table training data set for machine learning model training supporting a big data application scene.
Further, in step S1, encoding the real sample data specifically includes the following steps:
s11: in order to eliminate the deviation caused by different dimension units of different types of data, data standardization and data coding are required; the category data is coded in a one-hot coding mode; the logarithmic or hybrid data is encoded as follows:
suppose real sample data X in a table class datasetiIf there are N numeric type and mixed type variables and e type variables, and N + e is N variables in total, the sample data is encoded by scalar αi,j、βi,jOne-hot vector d for other categorical datai,eSplicing to form the following parts:
Figure BDA0003577733540000021
wherein,
Figure BDA0003577733540000022
representing vector concatenations, finally encoded XiThe dimension of (a) is u; alpha is alphai,jFor VGM encoded values, it is indicated that one VGM mode is sampled from a given probability density function, and the sampling mode is used to sample the jth column C of the ith samplei,jCarrying out standardization and normalization; if the sampling mode is pqThen C in the q-th modei,jExpressed as a scalar αi,jAnd one-hot vector betai,j,βi,j=[h1,...,hq,...,hk]Wherein h is1,...,hq,...,hkRepresent the 1 st to k th, h, of the one-hot vector respectivelyq1, other values take 0;
scalar αi,jThe process of normalization and normalization of (a) is as follows:
the ith sample is subjected to VGM coding to obtain k modes, namely rho12,...,ρkThe learned Gaussian mixture model is
Figure BDA0003577733540000023
Wherein
Figure BDA0003577733540000024
Probability density function, ω, representing a Gaussian modelq、ηqAnd σqEach represents oneWeight, mean and standard deviation of the modality, q 1.., k;
j column C for the ith samplei,jCalculating the probability of the value in each mode, wherein the probability density is rho12,...,ρk
Figure BDA0003577733540000025
Sampling a mode from a given probability density function and using the mode pair Ci,jCarrying out standardization; if the sampling result is rhoqThen C in the q-th modei,jExpressed as a scalar αi,j
Figure BDA0003577733540000026
S12: conditional vector and mask vector
Taking the sample class label as one of the conditions for modeling, wherein a condition vector cond is a bit vector and is formed by one-hot codes generated by the sample class label, wherein the selected label value label ═ 1, and the condition vector is expressed as cond [0, 0., label., 0,0 ];
mask vector M ═ M1,...,Md,...,MN],d=1...N*When M isdWhen 0, data representing the position is missing; when M isdWhen the value is 1, the data is complete; the dimension of the element value 1 in M is | | | M | | non-woven phosphor1And the dimension of the element value 0 is | |1-M | | non-conducting phosphor1;MdWhen all the elements in the list are 0, the sample category is only taken as a condition;
given a coding vector X of real sample data, simulating missing sample data vectors X of different situationsmiss
Figure BDA0003577733540000031
That is, XmissAn l ═ M ═ X |, M, in which, ", indicates element-level multiplication between vectors.
Further, the generated model in step S2 is composed of two Residual Networks (Residual Networks) and 1 Fully Connected Layer (full Connected Layer), the input portion of the generated model includes input data and condition vectors, and the input data is filled with noise, and the workflow and detailed structure thereof are as follows:
s21: coding a noise sample Z and filling missing sample data with the coded result to finally obtain Z ', wherein the process is represented as Z' ═ M ^ Xmiss+(1-M)⊙Z;
S22: by H0To represent
Figure BDA0003577733540000032
And as an initial input, i.e.
Figure BDA0003577733540000033
Each of which
Figure BDA0003577733540000034
Of | cond | + | Z'i|,i=1...m;
S23:H0First, its dimension is scaled from | cond | + | Z 'by a first residual network'iAugmented to | cond | + | Z'iL +256 with H1Representing the output result;
S24:H1and then a second residual network, the dimension of which is from | cond | + | Z'i+256 expansion to | cond | + | Z'iL +512, H for output2Represents;
s25: will H2Inputting into the last layer of fully-connected network, firstly, the input vector is converted into a scalar alpha by tanh activation functioni,j(ii) a Then, two pocket softmax activation functions are respectively used for obtaining the coding vector beta of the continuous datai,jAnd a code vector d of discrete datai,eWherein the gumbel softmax activation function is used to convert the input into a one-hot vector;
s26: will be alphai,j、βi,j、di,eSplicing is carried out and the final generated result of the generated model is obtained,by using
Figure BDA0003577733540000035
Is shown, i.e.
Figure BDA0003577733540000036
Figure BDA0003577733540000037
S27: will generate data
Figure BDA0003577733540000038
Missing data XmissPerforming element-level multiplication with mask vector M to obtain input data of discrimination model, and using X to makeimpIs expressed, i.e.
Figure BDA0003577733540000039
And is
Figure BDA00035777335400000310
Further, the discriminant model in step S2 is formed by three fully-connected networks, and its workflow and detailed structure are as follows:
the generation model not only fills in the missing part, but also generates new data for the non-missing part. However, it is only necessary for the discriminant model to identify whether the data filled in the missing part meets the given condition and fits the distribution of the real sample, so the generated data of the generative model needs to be processed to meet the input of the discriminant model.
S28: real sample X, X after being codedimpSplicing with the condition vectors cond respectively, and using K as the result0And
Figure BDA00035777335400000311
is shown, i.e.
Figure BDA0003577733540000041
K0And
Figure BDA0003577733540000042
respectively is | Ximp| X | + | cond | and | X | + | cond |;
s29: then K is put0And
Figure BDA0003577733540000043
respectively inputting the data into a first layer of fully-connected network, wherein the first layer of fully-connected network comprises a Leaky ReLU activation function used for preventing the situation of gradient disappearance in the back propagation process; performing dropout regularization operation on the first layer of fully connected network to avoid the phenomenon of overfitting of the discriminant model after successful training; respectively obtaining two 256-dimensional vectors by K after passing through a first layer of fully-connected network1And
Figure BDA0003577733540000044
carrying out representation;
S210:K1and
Figure BDA0003577733540000045
computing an output vector K as an input into a second tier fully-connected network with dimension 2562And
Figure BDA0003577733540000046
finally K is added2And
Figure BDA0003577733540000047
and respectively outputting the identification results through a third layer of full-connection network with the dimensionality of 256.
Furthermore, the generating model is used for generating samples, the loss function representation of the generating model is to put the generated samples into the discriminant model, so that the generated samples can cheat the discriminant model, and the mean square error between the generated element value and the real element value under the action of M is calculated; for the discrimination model, firstly, whether the generated sample meets the given sample class condition is identified, and secondly, whether the generated element value of the missing part under the action of (1-M) is fitted with the distribution of the real sample is calculated.
The design steps of the objective optimization function of the generated model are as follows:
s31: putting the generated sample into a discriminant model, and identifying the generated sample by using the discriminant model, wherein the loss is expressed as
Figure BDA0003577733540000048
The function represents the ith sample in the sample set
Figure BDA0003577733540000049
Inputting the data into a discriminant model for discrimination, wherein the larger the value of the data is, the larger the data is
Figure BDA00035777335400000410
The more satisfied a given condition is, i is 1
Figure BDA00035777335400000411
Figure BDA00035777335400000412
Representing the ith sample in the missing sample set;
s32: calculate M [ < X >missAnd
Figure BDA00035777335400000413
mean square error of each element in each sample, its loss function
Figure BDA00035777335400000414
Represents:
Figure BDA00035777335400000415
the smaller the result is, the closer the generated data is to the real data, j represents the jth element of each vector in the sample set, j is 1.. u, alpha represents the hyper-parameter,
Figure BDA00035777335400000416
representing the jth element of the ith sample in the missing sample set,
Figure BDA00035777335400000417
representing the jth element of the ith sample in the generated sample set;
s33: calculating the total loss function of the generative model, namely:
Figure BDA00035777335400000418
the design process of the objective optimization function of the discriminant model is as follows:
s34: for each sample, by calculating XimpAnd judging whether the sample X is matched with the cross entropy loss of the real sample X under the given condition cond, namely:
Figure BDA0003577733540000051
wherein
Figure BDA0003577733540000052
Representing a sample
Figure BDA0003577733540000053
Satisfying a given probability;
s35: calculating X using Wasserstein distance in Wasserstein generated countermeasure network (Wasserstein generated adaptive Networks)impAs a result, the similarity of the missing portion between (1-M) and X [ ((1-M)),
Figure BDA0003577733540000054
representing the distance of the Wasserstein and,
Figure BDA0003577733540000055
and
Figure BDA0003577733540000056
respectively represent XimpSet and X set obeys Pg、PrProbability distribution of gamma and epsilon
Figure BDA0003577733540000057
Over-parameters of, for loss
Figure BDA0003577733540000058
Represents:
Figure BDA0003577733540000059
Figure BDA00035777335400000510
Figure BDA00035777335400000511
wherein
Figure BDA00035777335400000512
Represents the "gradient penalty" mechanism in Wasserstein distance;
s36: calculating the total loss function value of the discriminant model:
Figure BDA00035777335400000513
wherein δ is
Figure BDA00035777335400000514
Is determined. D (X | cond) represents input of the sample X and the condition vector cond to the discriminant model. Where m represents the number of samples and u represents the dimension of the encoded samples.
Further, the procedure of training the dual-condition generation countermeasure network in step S4 is as follows:
s41: setting training iteration times epoch;
s42: creating a condition vector cond;
s43: creating a mask vector M;
s44: randomly sampling m real samples X to X1,X2,...,XmH, encoding real samples;
S45: randomly sampling m noise samples Z to Z1,Z2,...,ZmH, encode the noise and compute the missing sample XmissAnd input data Z ', Z' ═ M | _ Xmiss+(1-M)⊙Z;
S46: inputting the processed noise and the dual condition into the generative model and outputting the generated samples of the generative model, i.e.
Figure BDA00035777335400000515
Figure BDA00035777335400000516
S47:
Figure BDA00035777335400000517
S48: randomly sampling the E-bar real sample XiAnd (1-e) bar
Figure BDA00035777335400000518
Composition of
Figure BDA00035777335400000519
S49: a gradient penalty mechanism is introduced:
Figure BDA0003577733540000061
Figure BDA0003577733540000062
representing derivation of a discriminant model loss function;
s410: calculating a total loss function of the discriminant model:
Figure BDA0003577733540000063
Figure BDA0003577733540000064
s411: optimization of discriminant models and updating of parameters using Adam optimizer, back propagationIn the process, L is commonDLayer, then
Figure BDA0003577733540000065
The gradient update process of the layer is as follows:
Figure BDA0003577733540000066
wherein phiDRepresenting a weight matrix, BDA bias matrix is represented that is,
Figure BDA0003577733540000067
is to show to
Figure BDA0003577733540000068
The weight of (b) is differentiated,
Figure BDA0003577733540000069
denotes the L thDThe input of the +1 layer is,
Figure BDA00035777335400000610
is shown to the L < th > ofDThe output of the layer is differentiated,
Figure BDA00035777335400000611
representing differentiation of the LeakyReLu activation function,
Figure BDA00035777335400000612
an input of the L-th layer is represented,
Figure BDA00035777335400000613
is to the L < th > ofD-the output of layer 1 is differentiated,
Figure BDA00035777335400000614
denotes the first
Figure BDA00035777335400000615
The input of the layer(s) is (are),
Figure BDA00035777335400000616
is to show to
Figure BDA00035777335400000617
The output of the layer is differentiated,
Figure BDA00035777335400000618
is shown as
Figure BDA00035777335400000619
Inputting a layer;
s412: first, the
Figure BDA00035777335400000620
The parameter updating process of the layer is as follows:
Figure BDA00035777335400000621
wherein lr isDIn order to discriminate the learning rate of the model,
Figure BDA00035777335400000622
Figure BDA00035777335400000623
and
Figure BDA00035777335400000624
respectively represent the discriminant model
Figure BDA00035777335400000625
Weights and biases of layers;
s413: judging whether the generated data meet given conditions or not and whether the generated data are fit with the distribution of a real sample or not;
s414: calculate M [ < X >missAnd
Figure BDA00035777335400000626
the mean square error of each element in each sample,
Figure BDA00035777335400000627
Figure BDA00035777335400000628
s415: the total loss of the generative model is calculated,
Figure BDA00035777335400000629
Figure BDA00035777335400000630
s416: the Adam optimizer is used for optimizing the generation model and updating parameters, and the back propagation process has LGLayer, then
Figure BDA00035777335400000631
The gradient update process of the layer is as follows:
Figure BDA00035777335400000632
wherein phiGRepresenting a weight matrix, BGA bias matrix is represented that is,
Figure BDA00035777335400000633
is shown to the first
Figure BDA00035777335400000634
The weight of (b) is differentiated,
Figure BDA00035777335400000635
represents LGThe input of the +1 layer is,
Figure BDA0003577733540000071
is to the L < th > ofGThe output of the layer is differentiated,
Figure BDA0003577733540000072
an input of the L-th layer is represented,
Figure BDA0003577733540000073
is to the L < th > ofG-differentiation of the output of layer 1,
Figure BDA0003577733540000074
is shown as
Figure BDA0003577733540000075
The input of the layer(s) is (are),
Figure BDA0003577733540000076
is to show to
Figure BDA0003577733540000077
The output of the layer is differentiated,
Figure BDA0003577733540000078
is shown as
Figure BDA0003577733540000079
Inputting a layer;
s417: first, the
Figure BDA00035777335400000710
The parameter updating process of the layer is as follows:
Figure BDA00035777335400000711
wherein lr isGIn order to generate the learning rate of the model,
Figure BDA00035777335400000712
Figure BDA00035777335400000713
and
Figure BDA00035777335400000714
respectively represent the birthday device network
Figure BDA00035777335400000715
Weights and biases of layers.
Further, in step S5, in consideration of the presence of data loss and the need for data generation in the training data set, the data generation operation is performed separately:
(1) partial data loss
S51: coding missing samples in the table type data set, and creating a mask vector and a condition vector;
s52: filling the missing part of the sample by using the coded random noise to finally obtain input data of a generation model;
s53: and creating a condition vector, splicing the condition vector with input data, and inputting the condition vector and the input data into a trained dual-condition generation countermeasure network to obtain generated and filled sum sample data.
Data generation for the case of partial data loss can be used for generating and filling sample data in the table type data set so as to obtain high-quality sample information based on the existing partial data.
(2) Complete loss of data
There are two ways to deal with this case of a complete loss of data:
the method comprises the following steps: setting all mask vectors to 0;
s54: creating a mask vector completely filled with '0' according to a condition vector of samples to be generated;
s55: using random noise pair to represent samples to be generated to obtain input data of a generation model;
s56: and splicing the condition vector and the input data, and inputting the spliced condition vector and the input data into a trained dual-condition generation countermeasure network to obtain generated sample data.
The second method comprises the following steps: the existing data generation model is adopted to generate a part of data as a known data condition, and then the dual condition generation countermeasure network is adopted to fill up missing data.
S57: generating a part of data as a known data condition by adopting an existing data generation model;
s58: known data conditions are input into the dual-condition generation countermeasure network of the invention together with missing samples to fill in the missing data.
Data generation for the case of a complete loss of data can be used for generation of a few class samples in a tabular class dataset to address data imbalance issues therefor.
The invention has the beneficial effects that:
(1) the invention provides a method for filling missing data of a dual-condition generation countermeasure network. In the table type data set, a condition vector and a mask vector are introduced based on the category condition and part of existing data of a sample to be generated, a generation confrontation network model is established, a generation model and a discrimination model in the generation confrontation network model and a target optimization function of the model are established, and finally generation and filling of missing data in the sample are realized through a trained model.
(2) Aiming at different missing conditions of sample data in the tabular data set, the invention provides a data generation and filling scheme for generating the countermeasure network based on dual conditions, effectively solves the problems of partial data missing, data imbalance among sample classes and the like, improves the quality of a training data set in the tabular data set, and reduces the influence of the data imbalance problem. Therefore, the invention provides a method for constructing a high-quality table training data set, which is used for effectively supporting machine learning model training in a big data application scene.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
For a better understanding of the objects, aspects and advantages of the present invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:
FIG. 1 is a schematic flow chart of a method for generating missing data padding for a dual-conditional generation countermeasure network according to the present invention;
FIG. 2 is a schematic diagram of a generative model structure of a dual-conditional generative countermeasure network;
FIG. 3 is a structural diagram of a discriminant model of a dual-conditional generative countermeasure network.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.
Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.
Referring to fig. 1 to 3, the present invention provides a method for filling and generating missing data of a dual conditional generation countermeasure network, comprising the following steps:
s1: firstly, encoding sample data in a table type data set, and designing a sample type as a data generation condition and a representation mode of existing sample data in a generation process;
s1-1, in order to eliminate the deviation caused by different dimension units of different types of data, data standardization and data coding are required. The category data can be coded in a one-hot coding mode; the numerical or hybrid data is encoded by a variational gaussian mixture model (VGM).
For example, the information related to loan risk assessment of a user in the real sample set is encoded, and two variables of "user gender" and "loan amount" in the basic information of the user are taken as examples, wherein the user gender belongs to a category variable and can be represented by using a one-hot vector, and the loan amount belongs to a numerical variable and can be represented by VGM encoding. A one-hot vector such as "male" in user gender may be mapped to [1,0 ]; the loan amount value of "100000" may be mapped to a VGM code of [0.73,0,0,1,0], where "0.73" and "[ 0,0,1,0 ]" respectively represent a VGM-encoded scalar and a one-hot vector of the value "100000", and finally the user information is encoded to [0.73,0,0,1,0,1,0 ]. The number of numerical or hybrid variables like "loan amount" are very large and cannot be represented effectively by one-hot vectors, which can be well solved by VGM coding, which encodes numerical or hybrid data into a vector consisting of a scalar and one-hot codes, fits the numerical or hybrid data using multiple modalities, and represents it as one-hot codes by means of a random sampling modality. Meanwhile, the method uses normalization operation to scale the data item values to a certain area range, thereby avoiding the trouble brought to the model by the range difference of the data item values.
S1-2, condition vector and mask vector
The sample class label is used as one of the conditions for modeling. The condition vector cond is a bit vector and is formed by one-hot codes generated by sample class labels, where the selected label value label is 1, and the condition vector is denoted as cond [0, 0.
In the loan risk assessment application scenario, the user's loan information is distributed among different data sources, and for example, one of the data source participants, "loan platform," is label information that holds existing data. For example, three category levels, namely "high risk", "medium risk", and "low risk". Now, the loan information of a certain user is missing, the missing data needs to be generated and filled, and the risk evaluation level of the user on the loan platform side is known to belong to 'middle risk', so that when a condition vector is constructed, a category label is set to 'middle risk', and the condition vector can be represented as [0,1,0 ].
Mask vector M ═ M1,...,Md,...,MN],d=1...N*When M isdWhen 0, the data representing the position is missing; when M isdWhen 1, the data is complete. The dimension of the element value 1 in M is | | | M | | non-woven phosphor1And the dimension of the element value 0 is | |1-M | | non-conducting phosphor1. In particular, MdWhen all the elements in (a) are 0(s), it means that only the sample class is used as a condition.
Given a code vector X of real sample data, simulating a missing sample data vector X of different situationsmiss
Figure BDA0003577733540000101
That is, XmissAs one, where "" indicates element-level multiplication between vectors.
If the result of encoding the loan information of a certain user in the real sample set is [0.73,0,0,1,0,1,0], wherein the encoding vector [0.73,0,0,1,0] represents the result of VGM encoding the variable of the loan amount in the employee basic information, and [1,0] represents the result of one-hot encoding the user gender in the user basic information. If the user gender is null, the mask vector may be represented as [1,1,1,1,1,0,0], and the final missing sample data vector is [0.73,0,0,1,0,0,0 ].
The generative model consists of two ResNet residual networks and 1 fully connected layer, the input part of which comprises input data and condition vectors, and the input data is filled with noise. The working process and the detailed structure are as follows:
s2-1, encoding the noise sample Z and filling the missing sample data with the encoded result to finally obtain Z ', a process of which indicates Z' ═ M | _ X |miss+(1-M)⊙Z。
S2-2, using H0To represent
Figure BDA0003577733540000102
And as an initial input, i.e.
Figure BDA0003577733540000103
Each of which
Figure BDA0003577733540000104
Of | cond | + | Z'i|,i=1...m;
S2-3:H0First, its dimension is taken from | cond | + | Z 'through a first residual network'iExtended to | cond | + | Z'iL +256 with H1Representing the output result;
S2-4:H1and then a second residual network, the dimension of which is from | cond | + | Z'i+256 expansion to | cond | + | Z'iL +512, H for output2Represents;
s2-5, mixing H2Inputting into the last layer of fully-connected network, firstly, the input vector is converted into a scalar alpha by tanh activation functioni,j(ii) a Then, two pocket softmax activation functions are respectively used for obtaining the coding vector beta of the continuous dataj,jAnd a code vector d of discrete datai,eWherein the gumbel softmax activation function can convert the input into a one-hot vector;
s2-6, mixingi,j”、“βi,j”、“di,e"splice and obtain the final generated result of the generated model, use
Figure BDA0003577733540000105
Is expressed, i.e.
Figure BDA0003577733540000106
S2-7, generating data
Figure BDA0003577733540000107
Missing data XmissPerforming element-level multiplication with mask vector M to obtain input data of discrimination model, and using XimpIs expressed, i.e.
Figure BDA0003577733540000111
And is
Figure BDA0003577733540000112
Illustratively, each true sample X in the set of true samples XiHas a dimension of 128, each noise Z in the set of noise samples ZiThe dimension of (c) may be set to 128 and the vector dimension of the samples encoded for the real samples is set to 157.
The discrimination model is composed of three fully-connected networks, and the working process and the detailed structure are as follows:
the generation model not only fills in the missing part, but also generates new data for the non-missing part. However, it is only necessary for the discriminant model to identify whether the data filled in the missing part meets the given condition and fits the distribution of the real sample, so the generated data of the generative model needs to be processed to meet the input of the discriminant model.
S2-8, true sample X, X after being codedimpSplicing with the condition vectors cond respectively, and using K as the result0And
Figure BDA0003577733540000113
is shown, i.e.
Figure BDA0003577733540000114
K0And
Figure BDA0003577733540000115
respectively is | XimpAnd | X | + | cond |.
S2-9Will K0And
Figure BDA0003577733540000116
respectively inputting the data into a first full-connection network, wherein the layer comprises a Leaky ReLU activation function, and the function of the layer is to prevent the situation that the gradient disappears in the back propagation process; secondly, the dropout regularization operation is carried out on the layer, so that the phenomenon that the discriminant model after the training is successful is over-fitted is avoided. Respectively obtaining two 256-dimensional vectors after passing through a first layer network, and using K1And
Figure BDA0003577733540000117
carrying out representation;
S2-10:K1and
Figure BDA0003577733540000118
computing an output vector K as an input into a second tier fully-connected network with dimension 2562And
Figure BDA0003577733540000119
finally K is added2And
Figure BDA00035777335400001110
and respectively outputting the identification results through a third layer of fully-connected network with the dimension of 256.
S3: expressing an objective optimization function of the dual-condition generation countermeasure network structure;
and for the generating model, the generating model is used for generating samples, and the loss function representation is to put the generated samples into the discriminant model, so that the generating samples can cheat the discriminant model, and calculate the mean square error between the generating element value and the real element value under the action of M. The design steps of the objective optimization function of the generated model are as follows:
s3-1, judging whether the generated data meets the given conditions and is fitted with the distribution of the real samples, and using the loss
Figure BDA00035777335400001111
Is expressed, the function expression will
Figure BDA00035777335400001112
Inputting the data into a discriminant model for discrimination, wherein the larger the value of the data, the larger the value of the data indicates XimpThe more a given condition is met, where i denotes the ith sample in the sample set, i 1.. 50, and
Figure BDA00035777335400001113
Figure BDA00035777335400001114
s3-2 calculation
Figure BDA00035777335400001115
And with
Figure BDA00035777335400001116
Mean square error of each element in each sample, its loss function
Figure BDA00035777335400001117
Represents:
Figure BDA00035777335400001118
the smaller the result is, the closer the generated data is to the real data, j represents the jth element of each vector in the sample set, and 'alpha' represents the hyper-parameter.
Alternatively, the over parameter α may be set to 0.6.
S3-3, calculating the total loss function of the generative model, namely:
Figure BDA0003577733540000121
d (X | cond) represents that the sample X and the condition vector cond are input to the discriminant model.
For the discrimination model, firstly, whether the generated sample meets the given sample category condition is identified, and secondly, whether the generated element value of the missing part under the action of (1-M) is fitted with the distribution of the real sample is calculated, so that the design process of the target optimization function of the discrimination model is as follows:
s3-4, for each sample, calculating
Figure BDA0003577733540000122
With true sample XiThe cross entropy loss given condition cond is used to determine whether there is a match, i.e.:
Figure BDA0003577733540000123
s3-5, calculating Wasserstein distance in Wasserstein generated countermeasure network (Wasserstein general adaptive network)
Figure BDA0003577733540000124
And XiThe similarity of the missing portion between (1-M),
Figure BDA0003577733540000125
representing the distance of the Wasserstein,
Figure BDA0003577733540000126
and
Figure BDA0003577733540000127
respectively represent
Figure BDA0003577733540000128
Set and XiSet obeys Pg、PrThe probability distribution, "gamma",e "is
Figure BDA0003577733540000129
Is determined. For loss of
Figure BDA00035777335400001210
Represents, i.e.:
Figure BDA00035777335400001211
Figure BDA00035777335400001212
Figure BDA00035777335400001213
wherein
Figure BDA00035777335400001214
Representing the "gradient penalty" mechanism in the Wasserstein distance.
Optionally, γ is 0.4, and e is 0.5;
s3-6, calculating the total loss function value of the discriminant model, namely:
Figure BDA00035777335400001215
wherein "δ" is
Figure BDA00035777335400001216
Is determined.
Optionally, the over-parameter δ may be set to 0.6;
s4, establishing a training data set of the data generation model, and training the dual-condition generation countermeasure network;
the training process of the dual-condition generation countermeasure network is as follows:
s4-1, setting the training iteration time epoch, wherein the optional epoch is 300;
s4-2, creating a condition vector cond;
s4-3, creating a mask vector M;
s4-4, randomly sampling m real samples X to X1,X2,...,XmEncoding real samples;
s4-5 randomly sampling m noise samples Z to Z1,Z2,...,ZmTo noiseThe sound is encoded and the missing sample X is calculatedmissAnd input data Z', Z ═ M-miss+(1-M)⊙Z;
Illustratively, the total number of samples in the real sample set and the noise sample set is 3000, and the encoded dimension of the samples is 157.
Alternatively, the number m of batches may be set to 10.
S4-6, inputting the processed noise and the dual condition into the generative model and outputting the generative sample of the generative model, i.e.
Figure BDA0003577733540000131
Figure BDA0003577733540000132
S4-7, calculating input X of discriminant modelimp
Figure BDA0003577733540000133
S4-8, randomly sampling a real sample XiAnd
Figure BDA0003577733540000134
composition of
Figure BDA0003577733540000135
Namely, it is
Figure BDA0003577733540000136
S4-9, introducing a gradient penalty mechanism:
Figure BDA0003577733540000137
wherein
Figure BDA0003577733540000138
Representing derivation of a discriminant model loss function;
s4-10, calculating a discriminant model total loss function:
Figure BDA0003577733540000139
s4-11, optimizing the discriminant model by using an Adam optimizer and updating parameters, wherein in the back propagation process, if the LD layer is shared, the first step is
Figure BDA00035777335400001310
The gradient update process of the layer is as follows:
Figure BDA00035777335400001311
s4-12, update the first
Figure BDA00035777335400001312
The parametric process of the layers is:
Figure BDA00035777335400001313
lrDto discriminate the learning rate of the model, wherein
Figure BDA00035777335400001314
Figure BDA00035777335400001315
And
Figure BDA00035777335400001316
respectively represent the discriminant model
Figure BDA00035777335400001317
Weights and biases of layers.
Optionally, the learning rate lr of discriminant model backpropagationDMay be set to 0.0002;
s4-13, judging whether the generated data meet given conditions and are fitted with the distribution of real samples;
s4-14 calculating M [ < X > ]missAnd
Figure BDA0003577733540000141
the mean square error of each element in each sample,
Figure BDA0003577733540000142
Figure BDA0003577733540000143
s4-15, calculating the total loss of the generative model,
Figure BDA0003577733540000144
Figure BDA0003577733540000145
s4-16, optimizing the generation model by using an Adam optimizer and updating parameters, wherein the generation model has L in total in the back propagation processGLayer, then
Figure BDA0003577733540000146
The gradient update process of the layer is as follows:
Figure BDA0003577733540000147
s4-17 th
Figure BDA0003577733540000148
The parameter updating process of the layer is as follows:
Figure BDA0003577733540000149
lrGa learning rate to generate a model, wherein
Figure BDA00035777335400001410
Figure BDA00035777335400001411
And
Figure BDA00035777335400001412
respectively represent the birthday device network
Figure BDA00035777335400001413
Weights and biases of layers.
Optionally, the learning rate lr of model back propagation is generatedGMay be set to 0.0002.
S5: analyzing different data missing conditions, generating a countermeasure network by adopting a trained dual condition, and generating and filling missing data to construct a high-quality table training data set for machine learning model training supporting a big data application scene.
After the model is trained successfully, the confrontation network is generated by adopting the trained dual conditions, and missing data is generated and filled. Considering different situations of data missing and the problem of data imbalance in the training set, the optional data generation scheme is as follows:
(1) partial data loss
In the field of loan risk assessment, user information is lost because some data information is not provided or acquired or the user information is not operated by mistake, and the invention can be used for generating and filling the lost part under the condition, and mainly comprises the following steps:
s51, encoding the missing samples in the table data set, and creating a mask vector and a condition vector;
s52, filling the missing part of the sample with the encoded random noise to finally obtain the input data of the generation model;
and S53, creating a condition vector, splicing the condition vector with input data, and inputting the condition vector into the trained dual-condition generation confrontation network to obtain generated and filled sum sample data.
For example, in the loan risk assessment above, the pre-encoding 128-dimensional data and its encoded 157-dimensional data, of which 50-dimensional data is missing. Then the sample may be classified by its class conditions, such as: high risk [1,0,0] and existing 107-dimensional data, establishing a corresponding M vector, setting 0 in all missing 50 dimensions in the M vector, generating a countermeasure network through a trained dual condition, and generating and filling up missing 50-dimensional missing data of the high risk sample.
(2) Complete loss of data
There are two ways to deal with this case of a complete loss of data:
the method comprises the following steps: setting all mask vectors to 0;
s54: creating a mask vector completely filled with '0' according to a condition vector for which a sample is to be generated;
s55: using random noise pair to represent a sample to be generated to obtain generated network input data;
s56: and splicing the condition vector and the input data, and inputting the spliced condition vector and the input data into a trained dual-condition generation countermeasure network to obtain generated sample data.
For example, in the loan risk assessment above, the 157 dimensional data is all replaced with noise samples, and the 157 dimensions of M are all set to 0. If a high-risk [1,0,0] sample needs to be generated, a countermeasure network is generated through a trained dual condition, a noise sample and a condition vector are input, and M is used for generating a brand-new high-risk sample data.
The second method comprises the following steps: the existing data generation model is adopted to generate a part of data as a known data condition, and then the dual condition generation countermeasure network is adopted to fill up missing data.
S57: generating a part of data as a known data condition by adopting an existing data generation model;
s58: known data conditions are input into the dual-condition generation countermeasure network of the invention together with missing samples to fill in the missing data.
Optionally, an existing data generation model is used, for example: a conditional generation countermeasure network (CTGAN) or a tabular data generation countermeasure network (TGAN) based on tabular data is generated with a 50-dimensional sample therein as existing partial data as one condition of a dual condition. And generating a sample meeting the condition of a certain category according to the sample generation requirements of different conditions and the process of (1) partial data missing.
Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims (7)

1. A method for generating missing data filling of a countermeasure network based on dual conditions is characterized in that: the method comprises the following steps:
s1: coding the sample data of the table type, and designing the sample type as a data generation condition and a representation mode of the existing data of the sample in the generation process;
s2: constructing a structure of a dual-condition generation countermeasure network, wherein the structure comprises a generation model and a discrimination model;
s3: generating an objective optimization function of the antagonistic network structure under the dual conditions for representation;
s4: establishing a training data set of a data generation model, and training a dual-condition generation countermeasure network;
s5: analyzing different data missing conditions, and generating a countermeasure network by adopting a trained dual condition to generate missing data and fill up the missing data so as to construct a high-quality table data training set which can be used for training other machine learning models.
2. The dual-condition-based missing data padding generation method for generation of countermeasure networks according to claim 1, wherein: in step S1, encoding the real sample data specifically includes the following steps:
s11: carrying out data standardization and data coding; the category data is coded in a one-hot coding mode; the numerical or hybrid data is encoded as follows:
suppose true sample data X in a tabular class datasetiThere are n numerical type and mixed type variables, and eIf the type variable is N + e in total, the sample data is encoded by the scalar αi,j、βi,jOne-hot vector d for other categorical datai,eSplicing to form the following parts:
Figure FDA0003577733530000011
wherein,
Figure FDA0003577733530000012
representing vector concatenations, finally encoded XiThe dimension of (a) is u; alpha is alphai,jFor VGM encoded values, it is indicated that one VGM mode is sampled from a given probability density function, and the sampling mode is used to sample the jth column C of the ith samplei,jCarrying out standardization and normalization; if the sampling mode is pqThen C in the q-th modei,jExpressed as a scalar αi,jAnd one-hot vector betai,j,βi,j=[h1,...,hq,...,hk]Wherein h is1,...,hq,...,hkRespectively represent the 1 st to k th elements, h, of a one-hot vectorq1, other values take 0;
scalar αi,jThe process of normalization and normalization of (a) is as follows:
the ith sample is subjected to VGM coding to obtain k modes, namely rho1,ρ2,...,ρkThe learned Gaussian mixture model is
Figure FDA0003577733530000013
Wherein
Figure FDA0003577733530000014
Probability density function, ω, representing a Gaussian modelq、ηqAnd σqWeight, mean and standard deviation of one modality, q ═ 1., k, respectively;
j column C for the ith samplei,jCalculating the probability of the value in each mode, wherein the probability density is rho1,ρ2,...,ρk
Figure FDA0003577733530000015
Sampling a mode from a given probability density function and using the mode pair Ci,jCarrying out standardization; if the sampling result is rhoqThen C in the q-th modei,jExpressed as a scalar αi,j
Figure FDA0003577733530000021
S12: conditional vector and mask vector
Taking a sample class label as one of the conditions for modeling, wherein a condition vector cond is a bit vector and is formed by one-hot codes generated by the sample class label, wherein a selected label value label is 1, and the condition vector is represented as cond [0, 0., label,.., 0,0 ];
mask vector M ═ M1,...,Md,...,MN],d=1...N*When M isdWhen 0, the data representing the position is missing; when M isdWhen the value is 1, the data is complete; the dimension of the element value 1 in M is | | M | non-woven1And the dimension of the element value 0 is | |1-M | | non-conducting phosphor1;MdWhen all elements in the list are 0, the sample category is only taken as a condition;
given a coding vector X of real sample data, simulating missing sample data vectors X of different situationsmiss
Figure FDA0003577733530000022
That is, XmissAs one example, where M indicates an element-level multiplication between vectors.
3. The dual-condition-based missing data padding generation method for generation of countermeasure networks according to claim 1, wherein: the generated model in step S2 is composed of two residual error networks and a full link layer, the input part of the generated model includes input data and condition vectors, and the input data is filled with noise, and the work flow and detailed structure are as follows:
s21: coding a noise sample Z and filling missing sample data with the coded result to finally obtain Z ', wherein the process is represented as Z' ═ M ^ Xmiss+(1-M)⊙Z;
S22: by H0To represent
Figure FDA0003577733530000023
And as an initial input, i.e.
Figure FDA0003577733530000024
Each of which
Figure FDA0003577733530000025
Of | cond | + | Z'i|,i=1...m;
S23:H0First, its dimension is taken from | cond | + | Z 'through a first residual network'iExtended to | Cond | + | Z'iL +256 with H1Representing the output result;
S24:H1and then the dimension of the network is divided into | cond | + | Z 'by a second residual error network'i+256 expansion to | cond | + | Z'iL +512, H for output2Represents;
s25: h is to be2Inputting the vector into the last layer of fully-connected network, firstly converting the input vector into a scalar alpha through a tanh activation functioni,j(ii) a Then, two pocket softmax activation functions are respectively used for obtaining the coding vector beta of the continuous datai,jAnd a code vector d of discrete datai,eWherein a gum softmax activation function is used to convert the input into one-hot vectors;
s26: will be alphai,j、βi,j、di,eSplicing and obtaining the final generated result of the generated model
Figure FDA0003577733530000027
Is expressed, i.e.
Figure FDA0003577733530000028
Figure FDA0003577733530000029
S27: will generate data
Figure FDA0003577733530000026
Missing data XmissPerforming element-level multiplication with mask vector M to obtain input data of discrimination model, and using XimpIs expressed, i.e.
Figure FDA0003577733530000031
And is provided with
Figure FDA0003577733530000032
4. The dual-condition-based missing data padding generation method for generation of countermeasure networks according to claim 1, wherein: in step S2, the discriminant model is composed of three fully connected networks, and the work flow and detailed structure thereof are as follows:
s28: real sample X, X after being codedimpSplicing with the condition vectors cond respectively, and using K as the result0And
Figure FDA0003577733530000033
is shown, i.e.
Figure FDA0003577733530000034
K0And
Figure FDA0003577733530000035
respectively is | Ximp| X | + | cond | and | X | + | cond |;
s29: then K is put0And
Figure FDA0003577733530000036
respectively inputting the data into a first layer of fully connected network, wherein the first layer of fully connected network comprises an Leaky ReLU activation function, and performing dropout regularization operation on the first layer of fully connected network; respectively obtaining two 256-dimensional vectors by K after passing through a first layer of fully-connected network1And
Figure FDA0003577733530000037
carrying out representation;
S210:K1and
Figure FDA0003577733530000038
computing an output vector K as an input into a second tier fully-connected network with dimension 2562And
Figure FDA0003577733530000039
finally K is added2And
Figure FDA00035777335300000310
and respectively outputting the identification results through a third layer of fully-connected network with the dimension of 256.
5. The method for generating missing data padding for countering networks based on dual condition generation as claimed in claim 1, wherein: the generating model is used for generating samples, the loss function representation of the generating model is to put the generated samples into the discriminant model, so that the generated samples can cheat the discriminant model, and the mean square error between the generated element value and the real element value under the action of M is calculated; for the discrimination model, firstly, identifying whether the generated sample meets the given sample category condition, and secondly, calculating whether the generated element value of the missing part under the action of (1-M) is fitted with the distribution of the real sample;
the design steps of the objective optimization function of the generated model are as follows:
s31: putting the generated sample into a discriminant model, and identifying the generated sample by using the discriminant model, wherein the loss is expressed as
Figure FDA00035777335300000311
The function represents the ith sample in the sample set
Figure FDA00035777335300000312
Inputting the data into a discriminant model for discrimination, wherein the larger the value of the data is, the larger the data is
Figure FDA00035777335300000313
The more satisfied a given condition is, i 1.. m, and
Figure FDA00035777335300000314
representing the ith sample in the missing sample set;
s32: calculate M [ < X >missAnd
Figure FDA00035777335300000315
mean square error of each element in each sample, its loss function
Figure FDA00035777335300000316
Represents:
Figure FDA00035777335300000317
the smaller the result is, the closer the generated data is to the real data, j represents the jth element of each vector in the sample set, j is 1.. u, alpha represents the hyper-parameter,
Figure FDA00035777335300000318
indicating the jth element of the ith sample in the missing sample set,
Figure FDA00035777335300000319
representing the jth element of the ith sample in the generated sample set;
s33: calculating the total loss function of the generative model, namely:
Figure FDA0003577733530000041
the design process of the objective optimization function of the discriminant model is as follows:
s34: for each sample, by calculating XimpAnd judging whether the sample X is matched with the cross entropy loss of the real sample X under the given condition cond, namely:
Figure FDA0003577733530000042
wherein
Figure FDA0003577733530000043
Representing a sample
Figure FDA0003577733530000044
Satisfying a given probability;
s35: calculating X using Wasserstein to generate Wasserstein distances in a countermeasure networkimpAs a result, the similarity of the missing portion between (1-M) and X [ ((1-M)),
Figure FDA00035777335300000413
representing the distance of the Wasserstein and,
Figure FDA00035777335300000414
and
Figure FDA00035777335300000415
respectively represent XimpSet and X set obeys Pg、PrProbability distribution, gamma, ∈ is
Figure FDA0003577733530000045
Over-parameters of, for loss
Figure FDA0003577733530000046
Represents:
Figure FDA0003577733530000047
Figure FDA0003577733530000048
Figure FDA0003577733530000049
wherein
Figure FDA00035777335300000410
Represents the "gradient penalty" mechanism in Wasserstein distance;
s36: calculating the total loss function value of the discriminant model:
Figure FDA00035777335300000411
wherein δ is
Figure FDA00035777335300000412
D (X | cond) represents that the sample X and the condition vector cond are input to the discriminant model, m represents the number of samples, and u represents the dimension of the encoded sample.
6. The dual-condition-based missing data padding generation method for generation of countermeasure networks according to claim 1, wherein: the procedure for training the dual-conditional generation countermeasure network described in step S4 is as follows:
s41: setting training iteration times epoch;
s42: creating a condition vector cond;
s43: creating a mask vector M;
s44: randomly sampling m real samples X to X1,X2,...,Xm}, encoding the real samples;
s45: randomly sampling m noise samples Z to Z1,Z2,...,ZmH, encode the noise and compute the missing samples XmissAnd input data Z ', Z' ═ M | _ Xmiss+(1-M)⊙Z;
S46: inputting the processed noise and the dual condition into the generative model and outputting the generative sample of the generative model, i.e.
Figure FDA0003577733530000051
Figure FDA0003577733530000052
S47:
Figure FDA0003577733530000053
Figure FDA0003577733530000054
S48: randomly sampling the E-bar real sample XiAnd (1-e) bar
Figure FDA0003577733530000055
Composition of
Figure FDA0003577733530000056
S49: a gradient penalty mechanism is introduced:
Figure FDA0003577733530000057
Figure FDA0003577733530000058
representing derivation of a discriminant model loss function;
s410: calculating a total loss function of the discriminant model:
Figure FDA0003577733530000059
Figure FDA00035777335300000510
s411: optimizing the discriminant model by using an Adam optimizer and updating parameters, wherein the total L is obtained in the back propagation processDLayer, then
Figure FDA00035777335300000533
The gradient update process of the layer is as follows:
Figure FDA00035777335300000511
wherein inDRepresenting a weight matrix, BDA bias matrix is represented that is a matrix of biases,
Figure FDA00035777335300000512
is shown to the first
Figure FDA00035777335300000513
The weight of (a) is differentiated,
Figure FDA00035777335300000514
denotes the L < th > elementDThe input of the +1 layer is,
Figure FDA00035777335300000515
is to the L < th > ofDThe output of the layer is differentiated,
Figure FDA00035777335300000516
representing differentiation of the LeakyReLu activation function,
Figure FDA00035777335300000517
an input of the L-th layer is represented,
Figure FDA00035777335300000518
is to the L < th > ofD-differentiation of the output of layer 1,
Figure FDA00035777335300000519
is shown as
Figure FDA00035777335300000534
The input of the layer(s) is (are),
Figure FDA00035777335300000520
is shown to the first
Figure FDA00035777335300000535
The output of the layer is differentiated,
Figure FDA00035777335300000521
is shown as
Figure FDA00035777335300000522
Inputting a layer;
s412: first, the
Figure FDA00035777335300000523
The parameter updating process of the layer comprises the following steps:
Figure FDA00035777335300000524
wherein lr isDIn order to discriminate the learning rate of the model,
Figure FDA00035777335300000525
and
Figure FDA00035777335300000526
respectively represent the discriminant model
Figure FDA00035777335300000527
Weights and biases of layers;
s413: judging whether the generated data meet given conditions and whether the generated data are fitted with the distribution of the real sample;
s414: calculate M [ < X >missAnd with
Figure FDA00035777335300000528
The mean square error of each element in each sample,
Figure FDA00035777335300000529
Figure FDA00035777335300000530
s415: the total loss of the generative model is calculated,
Figure FDA00035777335300000531
Figure FDA00035777335300000532
s416: the Adam optimizer is used for optimizing the generation model and updating parameters, and the backward propagation process has L in totalGLayer, then
Figure FDA0003577733530000061
The gradient update process of the layer is as follows:
Figure FDA0003577733530000062
wherein phiaRepresenting a weight matrix, BGA bias matrix is represented that is,
Figure FDA0003577733530000063
is to show to
Figure FDA0003577733530000064
The weight of (a) is differentiated,
Figure FDA0003577733530000065
represents LGThe input of the +1 layer is,
Figure FDA0003577733530000066
is to the L < th > ofGThe output of the layer is differentiated,
Figure FDA0003577733530000067
an input of the L-th layer is represented,
Figure FDA0003577733530000068
is to the L < th > ofG-differentiation of the output of layer 1,
Figure FDA0003577733530000069
is shown as
Figure FDA00035777335300000610
The input of the layer(s) is (are),
Figure FDA00035777335300000611
is to show to
Figure FDA00035777335300000612
The output of the layer is differentiated,
Figure FDA00035777335300000613
is shown as
Figure FDA00035777335300000614
Inputting a layer;
s417: first, the
Figure FDA00035777335300000615
The parameter updating process of the layer is as follows:
Figure FDA00035777335300000616
wherein lr isGIn order to generate the learning rate of the model,
Figure FDA00035777335300000617
Figure FDA00035777335300000618
and
Figure FDA00035777335300000619
respectively represent the birthday device network
Figure FDA00035777335300000620
Weights and biases of layers.
7. The dual-condition-based missing data padding generation method for generation of countermeasure networks according to claim 1, wherein: in step S5, in consideration of the existence of data loss and the requirement for data generation in the training data set, data generation operations are performed:
(1) partial data loss
S51: coding missing samples in the table type data set, and creating a mask vector and a condition vector;
s52: filling the missing part of the sample by using the coded random noise to finally obtain input data of a generation model;
s53: creating a condition vector, splicing the condition vector with input data, and inputting the condition vector and the input data into a trained dual-condition generation countermeasure network to obtain generated and filled sum sample data;
the data generation under the condition of partial data missing can be used for generating and filling sample data in a table type data set so as to obtain high-quality sample information based on the existing partial data;
(2) complete loss of data
There are two ways to deal with this case of a complete loss of data:
the method comprises the following steps: setting all mask vectors to 0;
s54: creating a mask vector completely filled with '0' according to a condition vector for which a sample is to be generated;
s55: representing a sample to be generated by using random noise pairs to obtain input data of a generation model;
s56: splicing the condition vector and the input data, and inputting the spliced condition vector and the input data into a trained dual-condition generation countermeasure network to obtain generated sample data;
the second method comprises the following steps: generating a part of data as a known data condition by adopting an existing data generation model, and generating a confrontation network by adopting the double conditions to fill up missing data;
s57: generating a part of data as a known data condition by adopting an existing data generation model;
s58: inputting known data conditions and missing samples into the dual condition generation countermeasure network together to fill up the missing data;
data generation for the case of a complete loss of data can be used for generation of a few class samples in a tabular class dataset to address data imbalance issues therefor.
CN202210347936.6A 2022-04-01 2022-04-01 Dual-condition-based method for generating confrontation network and filling missing data Pending CN114757335A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210347936.6A CN114757335A (en) 2022-04-01 2022-04-01 Dual-condition-based method for generating confrontation network and filling missing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210347936.6A CN114757335A (en) 2022-04-01 2022-04-01 Dual-condition-based method for generating confrontation network and filling missing data

Publications (1)

Publication Number Publication Date
CN114757335A true CN114757335A (en) 2022-07-15

Family

ID=82329568

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210347936.6A Pending CN114757335A (en) 2022-04-01 2022-04-01 Dual-condition-based method for generating confrontation network and filling missing data

Country Status (1)

Country Link
CN (1) CN114757335A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115659797A (en) * 2022-10-24 2023-01-31 大连理工大学 Self-learning method for generating anti-multi-head attention neural network aiming at aeroengine data reconstruction
CN115829009A (en) * 2022-11-08 2023-03-21 重庆邮电大学 Data enhancement method based on semi-supervised federal learning under privacy protection
CN116913445A (en) * 2023-06-05 2023-10-20 重庆邮电大学 Medical missing data interpolation method based on form learning
CN117854716A (en) * 2024-03-08 2024-04-09 长春师凯科技产业有限责任公司 Method and system for filling heart disease diagnosis missing data based on HF-GAN

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115659797A (en) * 2022-10-24 2023-01-31 大连理工大学 Self-learning method for generating anti-multi-head attention neural network aiming at aeroengine data reconstruction
CN115659797B (en) * 2022-10-24 2023-03-28 大连理工大学 Self-learning method for generating anti-multi-head attention neural network aiming at aeroengine data reconstruction
CN115829009A (en) * 2022-11-08 2023-03-21 重庆邮电大学 Data enhancement method based on semi-supervised federal learning under privacy protection
CN116913445A (en) * 2023-06-05 2023-10-20 重庆邮电大学 Medical missing data interpolation method based on form learning
CN116913445B (en) * 2023-06-05 2024-05-07 重庆邮电大学 Medical missing data interpolation method based on form learning
CN117854716A (en) * 2024-03-08 2024-04-09 长春师凯科技产业有限责任公司 Method and system for filling heart disease diagnosis missing data based on HF-GAN

Similar Documents

Publication Publication Date Title
CN114757335A (en) Dual-condition-based method for generating confrontation network and filling missing data
US7979365B2 (en) Methods and systems for interactive computing
WO2020008919A1 (en) Machine learning device and method
CN110674850A (en) Image description generation method based on attention mechanism
CN111859978A (en) Emotion text generation method based on deep learning
CN112527966B (en) Network text emotion analysis method based on Bi-GRU neural network and self-attention mechanism
CN110210933B (en) Latent semantic recommendation method based on generation of confrontation network
CN111127146A (en) Information recommendation method and system based on convolutional neural network and noise reduction self-encoder
Taymouri et al. A deep adversarial model for suffix and remaining time prediction of event sequences
CN111626041B (en) Music comment generation method based on deep learning
CN116580767B (en) Gene phenotype prediction method and system based on self-supervision and transducer
CN112069827B (en) Data-to-text generation method based on fine-grained subject modeling
CN114580707A (en) Emotional tendency prediction model, building method and prediction method of multi-feature fusion product
CN116932722A (en) Cross-modal data fusion-based medical visual question-answering method and system
CN115660795A (en) Data processing method, device, equipment, storage medium and program product
CN115186197A (en) User recommendation method based on end-to-end hyperbolic space
CN114332565A (en) Method for generating image by generating confrontation network text based on distribution estimation condition
CN116881996B (en) Modeling intention prediction method based on mouse operation
CN116563524A (en) Glance path prediction method based on multi-vision memory unit
CN116484868A (en) Cross-domain named entity recognition method and device based on diffusion model generation
He Automatic Quality Assessment of Speech‐Driven Synthesized Gestures
CN112307745B (en) Relation enhanced sentence ordering method based on Bert model
CN113486180A (en) Remote supervision relation extraction method and system based on relation hierarchy interaction
CN113158577A (en) Discrete data characterization learning method and system based on hierarchical coupling relation
CN111832942A (en) Criminal transformation quality assessment system based on machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination