CN114757335A

CN114757335A - Dual-condition-based method for generating confrontation network and filling missing data

Info

Publication number: CN114757335A
Application number: CN202210347936.6A
Authority: CN
Inventors: 钱鹰; 戴思聪; 刘歆; 万邦睿; 黄江平; 王毅峰; 韦庆杰; 王奕琀
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-04-01
Filing date: 2022-04-01
Publication date: 2022-07-15

Abstract

The invention relates to a missing data filling generation method for generating a countermeasure network based on dual conditions, which belongs to the field of data perception and reconstruction in a computer and comprises the following steps: s1: encoding sample data, and designing a sample type as a data generation condition and a representation mode of the existing sample data in the generation process; s2: constructing a structure of a dual-condition generation countermeasure network, wherein the structure comprises a generation model and a discrimination model; s3: generating an objective optimization function of the antagonistic network structure under the dual conditions for representation; s4: establishing a training data set of a data generation model, and training a dual-condition generation countermeasure network; s5: analyzing different data missing conditions, and generating a countermeasure network by adopting a trained dual condition to generate and fill missing data. The invention provides a method for constructing a high-quality training data set aiming at table data, which is used for supporting machine learning model training in a big data application scene.

Description

Missing data filling generation method for generating countermeasure network based on dual conditions

Technical Field

The invention relates to a missing data filling generation method for generating a countermeasure network based on dual conditions, and belongs to the field of data perception and reconstruction in computers.

Background

In recent years, in machine learning, the generation of antagonistic network models has become increasingly important and popular due to their applicability in different fields. Their ability to represent complex, high-dimensional data can be used in other areas of academia, processing images, video, tabular data, and the like. In the field of digital finance and the like, the generation of form data is a core issue of interest to researchers. Given a set of random noises, the generation of the countermeasure network model can generate corresponding table data.

In a real application scenario, table data under a given condition is often required to be output, and the table data usually has a missing situation. In view of the specificity of the table data, the design of the encoding process and conditions thereof presents certain challenges. Therefore, it is necessary to design a generation countermeasure network model that can effectively encode the table data and solve the data missing problem.

Conditional Generative adaptive Networks (Conditional generic adaptive Networks) are certainly the most representative of Conditional Generative network models. Coding the table data based on the improved CTGAN of the CGAN, introducing a condition vector, splicing the random noise and the condition vector, and inputting the spliced random noise and the condition vector into a generation model to finally obtain the table data under the specified condition.

However, the above method has the following problems: (1) data loss of sample data in an actual scene often exists, and the problem of data imbalance of a few types of samples also exists; (2) the design only takes the category condition as input, and the quality of the training data set constructed by the generated sample cannot be well guaranteed, so that the requirement of machine learning model training in a big data application scene on the high-quality training data set is met.

Disclosure of Invention

In view of this, the present invention provides a missing data padding generation method based on a dual Conditional generated countermeasure network (Double Conditional robust generated adaptive Networks), which is used for generating table-type data to construct a high-quality training data set for supporting machine learning model training in a big data application scenario.

In order to achieve the purpose, the invention provides the following technical scheme:

a method for generating missing data filling of a countermeasure network based on dual conditions comprises the following steps:

s1: coding sample data in the table class data set, and designing a sample class serving as a data generation condition and a representation mode of existing sample data in a generation process;

s2: constructing a structure of a dual-condition generation countermeasure network, wherein the structure comprises a generation model and a discrimination model;

s3: generating an objective optimization function of the antagonistic network structure under the dual conditions for representation;

s4: establishing a training data set of a data generation model, and training a dual-condition generation countermeasure network;

s5: analyzing different data missing conditions, and generating a countermeasure network by adopting a trained dual condition to generate missing data and fill up the missing data so as to construct a high-quality table training data set for machine learning model training supporting a big data application scene.

Further, in step S1, encoding the real sample data specifically includes the following steps:

s11: in order to eliminate the deviation caused by different dimension units of different types of data, data standardization and data coding are required; the category data is coded in a one-hot coding mode; the logarithmic or hybrid data is encoded as follows:

suppose real sample data X in a table class dataset_iIf there are N numeric type and mixed type variables and e type variables, and N + e is N variables in total, the sample data is encoded by scalar α_i,j、β_i,jOne-hot vector d for other categorical data_i,eSplicing to form the following parts:

wherein,

representing vector concatenations, finally encoded X_iThe dimension of (a) is u; alpha is alpha_i,jFor VGM encoded values, it is indicated that one VGM mode is sampled from a given probability density function, and the sampling mode is used to sample the jth column C of the ith sample_i,jCarrying out standardization and normalization; if the sampling mode is p_qThen C in the q-th mode_i,jExpressed as a scalar α_i,jAnd one-hot vector beta_i,j，β_i,j＝[h₁,...,h_q,...,h_k]Wherein h is₁,...,h_q,...,h_kRepresent the 1 st to k th, h, of the one-hot vector respectively_q1, other values take 0;

scalar α_i,jThe process of normalization and normalization of (a) is as follows:

the ith sample is subjected to VGM coding to obtain k modes, namely rho₁,ρ₂,...,ρ_kThe learned Gaussian mixture model is

Wherein

Probability density function, ω, representing a Gaussian model_q、η_qAnd σ_qEach represents oneWeight, mean and standard deviation of the modality, q 1.., k;

j column C for the ith sample_i,jCalculating the probability of the value in each mode, wherein the probability density is rho₁,ρ₂,...,ρ_k，

Sampling a mode from a given probability density function and using the mode pair C_i,jCarrying out standardization; if the sampling result is rho_qThen C in the q-th mode_i,jExpressed as a scalar α_i,j：

S12: conditional vector and mask vector

Taking the sample class label as one of the conditions for modeling, wherein a condition vector cond is a bit vector and is formed by one-hot codes generated by the sample class label, wherein the selected label value label ═ 1, and the condition vector is expressed as cond [0, 0., label., 0,0 ];

mask vector M ═ M₁,...,M_d,...,M_N]，d＝1...N^*When M is_dWhen 0, data representing the position is missing; when M is_dWhen the value is 1, the data is complete; the dimension of the element value 1 in M is | | | M | | non-woven phosphor₁And the dimension of the element value 0 is | |1-M | | non-conducting phosphor₁；M_dWhen all the elements in the list are 0, the sample category is only taken as a condition;

given a coding vector X of real sample data, simulating missing sample data vectors X of different situations_miss：

That is, X_missAn l ═ M ═ X |, M, in which, ", indicates element-level multiplication between vectors.

Further, the generated model in step S2 is composed of two Residual Networks (Residual Networks) and 1 Fully Connected Layer (full Connected Layer), the input portion of the generated model includes input data and condition vectors, and the input data is filled with noise, and the workflow and detailed structure thereof are as follows:

s21: coding a noise sample Z and filling missing sample data with the coded result to finally obtain Z ', wherein the process is represented as Z' ═ M ^ X_miss+(1-M)⊙Z；

S22: by H₀To represent

And as an initial input, i.e.

Each of which

Of | cond | + | Z'_i|，i＝1...m；

s25: will H₂Inputting into the last layer of fully-connected network, firstly, the input vector is converted into a scalar alpha by tanh activation function_i,j(ii) a Then, two pocket softmax activation functions are respectively used for obtaining the coding vector beta of the continuous data_i,jAnd a code vector d of discrete data_i,eWherein the gumbel softmax activation function is used to convert the input into a one-hot vector;

s26: will be alpha_i,j、β_i,j、d_i,eSplicing is carried out and the final generated result of the generated model is obtained,by using

Is shown, i.e.

S27: will generate data

Missing data X_missPerforming element-level multiplication with mask vector M to obtain input data of discrimination model, and using X to make_impIs expressed, i.e.

And is

Further, the discriminant model in step S2 is formed by three fully-connected networks, and its workflow and detailed structure are as follows:

the generation model not only fills in the missing part, but also generates new data for the non-missing part. However, it is only necessary for the discriminant model to identify whether the data filled in the missing part meets the given condition and fits the distribution of the real sample, so the generated data of the generative model needs to be processed to meet the input of the discriminant model.

S28: real sample X, X after being coded_impSplicing with the condition vectors cond respectively, and using K as the result₀And

is shown, i.e.

K₀And

respectively is | X_imp| X | + | cond | and | X | + | cond |;

s29: then K is put₀And

respectively inputting the data into a first layer of fully-connected network, wherein the first layer of fully-connected network comprises a Leaky ReLU activation function used for preventing the situation of gradient disappearance in the back propagation process; performing dropout regularization operation on the first layer of fully connected network to avoid the phenomenon of overfitting of the discriminant model after successful training; respectively obtaining two 256-dimensional vectors by K after passing through a first layer of fully-connected network₁And

carrying out representation;

S210：K₁and

computing an output vector K as an input into a second tier fully-connected network with dimension 256₂And

finally K is added₂And

and respectively outputting the identification results through a third layer of full-connection network with the dimensionality of 256.

Furthermore, the generating model is used for generating samples, the loss function representation of the generating model is to put the generated samples into the discriminant model, so that the generated samples can cheat the discriminant model, and the mean square error between the generated element value and the real element value under the action of M is calculated; for the discrimination model, firstly, whether the generated sample meets the given sample class condition is identified, and secondly, whether the generated element value of the missing part under the action of (1-M) is fitted with the distribution of the real sample is calculated.

The design steps of the objective optimization function of the generated model are as follows:

s31: putting the generated sample into a discriminant model, and identifying the generated sample by using the discriminant model, wherein the loss is expressed as

The function represents the ith sample in the sample set

Inputting the data into a discriminant model for discrimination, wherein the larger the value of the data is, the larger the data is

The more satisfied a given condition is, i is 1

Representing the ith sample in the missing sample set;

s32: calculate M [ < X >_missAnd

mean square error of each element in each sample, its loss function

Represents:

the smaller the result is, the closer the generated data is to the real data, j represents the jth element of each vector in the sample set, j is 1.. u, alpha represents the hyper-parameter,

representing the jth element of the ith sample in the missing sample set,

representing the jth element of the ith sample in the generated sample set;

s33: calculating the total loss function of the generative model, namely:

the design process of the objective optimization function of the discriminant model is as follows:

s34: for each sample, by calculating X_impAnd judging whether the sample X is matched with the cross entropy loss of the real sample X under the given condition cond, namely:

wherein

Representing a sample

Satisfying a given probability;

s35: calculating X using Wasserstein distance in Wasserstein generated countermeasure network (Wasserstein generated adaptive Networks)_impAs a result, the similarity of the missing portion between (1-M) and X [ ((1-M)),

representing the distance of the Wasserstein and,

and

respectively represent X_impSet and X set obeys P_g、P_rProbability distribution of gamma and epsilon

Over-parameters of, for loss

Represents:

wherein

Represents the "gradient penalty" mechanism in Wasserstein distance;

s36: calculating the total loss function value of the discriminant model:

wherein δ is

Is determined. D (X | cond) represents input of the sample X and the condition vector cond to the discriminant model. Where m represents the number of samples and u represents the dimension of the encoded samples.

Further, the procedure of training the dual-condition generation countermeasure network in step S4 is as follows:

s41: setting training iteration times epoch;

s42: creating a condition vector cond;

s43: creating a mask vector M;

s44: randomly sampling m real samples X to X₁,X₂,...,X_mH, encoding real samples；

S45: randomly sampling m noise samples Z to Z₁,Z₂,...,Z_mH, encode the noise and compute the missing sample X_missAnd input data Z ', Z' ═ M | _ X_miss+(1-M)⊙Z；

S46: inputting the processed noise and the dual condition into the generative model and outputting the generated samples of the generative model, i.e.

S47：

S48: randomly sampling the E-bar real sample X_iAnd (1-e) bar

Composition of

S49: a gradient penalty mechanism is introduced:

representing derivation of a discriminant model loss function;

s410: calculating a total loss function of the discriminant model:

s411: optimization of discriminant models and updating of parameters using Adam optimizer, back propagationIn the process, L is common_DLayer, then

The gradient update process of the layer is as follows:

wherein phi_DRepresenting a weight matrix, B_DA bias matrix is represented that is,

is to show to

The weight of (b) is differentiated,

denotes the L th_DThe input of the +1 layer is,

is shown to the L < th > of_DThe output of the layer is differentiated,

representing differentiation of the LeakyReLu activation function,

an input of the L-th layer is represented,

is to the L < th > of_D-the output of layer 1 is differentiated,

denotes the first

The input of the layer(s) is (are),

is to show to

The output of the layer is differentiated,

is shown as

Inputting a layer;

s412: first, the

The parameter updating process of the layer is as follows:

wherein lr is_DIn order to discriminate the learning rate of the model,

and

respectively represent the discriminant model

Weights and biases of layers;

s413: judging whether the generated data meet given conditions or not and whether the generated data are fit with the distribution of a real sample or not;

s414: calculate M [ < X >_missAnd

the mean square error of each element in each sample,

s415: the total loss of the generative model is calculated,

s416: the Adam optimizer is used for optimizing the generation model and updating parameters, and the back propagation process has L_GLayer, then

The gradient update process of the layer is as follows:

wherein phi_GRepresenting a weight matrix, B_GA bias matrix is represented that is,

is shown to the first

The weight of (b) is differentiated,

represents L_GThe input of the +1 layer is,

is to the L < th > of_GThe output of the layer is differentiated,

an input of the L-th layer is represented,

is to the L < th > of_G-differentiation of the output of layer 1,

is shown as

The input of the layer(s) is (are),

is to show to

The output of the layer is differentiated,

is shown as

Inputting a layer;

s417: first, the

The parameter updating process of the layer is as follows:

wherein lr is_GIn order to generate the learning rate of the model,

and

respectively represent the birthday device network

Weights and biases of layers.

Further, in step S5, in consideration of the presence of data loss and the need for data generation in the training data set, the data generation operation is performed separately:

(1) partial data loss

S51: coding missing samples in the table type data set, and creating a mask vector and a condition vector;

s52: filling the missing part of the sample by using the coded random noise to finally obtain input data of a generation model;

s53: and creating a condition vector, splicing the condition vector with input data, and inputting the condition vector and the input data into a trained dual-condition generation countermeasure network to obtain generated and filled sum sample data.

Data generation for the case of partial data loss can be used for generating and filling sample data in the table type data set so as to obtain high-quality sample information based on the existing partial data.

(2) Complete loss of data

There are two ways to deal with this case of a complete loss of data:

the method comprises the following steps: setting all mask vectors to 0;

s54: creating a mask vector completely filled with '0' according to a condition vector of samples to be generated;

s55: using random noise pair to represent samples to be generated to obtain input data of a generation model;

s56: and splicing the condition vector and the input data, and inputting the spliced condition vector and the input data into a trained dual-condition generation countermeasure network to obtain generated sample data.

The second method comprises the following steps: the existing data generation model is adopted to generate a part of data as a known data condition, and then the dual condition generation countermeasure network is adopted to fill up missing data.

S57: generating a part of data as a known data condition by adopting an existing data generation model;

s58: known data conditions are input into the dual-condition generation countermeasure network of the invention together with missing samples to fill in the missing data.

Data generation for the case of a complete loss of data can be used for generation of a few class samples in a tabular class dataset to address data imbalance issues therefor.

The invention has the beneficial effects that:

(1) the invention provides a method for filling missing data of a dual-condition generation countermeasure network. In the table type data set, a condition vector and a mask vector are introduced based on the category condition and part of existing data of a sample to be generated, a generation confrontation network model is established, a generation model and a discrimination model in the generation confrontation network model and a target optimization function of the model are established, and finally generation and filling of missing data in the sample are realized through a trained model.

(2) Aiming at different missing conditions of sample data in the tabular data set, the invention provides a data generation and filling scheme for generating the countermeasure network based on dual conditions, effectively solves the problems of partial data missing, data imbalance among sample classes and the like, improves the quality of a training data set in the tabular data set, and reduces the influence of the data imbalance problem. Therefore, the invention provides a method for constructing a high-quality table training data set, which is used for effectively supporting machine learning model training in a big data application scene.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For a better understanding of the objects, aspects and advantages of the present invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a schematic flow chart of a method for generating missing data padding for a dual-conditional generation countermeasure network according to the present invention;

FIG. 2 is a schematic diagram of a generative model structure of a dual-conditional generative countermeasure network;

FIG. 3 is a structural diagram of a discriminant model of a dual-conditional generative countermeasure network.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

Referring to fig. 1 to 3, the present invention provides a method for filling and generating missing data of a dual conditional generation countermeasure network, comprising the following steps:

s1: firstly, encoding sample data in a table type data set, and designing a sample type as a data generation condition and a representation mode of existing sample data in a generation process;

s1-1, in order to eliminate the deviation caused by different dimension units of different types of data, data standardization and data coding are required. The category data can be coded in a one-hot coding mode; the numerical or hybrid data is encoded by a variational gaussian mixture model (VGM).

For example, the information related to loan risk assessment of a user in the real sample set is encoded, and two variables of "user gender" and "loan amount" in the basic information of the user are taken as examples, wherein the user gender belongs to a category variable and can be represented by using a one-hot vector, and the loan amount belongs to a numerical variable and can be represented by VGM encoding. A one-hot vector such as "male" in user gender may be mapped to [1,0 ]; the loan amount value of "100000" may be mapped to a VGM code of [0.73,0,0,1,0], where "0.73" and "[ 0,0,1,0 ]" respectively represent a VGM-encoded scalar and a one-hot vector of the value "100000", and finally the user information is encoded to [0.73,0,0,1,0,1,0 ]. The number of numerical or hybrid variables like "loan amount" are very large and cannot be represented effectively by one-hot vectors, which can be well solved by VGM coding, which encodes numerical or hybrid data into a vector consisting of a scalar and one-hot codes, fits the numerical or hybrid data using multiple modalities, and represents it as one-hot codes by means of a random sampling modality. Meanwhile, the method uses normalization operation to scale the data item values to a certain area range, thereby avoiding the trouble brought to the model by the range difference of the data item values.

S1-2, condition vector and mask vector

The sample class label is used as one of the conditions for modeling. The condition vector cond is a bit vector and is formed by one-hot codes generated by sample class labels, where the selected label value label is 1, and the condition vector is denoted as cond [0, 0.

In the loan risk assessment application scenario, the user's loan information is distributed among different data sources, and for example, one of the data source participants, "loan platform," is label information that holds existing data. For example, three category levels, namely "high risk", "medium risk", and "low risk". Now, the loan information of a certain user is missing, the missing data needs to be generated and filled, and the risk evaluation level of the user on the loan platform side is known to belong to 'middle risk', so that when a condition vector is constructed, a category label is set to 'middle risk', and the condition vector can be represented as [0,1,0 ].

Mask vector M ═ M₁,...,M_d,...,M_N]，d＝1...N^*When M is_dWhen 0, the data representing the position is missing; when M is_dWhen 1, the data is complete. The dimension of the element value 1 in M is | | | M | | non-woven phosphor₁And the dimension of the element value 0 is | |1-M | | non-conducting phosphor₁. In particular, M_dWhen all the elements in (a) are 0(s), it means that only the sample class is used as a condition.

Given a code vector X of real sample data, simulating a missing sample data vector X of different situations_miss：

That is, X_missAs one, where "" indicates element-level multiplication between vectors.

If the result of encoding the loan information of a certain user in the real sample set is [0.73,0,0,1,0,1,0], wherein the encoding vector [0.73,0,0,1,0] represents the result of VGM encoding the variable of the loan amount in the employee basic information, and [1,0] represents the result of one-hot encoding the user gender in the user basic information. If the user gender is null, the mask vector may be represented as [1,1,1,1,1,0,0], and the final missing sample data vector is [0.73,0,0,1,0,0,0 ].

The generative model consists of two ResNet residual networks and 1 fully connected layer, the input part of which comprises input data and condition vectors, and the input data is filled with noise. The working process and the detailed structure are as follows:

s2-1, encoding the noise sample Z and filling the missing sample data with the encoded result to finally obtain Z ', a process of which indicates Z' ═ M | _ X |_miss+(1-M)⊙Z。

S2-2, using H₀To represent

And as an initial input, i.e.

Each of which

Of | cond | + | Z'_i|，i＝1...m；

s2-5, mixing H₂Inputting into the last layer of fully-connected network, firstly, the input vector is converted into a scalar alpha by tanh activation function_i,j(ii) a Then, two pocket softmax activation functions are respectively used for obtaining the coding vector beta of the continuous data_j,jAnd a code vector d of discrete data_i,eWherein the gumbel softmax activation function can convert the input into a one-hot vector;

s2-6, mixing_i,j”、“β_i,j”、“d_i,e"splice and obtain the final generated result of the generated model, use

Is expressed, i.e.

S2-7, generating data

Missing data X_missPerforming element-level multiplication with mask vector M to obtain input data of discrimination model, and using X_impIs expressed, i.e.

And is

Illustratively, each true sample X in the set of true samples X_iHas a dimension of 128, each noise Z in the set of noise samples Z_iThe dimension of (c) may be set to 128 and the vector dimension of the samples encoded for the real samples is set to 157.

The discrimination model is composed of three fully-connected networks, and the working process and the detailed structure are as follows:

S2-8, true sample X, X after being coded_impSplicing with the condition vectors cond respectively, and using K as the result₀And

is shown, i.e.

K₀And

respectively is | X_impAnd | X | + | cond |.

S2-9Will K₀And

respectively inputting the data into a first full-connection network, wherein the layer comprises a Leaky ReLU activation function, and the function of the layer is to prevent the situation that the gradient disappears in the back propagation process; secondly, the dropout regularization operation is carried out on the layer, so that the phenomenon that the discriminant model after the training is successful is over-fitted is avoided. Respectively obtaining two 256-dimensional vectors after passing through a first layer network, and using K₁And

carrying out representation;

S2-10:K₁and

finally K is added₂And

and respectively outputting the identification results through a third layer of fully-connected network with the dimension of 256.

S3: expressing an objective optimization function of the dual-condition generation countermeasure network structure;

and for the generating model, the generating model is used for generating samples, and the loss function representation is to put the generated samples into the discriminant model, so that the generating samples can cheat the discriminant model, and calculate the mean square error between the generating element value and the real element value under the action of M. The design steps of the objective optimization function of the generated model are as follows:

s3-1, judging whether the generated data meets the given conditions and is fitted with the distribution of the real samples, and using the loss

Is expressed, the function expression will

Inputting the data into a discriminant model for discrimination, wherein the larger the value of the data, the larger the value of the data indicates X_impThe more a given condition is met, where i denotes the ith sample in the sample set, i 1.. 50, and

s3-2 calculation

And with

Mean square error of each element in each sample, its loss function

Represents:

the smaller the result is, the closer the generated data is to the real data, j represents the jth element of each vector in the sample set, and 'alpha' represents the hyper-parameter.

Alternatively, the over parameter α may be set to 0.6.

S3-3, calculating the total loss function of the generative model, namely:

d (X | cond) represents that the sample X and the condition vector cond are input to the discriminant model.

For the discrimination model, firstly, whether the generated sample meets the given sample category condition is identified, and secondly, whether the generated element value of the missing part under the action of (1-M) is fitted with the distribution of the real sample is calculated, so that the design process of the target optimization function of the discrimination model is as follows:

s3-4, for each sample, calculating

With true sample X_iThe cross entropy loss given condition cond is used to determine whether there is a match, i.e.:

s3-5, calculating Wasserstein distance in Wasserstein generated countermeasure network (Wasserstein general adaptive network)

And X_iThe similarity of the missing portion between (1-M),

representing the distance of the Wasserstein,

and

respectively represent

Set and X_iSet obeys P_g、P_rThe probability distribution, "gamma",e "is

Is determined. For loss of

Represents, i.e.:

wherein

Representing the "gradient penalty" mechanism in the Wasserstein distance.

Optionally, γ is 0.4, and e is 0.5;

s3-6, calculating the total loss function value of the discriminant model, namely:

wherein "δ" is

Is determined.

Optionally, the over-parameter δ may be set to 0.6;

s4, establishing a training data set of the data generation model, and training the dual-condition generation countermeasure network;

the training process of the dual-condition generation countermeasure network is as follows:

s4-1, setting the training iteration time epoch, wherein the optional epoch is 300;

s4-2, creating a condition vector cond;

s4-3, creating a mask vector M;

s4-4, randomly sampling m real samples X to X₁,X₂,...,X_mEncoding real samples;

s4-5 randomly sampling m noise samples Z to Z₁,Z₂,...,Z_mTo noiseThe sound is encoded and the missing sample X is calculated_missAnd input data Z', Z ═ M-_miss+(1-M)⊙Z；

Illustratively, the total number of samples in the real sample set and the noise sample set is 3000, and the encoded dimension of the samples is 157.

Alternatively, the number m of batches may be set to 10.

S4-6, inputting the processed noise and the dual condition into the generative model and outputting the generative sample of the generative model, i.e.

S4-7, calculating input X of discriminant model_imp：

S4-8, randomly sampling a real sample X_iAnd

composition of

Namely, it is

S4-9, introducing a gradient penalty mechanism:

wherein

Representing derivation of a discriminant model loss function;

s4-10, calculating a discriminant model total loss function:

s4-11, optimizing the discriminant model by using an Adam optimizer and updating parameters, wherein in the back propagation process, if the LD layer is shared, the first step is

The gradient update process of the layer is as follows:

s4-12, update the first

The parametric process of the layers is:

lr_Dto discriminate the learning rate of the model, wherein

And

respectively represent the discriminant model

Weights and biases of layers.

Optionally, the learning rate lr of discriminant model backpropagation_DMay be set to 0.0002;

s4-13, judging whether the generated data meet given conditions and are fitted with the distribution of real samples;

s4-14 calculating M [ < X > ]_missAnd

the mean square error of each element in each sample,

s4-15, calculating the total loss of the generative model,

s4-16, optimizing the generation model by using an Adam optimizer and updating parameters, wherein the generation model has L in total in the back propagation process_GLayer, then

The gradient update process of the layer is as follows:

s4-17 th

The parameter updating process of the layer is as follows:

lr_Ga learning rate to generate a model, wherein

And

respectively represent the birthday device network

Weights and biases of layers.

Optionally, the learning rate lr of model back propagation is generated_GMay be set to 0.0002.

S5: analyzing different data missing conditions, generating a countermeasure network by adopting a trained dual condition, and generating and filling missing data to construct a high-quality table training data set for machine learning model training supporting a big data application scene.

After the model is trained successfully, the confrontation network is generated by adopting the trained dual conditions, and missing data is generated and filled. Considering different situations of data missing and the problem of data imbalance in the training set, the optional data generation scheme is as follows:

(1) partial data loss

In the field of loan risk assessment, user information is lost because some data information is not provided or acquired or the user information is not operated by mistake, and the invention can be used for generating and filling the lost part under the condition, and mainly comprises the following steps:

s51, encoding the missing samples in the table data set, and creating a mask vector and a condition vector;

s52, filling the missing part of the sample with the encoded random noise to finally obtain the input data of the generation model;

and S53, creating a condition vector, splicing the condition vector with input data, and inputting the condition vector into the trained dual-condition generation confrontation network to obtain generated and filled sum sample data.

For example, in the loan risk assessment above, the pre-encoding 128-dimensional data and its encoded 157-dimensional data, of which 50-dimensional data is missing. Then the sample may be classified by its class conditions, such as: high risk [1,0,0] and existing 107-dimensional data, establishing a corresponding M vector, setting 0 in all missing 50 dimensions in the M vector, generating a countermeasure network through a trained dual condition, and generating and filling up missing 50-dimensional missing data of the high risk sample.

(2) Complete loss of data

There are two ways to deal with this case of a complete loss of data:

the method comprises the following steps: setting all mask vectors to 0;

s54: creating a mask vector completely filled with '0' according to a condition vector for which a sample is to be generated;

s55: using random noise pair to represent a sample to be generated to obtain generated network input data;

For example, in the loan risk assessment above, the 157 dimensional data is all replaced with noise samples, and the 157 dimensions of M are all set to 0. If a high-risk [1,0,0] sample needs to be generated, a countermeasure network is generated through a trained dual condition, a noise sample and a condition vector are input, and M is used for generating a brand-new high-risk sample data.

Optionally, an existing data generation model is used, for example: a conditional generation countermeasure network (CTGAN) or a tabular data generation countermeasure network (TGAN) based on tabular data is generated with a 50-dimensional sample therein as existing partial data as one condition of a dual condition. And generating a sample meeting the condition of a certain category according to the sample generation requirements of different conditions and the process of (1) partial data missing.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. A method for generating missing data filling of a countermeasure network based on dual conditions is characterized in that: the method comprises the following steps:

s1: coding the sample data of the table type, and designing the sample type as a data generation condition and a representation mode of the existing data of the sample in the generation process;

s5: analyzing different data missing conditions, and generating a countermeasure network by adopting a trained dual condition to generate missing data and fill up the missing data so as to construct a high-quality table data training set which can be used for training other machine learning models.

2. The dual-condition-based missing data padding generation method for generation of countermeasure networks according to claim 1, wherein: in step S1, encoding the real sample data specifically includes the following steps:

s11: carrying out data standardization and data coding; the category data is coded in a one-hot coding mode; the numerical or hybrid data is encoded as follows:

suppose true sample data X in a tabular class dataset_iThere are n numerical type and mixed type variables, and eIf the type variable is N + e in total, the sample data is encoded by the scalar α_i,j、β_i，jOne-hot vector d for other categorical data_i，eSplicing to form the following parts:

wherein,

representing vector concatenations, finally encoded X_iThe dimension of (a) is u; alpha is alpha_i,jFor VGM encoded values, it is indicated that one VGM mode is sampled from a given probability density function, and the sampling mode is used to sample the jth column C of the ith sample_i，jCarrying out standardization and normalization; if the sampling mode is p_qThen C in the q-th mode_i,jExpressed as a scalar α_i,jAnd one-hot vector beta_i，j，β_i，j＝[h₁，...，h_q，...，h_k]Wherein h is₁，...，h_q，...，h_kRespectively represent the 1 st to k th elements, h, of a one-hot vector_q1, other values take 0;

the ith sample is subjected to VGM coding to obtain k modes, namely rho₁，ρ₂，...，ρ_kThe learned Gaussian mixture model is

Wherein

Probability density function, ω, representing a Gaussian model_q、η_qAnd σ_qWeight, mean and standard deviation of one modality, q ═ 1., k, respectively;

j column C for the ith sample_i，jCalculating the probability of the value in each mode, wherein the probability density is rho₁，ρ₂，...，ρ_k，

Sampling a mode from a given probability density function and using the mode pair C_i,jCarrying out standardization; if the sampling result is rho_qThen C in the q-th mode_i，jExpressed as a scalar α_i,j：

S12: conditional vector and mask vector

Taking a sample class label as one of the conditions for modeling, wherein a condition vector cond is a bit vector and is formed by one-hot codes generated by the sample class label, wherein a selected label value label is 1, and the condition vector is represented as cond [0, 0., label,.., 0,0 ];

mask vector M ═ M₁，...，M_d，...，M_N]，d＝1...N^*When M is_dWhen 0, the data representing the position is missing; when M is_dWhen the value is 1, the data is complete; the dimension of the element value 1 in M is | | M | non-woven₁And the dimension of the element value 0 is | |1-M | | non-conducting phosphor₁；M_dWhen all elements in the list are 0, the sample category is only taken as a condition;

That is, X_missAs one example, where M indicates an element-level multiplication between vectors.

3. The dual-condition-based missing data padding generation method for generation of countermeasure networks according to claim 1, wherein: the generated model in step S2 is composed of two residual error networks and a full link layer, the input part of the generated model includes input data and condition vectors, and the input data is filled with noise, and the work flow and detailed structure are as follows:

S22: by H₀To represent

And as an initial input, i.e.

Each of which

Of | cond | + | Z'_i|，i＝1...m；

s25: h is to be₂Inputting the vector into the last layer of fully-connected network, firstly converting the input vector into a scalar alpha through a tanh activation function_i,j(ii) a Then, two pocket softmax activation functions are respectively used for obtaining the coding vector beta of the continuous data_i，jAnd a code vector d of discrete data_i，eWherein a gum softmax activation function is used to convert the input into one-hot vectors;

s26: will be alpha_i,j、β_i，j、d_i，eSplicing and obtaining the final generated result of the generated model

Is expressed, i.e.

S27: will generate data

And is provided with

4. The dual-condition-based missing data padding generation method for generation of countermeasure networks according to claim 1, wherein: in step S2, the discriminant model is composed of three fully connected networks, and the work flow and detailed structure thereof are as follows:

is shown, i.e.

K₀And

respectively is | X_imp| X | + | cond | and | X | + | cond |;

s29: then K is put₀And

respectively inputting the data into a first layer of fully connected network, wherein the first layer of fully connected network comprises an Leaky ReLU activation function, and performing dropout regularization operation on the first layer of fully connected network; respectively obtaining two 256-dimensional vectors by K after passing through a first layer of fully-connected network₁And

carrying out representation;

S210：K₁and

finally K is added₂And

5. The method for generating missing data padding for countering networks based on dual condition generation as claimed in claim 1, wherein: the generating model is used for generating samples, the loss function representation of the generating model is to put the generated samples into the discriminant model, so that the generated samples can cheat the discriminant model, and the mean square error between the generated element value and the real element value under the action of M is calculated; for the discrimination model, firstly, identifying whether the generated sample meets the given sample category condition, and secondly, calculating whether the generated element value of the missing part under the action of (1-M) is fitted with the distribution of the real sample;

The function represents the ith sample in the sample set

The more satisfied a given condition is, i 1.. m, and

representing the ith sample in the missing sample set;

s32: calculate M [ < X >_missAnd

mean square error of each element in each sample, its loss function

Represents:

indicating the jth element of the ith sample in the missing sample set,

representing the jth element of the ith sample in the generated sample set;

s33: calculating the total loss function of the generative model, namely:

wherein

Representing a sample

Satisfying a given probability;

s35: calculating X using Wasserstein to generate Wasserstein distances in a countermeasure network_impAs a result, the similarity of the missing portion between (1-M) and X [ ((1-M)),

representing the distance of the Wasserstein and,

and

respectively represent X_impSet and X set obeys P_g、P_rProbability distribution, gamma, ∈ is

Over-parameters of, for loss

Represents:

wherein

Represents the "gradient penalty" mechanism in Wasserstein distance;

s36: calculating the total loss function value of the discriminant model:

wherein δ is

D (X | cond) represents that the sample X and the condition vector cond are input to the discriminant model, m represents the number of samples, and u represents the dimension of the encoded sample.

6. The dual-condition-based missing data padding generation method for generation of countermeasure networks according to claim 1, wherein: the procedure for training the dual-conditional generation countermeasure network described in step S4 is as follows:

s41: setting training iteration times epoch;

s42: creating a condition vector cond;

s43: creating a mask vector M;

s44: randomly sampling m real samples X to X₁，X₂，...，X_m}, encoding the real samples;

s45: randomly sampling m noise samples Z to Z₁，Z₂，...，Z_mH, encode the noise and compute the missing samples X_missAnd input data Z ', Z' ═ M | _ X_miss+(1-M)⊙Z；

S46: inputting the processed noise and the dual condition into the generative model and outputting the generative sample of the generative model, i.e.

S47：

S48: randomly sampling the E-bar real sample X_iAnd (1-e) bar

Composition of

S49: a gradient penalty mechanism is introduced:

representing derivation of a discriminant model loss function;

s410: calculating a total loss function of the discriminant model:

s411: optimizing the discriminant model by using an Adam optimizer and updating parameters, wherein the total L is obtained in the back propagation process_DLayer, then

The gradient update process of the layer is as follows:

wherein in_DRepresenting a weight matrix, B_DA bias matrix is represented that is a matrix of biases,

is shown to the first

The weight of (a) is differentiated,

denotes the L < th > element_DThe input of the +1 layer is,

is to the L < th > of_DThe output of the layer is differentiated,

representing differentiation of the LeakyReLu activation function,

an input of the L-th layer is represented,

is to the L < th > of_D-differentiation of the output of layer 1,

is shown as

The input of the layer(s) is (are),

is shown to the first

The output of the layer is differentiated,

is shown as

Inputting a layer;

s412: first, the

The parameter updating process of the layer comprises the following steps:

wherein lr is_DIn order to discriminate the learning rate of the model,

and

respectively represent the discriminant model

Weights and biases of layers;

s413: judging whether the generated data meet given conditions and whether the generated data are fitted with the distribution of the real sample;

s414: calculate M [ < X >_missAnd with

The mean square error of each element in each sample,

s415: the total loss of the generative model is calculated,

s416: the Adam optimizer is used for optimizing the generation model and updating parameters, and the backward propagation process has L in total_GLayer, then

The gradient update process of the layer is as follows:

wherein phi_aRepresenting a weight matrix, B_GA bias matrix is represented that is,

is to show to

The weight of (a) is differentiated,

represents L_GThe input of the +1 layer is,

is to the L < th > of_GThe output of the layer is differentiated,

an input of the L-th layer is represented,

is to the L < th > of_G-differentiation of the output of layer 1,

is shown as

The input of the layer(s) is (are),

is to show to

The output of the layer is differentiated,

is shown as

Inputting a layer;

s417: first, the

The parameter updating process of the layer is as follows:

wherein lr is_GIn order to generate the learning rate of the model,

and

respectively represent the birthday device network

Weights and biases of layers.

7. The dual-condition-based missing data padding generation method for generation of countermeasure networks according to claim 1, wherein: in step S5, in consideration of the existence of data loss and the requirement for data generation in the training data set, data generation operations are performed:

(1) partial data loss

s53: creating a condition vector, splicing the condition vector with input data, and inputting the condition vector and the input data into a trained dual-condition generation countermeasure network to obtain generated and filled sum sample data;

the data generation under the condition of partial data missing can be used for generating and filling sample data in a table type data set so as to obtain high-quality sample information based on the existing partial data;

(2) complete loss of data

There are two ways to deal with this case of a complete loss of data:

the method comprises the following steps: setting all mask vectors to 0;

s55: representing a sample to be generated by using random noise pairs to obtain input data of a generation model;

s56: splicing the condition vector and the input data, and inputting the spliced condition vector and the input data into a trained dual-condition generation countermeasure network to obtain generated sample data;

the second method comprises the following steps: generating a part of data as a known data condition by adopting an existing data generation model, and generating a confrontation network by adopting the double conditions to fill up missing data;

s58: inputting known data conditions and missing samples into the dual condition generation countermeasure network together to fill up the missing data;