CN109165664B

CN109165664B - Attribute-missing data set completion and prediction method based on generation of countermeasure network

Info

Publication number: CN109165664B
Application number: CN201810722774.3A
Authority: CN
Inventors: 赵跃龙; 王禹
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2018-07-04
Filing date: 2018-07-04
Publication date: 2020-09-22
Anticipated expiration: 2038-07-04
Also published as: CN109165664A

Abstract

The invention discloses an attribute missing data set completion and prediction method based on generation of a countermeasure network, which comprises the steps of 1) normalizing data minmax, simultaneously using one hot coding for discrete type attributes, and marking a missing value as 0; 2) creating a missing position encoding vector for the sample using the data set; 3) constructing a generating countermeasure network and an auxiliary prediction network for data filling and label prediction; 4) reducing the maximum and minimum values in the attributes into a result before minmax normalization; 5) selecting proper hyper-parameters through testing; the invention fully utilizes the data distribution information and the label information in the data set, can carry out effective data filling on the high-dimensional missing data set, and simultaneously, after the training is finished, another auxiliary prediction network contained in the method can directly give the prediction result of the label according to the input attribute missing data, and the method has simple process and higher prediction accuracy.

Description

Attribute-missing data set completion and prediction method based on generation of countermeasure network

Technical Field

The invention relates to the technical field of data preprocessing, in particular to a method for complementing and predicting an attribute-missing data set based on a generated countermeasure network.

Background

The phenomenon of data set attribute missing widely exists in various data sets, and is generally caused by information loss in the process of data acquisition or transmission. Loss of one or more attributes from a sample in a dataset can degrade the prediction accuracy of a model for subsequent prediction and classification. How to complement the missing data and use the information contained in the samples with missing attributes to construct a high-precision prediction model is a key problem in data preprocessing.

Most statistical tools process the problem of attribute loss by deleting the corresponding rows and columns of the lost samples, or fill the lost positions by using the median and the average of the columns; although this type of method is efficient and convenient, sample data distribution information cannot be fully utilized, resulting in inaccurate calculation results. In the process of multidimensional data processing, a plurality of associations often exist among different attributes of data, the associations among the attributes can provide more information for data filling, and in consideration of the data filling method with the associations, smaller deviation exists when a missing value is estimated, so that the information contained in the missing sample can be deeply mined.

On the basis, a further data filling method fills in missing values through modeling. If the regression filling method is used for establishing a regression equation by taking the missing attribute as a dependent variable to realize prediction, an EM algorithm initializes the missing value first, a final filling result is obtained through iteration of the step E and the step M, a k-nearest neighbor algorithm (KNN) calculates the most similar k samples in the Euclidean distance matching sample set according to the attribute which is not missing, and the filling result is obtained through weighted average. The filling results which are more accurate than the mean value and the median are obtained by the algorithms under the condition of enough data quantity, and then the regression filling method generally has the problems that the obvious linear relation among all attributes is needed, and the filling method based on the EM algorithm has high calculation complexity and is easy to fall into local optimum; the k-nearest neighbor-based filling method is simple to implement, but when large data volume is faced, the calculation amount is large, the complexity is extremely high, and the calculation is difficult.

Furthermore, the main purpose of data population is to provide more complete data for subsequent modeling predictions. The method does not involve a modeling process, the filled data and the predicted labels are often associated, and the combination of the prediction model and the filling method can enable the filled data to have a better prediction effect. The method aims at solving the two problems that the traditional data filling method has high calculation complexity when processing high-dimensional data and fails to fully mine label information to correct filling results; the invention carries out data filling on the basis of the generated confrontation network learning data distribution, and simultaneously establishes an auxiliary prediction network to fully mine the association between the data and the label, so that the mutual information of the data and the label is maximized.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides an attribute missing data set complementing and predicting method based on a generated countermeasure network, which makes full use of data distribution information and label information in a data set and can effectively fill data in a high-dimensional missing data set.

In order to achieve the above purpose, the technical scheme provided by the invention is as follows: firstly, carrying out data preprocessing aiming at the attribute-missing data set, wherein the data preprocessing mainly comprises minmax normalization and one hot code conversion of discrete numerical variables; then, aiming at the samples with attribute deletion, constructing a coding vector of a deletion position so as to express the deletion position information; then constructing a filling network of missing data and an auxiliary prediction network to synchronously complete the filling and label prediction of the missing data; after the network training is finished, taking an output result of a network generated in a filling network as a filling result, and carrying out scale reduction according to the maximum and minimum values of the columns recorded during minmax normalization; finally, the super-parameters are continuously modified to observe the loss of the prediction result of the super-parameters in the verification set so as to complete the setting of the super-parameters; which comprises the following steps

1) Preprocessing data;

2) constructing a missing position coding vector;

3) constructing a missing data filling network and an auxiliary prediction network;

4) filling data scale reduction;

5) and testing and super-parameter setting.

In the step 1), different data types are preprocessed differently, the related main data types are divided into a continuous numerical value and a discrete numerical value, and the continuous numerical value is normalized by directly using minmax; for discrete numerical values, after being converted into one hot codes, minmax normalization is used, and 0 is uniformly supplemented for missing positions; further, whether the data set is divided into two parts: data with missing attributes versus data without missing attributes.

In step 2), a missing position-coding vector is constructed, which is the case: when the data is filled, the attribute positions of the missing samples are also important information, when the neural network is used for filling, only the missing positions need to be filled, when the missing position coding vector is constructed, each column of all the samples is traversed, if the attribute is missing, the attribute is marked as '1', otherwise, the attribute is marked as '0', the process is executed according to the flow, and each sample has a corresponding missing position coding vector.

In step 3), a missing data filling network and an auxiliary prediction network are constructed, wherein the conditions are as follows: the network is improved in the original generative countermeasure network as follows: removing noise obtained by random sampling from the input of a generated network; forming filled data by using the generated data and the missing position vector code; in addition, the introduction of the auxiliary prediction network considers the relation between the attributes and the tags more fully, when the attribute missing data is used for prediction, the loss between the tags and the real tags is predicted by using the auxiliary prediction network, and the generated network is updated by performing feedback calculation through a BP algorithm, so that the generated filling data has a better effect when a prediction model is constructed; the loss function in the antagonistic network and the loss function in the auxiliary prediction network are jointly generated, and the weight ratio of the antagonistic network and the auxiliary prediction network is controlled by the hyper-parameter to determine that the distribution of the generated filling data is closer to the distribution of the complete data or the prediction of the prediction model is more accurate; the structure of the data filling network and the auxiliary prediction network comprises a generation network, a judgment network and an auxiliary prediction network; the structure of these three networks is described in detail below:

generating a network: the input part is formed by splicing data with attribute missing and missing position coding vectors corresponding to the data; according to different data structures, the hidden layer can be formed by using a full-connection layer or a deconvolution layer, and particularly when the input data is picture type data, the generated filling data is obtained by using deconvolution operation; it is assumed here that the input data is denoted as I, which is a 100-dimensional vector, and thus the corresponding missing position encoding vector is denoted as E, which is also 100-dimensional, and the dimension of the input vector obtained by splicing is 200; the hidden layer is composed of a full connection layer, and the relu is used as an activation function; the final output layer is provided with 100 output units marked as O, and the activation function of the output layer adopts sigmoid; the padded data consists ultimately of I. (1-E) + O.E;

and (3) network discrimination: the input data comprises two parts, wherein the first part is a filling data result obtained based on the output of the generated network, the second part is sample data with non-missing attributes, and the output result is a decimal between 0 and 1 and represents the probability of judging whether the input data received by the network is from the data with non-missing attributes; the setting of the network structure is different according to the different types of the input data, and when the input data is image type data, the convolutional neural network is constructed; assuming here that the input data is a 100-dimensional vector, the hidden layer can be chosen to be made up of fully connected layers, with the activation function set to relu; the output layer only comprises one unit, the activation function is selected as sigmoid, and the probability is represented;

auxiliary prediction network: the input is completely consistent with the discrimination network, the output is a predicted value of the input sample about the label, when the prediction problem is a classification problem, cross entropy is adopted as a loss function, and when the prediction problem is a regression problem, an L2 norm or an L1 norm is adopted as the loss function; the network structure is the same as the setting method of the judgment network; assuming here that the input data is a 100-dimensional vector, the hidden layer can be chosen to be made up of fully connected layers, with the activation function set to relu; the output layer contains only one cell and the activation function is set in the manner described above.

In step 4), the generated filling data is subjected to scale reduction, and as the data normalization is performed by using minmax in the preprocessing stage, the final filling result can be obtained through reduction according to the maximum value and the minimum value of each recorded attribute.

In step 5), testing and hyper-parameter setting, wherein the conditions are as follows: during the training process of the network, loss is generated from two parts: generating a predicted loss of the antagonistic network and the auxiliary predictive network; the two losses are combined in different proportions lambda to obtain a comprehensive loss; different λ may affect the training of the model; in the operation process, the data set is segmented into a training set and a testing set, lambadas with different scales are selected from the training set and are respectively 0.1,0.3,0.5,0.7 and 0.9 for training, and meanwhile, the testing set is used for testing, and the minimum loss of the auxiliary prediction network on the testing set is used as a selection standard of the hyper-parameter.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the traditional filling method such as median filling and mean filling is simple and has a poor filling effect, while the KNN and EM-based method is often high in time complexity, so that the time complexity is extremely high when a high-dimensional data set is processed, and even the situation that the data set cannot be processed occurs. The generated countermeasure network has excellent effect on the distribution learning of high-dimensional data, so that the trouble caused by a high-dimensional data set can be solved; in addition, the samples without attribute deletion and the samples with attribute deletion generally obey the same distribution, and the filled data is approximate to the data set without attribute deletion from the distribution, so that the filling result does not deviate from the data distribution, and the prediction model is negatively influenced.

2. The traditional filling method does not consider the influence of filled data on the prediction result of the subsequent establishment of the prediction model, and the steps of the traditional filling method are that the missing data is filled to obtain the completed data, and then the filled data is used for establishing the prediction model, so that the data filling can not be guided by the prediction effect. According to the invention, the auxiliary prediction network is introduced to calculate the loss between the data prediction value of each filling and the real label, and the back propagation guidance is carried out to generate the data filling of the network, so that the prediction effect can be selected by observing the performance of the filled data on a prediction model, and the difference between the loss limit filling data of the network and the distribution of the real data is judged, so that the good filling effect is achieved, and the good prediction result is obtained. In addition, after the training is finished, an end-to-end network is obtained, and after data is input, a prediction result of the auxiliary prediction network can be directly obtained.

Drawings

FIG. 1 is a flow chart of missing data padding and prediction.

FIG. 2 is a data flow diagram of a generative confrontation network and prediction network of padding data.

Detailed Description

The present invention will be further described with reference to the following specific examples.

As shown in fig. 1, the method for complementing and predicting the attribute-missing data set based on the generation of the countermeasure network provided in this example includes the following specific cases:

1) data preprocessing: the data types of different attributes are different, and the corresponding processing modes are also different. The related main data types are divided into a continuous numerical value and a discrete numerical value, and for the continuous numerical value, minmax is directly used for normalization; for discrete values, after conversion to one hot encoding, minmax normalization was used, with a uniform 0 complement for the missing positions. The data set is further divided into two parts, data with missing attributes and data without missing attributes.

2) Constructing a missing position coding vector: the attribute positions of the missing samples are also important information in data filling, and only the missing positions need to be filled when the neural network is used for filling. When constructing the missing position encoding vector, each column of all samples is traversed, if the attribute is missing, the attribute is marked as "1", otherwise, the attribute is marked as "0". According to the execution of the flow, each sample has a missing position code vector corresponding to it.

3) Constructing a missing data filling network and an auxiliary prediction network: the invention provides an integrated network which is based on a generative countermeasure network and combined with an auxiliary prediction network to fill data and can predict. The network is improved in the original generative countermeasure network as follows: removing noise obtained by sampling from the input of a generated network; and secondly, the generated data and the missing position vector code are used for forming the filled data. In addition, the introduction of the auxiliary prediction network considers the relation between the attributes and the tags more fully, when the attribute missing data is used for prediction, the loss between the tags predicted by the auxiliary prediction network and the real tags is used for carrying out feedback calculation through a BP algorithm so as to update the generation network, and therefore the generated filling data has a better effect when a prediction model is constructed. And jointly generating a loss function in the countermeasure network and a loss function in the auxiliary prediction network, and controlling the weight ratio of the loss functions and the auxiliary prediction network through the hyper-parameters to determine that the generated filling data distribution is closer to the distribution of the complete data or enable the prediction model to predict more accurately. FIG. 2 is a block diagram of the most important data filling network and the auxiliary prediction network in the present invention, including the generation network, the discrimination network, and the auxiliary prediction network; the structure of these three networks is described in detail below:

generating a network: the input part is formed by splicing data with attribute missing and a missing position coding vector corresponding to the data. Depending on the structure of the data, the hidden layer can be constructed using either a fully connected layer or an deconvolution layer, and in particular, when the input data is picture type data, the generated padding data is obtained using the deconvolution layer. It is assumed here that the input data (denoted as I) is a 100-dimensional vector, and thus the corresponding missing position encoding vector (denoted as E) is also 100-dimensional, and the dimension of the input vector obtained by splicing is 200; the hidden layer is composed of a full connection layer, and the relu is used as an activation function; and the final output layer has 100 output units (marked as O), and the activation function of the output layer adopts sigmoid. The padded data is finally composed of I. (1-E) + O.E.

And (3) network discrimination: the input data comprises two parts, wherein the first part is a filling data result obtained based on the output of the generated network, the second part is sample data with non-missing attributes, and the output result is a decimal between 0 and 1 and represents the probability of judging whether the input data received by the network is from the data with non-missing attributes. The network structure is set differently according to the type of input data, and when the input data is image type data, the network structure can be constructed by a convolutional neural network. Assuming here that the input data is a 100-dimensional vector, the hidden layer can be chosen to be a fully-connected layer configuration, with the activation function set to relu; the output layer only comprises one unit, the activation function is selected as sigmoid, and the probability is represented.

Auxiliary prediction network: the input is completely consistent with the discriminant network, the output is the predicted value of the input sample about the label, when the prediction problem is the classification problem, the cross entropy is adopted as the loss function, and when the prediction problem is the regression problem, the L2 norm or the L1 norm is adopted as the loss function. The network structure is the same as the setting method of the discrimination network. Assuming here that the input data is a 100-dimensional vector, the hidden layer can be chosen to be a fully-connected layer configuration, with the activation function set to relu; the output layer contains only one cell and the activation function is set in the manner described above.

4) Filling data scale reduction: due to the fact that the preprocessing stage uses minmax to conduct data normalization, the final filling result can be obtained through restoration according to the maximum value and the minimum value of each recorded attribute.

5) Testing and hyper-parameter setting: in the training process of the network, loss is caused by loss in the two-part generative countermeasure network and the prediction loss of the auxiliary prediction network; these two losses combine in different ratios lambda to give a combined loss. Different λ's will affect the training of the model. In the operation process, the data set is segmented into a training set and a testing set, lambadas with different scales are selected from the training set and are respectively 0.1,0.3,0.5,0.7 and 0.9 for training, and meanwhile, the testing set is used for testing, and the minimum loss of the auxiliary prediction network on the testing set is used as a selection standard of the hyper-parameter.

The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that any changes made in the shape and principle of the present invention should be covered within the protection scope of the present invention.

Claims

1. A method for complementing and predicting attribute-missing data sets based on a generated countermeasure network is characterized by comprising the following steps: firstly, data preprocessing is carried out on a data set with lost attributes, wherein the data preprocessing comprises minmax normalization and one hot code conversion of discrete numerical variables; then, aiming at the samples with attribute deletion, constructing a coding vector of a deletion position so as to express the deletion position information; then constructing a filling network of missing data and an auxiliary prediction network to synchronously complete the filling and label prediction of the missing data; after the network training is finished, taking an output result of a network generated in a filling network as a filling result, and carrying out scale reduction according to the maximum and minimum values of the columns recorded during minmax normalization; finally, the super-parameters are continuously modified to observe the loss of the prediction result of the super-parameters in the verification set so as to complete the setting of the super-parameters; which comprises the following steps

1) Preprocessing data;

2) constructing a missing position coding vector, the case of which is: when data is filled, the attribute positions of the missing samples are important information, when a neural network is used for filling, only the missing positions need to be filled, when the missing position coding vectors are constructed, each column of all the samples is traversed, if the attribute is missing, the attribute is marked as '1', otherwise, the attribute is marked as '0', the process is executed according to the process, and each sample has a corresponding missing position coding vector;

3) constructing a missing data filling network and an auxiliary prediction network, wherein the conditions are as follows: the network is improved in the original generative countermeasure network as follows: removing noise from the input of the generated network; forming filled data by using the generated data and the missing position vector code; in addition, the introduction of the auxiliary prediction network fully considers the relation between the attributes and the tags, and when the attribute missing data is used for prediction, the loss between the tags and the real tags is predicted by using the auxiliary prediction network, and the generated network is updated by performing feedback calculation through a BP algorithm; jointly generating a loss function in the countermeasure network and a loss function in the auxiliary prediction network, and determining that the distribution of the generated filling data is close to the distribution of the complete data by controlling the weight ratio of the loss functions through the hyper-parameters; the structure of the data filling network and the auxiliary prediction network comprises a generation network, a judgment network and an auxiliary prediction network; the structure of these three networks is described in detail below:

generating a network: the input part is formed by splicing data with attribute missing and missing position coding vectors corresponding to the data; the input data is picture type data, and generated filling data is obtained by using deconvolution operation;

and (3) network discrimination: the input data comprises two parts, wherein the first part is a filling data result obtained based on the output of the generated network, the second part is sample data with non-missing attributes, and the output result is a decimal between 0 and 1 and represents the probability of judging whether the input data received by the network is from the data with non-missing attributes; the input data is image type data, and the network structure of the discrimination network is constructed by a convolution neural network;

auxiliary prediction network: the input is completely consistent with the discrimination network, the output is a predicted value of the input sample about the label, when the prediction problem is a classification problem, cross entropy is adopted as a loss function, and when the prediction problem is a regression problem, an L2 norm or an L1 norm is adopted as the loss function; the network structure is the same as the setting method of the judgment network;

4) filling data scale reduction;

5) and testing and super-parameter setting.

2. The method of claim 1, wherein the method comprises the following steps: in the step 1), different data types are preprocessed differently, the related data types are divided into a continuous numerical value and a discrete numerical value, and the continuous numerical value is normalized by directly using minmax; for discrete numerical values, after being converted into one hot codes, minmax normalization is used, and 0 is uniformly supplemented for missing positions; in addition, the data set is divided into two parts according to whether there is attribute missing: data with missing attributes versus data without missing attributes.

3. The method of claim 1, wherein the method comprises the following steps: in step 4), the generated filling data is subjected to scale reduction, and as the data normalization is performed by using minmax in the preprocessing stage, the final filling result can be obtained through reduction according to the maximum value and the minimum value of each recorded attribute.

4. The method of claim 1, wherein the method comprises the following steps: in step 5), testing and hyper-parameter setting, wherein the conditions are as follows: during the training process of the network, loss is generated from two parts: generating a predicted loss of the antagonistic network and the auxiliary predictive network; the two losses are combined in different proportions lambda to obtain a comprehensive loss; different λ may affect the training of the model; in the operation process, the data set is segmented into a training set and a testing set, lambadas with different scales are selected from the training set and are respectively 0.1,0.3,0.5,0.7 and 0.9 for training, and meanwhile, the testing set is used for testing, and the minimum loss of the auxiliary prediction network on the testing set is used as a selection standard of the hyper-parameter.