CN109165664B - Attribute-missing data set completion and prediction method based on generation of countermeasure network - Google Patents
Attribute-missing data set completion and prediction method based on generation of countermeasure network Download PDFInfo
- Publication number
- CN109165664B CN109165664B CN201810722774.3A CN201810722774A CN109165664B CN 109165664 B CN109165664 B CN 109165664B CN 201810722774 A CN201810722774 A CN 201810722774A CN 109165664 B CN109165664 B CN 109165664B
- Authority
- CN
- China
- Prior art keywords
- data
- network
- missing
- filling
- prediction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2148—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
本发明公开了一种基于生成对抗网络的属性缺失数据集补全与预测方法,包括步骤:1)对数据minmax归一化,同时对离散类型的属性使用one hot编码,缺失值标记为0;2)使用数据集建立关于样本的缺失位置编码向量;3)构建生成式对抗网络与辅助预测网络进行数据填充与标签的预测;4)根据属性中最大最小值还原为minmax归一化前的结果;5)通过测试选取合适的超参数;本发明充分利用数据集中数据分布信息与标签信息,能够对高维度缺失数据集进行有效的数据填充,同时在训练完成之后,该方法中包含的另一辅助预测网络能够直接队输入的属性缺失数据给出标签的预测结果,流程简捷、具有更高的预测准确率。
The invention discloses a method for completing and predicting a missing attribute data set based on a generative adversarial network, comprising the steps of: 1) normalizing the data minmax, using one hot coding for the attributes of discrete types, and marking the missing value as 0; 2) Use the dataset to establish the missing position encoding vector about the sample; 3) Build a generative adversarial network and an auxiliary prediction network for data filling and label prediction; 4) Restore the result before minmax normalization according to the maximum and minimum values in the attribute 5) Select suitable hyperparameters through testing; the present invention makes full use of data distribution information and label information in the data set, and can effectively fill in the high-dimensional missing data set, and after the training is completed, another method included in the method is used. The auxiliary prediction network can directly give the prediction result of the label according to the input missing attribute data, the process is simple and has a higher prediction accuracy.
Description
技术领域technical field
本发明涉及数据预处理的技术领域,尤其是指一种基于生成对抗网络的属性缺失数据集补全与预测方法。The invention relates to the technical field of data preprocessing, in particular to a method for completing and predicting missing attribute data sets based on generative adversarial networks.
背景技术Background technique
数据集属性缺失这一现象在各类数据集中广泛存在,通常是在数据采集或者传输的过程中信息丢失造成的。数据集中的样本丢失一个与多个属性会使后续建立预测、分类的模型预测精度下降。如何对这些缺失数据进行补全,并利用具有属性缺失的样本蕴含的信息来构建高精度的预测模型,是数据预处理面临的一个关键问题。The phenomenon of missing data set attributes is widespread in various data sets, and is usually caused by information loss during data collection or transmission. The loss of one or more attributes of the samples in the data set will reduce the prediction accuracy of the subsequent prediction and classification models. How to complete these missing data and use the information contained in the samples with missing attributes to build a high-precision prediction model is a key problem in data preprocessing.
多数统计工具采取删除缺失样本对应行、列的方式处理属性缺失的问题,或者使用该列中位数、平均数对缺失位置进行填充;这类方式虽然高效、便捷,但未能完全利用样本数据分布信息,造成计算结果的不准确。在多维数据处理的过程中,数据不同属性之间往往存在很多关联性,这些属性之间的关联性可以为数据的填充提供更多的信息,考虑到这类关联性的数据填充方法,在对缺失值进行估计时会有更小的偏差,从而能够深度挖掘缺失样本蕴含的信息。Most statistical tools deal with the problem of missing attributes by deleting the rows and columns corresponding to missing samples, or use the median and mean of the column to fill in the missing positions; although these methods are efficient and convenient, they fail to fully utilize the sample data. distribution information, resulting in inaccurate calculation results. In the process of multi-dimensional data processing, there are often many correlations between different attributes of data, and the correlation between these attributes can provide more information for data filling. There will be less bias in the estimation of missing values, so that the information contained in the missing samples can be deeply mined.
在此基础上,更进一步的数据填充方法通过建模来填补缺失值。如回归填补法将缺失属性作为因变量建立回归方程实现预测,EM算法先初始化缺失值,通过E步与M步迭代来得到最终的填补结果,k邻近算法(KNN)则根据未缺失的属性计算欧式距离匹配样本集中最相似的k个样本,通过加权平均得到填补结果。这些算法往往在数据量足够的情况下,取得比均值、中位数更准确的填补结果,然后通常也存在一些问题:回归填补法中,需要各属性之间有显著地线性关系,而基于EM算法的填充方法,计算复杂度高,并且容易陷入局部最优;基于k近邻的填充方法实现简单,但在面对大数据量时,计算量大复杂度极高导致计算困难。On this basis, further data filling methods are used to fill in missing values by modeling. For example, the regression imputation method uses the missing attribute as the dependent variable to establish a regression equation to achieve prediction. The EM algorithm first initializes the missing value, and the final filling result is obtained through E-step and M-step iteration. The k-neighbor algorithm (KNN) calculates according to the non-missing attributes The Euclidean distance matches the most similar k samples in the sample set, and the filling result is obtained by weighted average. These algorithms often achieve more accurate filling results than the mean and median when the amount of data is sufficient, and then there are usually some problems: in the regression filling method, there needs to be a significant linear relationship between attributes, and based on EM The filling method of the algorithm has high computational complexity and is easy to fall into the local optimum; the filling method based on k-nearest neighbors is simple to implement, but when faced with a large amount of data, the large amount of calculation and the extremely high complexity lead to computational difficulties.
此外,数据填充的主要目的是为了提供更多完整的数据以供后续的建模预测。以上方法中未涉及到建模的过程,填充的数据往往和预测的标签往往会存在一些关联,将预测模型与填充方法结合起来能够使得填充得到的数据能起到更好的预测效果。针对传统数据填充方法处理高维度数据时存在计算复杂度高,未能充分挖掘标签信息以修正填充结果这两个问题;本发明将基于生成式对抗网络学习数据分布进行数据填充,同时建立一个辅助预测网络充分挖掘数据与标签之间的关联,使得其互信息达到最大。In addition, the main purpose of data filling is to provide more complete data for subsequent modeling predictions. The above method does not involve the modeling process, and the filled data often has some relationship with the predicted label. Combining the prediction model with the filling method can make the filled data have a better prediction effect. Aiming at the two problems of high computational complexity when processing high-dimensional data with traditional data filling methods, the label information cannot be fully mined to correct filling results; The prediction network fully mines the association between data and labels to maximize their mutual information.
发明内容SUMMARY OF THE INVENTION
本发明的目的在于克服现有技术的不足,提出了一种基于生成对抗网络的属性缺失数据集补全与预测方法,充分利用数据集中数据分布信息与标签信息,能够对高维度缺失数据集进行有效的数据填充,同时在训练完成之后,该方法中包含的另一辅助预测网络能够直接对输入的属性缺失数据给出标签的预测结果,流程简捷、具有更高的预测准确率。The purpose of the present invention is to overcome the deficiencies of the prior art, and propose a method for completing and predicting missing attribute data sets based on generative adversarial networks, making full use of the data distribution information and label information in the data set, and being able to carry out analysis on high-dimensional missing data sets. Effective data filling, and after the training is completed, another auxiliary prediction network included in the method can directly give the prediction result of the label for the input missing attribute data, the process is simple and has a higher prediction accuracy.
为实现以上目的,本发明所提供的技术方案为:一种基于生成对抗网络的属性缺失数据集补全与预测方法,首先,针对属性缺失的数据集进行数据预处理,主要包括minmax归一化和离散的数值变量的one hot编码转换;然后针对具有属性缺失的样本,构建缺失位置的编码向量,从而表达缺失的位置信息;接着构建缺失数据的填充网络与辅助预测网络同步完成缺失数据的填充与标签预测;在网络训练完成之后,以填充网络中生成网络的输出结果为填充的结果,根据minmax归一化时记录的列最大最小值进行尺度还原;最后,通过不断修改超参数观测其在验证集的预测结果的损失来完成超参数的设置;其包括以下步骤In order to achieve the above purpose, the technical solution provided by the present invention is: a method for completing and predicting a data set with missing attributes based on a generative adversarial network. First, data preprocessing is performed for the data set with missing attributes, mainly including minmax normalization. One hot coding conversion of discrete numerical variables; then, for the samples with missing attributes, the coding vector of the missing position is constructed to express the missing position information; then the filling network of the missing data and the auxiliary prediction network are constructed to complete the filling of the missing data synchronously and label prediction; after the network training is completed, the output result of the generated network in the filling network is used as the filling result, and the scale is restored according to the maximum and minimum values of the columns recorded during minmax normalization; finally, it is observed by continuously modifying the hyperparameters. The loss of the prediction results on the validation set completes the setting of hyperparameters; it includes the following steps
1)数据预处理;1) Data preprocessing;
2)构建缺失位置编码向量;2) construct the missing position encoding vector;
3)构建缺失数据填充网络与辅助预测网络;3) Construct missing data filling network and auxiliary prediction network;
4)填充数据尺度还原;4) Fill data scale restoration;
5)测试与超参数设置。5) Testing and hyperparameter setting.
在步骤1)中,对不同数据类型进行不同的预处理,涉及的主要数据类型分为连续型数值与离散型数值,对于连续型数值,直接使用minmax进行归一化;对于离散型数值,转化为one hot编码之后,使用minmax归一化,对于缺失位置统一补上0;此外,将数据集是否划分为两部分:具有属性缺失的数据与属性未缺失的数据。In step 1), different data types are preprocessed differently. The main data types involved are divided into continuous values and discrete values. For continuous values, minmax is directly used for normalization; for discrete values, conversion After encoding one hot, use minmax to normalize, and add 0 to the missing position uniformly; in addition, whether to divide the data set into two parts: data with missing attributes and data with no missing attributes.
在步骤2)中,构建缺失位置编码向量,其情况是:在数据填充时,样本缺失的属性位置也是一种重要的信息,在使用神经网络进行填充时,只需要对这些缺失的位置进行填充,在构建缺失位置编码向量时,对所有样本的每一列进行遍历,如果该属性缺失,记为“1”,否则记为“0”,按此流程执行,每个样本都会有一个缺失位置编码向量对应。In step 2), the missing position encoding vector is constructed, and the situation is: when the data is filled, the missing attribute position of the sample is also an important information. When using the neural network to fill, only these missing positions need to be filled. , when constructing the missing position coding vector, traverse each column of all samples. If the attribute is missing, record it as "1", otherwise, record it as "0". According to this process, each sample will have a missing position code vector correspondence.
在步骤3)中,构建缺失数据填充网络与辅助预测网络,其情况是:该网络在原始的生成式对抗网络做了如下改进:①在生成网络的输入中移除了随机采样得到的噪声;②使用生成的数据与缺失位置向量编码来组成填充的数据;此外,辅助预测网络的引入更充分的考虑了属性与标签之间的联系,在使用属性缺失数据进行预测同时,使用辅助预测网络预测标签与真实标签之间损失通过BP算法进行反馈计算更新了生成网络,从而使得生成的填充数据在构建预测模型时具有更好的效果;联合生成式对抗网络中的损失函数与辅助预测网络中的损失函数,通过超参数控制其权重比,来决定生成的填充数据分布更贴近完整数据的分布或者是能够使得预测模型预测更准确;其中,数据填充网络与辅助预测网络的结构包含生成网络、判别网络、辅助预测网络;下面对这三个网络的结构进行详细的介绍:In step 3), a missing data filling network and an auxiliary prediction network are constructed, and the situation is that: the network has made the following improvements on the original generative adversarial network: (1) The noise obtained by random sampling is removed from the input of the generative network; ②Use the generated data and the missing position vector coding to form the filled data; in addition, the introduction of the auxiliary prediction network fully considers the relationship between attributes and labels, while using the attribute missing data for prediction, the auxiliary prediction network is used to predict The loss between the label and the real label updates the generation network through the feedback calculation of the BP algorithm, so that the generated filling data has a better effect when building a prediction model; the loss function in the joint generative adversarial network is combined with the auxiliary prediction network. The loss function, which controls its weight ratio through hyperparameters, determines that the distribution of the generated filling data is closer to the distribution of the complete data or can make the prediction model prediction more accurate; among them, the structure of the data filling network and the auxiliary prediction network includes the generation network, discriminant Network, auxiliary prediction network; the following is a detailed introduction to the structure of these three networks:
生成网络:输入部分由具有属性缺失的数据与其对应的缺失位置编码向量拼接构成;根据数据的结构不同,隐藏层能够使用全连接层或者反卷积层来构成,尤其在输入的数据是图片类型数据时,使用反卷积操作得到生成的填充数据;这里假定输入的数据记为I,是100维的向量,因而对应的缺失位置编码向量记为E,也是100维的,经拼接得到的输入向量维度为200;隐藏层由全连接层构成,激活函数使用relu;最终的输出层,具有100个输出单元,记为O,输出层的激活函数采用sigmoid;填充的数据最终由由I·(1-E)+O·E构成;Generative network: The input part is composed of data with missing attributes and their corresponding missing position encoding vectors; depending on the structure of the data, the hidden layer can be composed of a fully connected layer or a deconvolution layer, especially when the input data is a picture type When using the data, use the deconvolution operation to obtain the generated padding data; here, it is assumed that the input data is denoted as I, which is a 100-dimensional vector, so the corresponding missing position encoding vector is denoted as E, which is also 100-dimensional, and the input obtained by splicing The vector dimension is 200; the hidden layer is composed of a fully connected layer, and the activation function uses relu; the final output layer has 100 output units, denoted as O, and the activation function of the output layer adopts sigmoid; the filled data is finally composed of I·( 1-E)+O·E composition;
判别网络:输入的数据有两部分,第一部分是基于生成网络的输出得到的填充数据结果,第二部分是属性未缺失的样本数据,输出结果为0~1之间的小数,代表判别网络认为接收的输入数据是否来自属性未缺失的数据的概率;根据输入数据类型的不同,网络结构的设置也不同,在输入数据为图像类型数据时,由卷积神经网络构建;这里假定输入数据是100维向量,那么隐藏层能够选择由全连接层构成,激活函数设置为relu;输出层仅包含一个单元,激活函数选择为sigmoid,表征概率;Discriminant network: The input data has two parts. The first part is the filling data result based on the output of the generation network, and the second part is the sample data whose attributes are not missing. The output result is a decimal between 0 and 1, which means that the discriminant network thinks The probability of whether the received input data comes from data with no missing attributes; the settings of the network structure are different depending on the type of input data. When the input data is image type data, it is constructed by a convolutional neural network; here, it is assumed that the input data is 100 dimensional vector, then the hidden layer can be selected to be composed of a fully connected layer, and the activation function is set to relu; the output layer only contains one unit, and the activation function is selected to be sigmoid, representing the probability;
辅助预测网络:输入与判别网络完全一致,输出则是对输入样本关于标签的预测值,当预测问题是分类问题时,采用交叉熵作为损失函数,当预测问题是回归问题时,采用L2范数或者L1范数作为损失函数;网络结构与判别网络的设置方法相同;这里假定输入数据是100维向量,那么隐藏层能够选择由全连接层构成,激活函数设置为relu;输出层仅包含一个单元,激活函数按上述方式设置。Auxiliary prediction network: the input is exactly the same as the discriminant network, and the output is the predicted value of the input sample about the label. When the prediction problem is a classification problem, the cross entropy is used as the loss function, and when the prediction problem is a regression problem, the L2 norm is used Or the L1 norm is used as the loss function; the network structure is the same as the setting method of the discriminant network; here, it is assumed that the input data is a 100-dimensional vector, then the hidden layer can choose to be composed of a fully connected layer, and the activation function is set to relu; the output layer contains only one unit , the activation function is set as above.
在步骤4)中,对生成的填充数据进行尺度还原,由于预处理阶段使用了minmax进行了数据归一化,根据记录的每个属性的最大值与最小值,能够还原得到最终的填充的结果。In step 4), the generated filling data is scaled. Since minmax is used for data normalization in the preprocessing stage, the final filling result can be restored according to the recorded maximum and minimum values of each attribute. .
在步骤5)中,测试与超参数设置,其情况是:网络在训练的过程中,损失来源于两部分:生成式对抗网络中的损失与辅助预测网络的预测损失;这两部分损失以不同的比例λ组合得到综合的损失;不同的λ会影响模型的训练;在操作过程中,切分数据集为训练集和测试集,在训练集上选取不同尺度的λ,分别为0.1,0.3,0.5,0.7,0.9进行训练,同时,使用测试集进行测试,以测试集上辅助预测网络的损失最小作为超参数的选取标准。In step 5), test and hyperparameter settings, the situation is: in the process of network training, the loss comes from two parts: the loss in the generative adversarial network and the prediction loss of the auxiliary prediction network; the two parts of the loss are different The ratio of λ is combined to obtain a comprehensive loss; different λ will affect the training of the model; during the operation, the data set is divided into training set and test set, and different scales of λ are selected on the training set, which are 0.1, 0.3, 0.5, 0.7, 0.9 are used for training, and at the same time, the test set is used for testing, and the minimum loss of the auxiliary prediction network on the test set is used as the selection criterion of hyperparameters.
本发明与现有技术相比,具有如下优点与有益效果:Compared with the prior art, the present invention has the following advantages and beneficial effects:
1、传统的填充方法如中位数、均值填充等,方法简单,填充效果不够好,而基于KNN,EM的方法往往时间复杂度大,在处理高维度数据集时,时间复杂度极大,甚至出现无法处理的情况。而生成式对抗网络在高维度数据的分布学习上有着极好的效果,因而可以解决高维度数据集带来的麻烦;另外通常未有属性缺失的样本与具有属性的缺失样本是服从同一分布的,让填充后的数据从分布上逼近没有属性缺失的数据集能够使得填充的结果不会偏离数据分布,给预测模型来带负面影响。1. The traditional filling methods such as median and mean filling are simple and the filling effect is not good enough. However, the methods based on KNN and EM often have large time complexity. When dealing with high-dimensional data sets, the time complexity is extremely large. There are even situations that cannot be handled. The generative adversarial network has an excellent effect on the distribution learning of high-dimensional data, so it can solve the trouble caused by high-dimensional data sets; in addition, usually the samples without missing attributes and the missing samples with attributes obey the same distribution , so that the filled data approximates the data set with no missing attributes from the distribution, so that the filled results will not deviate from the data distribution, which will have a negative impact on the prediction model.
2、传统的填充方法并未考虑到填充后的数据对后续建立预测模型的预测结果的影响,其步骤通常是先对缺失数据进行填充得到完成的数据,再利用填充后的数据建立预测模型,因而在不能使用预测的效果去指导数据的填充。本发明通过引入辅助预测网络计算每次填充的数据预测的值与真实标签之间的损失进行反向传播指导生成网络的数据填充,从而能够观测到填充的数据在预测模型上表现好坏选择出预测效果,结合判别网络的损失限制填充的数据与真实数据分布的差异,达到在有较好填充效果的同时具有好的预测结果。此外在完成训练之后,得到的是一个端到端的网络,在输入数据之后,能够直接得到辅助预测网络的预测结果。2. The traditional filling method does not take into account the impact of the filled data on the prediction results of the subsequent prediction model establishment. The steps are usually to fill in the missing data to obtain the completed data, and then use the filled data to establish a prediction model. Therefore, the predicted effect cannot be used to guide the filling of the data. In the present invention, an auxiliary prediction network is introduced to calculate the loss between the predicted value of the data filled each time and the real label, and backpropagation guides the data filling of the generation network, so that the performance of the filled data on the prediction model can be observed and selected. The prediction effect, combined with the loss of the discriminant network, limits the difference between the filled data and the real data distribution, so as to achieve a good filling effect and a good prediction result. In addition, after the training is completed, an end-to-end network is obtained. After inputting the data, the prediction results of the auxiliary prediction network can be directly obtained.
附图说明Description of drawings
图1为缺失数据填充与预测的流程图。Figure 1 is a flowchart of missing data filling and prediction.
图2为填充数据的生成式对抗网络与预测网络数据流图。Figure 2 shows the data flow diagram of the generative adversarial network and prediction network filled with data.
具体实施方式Detailed ways
下面结合具体实施例对本发明作进一步说明。The present invention will be further described below in conjunction with specific embodiments.
如图1所示,本实例所提供的基于生成对抗网络的属性缺失数据集补全与预测方法,其具体情况如下:As shown in Figure 1, the method for completing and predicting missing attribute datasets based on generative adversarial networks provided in this example is as follows:
1)数据预处理:不同属性的数据类型不同,对应的处理方式也不同。涉及的主要数据类型分为连续型数值与离散型数值,对于连续型数值,直接使用minmax进行归一化;对于离散型数值,转化为one hot编码之后,使用minmax归一化,对于缺失位置统一补上0。此外将数据集分为两部分:具有属性缺失的数据与属性未缺失的数据。1) Data preprocessing: The data types of different attributes are different, and the corresponding processing methods are also different. The main data types involved are divided into continuous values and discrete values. For continuous values, minmax is directly used for normalization; for discrete values, after conversion to one hot encoding, minmax is used for normalization, and the missing positions are unified. Add 0. In addition, the dataset is divided into two parts: data with missing attributes and data with no missing attributes.
2)构建缺失位置编码向量:在数据填充时,样本缺失的属性位置也是一种重要的信息,在使用神经网络进行填充时,只需要对这些缺失的位置进行填充。在构建缺失位置编码向量时,对所有样本的每一列进行遍历,如果该属性缺失,记为“1”,否则记为“0”。按此流程执行,每个样本都会有一个缺失位置编码向量对应。2) Construct the missing position encoding vector: When filling data, the missing attribute position of the sample is also an important information. When filling with neural network, only these missing positions need to be filled. When constructing the missing position encoding vector, traverse each column of all samples, if the attribute is missing, record it as "1", otherwise, record it as "0". According to this process, each sample will have a corresponding missing position encoding vector.
3)构建缺失数据填充网络与辅助预测网络:本发明提出了一种基于生成式对抗网络并联合辅助预测网络来进行数据填充同时能够进行预测的综合网络。该网络在原始的生成式对抗网络做了如下改进:①在生成网络的输入中移除了采样得到的噪声;②使用生成的数据与缺失位置向量编码来组成填充的数据。此外辅助预测网络的引入更充分的考虑了属性与标签之间的联系,在使用属性缺失数据进行预测同时,使用辅助预测网络预测标签与真实标签之间损失通过BP算法进行反馈计算更新了生成网络,从而使得生成的填充数据在构建预测模型时具有更好的效果。联合生成式对抗网络中的损失函数与辅助预测网络中的损失函数,通过超参数控制其权重比,来决定生成的填充数据分布更贴近完整数据的分布或者是能够使得预测模型预测更准确。图2是本发明中最重要的数据填充网络与辅助预测网络的结构图,包含生成网络、判别网络、辅助预测网络;下面对这三个网络的结构进行详细的介绍:3) Constructing missing data filling network and auxiliary prediction network: The present invention proposes a comprehensive network based on generative adversarial network and joint auxiliary prediction network for data filling and prediction. The network makes the following improvements over the original generative adversarial network: (1) the sampled noise is removed from the input of the generative network; (2) the padded data is composed of the generated data and the missing position vector encoding. In addition, the introduction of the auxiliary prediction network fully considers the relationship between attributes and labels. While using the attribute missing data for prediction, the auxiliary prediction network is used to predict the loss between the label and the real label, and the BP algorithm is used for feedback calculation to update the generation network. , so that the generated padding data has a better effect when building a predictive model. The loss function in the joint generative adversarial network and the loss function in the auxiliary prediction network are controlled by hyperparameters to determine the weight ratio of the generated fill data distribution to be closer to the distribution of the complete data or to make the prediction model more accurate. Fig. 2 is the structure diagram of the most important data filling network and auxiliary prediction network in the present invention, including generating network, discriminating network, auxiliary prediction network; The structure of these three networks is introduced in detail below:
生成网络:输入部分由具有属性缺失的数据与其对应的缺失位置编码向量拼接构成。根据数据的结构不同,隐藏层层可以使用全连接层或者反卷积层来构成,尤其在输入的数据是图片类型数据时,使用反卷积层得到生成的填充数据。这里假定输入的数据(记为I)是100维的向量,因而对应的缺失位置编码向量(记为E)也是100维的,经拼接得到的输入向量维度为200;隐藏层由全连接层构成,激活函数使用relu;最终的输出层,具有100个输出单元(记为O),输出层的激活函数采用sigmoid。填充的数据最终由由I·(1-E)+O·E构成。Generative network: The input part consists of the concatenation of data with missing attributes and their corresponding missing position encoding vectors. Depending on the structure of the data, the hidden layer can be composed of a fully connected layer or a deconvolution layer, especially when the input data is image type data, the deconvolution layer is used to obtain the generated padding data. It is assumed here that the input data (denoted as I) is a 100-dimensional vector, so the corresponding missing position encoding vector (denoted as E) is also 100-dimensional, and the dimension of the input vector obtained by splicing is 200; the hidden layer is composed of a fully connected layer , the activation function uses relu; the final output layer has 100 output units (denoted as O), and the activation function of the output layer adopts sigmoid. The padded data is finally composed of I·(1-E)+O·E.
判别网络:输入的数据有两部分,第一部分是基于生成网络的输出得到的填充数据结果,第二部分是属性未缺失的样本数据,输出结果为0~1之间的小数,代表判别网络认为接收的输入数据是否来自属性未缺失的数据的概率。根据输入数据类型的不同,网络结构的设置也不同,在输入数据为图像类型数据时,可由卷积神经网络构建。这里假定输入数据是100维向量,那么隐藏层可选择为全连接层构成,激活函数设置为relu;输出层仅包含一个单元,激活函数选择为sigmoid,表征概率。Discriminant network: The input data has two parts. The first part is the filling data result based on the output of the generation network, and the second part is the sample data whose attributes are not missing. The output result is a decimal between 0 and 1, which means that the discriminant network thinks The probability that the received input data is from data whose attributes are not missing. Depending on the type of input data, the settings of the network structure are also different. When the input data is image type data, it can be constructed by a convolutional neural network. It is assumed here that the input data is a 100-dimensional vector, then the hidden layer can be selected to be a fully connected layer, and the activation function is set to relu; the output layer only contains one unit, and the activation function is selected to be sigmoid to represent the probability.
辅助预测网络:输入与判别网络完全一致,输出则是对输入样本关于标签的预测值,当预测问题是分类问题时,采用交叉熵作为损失函数,当预测问题是回归问题时,采用L2范数或者L1范数作为损失函数。网络结构与判别网络的设置方法相同。这里假定输入数据是100维向量,那么隐藏层可选择为全连接层构成,激活函数设置为relu;输出层仅包含一个单元,激活函数按上述方式设置。Auxiliary prediction network: the input is exactly the same as the discriminant network, and the output is the predicted value of the input sample about the label. When the prediction problem is a classification problem, the cross entropy is used as the loss function, and when the prediction problem is a regression problem, the L2 norm is used Or L1 norm as loss function. The network structure is the same as the setting method of the discriminant network. Assuming that the input data is a 100-dimensional vector, the hidden layer can be selected as a fully connected layer, and the activation function is set to relu; the output layer only contains one unit, and the activation function is set as described above.
4)填充数据尺度还原:由于预处理阶段使用了minmax进行了数据归一化,根据记录的每个属性的最大值与最小值,可还原得到最终的填充的结果。4) Scale restoration of filling data: Since minmax is used for data normalization in the preprocessing stage, the final filling result can be restored according to the recorded maximum and minimum values of each attribute.
5)测试与超参数设置:网络在训练的过程中,损失来源于两部分由生成式对抗网络中的损失与辅助预测网络的预测损失;这两部分损失以不同的比例λ组合得到综合的损失。不同的λ会影响模型的训练。在操作过程中,切分数据集为训练集和测试集,在训练集上选取不同尺度的λ,分别为0.1,0.3,0.5,0.7,0.9进行训练,同时,使用测试集进行测试,以测试集上辅助预测网络的损失最小作为超参数的选取标准。5) Test and hyperparameter settings: During the training process of the network, the loss comes from two parts, the loss in the generative adversarial network and the prediction loss of the auxiliary prediction network; the two parts of the loss are combined in different proportions λ to obtain a comprehensive loss . Different λ will affect the training of the model. During the operation, the data set is divided into training set and test set, and λ of different scales is selected on the training set, which are 0.1, 0.3, 0.5, 0.7, 0.9 for training, and at the same time, the test set is used for testing to test The minimum loss of the auxiliary prediction network on the set is used as the selection criterion for the hyperparameters.
以上所述实施例子只为本发明之较佳实施例子,并非以此限制本发明的实施范围,故凡依本发明之形状、原理所作的变化,均应涵盖在本发明的保护范围内。The above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of implementation of the present invention. Therefore, any changes made according to the shape and principle of the present invention should be included within the protection scope of the present invention.
Claims (4)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810722774.3A CN109165664B (en) | 2018-07-04 | 2018-07-04 | Attribute-missing data set completion and prediction method based on generation of countermeasure network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810722774.3A CN109165664B (en) | 2018-07-04 | 2018-07-04 | Attribute-missing data set completion and prediction method based on generation of countermeasure network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109165664A CN109165664A (en) | 2019-01-08 |
CN109165664B true CN109165664B (en) | 2020-09-22 |
Family
ID=64897277
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810722774.3A Expired - Fee Related CN109165664B (en) | 2018-07-04 | 2018-07-04 | Attribute-missing data set completion and prediction method based on generation of countermeasure network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109165664B (en) |
Families Citing this family (38)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109522973A (en) * | 2019-01-17 | 2019-03-26 | 云南大学 | Medical big data classification method and system based on production confrontation network and semi-supervised learning |
CN109978257A (en) * | 2019-03-25 | 2019-07-05 | 上海赢科信息技术有限公司 | The continuation of insurance prediction technique and system of vehicle insurance |
CN110046706B (en) * | 2019-04-18 | 2022-12-20 | 腾讯科技(深圳)有限公司 | Model generation method and device and server |
CN110175168B (en) * | 2019-05-28 | 2021-06-01 | 山东大学 | A time series data filling method and system based on generative adversarial network |
CN110647519B (en) * | 2019-08-30 | 2023-10-03 | 中国平安人寿保险股份有限公司 | Method and device for predicting missing attribute value in test sample |
CN110728297B (en) * | 2019-09-04 | 2021-08-06 | 电子科技大学 | A low-cost adversarial network attack sample generation method based on GAN |
CN111046027B (en) * | 2019-11-25 | 2023-07-25 | 北京百度网讯科技有限公司 | Missing value filling method and device for time series data |
CN113010500B (en) * | 2019-12-18 | 2024-06-14 | 天翼云科技有限公司 | Processing method and processing system for DPI data |
CN111037365B (en) * | 2019-12-26 | 2021-08-20 | 大连理工大学 | Tool Condition Monitoring Dataset Enhancement Method Based on Generative Adversarial Networks |
CN111177135B (en) * | 2019-12-27 | 2020-11-10 | 清华大学 | A landmark-based data filling method and device |
CN111259953B (en) * | 2020-01-15 | 2023-10-20 | 云南电网有限责任公司电力科学研究院 | Equipment defect time prediction method based on capacitive equipment defect data |
CN111259916A (en) * | 2020-02-12 | 2020-06-09 | 东华大学 | Low-rank projection feature extraction method under condition of label missing |
CN111429605B (en) * | 2020-04-10 | 2022-06-21 | 郑州大学 | Missing value filling method based on generation type countermeasure network |
CN111737463B (en) * | 2020-06-04 | 2024-02-09 | 江苏名通信息科技有限公司 | Big data missing value filling method, device and computer readable memory |
CN111738007B (en) * | 2020-07-03 | 2021-04-13 | 北京邮电大学 | Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network |
CN112036955B (en) * | 2020-09-07 | 2021-09-24 | 贝壳找房(北京)科技有限公司 | User identification method and device, computer readable storage medium and electronic equipment |
CN112183723B (en) * | 2020-09-17 | 2022-07-05 | 西北工业大学 | A data processing method for the problem of missing clinical test data |
CN112381303B (en) * | 2020-11-19 | 2024-12-20 | 北京嘀嘀无限科技发展有限公司 | A task indicator data prediction method and system |
CN112465150A (en) * | 2020-12-02 | 2021-03-09 | 南开大学 | Real data enhancement-based multi-element time sequence data filling method |
CN112712855B (en) * | 2020-12-28 | 2022-09-20 | 华南理工大学 | Joint training-based clustering method for gene microarray containing deletion value |
CN114826988B (en) * | 2021-01-29 | 2024-12-03 | 中国电信股份有限公司 | Method and device for abnormality detection and parameter filling of time series data |
CN113239022B (en) * | 2021-04-19 | 2023-04-07 | 浙江大学 | Method and device for complementing missing data in medical diagnosis, electronic device and medium |
CN113515896B (en) * | 2021-08-06 | 2022-08-09 | 红云红河烟草(集团)有限责任公司 | Data missing value filling method for real-time cigarette acquisition |
CN114022311B (en) * | 2021-11-16 | 2024-07-02 | 东北大学 | Data compensation method for integrated energy system based on temporal conditional generative adversarial network |
CN114091615B (en) * | 2021-11-26 | 2024-08-23 | 广东工业大学 | Electric energy metering data complement method and system based on generation countermeasure network |
CN114385608A (en) * | 2021-12-20 | 2022-04-22 | 上海皓桦科技股份有限公司 | Deep learning network and missing data prediction method and device thereof |
CN114757335A (en) * | 2022-04-01 | 2022-07-15 | 重庆邮电大学 | A Missing Data Imputation Generation Method Based on Dual Conditional Generative Adversarial Networks |
CN114860709B (en) * | 2022-05-20 | 2024-08-27 | 昆明理工大学 | Bi-GAN-based power system missing value filling method |
CN114936530A (en) * | 2022-06-22 | 2022-08-23 | 郑州大学 | Multi-element air quality data missing value filling model based on TAM and construction method thereof |
CN115145906B (en) | 2022-09-02 | 2023-01-03 | 之江实验室 | Preprocessing and completion method for structured data |
CN115883016B (en) * | 2022-10-28 | 2024-02-02 | 南京航空航天大学 | Flow data enhancement method and device based on federal generation countermeasure network |
CN115829162B (en) * | 2023-01-29 | 2023-05-26 | 北京市农林科学院信息技术研究中心 | Crop Yield Prediction Method, Device, Electronic Equipment and Medium |
CN117034142B (en) * | 2023-10-07 | 2024-02-09 | 之江实验室 | Unbalanced medical data missing value filling method and system |
CN117150231B (en) * | 2023-10-27 | 2024-01-26 | 国网江苏省电力有限公司苏州供电分公司 | Measurement data filling method and system based on correlation and generation countermeasure network |
CN117421548B (en) * | 2023-12-18 | 2024-03-12 | 四川互慧软件有限公司 | Method and system for treating loss of physiological index data based on convolutional neural network |
CN117524318B (en) * | 2024-01-05 | 2024-03-22 | 深圳新合睿恩生物医疗科技有限公司 | New antigen heterogeneous data integration method and device, equipment and storage medium |
CN117556267B (en) * | 2024-01-12 | 2024-04-02 | 闪捷信息科技有限公司 | Missing sample data filling method and device, storage medium and electronic equipment |
CN119025838A (en) * | 2024-08-15 | 2024-11-26 | 兰州理工大学 | Missing data filling method for industrial soft measurement |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106952239A (en) * | 2017-03-28 | 2017-07-14 | 厦门幻世网络科技有限公司 | image generating method and device |
CN107133934A (en) * | 2017-05-18 | 2017-09-05 | 北京小米移动软件有限公司 | Image completion method and device |
AU2017101166A4 (en) * | 2017-08-25 | 2017-11-02 | Lai, Haodong MR | A Method For Real-Time Image Style Transfer Based On Conditional Generative Adversarial Networks |
KR20170137350A (en) * | 2016-06-03 | 2017-12-13 | (주)싸이언테크 | Apparatus and method for studying pattern of moving objects using adversarial deep generative model |
CN107945118A (en) * | 2017-10-30 | 2018-04-20 | 南京邮电大学 | A kind of facial image restorative procedure based on production confrontation network |
CN107945140A (en) * | 2017-12-20 | 2018-04-20 | 中国科学院深圳先进技术研究院 | A kind of image repair method, device and equipment |
-
2018
- 2018-07-04 CN CN201810722774.3A patent/CN109165664B/en not_active Expired - Fee Related
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20170137350A (en) * | 2016-06-03 | 2017-12-13 | (주)싸이언테크 | Apparatus and method for studying pattern of moving objects using adversarial deep generative model |
CN106952239A (en) * | 2017-03-28 | 2017-07-14 | 厦门幻世网络科技有限公司 | image generating method and device |
CN107133934A (en) * | 2017-05-18 | 2017-09-05 | 北京小米移动软件有限公司 | Image completion method and device |
AU2017101166A4 (en) * | 2017-08-25 | 2017-11-02 | Lai, Haodong MR | A Method For Real-Time Image Style Transfer Based On Conditional Generative Adversarial Networks |
CN107945118A (en) * | 2017-10-30 | 2018-04-20 | 南京邮电大学 | A kind of facial image restorative procedure based on production confrontation network |
CN107945140A (en) * | 2017-12-20 | 2018-04-20 | 中国科学院深圳先进技术研究院 | A kind of image repair method, device and equipment |
Non-Patent Citations (1)
Title |
---|
GAIN: Missing Data Imputation using Generative Adversarial Nets;Jinsung Yoon et al.;《Proceedings of the 35 th International Conference on Machine》;20180607;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN109165664A (en) | 2019-01-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109165664B (en) | Attribute-missing data set completion and prediction method based on generation of countermeasure network | |
CN111310672A (en) | Video emotion recognition method, device and medium based on time sequence multi-model fusion modeling | |
CN110766044B (en) | Neural network training method based on Gaussian process prior guidance | |
JP2021060992A (en) | Machine learning system and method | |
CN111274398A (en) | Method and system for analyzing comment emotion of aspect-level user product | |
CN117371508B (en) | Model compression method, device, electronic equipment and storage medium | |
CN114841257A (en) | Small sample target detection method based on self-supervision contrast constraint | |
CN114117945B (en) | A deep learning cloud service QoS prediction method based on user-service interaction graph | |
CN116844041B (en) | Cultivated land extraction method based on bidirectional convolution time self-attention mechanism | |
CN108876044A (en) | Content popularit prediction technique on a kind of line of knowledge based strength neural network | |
CN113010774A (en) | Click rate prediction method based on dynamic deep attention model | |
CN104636486A (en) | Method and device for extracting features of users on basis of non-negative alternating direction change | |
CN112070162A (en) | Multi-class processing task training sample construction method, device and medium | |
CN114036298B (en) | Node classification method based on graph convolution neural network and word vector | |
CN111460275A (en) | A social network-oriented dynamic network representation learning method and system | |
CN111259264A (en) | Time sequence scoring prediction method based on generation countermeasure network | |
CN110648183A (en) | Grey correlation and QGNN-based resident consumption price index prediction method | |
CN116467466A (en) | Knowledge graph-based code recommendation method, device, equipment and medium | |
CN114611673A (en) | Neural network compression method, device, equipment and readable storage medium | |
CN110570048A (en) | user demand prediction method based on improved online deep learning | |
CN115934661B (en) | Method and device for compressing graphic neural network, electronic equipment and storage medium | |
CN117350391B (en) | Method, device, equipment and storage medium for determining geometric parameters of quantum chip | |
CN115688229B (en) | Method for creating most unfavorable defect mode of reticulated shell structure based on deep learning | |
CN115620807B (en) | Method for predicting interaction strength between target protein molecule and drug molecule | |
CN114492570B (en) | A method and system for constructing a key point extraction network of a neural network architecture |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200922 |