CN109165664A

CN109165664A - A kind of attribute missing data collection completion and prediction technique based on generation confrontation network

Info

Publication number: CN109165664A
Application number: CN201810722774.3A
Authority: CN
Inventors: 赵跃龙; 王禹
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2018-07-04
Filing date: 2018-07-04
Publication date: 2019-01-08
Anticipated expiration: 2038-07-04
Also published as: CN109165664B

Abstract

The invention discloses a kind of based on the attribute missing data collection completion and prediction technique that generate confrontation network, comprising steps of 1) normalizing to data minmax, while being encoded to the attribute of discrete type using one hot, missing values are labeled as 0；2) the deletion sites coding vector about sample is established using data set；3) building production confrontation network and auxiliary prediction network carry out the prediction of data filling and label；4) result before minmax normalization is reduced to according to maximin in attribute；5) suitable hyper parameter is chosen by test；The present invention makes full use of Data distribution information and label information in data set, effective data filling can be carried out to high-dimensional missing data collection, simultaneously after training is completed, another auxiliary prediction network for including in this method is capable of prediction result of the attribute missing data to outgoing label of direct team's input, and process is simple and direct, has higher predictablity rate.

Description

A kind of attribute missing data collection completion and prediction technique based on generation confrontation network

Technical field

The present invention relates to the technical fields of data prediction, refer in particular to a kind of based on the attribute missing for generating confrontation network Data set completion and prediction technique.

Background technique

Data set attribute lacks this phenomenon and is widely present in Various types of data concentration, usually acquires or transmits in data During information lose caused by.Sample in data set, which loses one, can make subsequent foundation prediction, classification with multiple attributes Model prediction accuracy decline.How completion is carried out to these missing datas, and contained using the sample with attribute missing Information constructs high-precision prediction model, is the critical issue that data prediction faces.

Most statistical tools, which are taken, deletes the problem of missing sample corresponds to the mode processing attribute missing of row, column, or makes Deletion sites are filled with the column median, average；This kind of mode is although efficient, convenient, but fails that sample is fully utilized Notebook data distributed intelligence causes the inaccuracy of calculated result.It is past between data different attribute during multidimensional data processing Toward there are many relevances, the relevance between these attributes can provide more information for the filling of data, it is contemplated that this The data filling method of class relevance has smaller deviation when estimating missing values, excavates and lacks so as to depth Lose the information that sample contains.

On this basis, further data filling method fills up missing values by modeling.Such as returning enthesis will Missing attribute establishes regression equation as dependent variable and realizes prediction, and EM algorithm first initializes missing values, walks iteration with M by E step To obtain final fill up as a result, k nearest neighbor algorithm (KNN) then calculates Euclidean distance matched sample concentration according to the attribute not lacked K most like sample obtains filling up result by weighted average.These algorithms often in the enough situations of data volume, take It obtains and is more accurately filled up than mean value, median as a result, being then also typically present some problems: returning in enthesis, need each category Property between have a significant ground wire sexual intercourse, and the fill method based on EM algorithm, computation complexity are high, and are easily trapped into part most It is excellent；Fill method based on k neighbour realizes that simply, but when facing big data quantity computationally intensive complexity is high to be caused to calculate It is difficult.

In addition, the main purpose of data filling is to provide for more complete data for subsequent modeling and forecasting.With The process of modeling is not directed in upper method, the data of filling often often have some associations with the label of prediction, will be pre- The data that surveying model and fill method can combine to that filling is obtained play better prediction effect.For traditional number There are computation complexity height when handling high dimensional data according to fill method, fail sufficiently to excavate label information to correct filling result Both of these problems；The present invention will based on production confrontation e-learning data distribution carry out data filling, while establish one it is auxiliary Being associated between the prediction abundant mining data of network and label is helped, so that its mutual information reaches maximum.

Summary of the invention

It is an object of the invention to overcome the deficiencies in the prior art, propose a kind of scarce based on the attribute for generating confrontation network Data set completion and prediction technique are lost, Data distribution information and label information in data set are made full use of, can be lacked to high-dimensional It loses data set and carries out effective data filling, while after training is completed, another auxiliary prediction network for including in this method Can be directly to the attribute missing data of input to the prediction result of outgoing label, process is simple and direct, has higher predictablity rate.

In order to achieve the above object, technical solution provided by the present invention are as follows: a kind of to be lacked based on the attribute for generating confrontation network Data set completion and prediction technique are lost, mainly includes minmax firstly, carrying out data prediction for the data set of attribute missing The one hot code conversion of normalization and discrete numerical variable；Then for the sample with attribute missing, building missing position The coding vector set, thus the location information of expression deletion；Then the filling network and auxiliary prediction network of missing data are constructed Synchronously complete filling and the Tag Estimation of missing data；After network training completion, the defeated of network is generated in network to fill Result is filling as a result, the column maximin recorded when being normalized according to minmax carries out scale reduction out；Finally, passing through Constantly modification hyper parameter observes it and completes the setting of hyper parameter in the loss of the prediction result of verifying collection；It includes the following steps

1) data prediction；

2) deletion sites coding vector is constructed；

3) building missing data filling network and auxiliary prediction network；

4) filling data scale reduction；

5) test is arranged with hyper parameter.

In step 1), different pretreatments is carried out to different types of data, the main data types being related to are divided into continuously Continuous type numerical value is directly normalized in type numerical value and discrete type numerical value using minmax；For discrete type numerical value, turn After turning to one hot coding, is normalized using minmax, 0 is uniformly filled for deletion sites；In addition, by data set whether It is divided into two parts: the data that data and attribute with attribute missing do not lack.

In step 2), deletion sites coding vector is constructed, situation is: in data filling, the attribute of sample missing Position is also a kind of important information, when being filled using neural network, it is only necessary to be filled out to the position of these missings It fills, when constructing deletion sites coding vector, each column of all samples is traversed, if the attribute lacks, be denoted as " 1 " is otherwise denoted as " 0 ", executes by this process, and each sample can have a deletion sites coding vector corresponding.

In step 3), building missing data filling network and auxiliary prediction network, situation is: the network is original Production confrontation network has done following improvement: 1. removing noise obtained through stochastical sampling in the input for generating network；2. making The data of filling are formed with the data of generation and deletion sites vector coding；In addition, the introducing of auxiliary prediction network is more abundant The considerations of contacting between attribute and label, predicted simultaneously, to use auxiliary prediction network using attribute missing data Loss carries out feedback calculating by BP algorithm and has updated generation network between prediction label and true tag, so that generate Filling data has better effect when constructing prediction model；Joint production is fought the loss function in network and is assisted pre- Loss function in survey grid network controls its weight ratio by hyper parameter, to determine the filling data distribution generated closer to complete The distribution of data either enables to prediction model prediction more acurrate；Wherein, data filling network and auxiliary predict network Structure includes to generate network, differentiate network, auxiliary prediction network；The structure of these three networks is described in detail below:

It generates network: importation structure being spliced by the corresponding deletion sites coding vector of the data lacked with attribute At；Different according to the structure of data, hidden layer is able to use full articulamentum or warp lamination to constitute, especially in the number of input When according to being picture type data, the filling data generated are operated using deconvolution；It is assumed herein that the data of input are denoted as I, The vector of 100 dimensions, thus corresponding deletion sites coding vector is denoted as the dimension of E and 100, spliced obtained input to Measuring dimension is 200；Hidden layer is made of full articulamentum, and activation primitive uses relu；Final output layer has 100 outputs Unit, is denoted as O, and the activation primitive of output layer uses sigmoid；The data of filling I (1-E)+OE finally by being made of；

Differentiate network: the data of input have two parts, and first part is the filler obtained based on the output for generating network According to as a result, second part is the sample data that attribute does not lack, output result is the decimal between 0~1, represents and differentiates that network is recognized For the probability for the data whether received input data does not lack from attribute；According to the difference of input data type, network knot The setting of structure is also different, when input data is image type data, is constructed by convolutional neural networks；It is assumed herein that input data It is 100 dimensional vectors, then hidden layer can select to be made of full articulamentum, activation primitive is set as relu；Output layer only includes One unit, activation primitive are selected as sigmoid, characterize probability；

Auxiliary prediction network: input and differentiation network are completely the same, and output is then the prediction to input sample about label Value, using cross entropy as loss function, when forecasting problem is regression problem, is used when forecasting problem is classification problem L2 norm or L1 norm are as loss function；Network structure is identical as the setting method of network is differentiated；It is assumed herein that input number According to being 100 dimensional vectors, then hidden layer can select to be made of full articulamentum, activation primitive is set as relu；Output layer only wraps Containing a unit, activation primitive is arranged in a manner described.

In step 4), scale reduction is carried out to the filling data of generation, due to pretreatment stage used minmax into Data normalization of having gone can restore to obtain the knot of final filling according to the maxima and minima of each attribute of record Fruit.

In step 5), test is arranged with hyper parameter, and situation is: for network during training, loss derives from two Part: the prediction that production fights loss and auxiliary prediction network in network is lost；This two parts loses λ in different proportions Combination obtains comprehensive loss；Different λ will affect the training of model；In operation, cutting data set be training set and Test set, the λ of selection different scale, respectively 0.1 on training set, 0.3,0.5,0.7,0.9 is trained, meanwhile, it uses Test set is tested, selection standard of the loss reduction of auxiliary prediction network as hyper parameter using on test set.

Compared with prior art, the present invention have the following advantages that with the utility model has the advantages that

1, traditional fill method such as median, mean value filling etc., method is simple, and filling effect is not good enough, and is based on Often time complexity is big for the method for KNN, EM, and when handling high dimensional data collection, time complexity is very big, or even occurring can not The case where processing.And production confrontation network has fabulous effect on the Distributed learning of high dimensional data, thus can solve Certainly high dimensional data collection bring trouble；In addition the sample for not having usually attribute to lack is to obey with the missing sample with attribute Same distribution, the data set for allowing filled data to approach no attribute missing from distribution enables to the result of filling not Meeting bias data distribution, gives prediction model to carry out band negative effect.

2, traditional fill method does not consider filled data to the subsequent prediction result for establishing prediction model It influences, step is usually that the data completed first are filled to missing data, recycles filled data to establish pre- Model is surveyed, thus the effect of prediction cannot used to go the filling of guide data.The present invention is by introducing auxiliary prediction network meter The loss progress backpropagation calculated between the value and true tag for the data prediction filled every time instructs the data for generating network to fill out It fills, selects prediction effect so as to observe that the data of filling show quality on prediction model, in conjunction with differentiation network The data of loss limitation filling and the difference of truthful data distribution, reach while having preferable filling effect with good prediction As a result.Furthermore it completes after training, what is obtained is that a network end to end can directly obtain after the data is entered The prediction result of auxiliary prediction network.

Detailed description of the invention

Fig. 1 is the flow chart that missing data is filled and predicted.

Fig. 2 is the production confrontation network and prediction network data flow graph for filling data.

Specific embodiment

The present invention is further explained in the light of specific embodiments.

As shown in Figure 1, based on the attribute missing data collection completion and prediction side for generating confrontation network provided by this example Method, concrete condition are as follows:

1) data prediction: the data type of different attribute is different, and corresponding processing mode is also different.The main number being related to It is divided into continuous type numerical value and discrete type numerical value according to type, for continuous type numerical value, is directly normalized using minmax；For Discrete type numerical value is converted into after one hot coding, is normalized using minmax, uniformly fill 0 for deletion sites.Furthermore Data set is divided into two parts: the data that data and attribute with attribute missing do not lack.

2) construct deletion sites coding vector: in data filling, the property location of sample missing is also a kind of important Information, when being filled using neural network, it is only necessary to which the position of these missings is filled.It is compiled in building deletion sites When code vector, each column of all samples are traversed, if the attribute lacks, " 1 " is denoted as, is otherwise denoted as " 0 ".By this stream Cheng Zhihang, each sample can have a deletion sites coding vector corresponding.

3) building missing data filling network and auxiliary prediction network: the invention proposes one kind to fight net based on production Network simultaneously combines auxiliary prediction network to carry out data filling while be able to carry out the integrated network of prediction.The network is in original life Accepted way of doing sth confrontation network has done following improvement: the noise that sampling obtains 1. is removed in the input for generating network；2. using generating Data and deletion sites vector coding form the data of filling.Furthermore the introducing of auxiliary prediction network more fully considers Contacting between attribute and label, is predicted simultaneously using attribute missing data, predicts neural network forecast label using auxiliary Loss carries out feedback calculating by BP algorithm and has updated generation network between true tag, so that the filling data generated There is better effect when constructing prediction model.Joint production is fought in loss function and auxiliary prediction network in network Loss function, its weight ratio is controlled by hyper parameter, come determine generate filling data distribution closer to partial data dividing Cloth either enables to prediction model prediction more acurrate.Fig. 2 is that most important data filling network and auxiliary are pre- in the present invention The structure chart of survey grid network, comprising generating network, differentiating network, auxiliary prediction network；The structure of these three networks is carried out below Detailed introduction:

It generates network: importation structure being spliced by the corresponding deletion sites coding vector of the data lacked with attribute At.Different according to the structure of data, hide can be used full articulamentum or warp lamination to constitute, especially in input layer by layer When data are picture type data, the filling data that are generated using warp lamination.It is assumed herein that the data (being denoted as I) of input The vector of 100 dimensions, thus corresponding deletion sites coding vector (being denoted as E) is also 100 dimensions, spliced obtained input to Measuring dimension is 200；Hidden layer is made of full articulamentum, and activation primitive uses relu；Final output layer has 100 outputs Unit (is denoted as O), and the activation primitive of output layer uses sigmoid.The data of filling I (1-E)+OE finally by being made of.

Differentiate network: the data of input have two parts, and first part is the filler obtained based on the output for generating network According to as a result, second part is the sample data that attribute does not lack, output result is the decimal between 0~1, represents and differentiates that network is recognized For the probability for the data whether received input data does not lack from attribute.According to the difference of input data type, network knot The setting of structure is also different, when input data is image type data, can be constructed by convolutional neural networks.It is assumed herein that input number According to being 100 dimensional vectors, constituted then hidden layer may be selected to be full articulamentum, activation primitive is set as relu；Output layer only includes One unit, activation primitive are selected as sigmoid, characterize probability.

Auxiliary prediction network: input and differentiation network are completely the same, and output is then the prediction to input sample about label Value, using cross entropy as loss function, when forecasting problem is regression problem, is used when forecasting problem is classification problem L2 norm or L1 norm are as loss function.Network structure is identical as the setting method of network is differentiated.It is assumed herein that input number According to being 100 dimensional vectors, constituted then hidden layer may be selected to be full articulamentum, activation primitive is set as relu；Output layer only includes One unit, activation primitive are arranged in a manner described.

4) filling data scale reduction: since pretreatment stage has used minmax to carry out data normalization, according to note The maxima and minima of each attribute of record, can restore to obtain the result of final filling.

5) test is arranged with hyper parameter: for network during training, loss fights net by production from two parts The prediction of loss and auxiliary prediction network in network is lost；λ combines to obtain comprehensive damage the loss of this two parts in different proportions It loses.Different λ will affect the training of model.In operation, cutting data set is training set and test set, on training set The λ of selection different scale, respectively 0.1,0.3,0.5,0.7,0.9 is trained, meanwhile, it is tested using test set, with Selection standard of the loss reduction of auxiliary prediction network as hyper parameter on test set.

Embodiment described above is only the preferred embodiments of the invention, and implementation model of the invention is not limited with this It encloses, therefore all shapes according to the present invention, changes made by principle, should all be included within the scope of protection of the present invention.

Claims

1. a kind of based on the attribute missing data collection completion and prediction technique that generate confrontation network, it is characterised in that: firstly, being directed to The data set of attribute missing carries out data prediction, the main one hot including minmax normalization and discrete numerical variable Code conversion；Then for the sample with attribute missing, the coding vector of deletion sites is constructed, thus the position of expression deletion Information；Then it is pre- that filling and label of the filling network of missing data with auxiliary prediction Network Synchronization completion missing data are constructed It surveys；It is filling as a result, being returned according to minmax to fill the output result of generation network in network after network training completion The one column maximin recorded when changing carries out scale reduction；Finally, observing it in verifying collection by constantly modifying hyper parameter Prediction result is lost to complete the setting of hyper parameter；It includes the following steps

1) data prediction；

2) deletion sites coding vector is constructed；

3) building missing data filling network and auxiliary prediction network；

4) filling data scale reduction；

5) test is arranged with hyper parameter.

2. a kind of attribute missing data collection completion and prediction technique based on generation confrontation network according to claim 1, It is characterized by: carrying out different pretreatments to different types of data in step 1), the main data types being related to are divided into company Continuous type numerical value is directly normalized in ideotype numerical value and discrete type numerical value using minmax；For discrete type numerical value, It is converted into after one hot coding, is normalized using minmax, 0 is uniformly filled for deletion sites；In addition, by data set root According to whether thering is attribute missing to be divided into two parts: data and the data that do not lack of attribute with attribute missing.

3. a kind of attribute missing data collection completion and prediction technique based on generation confrontation network according to claim 1, It is characterized by: constructing deletion sites coding vector, situation is in step 2): in data filling, the category of sample missing Property position be also a kind of important information, when being filled using neural network, it is only necessary to be carried out to the position of these missings Filling, when constructing deletion sites coding vector, traverses each column of all samples, if the attribute lacks, is denoted as " 1 " is otherwise denoted as " 0 ", executes by this process, and each sample can have a deletion sites coding vector corresponding.

4. a kind of attribute missing data collection completion and prediction technique based on generation confrontation network according to claim 1, It is characterized by: building missing data filling network and auxiliary prediction network, situation is: the network is in original in step 3) The production confrontation network of beginning has done following improvement: 1. removing noise in the input for generating network；2. using the number generated According to the data for forming filling with deletion sites vector coding；In addition, the introducing of auxiliary prediction network more fully considers category Property and label between contact, using attribute missing data predicted simultaneously, using auxiliary prediction neural network forecast label with Loss carries out feedback calculating by BP algorithm and has updated generation network between true tag, so that the filling data generated exist There is better effect when constructing prediction model；Joint production is fought in loss function and auxiliary prediction network in network Loss function controls its weight ratio by hyper parameter, to determine the filling data distribution generated closer to the distribution of partial data Either enable to prediction model prediction more acurrate；Wherein, the structure of data filling network and auxiliary prediction network includes life At network, differentiate network, auxiliary prediction network；The structure of these three networks is described in detail below:

Generate network: importation is spliced to form by the corresponding deletion sites coding vector of the data lacked with attribute； Different according to the structure of data, hidden layer is able to use full articulamentum or warp lamination to constitute, especially in the data of input When being picture type data, the filling data generated are operated using deconvolution；It is assumed herein that the data of input are denoted as I, it is The vector of 100 dimensions, thus corresponding deletion sites coding vector is denoted as the dimension of E and 100, through splicing obtained input vector Dimension is 200；Hidden layer is made of full articulamentum, and activation primitive uses relu；Final output layer has 100 outputs single Member, is denoted as O, and the activation primitive of output layer uses sigmoid；The data of filling I (1-E)+OE finally by being made of；

Differentiate network: the data of input have two parts, and first part is the filling data knot obtained based on the output for generating network Fruit, second part are the sample datas that attribute does not lack, and output result is the decimal between 0~1, represent and differentiate that network is thought to connect The probability for the data whether input data of receipts does not lack from attribute；According to the difference of input data type, network structure Setting is also different, when input data is image type data, is constructed by convolutional neural networks；It is assumed herein that input data is 100 dimensional vectors, then hidden layer can select to be made of full articulamentum, activation primitive is set as relu；Output layer only includes one A unit, activation primitive are selected as sigmoid, characterize probability；

Auxiliary prediction network: input and differentiation network are completely the same, and output is then the predicted value to input sample about label, when When forecasting problem is classification problem, using cross entropy as loss function, when forecasting problem is regression problem, using L2 norm Or L1 norm is as loss function；Network structure is identical as the setting method of network is differentiated；It is assumed herein that input data is 100 Dimensional vector, then hidden layer can select to be made of full articulamentum, activation primitive is set as relu；Output layer only includes a list Member, activation primitive are arranged in a manner described.

5. a kind of attribute missing data collection completion and prediction technique based on generation confrontation network according to claim 1, It is characterized by: scale reduction is carried out to the filling data of generation, since pretreatment stage has used minmax in step 4) Data normalization has been carried out, according to the maxima and minima of each attribute of record, can restore to obtain final filling As a result.

6. a kind of attribute missing data collection completion and prediction technique based on generation confrontation network according to claim 1, It is characterized by: test is arranged with hyper parameter, and situation is: during training, loss derives from network in step 5) Two parts: the prediction that production fights loss and auxiliary prediction network in network is lost；This two parts loses with different ratios Example λ combines to obtain comprehensive loss；Different λ will affect the training of model；In operation, cutting data set is training set And test set, the λ of selection different scale, respectively 0.1 on training set, 0.3,0.5,0.7,0.9 is trained, meanwhile, make It is tested with test set, selection standard of the loss reduction of auxiliary prediction network as hyper parameter using on test set.