CN114065919A

CN114065919A - Deficiency value completion method and medium based on generation countermeasure network

Info

Publication number: CN114065919A
Application number: CN202111360740.2A
Authority: CN
Inventors: 周轩; 刘成勇; 刘国辉; 王路; 林希佳; 陈明辉; 石冉; 秦磊; 李珍珍; 李宗雯
Original assignee: Jiangsu Jinling Institute Of Intelligent Manufacturing Co ltd; Nanjing Chenguang Group Co Ltd
Current assignee: Jiangsu Jinling Institute Of Intelligent Manufacturing Co ltd; Nanjing Chenguang Group Co Ltd
Priority date: 2021-11-17
Filing date: 2021-11-17
Publication date: 2022-02-18

Abstract

The invention discloses a missing value completion method and a medium based on a generation countermeasure network, wherein the method comprises the following steps: collecting incomplete historical data to obtain a training set; iteratively training the generator and the discriminator in an off-line manner based on a training set; determining a confrontation network model through a cross validation method, and obtaining an optimal generator with the minimum error; and (4) performing missing value inference on the acquired data based on an optimal generator, and reconstructing complete data. The invention improves the problem that a large amount of complete historical data is needed in a data driving method, avoids the requirement of a traditional missing value inference method on a data structure, and is mainly used for solving the problem that the data quality is reduced due to the loss of the collected data caused by the faults of collection equipment, insufficient coverage of the collection equipment, errors of operators and the like.

Description

Deficiency value completion method and medium based on generation countermeasure network

Technical Field

The invention belongs to the field of data cleaning, and particularly relates to a missing value completion method and medium based on a generation countermeasure network.

Background

Data is the basis of scientific research analysis and numerous applications as an important production factor, and its quality is directly related to model effect and final result. Only complete and accurate data can obtain ideal results, and missing or abnormal data can even lead to wrong conclusions. Data loss is inevitable, however, due to many reasons, such as collection equipment failure, insufficient coverage of collection equipment, operator error, etc. Therefore, in order to obtain correct results, data cleaning is an indispensable link in the whole data analysis process, and missing value completion is an important component in data cleaning.

The missing value completion methods commonly used at present mainly include mean value filling, maximum and minimum value filling and the like, but the methods are too simple and usually have great difference with the true data value. In many applications, data often appears in a matrix form, and therefore, interpolation methods such as compressive sensing, matrix filling, gaussian regression process, neural network and the like are also used for deducing missing values, but these methods usually assume that data has some characteristics, for example, compressive sensing assumes that data satisfies sparsity, matrix filling assumes that data has low rank, gaussian regression process assumes that data satisfies gaussian distribution, data-driven methods such as neural network and the like usually require a large amount of complete training data and the like, but these assumptions are often difficult to satisfy in an actual data set and usually only have a large amount of incomplete historical data.

Disclosure of Invention

The invention aims to provide a missing value completion method and medium based on a generated countermeasure network, which are used for solving the problem of data quality reduction caused by the loss of collected data due to the reasons of collection equipment faults, insufficient collection equipment coverage, operator errors and the like, do not need complete training data and can more accurately complete missing values.

The technical solution for realizing the purpose of the invention is as follows: a missing value completion method based on generation of a countermeasure network, comprising the steps of:

collecting incomplete historical data to obtain a training set;

training the generator and the discriminator in an off-line manner based on the training set;

determining and iteratively training a confrontation network model through a cross validation method, and obtaining an optimal generator with the minimum error;

and (4) performing missing value inference on the acquired data based on an optimal generator, and reconstructing complete data.

Further, training the generator in an off-line manner based on the training set, comprising the steps of:

flattening data in the training set at different moments into one-dimensional vectors to form a historical data matrix with continuous row moments and corresponding column data, and generating a corresponding indication matrix to identify whether the data is missing;

carrying out normalization processing on the historical data matrix, and replacing missing values in the historical data matrix by random variables generated by a Gaussian distribution generator;

will complement the historical data matrix

Training the generator as an input to the generator, the generator outputting a complete historical data matrix

Further, training the discriminator in an offline manner based on the training set specifically includes: randomly generating a clue matrix by the indication matrix and the set clue rate, using the indication matrix and the clue matrix as the input training discriminator of the discriminator, and generating a corresponding prediction history indication matrix by the discriminator

Further, the clue matrix is:

H＝B·Ω_hig+0.5(1-B)

where B is a random matrix of values 0, 1 generated by the cue rate h, Ω_hisIs an indication matrix.

Further, the determining of the confrontation network model by the cross validation method specifically includes: and selecting the layer number and the activation function of the corresponding generator and the corresponding arbiter when the total error in the test data set is minimum through an 80-20 cross validation method as the model structure of the countermeasure network.

Further, the countermeasure network model is:

wherein the content of the first and second substances,

in order to reverse the indication matrix, the display device,

is an inverse historical data matrix, H is a clue matrix,

in order to be a discriminator, the method comprises the steps of,

in order to be a generator,

is composed of

The joint probability distribution of (c).

Further, the antagonistic network model is divided into 2 sub-models:

wherein the content of the first and second substances,

and

are all cross-entropy loss functions of the entropy,

as a function of mean square error, alpha_iIn order to lose the over-parameters,

a history indication matrix predicted for the arbiter,

to generate the complete historical data matrix for the generator,

completing a historical data matrix through Gaussian variable interpolation;

during training, the generator is firstly fixed and then pressed according to V₁Training the discriminators, then fixing the discriminators, by V₂Training the generator until the iteration converges;

the best generator with the smallest acquisition error is: training is carried out through a plurality of groups of data training sets, and the generator with the minimum error is selected as the optimal generator.

Further, the indication matrix is a 0-1 indication matrix, 0 represents a missing value, and 1 represents a known value; the generator and the discriminator are two layers of fully connected neural networks.

Further, based on an optimal generator, missing value inference is carried out on the acquired data, and complete data are reconstructed, and the method specifically comprises the following steps:

carrying out normalization processing on the incomplete data matrix acquired this time, and replacing missing values in the matrix by random variables generated by Gaussian distribution to generate a corresponding temporary interpolation matrix;

taking the temporary interpolation matrix as the input of the optimal generator model and generating a temporary data matrix;

replacing the data at the position of the original missing value in the temporary data matrix with the missing value at the position corresponding to the temporary interpolation matrix, and generating a final interpolation matrix;

and carrying out inverse normalization processing on the final interpolation matrix to generate a final complete data matrix.

A computer storage medium having stored thereon an executable program for executing by a processor the steps of implementing the missing value padding method.

Compared with the prior art, the invention has the beneficial effects that: the method fully utilizes a large amount of incomplete historical data collected by the wireless sensor network and utilizes the generated countermeasure network to extract the time characteristics in the historical data so as to guide the data collection process to deduce the missing value and effectively reconstruct the complete data, and compared with the traditional missing value completion algorithm, the method has better reconstruction precision and does not need any data hypothesis; compared with the traditional data-driven algorithm, the method does not need a large amount of complete historical data and can more accurately complement the missing value.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

FIG. 2 is a flow chart of the method of the present invention.

Fig. 3 is a diagram of the structure of the countermeasure network of the present invention.

FIG. 4 is a diagram of simulation results of the method of the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and the embodiments.

The invention provides a missing value completion method based on a generated countermeasure network, which is mainly used for solving the problem of data quality reduction caused by the loss of collected data due to the reasons of collection equipment faults, insufficient collection equipment coverage, operator errors and the like.

As shown in fig. 1, the method includes two stages, an offline training stage and an online reconstruction stage, and the signal data collected by crowdsourcing is illustrated in this embodiment. In the present embodiment, simulation data is used for simulation, and the simulation space size is 24 × 20 × 4m³The division grid size is 1 x 1m²Form a 24 × 20 grid with transmitter positions [15, 12, 1 ]]Then, the simulation signals are divided into different grids according to the coordinate range, 16000 pieces of signal maps with the size of 24 × 20 are generated together, in order to simulate the incompleteness of the historical signal map, 50% of grid signals are randomly selected from each signal map to be set as None (the signal default value which is not collected), and therefore the historical signal map with the dimensions of 16000 × 24 × 20 is finally generated.

With reference to fig. 2, in the off-line training stage, a large amount of incomplete historical data collected by sensors are used for training to generate a confrontation network model, data collected by sensing nodes at different times are flattened into a one-dimensional vector, a historical data matrix with behavior time and sensing node number is formed, and a corresponding 0-1 indication matrix is generated, wherein 0 indicates a missing value, and 1 indicates a known value; secondly, performing normalization processing on the historical data matrix, replacing missing values in the historical data matrix with random variables generated by a Gaussian distribution generator, and then training the generator by taking the matrix as the input of the generator; then, randomly generating a clue matrix by using the indication matrix and the clue rate, and training a discriminator by using the indication matrix and the clue matrix as the input of the discriminator; iteratively training a generator and a discriminator until the model converges; at the end of off-line training, the cross validation method selects a model which generates the minimum hyper-parameters of the network structure (such as the number of hidden layers and the number of neurons in each layer), clue rate and the like of the countermeasure network and the test error. The method specifically comprises the following steps:

firstly, 16000 signal maps with the size of 24 × 20 are flattened into a historical data matrix M of 16000 × 480_hisAnd generates a corresponding indicator matrix omega_hisIt is a 0-1 matrix, 0 refers to missing values, 1 represents a known value;

the historical data matrix M_hisAccording to

Carrying out normalization processing, replacing missing values in a historical data matrix by using a random variable Z generated by a Gaussian distribution generator, and then training the generator by using the matrix as the input of the generator

Generating false data similar to real data by using a generator and calculating a final complete historical data matrix

To press V₂A training generator;

using the indicator matrix omega_hisAnd randomly generating a clue matrix H, B & omega by the pre-determined hyper-parameter clue rate H_his+0.5(1-B), where B is a random matrix of 0, 1 generated by the cue rate h, and the discriminant is trained with the indicator matrix and the cue matrix as inputs to the discriminant

The discriminator determines whether the data is real data or false data generated by the generator and generates a predicted history indication matrix

To press V₁And V₂Training a discriminator and a generator;

game through between generators and discriminators

Wherein

Calculated from the following formula:

wherein the content of the first and second substances,

is an inverse history indication matrix which is a 0-1 matrix, 1 referring to a missing value, 0 representing a known value,

is an inverse historical data matrix, H is a clue matrix,

in order to be a discriminator, the method comprises the steps of,

in order to be a generator,

is composed of

The joint probability distribution of (c). However, for simplicity of calculation, in the actual calculation process, we convert it into 2 sub-formulas

And

during training, the generator is firstly fixed and then pressed according to V₁Training the discriminators, then fixing the discriminators, by V₂Training the generator until the iterations converge, wherein

And

are all cross-entropy loss functions of the entropy,

as a function of mean square error, alpha_lTo be a loss superparameter (any positive number), which can be obtained by training, set to 700 in the experiment,

a history indication matrix predicted for the arbiter,

to generate the final complete historical data matrix for the generator,

a historical temporary interpolation matrix interpolated by a gaussian variable. Referring to fig. 3, a cross validation method of 80-20 is adopted to select a generation countermeasure network structure which minimizes the total test error, namely the number of layers of a generator and a discriminator and an activation function, wherein the generator and the discriminator both adopt two layers of fully connected neural networks, and the activation function is selected to output the initial weight as it is and to be initialized randomly.

As shown in fig. 3, the cross-validation method of B0-20 is used to select a generation countermeasure network structure, i.e. the number of layers of generators and discriminators, and an activation function, which outputs initial weights as they are, to be randomly initialized, so as to minimize the total test error.

And in the on-line reconstruction stage, the generator model finished by off-line training is utilized to carry out missing value inference on the collected sparse incomplete data so as to reconstruct complete data. Firstly, performing normalization processing on an incomplete data matrix collected this time, and replacing missing values in the matrix by using random variables generated by Gaussian distribution to generate a corresponding temporary interpolation matrix; then, the temporary interpolation matrix is used as the input of the trained generator model and generates a temporary data matrix; then, replacing the currently collected incomplete data matrix with data of the original missing value position in the temporary data matrix, and generating a final interpolation matrix; and finally, carrying out inverse normalization processing on the final interpolation matrix to generate a final complete data matrix, and specifically comprising the following steps of:

the incomplete data matrix M collected this time is normalized

And replacing missing values in the matrix with random variables generated by Gaussian distribution to generate corresponding temporary interpolation matrix

Using the temporary interpolation matrix as input of the trained generator model and generating a temporary data matrix

Replacing the collected incomplete data matrix with the data of the original missing value position in the temporary data matrix, and generating a final interpolation matrix

And carrying out inverse normalization processing on the final interpolation matrix and generating a final complete data matrix.

In the embodiment, Bayesian compressed sensing BCS and matrix filling calculation are comparedLmaFit, self-encoder AE, Gaussian regression algorithm GPR and other algorithms, and the method is a curve corresponding to GAN, wherein

The relative error is shown, the specific experimental result is shown in FIG. 4, the relative error of the method is the lowest as seen in FIG. 4, and the accuracy of the method for compensating the missing value is verified.

The method is different from the prior missing value completion method, and the method infers the missing value of the acquired incomplete data by generating the nonlinear characteristics and distribution of confrontation network learning existing in a large amount of historical incomplete data so as to acquire complete data. Therefore, the invention does not need complete training data and can more accurately complete the missing value.

Claims

1. A missing value completion method based on generation of a countermeasure network, comprising the steps of:

collecting incomplete historical data to obtain a training set;

2. The deficiency completion method according to claim 1, wherein the generator is trained in an off-line manner based on a training set, comprising the steps of:

will complement the historical data matrix

3. The deficiency completion method according to claim 2, wherein the off-line training of the discriminant based on the training set specifically comprises: randomly generating a clue matrix by the indication matrix and the set clue rate, using the indication matrix and the clue matrix as the input training discriminator of the discriminator, and generating a corresponding prediction history indication matrix by the discriminator

4. The missing value completion method according to claim 2, wherein the cue matrix is:

H＝B·Ω_his+0.5(1-B)

5. The deficiency value completion method according to claim 4, wherein the determination of the countermeasure network model by the cross-validation method is specifically: and selecting the layer number and the activation function of the corresponding generator and the corresponding arbiter when the total error in the test data set is minimum through an 80-20 cross validation method as the model structure of the countermeasure network.

6. The deficiency value completion method according to claim 5, wherein the countermeasure network model is:

wherein the content of the first and second substances,

in order to reverse the indication matrix, the display device,

is an inverse historical data matrix, H is a clue matrix,

in order to be a discriminator, the method comprises the steps of,

in order to be a generator,

is composed of

The joint probability distribution of (c).

7. The deficiency completion method according to claim 6, wherein the antagonistic network model is divided into 2 submodels:

wherein the content of the first and second substances,

and

are all cross-entropy loss functions of the entropy,

as a function of mean square error, alpha_lIn order to lose the over-parameters,

a history indication matrix predicted for the arbiter,

to generate the complete historical data matrix for the generator,

completing a historical data matrix through Gaussian variable interpolation;

8. The missing value completion method according to claim 2, wherein the indication matrix is a 0-1 indication matrix, 0 represents a missing value, and 1 represents a known value; the generator and the discriminator are two layers of fully connected neural networks.

9. The deficiency completion method according to claim 2, wherein the deficiency inference is performed on the collected data based on an optimal generator to reconstruct the complete data, and the method comprises the following specific steps:

10. A computer storage medium storing an executable program for execution by a processor to perform the steps of implementing the deficiency value completion method of any one of claims 1-9.