CN112085125A

CN112085125A - Missing value filling method based on linear self-learning network, storage medium and system

Info

Publication number: CN112085125A
Application number: CN202011052819.4A
Authority: CN
Inventors: 赵国帅; 白凌南; 李子烁; 钱学明
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2020-12-15

Abstract

The invention discloses a missing value filling method, a storage medium and a system based on a linear self-learning network, which are used for acquiring original time sequence data without missing values, preprocessing the original time sequence data, constructing a missing data set by random probability, and taking a newly generated missing data set and corresponding original data as a new data set; constructing a linear self-learning network model, and training by using the generated new data set; and filling missing values by using a trained linear self-learning network-based model and a back propagation algorithm, and using the time-continuous complete data set after filling the missing values in model training of essential characteristics and missing rules of the recurrent neural network to improve the performance of downstream classification and regression tasks. The invention can deeply mine the internal and mutual relation of data by utilizing the characteristic of a linear self-learning network, and can simultaneously improve the filling precision and the filling efficiency.

Description

Missing value filling method based on linear self-learning network, storage medium and system

Technical Field

The invention belongs to the technical field of data mining, and particularly relates to a missing value filling method, a storage medium and a system based on a linear self-learning network.

Background

The phenomenon of missing values has been widely seen in real world data sets, which degrades the quality and reliability of the data set. In many practical situations, missing values are inevitable for various reasons, such as hardware problems, emergency situations, human errors, etc. One approach is to delete incomplete records directly, however this will lose much useful information. Therefore, how to fill in missing values has become an important issue. This task is crucial to many algorithms, such as machine learning, deep learning and data mining, for which missing values in incomplete data sets can have a severe impact.

In a practical database, the absence of data values is inevitable. The reasons for the data loss are manifold, and there are mainly the following: firstly, some information is missed, which may be because the input is not considered important, the filling is forgotten or the data understanding is wrong, or may be lost due to the failure of the data acquisition equipment, the failure of the storage medium, some human factors and the like; secondly, some information cannot be acquired temporarily, for example, in application form data, the reflection of some problems depends on other problems; third, some object's attribute or attributes are not available-i.e., for this object, the attribute is not present, and so on. For data mining, the presence of missing values has the following effect: first, the system loses a large amount of useful information; secondly, the uncertainty presented in the system is more significant; also, data containing missing values confuse the mining process, resulting in unreliable outputs. The data mining algorithm is more dedicated to avoiding the data from being excessively suitable for the built model, and the characteristic makes it difficult to process incomplete data well through the algorithm. Therefore, the missing value of the data needs to be derived, complemented, etc. by a special method to reduce the gap between the data mining algorithm and the actual application.

In recent years, many missing value padding algorithms have been proposed. Most of these methods fill in missing values using a complete neighborhood of the missing value. The more complete the neighbor data tuple is, the higher the accuracy of the final padding is. When incomplete data tuples can be used as neighbor data for missing values, they will ignore the information in these incomplete data tuples. Missing value phenomena typically occur in real world data sets, especially in time-continuous data sets. In a time-continuous dataset, the neighbors of incomplete data tuples depend on their temporal relation. As such, neighbors of missing data inevitably contain other missing values. In addition, the aggregate missing value phenomenon results in some incomplete data tuples having few to no complete neighbors. The above described prior art methods all use a complete neighborhood that is relatively similar to the missing data tuples to fill in the missing values. However, under the cluster missing situation, the existing methods all suffer from the disadvantage of insufficient complete neighbors of missing data tuples, because the data tuples which are similar to the missing data tuples also contain missing values, and cannot be used as the complete neighbors for filling the missing values. In addition, the existing method only searches for complete neighbors in a data space for a missing data tuple, and does not consider the filled missing neighbors.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a missing value filling method, a storage medium and a system based on a linear self-learning network, aiming at the deficiencies in the prior art, and the characteristics of the linear self-learning network that the internal and the interrelation of data can be deeply mined are utilized to improve the filling precision and the filling efficiency.

The invention adopts the following technical scheme:

the missing value filling method based on the linear self-learning network comprises the following steps:

s1, acquiring original time sequence data without missing values, preprocessing the original time sequence data, constructing a missing data set according to random probability, and taking the newly generated missing data set and the corresponding original data as a new data set;

s2, constructing a linear self-learning network-based model, and training by using the new data set generated in the step S1;

and S3, filling missing values by using the linear self-learning network-based model trained in the step S2 and using a back propagation algorithm, and using the time-continuous complete data set after filling the missing values in model training of essential characteristics and missing rules of the recurrent neural network.

Specifically, in step S1, it is assumed that missing value x is filled in_i,jTuple x_i-1Sum tuple x_i+1Forming a current data tuple neighborhood; will be the current tuple x_iMiddle division x_i,jAdding other attribute values to the current data tuple neighborhood to form x_i,jThe missing value neighborhood of (a), which will be used to fill in the missing value x_i,j。

Specifically, in step S2, each attribute value on the data tuple at one time step is sequentially calculated, and then the time window is shifted to the next time step; each attribute value in the current tuple has a 3d-1 parameter; as a result, the time window is a linear network structure of a set of d x (3d-1) parameters.

Further, according to parameter set w_i,jAnd a neighborhood set of missing values MVN_i,jComputing the padding value y of the missing value by means of a linear network_i,jThe following were used:

where k represents the kth value in the parameter set or neighborhood of missing values.

Specifically, in step S3, the deficiency value neighborhood is divided into two disjoint subsets: an incomplete neighborhood set consisting of missing values; a complete neighborhood set, consisting of complete values; when other missing values are contained in the missing value neighborhood, parameters and the missing values are optimized through a minimized loss function, and a loss value L is calculated through a mean square error function_i,j(ii) a In the training process of missing values, after the calculation loss of forward output values is obtained by using complete data and network calculation, the weighting parameters and the missing values are differentiated to optimize the network weighting parameters and the missing values; and then assigning initial values to all missing values, and optimizing in an iterative process by using a back propagation algorithm.

Further, the loss value L_i,jComprises the following steps:

L_i,j＝(x_i,j-y_i,j)²

wherein x is_i,jIs the j attribute value, y, of the i time step_i,jIs a neighborhood of missing values.

Further, optimizing the network weight parameters and the missing values:

wherein k is equal to MVN and p is equal to IMVN.

Another aspect of the invention is a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods described.

The other technical scheme of the invention is that a missing value filling system based on a linear self-learning network comprises:

a processor and a memory coupled to the processor, the memory storing a computer program which, when executed by the processor, implements the method steps of the shimming method.

Compared with the prior art, the invention has at least the following beneficial effects:

the invention discloses a missing value filling method based on a linear self-learning network. The missing values in the neighborhood of missing values are considered as learnable parameters of the model, which are optimized by simultaneous derivation with the optimization parameters, so that the weight parameters and the missing values can be mutually facilitated. The filling of missing values is therefore not only a calculated value but also a result of the network optimization. Experimental results on a wireless communication data set show that the linear self-learning network has better performance on missing value filling tasks.

Further, step S1 is executed by matching the current tuple x_iMiddle division x_i,jAddition of other attribute values to the current tuple neighborhood x_i-1Sum tuple x_i+1In the formation of x_i,jThe neighborhood of missing values of (a) constitutes the input of the linear self-learning network model.

Further, step S2 is to use each attribute value in the current data tuple as a time step by sliding the time window, and each attribute value in the missing value neighborhood of the time step corresponds to a linear model parameter, and finally, a linear network structure is constructed.

Further, the advantage of calculating the padding value of the missing value by the linear network is that the linear network describes the direct association of different attribute values between data tuples while describing the relationship between the neighbor of the missing value and the missing value. Considering the missing values as learnable parameters of the network makes the missing values not only calculation values, but also results from an iterative optimization of the linear network. Thus, the method is not limited by the filling sequence.

Further, step S3 divides the missing value neighborhood into two disjoint subsets of the complete neighborhood set of the incomplete neighborhood set, and in the training process of the missing value, after the computation loss of the forward output value is obtained by using complete data and network computation, the weighting parameter and the missing value are derived, and the loss function is minimized by using a back propagation algorithm to optimize the network weighting parameter and the missing value, thereby improving the performance of the model.

In conclusion, the method provided by the invention improves the effect of filling missing values in the data set, and effectively solves the problem of data analysis difficulty caused by too high missing rate, inaccurate filling and the like.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

fig. 2 is a schematic diagram of determining a neighborhood of missing values by a time window structure.

Detailed Description

The invention provides a missing value filling method, a storage medium and a system based on a linear self-learning network, wherein a missing data set is constructed by random probability, and a newly generated missing data set and corresponding original data are used as a new data set; constructing a linear self-learning network model; optimizing the padding value by using a back propagation algorithm; checking the effectiveness of the linear self-learning network missing value filling method; the influence of different time window sizes on missing value filling is explored, the effects of higher accuracy and higher efficiency are achieved, missing data can be restored more truly and quickly, the method is widely applied to the fields of stock price prediction, network flow prediction, disease monitoring, weather prediction and the like, and the performance of a deep learning model for the problems is improved. .

Referring to fig. 1, the missing value filling method based on the linear self-learning network of the present invention includes the following steps:

s1, preprocessing the acquired original time sequence data without missing values, constructing a missing data set according to random probability, and taking the newly generated missing data set and the corresponding original data as a new data set for training a model;

referring to FIG. 2, T represents the data tuples of 1-N time steps in the time domain consecutive data sets. Each tuple of data has a D-dimensional attribute value and the question mark indicates that the attribute value for that location is missing.

Supposing to fill in missing value x_i,jTuple x_i-1Sum tuple x_i+1Forming a current data tuple neighborhood; will be the current tuple x_iMiddle division x_i,jAdding other attribute values to the current data tuple neighborhood to form x_i,jThe missing value neighborhood of (a), which will be used to fill in the missing value x_i,j。

S2, constructing a linear self-learning network model;

each attribute value on the data tuples of one time step is computed in turn and then the time window is shifted to the next time step. Each attribute value in the current tuple has a 3d-1 parameter; as a result, the time window can be seen as a linear network structure of a set of d x 3d-1 parameters.

Computing the jth attribute value x at the ith time step_i,jThen there is a parameter set w_i,jAnd a neighborhood set of missing values MVN_i,jComputing the padding value y of the missing value by means of a linear network_i,j；

S3, optimizing the filling value by using a back propagation algorithm;

the deficiency value neighborhood is divided into two disjoint subsets: an incomplete neighborhood set, represented by IMVN, consisting of missing values; the complete neighborhood set, denoted CMVN, consists of complete values. When other missing values are contained in the missing value neighborhood, parameters and the missing values are optimized through a minimized loss function, and a loss value L is calculated through a mean square error function_i,j。

Neighborhood of missing values

L_i,j＝(x_i,j-y_i,j)²

In the process of training missing values, after the calculation loss of forward output values is obtained by using complete data and network calculation, the weight parameters and the missing values are differentiated to optimize the network weight parameters w_i,jAnd the deficiency value y_i,j：

Wherein L is_i,jIn order to be a function of the loss,

for the neighborhood set of missing values, k belongs to MVN and p belongs to IMVN.

And assigning initial values to all the missing values, and optimizing the missing values in an iterative process by using a back propagation algorithm.

Since the calculated optimization frequency of the weight parameters and the derivative optimization frequency of the deficiency values are different, different learning rates and learning rate attenuations are set for them during the training iterations.

The present invention also provides a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods described.

The invention also provides a missing value filling system based on the linear self-learning network, which comprises the following components: a processor and a memory coupled to the processor, the memory storing a computer program which, when executed by the processor, performs the method steps of the above-mentioned missing value padding method based on a linear self-learning network.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Checking the effectiveness of the linear self-learning network missing value filling method;

and calculating the error rate ER and the root mean square error RMSE between the original real data set and the filled data, and comparing the result with the traditional data filling method so as to check the effectiveness of the linear self-learning network.

Experiments were performed using polynomial regression fill-in as the benchmark method. Meanwhile, the method is compared with a KNN filling method, and two filling methods of weighted summation and direct mean value evaluation of k neighbor data are respectively tested. Both of the above methods use the complete tuple of data to determine the neighborhood of missing values. For the method of using incomplete data tuples to form the neighborhood, a contrast experiment is carried out with an OSICM frame, and an optimal filling sequence is searched by a greedy algorithm in the experiment. In order to show the effect of other attribute values of the current tuple data on the filling missing value, a comparison experiment without the current tuple data is added, and the comparison result is shown in table 1.

TABLE 1 methods comparative experimental results

The influence of different time window sizes on missing value filling is explored;

the linear network is calculated in a time window, and a larger time window means more model parameters and larger neighborhood data, and the correlation between the neighboring data is worse in a time domain continuous data set.

Experiments attempted to use 3, 5, 7 data tuples to form a neighborhood of missing values. As shown in table 2, the lowest padding ER and RMSE were obtained when the time window size was 3. The results show that as the size of the time window is increased, the neighborhood increases more noise and the correlation of the neighborhood data is reduced, which can seriously affect the accuracy of the interpolation.

TABLE 2 Experimental results for different time window sizes

In summary, compared with the traditional statistical method, the storage medium and the system for filling the missing value based on the linear self-learning network implicitly extract the deep features among the data by learning the relationship among the data through the linear self-learning network, and are not limited by the integrity of the existing data.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. The missing value filling method based on the linear self-learning network is characterized by comprising the following steps of:

2. The method for filling missing values based on linear self-learning network as claimed in claim 1, wherein in step S1, the missing value x is assumed to be filled_i,jTuple x_i-1Sum tuple x_i+1Forming a current data tuple neighborhood; will be the current tuple x_iMiddle division x_i,jAdding other attribute values to the current data tuple neighborhood to form x_i,jThe missing value neighborhood of (a), which will be used to fill in the missing value x_i,j。

3. The missing value padding method based on linear self-learning network as claimed in claim 1, wherein in step S2, each attribute value on the data tuple of one time step is calculated in turn, and then the time window is moved to the next time step; each attribute value in the current tuple has a 3d-1 parameter; as a result, the time window is a linear network structure of a set of d x (3d-1) parameters.

4. The missing value padding method based on linear self-learning network as claimed in claim 3, wherein the parameter set w is based on_i,jAnd a neighborhood set of missing values MVN_i,jComputing the padding value y of the missing value by means of a linear network_i,jThe following were used:

5. The method for filling missing values based on linear self-learning network as claimed in claim 1, wherein in step S3, the missing value neighborhood is divided into two disjoint subsets: an incomplete neighborhood set consisting of missing values; a complete neighborhood set, consisting of complete values; when other missing values are contained in the missing value neighborhood, parameters and the missing values are optimized by minimizing a loss function, and mean square error is usedFunction calculation loss value L_i,j(ii) a In the training process of missing values, after the calculation loss of forward output values is obtained by using complete data and network calculation, the weighting parameters and the missing values are differentiated to optimize the network weighting parameters and the missing values; and then assigning initial values to all missing values, and optimizing in an iterative process by using a back propagation algorithm.

6. The method for filling missing values based on linear self-learning network as claimed in claim 5, wherein the missing value L is_i,jComprises the following steps:

L_i,j＝(x_i,j-y_i,j)²

7. The missing value filling method based on the linear self-learning network as claimed in claim 5, wherein the network weight parameters and the missing values are optimized:

wherein k is equal to MVN and p is equal to IMVN.

8. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-7.

9. A missing value filling system based on a linear self-learning network is characterized by comprising:

a processor and a memory coupled to the processor, the memory storing a computer program that when executed by the processor implements instructions of any of the methods of claims 1-7.