CN116303386A

CN116303386A - Intelligent interpolation method and system for missing data based on relational graph

Info

Publication number: CN116303386A
Application number: CN202310146169.7A
Authority: CN
Inventors: 廖伟; 夏欢; 陈肇欣; 潘野; 张涛; 郑奕; 薛方冉; 陈哲; 晏楠欣
Original assignee: Second Research Institute of CAAC
Current assignee: Second Research Institute of CAAC
Priority date: 2023-02-21
Filing date: 2023-02-21
Publication date: 2023-06-23

Abstract

The invention relates to the technical field of information processing, in particular to a missing data intelligent interpolation method and system based on a relation map, which are generally based on a regression interpolation idea, and introduce the relation map between data as an input control strategy of a missing value prediction model; adopting an improved neural network model to enable missing value prediction of a plurality of variables to use the same model; aiming at the scene of 'large scale and large proportion' missing of data, a set of interpolation sequence control strategy and secondary interpolation strategy with high credibility are constructed. In general, the invention reduces the complexity of the interpolation system and improves the calculation efficiency of the interpolation process.

Description

Intelligent interpolation method and system for missing data based on relational graph

Technical Field

The invention relates to the technical field of information processing, in particular to a method and a system for intelligent interpolation of missing data based on a relational graph.

Background

With the wide application of machine learning and digital twin technology, the degree of dependence of a software system on data is greatly improved, and higher requirements are also put on the integrity and the credibility of data input, but due to defects in the process of acquisition and storage, the condition that original data are frequently missing exists, and the interpolation of the missing data is a problem which has to be faced in the engineering field.

The prior art mainly comprises the following three types: a hot card interpolation method, a regression interpolation method, and a multiple interpolation method, wherein,

the hot card interpolation method finds one object most similar to the hot card interpolation method in the complete data, sometimes finds more than one similar object, and randomly selects one of all matching objects as a filling value. The method is conceptually simple, and uses the relationship between data to evaluate null values, but has the disadvantage that the similarity standard is difficult to accurately define, and is greatly influenced by subjective factors.

The multiple interpolation method considers that the missing values are randomly distributed, a multiple interpolation algorithm such as MICE algorithm firstly adopts a regression interpolation mode to estimate the values to be interpolated, then simulates noise to form a plurality of groups of optional interpolation values, finally compares the generated plurality of groups of data sets with the original data sets, and selects a set with the smallest distribution deviation with the original data sets as a final result. Multiple interpolation can only handle random misses, cannot handle non-random misses, and also requires a large amount of computation.

The regression interpolation method is to use supervised machine learning methods, such as regression, nearest neighbor, random forest, support vector machine and other models, to establish a prediction model based on a complete data set, and to substitute known attributes into the model to predict missing attributes.

Specifically, the regression interpolation method can establish a missing value prediction model for each variable, and under the big data scene of 'table many and field many', modeling for each variable consumes a great deal of resources and can greatly increase the complexity of the system; in addition, in the model training and actual prediction processes, the regression interpolation method takes all variables except the target variable as inputs, consumes a great deal of calculation power and calculation time, and also forms dependence on the variables.

Disclosure of Invention

The invention aims to provide a one-stop data interpolation technology, which is based on the idea of regression interpolation, introduces a relation graph between data as input to train a missing value prediction model, so that the same model can realize missing value prediction of a plurality of variables; the constructed interpolation sequence control strategy considering the deletion range and the deletion correlation in the same row can be suitable for a large-range and large-proportion data deletion scene, and realizes the data interpolation with high reliability and high calculation efficiency so as to solve the problems pointed out in the background technology.

The embodiment of the invention is realized by the following technical scheme: a missing data intelligent interpolation method based on a relation map comprises the following steps:

generating a variable data set, and performing feature numeralization and numerical normalization pretreatment;

based on the correlation coefficient between the variables, establishing a variable relation graph;

training a neural network model by taking adjacent variables of all variables in the relation graph as input to obtain a missing value prediction model, wherein the adjacent variables are variables which are directly connected with a target variable in the relation graph;

based on an interpolation sequence control strategy considering the deletion range and the deletion correlation in the same row, the intelligent interpolation of the deletion data is realized by using the deletion value prediction model;

and decoding and restoring the variable.

According to a preferred embodiment, the establishing a variable relation map based on the correlation coefficient between the variables includes:

calculating a correlation matrix among all variables, and performing binarization processing on the correlation matrix;

and setting the diagonal elements of the correlation matrix subjected to binarization processing to 0 to obtain an adjacent matrix, and constructing a relationship map based on the adjacent matrix.

According to a preferred embodiment, the establishing a variable relation map based on the correlation coefficient between the variables further includes:

based on the obtained adjacency matrix, the adjacency matrix is optimally adjusted based on expert experience data.

According to a preferred embodiment, the training the neural network model by using the adjacent variable of each variable in the relation map as an input to obtain a missing value prediction model includes:

taking the adjacent vectors of the variables, carrying out hadamard product on the adjacent vectors and the input tensors row by row to generate N intermediate tensors with the same dimension, wherein N is the number of the variables in the input tensors;

performing N rounds of forward propagation by taking the intermediate tensor as model input to generate N output tensors, wherein parameter updating is not performed after each round of forward propagation;

performing one round of forward propagation by taking the input tensor as input, and updating a process parameter;

setting the other elements except for the j-th column in the output tensor to zero, and summing N output tensors to obtain a final output tensor, wherein j is the forward propagation round number of the output tensor;

and carrying out back propagation based on the deviation of the final output tensor and the input tensor, repeating until the network converges or the training times reach a set value, and completing the training of the missing value prediction model.

According to a preferred embodiment, the interpolation sequence control strategy based on considering the missing ranges in the same row is:

performing sufficiency verification on all null values of the current data line, and performing filling on the null values meeting the sufficiency verification requirement;

and (5) circulating iteration until no null value meeting the sufficiency verification requirement is obtained.

According to a preferred embodiment, the intelligent interpolation of missing data using the missing value prediction model includes:

performing hadamard product on the adjacent vector of the current data line and the target null value to obtain a vector after shielding treatment;

taking the vector as input, and calculating a result through a missing value prediction model;

and extracting a column corresponding to the target null value in the calculation result as a predicted value to replace the target null value.

According to a preferred embodiment, the interpolation sequence control strategy based on considering the correlation of the deletions in the same row is:

the null values are ordered according to the missing correlation of the null values, wherein the missing correlation is the correlation duty ratio of the missing values in all adjacent variables of the current null values, and the expression is as follows:

in the above, r _ij Representing the elements of the ith row and jth column of the correlation matrix, l _j Is the j-th element of the adjacency vector L, z _j Is the jth element of the missing state vector Z, if the data of the jth bit of the data line is null, then Z _j =1, otherwise z _j ＝0；

Filling is performed in order of low to high missing correlation with default values instead of null values in the adjacency variables as input.

According to a preferred embodiment, after the filling, the method further comprises:

calculating the reliability of interpolation data to form a reliability comparison table, wherein the interpolation data is divided into original data and interpolation values, and the interpolation value calculation expression is as follows:

in the above formula, epsilon is a harmonic coefficient, eta is a model damage coefficient, and represents the reliability loss caused by model prediction, lambda _j Indicating the trustworthiness of the j-th variable of the current data line.

According to a preferred embodiment, the training the neural network model by using the adjacent variable of each variable in the relation map as an input to obtain a missing value prediction model further includes:

taking the missing value prediction model as a pre-training model, and using the pre-training model to realize intelligent interpolation of missing data;

and calculating the average credibility of each row of interpolation data, and taking the data row with the average credibility higher than a preset threshold value as a new input to perform secondary training on the pre-training model to obtain a final missing value prediction model.

The invention also provides a missing data intelligent interpolation system based on the relation map, which is applied to the method, and comprises the following steps:

the processing module is used for generating a variable data set and carrying out feature numeralization and numerical normalization preprocessing;

the relation map construction module is used for building a variable relation map based on the correlation coefficient between the variables;

the training module is used for training the neural network model by taking adjacent variables of all variables in the relation graph as input so as to obtain a missing value prediction model, wherein the adjacent variables are variables which are directly connected with the target variables in the relation graph;

the interpolation module is used for realizing intelligent interpolation of the missing data by using the missing value prediction model based on an interpolation sequence control strategy considering the missing range and the missing correlation in the same row;

and the decoding module is used for decoding and restoring the variable.

The technical scheme of the embodiment of the invention has at least the following advantages and beneficial effects: the invention comprises a relation graph construction strategy of a variable and a control strategy of using the relation graph to adjust the input and output of a model, the number of the input variable and the dependence on other variables in a data set are greatly reduced on the basis of a traditional interpolation method, and the data set has stronger compatibility under the conditions of 'large scale and large proportion'; the invention comprises a unified missing value prediction model training strategy, and under the big data scene of 'table more and field more', the same model is used for predicting all missing variables, thus greatly reducing modeling time and system complexity; the invention comprises an interpolation sequence control strategy considering the deletion range and the deletion correlation in the same row, and can furthest reserve the data authenticity by adjusting the interpolation sequence, thereby providing important reliability reference for subsequent work.

Drawings

Fig. 1 is a flow chart of a relation graph-based intelligent interpolation method for missing data provided in embodiment 1 of the present invention;

FIG. 2 is a schematic diagram of a relationship diagram according to embodiment 1 of the present invention;

FIG. 3 is a schematic diagram of the forward propagation and loss calculation process according to embodiment 1 of the present invention;

fig. 4 is a schematic flow chart of intelligent interpolation provided in embodiment 1 of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Example 1

The invention discloses a missing data intelligent interpolation method based on a relation map, wherein a flow chart is shown in fig. 1, and the method is implemented according to the following steps:

1) The data preprocessing step, because the input variable may have a plurality of formats such as numerical value, character, time and the like, the variable data set needs to be preprocessed before modeling, and the specific steps are as follows:

1.1 The character information is digitized, including but not limited to tag encoding, unicode, serial number encoding, frequency encoding, relative time, etc.

The following is a brief description taking the single-hot encoding as an example:

for example: the characters include southern aviation, chinese aviation, eastern aviation, hainan aviation, xiamen aviation, sichuan aviation, shenzhen aviation, shandong aviation, lucky aviation, spring and autumn aviation, etc. These characters are not continuous, but rather discrete, unordered.

The digitizing process is based on the principle that N states are encoded in an N-bit state register, for the above character example, and is performed (here, only nine features are provided, so n=9):

southern aviation-100000000

China navigation → 010000000

Oriental aviation → 001000000

Hainan aviation → 000100000

Mansion aviation- & gt 000010000

Sichuan aviation → 000001000

Shenzhen aviation- & gt 000000100

Lucky aviation- & gt 000000010

Aviation in spring and autumn- & gt 000000001

1.2 Normalization of the numerical information including, but not limited to, dispersion normalization, logarithmic normalization, zero-mean normalization, etc., and will not be described in detail herein.

After the preprocessing is completed, the processing methods and parameters are stored for subsequent decoding.

2) According to the statistical correlation and/or the business correlation, a relation map among all variables in the data set is constructed, and the specific steps are as follows:

2.1 The specific steps of constructing a relationship map based on statistical correlation are as follows:

2.1.1 A correlation matrix between all variables, the correlation matrix calculation means including but not limited to, pearson correlation coefficient, spearman correlation coefficient, kendel correlation coefficient, etc., wherein the calculation expression of pearson correlation coefficient is as follows:

in the above-mentioned method, the step of,

representing variable X _i Mean value of->

Representing variable Y _i And (5) an average value. It should be noted that the pearson correlation coefficient varies from-1 to 1, and that the coefficient value r is 1 and represents X _i And Y _i Has a linear relation, and a coefficient value r of 0 represents X _i And Y _i There is no linear relationship between them. In particular if and only if X _i And Y _i All fall on the same side of the respective mean value, and the correlation coefficient is positive; if X _i And Y _i Tends to fall on opposite sides of the respective mean value, the correlation coefficient is negative.

2.1.2 Binarization processing is carried out on the correlation matrix, a binarization threshold value can be preset, a default value can be adopted, and excessive description is not specifically carried out.

2.1.3 Setting 0 to the diagonal element of the correlation matrix so that the variables are not considered as adjacent, and finally obtaining the matrix, namely, the adjacent matrix used as a relation map, for describing the association relation between the variables, as shown in table 1, wherein table 1 is an example of the adjacent matrix between the variables provided by the embodiment of the invention:

TABLE 1 adjacency matrix between variables

	Number of passengers	Luggage number	Door closing time	Task timeout rate	…
						Number of passengers	0	1	1	0	…
Luggage number	1	0	1	0	…
						Door closing time	1	1	0	1	…
Task timeout rate	0	0	1	0	…
						…	…	…	…	…	…

The adjacent matrix is a symmetric matrix, the adjacent vector is a matrix row corresponding to the target variable, an element of 1 indicates that two variables corresponding to the row and the column are associated, and an element of 0 indicates that the two variables are not associated.

2.2 Optimizing and adjusting the relation map based on the service correlation, specifically, considering that a practitioner is more familiar with service data, the causal relation except the statistical correlation can be captured, and the homogeneous association relation in the relation map can be removed (taking civil aviation scene data as an example in fig. 2, the number of passengers is highly correlated with the number of baggage, and when the door closing time is predicted, one of the passengers and the baggage is removed, the calculated amount can be reduced, the data dependence can be reduced, and the brought income is greater than the loss in precision); therefore, in this embodiment, on the basis of the generated adjacency matrix, the corresponding element of the association relationship to be removed is set to 0, the corresponding element of the association relationship to be supplemented is set to 1 based on expert experience data, and the finally obtained relationship map is shown in fig. 2.

In summary, through step 2), the number of input variables and the dependence on other variables in the data set are greatly reduced on the basis of the traditional interpolation method, and the method has stronger compatibility for the situations of 'large scale and large proportion' of data.

3) Training the neural network model by taking adjacent variables of all variables in the relation map as input to obtain a missing value prediction model, wherein the method comprises the following specific steps of:

3.1 Initializing a model, and setting general parameters of training neural network models such as the number of layers of the neural network, the number of neurons, an activation function, a learning rate, a loss function, an optimizer and the like; in this embodiment, the dimensions of the input and output are the same, and N is required by the model.

In addition, before the forward propagation starts, the neuron weights need to be initialized, and the process is the same as that of the traditional feedforward neural network, and is not repeated here.

It should be noted that, the embodiment of the present invention adopts an improved neural network model, where the improved neural network model refers to the improvement of the timing and the number of forward propagation and backward propagation and the organization of input and output based on the deep feed forward network, and does not limit the network layer number, the neuron number, the activation function and other general parameters of the neural network.

3.2 Forward propagation, in this embodiment the input tensor is P _M×N The P is _M×N Is a matrix with dimension M x N, wherein M is the batch size, represents the number of data lines in the batch input, and N represents the number of variables in the data set; it should be noted that the input tensor used in the training process is a complete data row in the complete data set.

Further, take P _M×N Adjacency vectors of variables in (a), line by line and P _M×N Performing hadamard product to generate N intermediate tensors with dimension of M multiplied by N

The objective is to generate an intermediate tensor free of non-contiguous variables and target variables, the contiguous variables being variables in a relationship graph that are directly connected to the target variables.

Will be

N rounds of forward propagation as model input, generating N output tensors +.>

And the parameter is updated by the reverse gradient without immediately after each round of forward propagation, and only the output is recorded.

Finally, the P is _M×N A round of forward propagation is performed as input with the purpose of updating the process parameters for subsequent gradient calculations.

3.3 Calculating loss, wherein the specific steps are as follows:

3.3.1 Is to be used as a main component)

Setting zero in other elements except for the j-th column, and summing N output tensors to obtain a final output tensor O _M×N The purpose is to at N +.>

The extraction of valid columns in the matrix form the final output, where j is

Number of forward propagation rounds.

3.3.2 Calculating O based on a loss function) _M×N And P _M×N It should be noted that P _M×N I.e. the correct value of the output, thus O _M×N And P _M×N The deviation of the model is the loss of the current model; the loss functions used include, but are not limited to, general machine learning loss functions such as L1 norm loss, mean square error loss, cross entropy loss, and KL divergence loss, and the like, and are not described in detail herein.

4) Based on O _M×N And P _M×N The back propagation of the deviations of each neuron calculates the contribution of each neuron to the loss and updates the weights according to the gradient calculated by the back propagation algorithm, which is the same as that of a conventional feed-forward neural network and is not described herein. It should be noted that the back propagation process takes a much larger time than the forward propagation throughout the deep neural network training processThe multicast process, therefore, does not significantly increase training time for multiple rounds of forward propagation. The specific forward propagation and loss calculation procedure is shown with reference to fig. 3.

Repeating the steps 2) -4) until the network convergence or the training times reach the set value, and completing the training of the missing value prediction model.

In summary, the invention interpolates the model provided in step 4), and in the big data scene of 'table many, field many', the same model is used for the prediction of all missing variables, thus greatly reducing modeling time and system complexity.

5) Based on an interpolation sequence control strategy considering the deletion range and the deletion correlation in the same row, the intelligent interpolation of the deletion data is realized by using the deletion value prediction model, and the specific steps are as follows:

5.1 Initial line number m=1, and performing null value screening, see fig. 4, specifically including the following steps:

5.1.1 Performing sufficiency verification on all null values of the current data line; the sufficiency verification refers to whether all adjacent variables corresponding to the current null value are non-null, and if all adjacent variables are non-null, the sufficiency verification is satisfied.

5.1.2 Filling null values meeting sufficiency verification requirements, specifically as follows:

5.1.2.1 Performing hadamard product on the adjacent vector of the current data line and the target null value to obtain a vector after shielding treatment;

5.1.2.2 Taking the vector as input, and calculating a result through a missing value prediction model;

5.1.2.3 And (3) extracting a column corresponding to the target null value in the calculation result as a predicted value to replace the target null value.

5.1.3 And (3) iterating circularly until no null value meeting the sufficiency verification requirement is obtained.

5.2 When the empty values of the sufficiency verification are not satisfied, the empty values are not represented to be filled, and a plurality of empty values are possibly interdependent and cannot be filled; the empty value sorting step specifically comprises the following steps:

5.2.1 Ordering according to the missing correlation R of the current null value, wherein the expression of the missing correlation is as follows:

in the above, r _ij Elements representing the ith row and jth column of the correlation matrix, l _j The j-th element, z, representing the adjacency vector _j The j-th element of the missing state vector is z if the data of the j-th bit of the data line is null _j =1, otherwise z _j ＝0。

5.2.2 Assigning and filling, wherein filling is carried out according to the filling flow provided in the step 5.1.2) from low to high in the order of the lack relevance R until the line number M is greater than the total line M, otherwise, M is increased by 1, and the step 5.1.1) is returned.

It should be noted that, before filling, a default value is used as a model input instead of a null value in the adjacent variable, where the default value includes, but is not limited to, a median, a mode, or a mean value of the variables in the dataset, and the description is not repeated here.

In the initial stage of assignment filling of the same row, default values in input variables are more, but the influence is smaller because the missing correlation R is lower; the closer the assignment filling of the same row is to the later stage, the fewer default values in the input variables are, and meanwhile, the higher the missing correlation degree is, so that the reliability of the interpolation data is improved to the greatest extent as a whole.

Further, calculating the reliability of the interpolation data after each time of performing interpolation on the empty value to form a reliability comparison table, wherein the interpolation data is divided into original data and interpolation values, and the interpolation value calculation expression is as follows:

in the above formula, epsilon is a harmonic coefficient, eta is a model damage coefficient, and represents the reliability loss caused by model prediction, lambda _j Indicating the trustworthiness of the jth variable of the current data line (λ if the original data _j =1; if the default value is lambda _j μ represents a default loss factor representing a loss of confidence using the default value as input; if the value generated by interpolation in the preamble step is lambda _j For the calculated value of the formula in the preceding step), epsilon, eta, mu are constants, and default values can be preset or used.

In summary, the invention can adjust the interpolation sequence to keep the data authenticity to the greatest extent through the step 5), and provides important reliability reference for subsequent work.

6) Since the interpolation data and the original data are in the encoded state, the present embodiment also needs to decode and restore the variable according to the processing method and parameters stored in step 1.2), and finally form a new data set after the interpolation is implemented.

Example 2

In order to further improve the prediction accuracy of the model, the method is different from embodiment 1, in which on the basis of the missing value prediction model obtained in step 3), the missing value prediction model is used as a pre-training model to perform secondary training, and the pre-training model is used to realize intelligent interpolation of missing data;

and calculating the average reliability of the interpolation data of each row, taking the data row with the average reliability higher than a preset threshold value as a new input to perform secondary training on the pre-training model, obtaining a final missing value prediction model, and performing prediction interpolation again.

According to the scheme provided by the embodiment, the data utilization rate is further improved through a mode of combining the pre-training and the secondary training, and the method has stronger adaptability under the conditions of large-scale missing of data and fewer complete data lines, so that the compatibility of the model to the large-scale missing condition is improved, and the prediction accuracy can be further improved compared with the scheme of the embodiment 1.

Example 3

The embodiment of the invention provides a missing data intelligent interpolation system based on a relation map, which is applied to the method as described in the embodiment 1 or the embodiment 2, and comprises the following steps:

and the decoding module is used for decoding and restoring the variable.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The intelligent interpolation method for missing data based on the relation map is characterized by comprising the following steps of:

and decoding and restoring the variable.

2. The intelligent interpolation method for missing data based on a relationship map according to claim 1, wherein the establishing a variable relationship map based on correlation coefficients between variables comprises:

3. The intelligent interpolation method of missing data based on a relationship map according to claim 2, wherein the building of a variable relationship map based on correlation coefficients between variables further comprises:

4. The intelligent interpolation method for missing data based on a relational graph according to claim 1, wherein training the neural network model with the adjacent variable of each variable in the relational graph as an input to obtain a missing value prediction model comprises:

5. The intelligent interpolation method for missing data based on a relational graph as set forth in claim 1, wherein the interpolation sequence control strategy based on considering missing ranges in the same line is:

6. The intelligent interpolation method for missing data based on a relationship map according to claim 1, wherein the intelligent interpolation for missing data using the missing value prediction model comprises:

7. The intelligent interpolation method of missing data based on a relational graph as set forth in claim 1, wherein the interpolation sequence control strategy based on considering the degree of correlation of missing in the same line is:

in the above, r _ij Representing the elements of the ith row and jth column of the correlation matrix, l _j Is the j-th element of the adjacency vector L, z _j Is the j-th element of the missing state vector Z, if the data lineThe j-th bit data is null, z _j =1, otherwise z _j ＝0；

8. The intelligent interpolation method of missing data based on a relationship map of claim 7, further comprising, after the performing of the filling:

9. The intelligent interpolation method for missing data based on a relational graph according to claim 1, wherein training the neural network model by taking the adjacent variable of each variable in the relational graph as an input to obtain a missing value prediction model, further comprises:

10. The intelligent interpolation system for missing data based on a relational graph, which is applied to the method as claimed in any one of claims 1 to 9, and is characterized by comprising:

the interpolation module is used for realizing intelligent interpolation of the missing data by using the missing value prediction model based on an interpolation sequence control strategy considering the missing range and the missing correlation in the same row; and the decoding module is used for decoding and restoring the variable.