CN116303386A - Intelligent interpolation method and system for missing data based on relational graph - Google Patents

Intelligent interpolation method and system for missing data based on relational graph Download PDF

Info

Publication number
CN116303386A
CN116303386A CN202310146169.7A CN202310146169A CN116303386A CN 116303386 A CN116303386 A CN 116303386A CN 202310146169 A CN202310146169 A CN 202310146169A CN 116303386 A CN116303386 A CN 116303386A
Authority
CN
China
Prior art keywords
missing
variables
interpolation
data
variable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310146169.7A
Other languages
Chinese (zh)
Inventor
廖伟
夏欢
陈肇欣
潘野
张涛
郑奕
薛方冉
陈哲
晏楠欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Second Research Institute of CAAC
Original Assignee
Second Research Institute of CAAC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Second Research Institute of CAAC filed Critical Second Research Institute of CAAC
Priority to CN202310146169.7A priority Critical patent/CN116303386A/en
Publication of CN116303386A publication Critical patent/CN116303386A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data

Abstract

The invention relates to the technical field of information processing, in particular to a missing data intelligent interpolation method and system based on a relation map, which are generally based on a regression interpolation idea, and introduce the relation map between data as an input control strategy of a missing value prediction model; adopting an improved neural network model to enable missing value prediction of a plurality of variables to use the same model; aiming at the scene of 'large scale and large proportion' missing of data, a set of interpolation sequence control strategy and secondary interpolation strategy with high credibility are constructed. In general, the invention reduces the complexity of the interpolation system and improves the calculation efficiency of the interpolation process.

Description

Intelligent interpolation method and system for missing data based on relational graph
Technical Field
The invention relates to the technical field of information processing, in particular to a method and a system for intelligent interpolation of missing data based on a relational graph.
Background
With the wide application of machine learning and digital twin technology, the degree of dependence of a software system on data is greatly improved, and higher requirements are also put on the integrity and the credibility of data input, but due to defects in the process of acquisition and storage, the condition that original data are frequently missing exists, and the interpolation of the missing data is a problem which has to be faced in the engineering field.
The prior art mainly comprises the following three types: a hot card interpolation method, a regression interpolation method, and a multiple interpolation method, wherein,
the hot card interpolation method finds one object most similar to the hot card interpolation method in the complete data, sometimes finds more than one similar object, and randomly selects one of all matching objects as a filling value. The method is conceptually simple, and uses the relationship between data to evaluate null values, but has the disadvantage that the similarity standard is difficult to accurately define, and is greatly influenced by subjective factors.
The multiple interpolation method considers that the missing values are randomly distributed, a multiple interpolation algorithm such as MICE algorithm firstly adopts a regression interpolation mode to estimate the values to be interpolated, then simulates noise to form a plurality of groups of optional interpolation values, finally compares the generated plurality of groups of data sets with the original data sets, and selects a set with the smallest distribution deviation with the original data sets as a final result. Multiple interpolation can only handle random misses, cannot handle non-random misses, and also requires a large amount of computation.
The regression interpolation method is to use supervised machine learning methods, such as regression, nearest neighbor, random forest, support vector machine and other models, to establish a prediction model based on a complete data set, and to substitute known attributes into the model to predict missing attributes.
Specifically, the regression interpolation method can establish a missing value prediction model for each variable, and under the big data scene of 'table many and field many', modeling for each variable consumes a great deal of resources and can greatly increase the complexity of the system; in addition, in the model training and actual prediction processes, the regression interpolation method takes all variables except the target variable as inputs, consumes a great deal of calculation power and calculation time, and also forms dependence on the variables.
Disclosure of Invention
The invention aims to provide a one-stop data interpolation technology, which is based on the idea of regression interpolation, introduces a relation graph between data as input to train a missing value prediction model, so that the same model can realize missing value prediction of a plurality of variables; the constructed interpolation sequence control strategy considering the deletion range and the deletion correlation in the same row can be suitable for a large-range and large-proportion data deletion scene, and realizes the data interpolation with high reliability and high calculation efficiency so as to solve the problems pointed out in the background technology.
The embodiment of the invention is realized by the following technical scheme: a missing data intelligent interpolation method based on a relation map comprises the following steps:
generating a variable data set, and performing feature numeralization and numerical normalization pretreatment;
based on the correlation coefficient between the variables, establishing a variable relation graph;
training a neural network model by taking adjacent variables of all variables in the relation graph as input to obtain a missing value prediction model, wherein the adjacent variables are variables which are directly connected with a target variable in the relation graph;
based on an interpolation sequence control strategy considering the deletion range and the deletion correlation in the same row, the intelligent interpolation of the deletion data is realized by using the deletion value prediction model;
and decoding and restoring the variable.
According to a preferred embodiment, the establishing a variable relation map based on the correlation coefficient between the variables includes:
calculating a correlation matrix among all variables, and performing binarization processing on the correlation matrix;
and setting the diagonal elements of the correlation matrix subjected to binarization processing to 0 to obtain an adjacent matrix, and constructing a relationship map based on the adjacent matrix.
According to a preferred embodiment, the establishing a variable relation map based on the correlation coefficient between the variables further includes:
based on the obtained adjacency matrix, the adjacency matrix is optimally adjusted based on expert experience data.
According to a preferred embodiment, the training the neural network model by using the adjacent variable of each variable in the relation map as an input to obtain a missing value prediction model includes:
taking the adjacent vectors of the variables, carrying out hadamard product on the adjacent vectors and the input tensors row by row to generate N intermediate tensors with the same dimension, wherein N is the number of the variables in the input tensors;
performing N rounds of forward propagation by taking the intermediate tensor as model input to generate N output tensors, wherein parameter updating is not performed after each round of forward propagation;
performing one round of forward propagation by taking the input tensor as input, and updating a process parameter;
setting the other elements except for the j-th column in the output tensor to zero, and summing N output tensors to obtain a final output tensor, wherein j is the forward propagation round number of the output tensor;
and carrying out back propagation based on the deviation of the final output tensor and the input tensor, repeating until the network converges or the training times reach a set value, and completing the training of the missing value prediction model.
According to a preferred embodiment, the interpolation sequence control strategy based on considering the missing ranges in the same row is:
performing sufficiency verification on all null values of the current data line, and performing filling on the null values meeting the sufficiency verification requirement;
and (5) circulating iteration until no null value meeting the sufficiency verification requirement is obtained.
According to a preferred embodiment, the intelligent interpolation of missing data using the missing value prediction model includes:
performing hadamard product on the adjacent vector of the current data line and the target null value to obtain a vector after shielding treatment;
taking the vector as input, and calculating a result through a missing value prediction model;
and extracting a column corresponding to the target null value in the calculation result as a predicted value to replace the target null value.
According to a preferred embodiment, the interpolation sequence control strategy based on considering the correlation of the deletions in the same row is:
the null values are ordered according to the missing correlation of the null values, wherein the missing correlation is the correlation duty ratio of the missing values in all adjacent variables of the current null values, and the expression is as follows:
Figure BDA0004089268140000051
in the above, r ij Representing the elements of the ith row and jth column of the correlation matrix, l j Is the j-th element of the adjacency vector L, z j Is the jth element of the missing state vector Z, if the data of the jth bit of the data line is null, then Z j =1, otherwise z j =0;
Filling is performed in order of low to high missing correlation with default values instead of null values in the adjacency variables as input.
According to a preferred embodiment, after the filling, the method further comprises:
calculating the reliability of interpolation data to form a reliability comparison table, wherein the interpolation data is divided into original data and interpolation values, and the interpolation value calculation expression is as follows:
Figure BDA0004089268140000052
in the above formula, epsilon is a harmonic coefficient, eta is a model damage coefficient, and represents the reliability loss caused by model prediction, lambda j Indicating the trustworthiness of the j-th variable of the current data line.
According to a preferred embodiment, the training the neural network model by using the adjacent variable of each variable in the relation map as an input to obtain a missing value prediction model further includes:
taking the missing value prediction model as a pre-training model, and using the pre-training model to realize intelligent interpolation of missing data;
and calculating the average credibility of each row of interpolation data, and taking the data row with the average credibility higher than a preset threshold value as a new input to perform secondary training on the pre-training model to obtain a final missing value prediction model.
The invention also provides a missing data intelligent interpolation system based on the relation map, which is applied to the method, and comprises the following steps:
the processing module is used for generating a variable data set and carrying out feature numeralization and numerical normalization preprocessing;
the relation map construction module is used for building a variable relation map based on the correlation coefficient between the variables;
the training module is used for training the neural network model by taking adjacent variables of all variables in the relation graph as input so as to obtain a missing value prediction model, wherein the adjacent variables are variables which are directly connected with the target variables in the relation graph;
the interpolation module is used for realizing intelligent interpolation of the missing data by using the missing value prediction model based on an interpolation sequence control strategy considering the missing range and the missing correlation in the same row;
and the decoding module is used for decoding and restoring the variable.
The technical scheme of the embodiment of the invention has at least the following advantages and beneficial effects: the invention comprises a relation graph construction strategy of a variable and a control strategy of using the relation graph to adjust the input and output of a model, the number of the input variable and the dependence on other variables in a data set are greatly reduced on the basis of a traditional interpolation method, and the data set has stronger compatibility under the conditions of 'large scale and large proportion'; the invention comprises a unified missing value prediction model training strategy, and under the big data scene of 'table more and field more', the same model is used for predicting all missing variables, thus greatly reducing modeling time and system complexity; the invention comprises an interpolation sequence control strategy considering the deletion range and the deletion correlation in the same row, and can furthest reserve the data authenticity by adjusting the interpolation sequence, thereby providing important reliability reference for subsequent work.
Drawings
Fig. 1 is a flow chart of a relation graph-based intelligent interpolation method for missing data provided in embodiment 1 of the present invention;
FIG. 2 is a schematic diagram of a relationship diagram according to embodiment 1 of the present invention;
FIG. 3 is a schematic diagram of the forward propagation and loss calculation process according to embodiment 1 of the present invention;
fig. 4 is a schematic flow chart of intelligent interpolation provided in embodiment 1 of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Example 1
The invention discloses a missing data intelligent interpolation method based on a relation map, wherein a flow chart is shown in fig. 1, and the method is implemented according to the following steps:
1) The data preprocessing step, because the input variable may have a plurality of formats such as numerical value, character, time and the like, the variable data set needs to be preprocessed before modeling, and the specific steps are as follows:
1.1 The character information is digitized, including but not limited to tag encoding, unicode, serial number encoding, frequency encoding, relative time, etc.
The following is a brief description taking the single-hot encoding as an example:
for example: the characters include southern aviation, chinese aviation, eastern aviation, hainan aviation, xiamen aviation, sichuan aviation, shenzhen aviation, shandong aviation, lucky aviation, spring and autumn aviation, etc. These characters are not continuous, but rather discrete, unordered.
The digitizing process is based on the principle that N states are encoded in an N-bit state register, for the above character example, and is performed (here, only nine features are provided, so n=9):
southern aviation-100000000
China navigation → 010000000
Oriental aviation → 001000000
Hainan aviation → 000100000
Mansion aviation- & gt 000010000
Sichuan aviation → 000001000
Shenzhen aviation- & gt 000000100
Lucky aviation- & gt 000000010
Aviation in spring and autumn- & gt 000000001
1.2 Normalization of the numerical information including, but not limited to, dispersion normalization, logarithmic normalization, zero-mean normalization, etc., and will not be described in detail herein.
After the preprocessing is completed, the processing methods and parameters are stored for subsequent decoding.
2) According to the statistical correlation and/or the business correlation, a relation map among all variables in the data set is constructed, and the specific steps are as follows:
2.1 The specific steps of constructing a relationship map based on statistical correlation are as follows:
2.1.1 A correlation matrix between all variables, the correlation matrix calculation means including but not limited to, pearson correlation coefficient, spearman correlation coefficient, kendel correlation coefficient, etc., wherein the calculation expression of pearson correlation coefficient is as follows:
Figure BDA0004089268140000091
in the above-mentioned method, the step of,
Figure BDA0004089268140000092
representing variable X i Mean value of->
Figure BDA0004089268140000093
Representing variable Y i And (5) an average value. It should be noted that the pearson correlation coefficient varies from-1 to 1, and that the coefficient value r is 1 and represents X i And Y i Has a linear relation, and a coefficient value r of 0 represents X i And Y i There is no linear relationship between them. In particular if and only if X i And Y i All fall on the same side of the respective mean value, and the correlation coefficient is positive; if X i And Y i Tends to fall on opposite sides of the respective mean value, the correlation coefficient is negative.
2.1.2 Binarization processing is carried out on the correlation matrix, a binarization threshold value can be preset, a default value can be adopted, and excessive description is not specifically carried out.
2.1.3 Setting 0 to the diagonal element of the correlation matrix so that the variables are not considered as adjacent, and finally obtaining the matrix, namely, the adjacent matrix used as a relation map, for describing the association relation between the variables, as shown in table 1, wherein table 1 is an example of the adjacent matrix between the variables provided by the embodiment of the invention:
TABLE 1 adjacency matrix between variables
Number of passengers Luggage number Door closing time Task timeout rate
Number of passengers 0 1 1 0
Luggage number 1 0 1 0
Door closing time 1 1 0 1
Task timeout rate 0 0 1 0
The adjacent matrix is a symmetric matrix, the adjacent vector is a matrix row corresponding to the target variable, an element of 1 indicates that two variables corresponding to the row and the column are associated, and an element of 0 indicates that the two variables are not associated.
2.2 Optimizing and adjusting the relation map based on the service correlation, specifically, considering that a practitioner is more familiar with service data, the causal relation except the statistical correlation can be captured, and the homogeneous association relation in the relation map can be removed (taking civil aviation scene data as an example in fig. 2, the number of passengers is highly correlated with the number of baggage, and when the door closing time is predicted, one of the passengers and the baggage is removed, the calculated amount can be reduced, the data dependence can be reduced, and the brought income is greater than the loss in precision); therefore, in this embodiment, on the basis of the generated adjacency matrix, the corresponding element of the association relationship to be removed is set to 0, the corresponding element of the association relationship to be supplemented is set to 1 based on expert experience data, and the finally obtained relationship map is shown in fig. 2.
In summary, through step 2), the number of input variables and the dependence on other variables in the data set are greatly reduced on the basis of the traditional interpolation method, and the method has stronger compatibility for the situations of 'large scale and large proportion' of data.
3) Training the neural network model by taking adjacent variables of all variables in the relation map as input to obtain a missing value prediction model, wherein the method comprises the following specific steps of:
3.1 Initializing a model, and setting general parameters of training neural network models such as the number of layers of the neural network, the number of neurons, an activation function, a learning rate, a loss function, an optimizer and the like; in this embodiment, the dimensions of the input and output are the same, and N is required by the model.
In addition, before the forward propagation starts, the neuron weights need to be initialized, and the process is the same as that of the traditional feedforward neural network, and is not repeated here.
It should be noted that, the embodiment of the present invention adopts an improved neural network model, where the improved neural network model refers to the improvement of the timing and the number of forward propagation and backward propagation and the organization of input and output based on the deep feed forward network, and does not limit the network layer number, the neuron number, the activation function and other general parameters of the neural network.
3.2 Forward propagation, in this embodiment the input tensor is P M×N The P is M×N Is a matrix with dimension M x N, wherein M is the batch size, represents the number of data lines in the batch input, and N represents the number of variables in the data set; it should be noted that the input tensor used in the training process is a complete data row in the complete data set.
Further, take P M×N Adjacency vectors of variables in (a), line by line and P M×N Performing hadamard product to generate N intermediate tensors with dimension of M multiplied by N
Figure BDA0004089268140000111
The objective is to generate an intermediate tensor free of non-contiguous variables and target variables, the contiguous variables being variables in a relationship graph that are directly connected to the target variables.
Will be
Figure BDA0004089268140000121
N rounds of forward propagation as model input, generating N output tensors +.>
Figure BDA0004089268140000125
And the parameter is updated by the reverse gradient without immediately after each round of forward propagation, and only the output is recorded.
Finally, the P is M×N A round of forward propagation is performed as input with the purpose of updating the process parameters for subsequent gradient calculations.
3.3 Calculating loss, wherein the specific steps are as follows:
3.3.1 Is to be used as a main component)
Figure BDA0004089268140000122
Setting zero in other elements except for the j-th column, and summing N output tensors to obtain a final output tensor O M×N The purpose is to at N +.>
Figure BDA0004089268140000123
The extraction of valid columns in the matrix form the final output, where j is
Figure BDA0004089268140000124
Number of forward propagation rounds.
3.3.2 Calculating O based on a loss function) M×N And P M×N It should be noted that P M×N I.e. the correct value of the output, thus O M×N And P M×N The deviation of the model is the loss of the current model; the loss functions used include, but are not limited to, general machine learning loss functions such as L1 norm loss, mean square error loss, cross entropy loss, and KL divergence loss, and the like, and are not described in detail herein.
4) Based on O M×N And P M×N The back propagation of the deviations of each neuron calculates the contribution of each neuron to the loss and updates the weights according to the gradient calculated by the back propagation algorithm, which is the same as that of a conventional feed-forward neural network and is not described herein. It should be noted that the back propagation process takes a much larger time than the forward propagation throughout the deep neural network training processThe multicast process, therefore, does not significantly increase training time for multiple rounds of forward propagation. The specific forward propagation and loss calculation procedure is shown with reference to fig. 3.
Repeating the steps 2) -4) until the network convergence or the training times reach the set value, and completing the training of the missing value prediction model.
In summary, the invention interpolates the model provided in step 4), and in the big data scene of 'table many, field many', the same model is used for the prediction of all missing variables, thus greatly reducing modeling time and system complexity.
5) Based on an interpolation sequence control strategy considering the deletion range and the deletion correlation in the same row, the intelligent interpolation of the deletion data is realized by using the deletion value prediction model, and the specific steps are as follows:
5.1 Initial line number m=1, and performing null value screening, see fig. 4, specifically including the following steps:
5.1.1 Performing sufficiency verification on all null values of the current data line; the sufficiency verification refers to whether all adjacent variables corresponding to the current null value are non-null, and if all adjacent variables are non-null, the sufficiency verification is satisfied.
5.1.2 Filling null values meeting sufficiency verification requirements, specifically as follows:
5.1.2.1 Performing hadamard product on the adjacent vector of the current data line and the target null value to obtain a vector after shielding treatment;
5.1.2.2 Taking the vector as input, and calculating a result through a missing value prediction model;
5.1.2.3 And (3) extracting a column corresponding to the target null value in the calculation result as a predicted value to replace the target null value.
5.1.3 And (3) iterating circularly until no null value meeting the sufficiency verification requirement is obtained.
5.2 When the empty values of the sufficiency verification are not satisfied, the empty values are not represented to be filled, and a plurality of empty values are possibly interdependent and cannot be filled; the empty value sorting step specifically comprises the following steps:
5.2.1 Ordering according to the missing correlation R of the current null value, wherein the expression of the missing correlation is as follows:
Figure BDA0004089268140000141
in the above, r ij Elements representing the ith row and jth column of the correlation matrix, l j The j-th element, z, representing the adjacency vector j The j-th element of the missing state vector is z if the data of the j-th bit of the data line is null j =1, otherwise z j =0。
5.2.2 Assigning and filling, wherein filling is carried out according to the filling flow provided in the step 5.1.2) from low to high in the order of the lack relevance R until the line number M is greater than the total line M, otherwise, M is increased by 1, and the step 5.1.1) is returned.
It should be noted that, before filling, a default value is used as a model input instead of a null value in the adjacent variable, where the default value includes, but is not limited to, a median, a mode, or a mean value of the variables in the dataset, and the description is not repeated here.
In the initial stage of assignment filling of the same row, default values in input variables are more, but the influence is smaller because the missing correlation R is lower; the closer the assignment filling of the same row is to the later stage, the fewer default values in the input variables are, and meanwhile, the higher the missing correlation degree is, so that the reliability of the interpolation data is improved to the greatest extent as a whole.
Further, calculating the reliability of the interpolation data after each time of performing interpolation on the empty value to form a reliability comparison table, wherein the interpolation data is divided into original data and interpolation values, and the interpolation value calculation expression is as follows:
Figure BDA0004089268140000142
in the above formula, epsilon is a harmonic coefficient, eta is a model damage coefficient, and represents the reliability loss caused by model prediction, lambda j Indicating the trustworthiness of the jth variable of the current data line (λ if the original data j =1; if the default value is lambda j μ represents a default loss factor representing a loss of confidence using the default value as input; if the value generated by interpolation in the preamble step is lambda j For the calculated value of the formula in the preceding step), epsilon, eta, mu are constants, and default values can be preset or used.
In summary, the invention can adjust the interpolation sequence to keep the data authenticity to the greatest extent through the step 5), and provides important reliability reference for subsequent work.
6) Since the interpolation data and the original data are in the encoded state, the present embodiment also needs to decode and restore the variable according to the processing method and parameters stored in step 1.2), and finally form a new data set after the interpolation is implemented.
Example 2
In order to further improve the prediction accuracy of the model, the method is different from embodiment 1, in which on the basis of the missing value prediction model obtained in step 3), the missing value prediction model is used as a pre-training model to perform secondary training, and the pre-training model is used to realize intelligent interpolation of missing data;
and calculating the average reliability of the interpolation data of each row, taking the data row with the average reliability higher than a preset threshold value as a new input to perform secondary training on the pre-training model, obtaining a final missing value prediction model, and performing prediction interpolation again.
According to the scheme provided by the embodiment, the data utilization rate is further improved through a mode of combining the pre-training and the secondary training, and the method has stronger adaptability under the conditions of large-scale missing of data and fewer complete data lines, so that the compatibility of the model to the large-scale missing condition is improved, and the prediction accuracy can be further improved compared with the scheme of the embodiment 1.
Example 3
The embodiment of the invention provides a missing data intelligent interpolation system based on a relation map, which is applied to the method as described in the embodiment 1 or the embodiment 2, and comprises the following steps:
the processing module is used for generating a variable data set and carrying out feature numeralization and numerical normalization preprocessing;
the relation map construction module is used for building a variable relation map based on the correlation coefficient between the variables;
the training module is used for training the neural network model by taking adjacent variables of all variables in the relation graph as input so as to obtain a missing value prediction model, wherein the adjacent variables are variables which are directly connected with the target variables in the relation graph;
the interpolation module is used for realizing intelligent interpolation of the missing data by using the missing value prediction model based on an interpolation sequence control strategy considering the missing range and the missing correlation in the same row;
and the decoding module is used for decoding and restoring the variable.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. The intelligent interpolation method for missing data based on the relation map is characterized by comprising the following steps of:
generating a variable data set, and performing feature numeralization and numerical normalization pretreatment;
based on the correlation coefficient between the variables, establishing a variable relation graph;
training a neural network model by taking adjacent variables of all variables in the relation graph as input to obtain a missing value prediction model, wherein the adjacent variables are variables which are directly connected with a target variable in the relation graph;
based on an interpolation sequence control strategy considering the deletion range and the deletion correlation in the same row, the intelligent interpolation of the deletion data is realized by using the deletion value prediction model;
and decoding and restoring the variable.
2. The intelligent interpolation method for missing data based on a relationship map according to claim 1, wherein the establishing a variable relationship map based on correlation coefficients between variables comprises:
calculating a correlation matrix among all variables, and performing binarization processing on the correlation matrix;
and setting the diagonal elements of the correlation matrix subjected to binarization processing to 0 to obtain an adjacent matrix, and constructing a relationship map based on the adjacent matrix.
3. The intelligent interpolation method of missing data based on a relationship map according to claim 2, wherein the building of a variable relationship map based on correlation coefficients between variables further comprises:
based on the obtained adjacency matrix, the adjacency matrix is optimally adjusted based on expert experience data.
4. The intelligent interpolation method for missing data based on a relational graph according to claim 1, wherein training the neural network model with the adjacent variable of each variable in the relational graph as an input to obtain a missing value prediction model comprises:
taking the adjacent vectors of the variables, carrying out hadamard product on the adjacent vectors and the input tensors row by row to generate N intermediate tensors with the same dimension, wherein N is the number of the variables in the input tensors;
performing N rounds of forward propagation by taking the intermediate tensor as model input to generate N output tensors, wherein parameter updating is not performed after each round of forward propagation;
performing one round of forward propagation by taking the input tensor as input, and updating a process parameter;
setting the other elements except for the j-th column in the output tensor to zero, and summing N output tensors to obtain a final output tensor, wherein j is the forward propagation round number of the output tensor;
and carrying out back propagation based on the deviation of the final output tensor and the input tensor, repeating until the network converges or the training times reach a set value, and completing the training of the missing value prediction model.
5. The intelligent interpolation method for missing data based on a relational graph as set forth in claim 1, wherein the interpolation sequence control strategy based on considering missing ranges in the same line is:
performing sufficiency verification on all null values of the current data line, and performing filling on the null values meeting the sufficiency verification requirement;
and (5) circulating iteration until no null value meeting the sufficiency verification requirement is obtained.
6. The intelligent interpolation method for missing data based on a relationship map according to claim 1, wherein the intelligent interpolation for missing data using the missing value prediction model comprises:
performing hadamard product on the adjacent vector of the current data line and the target null value to obtain a vector after shielding treatment;
taking the vector as input, and calculating a result through a missing value prediction model;
and extracting a column corresponding to the target null value in the calculation result as a predicted value to replace the target null value.
7. The intelligent interpolation method of missing data based on a relational graph as set forth in claim 1, wherein the interpolation sequence control strategy based on considering the degree of correlation of missing in the same line is:
the null values are ordered according to the missing correlation of the null values, wherein the missing correlation is the correlation duty ratio of the missing values in all adjacent variables of the current null values, and the expression is as follows:
Figure FDA0004089268130000031
in the above, r ij Representing the elements of the ith row and jth column of the correlation matrix, l j Is the j-th element of the adjacency vector L, z j Is the j-th element of the missing state vector Z, if the data lineThe j-th bit data is null, z j =1, otherwise z j =0;
Filling is performed in order of low to high missing correlation with default values instead of null values in the adjacency variables as input.
8. The intelligent interpolation method of missing data based on a relationship map of claim 7, further comprising, after the performing of the filling:
calculating the reliability of interpolation data to form a reliability comparison table, wherein the interpolation data is divided into original data and interpolation values, and the interpolation value calculation expression is as follows:
Figure FDA0004089268130000041
in the above formula, epsilon is a harmonic coefficient, eta is a model damage coefficient, and represents the reliability loss caused by model prediction, lambda j Indicating the trustworthiness of the j-th variable of the current data line.
9. The intelligent interpolation method for missing data based on a relational graph according to claim 1, wherein training the neural network model by taking the adjacent variable of each variable in the relational graph as an input to obtain a missing value prediction model, further comprises:
taking the missing value prediction model as a pre-training model, and using the pre-training model to realize intelligent interpolation of missing data;
and calculating the average credibility of each row of interpolation data, and taking the data row with the average credibility higher than a preset threshold value as a new input to perform secondary training on the pre-training model to obtain a final missing value prediction model.
10. The intelligent interpolation system for missing data based on a relational graph, which is applied to the method as claimed in any one of claims 1 to 9, and is characterized by comprising:
the processing module is used for generating a variable data set and carrying out feature numeralization and numerical normalization preprocessing;
the relation map construction module is used for building a variable relation map based on the correlation coefficient between the variables;
the training module is used for training the neural network model by taking adjacent variables of all variables in the relation graph as input so as to obtain a missing value prediction model, wherein the adjacent variables are variables which are directly connected with the target variables in the relation graph;
the interpolation module is used for realizing intelligent interpolation of the missing data by using the missing value prediction model based on an interpolation sequence control strategy considering the missing range and the missing correlation in the same row; and the decoding module is used for decoding and restoring the variable.
CN202310146169.7A 2023-02-21 2023-02-21 Intelligent interpolation method and system for missing data based on relational graph Pending CN116303386A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310146169.7A CN116303386A (en) 2023-02-21 2023-02-21 Intelligent interpolation method and system for missing data based on relational graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310146169.7A CN116303386A (en) 2023-02-21 2023-02-21 Intelligent interpolation method and system for missing data based on relational graph

Publications (1)

Publication Number Publication Date
CN116303386A true CN116303386A (en) 2023-06-23

Family

ID=86837083

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310146169.7A Pending CN116303386A (en) 2023-02-21 2023-02-21 Intelligent interpolation method and system for missing data based on relational graph

Country Status (1)

Country Link
CN (1) CN116303386A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117437086A (en) * 2023-12-20 2024-01-23 中国电建集团贵阳勘测设计研究院有限公司 Deep learning-based solar resource missing measurement data interpolation method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117437086A (en) * 2023-12-20 2024-01-23 中国电建集团贵阳勘测设计研究院有限公司 Deep learning-based solar resource missing measurement data interpolation method and system

Similar Documents

Publication Publication Date Title
CN109948029B (en) Neural network self-adaptive depth Hash image searching method
CN105512273A (en) Image retrieval method based on variable-length depth hash learning
CN110212528B (en) Power distribution network measurement data missing reconstruction method
CN112599208A (en) Machine learning system and method for generating material structure of target material attributes
CN108805193B (en) Electric power missing data filling method based on hybrid strategy
CN115578248B (en) Generalized enhanced image classification algorithm based on style guidance
CN113283590B (en) Defending method for back door attack
CN111353534B (en) Graph data category prediction method based on adaptive fractional order gradient
CN112819523B (en) Marketing prediction method combining inner/outer product feature interaction and Bayesian neural network
CN116303386A (en) Intelligent interpolation method and system for missing data based on relational graph
CN115640842A (en) Network representation learning method based on graph attention self-encoder
CN115496144A (en) Power distribution network operation scene determining method and device, computer equipment and storage medium
CN115358838A (en) Credit time series data modeling method and device based on convolutional neural network
CN111192158A (en) Transformer substation daily load curve similarity matching method based on deep learning
CN111967528B (en) Image recognition method for deep learning network structure search based on sparse coding
CN111783688B (en) Remote sensing image scene classification method based on convolutional neural network
CN112785051A (en) Cloud resource prediction method based on combination of EMD and TCN
CN112232565A (en) Two-stage time sequence prediction method, prediction system, terminal and medium
CN111539558A (en) Power load prediction method adopting optimized extreme learning machine
CN116415177A (en) Classifier parameter identification method based on extreme learning machine
CN116226689A (en) Power distribution network typical operation scene generation method based on Gaussian mixture model
CN113132482B (en) Distributed message system parameter adaptive optimization method based on reinforcement learning
CN115762183A (en) Traffic speed prediction method based on geometric algebra and hypergraph
CN115081551A (en) RVM line loss model building method and system based on K-Means clustering and optimization
CN115081609A (en) Acceleration method in intelligent decision, terminal equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination