CN115600105A - Water body missing data interpolation method and device based on MIC-LSTM - Google Patents

Water body missing data interpolation method and device based on MIC-LSTM Download PDF

Info

Publication number
CN115600105A
CN115600105A CN202211160686.1A CN202211160686A CN115600105A CN 115600105 A CN115600105 A CN 115600105A CN 202211160686 A CN202211160686 A CN 202211160686A CN 115600105 A CN115600105 A CN 115600105A
Authority
CN
China
Prior art keywords
data
mic
lstm
water body
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211160686.1A
Other languages
Chinese (zh)
Inventor
董方敏
周家伟
轩小静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Three Gorges University CTGU
Original Assignee
China Three Gorges University CTGU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Three Gorges University CTGU filed Critical China Three Gorges University CTGU
Priority to CN202211160686.1A priority Critical patent/CN115600105A/en
Publication of CN115600105A publication Critical patent/CN115600105A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a water body missing data interpolation method and a water body missing data interpolation device based on MIC-LSTM, wherein the method comprises the following steps: constructing a plurality of data sets with different deletion types and different deletion degrees based on the original monitoring data of the watershed water body; performing correlation analysis on the data in each data set by using an MIC algorithm to obtain data of which the correlation is greater than a preset threshold value in each data set; training, verifying and testing the LSTM model based on the relevant data set to obtain the LSTM model after the test is passed; the related data sets are sets of data with the relevance larger than a preset threshold in all the data sets; and predicting the watershed water body missing data based on the LSTM model after the test is passed, and acquiring an interpolation value corresponding to the watershed water body missing data. According to the invention, the LSTM model is constructed based on the data with strong correlation screened by the MIC algorithm, and the final LSTM model is utilized to perform data interpolation, so that the reliability of the interpolation result is improved.

Description

Water body missing data interpolation method and device based on MIC-LSTM
Technical Field
The invention relates to the technical field of data processing, in particular to a water body missing data interpolation method and device based on MIC-LSTM.
Background
Water quality prediction presents a fundamental problem, and data collection usually involves a large amount of missing data; a reasonable water body data model is established, and corresponding water body data interpolation work is carried out on the missing part of the water body data, so that the missing data proportion is reduced.
When interpolation work is carried out on water missing data, a traditional data interpolation model and a deep learning model are mostly adopted for carrying out data interpolation, but original data are directly put into the model for model training, and then interpolation is carried out by using trained model prediction data, so that the problem that the reliability of a data interpolation result is not strong exists.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a water body missing data interpolation method and device based on MIC-LSTM.
The invention provides a water missing data interpolation method based on MIC-LSTM, which comprises the following steps:
constructing a plurality of data sets with different deletion types and different deletion degrees based on the original monitoring data of the watershed water body;
performing correlation analysis on the data in each data set by using an MIC algorithm to obtain data of which the correlation is greater than a preset threshold value in each data set;
training, verifying and testing the LSTM model based on the relevant data set to obtain the LSTM model after the test is passed; the related data sets are sets of data with the correlation larger than a preset threshold value in all the data sets;
predicting watershed water body missing data based on the LSTM model after the test is passed, and obtaining an interpolation value corresponding to the watershed water body missing data.
Optionally, the performing, by using the MIC algorithm, correlation analysis on the data in each data set to obtain data with strong correlation in each data set includes:
and respectively carrying out correlation analysis on the same type of effect quantity in each data set and the effect quantity and the environmental quantity by utilizing an MIC (many integrated core) algorithm, and acquiring data of which the correlation between the same type of effect quantity in each data set is greater than a preset threshold value and data of which the correlation between the effect quantity and the environmental quantity in each data set is greater than a preset threshold value.
Optionally, the training, verifying, and testing the LSTM model based on the relevant data set to obtain the LSTM model after the test is passed includes:
determining a training set and a validation set based on a continuous portion of data in the correlated data set;
determining a test set based on the data missing part in the related data set;
and respectively and sequentially training, verifying and testing the LSTM model by utilizing the training set, the verifying set and the testing set to obtain the LSTM model after the test is passed.
Optionally, before performing correlation analysis on the data in each data set by using the MIC algorithm, the method further includes:
the data in each data set is normalized.
Optionally, there are 2 hidden layers between the input layer and the output layer of the LSTM model; wherein, the number of neurons in the first layer hidden layer is 64, and the number of neurons in the second layer hidden layer is 32.
The invention also provides a water body missing data interpolation device based on MIC-LSTM, which comprises:
the construction module is used for constructing a plurality of data sets with different deletion types and different deletion degrees based on the original monitoring data of the watershed water body;
the first acquisition module is used for carrying out correlation analysis on the data in each data set by utilizing an MIC algorithm and acquiring the data with the correlation larger than a preset threshold value in each data set;
the second acquisition module is used for training, verifying and testing the LSTM model based on the relevant data set to acquire the LSTM model after the test is passed; the related data sets are sets of data with the correlation larger than a preset threshold value in all the data sets;
and the third acquisition module is used for predicting the watershed water body missing data based on the LSTM model after the test is passed, and acquiring an interpolation value corresponding to the watershed water body missing data.
Optionally, the first obtaining module is specifically configured to:
and respectively carrying out correlation analysis on the same type of effect quantities in each data set and the effect quantities and the environment quantities by using an MIC algorithm to obtain data of which the correlation between the same type of effect quantities in each data set is greater than a preset threshold value and data of which the correlation between the effect quantities and the environment quantities in each data set is greater than a preset threshold value.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the MIC-LSTM-based water body missing data interpolation method.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a MIC-LSTM based water missing data interpolation method as claimed in any one of the above.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a MIC-LSTM-based water missing data interpolation method as described in any one of the above.
According to the water body missing data interpolation method and device based on the MIC-LSTM, data sets of different types and different missing degrees are built, strong-correlation data in each data set are screened out through an MIC algorithm, an LSTM model is trained, verified and tested based on the strong-correlation data, the LSTM model after the test is passed is used for data interpolation, and the reliability of interpolation results is improved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a MIC-LSTM-based water missing data interpolation method according to an embodiment of the present invention;
FIG. 2 is a second schematic flowchart of a MIC-LSTM-based water missing data interpolation method according to an embodiment of the present invention;
FIG. 3 is a diagram of the internal elements of an LSTM neural network provided by an embodiment of the present invention;
FIG. 4 is a diagram of an LSTM model structure provided by an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a MIC-LSTM-based water missing data interpolation device according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art based on the embodiments of the present invention without inventive step, are within the scope of the present invention.
Fig. 1 is a schematic flow chart of a MIC-LSTM-based water missing data interpolation method according to an embodiment of the present invention, and as shown in fig. 1, the present invention provides a MIC-LSTM-based water missing data interpolation method, which includes:
step 101, constructing a plurality of data sets with different deletion types and different deletion degrees based on the original monitoring data of the watershed water body.
Specifically, fig. 2 is a second schematic flow chart of the MIC-LSTM-based water body missing data interpolation method provided in the embodiment of the present invention, which is used for monitoring an actual watershed water body to obtain monitored original watershed water body monitoring data. Dividing the original monitoring data of the watershed water body according to different loss types and different loss degrees to construct a plurality of data sets.
The different deletion types can be divided based on the data deletion mode, and can be divided into continuous deletion and scattered deletion. The continuous missing refers to the situation that the monitored water body data are incomplete in a monitoring time period, and one or more data or even all data are missing; the dispersion loss means that the water body data monitored at a certain moment is complete and the water body data monitored at a certain moment is incomplete in the monitoring time period.
The different degrees of missing may be divided based on the number of categories of missing data.
For example, the different degrees of deletion may be that a certain kind of detection data is deleted within a preset time, that several kinds of detection data are deleted within a preset time, and that all detection data are deleted within a preset time.
The different degrees of absence may also be divided based on the size of the amount of data missing, or based on the proportion of the amount of data missing to the total amount of data.
For example, the different degrees of deletion may be 10%, 30%, 50%, 70%, 90%, etc.
For example, table 1 shows that different deletion types and different deletion degree tables are constructed, in table 1, the deletion types are divided into continuous deletion and scattered deletion, and the deletion degree is divided into deletion of one data, deletion of m data (m is a positive integer greater than or equal to 2 and less than n, and n is the total number of data types in the water body data), and deletion of n-1 data, so that 6 different data sets are obtained.
Table 1 constructs data set tables of different deletion types and different deletion degrees
Figure BDA0003859759570000051
Figure BDA0003859759570000061
And 102, carrying out correlation analysis on the data in each data set by using an MIC algorithm to obtain the data with the correlation larger than a preset threshold value in each data set.
Specifically, there are multiple data in each data set, the data amount corresponding to each data is huge, however, not every data is helpful for model training, and it is necessary to perform correlation analysis on The data in each data set by using The Maximum Information Coefficient (MIC), and screen out The data in each data set whose correlation is greater than a preset threshold.
The preset threshold value can be set according to actual conditions and actual requirements. The multiple data sets may correspond to the same or different preset thresholds.
For example, data set 1 corresponds to the preset threshold 1, data set 2 corresponds to the preset threshold 2, and data set 3 corresponds to the preset threshold 3.
For another example, data set 1, data set 2, and data set 3 each correspond to a preset threshold of 1.
One data set may also correspond to a plurality of different preset thresholds, and the correlations between different kinds of data in one data set are compared with the different preset thresholds.
For example, a type a data, a type B data, and a type C data are collected in one data set, the correlation between the type a data and the type B data is compared with a preset threshold a, the correlation between the type a data and the type C data is compared with a preset threshold B, and the correlation between the type B data and the type C data is compared with a preset threshold C.
Optionally, before performing correlation analysis on the data in each data set by using the MIC algorithm, the method further includes:
the data in each data set is normalized.
Specifically, the dimensions of different kinds of data are different, and the difference of data values is large, so that in order to eliminate adverse effects on model prediction effects caused by different dimensions of data, normalization processing needs to be performed on the data in each data set before correlation analysis is performed on the data.
The expression of the normalized data is as follows:
Figure BDA0003859759570000071
in the formula, X i Representing data x i Normalized value, x i Representing a feature vector at a particular time length index i, X min Denotes the minimum value, X, in the data set max Representing the maximum value in the data set.
According to the MIC-LSTM-based water body missing data interpolation method, the data in the data set are subjected to normalization processing, so that adverse effects on model prediction effects due to different data dimensions are avoided, and the reliability of interpolation results is improved.
Optionally, the performing, by using the MIC algorithm, correlation analysis on the data in each data set to obtain data with strong correlation in each data set includes:
and respectively carrying out correlation analysis on the same type of effect quantities in each data set and the effect quantities and the environment quantities by using an MIC algorithm to obtain data of which the correlation between the same type of effect quantities in each data set is greater than a preset threshold value and data of which the correlation between the effect quantities and the environment quantities in each data set is greater than a preset threshold value.
Specifically, the data in each data set may be divided into water body data and environment data, the data pre-interpolated in the water body data is an effect quantity, the data in the water body data except the data pre-interpolated is a same type of effect quantity, and the environment data is an environment quantity.
For example, if the pre-interpolation data is a pH value, the effect quantity is a pH value, the data other than the pH value in the water body data are the same type of effect quantity, and the temperature, humidity, and the like around the water body are environmental quantities.
After the normalization processing is finished, analyzing the correlation between the same type of effect quantities in each data set by using an MIC algorithm, and analyzing the correlation between the effect quantities in each data set and environmental quantities by using the MIC algorithm.
The MIC algorithm proposed by scholar Reshef, which is developed based on mutual information, which can be defined as:
I(x,y)=E(y)-E(y|x)
in the formula, I (x, y) represents mutual information between x and y, E (y) is entropy of y, E (y | x) is conditional entropy of y, and complexity of variable y is represented.
Mutual information between time series X and Y can be expressed as:
Figure BDA0003859759570000081
in the formula, I (X, Y) represents mutual information between time series X and Y, p (X, Y) is a joint probability density function of the time series X and Y, p (X) is a joint probability density function of the edges of the time series X, and p (Y) is a joint probability density function of the edges of the time series Y.
The MIC algorithm process is as follows: two variables were discretized into two-dimensional space, represented as a scatter plot. And continuously dividing the x direction and the y direction of the two-dimensional space by using the small squares, calculating the falling probability of the squares, and estimating the joint probability density distribution of the squares. When the number and position of the division grids are changed, different results are obtained, the maximum mutual information value is selected through comparison, and normalization is needed.
The maximum information coefficient is calculated by the formula:
Figure BDA0003859759570000082
in the formula: MIC (a, B) represents the maximum information coefficient between variable a and variable B, I (a, B) represents the mutual information between a and B, B is the sample variable, N is the number of samples, and the value of B is N 0.6 The effect is optimal.
The larger the maximum informative coefficient values of the two variables are, the stronger the correlation is, and the range of MIC values is [0,1].
And analyzing the correlation among the same type of effect quantities in each data set by using an MIC algorithm, acquiring the correlation (namely MIC value) among the same type of effect quantities, comparing the MIC value among the same type of effect quantities with a preset threshold value, and screening out the same type of effect quantities of which the MIC values are greater than the preset threshold value.
And analyzing the correlation between the effect quantity and the environmental quantity in each data set by using an MIC algorithm, acquiring the correlation (namely MIC value) between the effect quantity and the environmental quantity, comparing the MIC value between the effect quantity and the environmental quantity with a preset threshold value, and screening out the effect quantity and the environmental quantity of which the MIC value is greater than the preset threshold value.
According to the MIC-LSTM-based water body missing data interpolation method provided by the embodiment of the invention, correlation analysis is carried out between the same type of effect quantities in the watershed water body monitoring data and between the effect quantities and the environment quantity by using an MIC algorithm, so that data with strong correlation is screened out, the usability of the data is improved, and the utilization efficiency of the model is favorably improved.
103, training, verifying and testing the LSTM model based on the relevant data set to obtain the LSTM model after the test is passed; the related data set is a set of data with the relevance larger than a preset threshold value in all data sets.
Particularly, the watershed water body is a dynamic, nonlinear, unstable and noisy system, and data of the watershed water body also has the characteristics, so that the nonlinear data is predicted by using a linear model, the problem of insufficient accuracy exists, and the nonlinear data of the water body is analyzed by introducing a deep learning model Long Short-Term Memory (LSTM).
And combining the data sets with the correlation larger than a preset threshold value in all the data sets to form a correlated data set.
And selecting a training set, a verification set and a test set from the relevant data sets, wherein the test set is used for parameter fitting training of the LSTM model, the verification set is used for optimizing parameters of the LSTM model, and the test set is used for evaluating the generalization capability of the LSTM model after verification passes. And after the LSTM model is trained, verified and tested, obtaining the LSTM model after the test is passed. Fig. 3 is a structural diagram of an internal unit of the LSTM neural network according to the embodiment of the present invention, which is divided into three parts, namely a "forgetting gate", an "input gate", and an "output gate".
The 'forgetting gate' is based on the control coefficient f t To determine t-1Information on the state of cells C t-1 How much structure can be preserved. f. of t Is a value between 0 and 1, f, calculated from the input value at time t and the hidden layer state information at time t-1 as inputs t The closer to 0, C t-1 The more information is rejected, f t The closer to 1, C t-1 The more information is retained.
'forgetting gate' control coefficient f t The expression of (a) is as follows:
f t =σ(W f *[h t-1 ,x t ]+b f )
in the formula (f) t Control coefficient representing "forget gate", σ () representing Sigmoid function, W f Weight matrix representing 'forget gate', h t-1 Representing hidden layer state information, x, at time t-1 t Input value representing time t, b f A bias term representing "forget gate".
The expression of the Sigmoid function is as follows:
Figure BDA0003859759570000101
in the formula, σ (p) is a Sigmoid function, and p is an input parameter of the Sigmoid function.
The input gate depends on the control coefficient i t Cell status information C to determine which information is to be added to time t t In (1), determine C t The update condition of (1).
Input Gate control coefficient i t The expression of (a) is as follows:
i t =σ(W i *[h t-1 ,x t ]+b i )
in the formula i t Control coefficients representing "input gate", σ () representing Sigmoid function, W i Weight matrix representing "input gate", h t-1 Representing hidden layer state information, x, at time t-1 t Representing the input value at time t, b i An offset term representing "input gate".
In "In input gate ″, h t-1 And x t Candidate updating information at t moment can be determined through tanh activation function
Figure BDA0003859759570000102
Candidate update information at time t
Figure BDA0003859759570000103
The expression of (c) is as follows:
Figure BDA0003859759570000104
in the formula (I), the compound is shown in the specification,
Figure BDA0003859759570000105
represents candidate update information at time t, tanh () represents a tanh activation function, W c Weight matrix representing "input gate", h t-1 Representing hidden layer state information, x, at time t-1 t Input value representing time t, b c An offset term representing "input gate".
the expression of the tanh activation function is as follows:
Figure BDA0003859759570000106
wherein tanh () represents a tanh activation function,
Figure BDA0003859759570000107
representing the input parameters of the tanh activation function.
The output gate is used for controlling the coefficient o t Controlling output h of hidden layer state information at time t t
Input Gate control coefficient o t The expression of (c) is as follows:
o t =σ(W o *[h t-1 ,x t ]+b o )
in the formula o t Control coefficients representing "output gates", σ () representing Sigmoid function, W o Weight matrix representing the "output gates", h t-1 Representing hidden layer state information, x, at time t-1 t Input value representing time t, b o A bias term representing an "output gate".
By combining the forgetting gate and the input gate, the cell state information C at the time t is calculated t
Cell state information C at time t t The expression of (a) is as follows:
Figure BDA0003859759570000111
in the formula, C t Information indicating the state of the cells at time t, f t Control coefficient indicating "forget gate", C t-1 Indicating the cell state information at time t-1,
Figure BDA0003859759570000112
the candidate update information at time t is indicated.
Information on the state of the cells at time t C t On the basis of the t time, the state information h of the hidden layer at the t time is calculated t
Hidden layer state information h at time t t The expression of (a) is as follows:
h t =o t *tanh(C t )
in the formula, h t Indicating the hidden layer state information at time t, o t Control coefficients representing "output gates", tanh () representing the tanh activation function, C t Indicating cell state information at time t.
The MIC-LSTM-based water body missing data interpolation method provided by the embodiment of the invention utilizes the screened data with strong correlation to train, verify and test the LSTM model, is beneficial to improving the prediction of the LSTM model on watershed water body missing data, and improves the reliability of interpolation results.
Optionally, there are 2 hidden layers between the input layer and the output layer of the LSTM model; wherein, the number of neurons in the first hidden layer is 64, and the number of neurons in the second hidden layer is 32.
Specifically, fig. 4 is a diagram of an LSTM model structure provided in the embodiment of the present invention, and as shown in fig. 4, the LSTM model has one input layer, two hidden layers, and one output layer. The number of the neurons in the first hidden layer is 64, the number of the neurons in the second hidden layer is 32, and the number of the neurons in the second hidden layer is half of the number of the neurons in the first hidden layer, so that the structure of the LSTM model is simplified.
The number of iterations of the LSTM model may be set to 240, the skin size (batch-size) of the LSTM model may be set to 36, and the Adam function may be employed to optimize internal parameters of the LSTM model neural network.
According to the MIC-LSTM-based water body missing data interpolation method provided by the embodiment of the invention, 2 layers of hidden layers are arranged in the LSTM model, the number of neurons in the first layer of hidden layer is 64, and the number of neurons in the second layer of hidden layer is 32, so that the structure of the LSTM model is simplified.
Optionally, the training, verifying, and testing the LSTM model based on the relevant data set to obtain the LSTM model after the test is passed includes:
determining a training set and a validation set based on a continuous portion of data in the correlated data set;
determining a test set based on the data missing part in the related data set;
and respectively training, verifying and testing the LSTM model in sequence by using the training set, the verifying set and the testing set to obtain the LSTM model after the test is passed.
Specifically, a data continuous part and a data missing part in the relevant data set are determined, the data continuous part is divided into two parts, one part is a training set, the other part is a verification set, the data volume of the training set is larger than that of the verification set, the data missing part is used as a test set, and the test set can be a plurality of data sets with different missing rates so as to verify and improve the generalization capability of the model.
For example, in the related data set, the continuous data portion accounts for 75%, the missing data portion accounts for 10%, and the completely missing data portion accounts for 15%. And 2/3 of the continuous data part is used as a training set, 1/3 of the continuous data part is used as a verification set, and the missing data part is used as a test set.
After a training set, a verification set and a test set are determined, firstly, fitting training is carried out on parameters of the LSTM model by using the training set; after the LSTM model is trained, verifying the trained LSTM model by using a verification set, comparing a predicted value with an actual value, and optimizing model parameters; and after the LSTM model passes the verification, finally, evaluating the LSTM model after the verification by using a test set so as to verify and improve the generalization capability of the LSTM model and obtain the LSTM model after the test passes.
And 104, predicting the watershed water body missing data based on the LSTM model after the test is passed, and acquiring an interpolation value corresponding to the watershed water body missing data.
Specifically, the LSTM model after passing the test is used for predicting the watershed water body missing data, and the value output by the LSTM model after passing the test is the interpolation value corresponding to the watershed water body missing data.
For example, table 2 is a Dissolved Oxygen (DO) prediction table, which predicts the amount of dissolved oxygen using an LSTM model after the test passes.
Inputting DO into LSTM model after passing test t-1 、DO t-2 、DO t-3 、、、、、、DO t-11 And DO t-12 ,DO t-1 Represents the dissolved oxygen amount at time t-1, DO t-2 Represents the dissolved oxygen amount at time t-2, and so on, DO t-12 Representing the dissolved oxygen amount at the t-12 moment, and respectively predicting DO by using an LSTM model after the test passes t 、DO t+1 、DO t+2 、DO t+3 、DO t+4 And DO t+5 ,DO t Indicating the amount of dissolved oxygen at time t, DO t+1 Represents the dissolved oxygen amount at time t +1, DO t+2 Represents the dissolved oxygen at time t +2, and so on, DO t+5 Represents the dissolved oxygen amount at time t + 5.
TABLE 2 dissolved oxygen prediction Table
Target Input1 Input2 Input3 Input11 Input12
DO t DO t-1 DO t-2 DO t-3 DO t-11 DO t-12
DO t+1 DO t-1 DO t-2 DO t-3 DO t-11 DO t-12
DO t+2 DO t-1 DO t-2 DO t-3 DO t-11 DO t-12
DO t+3 DO t-1 DO t-2 DO t-3 DO t-11 DO t-12
DO t+4 DO t-1 DO t-2 DO t-3 DO t-11 DO t-12
DO t+5 DO t-1 DO t-2 DO t-3 DO t-11 DO t-12
According to the water body missing data interpolation method based on MIC-LSTM, provided by the embodiment of the invention, data sets with different missing types and different missing degrees are constructed, strong-correlation data in each data set are screened out by using an MIC algorithm, an LSTM model is trained, verified and tested based on the strong-correlation data, and the LSTM model after the test is passed is used for performing data interpolation, so that the reliability of interpolation results is improved.
The MIC-LSTM-based water missing data interpolation device provided by the present invention is described below, and the MIC-LSTM-based water missing data interpolation device described below and the MIC-LSTM-based water missing data interpolation method described above may be referred to in correspondence.
Fig. 5 is a schematic structural diagram of a MIC-LSTM-based water missing data interpolation device according to an embodiment of the present invention, and as shown in fig. 5, the present invention further provides a MIC-LSTM-based water missing data interpolation device, including: a building module 501, a first obtaining module 502, a second obtaining module 503, and a third obtaining module 504, wherein:
the building module 501 is used for building a plurality of data sets with different deletion types and different deletion degrees based on the original monitoring data of the watershed water body;
a first obtaining module 502, configured to perform correlation analysis on data in each data set by using an MIC algorithm, and obtain data in each data set whose correlation is greater than a preset threshold;
a second obtaining module 503, configured to train, verify and test the LSTM model based on the relevant data set, and obtain the LSTM model after the test is passed; the related data sets are sets of data with the relevance larger than a preset threshold value in all the data sets;
a third obtaining module 504, configured to predict watershed water missing data based on the LSTM model after the test is passed, and obtain an interpolation value corresponding to the watershed water missing data.
Optionally, the first obtaining module 502 is specifically configured to:
and respectively carrying out correlation analysis on the same type of effect quantities in each data set and the effect quantities and the environment quantities by using an MIC algorithm to obtain data of which the correlation between the same type of effect quantities in each data set is greater than a preset threshold value and data of which the correlation between the effect quantities and the environment quantities in each data set is greater than a preset threshold value.
Optionally, the second obtaining module 503 is specifically configured to:
determining a training set and a validation set based on a continuous portion of data in the correlated data set;
determining a test set based on the data missing part in the related data set;
and respectively training, verifying and testing the LSTM model in sequence by using the training set, the verifying set and the testing set to obtain the LSTM model after the test is passed.
Optionally, the apparatus further comprises: a normalization module; the normalization module is configured to:
and normalizing the data in each data set.
Optionally, there are 2 hidden layers between the input layer and the output layer of the LSTM model; wherein, the number of neurons in the first hidden layer is 64, and the number of neurons in the second hidden layer is 32.
Specifically, the MIC-LSTM-based water missing data interpolation device provided in the embodiment of the present application can implement all the method steps implemented by the above method embodiment, and can achieve the same technical effects, and details of the same parts and beneficial effects as those of the method embodiment in this embodiment are not described herein again.
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 6, the electronic device may include: a processor (processor) 610, a communication Interface (Communications Interface) 620, a memory (memory) 630 and a communication bus 640, wherein the processor 610, the communication Interface 620 and the memory 630 communicate with each other via the communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a MIC-LSTM based water loss data interpolation method comprising: constructing a plurality of data sets with different deletion types and different deletion degrees based on the original monitoring data of the watershed water body; performing correlation analysis on the data in each data set by using an MIC algorithm to obtain data of which the correlation is greater than a preset threshold in each data set; training, verifying and testing the LSTM model based on the relevant data set to obtain the LSTM model after the test is passed; the related data sets are sets of data with the correlation larger than a preset threshold value in all the data sets; predicting watershed water body missing data based on the LSTM model after the test is passed, and obtaining an interpolation value corresponding to the watershed water body missing data.
In addition, the logic instructions in the memory 630 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being stored on a non-transitory computer readable storage medium, wherein when the computer program is executed by a processor, the computer is capable of executing the MIC-LSTM-based water missing data interpolation method provided by the above methods, the method including: constructing a plurality of data sets with different deletion types and different deletion degrees based on the original monitoring data of the watershed water body; performing correlation analysis on the data in each data set by using an MIC algorithm to obtain data of which the correlation is greater than a preset threshold in each data set; training, verifying and testing the LSTM model based on the relevant data set to obtain the LSTM model after the test is passed; the related data sets are sets of data with the correlation larger than a preset threshold value in all the data sets; predicting watershed water body missing data based on the LSTM model after the test is passed, and obtaining an interpolation value corresponding to the watershed water body missing data.
In still another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the MIC-LSTM-based water loss data interpolation method provided by the above methods, the method including: constructing a plurality of data sets with different deletion types and different deletion degrees based on the original monitoring data of the watershed water body; performing correlation analysis on the data in each data set by using an MIC algorithm to obtain data of which the correlation is greater than a preset threshold value in each data set; training, verifying and testing the LSTM model based on the relevant data set to obtain the LSTM model after the test is passed; the related data sets are sets of data with the correlation larger than a preset threshold value in all the data sets; predicting watershed water body missing data based on the LSTM model after the test is passed, and obtaining an interpolation value corresponding to the watershed water body missing data.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
The terms "first," "second," and the like in the embodiments of the present application are used for distinguishing between similar elements and not for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in other sequences than those illustrated or otherwise described herein, and that the terms "first" and "second" used herein generally refer to a class and do not limit the number of objects, for example, a first object can be one or more.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A water body missing data interpolation method based on MIC-LSTM is characterized by comprising the following steps:
constructing a plurality of data sets with different deletion types and different deletion degrees based on the original monitoring data of the watershed water body;
performing correlation analysis on the data in each data set by using an MIC algorithm to obtain data of which the correlation is greater than a preset threshold value in each data set;
training, verifying and testing the LSTM model based on the relevant data set to obtain the LSTM model after the test is passed; the related data sets are sets of data with the correlation larger than a preset threshold value in all the data sets;
predicting watershed water body missing data based on the LSTM model after the test is passed, and obtaining an interpolation value corresponding to the watershed water body missing data.
2. The MIC-LSTM-based water body missing data interpolation method according to claim 1, wherein the correlation analysis of the data in each data set by using the MIC algorithm to obtain the data with strong correlation in each data set comprises the following steps:
and respectively carrying out correlation analysis on the same type of effect quantity in each data set and the effect quantity and the environmental quantity by utilizing an MIC (many integrated core) algorithm, and acquiring data of which the correlation between the same type of effect quantity in each data set is greater than a preset threshold value and data of which the correlation between the effect quantity and the environmental quantity in each data set is greater than a preset threshold value.
3. The MIC-LSTM-based water body missing data interpolation method according to claim 1, wherein the training, verifying and testing the LSTM model based on the relevant data set to obtain the LSTM model after the test is passed comprises:
determining a training set and a validation set based on a continuum of data in the correlated data set;
determining a test set based on the data missing part in the related data set;
and respectively training, verifying and testing the LSTM model in sequence by using the training set, the verifying set and the testing set to obtain the LSTM model after the test is passed.
4. The MIC-LSTM-based water body missing data interpolation method of claim 1, wherein before the correlation analysis of the data in each data set using MIC algorithm, further comprising:
and normalizing the data in each data set.
5. The MIC-LSTM-based water body missing data interpolation method according to claim 1, wherein 2 hidden layers are arranged between an input layer and an output layer of the LSTM model; wherein, the number of neurons in the first hidden layer is 64, and the number of neurons in the second hidden layer is 32.
6. A water body missing data interpolation device based on MIC-LSTM is characterized by comprising:
the construction module is used for constructing a plurality of data sets with different deletion types and different deletion degrees based on the original monitoring data of the watershed water body;
the first acquisition module is used for carrying out correlation analysis on the data in each data set by utilizing an MIC algorithm and acquiring the data of which the correlation is greater than a preset threshold in each data set;
the second acquisition module is used for training, verifying and testing the LSTM model based on the relevant data set to acquire the LSTM model after the test is passed; the related data sets are sets of data with the relevance larger than a preset threshold value in all the data sets;
and the third acquisition module is used for predicting the watershed water body missing data based on the LSTM model after the test is passed, and acquiring an interpolation value corresponding to the watershed water body missing data.
7. The MIC-LSTM-based water body missing data interpolation device of claim 6, wherein the first obtaining module is specifically configured to:
and respectively carrying out correlation analysis on the same type of effect quantities in each data set and the effect quantities and the environment quantities by using an MIC algorithm to obtain data of which the correlation between the same type of effect quantities in each data set is greater than a preset threshold value and data of which the correlation between the effect quantities and the environment quantities in each data set is greater than a preset threshold value.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the computer program implements the MIC-LSTM based water loss data interpolation method as claimed in any one of claims 1 to 5.
9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the MIC-LSTM based water loss data interpolation method of any one of claims 1 to 5.
10. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements a MIC-LSTM based water loss data interpolation method as claimed in any one of claims 1 to 5.
CN202211160686.1A 2022-09-22 2022-09-22 Water body missing data interpolation method and device based on MIC-LSTM Pending CN115600105A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211160686.1A CN115600105A (en) 2022-09-22 2022-09-22 Water body missing data interpolation method and device based on MIC-LSTM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211160686.1A CN115600105A (en) 2022-09-22 2022-09-22 Water body missing data interpolation method and device based on MIC-LSTM

Publications (1)

Publication Number Publication Date
CN115600105A true CN115600105A (en) 2023-01-13

Family

ID=84845489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211160686.1A Pending CN115600105A (en) 2022-09-22 2022-09-22 Water body missing data interpolation method and device based on MIC-LSTM

Country Status (1)

Country Link
CN (1) CN115600105A (en)

Similar Documents

Publication Publication Date Title
CN111124840B (en) Method and device for predicting alarm in business operation and maintenance and electronic equipment
Molnar et al. Pitfalls to avoid when interpreting machine learning models
Galelli et al. An evaluation framework for input variable selection algorithms for environmental data-driven models
Couckuyt et al. Fast calculation of multiobjective probability of improvement and expected improvement criteria for Pareto optimization
Durrant winGamma: A non-linear data analysis and modelling tool with applications to flood prediction
CN110909926A (en) TCN-LSTM-based solar photovoltaic power generation prediction method
CN114925623B (en) Oil and gas reservoir yield prediction method and system
Azzouz et al. Steady state IBEA assisted by MLP neural networks for expensive multi-objective optimization problems
CN111030889B (en) Network traffic prediction method based on GRU model
CN111625516A (en) Method and device for detecting data state, computer equipment and storage medium
Guo et al. Robust echo state networks based on correntropy induced loss function
Zhu et al. Emulation of stochastic simulators using generalized lambda models
Wei et al. Bayesian probabilistic propagation of imprecise probabilities with large epistemic uncertainty
CN115049019B (en) Method and device for evaluating arsenic adsorption performance of metal organic framework and related equipment
CN115587666A (en) Load prediction method and system based on seasonal trend decomposition and hybrid neural network
CN113095484A (en) Stock price prediction method based on LSTM neural network
Doumpos et al. Regularized estimation for preference disaggregation in multiple criteria decision making
CN115600105A (en) Water body missing data interpolation method and device based on MIC-LSTM
CN116542701A (en) Carbon price prediction method and system based on CNN-LSTM combination model
CN113392958B (en) Parameter optimization and application method and system of fuzzy neural network FNN
CN115389743A (en) Method, medium and system for predicting content interval of dissolved gas in transformer oil
CN114970674A (en) Time sequence data concept drift adaptation method based on relevance alignment
Goldstein Bayes linear analysis for complex physical systems modeled by computer simulators
CN113011748A (en) Recommendation effect evaluation method and device, electronic equipment and readable storage medium
Tian et al. Microbial Network Recovery by Compositional Graphical Lasso

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination