Disclosure of Invention
Aiming at the problems, the invention provides a method for predicting dissolved oxygen in a tidal river network area.
The technical scheme of the invention is as follows:
a prediction method for dissolved oxygen in a tidal river network area comprises the following steps:
s1, data acquisition: establishing a water quality automatic station in a tidal river network area needing dissolved oxygen prediction, acquiring water quality time sequence data through the water quality automatic station, and preprocessing the acquired water quality time sequence data, wherein the water quality time sequence data comprise dissolved oxygen and other environment variables;
s2, data screening: calculating the maximum mutual information coefficient of the dissolved oxygen and other environment variables in the water quality time series data obtained in the step S1, screening out other environment variables with large correlation with the dissolved oxygen, and taking the other environment variables as input variables of a long-time memory network;
s2-1, mutual information definition: mutual information is an index for measuring the degree of correlation between other environmental variables and dissolved oxygen, and a given variable A = { x = { (x) i I =1,2,. Cndot., n } and B = { y = i I =1, 2.,. N }, where n is the number of samples, and the mutual information I (a; B) of a and B defines the formula:
wherein p (x, y) is the joint probability density of A and B, p (x) is the edge probability density of A, and p (y) is the edge probability density of B;
s2-2, value domain division: let D = { (a) i B), i =1,2,.. Multidot.n } is a finite set, and meanwhile, the value ranges of the variable a and the variable B are divided into x sections and y sections respectively to obtain x × y grids G, then the mutual information MI (a, B) is calculated inside each obtained grid division to obtain the maximum value G of the mutual information MI (a, B), and then the maximum normalization value formula of the finite set D under the condition of defining the maximum value G is as follows:
MI*(D,x,y)=maxMI(D│G)
wherein D | G is a finite set D divided using a grid G, and MI (D, x, y) is a maximum normalized value;
s2-3, calculating the maximum value: the maximum value of the feature matrix composed of the maximum normalized values obtained under each grid division is obtained, and the formula of the maximum information coefficient is obtained as follows:
wherein MIC (D) is the maximum information coefficient;
s2-4, relevance analysis: calculating the value of the maximum information coefficient MIC (D) of the dissolved oxygen and other environment variables by taking the dissolved oxygen as a variable A and other environment variables as a variable B, wherein the obtained value of the maximum information coefficient MIC (D) is in a [0,1] interval, the greater the value of the maximum information coefficient MIC (D), the greater the relevance of the dissolved oxygen and other environment variables is, the smaller the value of the maximum information coefficient MIC (D), the smaller the relevance of the dissolved oxygen and other environment variables is, and the other environment variables with greater relevance to the dissolved oxygen are selected as input variables of a prediction model;
s3, establishing a long-time memory network model:
s3-1, framework construction: the long-time and short-time memory network model comprises 1 input layer, 1 output layer and a plurality of hidden layers, each hidden layer is composed of a plurality of memory units, the memory units control the updating and utilization of historical information by introducing a gating mechanism, and the gating mechanism comprises an input gate i
t Forget gate i
t f
t And an output gate o
t Input door i
t Forgetting door f
t And an output gate o
t All values of (A) are in [0,1]]The interval represents that the information is passed by a certain proportion, the cell state is reset periodically to avoid the short accumulation of the cell state, and the cell state comprises a candidate state
Internal state C
t And an external state h
t Input door i
t Controlling candidate states at the current time
How much information needs to be stored, forget the door f
t Controlling the internal state C of the last moment
t The door o is output according to the information to be forgotten
t Controlling the internal state C at the present time
t How much information needs to be output to the external state h
t Simultaneously activating the sigmoid (σ) function and the tanh hyperbolic tangent function layer as shown in the following formula:
s3-2, initialization: initializing the matrix and the vector of the memory unit for storing model parameters, storing intermediate calculation results, and storing the number of neurons of an input layer and an output layer, the number of cells of a hidden layer and a network state;
s3-3, forward propagation calculation: the long-time and short-time memory network model determines the information discarded from the cell state, and the step is completed by a forgetting gate, firstly, aiming at the input information x at the current moment t And the hidden layer external state h at the previous moment t-1 The output information of the system is processed by a sigmoid (sigma) function layer to obtain an output between 0 and 1 as an internal state C at the last moment t-1 The filtering value of (f) is obtained as the forgetting gate t The formula of (1) is as follows:
f t =σ(W xf x t +W hf h t―1 +b f )
wherein, W is a weight matrix, the subscript of W represents the connection weight between two units, b represents the bias term;
secondly, the long-time memory network model judges the information stored in the cell state, firstly, the input information x at the current time is t And the hidden layer external state h at the previous moment t-1 The output information is calculated by a sigmoid function layer to obtain an input gate i t The value is shown as the following formula:
i t =σ(W xi x t +W hi h t―1 +b i )
a candidate state is then generated by the hyperbolic tangent function layer tanh
For the renewal of the cell state, as shown in the following formula:
finally, the long-time and short-time memory network model determines the output information of the cell, and inputs the input information x at the current moment t And the hidden layer external state h at the previous moment t-1 Output information of the output gate is calculated by a sigmoid (sigma) function layer to form an output gate o t As shown in the following formula:
o t =σ(W xo x t +W ho h t―1 +b o )
then the internal state C of the current cell t Compression to [ -1, 1] by tanh function]Interval of (2), internal state C of the compressed cell t And an output gate o t Multiplying to obtain the external state h of the hidden layer at the current moment t Outputting information as shown in the following formula:
h t =o t tanh(C t )
the memory unit can be connected with other parts in the long-time and short-time memory network model, and the external state h of the hidden layer at the current moment
t As the hidden layer external state h
t Is passed on to the next instant, on the other hand as the hidden layer external state h
t The output information is transmitted to a next long-term and short-term memory network, when the next long-term and short-term memory network is a full connection layer, a transformation is carried out on a hidden layer result to obtain final output information, and therefore a predicted value of a time sequence is obtained
As shown in the following formula:
in the formula, vout is a weight matrix of the full connection layer, and b represents an offset term;
s3-4, updating the weight: solving the gradient of each weight of the long-time memory network, finding an optimal solution by using training data to perform random gradient descent, solving the gradient from the weight of an output layer to the weight of an input layer, sequentially updating each weight, resetting an internal state, designing an error function, and calculating and checking the gradient;
s3-5, root mean square error evaluation: training time sequence data of other environment variables related to dissolved oxygen through a long-short-term memory network model, training the long-short-term memory network model by taking the time sequence data of other environment variables which are normalized and subjected to MIC screening as a training data set, adding a Drapout mechanism into a training mechanism of a hidden layer in order to relieve the overfitting problem in the training process of a multivariable prediction model neural network, and calculating a root mean square error after training to evaluate the prediction result of the long-short-term memory network model, wherein the root mean square error is as shown in the following formula:
in the formula (I), the compound is shown in the specification,
y (i) is a predicted value of dissolved oxygen and an actual measured value of dissolved oxygen;
s4, k-fold cross validation: dividing the input variable obtained in the step S2-4 into k equal parts as an original data set, selecting k-1 parts as a training set each time, using 1 part as a test set, training k-1 parts and testing the
rest 1 parts by using different hyper-parameter combinations, calculating the RMSE value of the test set, repeating the steps of the step S3-2 to the step S3-5 for long and short time memory of network model training and testing until each hyper-parameter combination in the k original data set is tested, and calculating the RMSE average value of each final output information, wherein the parameter group with the minimum RMSE average value is an optimal combination as shown in the following formula:
s5, calculating and predicting: real-time data of a water quality automatic station in a tidal river network area are input into an established long-time and short-time memory network model after being preprocessed, a predicted value of dissolved oxygen is obtained through scaling of a result output by the long-time and short-time memory network model, and a trend graph of the dissolved oxygen is drawn by adopting a rolling forecasting method.
Further, other environmental variables of the water time series data in step S1 include pH, water temperature, conductivity, turbidity, water level, flow rate, ammonia nitrogen, total phosphorus, permanganate index, chemical oxygen demand, total nitrogen and DO 25h Said DO 25h For corrected time-series data of dissolved oxygen, DO 25h The correction method comprises the following steps: the duration of one tidal cycle is 24h50min, the lag time is increased to 25h, and the corrected dissolved oxygen time sequence data obtained at the moment is DO 25h 。
Further, the preprocessing method in step S1 includes: carrying out missing value interpolation and normalization processing on the collected water quality time sequence data;
s1-1, missing value interpolation: when the water quality time sequence data is missing, the average value of the data at two adjacent moments is used for interpolation;
s1-2, normalization processing: the formula of the normalization process is:
wherein x' is water quality time sequence data after normalization, x is water quality time sequence data before normalization, and x min As a minimum in water quality time series dataValue, x max Is the maximum value in the water quality time series data.
Further, when the value of the maximum information coefficient MIC (D) in step S2-4 is greater than 0.8, it is considered that the correlation between the other environmental variables and the dissolved oxygen is large. Normally dissolved oxygen and DO 25 The MIC (D) value of (1) is large, about 0.7-0.8, and the correlation calculation is performed by normalized _ mutual _ info _ score in python module skleern. Metrics. Cluster.
Further, the model parameters in step S3-2 include a weight matrix W and a bias term b, and the intermediate calculation result includes an external state h t Output information, input gate f t Forgotten door i t Output gate o t 。
Further, the Dropout mechanism in step S3-5 is: the neural units and their connections are randomly lost during the training process with time series data of other environmental variables.
Further, the long-time memory network model is built in the step S3-1 based on a TensorFlow deep learning framework.
Further, the method for scrolling and forecasting in step S5 specifically includes: according to the sampling interval of the predicted value of the existing dissolved oxygen, a reasonable prediction time step length is set, the long-time memory network model can calculate the dissolved oxygen data of t + n days and output the calculated dissolved oxygen data to obtain a true dissolved oxygen value on the assumption that the predicted time is n days according to the t-day dissolved oxygen data concentrated in the test and the important parameters screened by the method of S2, then the true dissolved oxygen value of t + n days and other environment variables screened by the method of S2 are screened out on the t +2n days, and the sequence information is updated in time by adopting a rolling prediction method, so that error accumulation is avoided.
The invention has the beneficial effects that:
the invention provides a solution for predicting dissolved oxygen in a tidal river Network area, fully considers the characteristics that the tidal river Network area is influenced by tides and the dissolved oxygen shows periodic change, selects time-lag dissolved oxygen data as input variables, identifies key factors influencing the change of the dissolved oxygen as the input variables by a Maximum Information Coefficient (MIC) method, establishes a Long-Short-Term Memory Network (LSTM) by using a deep machine learning model, effectively solves the problem of gradient disappearance in the traditional circulating Network, selects an optimal hyper-parameter combination of the model by using a K-fold cross-validation grid searching method, and improves the accuracy of predicting the dissolved oxygen in the tidal river Network area.
Detailed Description
Example 1
A method for predicting dissolved oxygen in a tidal river network area, as shown in fig. 1, comprising the following steps:
s1, data acquisition: establishing a water quality automatic station in a tidal river network area needing dissolved oxygen prediction, collecting water quality time sequence data through the water quality automatic station, and preprocessing the collected water quality time sequence data, wherein the water quality time sequence data comprise dissolved oxygen and other environmental variables, and the other environmental variables of the water quality time sequence data comprise pH, water temperature, conductivity, turbidity, water level, flow, ammonia nitrogen, total phosphorus, permanganate index, chemical oxygen demand, total nitrogen and DO 25h Said DO 25h For corrected time series data of dissolved oxygen, DO 25h The correction method comprises the following steps: the duration of one tidal cycle is 24h50min, the lag time is increased to 25h, and the corrected dissolved oxygen time sequence data obtained at the moment is DO 25h ;
The pretreatment method comprises the following steps: carrying out missing value interpolation and normalization processing on the collected water quality time sequence data;
s1-1, missing value interpolation: when the water quality time sequence data is missing, the average value interpolation of the data at two adjacent moments is used;
outliers (identified by L or more 000 in the data quadratic table) and default values of the data are identified, labeled as nan. When the water quality time sequence data is missing, the average value of the data at two adjacent moments is used for interpolation;
the sampling frequency is unified, the non-integral point or integral day recording situation can occur in the data recording of the water quality automatic station, the situations are screened, and the data are unified to the whole day or the whole hour according to the actual situation of each station;
missing value interpolation: according to the unified data sampling frequency of each station, under the condition that no effective data exists at the corresponding time point, the nearest effective data is used for filling, and if the missing data is more than 12 time steps, linear interpolation is used for interpolation.
S1-2, normalization processing: the formula for the normalization process is:
wherein x' is water quality time sequence data after normalization, x is water quality time sequence data before normalization, and x min Is the minimum value, x, in the water quality time series data max Is the maximum value in the water quality time sequence data;
s2, data screening: calculating the maximum mutual information coefficient of the dissolved oxygen and other environment variables in the water quality time series data obtained in the step S1, screening out other environment variables with large correlation with the dissolved oxygen, and taking the other environment variables as input variables of a long-time memory network;
s2-1, mutual information definition: mutual information is an index that measures the degree of correlation between other environmental variables and dissolved oxygen, givenVariable a = { x i I =1,2,. Cndot., n } and B = { y = i I =1,2,., n }, where n is the number of samples, the mutual information I (a; B) of a and B is defined by the formula:
wherein p (x, y) is the joint probability density of A and B, p (x) is the edge probability density of A, and p (y) is the edge probability density of B;
s2-2, value domain division: let D = { (a) i And B), i =1,2,.. Multidot.n } is a finite set, meanwhile, the value ranges of the variable A and the variable B are respectively divided into an x section and a y section to obtain a grid G of x y, then mutual information MI (A, B) is calculated in each obtained grid division to obtain the maximum value G of the mutual information MI (A, B), and then the maximum normalization value formula of the finite set D under the condition of defining the maximum value G is as follows:
MI*(D,x,y)=maxMI(D│G)
wherein D | G is a finite set D divided using a grid G, and MI (D, x, y) is the maximum normalized value;
s2-3, calculating the maximum value: the maximum value of the feature matrix composed of the maximum normalized values obtained under each grid division is obtained, and the formula of the maximum information coefficient is as follows:
wherein MIC (D) is the maximum information coefficient;
s2-4, correlation analysis: calculating the value of the maximum information coefficient MIC (D) of the dissolved oxygen and other environment variables by taking the dissolved oxygen as a variable A and other environment variables as a variable B, wherein the obtained value of the maximum information coefficient MIC (D) is in a [0,1] interval, the greater the value of the maximum information coefficient MIC (D), the greater the correlation between the dissolved oxygen and other environment variables, the smaller the value of the maximum information coefficient MIC (D), the smaller the correlation between the dissolved oxygen and other environment variables, selecting other environment variables with greater correlation with the dissolved oxygen as input variables of the prediction model, and considering that the correlation between the other environment variables and the dissolved oxygen is greater when the value of the maximum information coefficient MIC (D) is greater than 0.8;
s3, establishing a long-time memory network model:
s3-1, framework construction: a long-and-short-term memory network model is built based on a TensorFlow deep learning framework, the long-and-short-term memory network model comprises 1 input layer, 1 output layer and 3 hidden layers, each hidden layer is composed of 20 memory units, the memory units control updating and utilization of historical information by introducing a gating mechanism, and the gating mechanism comprises an input gate i
t Forget gate i
t f
t And an output gate o
t Input door i
t Forgetting door f
t And an output gate o
t All values of (A) are in [0,1]]The interval represents that the information is passed by a certain proportion, the cell state is reset periodically to avoid the short accumulation of the cell state, and the cell state comprises a candidate state
Internal state C
t And an external state h
t Input door i
t Controlling candidate states at the current time
How much information needs to be stored, forget the door f
t Controlling the internal state C of the last moment
t The information to be forgotten is output through an output gate o
t Controlling the internal state C at the present time
t How much information needs to be output to the external state h
t Simultaneously activating the sigmoid (σ) function and the tanh hyperbolic tangent function layer as shown in the following formula:
s3-2, initialization:initializing the matrix and vector of the memory unit for storing model parameters including weight matrix W and bias item b and storing intermediate calculation result including external state h t Output information, input gate f t Forget gate i t Output gate o t Storing the neuron number, the hidden layer cell number and the network state of the input layer and the output layer;
s3-3, forward propagation calculation: the long-time and short-time memory network model determines the information discarded from the cell state, and the step is completed by a forgetting gate, firstly, aiming at the input information x at the current moment t And the hidden layer external state h at the previous moment t-1 The output information of (2) is processed by sigmoid (sigma) function layer to obtain an output between 0 and 1 as the internal state C at the last time t-1 The filtering value of (f) is obtained as the forgetting gate t The formula of (1) is:
f t =σ(W xf x t +W hf h t―1 +b f )
wherein, W is a weight matrix, the subscript of W represents the connection weight between two units, and b represents an offset term;
secondly, the long-time and short-time memory network model judges the information stored in the cell state, firstly, the input information x at the current time is input t And the hidden layer external state h at the previous moment t-1 The output information is calculated by a sigmoid function layer to obtain an input gate i t The value is shown as the following formula:
i t =σ(W xi x t +W hi h t―1 +b i )
a candidate state is then generated by means of the tanh layer
For the renewal of the cell state, as shown in the following formula:
finally, the long-time and short-time memory network model determines the output information of the cell, and inputs the input information x at the current moment t And the hidden layer external state h at the previous moment t-1 The output information of the output gate is subjected to sigmoid (sigma) function layer calculation to obtain an output gate o t As shown in the following formula:
o t =σ(W xo x t +W ho h t―1 +b o )
then the internal state C of the current cell t Compression to [ -1, 1] by tanh function]Interval of (2), internal state C of the compressed cell t And an output gate o t Multiplying to obtain the external state h of the hidden layer at the current moment t Outputting information as shown in the following formula:
h t =o t tanh(C t )
the memory unit is also connected with other parts in the long-time memory network model, and the external state h of the hidden layer at the current time
t As the hidden layer external state h
t Is passed on to the next instant, on the other hand as the hidden layer external state h
t The output information is transmitted to a next long-term and short-term memory network, when the next long-term and short-term memory network is a full connection layer, a transformation is carried out on a hidden layer result to obtain final output information, and therefore a predicted value of a time sequence is obtained
As shown in the following formula:
in the formula, vout is a weight matrix of the full connection layer, and b represents an offset term;
s3-4, updating the weight: solving the gradient of each weight of the long-time memory network, finding an optimal solution by using training data to perform random gradient descent, solving the gradient from the weight of an output layer to the weight of an input layer, sequentially updating each weight, resetting an internal state, designing an error function, and calculating and checking the gradient;
s3-5, root mean square error evaluation: training time sequence data of other environment variables related to dissolved oxygen through a long-short-term memory network model, training the long-short-term memory network model by taking the time sequence data of other environment variables which are normalized and subjected to MIC screening as a training data set, and adding a Dropout mechanism into a training mechanism of a hidden layer in order to relieve the overfitting problem in the training process of a multivariate prediction model neural network, wherein the Dropout mechanism is as follows: the neural unit and the connection thereof are randomly lost in the process of training time sequence data of other environment variables, and after the training is finished, the root mean square error is calculated to evaluate the prediction result of the long-time and short-time memory network model, wherein the root mean square error is shown as the following formula:
in the formula (I), the compound is shown in the specification,
y (i) is a predicted value of the dissolved oxygen and an actual measurement value of the dissolved oxygen;
s4, k-fold cross validation: dividing the input variable obtained in the step S2-4 as an original data set into k equal parts, taking k as 5, selecting k-1 part as a training set each time, taking the remaining 1 part as a test set, training k-1 part and testing the
rest 1 part by using different hyper-parameter combinations, calculating the RMSE value of the test set, repeating the steps of training and testing the network model in the steps S3-2 to S3-5 at long and short times until each hyper-parameter combination in the k original data set is tested, and calculating the RMSE average value of each final output information, wherein the parameter with the smallest RMSE average value is the optimal combination, and the following formula shows that:
s5, calculating and predicting: preprocessing real-time data of a water quality automatic station in a tidal river network area, inputting the preprocessed real-time data into a built long-and-short-term memory network model, obtaining a predicted value of dissolved oxygen through scaling of a result output by the long-and-short-term memory network model, and drawing a trend graph of the dissolved oxygen by adopting a rolling forecasting method, wherein the rolling forecasting method in the step S5 specifically comprises the following steps of: according to the sampling interval of the predicted value of the existing dissolved oxygen, a reasonable prediction time step length is set, the long-time memory network model can calculate the dissolved oxygen data of t + n days and output the calculated dissolved oxygen data to obtain a true dissolved oxygen value on the assumption that the predicted time is n days according to the t-day dissolved oxygen data concentrated in the test and the important parameters screened by the method of S2, then the true dissolved oxygen value of t + n days and other environment variables screened by the method of S2 are screened out on the t +2n days, and the sequence information is updated in time by adopting a rolling prediction method, so that error accumulation is avoided.
Example 2
This embodiment is substantially the same as
embodiment 1, except that: and S3-1, the number of the hidden layers in the framework construction is different.
S3-1, framework construction: a long-short-term memory network model is built based on a TensorFlow deep learning framework, and the long-short-term memory network model comprises 1 input layer, 1 output layer and 3 hidden layers.
Example 3
This embodiment is substantially the same as
embodiment 1, except that: the values of the maximum information coefficients MIC (D) in steps S2-4 are different. The maximum information coefficient MIC (D) was 0.5, and the variables used for prediction included ammonia nitrogen and total phosphorus.
Experimental example 1
In order to verify the actual application effect of the invention, the actually measured water quality online observation data actually operated by a certain water quality automatic online site is selected for verification. The prediction is carried out by using the method for predicting the dissolved oxygen in the tidal river network area in the
embodiment 1, the selected station is a Dalongyong station, and the time span is 1/1 day in 2019 to 3/29 days in 2021. The sampling frequency of permanganate index, ammonia nitrogen, total phosphorus, and total nitrogen was 4 hours, and the time sampling frequency of the remaining variables was 1 hour, as shown in table 1.
8832 time sequence samples processed in the data acquisition of the step S1 are respectively calculated in the step S2, and according to the MIC (D) value and 0.85 as a threshold value, DO25, conductivity, water temperature, ammonia nitrogen and total nitrogen concentration are selected as prediction variables of a long-time memory network model, wherein the MIC (D) values of temperature, pH, DO25, conductivity, turbidity, permanganate index, ammonia nitrogen, total phosphorus, total nitrogen and dissolved oxygen are respectively calculated; in step S3, a long-time and short-time memory network model is built based on a mainstream TensorFlow deep learning framework, for the hyper-parameters related to the prediction model, as shown in fig. 2, in step S4, a k-fold cross validation grid search method is adopted for optimization to obtain an optimal hyper-parameter combination, 67% of data in a sample is selected as a training set, the long-time and short-time memory network model is trained, the remaining 33% of samples are used as a test set, training and test results are shown in fig. 3, calculation results of each relevant variable are shown in table 1, and a model parameter setting and result evaluation list is shown in table 2. After training, the root mean square error was calculated to evaluate model performance, with a training set RMSE of 0.29 and a test set RMSE of 0.22.
Experimental example 2
This example is basically the same as example 1, except that: the selected observation stations are different, the data of pier bases are selected to train and predict the model, the calculation results of all relevant variables are shown in table 1, the model parameter setting and result evaluation list is shown in table 2, and the training and testing results are shown in fig. 4.
Experimental example 3
This example is basically the same as example 2, except that: the number of the selected grid layers is different, the calculation results of the relevant variables are shown in table 1, the model parameter setting and result evaluation list is shown in table 2, and the training and testing results are shown in fig. 5.
Experimental example 4
This example is substantially the same as example 2 except that: based on the fact that the maximum information coefficient MIC (D) in example 3 was 0.5, variables used for prediction included ammonia nitrogen and total phosphorus, the calculation results of each relevant variable are shown in table 1, and the model parameter setting and result evaluation list is shown in table 2.
Experimental example 5
This example is basically the same as example 1, except that: the step size is changed, more input and output time steps are used, the calculation results of all relevant variables are shown in table 1, and the model parameter setting and result evaluation list is shown in table 2.
TABLE 1 List of the results of MIC (D) calculations for the maximum information coefficient for each of the relevant variables in the Large Surge site and pier-head-based site
Table 2 model parameter settings and result evaluation list in experimental cases 1-5