CN111639111A

CN111639111A - Water transfer engineering-oriented multi-source monitoring data deep mining and intelligent analysis method

Info

Publication number: CN111639111A
Application number: CN202010516046.4A
Authority: CN
Inventors: 李明超; 任秋兵; 李明昊
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-06-09
Filing date: 2020-06-09
Publication date: 2020-09-08

Abstract

A multi-source monitoring data deep mining and intelligent analysis method for water transfer engineering comprises the following steps: performing rough error discrimination and cubic spline curve interpolation cleaning work on original input data; predicting data by a statistical model method of a stacked kernel ridge regression, a random forest and a partial regression algorithm; predicting data by a BP neural network method optimized by a genetic algorithm; predicting data by an integrated LSTM neural network method; and comparing the mean square errors of the predicted value, the predicted average value and the weighted average value on the data set, and selecting the optimal predicted value corresponding to the minimum mean square error. The method combines the two types of neural network algorithms, utilizes the genetic algorithm to optimize the hyperparameter of the BP neural network, uses the stacked LSTM neural network to optimize, and comprehensively considers the prediction results of the three types of models in the final prediction result, thereby obtaining the source monitoring data deep mining and intelligent analysis method suitable for the water transfer engineering.

Description

Water transfer engineering-oriented multi-source monitoring data deep mining and intelligent analysis method

Technical Field

The invention relates to the technical field of safety monitoring of hydraulic buildings, in particular to a multisource monitoring data deep mining and intelligent analysis method for water transfer engineering for safety monitoring data.

Background

Mass data in the long-distance water transfer project are various in source, different in scale and different in precision, and mining analysis of the multi-source data at different levels and different depths is achieved under different algorithms and models through classification, reorganization and intelligent analysis of the multi-source data, so that future working condition information of the water transfer project is obtained, and theoretical support is provided for safety early warning and control analysis.

The traditional statistical regression model is the earliest and most mature model in the safety monitoring analysis of hydraulic engineering. Regression models serve as an auxiliary analytical tool in this algorithm because of their well-defined physical meaning and interpretability. In the existing regression model, different models have different expressions on different data sets, and data with various sources in the water diversion project need to be most accurately modeled and analyzed, which cannot be reached by a single regression model. The feedforward neural network has been a hot algorithm for research and application in the field of water conservancy, and the most common of the feedforward neural network is a Back Propagation (BP) neural network. The algorithm is repeatedly updated for decades, occupies a large proportion in water conservancy safety monitoring, and is continuously improved, so that the algorithm can keep good adaptability to the evolution behavior of complex and highly nonlinear hydraulic buildings. Li and the like propose a combined algorithm of a genetic algorithm and a BP neural network, and establish a relation between actually measured stress data and numerical calculation. Liu and the like combine a BP neural network and a gray model aiming at the problems of unstable performance and measurement value drift of an automatic monitoring system, optimize parameters by utilizing a genetic algorithm, and establish an observer fault self-diagnosis system combining linearity and nonlinearity. The Long Short-Term Memory (LSTM) neural network is a special recurrent neural network. LSTM has inherent advantages in long sequence analysis. In monitoring and forecasting of the water conservancy industry, the LSTM neural network has been a research hotspot in recent years. Huan Jade et al uses an LSTM neural network to mine the correlation of dam deformation monitoring data over a span of time. Similarly, Liuqiong and the like apply the LSTM neural network to the deformation time sequence prediction of the dam, and compare the LSTM neural network with the common NAR network and ARIMA integration moving average autoregressive model in machine learning, and the result shows that the average relative error of the LSTM neural network is only half of that of the comparison model, so that the method has the characteristics of high precision and good stability.

However, the neural network algorithm has common problems of adjustment of hyper-parameters and model simplification. Different hyper-parameters and different models are adapted to different data sets. The monitoring data in the water transfer project has wide sources, different scales and different accuracies, which means that the hyper-parameters of the whole set of algorithm must be continuously adjusted and combined with different model prediction results under the condition of ensuring the accuracy so as to adapt to different types of data.

Disclosure of Invention

In view of the above, the main objective of the present invention is to provide a water diversion engineering-oriented method for deep mining and intelligent analysis of multi-source monitoring data, so as to partially solve at least one of the above technical problems.

In order to achieve the above object, as an aspect of the present invention, a method for deep mining and intelligent analysis of multi-source monitoring data for water transfer engineering is provided, which includes the following steps:

step 1: performing rough error discrimination and cubic spline curve interpolation cleaning work on original input data;

step 2: predicting data by a statistical model method of a stacked kernel ridge regression, a random forest and a partial regression algorithm;

and step 3: predicting data by a BP neural network method optimized by a genetic algorithm;

and 4, step 4: predicting data by an integrated LSTM neural network method;

and 5: and (3) comparing the mean square errors of the predicted values, the predicted average values and the weighted average values in the step (1), the step (2) and the step (3) on the data set, and selecting the optimal predicted value corresponding to the minimum mean square error.

Based on the technical scheme, compared with the prior art, the multisource monitoring data deep mining and intelligent analysis method for the water transfer project has at least one of the following beneficial effects:

1. the invention trains the regression model in a stacking mode, and completes the learning task by constructing and combining a plurality of regression algorithms to achieve the effect of 'popular and popular' growth.

2. The method also combines a novel algorithm represented by a neural network algorithm, and compared with an analysis method only relying on a single model at the present stage, the method is greatly improved in the aspect of prediction precision.

3. The method combines the two types of neural network algorithms, utilizes the genetic algorithm to optimize the hyperparameter of the BP neural network, uses the stacked LSTM neural network to optimize, and comprehensively considers the prediction results of the three types of models in the final prediction result, thereby obtaining the source monitoring data deep mining and intelligent analysis method suitable for the water transfer engineering.

Drawings

FIG. 1 is a general flow diagram of a method implementation of the present invention;

FIG. 2 shows raw data and filtered and interpolated data according to an embodiment of the present invention;

FIG. 3 is a graph of statistical model prediction data and actual monitoring instrument data in accordance with an embodiment of the present invention;

FIG. 4 is a genetic algorithm optimized BP neural network model prediction data and actual monitoring instrument data;

FIG. 5 is a stacked LSTM neural network prediction data with actual monitoring instrument data;

FIG. 6 shows the predicted effect of the algorithm of the embodiment of the present invention on the test set Q3 after integration.

Detailed Description

In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.

As shown in fig. 1, an overall flow chart of a multi-source monitoring data deep mining and intelligent analysis method for water transfer engineering is provided.

Step A, performing gross error judgment and cubic spline curve interpolation cleaning work on original input data; the method comprises the following specific steps:

step A1, inputting selected instrument monitoring data and the number of days to be predicted, and preparing for data analysis;

step a2, filtering the original data using a zero phase shift formula to separate the trend and eliminate gross errors in the data:

Y(e^jω)＝X(e^jω)|H(e^jω)|²(1)

in the formula, Y is output data after filtering, and X and H are respectively input monitoring data and a filtering system function;

step A3, a plurality of null values exist in the filtered data, the monitoring data are homogenized by utilizing cubic spline interpolation processing, the original monitoring data are ensured not to be changed after interpolation, the establishment of a monitoring model in the subsequent steps is facilitated, and the formula is as follows:

wherein S (x) is the curve function after final interpolation, S_i(x) N-1 is an interpolation function for different interval segments.

Step a4, sequentially taking the monitoring data at fixed time intervals over the time length of the data set, and constructing the data set of the monitoring instrument, as shown in fig. 2.

B, predicting data by a statistical model method of a stacked kernel ridge regression, random forest and partial regression algorithm; the method comprises the following specific steps:

step B1, according to the characteristic information of the data set, separating and constructing the prediction dependent variable and the influence factor independent variable of the input data;

and step B2, dividing the data set into three parts according to the time sequence, wherein the three parts respectively correspond to the training set, the verification set and the test set. In the training set Q1, the reasonable length of Q1 needs to be determined, the determination criterion is the root mean square error of the model on the verification set Q2, and the test set is the data set Q3;

and step B3, respectively establishing sub data sets with different lengths by using the data set Q1, so as to reduce low robustness caused by unreasonable length of the data set. The data set is too long and contains too much information in the past, and a general statistical model cannot learn the change of the information and causes the accuracy of the model to be reduced. The data set is too short, the contained information is too little, the parameters of the statistical model are not fully optimized, and the accuracy of the model is also reduced. Therefore, it is necessary to find sub data sets with suitable lengths in different sub data sets;

step B4, bringing each subdata set into a stacked kernel ridge regression, a random forest algorithm and a partial regression algorithm in sequence for training;

step B5, performing kernel ridge regression of 5-fold cross validation on the training data set Q1, and adjusting parameters of a model obtained by regression on a validation set Q2;

and step B6, calculating to obtain the predicted value on the training set Q1 and the predicted value on the verification set Q2 after 5-fold cross validation. Synthesizing the predicted values on the training set Q1 into a new data set as a training set of a random forest algorithm;

step B7, synthesizing the predicted values in the verification set Q2 into a new data set as a test set of the random forest algorithm;

step B8, substituting the training set and the testing set into a random forest algorithm, and repeating the steps B5, B6 and B7;

step B9, after obtaining a new training set and a new testing set, introducing a partial regression algorithm, repeating the steps B5, B6 and B7, and finally obtaining an integrated statistical model;

step B10, calculating the mean square error MSE on the verification set Q2_static：

Where n is the number of data in the Q2 data set,

is the ith predictor, y, of the statistical model_iIs an actual measurement value. The closer the value of the root mean square error is to zero, the better the prediction effect of the representative model is, and the closer the prediction is to the measured value.

Step B11, repeating the steps B4 to B10 to find the minimum MSE_staticAnd a corresponding training data set Q1, and predicting the model corresponding to the data set in a test set Q3 to obtain the predicted value of the statistical model. FIG. 3 shows the predicted effect of the statistical model on the entire data set, with the shaded portion on the right side of the dashed line being the test set.

Step C, predicting data by a genetic algorithm optimized BP neural network method; the method comprises the following specific steps:

step C1, according to the characteristic information of the data set, separating and constructing the forecast dependent variable and the influence factor independent variable of the input data;

and step C2, dividing the data into three parts according to the time sequence, wherein the three parts respectively correspond to the training set Q1, the verification set Q2 and the test set Q3. In the training set Q1, because the number of hyper-parameters in the neural network model is larger than that of the statistical model, and the neural network can exert the capability of deep mining analysis on the premise of large amount of data. Therefore, the model was trained directly in the Q1 dataset;

step C3, self-evolution iteration is carried out by utilizing a genetic algorithm to select the optimal hyperparameter, namely the number of neurons;

step C4, calculate the root mean square error on validation set Q2:

where n is the number of data in the verification set Q2,

for the ith predictor, y, of the BP neural network_iIs an actual measurement value.

And step C5, predicting in the test set Q3 to obtain the predicted value of the BP neural network. Fig. 4 shows the predicted effect of the BP neural network on the entire data set, with the shaded portion on the right side of the dotted line being the test set.

Step D, predicting data by an integrated LSTM neural network method; the method comprises the following specific steps:

step D1, according to the characteristic information of the data set, separating and constructing the forecast dependent variable and the influence factor independent variable of the input data;

step D2, dividing the data into three parts according to the time sequence, and respectively corresponding to a training set Q1, a verification set Q2 and a test set Q3;

step D3, train the integrated LSTM neural network on the training set Q1. The LSTM neural network is a special cyclic neural network, has inherent advantages in long-sequence monitoring data, and can provide powerful conditions for solving a large amount of long-term monitoring data in engineering;

d4, initializing the weight distribution of the data samples in the training set Q1, wherein the weight of each data sample is the same;

step D5, training, learning and predicting by using an LSTM neural network according to the current weight distribution;

step D6, calculate the root mean square error on validation set Q2:

where n is the number of data in the verification set Q2,

for the ith predictor, y, of the LSTM neural network_iIs an actual measurement value.

Step D7, judging the difference between the root mean square error and the preset error, if the difference is in the allowable range, stopping the calculation; if the difference value does not meet the preset value, updating the weight of the data in the training set according to the difference value between the current predicted value and the actually measured value, so that the weight of the sample with a large difference value is increased, and the updated sample weight is obtained;

step D8, training the next LSTM neural network based on the updated weight, repeating the steps D5, D6 and D7 until the number of the trained LSTM neural networks reaches the pre-designated number;

step D9, integrating all LSTM neural networks, and taking a weighted average value of the predicted values of each neural network to obtain a final model;

and D10, predicting in the test set Q3 to obtain the predicted value of the LSTM neural network. FIG. 5 shows the predicted effect of the LSTM neural network on the entire data set, with the shaded portion on the right of the dotted line being the test set.

Step E, comparing the mean square errors of the predicted value, the predicted average value and the weighted average value in the three steps on the data set, and selecting the optimal predicted value corresponding to the minimum mean square error; the method comprises the following specific steps:

step E1, calculating the mean value and MSE of the predicted values of the statistical model, the BP neural network and the LSTM neural network_av；

E2, taking the reciprocal of the root mean square error of the statistical model, the BP neural network and the LSTM neural network on the Q2 data set as the weight of the weighted average, and calculating the weighted average;

step E3, comparing the weighted average value with the measured value, calculating the MSE_WeightedThe MSE listed in the table below is shown:

TABLE 1 RMS error of three different prediction algorithms and their means and weighted averages

Step E4, selecting the corresponding MSE_pls、MSE_nn、MSE_lstm、MSE_WeightedThe minimum predicted value in the above-mentioned steps is used as the final predicted value, i.e. the prediction effect after integration is shown in fig. 6.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A multi-source monitoring data deep mining and intelligent analysis method for water transfer engineering is characterized by comprising the following steps:

and 4, step 4: predicting data by an integrated LSTM neural network method;

2. The method for the deep mining and intelligent analysis of multi-source monitoring data according to claim 1, wherein the step 1 comprises the following sub-steps:

step 11: inputting selected instrument monitoring data and days needing prediction to prepare for data analysis;

step 12: filtering the original data by adopting a zero phase shift formula to separate trends and eliminate gross errors in the data:

Y(e^jω)＝X(e^jω)|H(e^jω)|²；

wherein Y is output data after filtering, and X and H are respectively input monitoring data and a system function of filtering;

step 13: the filtered data has a plurality of null values, the monitoring data are homogenized by utilizing cubic spline interpolation processing, the original monitoring data are not changed after interpolation, the establishment of a monitoring model in the subsequent steps is facilitated, and the formula is as follows:

wherein S (x) is the curve function after final interpolation, S_i(x) N-1 is an interpolation function segmented by different intervals;

and step 14, sequentially taking the monitoring data of the fixed time intervals on the long time of the data set to form the data set of the monitoring instrument.

3. The method for multi-source monitoring data deep mining and intelligent analysis according to claim 1, wherein the step 2 comprises the following sub-steps:

step 21: according to the characteristic information of the data set, separating and constructing a prediction dependent variable and an influence factor independent variable of input data;

step 22: and dividing the data set into three parts according to the time sequence, wherein the three parts respectively correspond to the training set, the verification set and the test set. In the training set Q1, the reasonable length of Q1 needs to be determined, the determination criterion is the root mean square error of the model on the verification set Q2, and the test set is the data set Q3;

step 23, respectively establishing sub data sets with different lengths by using the data set Q1, and finding out sub data sets with proper lengths in the different sub data sets;

step 24: bringing each subdata set into a stacked kernel ridge regression, random forest algorithm and partial regression algorithm in sequence for training;

step 25: performing kernel ridge regression of 5-fold cross validation on a training data set Q1, and adjusting parameters of a model obtained by regression on a validation set Q2;

step 26: after 5-fold cross validation, calculating to obtain a predicted value on a training set Q1 and a predicted value on a validation set Q2; synthesizing the predicted values on the training set Q1 into a new data set as a training set of a random forest algorithm;

step 27: synthesizing the predicted values on the verification set Q2 into a new data set serving as a test set of a random forest algorithm;

step 28: substituting the training set and the test set into a random forest algorithm, and repeating the steps 25, 26 and 27;

step 29: after a new training set and a new test set are obtained, a partial regression algorithm is introduced, and the steps 25, 26 and 27 are repeated to finally obtain an integrated statistical model;

in a step 210, the process is carried out,the root mean square error MSE is calculated on the validation set Q2_staticThe calculation formula is as follows:

where n is the number of data in the Q2 data set,

is the ith predictor, y, of the statistical model_iIs an actual measurement value;

step 211, repeating the above steps 24 to 210 to find the minimum MSE_staticAnd a corresponding training data set Q1, and predicting the model corresponding to the data set in a test set Q3 to obtain the predicted value of the statistical model.

4. The method for multi-source monitoring data deep mining and intelligent analysis according to claim 1, wherein the step 3 comprises the following sub-steps:

step 31: according to the characteristic information of the data set, separating and constructing a prediction dependent variable and an influence factor independent variable of input data;

step 32: dividing the data into two parts according to the time sequence, wherein one part is a training set Q1 required by a training model, and the other part is a test set Q2 corresponding to the prediction days;

step 33: training a model in a Q1 data set, and selecting the optimal hyperparameter, namely the number of neurons by self-evolution iteration of a genetic algorithm;

step 34: the root mean square error is calculated on the validation set Q2 as follows:

wherein n is the number of data in the verification set Q2,

for the ith predictor, y, of the BP neural network_iIs an actual measurement value;

step 35: and predicting in a test set Q3 to obtain a predicted value of the BP neural network.

5. The method for multi-source monitoring data deep mining and intelligent analysis according to claim 1, wherein the step 4 comprises the following sub-steps:

step 41: according to the characteristic information of the data set, separating and constructing a prediction dependent variable and an influence factor independent variable of input data;

step 42: dividing the data into two parts according to the time sequence, wherein one part is a training set Q1 for determining an integrated LSTM neural network algorithm, and the other part is a test set Q2 corresponding to the number of predicted days;

step 43: training the integrated LSTM neural network on a training set Q1;

step 44: initializing a data sample weight distribution of a training set Q1, wherein the weight of each data sample is the same;

step 45: training, learning and predicting by using an LSTM neural network according to the current weight distribution;

step 46: the root mean square error is calculated on the validation set Q2 as follows:

wherein n is the number of data in the verification set Q2,

for the ith predictor, y, of the LSTM neural network_iIs an actual measurement value;

step 47: judging the difference value between the root mean square error and a preset error, and stopping calculation if the difference value is within an allowable range; if the difference value does not meet the preset value, updating the weight of the data in the training set according to the difference value between the current predicted value and the actually measured value, so that the weight of the sample with a large difference value is increased, and the updated sample weight is obtained;

and 48: training the next LSTM neural network based on the updated weights, and repeating steps 45, 46 and 47 until the number of the trained LSTM neural networks reaches a pre-specified number;

step 49: integrating all LSTM neural networks, and taking a weighted average value of the predicted values of each neural network to obtain a final model;

step 410: prediction in test set Q3 yields the predicted values for the LSTM neural network.

6. The method for multi-source monitoring data deep mining and intelligent analysis according to claim 1, wherein the step 5 comprises the following sub-steps:

step 51: calculating the mean value of the predicted values of the statistical model, the BP neural network and the LSTM neural network;

step 52: taking the reciprocal of the root mean square error of the statistical model, the BP neural network and the LSTM neural network on the Q2 data set as the weight of the weighted average, and calculating the weighted average;

step 53: comparing the weighted average value with the measured value to calculate the mean square error MSE_weighted；

Step 54: selecting corresponding MSE_pls、MSE_nn、MSE_lstm、MSE_WeightedThe smallest predicted value in the data is used as the final predicted value.