CN107992447B

CN107992447B - Feature selection decomposition method applied to river water level prediction data

Info

Publication number: CN107992447B
Application number: CN201711330726.1A
Authority: CN
Inventors: 杨拥军; 管杰
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2017-12-13
Filing date: 2017-12-13
Publication date: 2019-12-17
Anticipated expiration: 2037-12-13
Also published as: CN107992447A

Abstract

The invention discloses a feature selection decomposition method applied to river water level prediction data, which introduces LASSO regression to select features of an original input set, integrates MODWT to decompose components of the selected features and adopts multiple linear regression as a basic model to test the performance of LASSO-MODWT in order to obtain the features which are most suitable for being used as model input. Tests show that the LASSO-MODWT-based feature selection decomposition method is beneficial to improving the performance and model interpretation capability of a river water level prediction model.

Description

Feature selection decomposition method applied to river water level prediction data

Technical Field

The invention belongs to the technical field of water level prediction, and particularly relates to a design of a feature selection decomposition method applied to river water level prediction data.

Background

The water level prediction plays an extremely important role in flood control and disaster reduction, water resource utilization and distribution management. A steady water level prediction model can provide the change situation of the future water level for a relevant decision maker, and timely master the potential hydrological disasters, so that the relevant early warning deployment can be carried out earlier. In the field of water level prediction, due to the multi-dimensionality and complexity of factors influencing the water level, a nonlinear dynamic relation and various correlations are often presented between potential input quantities of a model system. In addition, the number of input variables is generally large, and particularly, the number of dimensions and computational complexity of features are drastically increased by introducing a hysteresis amount of each variable, but these variables actually include a large amount of repetitive information and noise components. In order to reduce the operation complexity of the model and improve the flexibility and the explanatory power of the model, effective characteristics containing the minimum redundancy are required to be selected from an original high-dimensional data set, so that a model which has flexibility, is simpler and can reflect the real water level change rule better is constructed.

LASSO was first proposed by Robert Tibshirani in 1996, and is called as a last absolute shrinkage and selection operator. The method is a compression estimation that results in a more refined model by constructing a penalty function such that it compresses coefficients while setting coefficients to zero. The advantage of subset puncturing is thus retained, and is a way to process biased estimates of data with complex collinearity. The basic idea of LASSO is to minimize the residual sum of squares RSS under the constraint that the sum of the absolute values of the regression coefficients is less than or equal to a constant, so as to generate some regression coefficients strictly 0, and obtain a model with interpretability after compressing features.

Discrete Wavelet Transform (DWT) is widely used in many models of integrated wavelets, and can obtain detailed spectral information of data, such as periodicity, local variation characteristics, randomness and mutation. But because of its decimation effect, it introduces a potential lack of information during the model building phase and thus produces a bias in the prediction. In addition, the wavelet transform coefficient result of the DWT is related to the start position of the wavelet transform, thereby bringing about a certain contingency.

based on the above-mentioned defects of DWT, the related people further propose Maximum Overlap Discrete Wavelet Transform (MODWT) as a method of feature decomposition. MODWT is a linear filtering operation that can better solve the decimation effect, and through MODWT, multi-dimensional wavelet coefficients with the same dimension as the observed values can be obtained. In addition, the result of the wavelet transform is independent of the position of the start of the transform, and can be used for the transform of data of different sample sizes. In general, MODWT can extract different frequency band components of the input signal, so as to obtain more abundant information and reveal the potential variation rule of the data.

Disclosure of Invention

The invention aims to reduce the operation complexity of the existing water level prediction model and improve the flexibility and the explanatory power of the existing water level prediction model, and provides a characteristic selection decomposition method applied to river water level prediction data.

the technical scheme of the invention is as follows: a feature selection decomposition method applied to river water level prediction data comprises the following steps:

And S1, acquiring hydrological factors influencing the water level of the target prediction station (current water level information of the target station, upstream basin water level information, rainfall along the way and the like).

and S2, constructing a feature set based on the information theory according to each hydrological element.

and S3, introducing a hysteresis quantity to each feature in the feature set based on correlation analysis, and constructing an original input set.

And S4, carrying out standardization processing on the original input set.

And S5, selecting the characteristics of the input set after the standardization processing based on the LASSO.

s6, performing feature decomposition on the input set after feature selection based on MODWT to obtain an input set optimized by LASSO-MODWT.

The invention has the beneficial effects that: the invention adopts LASSO regression to select the characteristics of the original input set and integrates MODWT to decompose the selected characteristics, thereby obviously improving the prediction performance of the river water level and being beneficial to improving the performance and the model interpretation capability of a river water level prediction model.

further, step S2 is specifically: respectively calculating the maximum information coefficient MIC between each hydrological element and the prediction target, analyzing the strength of the relation between the maximum information coefficient MIC and the prediction target, and constructing a feature set by taking the hydrological elements with the MIC value larger than a set threshold value with the prediction target as input features.

The maximum information coefficient MIC is calculated by the formula:

Wherein X, Y is two random variables, B is a segmentation limit, the total amount of data taken is 0.6 or 0.55 power, MIC [ X; y represents the maximum information coefficient between X and Y, I [ X; y ] represents the mutual information between X and Y, and the calculation formula is as follows:

Where p (X) and p (Y) represent the probability density distribution function of X, Y, respectively, and p (X, Y) represents the joint probability density distribution function of X, Y.

The beneficial effects of the above further scheme are: and analyzing the relationship strength between each hydrological element and the prediction target by adopting a maximum information coefficient MIC (many integrated computer), and constructing a feature set by taking a factor having a strong relationship with the prediction target as an input feature.

further, step S3 is specifically: determining a hysteresis quantity by adopting a partial autocorrelation function PACF (Picture archiving and communication function) aiming at the current water level information of a target site in a feature set, and analyzing and determining the hysteresis quantity by adopting a cross-correlation coefficient aiming at other input features in the feature set; for each lag, if it exhibits a clear statistical correlation with the predicted target, i.e., reaches a 95% confidence interval, the lag is added to the input set, thereby constructing the original input set.

The beneficial effects of the above further scheme are: since the predicted target river level information is time series, the influence of introducing a lag amount should be taken into consideration when constructing the original input set.

further, step S4 is specifically: carrying out standardization processing on an original input set by adopting a minimum-maximum value standardization processing method, and scaling the original input set to a [0,1] interval, wherein the processing formula is as follows:

Wherein x_i,normFor normalized data values, x_irepresenting the ith data item to be normalized in the original input set, N_minand N_maxMinimum and maximum values of scaling, i.e. 0 and 1, respectively, x_minAnd x_maxRespectively the minimum and maximum in the original input set.

The beneficial effects of the above further scheme are: because different input data have different dimensions, in order to evaluate the original input set by using the same standard, the original input set needs to be standardized to realize non-dimensionalization, and the original input set is scaled to the [0,1] interval.

Further, step S5 specifically includes the following sub-steps:

and S51, taking the input set after the standardization processing as model input, taking the water level data set of the predicted target site as model output, and constructing a LASSO regression model.

S52, training the LASSO regression model, optimizing the parameter lambda of the LASSO regression by adopting a grid search method, and searching for the optimal parameter.

And S53, scoring the features in the input set by using an LASSO regression model with optimal parameters, wherein the scoring standard is a regression coefficient obtained by LASSO regression, selecting the features with the LASSO regression coefficient being positive to continuously keep in the input set, and removing the features with the LASSO regression coefficient being 0 or negative from the input set to realize the feature selection of the input set.

the beneficial effects of the above further scheme are: after the LASSO selects the characteristics of the input set after the standardization processing, the prediction accuracy can be improved on the premise of greatly reducing the model input parameters.

Further, step S6 is specifically: and performing characteristic decomposition on the input set after characteristic selection by adopting an MODWT model, and using wavelet coefficient sets obtained by all characteristic decomposition to construct an optimized input set.

The formula of the characteristic decomposition is as follows:

Where f (t) is the wavelet coefficients resulting from the feature decomposition,for smooth approximation of wavelets, W, to the original signal in M-layer decomposition_m(t) is the decomposition wavelet of the original signal in M layers, M is 1, 2.

M＝int[log(N)] (5)

where N is the input set length after feature selection and int [. cndot ] is an upward rounding function.

the beneficial effects of the above further scheme are: the MODWT model is adopted to carry out feature decomposition on the input set after feature selection, so that the river water level prediction precision can be obviously improved.

Further, the MODWT model employs Daubechies wavelet basis.

The beneficial effects of the above further scheme are: the invention adopts Daubechies wavelet base, which is widely applied to the field of hydrologic prediction considering that hydrologic prediction is suitable for irregular wavelet base.

Drawings

Fig. 1 is a flowchart of a feature selection decomposition method applied to river water level prediction data according to an embodiment of the present invention.

FIG. 2 is a graph comparing the results of DMDWT on WL _ CS using a Daubechies wavelet base in the form of db3, according to an embodiment of the present invention.

Fig. 3 is a comparison graph of predicted values and actual values of three-hour prediction of different input sets according to an embodiment of the present invention.

Fig. 4 is a scatter diagram illustrating predicted values and true values of three-hour prediction for different input sets according to an embodiment of the present invention.

Detailed Description

exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It is to be understood that the embodiments shown and described in the drawings are merely exemplary and are intended to illustrate the principles and spirit of the invention, not to limit the scope of the invention.

The embodiment of the invention provides a feature selection decomposition method applied to river water level prediction data, as shown in fig. 1, comprising the following steps S1-S6:

And S1, acquiring hydrological elements (including current water level information, upstream basin water level information, rainfall along the way and other hydrological elements) influencing the water level of the target prediction station.

In the embodiment of the invention, the water level change trend in the downstream of the red water river is taken as an example, and the purpose is to predict the water level conditions of the red water station in the future of 3 hours and 6 hours. The adopted data is collected by automatic monitoring stations along the bank in the red river in the period of 2015 and 2016, and the related station information is shown in table 1. Since the data was stored for an hourly acquisition, there were a total of 8834 data points. The data acquisition and storage process inevitably has deletion, analysis finds that the deletion data is WL _ MT 2015-10-0902-2015-10-1407 and 126 items of data, and interpolation and completion are carried out on the data by utilizing pandas.

TABLE 1

Code number	Meaning of parameters	Monitoring station	Data type	Acquisition cycle
					WL_CS	Red water station level	red water station	water level	hourly space
WL_EL	water level of the two-man station	Two-man station	Water level	Hourly space
					WL_MT	Couchtop station water level	Couchgrass stand	Water level	Hourly space
RF_CS	Rainfall capacity of red water station	Red water station	Amount of rainfall	Hourly space
					RF_XS	Rainfall of water learning station	Water station	Amount of rainfall	Hourly space

respectively calculating the maximum information coefficient MIC between each hydrological element and the prediction target, analyzing the strength of the relation between the maximum information coefficient MIC and the prediction target, and constructing a feature set by taking the hydrological element (namely the hydrological element with a strong relation with the prediction target) with the MIC value larger than a set threshold value with the prediction target as an input feature.

The maximum information coefficient MIC is calculated by the formula:

Wherein X, Y is two random variables, B is a segmentation limit, which determines the upper limit of X, Y discrete segmentation, generally takes data of 0.6 or 0.55 power of the total amount, MIC [ X; y represents the maximum information coefficient between X and Y, I [ X; y ] represents the mutual information between X and Y, and the calculation formula is as follows:

in the embodiment of the present invention, the feature set has 5 features, which includes the following contents: (1) three hydrological monitoring stations including a red water station, a couchgrass station and a second station (with the code numbers of WL _ CS, WL _ MT and WL _ EL); (2) rainfall data (with the code of RF _ CS and RF _ XS) of two weather monitoring stations, namely a red water station and a water learning station.

Since the predicted target river level information is time series, the influence of introducing a lag amount should be taken into consideration when constructing the original input set. In the embodiment of the invention, a partial autocorrelation function PACF is adopted to determine the lag for the current water level information of a target site in a feature set, and the lag is determined by cross-correlation coefficient analysis for other input features in the feature set; for each lag, if it exhibits a clear statistical correlation (i.e., reaches a 95% confidence interval) with the predicted target, the lag is added to the input set, thereby constructing the original input set. The partial autocorrelation function PACF and the cross correlation coefficient analysis methods are all correlation analysis methods commonly used in the art, and are not described herein again.

In the embodiment of the invention, the number of the features of the original input set is predicted to be 221 in 3h and predicted to be 229 in 6h after the hysteresis quantity is introduced to each feature through correlation analysis.

And S4, carrying out standardization processing on the original input set.

Because different input data have different dimensions, in order to evaluate the original input set by using the same standard, the original input set needs to be standardized to realize non-dimensionalization. In the embodiment of the invention, a minimum-maximum value standardization processing method (Min-Max Scaler) is adopted to standardize an original input set, the original input set is zoomed in a [0,1] interval, and the processing formula is as follows:

Wherein x_i,normFor normalized data values, x_iRepresenting the ith data item to be normalized in the original input set, N_minAnd N_maxMinimum and maximum values of scaling, i.e. 0 and 1, respectively, x_minAnd x_maxRespectively, the minimum and the maximum in the original input setA large value.

in order to simplify the input set and select the most suitable features for input, the feature selection is performed on the element input set based on LASSO regression in the embodiment of the invention. Since it introduces the L1 regular term as a penalty term, the regression coefficient of the redundant features can be compressed to 0, so that the feature selection based on LASSO regression is a sparse feature selection method.

The step S5 specifically includes the following substeps S51-S53:

In the embodiment of the invention, the number of the predicted features in 3h is 49 after LASSO-based feature selection, and the number of the predicted features in 6h is 88. It can be seen that the number of input features is greatly reduced in both prediction scenarios, and further the complexity of model construction is reduced.

And performing characteristic decomposition on the input set after characteristic selection by adopting an MODWT model, and using wavelet coefficient sets obtained by all characteristic decomposition to construct an optimized input set.

The formula of the characteristic decomposition is as follows:

M＝int[log(N)] (5)

the effective input sets in the embodiment of the present invention are 8678, so the minimum decomposition layer number of MODWT is: the test is carried out by taking the integer of M ═ log (8678) ═ 3.93 and the integer of M ═ 4, and taking the cases of M ═ 4 and M ═ 5 in the examples of the present invention.

although MODWT has proven to have many advantages as a multi-resolution feature recognition tool, one challenge in building a model based on MODWT is to select a proper wavelet basis function, and since there is no definite general basis function selection standard at present and there is no relevant literature describing which basis function is selected to obtain the best model effect, different application scenarios are theoretically suitable for different basis functions. In view of the fact that hydrologic predictions are suitable for irregular wavelet bases, embodiments of the present invention employ Daubechies wavelet bases, which are widely used in the field of hydrologic predictions. In the embodiment of the invention, the wavelet bases of three forms of db2, db3 and db4 are adopted for comparative test, and the wavelet bases which are most suitable for predicting the water level of the red water river are searched.

Fig. 2 shows the result of DMDWT on WL _ CS using Daubechies wavelet basis in db3, with 6 sub-graphs from top to bottom for the original signal waveform, the smoothed approximation waveform (a4), and the four-layer DMDWT decomposition coefficients (d1, d2, d3, d4), respectively. In order to reduce the computational complexity, the embodiment of the present invention decomposes only the most important WL _ CS feature scored by LASSO, and adds the wavelet coefficients obtained after decomposition as new features to the input set (the 4-layer and 5-layer decompositions are 5-dimensional and 6-dimensional coefficients, respectively), where the 3-hour prediction features are 53, and the 6-hour prediction features are 92.

Since there is no universal single index for evaluating performance of a hydrological prediction model, embodiments of the present invention pass through the Nash efficiency coefficient E_NSAnd comprehensively evaluating the prediction performance by three statistical indexes, namely the root mean square error RMSE and the average absolute error MAE.

(1) Coefficient of Nash efficiency E_NS：

(2) Root mean square error RMSE:

(3) Mean absolute error MAE:

Wherein, SWL_OBSFor actually measured water level SWL_FORThe water level is obtained through model prediction, N is the number of data points,The measured water level is the overall average value.

In the embodiment of the invention, an original input set obtained based on correlation analysis, an input set subjected to LASSO-based feature selection and an input set subjected to LASSO-MODWT optimization are respectively used as the input of a multiple linear regression model for predicting water level data of a red water station in 3 hours and 6 hours, and the performance of the LASSO-MODWT feature selection decomposition method is further evaluated. Table 2 is a comparison of the performance of different input sets for predicting 3 and 6 hour water levels for a red water station. As can be seen from table 2, the prediction accuracy can be improved on the premise of greatly reducing the model input parameters after the characteristic selection based on LASSO, regardless of the 3-hour prediction or the 6-hour prediction; and the integrated MODWT can obviously improve the prediction accuracy and has good performance for 3-hour prediction and 6-hour prediction.

TABLE 2

fig. 3 is a comparison of the predicted results and the actual values of the water levels at 3 hours during 2016, 8 and month in the red water station for different input sets, and fig. 4 is a scatter diagram of the predicted values and the actual values of the three input sets. It can be seen that, after the LASSO-MODWT feature selection decomposition, the approximation degree of the predicted value and the true value of the LASSO-W-MLR is higher and the model performance is more stable compared with the predicted result of the original input set. Therefore, the LASSO-MODWT characteristic selection decomposition method can obviously improve the accuracy and stability of the red water river level prediction model.

In order to further study the influence of different wavelet base types on the red water river level prediction performance, three wavelets db2, db3 and db4 and two decomposition layer numbers of level4 and level5 are respectively simulated in the embodiment of the invention, and table 3 shows the performance results of 3h prediction and 6h prediction by adopting different wavelet bases and decomposition layer numbers. As can be seen from Table 3, the db2 wavelet basis and 5-layer wavelet decomposition are adopted to obtain better prediction performance in the red water river level prediction model. The result further shows that different application scenarios are suitable for adopting different wavelet bases, and in the actual modeling process, demonstration attempts should be made according to specific requirements to find the most suitable wavelet base and decomposition layer number, so as to improve the model accuracy.

TABLE 3

it will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. A feature selection decomposition method applied to river water level prediction data is characterized by comprising the following steps:

S1, acquiring hydrological factors influencing the water level of the target prediction station;

S2, constructing a feature set based on an information theory according to each hydrological element;

S3, introducing hysteresis quantity to each feature in the feature set based on correlation analysis, and constructing an original input set;

s4, carrying out standardization processing on the original input set;

S5, performing feature selection on the input set after the standardization processing based on the LASSO;

S6, performing feature decomposition on the input set after feature selection based on MODWT to obtain an input set optimized by LASSO-MODWT;

The step S5 specifically includes the following sub-steps:

S51, taking the input set after standardization as model input, taking the water level data set of the predicted target site as model output, and constructing an LASSO regression model;

S52, training the LASSO regression model, optimizing a parameter lambda of the LASSO regression by adopting a grid search method, and searching for an optimal parameter;

2. The feature selection decomposition method according to claim 1, wherein the hydrological elements affecting the water level of the target prediction site in the step S1 include current water level information of the target site, upstream basin water level information, and rainfall along the way.

3. The method for feature selection decomposition according to claim 1, wherein the step S2 specifically comprises: respectively calculating the maximum information coefficient MIC between each hydrological element and the prediction target, analyzing the strength of the relation between the maximum information coefficient MIC and the prediction target, and constructing a feature set by taking the hydrological elements with the MIC value larger than a set threshold value with the prediction target as input features.

4. The method of feature selection decomposition of claim 3, wherein the maximum information coefficient MIC is calculated as:

5. The method for feature selection decomposition according to claim 3, wherein the step S3 specifically comprises: determining a hysteresis quantity by adopting a partial autocorrelation function PACF (Picture archiving and communication function) aiming at the current water level information of a target site in a feature set, and analyzing and determining the hysteresis quantity by adopting a cross-correlation coefficient aiming at other input features in the feature set; for each lag, if it exhibits a clear statistical correlation with the predicted target, i.e., reaches a 95% confidence interval, the lag is added to the input set, thereby constructing the original input set.

6. The method for feature selection decomposition according to claim 1, wherein the step S4 specifically comprises: carrying out standardization processing on an original input set by adopting a minimum-maximum value standardization processing method, and scaling the original input set to a [0,1] interval, wherein the processing formula is as follows:

Wherein x_i,normFor normalized data values, x_iRepresenting the ith data item to be normalized, N, in the original input set_minAnd N_maxMinimum and maximum values of scaling, i.e. 0 and 1, respectively, x_minAnd x_maxRespectively the minimum and maximum in the original input set.

7. The method for feature selection decomposition according to claim 1, wherein the step S6 specifically comprises: and performing characteristic decomposition on the input set after characteristic selection by adopting an MODWT model, and using wavelet coefficient sets obtained by all characteristic decomposition to construct an optimized input set.

8. The method of feature selection decomposition of claim 7 wherein the formula of the feature decomposition is:

M＝int[log(N)] (5)

9. The method of feature selection decomposition of claim 7 wherein the MODWT model employs Daubechies wavelet basis.