CN112733996B

CN112733996B - GA-PSO (genetic algorithm-particle swarm optimization) based hydrological time sequence prediction method for optimizing XGboost

Info

Publication number: CN112733996B
Application number: CN202110049321.0A
Authority: CN
Inventors: 马露; 万定生; 余宇峰; 杨志勇
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2021-01-14
Filing date: 2021-01-14
Publication date: 2022-07-12
Anticipated expiration: 2041-01-14
Also published as: CN112733996A

Abstract

The invention discloses a GA-PSO optimization XGboost-based hydrological time sequence prediction method, which comprises the steps of collecting rainfall values corresponding to hydrological stations and flow of the corresponding hydrological stations, and organizing a hydrological time sequence dataset; preprocessing data, and dividing a sample data set into a training set and a test set; optimizing various super parameters such as the learning rate lr of the XGboost, the number n _ estimators of the base learners, the minimum leaf weight min _ weights, the maximum tree depth max _ depth and the like by adopting an improved GA-PSO combined optimization algorithm, and training an XGboost model by utilizing a sample data set to finally obtain a GA-PSO optimized XGboost hydrological time sequence prediction model; and testing the GA-PSO optimized XGboost hydrological prediction model. According to the invention, the GA-PSO is adopted to optimize the parameters of the XGboost model, and the model obtained by using the optimal parameters is used for hydrologic prediction, so that the accuracy is higher.

Description

GA-PSO (genetic Algorithm-particle swarm optimization) based hydrological time sequence prediction method for optimizing XGboost

Technical Field

The invention belongs to a hydrological prediction technology, and particularly relates to a GA-PSO (genetic algorithm-particle swarm optimization) based hydrological time sequence prediction method for optimizing XGboost.

Background

At present, the hydrology industry in China advances from traditional hydrology to modern hydrology, the observation technology of the automatic hydrology station is rapidly popularized, and the coverage of hydrology data is more and more comprehensive from manual recording of hydrology data to data recording of the current automatic station every few minutes or even every second. The hydrological data have the characteristics of large quantity, various categories, spatiotemporal property, quick updating and the like, and meanwhile, the hydrological data are influenced by various conditions such as seasonal climate, geomorphic characteristics, hydrological laws and the like, so that a lot of valuable laws and information are hidden. How to make powerful analysis on them and obtain useful information from them to serve hydrologic forecasting, flood detection, etc. becomes a focus of attention. In the traditional hydrology industry, a physical model is generally established according to the hydrology environment and process, and then manual experience is added for prediction. From the information perspective, if a specific pattern rule can be mined from the long-term time series historical data owned by the drainage basin, the future water level flow of the drainage basin can be effectively predicted by utilizing the approximate trend, and the method is helpful for preventing flood disasters, so the prediction importance of the hydrologic time series is self-evident.

In recent years, a few scholars apply machine learning methods to hydrological time series prediction, such as: the method has the advantages that the method also achieves better effects, and has some problems while improving the calculation speed and precision of the traditional model: the LSTM and BP neural networks have strong learning ability, but are easy to fall into local optimization, a large number of parameters are needed, and the convergence rate is low; the support vector machine has good prediction effect, but for large-scale training samples, the calculation speed is slow and the selection of the hyper-parameters is depended on. Therefore, it is necessary to find a prediction model with both efficiency and accuracy.

The genetic algorithm and the particle swarm algorithm are the most frequently used and most basic optimization algorithms when the parameters are optimized for the model, in the optimization process of the GA algorithm, the whole population exists in a coding form, the variation trend is gradually and uniformly close to the optimal area, but the GA algorithm is 'memoryless', and the particles are updated only through crossing and variation, so that the global search capability is stronger; in contrast, the PSO algorithm "has memory", updates the particle by changing the velocity and position of the particle, is closely related to the position of the previous time, is more suitable for the local optimal search, has less parameters to be adjusted, and has a fast convergence rate but avoids the premature convergence.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to solve the defects in the prior art, and provides a GA-PSO optimization XGboost-based hydrological time sequence prediction method.

The technical scheme is as follows: the invention discloses a GA-PSO (genetic algorithm-particle swarm optimization) based hydrological time sequence prediction method for optimizing XGboost, which comprises the following steps of:

s1, collecting rainfall values of all rainfall stations corresponding to a water system basin within a certain time period and water levels of corresponding water level stations, and organizing a hydrological time series data set;

s2, preprocessing each hydrological sample data in the hydrological time series data set of the S1, and dividing the sample data set into a hydrological training data set L and a hydrological testing data set T;

step S3, optimizing various super parameters such as learning rate lr, number n _ estimators of base learners, minimum leaf weight, maximum tree depth and the like of the XGboost model by adopting an improved GA-PSO combined optimization algorithm, and training the XGboost model by utilizing a sample data set to finally obtain a GA-PSO optimized XGboost hydrological time sequence prediction model;

and step S4, testing the GA-PSO optimized XGboost hydrological prediction model.

The step S1 is to obtain a data set and corresponding tag information, and the step S1 is further to: and organizing current and previous 7-hour rainfall values of the rainfall station corresponding to the water system drainage basin and current and previous 7-hour flow values of the corresponding water system station as a water system time sequence data set.

The step S2 is to pre-process the data in the data set and partition the data set, and the step S2 is further to:

step S2.1, the preprocessing of the hydrological sample data x (t) in step S2 includes missing value processing, error value correction and normalization;

the normalization formula is as follows:

wherein x is^*Is a normalized value, x is an initial value, x_minIs the minimum value in the original sequence, x_maxIs the maximum value in the original sequence;

and S2.2, taking the first 80% of the preprocessed hydrological time series data set as a hydrological training data set L, and taking the remaining 20% of the preprocessed hydrological time series data set as a hydrological test data set T.

The XGboost model has a plurality of parameters, and the more optimal parameter can improve the accuracy of sequence prediction, so that the learning rate lr, the number n _ estimators of base learners, the minimum leaf weight min _ weights, the maximum tree depth max _ depth and other super parameters of the XGboost model are optimized by adopting an improved GA-PSO algorithm, and the step S3 specifically comprises the following steps:

s3.1, initializing the learning rate lr of the XGboost model, the number n _ estimators of the base learners, the minimum leaf weight min _ weights and the value range of the maximum tree depth max _ depth parameter, and setting the iteration number of the GA-PSO integral optimization algorithm as T^*；

S3.2, randomly generating N subgroups, wherein chromosomes of particles in each subgroup are equivalent to a group of XGboost parameters (lr, N _ estimators, min _ weights, max _ depth);

step S3.3, use R²As individual fitness values, initializing the individual fitness values of all the particles in the N subgroups of step S3.2;

s3.4, performing classical GA optimization on the N subgroups once to finally obtain N optimal particles, wherein the specific GA optimization method comprises the following steps: each subgroup comprises m individuals, and the iteration number of each subgroup is set to be T₁Performing selection, crossing and mutation operations on the encoded m individuals to further update the population;

s3.5, calculating the fitness value of each particle after the variation, and updating the optimal individual representing the current iteration times according to the fitness value;

step S3.6, returning to step S3.4 to continue to complete the classical GA optimization until the upper limit T of the iteration times is reached₁Satisfying the termination condition, each subgroup will have T₁Comparing the fitness of the historical optimal particles, taking the particles with the highest fitness value as the optimal individuals of the subgroup, and finally obtaining N optimal individuals from the N subgroups;

s3.7, decoding the N optimal individuals obtained in the step S3.6 to serve as initial particle swarm of the PSO algorithm, and performing improved PSO optimization, wherein the iteration number of the PSO algorithm is set to be T₂；

S3.8, initializing the initial speed of the initial particles of the PSO algorithm, and still adopting R²As a calculation formula of the fitness value, updating the speed and the position of each particle by using the improved formula, thereby updating a historical optimal position, which is marked as pbest and global optimal position gbest of the group;

the particle velocity and position update formula in PSO is:

wherein the content of the first and second substances,

representing the velocity of the particles at the current time t,

indicating the position of the particle at the current time t,

the extreme point of the individual is represented,

representing global extreme points, ω being the inertial weight, c₁、c₂As learning factor, rand₁、rand₂Is [0,1 ]]Random numbers within the interval;

a non-linear decreasing weight method is adopted for the weight ω:

the learning factor is also in a nonlinear function with the weight:

step S3.9, judging whether the current iteration number is less than or equal to T₂If yes, returning to the step S3.8 to continue the current PSO optimization, otherwise, jumping to the step S3.10;

step S3.10,Judging whether the current total iteration number is less than or equal to T^*If the number of the GA subgroups can not be met, K individuals of each GA subgroup in the step S3.2 are randomly selected from the historical optimal particles in the PSO to replace the K individuals, and the step S3.2 is returned to continue the optimization; if so, outputting an optimal solution;

the XGBoost in step S4 is a tree integration model, the internal decision tree uses a regression tree, and the detailed process of step S4 is as follows:

the loss function of the GA-PSO optimized XGboost hydrological time series prediction model is set as follows:

wherein the content of the first and second substances,

measure the predicted value for the loss function

With the actual value y_iThe difference between them; k represents the number of decision trees contained in the model;

the leaf node is a regular term, wherein gamma is a penalty constant of a profit function for segmenting the leaf nodes, M is the number of the leaf nodes, and lambda is a penalty function coefficient of the L2 regular term;

the predicted value of the jth model, i.e. the ith sample, in the jth training is as follows:

the simplified objective function of the jth training model is:

in the formula (I), the compound is shown in the specification,

is the first derivative of the loss function and,

the second derivative of the loss function.

Has the advantages that: compared with the prior art, the invention has the advantages that:

according to the invention, the parameters of the XGboost model are optimized by adopting a GA-PSO combined optimization algorithm, so that the situation that local optimization is involved when the optimal parameters are searched is avoided, and the model obtained by utilizing the optimal parameters is used for hydrologic prediction, so that the accuracy is higher. And on the basis of ensuring the prediction accuracy, the method has higher convergence rate, and the calculation speed of large-scale training samples is improved to a certain extent.

The XGboost prediction model after parameter optimization has better prediction effect and prediction precision, and the generalization capability of the prediction model is improved.

Drawings

FIG. 1 is a schematic overall flow diagram of the present invention;

FIG. 2 is a schematic diagram of GA-PSO optimization according to an embodiment of the present invention;

FIG. 3 is a graph comparing the change curves of the fitness values of the GA-PSO and GA and PSO optimization algorithms in the examples;

FIG. 4 is a detailed sequence (471, 481) of the forecast period 1h in the example;

fig. 5 shows the detailed sequence (2068, 2107) of prophase 1h in the example.

Detailed Description

The technical solution of the present invention is described in detail below, but the scope of the present invention is not limited to the embodiments.

As shown in FIG. 1, the invention relates to a prediction method of a hydrological time series based on GA-PSO optimization XGboost, which mainly comprises 4 steps:

and step S1, selecting data of the Longshan watershed to organize a hydrologic time series data set. The time is from 12/24/01/2010 to 7/25/2014, and the time is 31416 pieces of hour data, and one piece of data consists of five attributes including the flow value of the dragon mountain station and the rain values of four rain stations. The four rainfall stations are respectively: dragon mountain, rear love, stream and moon;

step S2.1, the preprocessing of the hydrological sample data in step S2 includes missing value processing, error value correction, and normalization;

the normalization formula is as follows:

and S2.2, taking the first 80% of the preprocessed hydrological time series data set as a hydrological training data set L, and taking the remaining 20% of the preprocessed hydrological time series data set as a hydrological test data set T. Selecting 26000 hours of data from 24/01/12/2013/12/11/08 as a training set L, and 5416 pieces of data from 08/12/11/2013/2014/7/25/01 as a test set T;

s3, optimizing the learning rate lr, the number n _ estimators of the base learners, the minimum leaf weight min _ weights and the maximum tree depth max _ depth of the XGboost model by adopting an improved GA-PSO combined optimization algorithm, and training the XGboost model by using a sample data set L to finally obtain the XGboost hydrological time sequence prediction model optimized by GA-PSO;

s3.1, initializing the learning rate lr of the XGboost model, the number n _ estimators of the base learners, the minimum leaf weight min _ weights and the value range of the maximum tree depth max _ depth parameter, setting the range of lr to be (0.01,0.4), the range of n _ estimators to be (10,220), the range of gamma to be (3,10) and the range of max _ depth to be (0, 0.2). Setting the iteration number of the GA-PSO integral optimization algorithm as T^*Setting the initial population number to be N-50 and the iteration number T for GA-PSO^*Set to 100 times, where the crossover probability cp in the GA used is 0.85, the mutation probability mp is 0.05, and the number of iterations T ₁50, improved PSO optimization₂And (3) optimizing the parameter jinxing of the XGboost model by using a GA-PSO optimization algorithm, wherein the specific flow is shown in figure 2, and the specific steps are as follows:

step S3.3, use R²As individual fitness values, initializing the individual fitness values of all particles in the N subgroups in step S3.2;

step S3.4, performing classical GA optimization on the 50 subgroups once to finally obtain 50 optimal particles, wherein the specific GA optimization method comprises the following steps: each subgroup contains 50 individuals, and the iteration number of each subgroup is set to be T₁Selecting, crossing and mutating the 50 encoded individuals to further update the population;

step S3.6, returning to step S3.4 to continue to complete the classical GA optimization until the upper limit T of the iteration times is reached₁Satisfying the termination condition, each subgroup will have T₁Comparing the fitness of the historical optimal particles, taking the particles with the highest fitness value as the optimal individuals of the subgroup, and finally obtaining 50 optimal individuals from 50 subgroups;

S3.8, initializing the initial speed of the initial particles of the PSO algorithm, and still adopting R²As a calculation formula of the fitness value, the velocity and the position of each particle are updated by using the improved formula, so that the historical optimal position, which is recorded as pbest and is the whole population is updatedThe optimal position of the bureau gbest;

the particle velocity and position update formula in PSO is:

wherein the content of the first and second substances,

representing the velocity of the particles at the current time t,

indicating the position of the particle at the current time t,

the extreme point of the individual is represented,

representing global extreme points, ω being the inertial weight, c₁、c₂As a learning factor, rand₁、rand₂Is [0,1 ]]Random numbers within the interval;

a non-linear decreasing weight method is adopted for the weight ω:

the learning factor is also in a nonlinear function with the weight:

s3.10, judging whether the current total iteration times are less than or equal to T^*If the number of the individuals in the GA subgroup in step S3.2 is not equal to N/2 equal to 25, then returning to step S3.2 to continue the optimization; if so, outputting an optimal solution;

wherein the content of the first and second substances,

measure the prediction value for the loss function

the simplified objective function of the jth training model is:

in the formula (I), the compound is shown in the specification,

is the first derivative of the loss function,

the second derivative of the loss function.

In the embodiment, the optimal parameters of the XGBoost model with parameters optimized by the GA-PSO optimization algorithm in the forecast period of 1 to 6 hours are shown in table 1 below:

TABLE 1

Predicting the flow data of the dragon mountain by using the optimal model, comparing the flow data with the flow data by using an SVM (support vector machine) model and an LSTM (least squares metric) model, and finally predicting the result as shown in figure 4, wherein MRE (maximum likelihood estimation), MAE (maximum likelihood estimation), RMSE (maximum likelihood estimation) and R (maximum likelihood estimation) are used as evaluation indexes of the predicted result²Four, the calculation formula is as follows:

in the formula, y_iIn order to be the actual value of the measurement,

in order to have a value that is to be reported,

is the average value, and n is the number of samples.

Table 2 shows the comparison between the predicted values of the two prediction models, namely SVM and LSTM, when the optimal parameters are used by the XGboost in the prediction period of 1 h.

TABLE 2

Table 3 shows the comparison of the evaluation indexes of the three models in all the forecast periods.

TABLE 3

Fig. 3 shows a fitness value change curve of the GA-PSO optimization algorithm (GPSO for short) at a forecast period of 1h, compared with the classical GA and the classical PSO algorithms. Two detailed sequences (471, 481) and (2068, 2107) in the test set are selected for display in fig. 4 and fig. 5, respectively.

Claims

1. A hydrological time sequence prediction method for optimizing XGboost based on GA-PSO is characterized by comprising the following steps: the method comprises the following steps:

step S3, optimizing the learning rate lr, the number n _ estimators of the base learners, the minimum leaf weight min _ weights and the maximum tree depth max _ depth of the XGboost model by adopting an improved GA-PSO combined optimization algorithm, and training the XGboost model by utilizing a hydrologic training data set L to finally obtain the GA-PSO optimized XGboost hydrologic time sequence prediction model; the concrete contents are as follows:

step S3.4, performing classical GA optimization on the N subgroups to finally obtain N optimal particles, wherein the specific GA optimization method comprises the following steps: each subgroup comprises m individuals, and the iteration number of each subgroup is set to be T₁Performing selection, crossing and mutation operations on the encoded m individuals to further update the population;

step S3.6, returning to step S3.4, and continuing to finish GA optimization on subgroups until the upper limit T of iteration times is reached₁The termination condition is satisfied, then each subgroup has T₁Comparing the fitness of the historical optimal particles, taking the particles with the highest fitness value as the optimal individuals of the subgroup, and finally obtaining N optimal individuals from the N subgroups;

S3.8, initializing the initial speed of the PSO algorithm initial particles, and still adopting R²As a formula for calculating the fitness value, the improved formula is usedUpdating the speed and the position of each particle so as to update the historical optimal position, which is marked as pbest and the global optimal position gbest of the group;

s3.10, judging whether the current total iteration times are less than or equal to T^*If the number of the GA subgroups can not be met, K individuals of each GA subgroup in the step S3.2 are randomly selected from the historical optimal particles in the PSO to replace the K individuals, and the step S3.2 is returned to continue optimization; if so, outputting an optimal solution;

and S4, testing the test set T by the optimal XGboost hydrological prediction model optimized by the GA-PSO obtained in the step S3.

2. The GA-PSO optimized XGboost-based hydrological time series prediction method according to claim 1, characterized in that: the hydrologic time series data set in step S1 includes current and previous 7-hour rainfall values of the rainfall station corresponding to the water system watershed, and current and previous 7-hour flow values of the corresponding hydrologic station.

3. The GA-PSO optimized XGboost-based hydrological time series prediction method according to claim 1, characterized in that: the preprocessing of the hydrological sample data x (t) in the step S2 includes missing value processing, error value correction and normalization;

the normalization formula is as follows:

and taking the first 80% of the preprocessed hydrographic time sequence data set as a hydrographic training data set L, and taking the rest 20% of the data as a hydrographic testing data set T.

4. The GA-PSO optimized XGboost-based hydrological time series prediction method according to claim 1, characterized in that: the particle velocity and position update formula in step S3.8 is:

wherein, the first and the second end of the pipe are connected with each other,

representing the velocity of the particles at the current time t,

indicating the position of the particle at the current time t,

the extreme point of the individual is represented,

a non-linear decreasing weight method is adopted for the weight ω:

the learning factor is also in a nonlinear function with the weight:

5. the GA-PSO optimized XGboost-based hydrological time series prediction method according to claim 1, characterized in that: the detailed process of step S4 is:

wherein the content of the first and second substances,

measure the prediction value for the loss function

is a regular term, wherein gamma is a penalty constant of a gain function for segmenting leaf nodes, M is the number of the leaf nodes, and lambda is an L2 regular term penalty function coefficient;

the simplified objective function of the jth training model is:

in the formula (I), the compound is shown in the specification,

is the first derivative of the loss function,

is the second derivative of the loss function;

and testing the test set by using the optimal parameters of the XGboost model found by the GA-PSO optimization algorithm.