CN115688588B

CN115688588B - Sea surface temperature daily variation amplitude prediction method based on improved XGB method

Info

Publication number: CN115688588B
Application number: CN202211376526.0A
Authority: CN
Inventors: 宋振亚; 冯跃玲; 肖衡; 杨晓丹; 高振
Original assignee: First Institute of Oceanography MNR
Current assignee: First Institute of Oceanography MNR
Priority date: 2022-11-04
Filing date: 2022-11-04
Publication date: 2023-06-27
Anticipated expiration: 2042-11-04
Also published as: CN115688588A

Abstract

The invention relates to the technical field of ocean surface temperature prediction, and provides a sea surface temperature daily variation amplitude prediction method based on an improved XGB method, which comprises the following steps of: s1: acquiring a data set and preprocessing, wherein the data set comprises wind speed data and short wave radiation value data; s2: establishing an XGBoost model; s3: modifying the algorithm weight of the XGBoost model by applying an LDS algorithm, and establishing an LDS-XGB model; s4: selecting a training set from the data set, and training the LDS-XGB model by using the training set; s5: and predicting the daily variation amplitude of the sea surface temperature by using the trained LDS-XGB model. The invention innovatively uses an XGB algorithm in the SST daily variation amplitude prediction, and applies machine learning to the SST daily variation amplitude prediction; and smoothing the data tag value by using an LDS method, so that the traditional unbalanced classification method can be applied to regression problems.

Description

Sea surface temperature daily variation amplitude prediction method based on improved XGB method

Technical Field

The invention relates to the technical field of ocean surface temperature prediction, in particular to a sea surface temperature daily variation amplitude prediction method based on an improved XGB method.

Background

SST stands for sea surface temperature, and currently there are observation studies, empirical models and numerical simulations for the main methods of SST daily variation studies.

The development of marine observation greatly promotes the development of SST daily change process research, but is limited by observation means and data, and the understanding of SST daily change process still has great defects.

Currently, there is a gap in the field of predicting SST daily changes using machine learning. The specific expression is as follows:

(1) Although the traditional experience model can grasp the basic characteristic of SST daily variation, the application range is limited and the precision is not high. At present, people still have great defects on understanding SST daily variation, and the traditional experience model still has the problems of low precision, complex calculation and the like, so that reasonable simulation and prediction of the SST daily variation process still remain a challenge.

(2) Numerical simulation is an effective means of simulating and predicting SST day-to-day processes, but is limited by the state of the art of numerical modeling, accurate simulation and prediction remains a challenge. In addition, the accuracy of simulation is difficult to ensure due to uncertainty factors such as uncertainty of the SST daily variation process itself and parameterization of the pattern itself.

(3) The application of machine learning in marine environment research, simulation, prediction and the like is more and more emphasized, and the machine learning is expected to play an important role in SST daily change process research under physical constraint. Machine learning methods have achieved great results in predicting sea temperature and the like, but there is a gap in the field of predicting SST daily variation by machine learning at present.

Disclosure of Invention

In order to solve the problems in the background technology, the invention provides a sea surface temperature daily variation amplitude prediction method based on an improved XGB method, which comprises the following steps:

s1: acquiring a data set and preprocessing, wherein the data set comprises wind speed data and short wave radiation value data;

s2: establishing an XGBoost model;

s3: modifying the algorithm weight of the XGBoost model by applying an LDS algorithm, and establishing an LDS-XGB model;

s4: selecting a training set from the data set, and training the LDS-XGB model by using the training set;

s5: and predicting the daily variation amplitude of the sea surface temperature by using the trained LDS-XGB model.

In a preferred embodiment, the data set comprises average wind speed data over several days at three hours intervals, and average short wave radiation value data at three hours intervals.

Further, the specific process of step S2 includes:

the XGBoost model is based on the current model and adds another model, so that the effect of the combined model is superior to that of the machine learning algorithm model of the current model, and the establishment process is as follows:

in the method, in the process of the invention,

representing the predicted value of the model, K representing the number of decision trees, f _k Represents the kth tree model, x _i Represents the ith training sample, +.>

Representing a set of all decision tree models;

an objective function is constructed and then optimized:

wherein n is the number of training samples; the objective function consists of two parts, one part is a loss function l which is a mean square error, and the other part is a regularization term omega, namely the sum of the complexity of each tree, and the objective is to control the complexity of a model and prevent overfitting;

the tree set model of the formula (2) takes functions as parameters, so that the traditional optimization method cannot be directly used for optimization, and an addition learning mode (additivetraining) is adopted for training; each time the original model is kept unchanged, a new function f is added to the model as follows:

wherein:

the predicted value after t times of iteration is the ith sample; />

An initial value for the i-th sample;

constructing an optimal model by minimizing a loss function, and obtaining an objective function of a t-th round:

wherein, ons is a constant term and is the complexity of the first t-1 trees;

second-order taylor expansion is carried out on the objective function of the t-th round:

wherein: g _i ，h _i Respectively represent the objective function pairs

First and second derivatives of (a);

due to loss function

Is a fixed value and is therefore incorporated into the constant term ons; the constant term has no influence on the optimization solution, so that the constant term can be removed; the objective function depends only on the first and second derivatives of each sample point on the loss function, resulting in a new objective function:

next consider the complexity term Ω of the decision tree; firstly, defining each tree, and converting a tree structure expression into a leaf structure expression; dividing the decision tree into a structural part q and a leaf weight part omega;

f _t (x)＝ω _q(x) ，ω∈R ^T ，q：R ^d →{1，2，…，T} (8)

wherein: t is the total number of leaf nodes of the regression tree, ω is a T-dimensional vector consisting of the values of the leaf nodes, q (x) represents the sample x on a certain leaf node, ω _q(x) Is the score of the node, i.e., the model predictive value of the sample;

the complexity item of the XGBoost algorithm for the tree comprises two parts, wherein one part is the total number of leaf nodes, the other part is the score of the leaf nodes, and an L2 smoothing item is added for the score of each leaf node so as to avoid overfitting;

wherein:

modulo the leaf node vector; gamma represents difficulty of node segmentation, lambda represents an L2 regularization coefficient, and gamma and lambda values represent punishment force on a tree with more leaf nodes;

re-writing the objective function according to the leaf structure:

wherein: i _j ＝{i|q(x _i ) =j } is the set of samples on leaf node j;

the objective function comprises T independent univariate quadratic functions; we can define:

the final objective function reduction is as follows:

for unknown variable omega _j Obtaining the deviation guide to make the derivative be 0, substituting the obtained extreme point into the loss function to obtain the extreme value

Will->

Substituting formula (12) to obtain an optimal objective function:

the objective function is used for measuring the quality of the t-th tree structure, a greedy algorithm is utilized to traverse all the dividing points in the splitting process, loss values are calculated respectively, then the dividing point with the largest gain value is selected, and the smaller the maximum value of the gain loss is, the better the model prediction is represented; the final gain expression is as follows:

wherein:

representing left subtree score,/->

Representing right subtree score, ">

Represents the score when not split, λ represents the complexity penalty introduced by adding a new node. If the determination value is greater than 0, the division is possible, otherwise, the division is not possible.

Further, the specific process of step S3 includes:

is provided with

Representing a training set with a sample size n, wherein +.>

Representing input->

Representing a label, y being continuous;

in the tag space y, we divide y into B groups at equal intervals,

i.e. [ y ] ₀ ，y ₁ )，[y ₁ ，y ₂ )，…，[y _B-1 ，y _B ) The method comprises the steps of carrying out a first treatment on the surface of the We use

To represent group index of target values by

Representing an index space;

in the prediction of SST daily variation amplitude, we define

Calculating density distribution of the daily variation amplitude of the tag value SST in the training set according to delta y, which is called empirical density distribution; previous studies have shown that the empirical density distribution of a tag does not reflect the true tag density distribution in the case of continuous tag values due to the dependency between data samples on adjacent tags; LDS uses kernel density estimation to improve imbalance in the continuous dataset;

LDS uses a symmetric kernel function, we choose to use Gaussian kernel

The gaussian kernel function is a symmetric kernel function satisfying k (y, y ')=k (y', y) and

it characterizes the similarity between the target values y' and y; then convolving the empirical density distribution with it to obtain a new distribution, called the effective density distribution; the calculation formula is as follows:

where p (y) represents an empirical density distribution,

representing the effective density distribution of the tag value y';

in a general XGboost algorithm, the regression tree loss function is generally chosen as the square loss; after the effective density distribution is obtained through calculation, the weight is improved by a weighting method to predict;

specifically, we weight the loss function by multiplying it by the inverse of the effective density distribution of each training sample; the resulting loss function is:

wherein:

representing the re-weighted loss function.

The beneficial effects achieved by the invention are as follows:

in the prediction of the SST daily variation amplitude, a machine learning algorithm such as Bagging, RF and the like is adopted to realize the prediction of the SST daily variation amplitude, but the prediction error is larger than that of an XGB method, and the predicted SST daily variation amplitude is lower. The invention innovatively uses an XGB algorithm in the SST daily variation amplitude prediction, and applies machine learning to the SST daily variation amplitude prediction; and smoothing the data tag value by using an LDS method, so that the traditional unbalanced classification method can be applied to regression problems.

Drawings

FIG. 1 is a schematic overall flow diagram of the present invention;

FIG. 2 is an empirical label density versus error distribution plot;

FIG. 3 is a plot of effective label density versus error profile;

FIG. 4 is a schematic diagram of XGB model prediction results;

FIG. 5 is a schematic diagram of the LDS-XGB model prediction results.

Detailed Description

The embodiments of the present invention will be described more fully hereinafter with reference to the accompanying drawings, in which the embodiments of the present invention are shown by way of illustration only, and the invention is not limited to the embodiments of the present invention, but other embodiments of the present invention will be apparent to those skilled in the art without making any inventive effort.

Referring to fig. 1-5, a sea surface temperature daily variation amplitude prediction method based on an improved XGB method includes the steps of: s1: acquiring a data set and preprocessing, wherein the data set comprises wind speed data and short wave radiation value data; s2: establishing an XGBoost model; s3: modifying the algorithm weight of the XGBoost model by applying an LDS algorithm, and establishing an LDS-XGB model; s4: selecting a training set from the data set, and training the LDS-XGB model by using the training set; s5: and predicting the daily variation amplitude of the sea surface temperature by using the trained LDS-XGB model.

The specific process of the step S2 includes:

in the method, in the process of the invention,

Representing a set of all decision tree models;

an objective function is constructed and then optimized:

wherein:

the predicted value after t times of iteration is the ith sample; />

An initial value for the i-th sample;

wherein, ons is a constant term and is the complexity of the first t-1 trees;

wherein: g _i ，h _i Respectively represent the objective function pairs

First and second derivatives of (a);

due to loss function

f _t (x)＝ω _q(x) ，ω∈R ^T ，q：R ^d →{1，2，…，T} (8)

wherein: t is the total number of leaf nodes of the regression tree, ω is a T-dimensional vector consisting of the values of the leaf nodes, q (x) represents the sample x on a certain leaf node, w _q(x) Is the score of the node, i.e., the model predictive value of the sample;

wherein:

re-writing the objective function according to the leaf structure:

wherein: i _j ＝{i|q(x _i ) =j } is the set of samples on leaf node j;

the final objective function reduction is as follows:

Will->

Substituting formula (12) to obtain an optimal objective function:

wherein:

representing left subtree score,/->

Representing right subtree score, ">

Further, the specific process of step S3 includes:

is provided with

Representing a training set with a sample size n, wherein +.>

Representing input->

Representing a label, y being continuous;

in the tag space y, we divide y into B groups at equal intervals,

To represent group index of target values by

Representing an index space;

in the prediction of SST daily variation amplitude, we define

LDS uses a symmetric kernel function, we choose to use Gaussian kernel

where p (y) represents an empirical density distribution,

representing the effective density distribution of the tag value y';

wherein:

representing the re-weighted loss function.

Example 1:

the embodiment is applied to sea surface temperature daily variation amplitude prediction, and an LDS-XGB model suitable for predicting sea surface temperature daily variation amplitude is developed. Observations during tropical ocean and global atmosphere-sea gas coupling response experiments (TOGACOARE) observation are adopted, and include parameters such as sensible heat, latent heat, short wave radiation, wind stress, sea surface temperature and the like. Buoy data of 133 stations are selected, the distribution range is 25 DEG S-21 DEG N in the global scope, the time resolution is 1 hour or 10 minutes, and the time span is 10 months in 1992 to 8 months in 2021.

The experimental data set is preprocessed, the training set and the testing set are divided according to the proportion of 8:2, SST daily variation amplitude is calculated, and daily average wind speed and daily maximum short wave radiation are calculated. While calculating the average wind speed every three hours and the average short wave radiation every three hours. Wherein the average wind speed every three hours and the average short wave radiation every three hours are used as inputs to predict the amplitude of the SST daily variation.

First, the pearson correlation coefficient between the empirical label density and the error distribution is calculated to be-0.38, and the correlation between the empirical label density and the error distribution is weak. The results are shown in FIG. 2:

the pearson correlation coefficient between the effective label density and the error distribution is-0.56, and the result shows that the effective label density obtained through LDS calculation has good correlation with the error distribution. The results are shown in FIG. 3:

training the training set by XGBoost and LDS-XGB respectively, and predicting the testing set after training. The prediction result shows that the model LDS-XGB after re-weighting has good performance in a training set and a verification set. The prediction results of XGBoost and LDS-XGB are shown in FIG. 4 and FIG. 5:

the results of the evaluation of the fitness and prediction errors of the unmodified weight XGB model and the LDS-XGB model are shown in tables 1-2.

TABLE 1 evaluation results of SST daily Change amplitude prediction model

TABLE 2 statistics of predicted results for SST daily variation amplitude model

It can be seen from tables 1 to 2 that: whether the model is a training set or a testing set, the two models obtain higher fitting degree and smaller error value, and the models are proved to have good performance on the prediction of SST daily variation amplitude. From the aspect of fitting degree analysis, the fitting degree of the XGB model and the LDS-XGB model reaches more than 70%. In terms of errors, the RMSE is used as an evaluation index, and the RMSE of the model respectively reaches 17.773% and 17.771%. When the weight is not modified, the predicted SST daily variation amplitude value is more than 99% and less than 2 ℃, and after the weight is modified, the model can predict the value of more than 2 ℃, which shows that the model has a certain effect in the aspect of improving data unbalance. The prediction of high values is improved to a certain extent after the LDS_XGB model.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The sea surface temperature daily variation amplitude prediction method based on the improved XGB method is characterized by comprising the following steps of:

s2: establishing an XGBoost model;

s5: predicting the daily change amplitude of the sea surface temperature by using the trained LDS-XGB model;

the specific process of the step S2 includes:

in the method, in the process of the invention,

representing the predicted value of the model, K representing the number of decision trees, f _k Representing the kth tree model, +.>

Representing a set of all decision tree models;

an objective function is constructed and then optimized:

the tree set model of the formula (2) takes functions as parameters, so that the traditional optimization method cannot be directly used for optimization, and an addition learning mode is adopted for training; each time the original model is kept unchanged, a new function f is added to the model as follows:

……

wherein:

the predicted value after t times of iteration is the ith sample; />

For the initial value of the ith sample, x _i Representing an ith training sample;

wherein, ons is a constant term and is the complexity of the first t-1 trees;

wherein: g _i ，h _i Respectively represent the objective function pairs

First and second derivatives of (a);

due to loss function

f _t (x)＝ω _q(x) ，ω∈R ^T ，q:R ^d →{1，2，…，T} (8)

wherein:

re-writing the objective function according to the leaf structure:

wherein: i _j ＝{i|q(x _i ) =j } is the set of samples on leaf node j;

the final objective function reduction is as follows:

Will->

Substituting formula (12) to obtain an optimal objective function:

wherein:

representing left subtree score,/->

Representing right subtree score, ">

The score when the node is not segmented is represented, lambda represents the complexity cost introduced by adding a new node, if the judgment value is greater than 0, the node can be segmented, otherwise, the node is not segmented;

the specific process of the step S3 includes:

is provided with

Representing a training set with a sample size n, wherein +.>

Representing input->

Representing a label, y being continuous;

in the label space

In, we will->

Divided into B groups at equal intervals,

i.e. [ y ] ₀ ，y ₁ )，[y ₁ ，y ₂ ),…，[y _B-1 ，y _B ) The method comprises the steps of carrying out a first treatment on the surface of the We use

To represent group index of target values by

Representing an index space;

in the prediction of SST daily variation amplitude, we define

LDS uses a symmetric kernel function, we choose to use Gaussian kernel

where p (y) represents an empirical density distribution,

representsAn effective density distribution of the tag value y';

in the XGboost algorithm, the regression tree loss function is selected as the square loss; after the effective density distribution is obtained through calculation, the weight is improved by a weighting method to predict;

wherein:

representing the re-weighted loss function.

2. The sea surface temperature daily variation amplitude prediction method based on the improved XGB method according to claim 1, wherein: the dataset comprises average wind speed data over several days every three hours, and average short wave radiation value data every three hours.