CN115688588B - Sea surface temperature daily variation amplitude prediction method based on improved XGB method - Google Patents

Sea surface temperature daily variation amplitude prediction method based on improved XGB method Download PDF

Info

Publication number
CN115688588B
CN115688588B CN202211376526.0A CN202211376526A CN115688588B CN 115688588 B CN115688588 B CN 115688588B CN 202211376526 A CN202211376526 A CN 202211376526A CN 115688588 B CN115688588 B CN 115688588B
Authority
CN
China
Prior art keywords
model
xgb
representing
lds
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211376526.0A
Other languages
Chinese (zh)
Other versions
CN115688588A (en
Inventor
宋振亚
冯跃玲
肖衡
杨晓丹
高振
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
First Institute of Oceanography MNR
Original Assignee
First Institute of Oceanography MNR
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by First Institute of Oceanography MNR filed Critical First Institute of Oceanography MNR
Priority to CN202211376526.0A priority Critical patent/CN115688588B/en
Publication of CN115688588A publication Critical patent/CN115688588A/en
Application granted granted Critical
Publication of CN115688588B publication Critical patent/CN115688588B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention relates to the technical field of ocean surface temperature prediction, and provides a sea surface temperature daily variation amplitude prediction method based on an improved XGB method, which comprises the following steps of: s1: acquiring a data set and preprocessing, wherein the data set comprises wind speed data and short wave radiation value data; s2: establishing an XGBoost model; s3: modifying the algorithm weight of the XGBoost model by applying an LDS algorithm, and establishing an LDS-XGB model; s4: selecting a training set from the data set, and training the LDS-XGB model by using the training set; s5: and predicting the daily variation amplitude of the sea surface temperature by using the trained LDS-XGB model. The invention innovatively uses an XGB algorithm in the SST daily variation amplitude prediction, and applies machine learning to the SST daily variation amplitude prediction; and smoothing the data tag value by using an LDS method, so that the traditional unbalanced classification method can be applied to regression problems.

Description

Sea surface temperature daily variation amplitude prediction method based on improved XGB method
Technical Field
The invention relates to the technical field of ocean surface temperature prediction, in particular to a sea surface temperature daily variation amplitude prediction method based on an improved XGB method.
Background
SST stands for sea surface temperature, and currently there are observation studies, empirical models and numerical simulations for the main methods of SST daily variation studies.
The development of marine observation greatly promotes the development of SST daily change process research, but is limited by observation means and data, and the understanding of SST daily change process still has great defects.
Currently, there is a gap in the field of predicting SST daily changes using machine learning. The specific expression is as follows:
(1) Although the traditional experience model can grasp the basic characteristic of SST daily variation, the application range is limited and the precision is not high. At present, people still have great defects on understanding SST daily variation, and the traditional experience model still has the problems of low precision, complex calculation and the like, so that reasonable simulation and prediction of the SST daily variation process still remain a challenge.
(2) Numerical simulation is an effective means of simulating and predicting SST day-to-day processes, but is limited by the state of the art of numerical modeling, accurate simulation and prediction remains a challenge. In addition, the accuracy of simulation is difficult to ensure due to uncertainty factors such as uncertainty of the SST daily variation process itself and parameterization of the pattern itself.
(3) The application of machine learning in marine environment research, simulation, prediction and the like is more and more emphasized, and the machine learning is expected to play an important role in SST daily change process research under physical constraint. Machine learning methods have achieved great results in predicting sea temperature and the like, but there is a gap in the field of predicting SST daily variation by machine learning at present.
Disclosure of Invention
In order to solve the problems in the background technology, the invention provides a sea surface temperature daily variation amplitude prediction method based on an improved XGB method, which comprises the following steps:
s1: acquiring a data set and preprocessing, wherein the data set comprises wind speed data and short wave radiation value data;
s2: establishing an XGBoost model;
s3: modifying the algorithm weight of the XGBoost model by applying an LDS algorithm, and establishing an LDS-XGB model;
s4: selecting a training set from the data set, and training the LDS-XGB model by using the training set;
s5: and predicting the daily variation amplitude of the sea surface temperature by using the trained LDS-XGB model.
In a preferred embodiment, the data set comprises average wind speed data over several days at three hours intervals, and average short wave radiation value data at three hours intervals.
Further, the specific process of step S2 includes:
the XGBoost model is based on the current model and adds another model, so that the effect of the combined model is superior to that of the machine learning algorithm model of the current model, and the establishment process is as follows:
Figure GDA0004245145530000021
in the method, in the process of the invention,
Figure GDA0004245145530000022
representing the predicted value of the model, K representing the number of decision trees, f k Represents the kth tree model, x i Represents the ith training sample, +.>
Figure GDA0004245145530000023
Representing a set of all decision tree models;
an objective function is constructed and then optimized:
Figure GDA0004245145530000024
wherein n is the number of training samples; the objective function consists of two parts, one part is a loss function l which is a mean square error, and the other part is a regularization term omega, namely the sum of the complexity of each tree, and the objective is to control the complexity of a model and prevent overfitting;
the tree set model of the formula (2) takes functions as parameters, so that the traditional optimization method cannot be directly used for optimization, and an addition learning mode (additivetraining) is adopted for training; each time the original model is kept unchanged, a new function f is added to the model as follows:
Figure GDA0004245145530000031
wherein:
Figure GDA0004245145530000032
the predicted value after t times of iteration is the ith sample; />
Figure GDA0004245145530000033
An initial value for the i-th sample;
constructing an optimal model by minimizing a loss function, and obtaining an objective function of a t-th round:
Figure GDA0004245145530000034
wherein, ons is a constant term and is the complexity of the first t-1 trees;
second-order taylor expansion is carried out on the objective function of the t-th round:
Figure GDA0004245145530000035
Figure GDA0004245145530000036
wherein: g i ,h i Respectively represent the objective function pairs
Figure GDA0004245145530000037
First and second derivatives of (a);
due to loss function
Figure GDA0004245145530000038
Is a fixed value and is therefore incorporated into the constant term ons; the constant term has no influence on the optimization solution, so that the constant term can be removed; the objective function depends only on the first and second derivatives of each sample point on the loss function, resulting in a new objective function:
Figure GDA0004245145530000039
next consider the complexity term Ω of the decision tree; firstly, defining each tree, and converting a tree structure expression into a leaf structure expression; dividing the decision tree into a structural part q and a leaf weight part omega;
f t (x)=ω q(x) ,ω∈R T ,q:R d →{1,2,…,T} (8)
wherein: t is the total number of leaf nodes of the regression tree, ω is a T-dimensional vector consisting of the values of the leaf nodes, q (x) represents the sample x on a certain leaf node, ω q(x) Is the score of the node, i.e., the model predictive value of the sample;
the complexity item of the XGBoost algorithm for the tree comprises two parts, wherein one part is the total number of leaf nodes, the other part is the score of the leaf nodes, and an L2 smoothing item is added for the score of each leaf node so as to avoid overfitting;
Figure GDA0004245145530000041
wherein:
Figure GDA0004245145530000042
modulo the leaf node vector; gamma represents difficulty of node segmentation, lambda represents an L2 regularization coefficient, and gamma and lambda values represent punishment force on a tree with more leaf nodes;
re-writing the objective function according to the leaf structure:
Figure GDA0004245145530000043
wherein: i j ={i|q(x i ) =j } is the set of samples on leaf node j;
the objective function comprises T independent univariate quadratic functions; we can define:
Figure GDA0004245145530000044
the final objective function reduction is as follows:
Figure GDA0004245145530000045
for unknown variable omega j Obtaining the deviation guide to make the derivative be 0, substituting the obtained extreme point into the loss function to obtain the extreme value
Figure GDA0004245145530000051
Will->
Figure GDA0004245145530000052
Substituting formula (12) to obtain an optimal objective function:
Figure GDA0004245145530000053
the objective function is used for measuring the quality of the t-th tree structure, a greedy algorithm is utilized to traverse all the dividing points in the splitting process, loss values are calculated respectively, then the dividing point with the largest gain value is selected, and the smaller the maximum value of the gain loss is, the better the model prediction is represented; the final gain expression is as follows:
Figure GDA0004245145530000054
wherein:
Figure GDA00042451455300000513
representing left subtree score,/->
Figure GDA00042451455300000514
Representing right subtree score, ">
Figure GDA00042451455300000515
Represents the score when not split, λ represents the complexity penalty introduced by adding a new node. If the determination value is greater than 0, the division is possible, otherwise, the division is not possible.
Further, the specific process of step S3 includes:
is provided with
Figure GDA0004245145530000056
Representing a training set with a sample size n, wherein +.>
Figure GDA0004245145530000057
Representing input->
Figure GDA0004245145530000058
Representing a label, y being continuous;
in the tag space y, we divide y into B groups at equal intervals,
i.e. [ y ] 0 ,y 1 ),[y 1 ,y 2 ),…,[y B-1 ,y B ) The method comprises the steps of carrying out a first treatment on the surface of the We use
Figure GDA0004245145530000059
To represent group index of target values by
Figure GDA00042451455300000510
Representing an index space;
in the prediction of SST daily variation amplitude, we define
Figure GDA00042451455300000511
Calculating density distribution of the daily variation amplitude of the tag value SST in the training set according to delta y, which is called empirical density distribution; previous studies have shown that the empirical density distribution of a tag does not reflect the true tag density distribution in the case of continuous tag values due to the dependency between data samples on adjacent tags; LDS uses kernel density estimation to improve imbalance in the continuous dataset;
LDS uses a symmetric kernel function, we choose to use Gaussian kernel
Figure GDA00042451455300000512
The gaussian kernel function is a symmetric kernel function satisfying k (y, y ')=k (y', y) and
Figure GDA0004245145530000061
it characterizes the similarity between the target values y' and y; then convolving the empirical density distribution with it to obtain a new distribution, called the effective density distribution; the calculation formula is as follows:
Figure GDA0004245145530000062
where p (y) represents an empirical density distribution,
Figure GDA0004245145530000063
representing the effective density distribution of the tag value y';
in a general XGboost algorithm, the regression tree loss function is generally chosen as the square loss; after the effective density distribution is obtained through calculation, the weight is improved by a weighting method to predict;
Figure GDA0004245145530000064
specifically, we weight the loss function by multiplying it by the inverse of the effective density distribution of each training sample; the resulting loss function is:
Figure GDA0004245145530000065
Figure GDA0004245145530000066
wherein:
Figure GDA0004245145530000067
representing the re-weighted loss function.
The beneficial effects achieved by the invention are as follows:
in the prediction of the SST daily variation amplitude, a machine learning algorithm such as Bagging, RF and the like is adopted to realize the prediction of the SST daily variation amplitude, but the prediction error is larger than that of an XGB method, and the predicted SST daily variation amplitude is lower. The invention innovatively uses an XGB algorithm in the SST daily variation amplitude prediction, and applies machine learning to the SST daily variation amplitude prediction; and smoothing the data tag value by using an LDS method, so that the traditional unbalanced classification method can be applied to regression problems.
Drawings
FIG. 1 is a schematic overall flow diagram of the present invention;
FIG. 2 is an empirical label density versus error distribution plot;
FIG. 3 is a plot of effective label density versus error profile;
FIG. 4 is a schematic diagram of XGB model prediction results;
FIG. 5 is a schematic diagram of the LDS-XGB model prediction results.
Detailed Description
The embodiments of the present invention will be described more fully hereinafter with reference to the accompanying drawings, in which the embodiments of the present invention are shown by way of illustration only, and the invention is not limited to the embodiments of the present invention, but other embodiments of the present invention will be apparent to those skilled in the art without making any inventive effort.
Referring to fig. 1-5, a sea surface temperature daily variation amplitude prediction method based on an improved XGB method includes the steps of: s1: acquiring a data set and preprocessing, wherein the data set comprises wind speed data and short wave radiation value data; s2: establishing an XGBoost model; s3: modifying the algorithm weight of the XGBoost model by applying an LDS algorithm, and establishing an LDS-XGB model; s4: selecting a training set from the data set, and training the LDS-XGB model by using the training set; s5: and predicting the daily variation amplitude of the sea surface temperature by using the trained LDS-XGB model.
The specific process of the step S2 includes:
the XGBoost model is based on the current model and adds another model, so that the effect of the combined model is superior to that of the machine learning algorithm model of the current model, and the establishment process is as follows:
Figure GDA0004245145530000071
in the method, in the process of the invention,
Figure GDA0004245145530000072
representing the predicted value of the model, K representing the number of decision trees, f k Represents the kth tree model, x i Represents the ith training sample, +.>
Figure GDA0004245145530000073
Representing a set of all decision tree models;
an objective function is constructed and then optimized:
Figure GDA0004245145530000081
wherein n is the number of training samples; the objective function consists of two parts, one part is a loss function l which is a mean square error, and the other part is a regularization term omega, namely the sum of the complexity of each tree, and the objective is to control the complexity of a model and prevent overfitting;
the tree set model of the formula (2) takes functions as parameters, so that the traditional optimization method cannot be directly used for optimization, and an addition learning mode (additivetraining) is adopted for training; each time the original model is kept unchanged, a new function f is added to the model as follows:
Figure GDA0004245145530000082
wherein:
Figure GDA0004245145530000083
the predicted value after t times of iteration is the ith sample; />
Figure GDA0004245145530000084
An initial value for the i-th sample;
constructing an optimal model by minimizing a loss function, and obtaining an objective function of a t-th round:
Figure GDA0004245145530000085
wherein, ons is a constant term and is the complexity of the first t-1 trees;
second-order taylor expansion is carried out on the objective function of the t-th round:
Figure GDA0004245145530000086
Figure GDA0004245145530000087
wherein: g i ,h i Respectively represent the objective function pairs
Figure GDA0004245145530000088
First and second derivatives of (a);
due to loss function
Figure GDA0004245145530000089
Is a fixed value and is therefore incorporated into the constant term ons; the constant term has no influence on the optimization solution, so that the constant term can be removed; the objective function depends only on the first and second derivatives of each sample point on the loss function, resulting in a new objective function:
Figure GDA0004245145530000091
next consider the complexity term Ω of the decision tree; firstly, defining each tree, and converting a tree structure expression into a leaf structure expression; dividing the decision tree into a structural part q and a leaf weight part omega;
f t (x)=ω q(x) ,ω∈R T ,q:R d →{1,2,…,T} (8)
wherein: t is the total number of leaf nodes of the regression tree, ω is a T-dimensional vector consisting of the values of the leaf nodes, q (x) represents the sample x on a certain leaf node, w q(x) Is the score of the node, i.e., the model predictive value of the sample;
the complexity item of the XGBoost algorithm for the tree comprises two parts, wherein one part is the total number of leaf nodes, the other part is the score of the leaf nodes, and an L2 smoothing item is added for the score of each leaf node so as to avoid overfitting;
Figure GDA0004245145530000092
wherein:
Figure GDA0004245145530000093
modulo the leaf node vector; gamma represents difficulty of node segmentation, lambda represents an L2 regularization coefficient, and gamma and lambda values represent punishment force on a tree with more leaf nodes;
re-writing the objective function according to the leaf structure:
Figure GDA0004245145530000094
wherein: i j ={i|q(x i ) =j } is the set of samples on leaf node j;
the objective function comprises T independent univariate quadratic functions; we can define:
Figure GDA0004245145530000095
the final objective function reduction is as follows:
Figure GDA0004245145530000101
for unknown variable omega j Obtaining the deviation guide to make the derivative be 0, substituting the obtained extreme point into the loss function to obtain the extreme value
Figure GDA0004245145530000102
Will->
Figure GDA0004245145530000103
Substituting formula (12) to obtain an optimal objective function:
Figure GDA0004245145530000104
the objective function is used for measuring the quality of the t-th tree structure, a greedy algorithm is utilized to traverse all the dividing points in the splitting process, loss values are calculated respectively, then the dividing point with the largest gain value is selected, and the smaller the maximum value of the gain loss is, the better the model prediction is represented; the final gain expression is as follows:
Figure GDA0004245145530000105
wherein:
Figure GDA00042451455300001013
representing left subtree score,/->
Figure GDA00042451455300001014
Representing right subtree score, ">
Figure GDA00042451455300001015
Represents the score when not split, λ represents the complexity penalty introduced by adding a new node. If the determination value is greater than 0, the division is possible, otherwise, the division is not possible.
Further, the specific process of step S3 includes:
is provided with
Figure GDA0004245145530000107
Representing a training set with a sample size n, wherein +.>
Figure GDA0004245145530000108
Representing input->
Figure GDA0004245145530000109
Representing a label, y being continuous;
in the tag space y, we divide y into B groups at equal intervals,
i.e. [ y ] 0 ,y 1 ),[y 1 ,y 2 ),…,[y B-1 ,y B ) The method comprises the steps of carrying out a first treatment on the surface of the We use
Figure GDA00042451455300001010
To represent group index of target values by
Figure GDA00042451455300001011
Representing an index space;
in the prediction of SST daily variation amplitude, we define
Figure GDA00042451455300001012
Calculating density distribution of the daily variation amplitude of the tag value SST in the training set according to delta y, which is called empirical density distribution; previous studies have shown that the empirical density distribution of a tag does not reflect the true tag density distribution in the case of continuous tag values due to the dependency between data samples on adjacent tags; LDS uses kernel density estimation to improve imbalance in the continuous dataset;
LDS uses a symmetric kernel function, we choose to use Gaussian kernel
Figure GDA0004245145530000111
The gaussian kernel function is a symmetric kernel function satisfying k (y, y ')=k (y', y) and
Figure GDA0004245145530000112
it characterizes the similarity between the target values y' and y; then convolving the empirical density distribution with it to obtain a new distribution, called the effective density distribution; the calculation formula is as follows:
Figure GDA0004245145530000113
where p (y) represents an empirical density distribution,
Figure GDA0004245145530000114
representing the effective density distribution of the tag value y';
in a general XGboost algorithm, the regression tree loss function is generally chosen as the square loss; after the effective density distribution is obtained through calculation, the weight is improved by a weighting method to predict;
Figure GDA0004245145530000115
specifically, we weight the loss function by multiplying it by the inverse of the effective density distribution of each training sample; the resulting loss function is:
Figure GDA0004245145530000116
Figure GDA0004245145530000117
wherein:
Figure GDA0004245145530000118
representing the re-weighted loss function.
Example 1:
the embodiment is applied to sea surface temperature daily variation amplitude prediction, and an LDS-XGB model suitable for predicting sea surface temperature daily variation amplitude is developed. Observations during tropical ocean and global atmosphere-sea gas coupling response experiments (TOGACOARE) observation are adopted, and include parameters such as sensible heat, latent heat, short wave radiation, wind stress, sea surface temperature and the like. Buoy data of 133 stations are selected, the distribution range is 25 DEG S-21 DEG N in the global scope, the time resolution is 1 hour or 10 minutes, and the time span is 10 months in 1992 to 8 months in 2021.
The experimental data set is preprocessed, the training set and the testing set are divided according to the proportion of 8:2, SST daily variation amplitude is calculated, and daily average wind speed and daily maximum short wave radiation are calculated. While calculating the average wind speed every three hours and the average short wave radiation every three hours. Wherein the average wind speed every three hours and the average short wave radiation every three hours are used as inputs to predict the amplitude of the SST daily variation.
First, the pearson correlation coefficient between the empirical label density and the error distribution is calculated to be-0.38, and the correlation between the empirical label density and the error distribution is weak. The results are shown in FIG. 2:
the pearson correlation coefficient between the effective label density and the error distribution is-0.56, and the result shows that the effective label density obtained through LDS calculation has good correlation with the error distribution. The results are shown in FIG. 3:
training the training set by XGBoost and LDS-XGB respectively, and predicting the testing set after training. The prediction result shows that the model LDS-XGB after re-weighting has good performance in a training set and a verification set. The prediction results of XGBoost and LDS-XGB are shown in FIG. 4 and FIG. 5:
the results of the evaluation of the fitness and prediction errors of the unmodified weight XGB model and the LDS-XGB model are shown in tables 1-2.
TABLE 1 evaluation results of SST daily Change amplitude prediction model
Figure GDA0004245145530000121
TABLE 2 statistics of predicted results for SST daily variation amplitude model
Figure GDA0004245145530000122
It can be seen from tables 1 to 2 that: whether the model is a training set or a testing set, the two models obtain higher fitting degree and smaller error value, and the models are proved to have good performance on the prediction of SST daily variation amplitude. From the aspect of fitting degree analysis, the fitting degree of the XGB model and the LDS-XGB model reaches more than 70%. In terms of errors, the RMSE is used as an evaluation index, and the RMSE of the model respectively reaches 17.773% and 17.771%. When the weight is not modified, the predicted SST daily variation amplitude value is more than 99% and less than 2 ℃, and after the weight is modified, the model can predict the value of more than 2 ℃, which shows that the model has a certain effect in the aspect of improving data unbalance. The prediction of high values is improved to a certain extent after the LDS_XGB model.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (2)

1. The sea surface temperature daily variation amplitude prediction method based on the improved XGB method is characterized by comprising the following steps of:
s1: acquiring a data set and preprocessing, wherein the data set comprises wind speed data and short wave radiation value data;
s2: establishing an XGBoost model;
s3: modifying the algorithm weight of the XGBoost model by applying an LDS algorithm, and establishing an LDS-XGB model;
s4: selecting a training set from the data set, and training the LDS-XGB model by using the training set;
s5: predicting the daily change amplitude of the sea surface temperature by using the trained LDS-XGB model;
the specific process of the step S2 includes:
the XGBoost model is based on the current model and adds another model, so that the effect of the combined model is superior to that of the machine learning algorithm model of the current model, and the establishment process is as follows:
Figure FDA0004245145510000011
in the method, in the process of the invention,
Figure FDA0004245145510000012
representing the predicted value of the model, K representing the number of decision trees, f k Representing the kth tree model, +.>
Figure FDA0004245145510000013
Representing a set of all decision tree models;
an objective function is constructed and then optimized:
Figure FDA0004245145510000014
wherein n is the number of training samples; the objective function consists of two parts, one part is a loss function l which is a mean square error, and the other part is a regularization term omega, namely the sum of the complexity of each tree, and the objective is to control the complexity of a model and prevent overfitting;
the tree set model of the formula (2) takes functions as parameters, so that the traditional optimization method cannot be directly used for optimization, and an addition learning mode is adopted for training; each time the original model is kept unchanged, a new function f is added to the model as follows:
Figure FDA0004245145510000021
Figure FDA0004245145510000022
Figure FDA0004245145510000023
……
Figure FDA0004245145510000024
wherein:
Figure FDA0004245145510000025
the predicted value after t times of iteration is the ith sample; />
Figure FDA0004245145510000026
For the initial value of the ith sample, x i Representing an ith training sample;
constructing an optimal model by minimizing a loss function, and obtaining an objective function of a t-th round:
Figure FDA0004245145510000027
wherein, ons is a constant term and is the complexity of the first t-1 trees;
second-order taylor expansion is carried out on the objective function of the t-th round:
Figure FDA0004245145510000028
Figure FDA0004245145510000029
wherein: g i ,h i Respectively represent the objective function pairs
Figure FDA00042451455100000210
First and second derivatives of (a);
due to loss function
Figure FDA00042451455100000211
Is a fixed value and is therefore incorporated into the constant term ons; the constant term has no influence on the optimization solution, so that the constant term can be removed; the objective function depends only on the first and second derivatives of each sample point on the loss function, resulting in a new objective function:
Figure FDA00042451455100000212
next consider the complexity term Ω of the decision tree; firstly, defining each tree, and converting a tree structure expression into a leaf structure expression; dividing the decision tree into a structural part q and a leaf weight part omega;
f t (x)=ω q(x) ,ω∈R T ,q:R d →{1,2,…,T} (8)
wherein: t is the total number of leaf nodes of the regression tree, ω is a T-dimensional vector consisting of the values of the leaf nodes, q (x) represents the sample x on a certain leaf node, ω q(x) Is the score of the node, i.e., the model predictive value of the sample;
the complexity item of the XGBoost algorithm for the tree comprises two parts, wherein one part is the total number of leaf nodes, the other part is the score of the leaf nodes, and an L2 smoothing item is added for the score of each leaf node so as to avoid overfitting;
Figure FDA0004245145510000031
wherein:
Figure FDA0004245145510000032
modulo the leaf node vector; gamma represents difficulty of node segmentation, lambda represents an L2 regularization coefficient, and gamma and lambda values represent punishment force on a tree with more leaf nodes;
re-writing the objective function according to the leaf structure:
Figure FDA0004245145510000033
wherein: i j ={i|q(x i ) =j } is the set of samples on leaf node j;
the objective function comprises T independent univariate quadratic functions; we can define:
Figure FDA0004245145510000034
the final objective function reduction is as follows:
Figure FDA0004245145510000035
for unknown variable omega j Obtaining the deviation guide to make the derivative be 0, substituting the obtained extreme point into the loss function to obtain the extreme value
Figure FDA0004245145510000036
Will->
Figure FDA0004245145510000037
Substituting formula (12) to obtain an optimal objective function:
Figure FDA0004245145510000038
the objective function is used for measuring the quality of the t-th tree structure, a greedy algorithm is utilized to traverse all the dividing points in the splitting process, loss values are calculated respectively, then the dividing point with the largest gain value is selected, and the smaller the maximum value of the gain loss is, the better the model prediction is represented; the final gain expression is as follows:
Figure FDA0004245145510000041
wherein:
Figure FDA0004245145510000042
representing left subtree score,/->
Figure FDA0004245145510000043
Representing right subtree score, ">
Figure FDA0004245145510000044
The score when the node is not segmented is represented, lambda represents the complexity cost introduced by adding a new node, if the judgment value is greater than 0, the node can be segmented, otherwise, the node is not segmented;
the specific process of the step S3 includes:
is provided with
Figure FDA0004245145510000045
Representing a training set with a sample size n, wherein +.>
Figure FDA0004245145510000046
Representing input->
Figure FDA0004245145510000047
Representing a label, y being continuous;
in the label space
Figure FDA0004245145510000048
In, we will->
Figure FDA0004245145510000049
Divided into B groups at equal intervals,
i.e. [ y ] 0 ,y 1 ),[y 1 ,y 2 ),…,[y B-1 ,y B ) The method comprises the steps of carrying out a first treatment on the surface of the We use
Figure FDA00042451455100000410
To represent group index of target values by
Figure FDA00042451455100000411
Representing an index space;
in the prediction of SST daily variation amplitude, we define
Figure FDA00042451455100000412
Calculating density distribution of the daily variation amplitude of the tag value SST in the training set according to delta y, which is called empirical density distribution; previous studies have shown that the empirical density distribution of a tag does not reflect the true tag density distribution in the case of continuous tag values due to the dependency between data samples on adjacent tags; LDS uses kernel density estimation to improve imbalance in the continuous dataset;
LDS uses a symmetric kernel function, we choose to use Gaussian kernel
Figure FDA00042451455100000413
The gaussian kernel function is a symmetric kernel function satisfying k (y, y ')=k (y', y) and
Figure FDA00042451455100000414
it characterizes the similarity between the target values y' and y; then convolving the empirical density distribution with it to obtain a new distribution, called the effective density distribution; the calculation formula is as follows:
Figure FDA0004245145510000051
where p (y) represents an empirical density distribution,
Figure FDA0004245145510000052
representsAn effective density distribution of the tag value y';
in the XGboost algorithm, the regression tree loss function is selected as the square loss; after the effective density distribution is obtained through calculation, the weight is improved by a weighting method to predict;
Figure FDA0004245145510000053
specifically, we weight the loss function by multiplying it by the inverse of the effective density distribution of each training sample; the resulting loss function is:
Figure FDA0004245145510000054
Figure FDA0004245145510000055
wherein:
Figure FDA0004245145510000056
representing the re-weighted loss function.
2. The sea surface temperature daily variation amplitude prediction method based on the improved XGB method according to claim 1, wherein: the dataset comprises average wind speed data over several days every three hours, and average short wave radiation value data every three hours.
CN202211376526.0A 2022-11-04 2022-11-04 Sea surface temperature daily variation amplitude prediction method based on improved XGB method Active CN115688588B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211376526.0A CN115688588B (en) 2022-11-04 2022-11-04 Sea surface temperature daily variation amplitude prediction method based on improved XGB method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211376526.0A CN115688588B (en) 2022-11-04 2022-11-04 Sea surface temperature daily variation amplitude prediction method based on improved XGB method

Publications (2)

Publication Number Publication Date
CN115688588A CN115688588A (en) 2023-02-03
CN115688588B true CN115688588B (en) 2023-06-27

Family

ID=85048709

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211376526.0A Active CN115688588B (en) 2022-11-04 2022-11-04 Sea surface temperature daily variation amplitude prediction method based on improved XGB method

Country Status (1)

Country Link
CN (1) CN115688588B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116976149B (en) * 2023-09-22 2023-12-29 广东海洋大学 Sea surface temperature prediction method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113537336A (en) * 2021-03-10 2021-10-22 沈阳工业大学 XGboost-based short-term thunderstorm and strong wind forecasting method
CN114595624A (en) * 2022-01-10 2022-06-07 山西中节能潞安电力节能服务有限公司 Service life state prediction method of heat tracing belt device based on XGboost algorithm

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110543929B (en) * 2019-08-29 2023-11-14 华北电力大学(保定) Wind speed interval prediction method and system based on Lorenz system
CN111340273B (en) * 2020-02-17 2022-08-26 南京邮电大学 Short-term load prediction method for power system based on GEP parameter optimization XGboost
CN113159364A (en) * 2020-12-30 2021-07-23 中国移动通信集团广东有限公司珠海分公司 Passenger flow prediction method and system for large-scale traffic station
CN113051795B (en) * 2021-03-15 2023-04-28 哈尔滨工程大学 Three-dimensional Wen Yanchang analysis and prediction method for offshore platform guarantee
CN113256066B (en) * 2021-04-23 2022-05-06 新疆大学 PCA-XGboost-IRF-based job shop real-time scheduling method
CN113743013A (en) * 2021-09-08 2021-12-03 成都卡普数据服务有限责任公司 XGboost-based temperature prediction data correction method
CN114898819A (en) * 2022-04-06 2022-08-12 中国石油大学(北京) Mixed crude oil viscosity prediction model training method and device and application method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113537336A (en) * 2021-03-10 2021-10-22 沈阳工业大学 XGboost-based short-term thunderstorm and strong wind forecasting method
CN114595624A (en) * 2022-01-10 2022-06-07 山西中节能潞安电力节能服务有限公司 Service life state prediction method of heat tracing belt device based on XGboost algorithm

Also Published As

Publication number Publication date
CN115688588A (en) 2023-02-03

Similar Documents

Publication Publication Date Title
CN110083125B (en) Machine tool thermal error modeling method based on deep learning
CN113568055B (en) Aviation transient electromagnetic data inversion method based on LSTM network
CN112085254B (en) Prediction method and model based on multi-fractal cooperative measurement gating circulation unit
CN115688588B (en) Sea surface temperature daily variation amplitude prediction method based on improved XGB method
CN108182500A (en) Ammunition Storage Reliability Forecasting Methodology based on accelerated life test
CN111523778A (en) Power grid operation safety assessment method based on particle swarm algorithm and gradient lifting tree
CN112989711B (en) Aureomycin fermentation process soft measurement modeling method based on semi-supervised ensemble learning
CN112926265A (en) Atmospheric porous probe measurement calibration method based on genetic algorithm optimization neural network
CN113592144A (en) Medium-and-long-term runoff probability forecasting method and system
CN111859249A (en) Ocean numerical forecasting method based on analytical four-dimensional set variation
CN108520310A (en) Wind speed forecasting method based on G-L mixed noise characteristic v- support vector regressions
CN116187835A (en) Data-driven-based method and system for estimating theoretical line loss interval of transformer area
CN110738363B (en) Photovoltaic power generation power prediction method
CN115982141A (en) Characteristic optimization method for time series data prediction
Qin et al. A wireless sensor network location algorithm based on insufficient fingerprint information
CN111914488B (en) Data area hydrologic parameter calibration method based on antagonistic neural network
CN112307536B (en) Dam seepage parameter inversion method
CN112163632A (en) Application of semi-supervised extreme learning machine based on bat algorithm in industrial detection
Duan et al. LightGBM low-temperature prediction model based on LassoCV feature selection
CN116054144A (en) Distribution network reconstruction method, system and storage medium for distributed photovoltaic access
CN116189794A (en) Rammed earth water salt content measurement method
CN113642785B (en) Method, system and equipment for long-term prediction of space debris track based on priori information
CN115796327A (en) Wind power interval prediction method based on VMD (vertical vector decomposition) and IWOA-F-GRU (empirical mode decomposition) -based models
CN115964923A (en) Modeling method for forecasting 80-100km atmospheric wind speed in adjacent space based on VMD-PSO-LSTM
CN111914487B (en) Data-free regional hydrological parameter calibration method based on antagonistic neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant