CN115688588A - Sea surface temperature daily change amplitude prediction method based on improved XGB method - Google Patents

Sea surface temperature daily change amplitude prediction method based on improved XGB method Download PDF

Info

Publication number
CN115688588A
CN115688588A CN202211376526.0A CN202211376526A CN115688588A CN 115688588 A CN115688588 A CN 115688588A CN 202211376526 A CN202211376526 A CN 202211376526A CN 115688588 A CN115688588 A CN 115688588A
Authority
CN
China
Prior art keywords
model
tree
xgb
function
formula
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211376526.0A
Other languages
Chinese (zh)
Other versions
CN115688588B (en
Inventor
宋振亚
冯跃玲
肖衡
杨晓丹
高振
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
First Institute of Oceanography MNR
Original Assignee
First Institute of Oceanography MNR
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by First Institute of Oceanography MNR filed Critical First Institute of Oceanography MNR
Priority to CN202211376526.0A priority Critical patent/CN115688588B/en
Publication of CN115688588A publication Critical patent/CN115688588A/en
Application granted granted Critical
Publication of CN115688588B publication Critical patent/CN115688588B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the technical field of ocean surface temperature prediction, and provides a sea surface temperature daily variation amplitude prediction method based on an improved XGB method, which comprises the following steps: s1: acquiring a data set and preprocessing the data set, wherein the data set comprises wind speed data and short wave radiation value data; s2: establishing an XGboost model; s3: modifying the algorithm weight of the XGboost model by applying an LDS algorithm, and establishing an LDS-XGB model; s4: selecting a training set from the data set, and training the LDS-XGB model by using the training set; s5: and predicting the daily change amplitude of the sea surface temperature by using the trained LDS-XGB model. The XGB algorithm is innovatively used in SST daily variation amplitude prediction, and machine learning is applied to SST daily variation amplitude prediction; and smoothing the data label value by using an LDS method, so that the traditional unbalanced classification method can be applied to the regression problem.

Description

Sea surface temperature daily change amplitude prediction method based on improved XGB method
Technical Field
The invention relates to the technical field of ocean surface temperature prediction, in particular to a sea surface temperature daily change amplitude prediction method based on an improved XGB method.
Background
SST stands for sea surface temperature, and the current main methods for SST daily change research include observation research, empirical models and numerical simulation.
The development of marine observation greatly promotes the development of research on the daily change process of the SST, but is limited by observation means and data, and the understanding of the daily change process of the SST still has great defects.
Since the United kingdom 'challenger' started the science investigation of ocean environment in 1872, ship observation, offshore station or buoy observation become mainstream ocean observation means in nearly 150 years, and then the ocean observation capability is greatly improved through multiple ocean observation technology changes such as satellite remote sensing, TAO/TRITON ocean buoy array, argo, glider and the like.
In early marine SST data, sverdrupeteal, (1942) and Roll (1965) mentioned the daily warming phenomenon of SST. During the tropical marine global climate response experiment (TOGA/COARE), websteret et al (1992) observed a complete diurnal variation process over time in the warm pool region of the western pacific, and observed a maximum diurnal variation amplitude greater than 3 ℃. The large-area observation of the satellite provides a brand-new means for SST daily variation research, stuart-Mentethital, (2003) and the like provide a large-scale space-time characteristic diagram of SST daily variation close to the global range, find that large SST daily variation amplitude is easy to appear in tropical and mid-latitude areas, and emphasize the importance of the SST daily variation on SST measurement and marine environment. Yanetal, (2021) revealed variability of SST over the time scale of the season with SST daily changes using global tropical mooring buoy arrays, indicating that SST daily changes have a significant nonlinear effect on variability over longer time scales. The Lietal (2021) discovers that the daily change of the SST of the China sea is in a sine form through research, establishes an inversion mode of the daily average sea temperature product of the polar orbit satellite based on the daily change condition of the SST, and applies the inversion mode to 7716 visible infrared imaging radiometers in the China sea area.
However, since ship survey, buoy and the like can only perform observation on a single point, argo is mainly used for marine internal observation, and the description capability on the upper layer of the sea is insufficient. The large-area observation of the satellite provides a brand-new means for SST daily change research, but the time resolution of polar orbit satellites for the observation of the same place is up to twice a day, the daily change process of the resolution cannot be effectively realized, the synchronous satellite is limited by the influence of cloud and rain and the like, and the SST is difficult to continuously and effectively observe at present. Although marine scientists have conducted a great deal of research on the daily change process of the SST based on the above observation data, especially on the observation data of the SST, short-wave radiation, wind speed, wind direction and the like distributed in the range of 140 ° E-180 ° W,10 ° S-10 ° N at the hour level of time resolution in 1985 provided by the tropical marine global climate response experiment (TOGA/coarse). However, the SST daily change process is a complex process, on one hand, the ocean surface is the underlying surface of the atmosphere, and the SST directly affects the weather and climate, and on the other hand, the SST is also affected and controlled by various atmospheric and marine internal processes, so that people still have great defects in understanding the SST daily change.
A series of empirical models are developed, which can grasp the basic characteristics of SST daily variation, but the SST daily variation has limited application range and low precision.
Early Pricetotal (1987) proposed a representative empirical model that used wind stress and ocean heat flux to estimate the magnitude of the change in SST day. Since then, websteret al (1996) proposed a widely used diagnostic model for the daily variation range of SST skin temperature based on the analysis of a large amount of observed data. KawaiandandKawamura (2002) uses buoy data of tropical and mid-latitude areas, finds that precipitation has no obvious influence on SST daily change, removes daily average precipitation terms, and rewrites the daily average precipitation terms on the basis of a formula proposed by Websteret. Zenital (1999) and Gentemann et al (2003) propose empirical models estimating hourly changes in SST from a minimum value within a day, respectively. At present, people still have great defects in understanding the change of the SST day, and the traditional empirical models still have the problems of low precision, complex calculation and the like, so that the reasonable simulation and prediction of the change process of the SST day still remain a challenge at present.
Numerical simulation is an effective means of simulating and predicting the SST diurnal variation process, but is limited by the level of numerical pattern development, and accurate simulation and prediction remains a challenge.
With the rapid development of computer technology in the middle of the last century, the numerical model has been used in various fields rapidly and widely, and many researches for simulating the daily change of the SST by using the numerical dynamic model have been carried out by scientists. The ocean mixing layer mode, the ocean circulation mode and the ocean air coupling mode are all applied to the simulation of SST daily change. The marine mixed layer mode can be basically divided into three types: a multilayer threshold mode, a bulk-mix layer mode, and a multilayer turbulent flow mode. The one-dimensional mixed layer mode has the advantages of clear physical process, small calculation amount, easy operation and the like, but the one-dimensional mixed layer mode can only simulate the vertical change characteristics of temperature, salt and flow and cannot depict the horizontal change of the temperature, the salt and the flow. The ocean circulation mode and the ocean air coupling mode mainly simulate the daily change of the SST in two modes of direct simulation of the ocean circulation mode and nested mixed layer mode simulation of the ocean circulation mode, but the ocean mode in the mode needs higher vertical resolution (the upper 50 m layer needs to reach 1 m layer), so that on one hand, the calculated amount is greatly increased, and meanwhile, mode integral is unstable due to the fact that the vertical layer is too thin. The vertical subgrid parameterization scheme proposed by Schiller and Godfrey (2005) is applied to a three-dimensional ocean circulation mode NEMO for the first time by Yantal, and the daily change process of SST is accurately simulated in a global range. Because the vertical subgrid parameterization scheme has low requirements on mode calculation amount, long-time integration can be performed. In addition, because the SST daily change process itself is not known and the model itself has uncertainty factors such as parameterization, the simulation precision is difficult to guarantee.
The machine learning is more and more emphasized in the aspects of marine environment research, simulation, prediction and the like, and is expected to play an important role in the aspect of SST daily change process research under physical constraint.
With the rapid development of deep learning technology, the data-driven method based on the deep learning model is more and more emphasized in marine environment element prediction. In 2007, elisa et al used an Artificial Neural Network (ANN) to analyze the sea surface temperature in the western part of the Mediterranean sea, and the model better predicted the seasonal and annual changes in the sea temperature in that area. In 2017, qin and the like adopt an LSTM model to predict the sea surface temperature, the network architecture of the method consists of an LSTM layer and a fully-connected dense layer, and the effectiveness of the method is verified by taking the coastal sea area of China as an example. In the same year, jiang et al analyze the influence of temperature, salinity and geographical location on the thermocline, and propose an improved thermocline selection model based on an entropy method, which can effectively predict the change of temperature. In 2019, xiao et al established an LSTM model by using 36-year satellite-borne sea surface temperature data in the east and the middle of China, and the model has a good daily prediction effect on short-term and medium-term sea surface temperature fields. In 2020, xu et al proposed an M-LCNN prediction model that decomposed and reconstructed time series using wavelet transform to predict the sequence variation of sea surface temperature over multiple time scales. In the same year, he and the like construct an SSTP model adopting a local search strategy, and the SSTP model is suitable for sea temperature data prediction of a long-time sequence. Currently, there is a gap in the field of predicting SST daily changes using machine learning.
In summary, (1) although the conventional empirical model can grasp the basic characteristics of the SST daily change, the conventional empirical model has a limited application range and is not high in precision. At present, people still have great defects in understanding the daily change of the SST, and the traditional empirical model still has the problems of low precision, complex calculation and the like, so that the reasonable simulation and prediction of the daily change process of the SST still remain a challenge at present.
(2) Numerical simulation is an effective means of simulating and predicting the SST daily variation process, but is limited by the level of numerical pattern development, and accurate simulation and prediction remains a challenge. In addition, because the SST daily change process itself is not known and the model itself has parameterization and other uncertainty factors, the simulation precision is also difficult to guarantee.
(3) The machine learning is more and more emphasized in the aspects of marine environment research, simulation, prediction and the like, and is expected to play an important role in the aspect of SST daily change process research under physical constraint. The machine learning method achieves great results in the aspect of sea temperature prediction and the like, but the field of predicting SST daily change by utilizing machine learning is vacant at present.
Disclosure of Invention
In order to solve the problems in the background art, the invention provides a sea surface temperature daily variation amplitude prediction method based on an improved XGB method, which comprises the following steps:
s1: acquiring a data set and preprocessing the data set, wherein the data set comprises wind speed data and short wave radiation value data;
s2: establishing an XGboost model;
s3: modifying the algorithm weight of the XGboost model by applying an LDS algorithm, and establishing an LDS-XGB model;
s4: selecting a training set from the data set, and training the LDS-XGB model by using the training set;
s5: and predicting the daily change amplitude of the sea surface temperature by using the trained LDS-XGB model.
In a preferred scheme, the data set comprises average wind speed data every three hours within a plurality of days and average short wave radiation value data every three hours.
Further, the specific process of step S2 includes:
the XGboost model is a machine learning algorithm model which is added into another model based on the current model, so that the effect of the combined model is better than that of the current model, and the building process is as follows:
Figure BDA0003926796090000051
in the formula (I), the compound is shown in the specification,
Figure BDA0003926796090000052
representing the predicted value of the model, K representing the number of decision trees, f k Representing the kth tree model, x i Which represents the (i) th training sample,
Figure BDA0003926796090000053
representing all decision tree modelsA set of (a);
constructing an objective function, and then optimizing the objective function:
Figure BDA0003926796090000054
in the formula, n is the number of training samples; the objective function is composed of two parts, one part is a loss function l which is generally mean square error, and the other part is a regularization term omega which is the sum of the complexity of each tree, and the objective is to control the complexity of the model and prevent overfitting;
the tree set model of the formula (2) takes a function as a parameter, so that the traditional optimization method cannot be directly used for optimization, and an additive learning mode (additive learning) is adopted for training; each time the original model is kept unchanged, a new function f is added to the model as follows:
Figure BDA0003926796090000055
in the formula:
Figure BDA0003926796090000056
iterating the ith sample for t times to obtain a predicted value;
Figure BDA0003926796090000057
is the initial value of the ith sample;
constructing an optimal model by minimizing a loss function to obtain an objective function of the t-th round:
Figure BDA0003926796090000061
in the formula, cons is a constant term and is the complexity of the first t-1 trees;
and performing second-order Taylor expansion on the target function of the t-th round:
Figure BDA0003926796090000062
Figure BDA0003926796090000063
in the formula: g i ,h i Respectively representing the objective function pair
Figure BDA0003926796090000064
First and second derivatives of;
due to loss function
Figure BDA0003926796090000065
Is a fixed value and is therefore incorporated into the constant term cons; the constant term has no influence on the optimization solution, so that the constant term can be removed; the target function depends only on the first and second derivatives at each sample point on the loss function, resulting in a new target function:
Figure BDA0003926796090000066
next consider the complexity term Ω of the decision tree; firstly, defining each tree, and converting a tree structure expression into a leaf structure expression; dividing the decision tree into a structure part q and a leaf weight part omega;
f t (x)=ω q(x) ,ω∈R T ,q:R d →{1,2,…,T} (8)
in the formula: t is the total number of leaf nodes of the regression tree, ω is a T-dimensional vector consisting of values of the leaf nodes, q (x) denotes that the sample x is on a certain leaf node, ω is q (x) Is the score of the node, i.e. the model prediction value of the sample;
the complexity term of the tree in the XGboost algorithm comprises two parts, one part is the total number of leaf nodes, the other part is the score of the leaf node, and an L2 smoothing term is added aiming at the score of each leaf node to avoid overfitting;
Figure BDA0003926796090000067
in the formula:
Figure BDA0003926796090000071
modulo a leaf node vector; gamma represents the difficulty of node segmentation, lambda represents an L2 regularization coefficient, and the values of gamma and lambda represent the punishment to the tree with more leaf nodes;
rewriting the objective function according to the leaf structure:
Figure BDA0003926796090000072
in the formula: i is j ={i|q(x i ) = j } is the set of samples on leaf node j;
the target function comprises T independent univariate quadratic functions; we can define:
Figure BDA0003926796090000073
the final objective function is simplified as:
Figure BDA0003926796090000074
for unknown variable omega j Calculating a partial derivative, making the derivative be 0, obtaining an input loss function after an extreme point, and obtaining an extreme value
Figure BDA0003926796090000075
Will be provided with
Figure BDA0003926796090000076
Substituting formula (12) to obtain an optimal objective function:
Figure BDA0003926796090000077
the objective function is used for measuring the quality of the t tree structure, a greedy algorithm is utilized to traverse all the segmentation points in the splitting process, loss values are respectively calculated, then the segmentation point with the maximum gain value is selected, and the smaller the maximum value of the gain loss is, the better the model prediction is represented; the final gain expression is as follows:
Figure BDA0003926796090000081
in the formula:
Figure BDA0003926796090000082
the score of the left sub-tree is represented,
Figure BDA0003926796090000083
the score of the right sub-tree is represented,
Figure BDA0003926796090000084
denotes the score without segmentation and λ denotes the complexity cost introduced by adding a new node. If the judgment value is larger than 0, the segmentation can be carried out, otherwise, the segmentation is not carried out.
Further, the specific process of step S3 includes:
is provided with
Figure BDA0003926796090000085
Representing a training set of sample size n, wherein
Figure BDA0003926796090000086
The input is represented by a representation of the input,
Figure BDA0003926796090000087
represents a label, y is of continuous type;
in the label space
Figure BDA00039267960900000811
In, we will
Figure BDA00039267960900000812
Divided into B groups at equal intervals,
i.e. [ y ] 0 ,y 1 ),[y 1 ,y 2 ),…,[yB -1 ,y B ) (ii) a We use
Figure BDA00039267960900000813
To represent a group index of the target value by
Figure BDA0003926796090000088
Representing an index space;
in the prediction of SST daily variation amplitude, we define
Figure BDA0003926796090000089
Calculating density distribution of SST daily variation amplitude of the label value in the training set according to delta y, and calling the density distribution as empirical density distribution; previous research shows that the empirical density distribution of the labels cannot reflect the real label density distribution under the condition that the label values are continuous because of the dependency between the data samples on the adjacent labels; LDS uses kernel density estimation to improve imbalance in continuous data sets;
LDS uses a symmetric kernel function, we choose to use the Gaussian kernel function
Figure BDA00039267960900000810
A Gaussian kernel is a symmetric kernel satisfying k (y, y ') = k (y', y) and
Figure BDA0003926796090000091
Figure BDA0003926796090000092
it characterizes the similarity between the target values y' and y; then convolution is carried out on the empirical density distribution to obtain a new distribution which is called as effective density distribution; the calculation formula is as follows:
Figure BDA0003926796090000093
wherein p (y) represents an empirical density distribution,
Figure BDA0003926796090000094
represents the effective density distribution of the label value y';
in a general XGBoost algorithm, the regression tree loss function is generally chosen as the squared loss; after the effective density distribution is obtained through calculation, the weight is improved by using a weight weighting method for prediction;
Figure BDA0003926796090000095
in particular, we weight each training sample by multiplying it by the inverse of its effective density distribution; the resulting loss function is:
Figure BDA0003926796090000096
Figure BDA0003926796090000097
in the formula:
Figure BDA0003926796090000098
representing the re-weighted loss function.
The invention has the following beneficial effects:
in the SST daily variation amplitude prediction, machine learning algorithms such as Bagging and RF can be used for predicting the SST daily variation amplitude, but the prediction error is larger than that of the XGB method, and the predicted SST daily variation amplitude is lower. The XGB algorithm is innovatively used in SST daily variation amplitude prediction, and machine learning is applied to SST daily variation amplitude prediction; and smoothing the data label value by using an LDS method, so that the traditional unbalanced classification method can be applied to the regression problem.
Drawings
FIG. 1 is a schematic overall flow diagram of the present invention;
FIG. 2 is a graph of empirical label density versus error;
FIG. 3 is a graph of effective label density versus error distribution;
FIG. 4 is a graphical representation of the predicted results of the XGB model;
FIG. 5 is a diagram showing the predicted result of the LDS-XGB model.
Detailed Description
The technical solutions of the present invention will be clearly and completely described below with reference to the drawings of the present invention, and the forms of the respective structures described in the following embodiments are merely examples, and the present invention is not limited to the respective structures described in the following embodiments, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts belong to the scope of the present invention.
Referring to fig. 1-5, a sea surface temperature daily variation amplitude prediction method based on an improved XGB method includes the steps of: s1: acquiring a data set and preprocessing the data set, wherein the data set comprises wind speed data and short wave radiation value data; s2: establishing an XGboost model; s3: modifying the algorithm weight of the XGboost model by applying an LDS algorithm, and establishing an LDS-XGB model; s4: selecting a training set from the data set, and training the LDS-XGB model by using the training set; s5: and predicting the daily change amplitude of the sea surface temperature by using the trained LDS-XGB model.
The specific process of the step S2 includes:
the XGboost model is a machine learning algorithm model which is added into another model based on the current model, so that the effect of the combined model is better than that of the current model, and the building process is as follows:
Figure BDA0003926796090000101
in the formula (I), the compound is shown in the specification,
Figure BDA0003926796090000102
representing the predicted value of the model, K representing the number of decision trees, f k Representing the kth tree model, x i Which represents the (i) th training sample,
Figure BDA0003926796090000103
representing a set of all decision tree models;
an objective function is constructed and then optimized:
Figure BDA0003926796090000111
in the formula, n is the number of training samples; the objective function is composed of two parts, one part is a loss function l, generally a mean square error, and the other part is a regularization term omega, namely the sum of the complexity of each tree, so as to control the complexity of the model and prevent overfitting;
the tree set model of the formula (2) takes a function as a parameter, so that the traditional optimization method cannot be directly used for optimization, and an additive learning mode (additive learning) is adopted for training; each time the original model is kept unchanged, a new function f is added to the model as follows:
Figure BDA0003926796090000112
in the formula:
Figure BDA0003926796090000113
iterating the ith sample for t times to obtain a predicted value;
Figure BDA0003926796090000114
is the initial value of the ith sample;
constructing an optimal model by minimizing a loss function to obtain an objective function of the t-th round:
Figure BDA0003926796090000115
in the formula, cons is a constant term and is the complexity of the first t-1 trees;
and performing second-order Taylor expansion on the target function of the t-th round:
Figure BDA0003926796090000116
Figure BDA0003926796090000121
in the formula: g i ,h i Respectively representing the objective function pair
Figure BDA0003926796090000122
First and second derivatives of;
due to loss function
Figure BDA0003926796090000123
Is a fixed value and is therefore incorporated into the constant term cons; the constant term has no influence on the optimization solution, so that the constant term can be removed; the target function depends only on the first and second derivatives of each sample point on the loss function, resulting in a new target function:
Figure BDA0003926796090000124
next consider the complexity term Ω of the decision tree; firstly, defining each tree, and converting a tree structure expression into a leaf structure expression; dividing the decision tree into a structure part q and a leaf weight part omega;
f t (x)=ω q(x) ,ω∈R T ,q:R d →{1,2,…,T} (8)
in the formula: t is the total number of leaf nodes of the regression tree, ω is a T-dimensional vector composed of values of the leaf nodes, q (x) represents that a sample x is on a certain leaf node, and ω is q (x) Is the score of the node, i.e. the model prediction value of the sample;
the complexity term of the tree in the XGboost algorithm comprises two parts, one part is the total number of leaf nodes, the other part is the score of the leaf node, and an L2 smoothing term is added aiming at the score of each leaf node to avoid overfitting;
Figure BDA0003926796090000125
in the formula:
Figure BDA0003926796090000126
modulo a leaf node vector; gamma represents the difficulty of node segmentation, lambda represents an L2 regularization coefficient, and the values of gamma and lambda represent the punishment to the tree with more leaf nodes;
rewriting the objective function according to the leaf structure:
Figure BDA0003926796090000131
in the formula: i is j ={i|q(x i ) = j } is the set of samples on leaf node j;
the target function comprises T independent univariate quadratic functions; we can define:
Figure BDA0003926796090000132
the final objective function is simplified as:
Figure BDA0003926796090000133
for unknown variable omega j Calculating a partial derivative, making the derivative be 0, obtaining an input loss function after an extreme point, and obtaining an extreme value
Figure BDA0003926796090000134
Will be provided with
Figure BDA0003926796090000135
Substituting formula (12) to obtain an optimal objective function:
Figure BDA0003926796090000136
the objective function is used for measuring the quality of the t tree structure, all the division points are traversed by using a greedy algorithm in the splitting process, loss values are respectively calculated, then the division point with the maximum gain value is selected, and the smaller the maximum value of the gain loss is, the better the model prediction is represented; the final gain expression is as follows:
Figure BDA0003926796090000137
in the formula:
Figure BDA0003926796090000138
the score of the left sub-tree is represented,
Figure BDA0003926796090000139
the score of the right sub-tree is represented,
Figure BDA00039267960900001310
denotes the score without segmentation and λ denotes the complexity cost introduced by adding a new node. If the judgment value is larger than 0, the segmentation can be carried out, otherwise, the segmentation is not carried out.
Further, the specific process of step S3 includes:
is provided with
Figure BDA0003926796090000141
Representing a training set of sample size n, wherein
Figure BDA0003926796090000142
The input is represented by a representation of the input,
Figure BDA0003926796090000143
represents a label, y is of continuous type;
in the label space
Figure BDA00039267960900001412
In, we will
Figure BDA00039267960900001413
Divided into B groups at equal intervals,
i.e. [ y ] 0 ,y 1 ),[y 1 ,y 2 ),…,[y B-1 ,y B ) (ii) a We use
Figure BDA0003926796090000144
To represent a group index of the target value by
Figure BDA0003926796090000145
Representing an index space;
in the prediction of SST daily variation amplitude, we define
Figure BDA0003926796090000146
Calculating density distribution of SST daily variation amplitude of the label value in the training set according to delta y, and calling the density distribution as empirical density distribution; previous research shows that the empirical density distribution of the labels cannot reflect the real label density distribution under the condition that the label values are continuous because of the dependency between the data samples on the adjacent labels; LDS uses kernel density estimation to improve imbalance in continuous data sets;
LDS uses a symmetric kernel function, we choose to use the Gaussian kernel function
Figure BDA0003926796090000147
The Gaussian kernel is a symmetric kernel satisfying k (y, y ') = k (y', y) and
Figure BDA0003926796090000148
Figure BDA0003926796090000149
it characterizes the similarity between the target values y' and y; then convolution is carried out on the empirical density distribution to obtain a new distribution which is called as effective density distribution; the calculation formula is as follows:
Figure BDA00039267960900001410
wherein p (y) represents an empirical density distribution,
Figure BDA00039267960900001411
represents the effective density distribution of the label value y';
in a general XGBoost algorithm, the regression tree loss function is generally chosen as the squared loss; after the effective density distribution is obtained through calculation, the weight is improved by using a weight weighting method for prediction;
Figure BDA0003926796090000151
in particular, we weight the loss function by multiplying it by the inverse of the effective density distribution of each training sample; the resulting loss function is:
Figure BDA0003926796090000152
Figure BDA0003926796090000153
in the formula:
Figure BDA0003926796090000154
representing the re-weighted loss function.
Example 1:
the embodiment is applied to the sea surface temperature daily change amplitude prediction, and develops the LDS-XGB model suitable for predicting the sea surface temperature daily change amplitude. The observation data of tropical sea and global atmospheric-ocean coupling response experiment (TOGACOARE) during observation period are adopted, and the observation data comprises parameters such as heat sensitivity, latent heat, short-wave radiation, wind stress, sea surface temperature and the like. Buoy data of 133 sites are selected, the distribution range is 25 degrees S-21 degrees N in the global range, the time resolution is 1 hour or 10 minutes, and the time span is 10 months-2021 months-8 months in 1992.
And (3) preprocessing an experimental data set, dividing a training set and a testing set according to a ratio of 8. And simultaneously calculating the average wind speed every three hours and the average short wave radiation every three hours. The SST daily variation amplitude is predicted by taking the average wind speed every three hours and the average short wave radiation every three hours as input.
First, the pearson correlation coefficient between the empirical label density and the error distribution is calculated to be-0.38, and the correlation between the empirical label density and the error distribution is weak. The results are shown in FIG. 2:
the Pearson correlation coefficient between the effective label density and the error distribution is-0.56, and the result shows that the effective label density obtained by LDS calculation has good correlation with the error distribution. The results are shown in FIG. 3:
and respectively training the training set by using the XGboost and the LDS-XGB, and predicting the test set after training. The prediction result shows that the model LDS-XGB after the re-weighting has good performance in both a training set and a verification set. The prediction results of the XGboost and the LDS-XGB are shown in FIGS. 4 and 5:
the evaluation results of the fitness and prediction error of the XGB model and the LDS-XGB model with unmodified weights are shown in tables 1 to 2.
TABLE 1 SST daily variation amplitude prediction model evaluation results
Figure BDA0003926796090000161
TABLE 2 SST daily variation amplitude model prediction statistics
Figure BDA0003926796090000162
As can be seen from tables 1-2: both the training set and the test set obtain higher fitting degree and smaller error value, and the model is proved to have good performance in the prediction of SST day change amplitude. And the fitting degree of the XGB model and the LDS-XGB model reaches more than 70 percent from the aspect of the fitting degree. In terms of errors, the RMSE of the model is taken as an evaluation index and reaches 17.773 percent and 17.771 percent respectively. When the weight is not modified, the predicted SST daily variation amplitude value is more than 99% and less than 2 ℃, and after the weight is modified, the model can predict the value more than 2 ℃, which shows that the model has certain effect on improving the data imbalance. The LDS _ XGB model improves the prediction of high values to a certain extent.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that are within the spirit and principle of the present invention are intended to be included in the scope of the present invention.

Claims (4)

1. A sea surface temperature day change amplitude prediction method based on an improved XGB method is characterized by comprising the following steps:
s1: acquiring a data set and preprocessing the data set, wherein the data set comprises wind speed data and short wave radiation value data;
s2: establishing an XGboost model;
s3: modifying the algorithm weight of the XGboost model by applying an LDS algorithm, and establishing an LDS-XGB model;
s4: selecting a training set from the data set, and training the LDS-XGB model by using the training set;
s5: and predicting the daily change amplitude of the sea surface temperature by using the trained LDS-XGB model.
2. The sea surface temperature daily variation amplitude prediction method based on the improved XGB method as claimed in claim 1, wherein the method comprises the following steps: the data set includes average wind speed data at three hour intervals over several days, and average short wave radiance data at three hour intervals.
3. The sea surface temperature daily variation amplitude prediction method based on the improved XGB method as claimed in claim 1, wherein the method comprises the following steps: the specific process of the step S2 includes:
the XGboost model is a machine learning algorithm model which is added into another model based on the current model, so that the effect of the combined model is better than that of the current model, and the establishing process is as follows:
Figure FDA0003926796080000011
in the formula (I), the compound is shown in the specification,
Figure FDA0003926796080000012
representing the predicted value of the model, K representing the number of decision trees, f k Representing the kth tree model, x i The (i) th training sample is represented,
Figure FDA0003926796080000013
representing a set of all decision tree models;
constructing an objective function, and then optimizing the objective function:
Figure FDA0003926796080000014
in the formula, n is the number of training samples; the objective function is composed of two parts, one part is a loss function l, generally a mean square error, and the other part is a regularization term omega, namely the sum of the complexity of each tree, so as to control the complexity of the model and prevent overfitting;
the tree set model of the formula (2) takes a function as a parameter, so that the traditional optimization method cannot be directly used for optimization, and an additive learning mode (additive learning) is adopted for training; each time the original model is kept unchanged, a new function f is added to the model as follows:
Figure FDA0003926796080000021
Figure FDA0003926796080000022
Figure FDA0003926796080000023
……
Figure FDA0003926796080000024
in the formula:
Figure FDA0003926796080000025
iterating the ith sample for t times to obtain a predicted value;
Figure FDA0003926796080000026
is the initial value of the ith sample;
constructing an optimal model by minimizing a loss function to obtain an objective function of the t-th round:
Figure FDA0003926796080000027
in the formula, cons is a constant term and is the complexity of the first t-1 trees;
and performing second-order Taylor expansion on the target function of the t-th round:
Figure FDA0003926796080000028
Figure FDA0003926796080000029
in the formula: g i ,h i Respectively representing the objective function pair
Figure FDA00039267960800000210
First and second derivatives of;
due to loss function
Figure FDA0003926796080000031
Is a fixed value and is therefore incorporated into the constant term cons; the constant item has no influence on the optimization solution, so that the constant item can be removed; the target function depends only on the first and second derivatives of each sample point on the loss function, resulting in a new target function:
Figure FDA0003926796080000032
next consider the complexity term Ω of the decision tree; firstly, defining each tree, and converting a tree structure expression into a leaf structure expression; dividing the decision tree into a structure part q and a leaf weight part omega;
f t (x)=ω q(x) ,ω∈R T ,q:R d →{1,2,…,T} (8)
in the formula: t is the total number of leaf nodes of the regression tree, ω is a T-dimensional vector composed of values of the leaf nodes, q (x) represents that a sample x is on a certain leaf node, and ω is q (x) Is the score of the node, i.e. the model prediction value of the sample;
the complexity term of the tree in the XGboost algorithm comprises two parts, one part is the total number of leaf nodes, the other part is the score of the leaf node, and an L2 smoothing term is added aiming at the score of each leaf node to avoid overfitting;
Figure FDA0003926796080000033
in the formula:
Figure FDA0003926796080000034
is the modulus of the leaf node vector; gamma represents the difficulty of node segmentation, lambda represents an L2 regularization coefficient, and the values of gamma and lambda represent the punishment to the tree with more leaf nodes;
rewriting the objective function according to the leaf structure:
Figure FDA0003926796080000035
in the formula: i is j ={i|q(x i ) = j } is the set of samples on leaf node j;
the target function comprises T independent univariate quadratic functions; we can define:
Figure FDA0003926796080000041
the final objective function is simplified as:
Figure FDA0003926796080000042
for unknown variable omega j Calculating a partial derivative, making the derivative be 0, obtaining an input loss function after an extreme point, and obtaining an extreme value
Figure FDA0003926796080000043
Will be provided with
Figure FDA0003926796080000044
Substituting equation (12) to obtain an optimal objective function:
Figure FDA0003926796080000045
the objective function is used for measuring the quality of the t tree structure, all the division points are traversed by using a greedy algorithm in the splitting process, loss values are respectively calculated, then the division point with the maximum gain value is selected, and the smaller the maximum value of the gain loss is, the better the model prediction is represented; the final gain expression is as follows:
Figure FDA0003926796080000046
in the formula:
Figure FDA0003926796080000047
the score of the left sub-tree is represented,
Figure FDA0003926796080000048
the score of the right sub-tree is represented,
Figure FDA0003926796080000049
and the score is expressed when the node is not segmented, the lambda represents the complexity cost introduced by adding a new node, if the judgment value is greater than 0, the node can be segmented, otherwise, the node is not segmented.
4. The sea surface temperature daily variation amplitude prediction method based on the improved XGB method as claimed in claim 2, wherein the method comprises the following steps: the specific process of the step S3 includes:
is provided with
Figure FDA00039267960800000410
Representing a training set of sample size n, wherein
Figure FDA00039267960800000411
The input is represented by a representation of the input,
Figure FDA00039267960800000412
represents a label, y is of continuous type;
in the label space
Figure FDA0003926796080000051
In, we will
Figure FDA0003926796080000052
Divided into B groups at equal intervals,
i.e. [ y ] 0 ,y 1 ],[y 1 ,y 2 ),…,[y B-1 ,y B ) (ii) a We use
Figure FDA0003926796080000053
To represent a group index of the target value by
Figure FDA0003926796080000054
Representing an index space;
in the prediction of SST daily variation amplitude, we define
Figure FDA0003926796080000055
Calculating density distribution of SST daily variation amplitude of the label value in the training set according to delta y, and calling the density distribution as empirical density distribution; previous research shows that the empirical density distribution of the labels cannot reflect the real label density distribution under the condition that the label values are continuous because of the dependency between data samples on adjacent labels; LDS uses kernel density estimation to improve imbalance in continuous data sets;
LDS uses a symmetric kernel function, we choose to use the Gaussian kernel function
Figure FDA0003926796080000056
The Gaussian kernel is a symmetric kernel satisfying k (y, y ') = k (y', y) and
Figure FDA0003926796080000057
Figure FDA0003926796080000058
it characterizes the similarity between the target values y' and y; then convolution is carried out on the empirical density distribution to obtain a new distribution which is called as effective density distribution; the calculation formula is as follows:
Figure FDA0003926796080000059
wherein p (y) represents an empirical density distribution,
Figure FDA00039267960800000510
represents the effective density distribution of the label value y';
in a general XGboost algorithm, the regression tree loss function is generally chosen as the squared loss; after the effective density distribution is obtained through calculation, the weight is improved by using a weight weighting method for prediction;
Figure FDA00039267960800000511
in particular, we weight the loss function by multiplying it by the inverse of the effective density distribution of each training sample; the resulting loss function is:
Figure FDA0003926796080000061
Figure FDA0003926796080000062
in the formula:
Figure FDA0003926796080000063
representing the re-weighted loss function.
CN202211376526.0A 2022-11-04 2022-11-04 Sea surface temperature daily variation amplitude prediction method based on improved XGB method Active CN115688588B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211376526.0A CN115688588B (en) 2022-11-04 2022-11-04 Sea surface temperature daily variation amplitude prediction method based on improved XGB method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211376526.0A CN115688588B (en) 2022-11-04 2022-11-04 Sea surface temperature daily variation amplitude prediction method based on improved XGB method

Publications (2)

Publication Number Publication Date
CN115688588A true CN115688588A (en) 2023-02-03
CN115688588B CN115688588B (en) 2023-06-27

Family

ID=85048709

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211376526.0A Active CN115688588B (en) 2022-11-04 2022-11-04 Sea surface temperature daily variation amplitude prediction method based on improved XGB method

Country Status (1)

Country Link
CN (1) CN115688588B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116976149A (en) * 2023-09-22 2023-10-31 广东海洋大学 Sea surface temperature prediction method

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110543929A (en) * 2019-08-29 2019-12-06 华北电力大学(保定) wind speed interval prediction method and system based on Lorenz system
CN111340273A (en) * 2020-02-17 2020-06-26 南京邮电大学 Short-term load prediction method for power system based on GEP parameter optimization XGboost
CN113159364A (en) * 2020-12-30 2021-07-23 中国移动通信集团广东有限公司珠海分公司 Passenger flow prediction method and system for large-scale traffic station
CN113256066A (en) * 2021-04-23 2021-08-13 新疆大学 PCA-XGboost-IRF-based job shop real-time scheduling method
CN113537336A (en) * 2021-03-10 2021-10-22 沈阳工业大学 XGboost-based short-term thunderstorm and strong wind forecasting method
CN113743013A (en) * 2021-09-08 2021-12-03 成都卡普数据服务有限责任公司 XGboost-based temperature prediction data correction method
CN114595624A (en) * 2022-01-10 2022-06-07 山西中节能潞安电力节能服务有限公司 Service life state prediction method of heat tracing belt device based on XGboost algorithm
CN114898819A (en) * 2022-04-06 2022-08-12 中国石油大学(北京) Mixed crude oil viscosity prediction model training method and device and application method
WO2022194045A1 (en) * 2021-03-15 2022-09-22 哈尔滨工程大学 Three-dimensional temperature-salinity field analysis and forecasting method for offshore platform guarantee

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110543929A (en) * 2019-08-29 2019-12-06 华北电力大学(保定) wind speed interval prediction method and system based on Lorenz system
CN111340273A (en) * 2020-02-17 2020-06-26 南京邮电大学 Short-term load prediction method for power system based on GEP parameter optimization XGboost
CN113159364A (en) * 2020-12-30 2021-07-23 中国移动通信集团广东有限公司珠海分公司 Passenger flow prediction method and system for large-scale traffic station
CN113537336A (en) * 2021-03-10 2021-10-22 沈阳工业大学 XGboost-based short-term thunderstorm and strong wind forecasting method
WO2022194045A1 (en) * 2021-03-15 2022-09-22 哈尔滨工程大学 Three-dimensional temperature-salinity field analysis and forecasting method for offshore platform guarantee
CN113256066A (en) * 2021-04-23 2021-08-13 新疆大学 PCA-XGboost-IRF-based job shop real-time scheduling method
CN113743013A (en) * 2021-09-08 2021-12-03 成都卡普数据服务有限责任公司 XGboost-based temperature prediction data correction method
CN114595624A (en) * 2022-01-10 2022-06-07 山西中节能潞安电力节能服务有限公司 Service life state prediction method of heat tracing belt device based on XGboost algorithm
CN114898819A (en) * 2022-04-06 2022-08-12 中国石油大学(北京) Mixed crude oil viscosity prediction model training method and device and application method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杜扬帆 等: "基于XGBoost-PredRnn++的还表面温度预测", 《计算机系统应用》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116976149A (en) * 2023-09-22 2023-10-31 广东海洋大学 Sea surface temperature prediction method
CN116976149B (en) * 2023-09-22 2023-12-29 广东海洋大学 Sea surface temperature prediction method

Also Published As

Publication number Publication date
CN115688588B (en) 2023-06-27

Similar Documents

Publication Publication Date Title
CN113051795A (en) Three-dimensional temperature-salinity field analysis and prediction method for offshore platform guarantee
Bilgili The use of artificial neural networks for forecasting the monthly mean soil temperatures in Adana, Turkey
Bao et al. Salinity profile estimation in the Pacific Ocean from satellite surface salinity observations
CN112288164B (en) Wind power combined prediction method considering spatial correlation and correcting numerical weather forecast
CN103617462B (en) A kind of wind farm wind velocity Spatiotemporal Data Modeling method based on geographical statistics
CN108981957A (en) Submarine temperatures field reconstructing method based on self organizing neural network and Empirical Orthogonal Function
CN114239422B (en) Method for improving marine chlorophyll a concentration prediction accuracy based on machine learning
Dupuy et al. ARPEGE cloud cover forecast postprocessing with convolutional neural network
CN116526478B (en) Short-term wind power prediction method and system based on improved snake group optimization algorithm
CN109214591B (en) Method and system for predicting aboveground biomass of woody plant
CN111639803A (en) Prediction method applied to future vegetation index of area under climate change scene
CN115688588A (en) Sea surface temperature daily change amplitude prediction method based on improved XGB method
Miller Tropical data assimilation experiments with simulated data: The impact of the tropical ocean and global atmosphere thermal array for the ocean
CN117493475A (en) Method and system for reconstructing regional moon runoff based on missing data through machine learning
CN117493476A (en) Runoff backtracking simulation method and system integrating physical mechanism and artificial intelligence
CN104899464B (en) A kind of sampling study machine remote sensing quantitative inversion method under adaptation noise conditions
CN114047563B (en) All-weather assimilation method for infrared hyperspectrum
CN114970663A (en) Near-shore sea surface temperature inversion method of microwave radiometer based on neural network
CN117313307B (en) Climate model simulation temperature data correction method integrating space-time environment information
Zheng et al. Evaluation of different methods for soil heat flux estimation at large scales using remote sensing observations
Dartt Automated Streamline Analysis Utilizing" Optimum Interpolation"
SUN et al. Research on Multivariate Yellow Sea SST Week Prediction Method Based on Encoder-Decoder LSTM
Saha et al. Dependency investigation of sea surface temperature on sea bottom temperature and sea surface salinity
CN116449460B (en) Regional month precipitation prediction method and system based on convolution UNet and transfer learning
CN117809203B (en) Multi-task continuous learning cross-sea area tropical cyclone strength estimation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant