CN115587916A

CN115587916A - Construction method of GPP estimation model

Info

Publication number: CN115587916A
Application number: CN202211184260.XA
Authority: CN
Inventors: 姜佳菲; 韩舸; 蔡孟阳; 应家莹; 王亨源
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-09-27
Filing date: 2022-09-27
Publication date: 2023-01-10

Abstract

The invention discloses a construction method of a GPP estimation model, which comprises the following steps: acquiring data of a sunlight induced chlorophyll fluorescent SIF product and a local corresponding GPP database, combining the data with vegetation data and meteorological data of a research area to form a data set, preprocessing the data set, and dividing the processed data into a training set, a verification set and a test set; selecting an existing GPP model, adding SIF characteristic factors, vegetation characteristic factors and meteorological characteristic factors into the existing GPP model by adopting an XGBOOST model, and training parameters of the XGBOOST model by adopting a training set; verifying and further adjusting parameters of the XGBOOST model by adopting a verification set; and (4) performing data resampling on the test set, performing bootstrap random selection, and importing the test set into the XGB OST model obtained in the step (3) to perform traversal iteration to obtain an optimal GPP estimation model. According to the invention, characteristic factors such as vegetation and weather are considered in the existing SIF-GPP relationship, the regional limitation of the application of a linear SIF-GPP model is broken, and the efficiency and the accuracy of GPP measurement are improved.

Description

Construction method of GPP estimation model

Technical Field

The invention belongs to the technical field of machine learning, and particularly relates to a construction method of a GPP estimation model.

Background

At present, the method for verifying the carbon sink of the land ecosystem mainly depends on sample measurement, and the method is time-consuming and labor-consuming, has obvious representative errors and cannot reflect the dynamic change of the carbon sink of the ecosystem. Therefore, the development of a novel land ecological carbon sink metering method with large area, high precision and low cost has important academic, social and economic values. The total primary productivity (GPP) of the terrestrial ecosystem refers to the total amount of organic carbon fixed by photosynthesis in the green plants per unit time and unit area, and is also called the total primary productivity. The net carbon sink of the ecosystem can be obtained by deducting factors such as vegetation autotrophy, heterotrophic respiration and disturbance on the basis of GPP. Currently, there are mature models of respiration calculations, and the main uncertainty of the net carbon sink estimate comes from the estimate of GPP. Sunlight induced chlorophyll fluorescence (SIF), which is a byproduct in the photosynthesis process of vegetation, can more quickly and accurately reflect the physiological state change inside vegetation, and has been proved in recent years to be capable of obtaining accurate GPP estimation results on a single-point scale. The SIF-GPP relationship is governed by a number of factors such as temperature, humidity, vegetation type, drought, and even observation time, and therefore there is a large uncertainty in making the area GPP estimates. Therefore, the problem of dynamic change of the SIF-GPP relationship under different coercion factors needs to be solved, so as to more accurately estimate the GPP metrology method.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides the method for constructing the GPP estimation model, which considers the characteristic factors such as vegetation and weather in the prior SIF-GPP relationship, breaks the regional limitation of the application of the linear SIF-GPP model, and improves the efficiency and the accuracy of the GPP measurement and calculation.

In order to solve the technical problems, the invention adopts the following technical scheme:

a construction method of a GPP estimation model comprises the following steps:

step 1, acquiring sunlight induced chlorophyll fluorescence SIF product data and a local corresponding GPP database, combining the data with vegetation data and meteorological data of a research area to form a data set, preprocessing the data in the data set, eliminating abnormal data, and dividing the processed data into a training set, a verification set and a test set;

step 2, selecting the existing GPP model, adding SIF characteristic factors, vegetation characteristic factors and meteorological characteristic factors into the existing GPP model by adopting the XGBOST model to improve the existing GPP model so as to construct a GPP model suitable for large-scale use, and training parameters of the XGBOST model by adopting a training set;

step 3, verifying and further adjusting parameters of the XGB OST model by adopting the verification set, and obtaining the XGB OST model with the accuracy meeting the requirement;

and 4, performing data resampling on the test set, performing bootstrap random selection, and importing the test set into the XGBOOST model obtained in the step 3 for traversal iteration to obtain an optimal GPP estimation model.

Further, the vegetation data in the step 1 comprise vegetation types and soil humidity, and the meteorological data comprise precipitation, downward radiation, average temperature, maximum temperature and minimum temperature.

Further, the existing GPP model selected is:

in the formula, G _PP To total primary productivity; s. the _IF Is chlorophyll fluorescence; l is a radical of an alcohol _UE The utilization rate of light energy is obtained; l is _UEf Is the fluorescence quantum yield; f. of _esc The canopy escape rate of chlorophyll fluorescence;

transforming the GPP model to obtain a transformed SIF-GPP model:

G _PP ＝S _IF T _MINscalar V _PDscalar a+b

in the formula, a and b are constants obtained by model fitting; t is _MINscalar For the lowest temperature coefficient of regulation, V _PDscalar The adjustment coefficient is saturated vapor pressure difference;

and adding vegetation characteristic factors and meteorological characteristic factors into the converted SIF-GPP model by adopting an XGBOOST model.

Further, in step 2, the XGBOOST model is:

wherein FS is a decision tree set, x _i A vector formed by characteristic values of the ith data, including vegetation data and meteorological data, f _n (x _i ) Is the nth independent decision tree, which contains the structure and weight information of the tree, N is the total number of decision trees,

is the predicted value of the ith data.

Further, in step 2, the XGBOOST model is trained according to a ten-fold cross validation method, and a loss function in the training process is as follows:

wherein M represents the number of training sets,

is a predicted value

And true value y _i Inter-loss function, where the mean square error is chosen as the loss function, Ω (f) _n ) Is a regular term of the decision tree.

Compared with the prior art, the invention has the following beneficial effects: the invention integrates the advantages of high precision of SIF estimation GPP, simple algorithm, strong robustness after XGB OST training, wide application condition and increased application area, obtains a GPP estimation model which can directly obtain the GPP estimation value with high precision and high robustness only by inputting the local SIF value and other selected characteristic values (vegetation type, soil humidity, precipitation, downlink radiation, average temperature, highest temperature and lowest temperature), perfects the Chinese area research of the GPP data set based on sunlight-induced chlorophyll fluorescence SIF, breaks the area limitation of the application of the linear SIF model, and obtains the GPP estimation model with high precision and high robustness.

Drawings

FIG. 1 is a flowchart of a method for building a GPP estimation model according to an embodiment of the present invention;

fig. 2 is a line graph of variation trends of SIF and GPP of each site in 2008-2013 according to the embodiment of the present invention; wherein, (a) is KFS site, (b) is Kon site, (c) is Ne-1 site, (d) is Ne-3 site;

FIG. 3 is a diagram illustrating the influence of various characteristic factors on the GPP result according to an embodiment of the present invention;

fig. 4 is a schematic diagram of establishing a GPP model and a GPP estimation result by using an XGBOOST model according to the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the following embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The present invention is further illustrated by the following examples, which are not to be construed as limiting the invention.

As shown in fig. 1, the embodiment of the present invention provides a method for constructing a GPP estimation model, which includes training an XGBOOST model by using existing SIF and corresponding GPP, vegetation data, and meteorological data to obtain an importance degree and corresponding weight of a feature value and perform parameter adjustment, inputting the SIF and the feature value of a research area into the trained XGBOOST model to obtain a GPP model of the prediction research area, perfecting a chinese regional research of a GPP dataset based on sunlight-induced chlorophyll fluorescence SIF, breaking the regional limitation of the application of a linear SIF model, and obtaining a GPP estimation model with high accuracy and high robustness; the method specifically comprises the following steps:

step 1, acquiring data of a sunlight induced chlorophyll fluorescent SIF product and a local corresponding GPP database, combining the data with vegetation data and meteorological data of a research area to form a data set, preprocessing the data in the data set, and dividing the processed data into a training set, a verification set and a test set;

in this embodiment, SIF740 products measured by a GOME-2 sensor on a metalp-a satellite, existing global SIF product data GOSIF of 2000-2017 obtained by data processing of MODIS and MERRA-2, a local GPP database (where applicable MODIS data are considered), and local vegetation data and meteorological data are combined into a data set, wherein the prepared data include vegetation types and soil humidity, and the meteorological data include precipitation, downlink radiation, average temperature, maximum temperature, and minimum temperature.

And then preprocessing the data, wherein the data preprocessing can be realized by using python. When data preprocessing is carried out, data preview is carried out firstly, the condition of a data document is checked, the characteristics of each column are known, the appropriate type is conveniently designated during reading, and the reading speed is accelerated. Then, a code is established for the category, and the code is digitized through a create _ num function, but the original df _ raw still needs to be preserved, and when the model is finally explained, the digitized code needs to be converted into the initial code, namely the code established by the category. After the data preprocessing is finished, abnormal data are removed, the format is converted into a format which is more beneficial to the subsequent operation, and the format is stored as a training set.

It is contemplated herein that the use of autonomic sampling, set-out, and cross-validation methods to construct training set validation sets and test sets may provide data sets for subsequent training, validation, and testing.

in this embodiment, the existing reference chlorophyll fluorescence GPP estimation model is selected as:

in the formula, G _PP Total primary productivity, g.c/(m) ² ·d)

S _IF Chlorophyll fluorescence, m.W/(m) ² ·nm·sr)

L _UE -light energy utilization efficiency%

L _UEf -fluorescence quantum yield%

f _esc Canopy escape rate of chlorophyll fluorescence

Since the far infrared band is less affected by the canopy, f _esc Generally by default, 1. According to the MODl7 product use calculation method, the following steps are provided:

L _UE ＝L _UEmax T _MINscalar V _PDscalar (2)

the transformed SIF-GPP model is constructed as follows:

G _PP ＝S _IF T _MINscalar V _PDscalar a+b (3)

in the formula, a, b are constants obtained by model fitting

Wherein, T _MINscalar Minimum temperature coefficient of regulation and V _PDscalar The calculation formula of the saturated water vapor pressure difference regulating coefficient is as follows:

L _UE ＝L _UEmax T _MINscalar V _PDscalar (4)

wherein,

in the formula, L _UEmax -maximum light energy utilization%

T _MINscalar Lowest temperature regulating factor

T _MIN -minimum temperature throughout the day, DEG C

T _MINmin Lowest temperature when light energy utilization rate of vegetation is 0 DEG C

T _MINmax Lowest temperature when light energy utilization rate of vegetation reaches maximum value, DEG C

V _PDscalar -saturated steam pressure difference regulating factor

V _PD Mean saturated water-gas pressure difference between days, pa

V _PDmin Average saturated water pressure difference between days, pa when light energy utilization rate of vegetation is 0

V _PDmax -mean saturated water-air pressure difference between days, pa, when light energy utilization rate of vegetation is maximum

L _UEmax 、T _MINmin 、T _MINmax 、V _PDmin 、V _PDmax Can be referred to in the MOD17 product description. An SIF-GPP annual variation trend graph in the prior art is shown in fig. 2, and it can be seen that under certain time and space condition constraints, SIF and GPP variations have strong consistency, and a reference is provided for formula (3).

And (3) transforming the reference chlorophyll fluorescence GPP estimation model to obtain a linear model (formula (3) and a small area) of SIF-GPP, and adding characteristic factors (vegetation and meteorological factors) in the corresponding relation of SIF-GPP by adopting the XGB OST model so as to enable the XGB OST model to be applicable to a wider range.

After the thought is determined, XGB OST model training is started. The XGBOOST is a machine learning algorithm developed from a GBDT (Gradient Boosting Decision Tree) algorithm, and completes optimization of an objective function to obtain an optimal solution by combining Gradient information on the basis of ensemble learning. The XGBOOST algorithm is composed of a series of decision trees:

wherein FS (Forest Sets) is a decision tree set, x _i A vector consisting of the eigenvalues of the ith data, here comprising the local vegetation type, soil moisture, precipitation, downgoing radiation, mean temperature, maximum temperature, minimum temperature data, f _n (x _i ) Is the nth independent decision tree, which contains the structure and weight information of the tree, N is the total number of decision trees,

is the predicted value of the ith data.

The loss function during training is defined as:

wherein M represents the number of training sets,

is a predicted value

And true value y _i With respect to the loss function, where the mean square error is chosen as the loss function, Ω (f) _n ) Is a regular term of the decision tree.

The parameters of XGBOOST may be obtained continuously through iteration, where the loss function of the t-th iteration may be expressed as formula (9):

to simplify the operation process, the following steps are carried out

Substituting equation (9) can result in equation (10):

performing a second-order Taylor expansion on the formula (10) to obtain a formula (11);

wherein Δ f _t (x _i )＝f _t (x _i )-f _t-1 (x _i ) Due to l (y) _i ,f _t-1 (x _i ) Is the error value in the previous iteration, equation (11) can be extended to equation (12) after its elimination:

wherein,

will be provided with

Substituting into equation (12), equation (13) can be obtained;

since the samples are sorted into T decision trees, equation (13) can be rewritten as equation (14):

wherein w _j For the leaf weight of the jth decision tree, I belongs to I _j The sample belonging to the node j of the decision tree is represented, lambda represents a parameter of a regularization item, and lambda can control the fraction of leaf nodes not to be too large so as to prevent overfitting; lambda (lambda) is generally the default =1, the larger the parameter, the less likely the model is to be overfit; γ also represents a parameter of the regularization term to control the number of leaf nodes, γ (gamma) generally defaults =0, the minimum loss reduction required for further partitioning on the leaf nodes of the tree, the larger the value, the more conservative the algorithm; the optimal estimation value of the leaf weight of each decision tree can be obtained by applying the formula (14) to w

Inputting a training set, training the XGB OST model according to a ten-fold cross validation mode, namely averagely dividing the training set into 10 parts, training by taking nine parts of ten samples as the training set in each training, verifying and outputting training precision by taking the remaining one sample as a test set, and repeating the training for ten times to output final training precision. After training, the influence of each feature factor on the GPP result is shown in fig. 3, and it can be seen from fig. 3 that the example graph of the weight of each feature factor obtained after the XGBOOST model is trained, the reference formula of the weight factor, see formula (15).

the num _ boost _ round parameter in the XGBOOST model is initially trained with a large enough initial value using its built-in xgb.cv function, and then waits for it to return the best result value.

Other parameters may be tuned using the grid training tool GridSearchCV in skleran.

The max _ depth and min _ weight parameters have great influence on the result, and the tuning is preferentially carried out, wherein the large-range initial tuning is firstly carried out, and then the small-range tuning is carried out.

The subsample and colsample _ byte parameters may also be adjusted. 0.6 can be taken as an initial value, the adjustment is gradually carried out towards 1, and the overfitting phenomenon can occur when the numerical value is too small.

And verifying the parameter adjusting result by using the verification set and updating the parameters until the accuracy of the XGB OST model reaches a target value, wherein in the embodiment, if the average accuracy of the XGB OST model is higher than 0.9, the XGB model with the consistent accuracy is output.

And 4, resampling data of the test group data by an ADASYNN method, carrying out bootstrap random selection, and importing the test group data into the XGB OST model output in the step 3 for traversal iteration to obtain an optimal GPP model.

Referring to fig. 4, in this embodiment, SIF and characteristic values (local vegetation type, soil humidity, precipitation, downlink radiation, average temperature, maximum temperature, and minimum temperature data) in the test set are subjected to data resampling by an ADASYN method, bootstrap random selection is performed, the XGBOOST model obtained in step 3 is introduced, traversal iteration is performed to obtain an optimal GPP estimation model, and the optimal GPP model is used for predicting a research area GPP, so as to obtain a GPP value.

The establishment of a GPP estimation model based on sunlight-induced chlorophyll fluorescence and an XGBOST algorithm is completed in one step, chinese regional research of a GPP data set based on sunlight-induced chlorophyll fluorescence SIF is perfected, regional limitation of application of a linear SIF model is broken, the efficiency and the accuracy of GPP measurement are improved, and the GPP estimation model with high accuracy and high robustness is obtained.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.

Claims

1. A method for constructing a GPP estimation model is characterized by comprising the following steps:

and 4, performing data resampling on the test set, performing bootstrap random selection, and importing the test set into the XGBOST model obtained in the step 3 for traversal iteration to obtain an optimal GPP estimation model.

2. The method of claim 1, wherein the vegetation data in step 1 includes vegetation type and soil moisture, and the meteorological data includes precipitation, downlink radiation, average temperature, maximum temperature, and minimum temperature.

3. The method of claim 1, wherein the existing GPP model is selected as:

in the formula, G _PP To total primary productivity; s _IF Is chlorophyll fluorescence; l is _UE The utilization rate of light energy is obtained; l is a radical of an alcohol _UEf For fluorescence quantum yield；f _esc The canopy escape rate of chlorophyll fluorescence;

transforming the GPP model to obtain a transformed SIF-GPP model:

G _PP ＝S _IF T _MINscalar V _PDscalar a+b

and adding vegetation characteristic factors and meteorological characteristic factors into the converted SIF-GPP model by adopting an XGBOST model.

4. The method for constructing a GPP estimation model according to claim 1, wherein in step 2, the XGBOOST model is:

is the predicted value of the ith data.

5. The method for constructing a GPP estimation model according to claim 1, wherein in step 2, the XGBOOST model is trained in a ten-fold cross validation manner, and the loss function in the training process is:

wherein M represents the number of training sets,

is a predicted value