CN116756495A - Novel vorticity covariance net ecological system exchange data interpolation method - Google Patents

Novel vorticity covariance net ecological system exchange data interpolation method Download PDF

Info

Publication number
CN116756495A
CN116756495A CN202310567613.2A CN202310567613A CN116756495A CN 116756495 A CN116756495 A CN 116756495A CN 202310567613 A CN202310567613 A CN 202310567613A CN 116756495 A CN116756495 A CN 116756495A
Authority
CN
China
Prior art keywords
data
nee
interpolation
trend
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310567613.2A
Other languages
Chinese (zh)
Inventor
高德祥
高中明
药静宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202310567613.2A priority Critical patent/CN116756495A/en
Publication of CN116756495A publication Critical patent/CN116756495A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • G06F18/15Statistical pre-processing, e.g. techniques for normalisation or restoring missing data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/27Regression, e.g. linear or logistic regression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a novel net ecosystem carbon dioxide exchange (NEE) data interpolation method based on vorticity covariance, which is characterized in that a missing value is removed on the basis of obtained original observed data, a random forest model (RF) is trained by data without a gap, grid search is used for searching optimal parameters, then the missing value is interpolated, the interpolated complete NEE time sequence data is decomposed into a trend item and a fluctuation item by using time sequence additive decomposition (TSA) or Empirical Mode Decomposition (EMD) respectively, then the original missing value in the trend item is removed, the RF model is trained again, an influence factor data corresponding to the missing value is input by using the trained model, and then the interpolated trend item and the fluctuation item obtained by decomposition are overlapped to obtain the final net ecosystem exchange. The method not only remarkably improves the accuracy of the interpolation of the original observation data and the performance of the interpolation of the RF model, but also can improve the flux interpolation accuracy of the covariance of the long-gap vorticity.

Description

Novel vorticity covariance net ecological system exchange data interpolation method
Technical Field
The invention relates to the technical field of atmospheric science and carbon dioxide flux, in particular to a novel vorticity covariance net ecological system exchange data interpolation method.
Background
The vortex covariance (EC) method is an international general method of observing the gas exchange of an ecosystem with an atmospheric chamber, and tens of thousands of flux tower sites have been established from the beginning of the 20 th century. As an observation technology for directly observing the land ecological system and the material and energy flux between the atmosphere, the system is an important observation means of an international flux network (FLUXNET), a large number of observation sites such as weather, ecology, hydrology and the like, and plays an extremely important role in global change research. However, many long-term EC sites annually have gaps of about 20% -60% of half-hour data points due to a variety of factors such as power outages, instrument malfunctions and maintenance, and data quality checks, where continuous loss for longer periods of time (up to half a month, even one month) may also occur.
The conventional interpolation method of the vorticity covariance about the missing flux data has tens of methods, including a method using a linear/multiple regression model, a table look-up method, a multiple-cause, a marginal distribution sampling method, a machine learning method and the like, however, the international interpolation method about the missing flux data has not been unified so far, the interpolation accuracy of the existing method is not high, and the interpolation of the long-term missing data is more difficult.
In addition, the methods in the prior art often only involve one vegetation coverage or are applicable only in specific environments, and a robust and effective carbon flux interpolation scheme is lacking for different vegetation coverage and different geographical environment scenes of the global site. Therefore, the general robust NEE interpolation method plays a vital role in quantifying the annual change of carbon budget, and the research of substance and energy exchange of terrestrial ecosystems.
Disclosure of Invention
In view of the above-mentioned problems, the present invention aims to provide a novel carbon dioxide exchange amount (NEE) data interpolation method for a vortex covariance net ecological system, which can obtain complete and reliable flux data by interpolating missing data.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
the novel vorticity covariance net ecological system exchange data interpolation method is characterized by comprising the following steps of:
step 1: acquiring observed meteorological data and carbon dioxide flux data NEE;
step 2: preprocessing the obtained meteorological data and carbon dioxide flux data to obtain effective NEE data and corresponding influence factor data;
step 3: training a random forest model based on the data obtained in the step (2), and performing missing value pre-interpolation by using the trained random forest model to obtain complete NEE time sequence data;
step 4: processing the obtained complete NEE time sequence data based on a time sequence addition model (TSA) to obtain trend items and fluctuation items, removing an original missing value trend item division data set, training a machine learning model and performing interpolation to obtain trend item data for completing interpolation;
step 5: processing the obtained complete NEE time sequence data based on an empirical mode decomposition algorithm to obtain trend items and fluctuation items, removing an original missing value trend item division data set, training a machine learning model and performing interpolation to obtain interpolated trend item data;
step 6: and (3) comparing the interpolated data obtained in the step (4) with the interpolated data obtained in the step (5), wherein advantages of different decomposition methods can be complemented, the TSA can effectively extract the trend of NEE, and the EMD can solve the problems that the trend extraction of the TSA under a long gap is possibly inaccurate and distorted.
Step 7: and (3) superposing the trend item obtained in the step (4) or the step (5) and the fluctuation item obtained by decomposing the corresponding algorithm to obtain a final NEE interpolation result.
Further, the specific operation steps of the step 2 are as follows:
step 21: performing cubic spline interpolation on the meteorological data corresponding to NEE to obtain data with an interval of half an hour;
step 22: dividing the carbon dioxide flux data into three grades, performing quality evaluation, and retaining high-quality carbon dioxide flux data with a score of 0;
step 23: and removing NEE missing data and corresponding influence factor data, and reserving valid NEE and corresponding factor data.
Further, the specific operation steps of the step 3 include:
step 31: dividing NEE effective data obtained in the step 2, wherein 75% of the NEE effective data are training sets, and 25% of the NEE effective data are test sets;
step 32: training a random forest model in a training set, testing the performance of the model by using a testing set, and obtaining the optimal parameter combination of the model by using grid search;
step 33: and inputting NEE missing part influence factor data into a trained RF model interpolation missing value to obtain complete NEE time sequence data.
Further, the specific operation steps of the step 4 include:
step 41: defining T as a time sequence, P as a trend term, and R as a fluctuation term, wherein the steps are as follows:
T=P+R;
step 42: recording the NEE time sequence data completed in the step 3 as T i ,T i =[T 1 ,...,...,T i ,...,...,T n ];
Step 43: for complete NEE time series data T by moving average method i Decomposing to obtain decomposed trend items:
wherein ,is a trend term after decomposition, N is a period;
step 44: subtracting the trend term from the original sequence yields the fluctuation term R:
step 45: and removing the trend items of the blank of the original data, training and testing the machine learning model according to the obtained trend items, inputting the missing value influence factor data into the trained machine learning model for trend item interpolation, and finally superposing the interpolation result and the corresponding value fluctuation item to complete the NEE interpolation.
Further, the specific operation steps of step 5 include:
step 51: obtaining maximum value and minimum value points of the completed NEE time sequence data T (i), and fitting the extreme value points by a curve interpolation method to obtain an upper envelope curve T (i) of the signal max And lower envelope T (i) min
Step 52: the upper envelope and the lower envelope are averaged to obtain an average envelope m (i):
step 53: subtracting T (i) from m (i) to obtain a residual signal d (i) =t (i) -m (i);
step 54: repeating steps 51-53 for the residual signal d (i) until SD is smaller than the threshold value, to obtain a suitable first-order modal component c (i), where the calculation formula of SD is:
step 55: the difference is carried out on the signals T (i) and c (i) to obtain a first order residual quantity r (i), r (i) is used for replacing the original signal T (i), and the steps 51-55 are repeatedly carried out for n times to obtain an nth order modal function c n (i) And the residual quantity r finally meeting the standard n (i) Thus, the expression of T (i) after EMD decomposition is obtained as follows:
step 56: reconstructing the low-frequency component into a annual trend term with the similar proportion to the time sequence trend term, and taking the sum of other high-frequency components and residual errors as a fluctuation term;
step 57: and removing the trend items of the blank of the original data, training and testing the machine learning model according to the trend items after EMD decomposition, inputting the missing value influence factor data into the trained machine learning model for trend item interpolation, and finally superposing the interpolation result and the corresponding value fluctuation item to complete NEE interpolation.
Further, the influence factor data includes: air temperature, short wave radiation, precipitation, saturated water vapor pressure difference, wind speed, soil temperature, soil moisture content, normalized difference vegetation index, enhanced vegetation index, and three fuzzy variables (sine and cosine functions of decimal and decimal days of time points recorded once every half hour of each year).
Compared with other interpolation methods, the method has the beneficial effects that:
firstly, the traditional method commonly uses a marginal distribution sampling Method (MDS) to perform interpolation by other traditional interpolation methods, and although a machine learning RF model in most of the latest researches has better effect than the traditional scheme and can perform better interpolation on the long space, the interpolation accuracy and the interpolation effect on the long space of the method provided by the invention are greatly improved compared with the interpolation effect on the long space by using an RF model, and the method is simultaneously suitable for the interpolation of the net ecological system exchange amount covered by different vegetation in various geographic environments;
secondly, simulation experiment results show that the method not only has obvious improvement on the accuracy of interpolation precision, but also has good robustness and adaptability;
thirdly, the invention selects five stations with surface coverings respectively representing different vegetation, the artificial clearance is verified under four conditions with different lengths by using RF alone, and performance comparison experiments are carried out with the method provided by the invention, and the experimental results show that the interpolation accuracy of the invention is obviously higher than that of the independent RF filling.
Drawings
FIG. 1 is a flow chart of the method according to the present invention.
FIG. 2 is a graph of four different manual gap test indicator boxes and sections for testing 25 stations in North America;
FIG. 3 is a schematic diagram showing the result of the interpolation effect of the trend term extracted at the semiarid station US-Hn1 using a time series addition model (TSA);
FIG. 4 is a graph showing the result of the interpolation effect of the trend term extracted at the semiarid station US-Hn1 by using Empirical Mode Decomposition (EMD).
Detailed Description
In order to enable those skilled in the art to better understand the technical solution of the present invention, the technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
In order to obtain complete and reliable flux data, the invention provides a reasonable, reliable and robust interpolation method for interpolating missing data, which specifically comprises the following steps:
step 1: acquiring meteorological data and flux data
Acquiring data, including observed meteorological data and flux data, from a FLUXNET or similar observation website;
the meteorological data includes: air Temperature (TA), short wave radiation (SW), precipitation (P), saturated Vapor Pressure Difference (VPD), wind Speed (WS), soil Temperature (TS), soil moisture content (SWC).
The input data also comprises Normalized Difference Vegetation Index (NDVI), enhanced Vegetation Index (EVI) obtained by satellite observation and three fuzzy variables (sine and cosine functions of decimal days and decimal days of time points recorded every half hour of each year);
step 2: processing the acquired data
The acquired meteorological data and fluxes are processed by using python, when the meteorological data corresponding to NEE (carbon dioxide fluxes) mainly comprises that when a gap exists (a part of meteorological data recording intervals are not half an hour but are in a unit of a day, and therefore the gap exists), the meteorological data are subjected to spline interpolation for three times to obtain data with the intervals of half an hour, then useless data are removed, then NEE data are subjected to quality evaluation, so that carbon dioxide fluxes with higher quality are selected for training a model, and poor carbon dioxide fluxes are regarded as missing, and the specific process is as follows: the NEE data quality is classified into three grades, wherein the gap score caused by the fault and maintenance of the field instrument or the electric appliance is 2, the score of other low-quality data is 1, the data with the score of 0 represents high-quality data, and finally the high-quality data (namely, the score of 0) is selected for training of the model and filling of the blank. After finishing the data preprocessing, removing the NEE missing data and the corresponding influence factor data by using python, and reserving the effective NEE and the corresponding influence factor data for training a model;
step 3: based on RF model, obtaining complete NEE time series data
Dividing NEE effective data into 75% training sets and 25% testing sets; training an RF model on a training set and a testing set, searching optimal parameters by using grid search and random search, if the data volume is not large and the combination of the super parameters is small, using grid search (GridSearchCV), if the data volume is large and the combination of the parameters is large, selecting random search (RandomazedSearchCV), then using 5 times of cross validation to search the optimal super parameters, and inputting an influence factor data interpolation missing value after model training is completed to obtain complete NEE time sequence data;
when the machine learning model is utilized to perform data interpolation (the machine learning model disclosed in the prior art is adopted), in order to improve interpolation precision, a time sequence addition model is used to extract the NEE overall change trend item and the fluctuation item after interpolation, and the specific steps comprise:
let T be the time series, P be the trend term, R be the fluctuation term. Since the data recording interval after filling is half an hour, the trend term P is decomposed by a moving average method (i.e., time series addition model) with 48 intervals (i.e., one day) as a period, and the trend term R is subtracted from NEE time series data.
T=P+R
The data is from the NEE time series data which is completely recorded once in half an hour after interpolation, and is set to be subjected to interpolationWhole-back NEE time series data T i The method comprises the following steps:
T i =[T 1 ,...,...,T i ,...,...,T n ];
decomposing NEE time series data using a moving average method:
in the formula ,is a trend term after decomposition, N is a period, the period used by the invention is one day or 48, and the fluctuation term is obtained by subtracting the trend term from the original sequence, namely +.>
In order to compensate the error possibly introduced by the moving average method under the long blank, the invention also provides the use of Empirical Mode Decomposition (EMD) for component recombination, wherein the component recombination refers to decomposing the complete time sequence data after NEE is completed by using the EMD into a plurality of components, namely a plurality of IMFs and a residual error (res), the high-frequency component and the low-frequency component are contained, the recombination refers to overlapping the high-frequency component to obtain a fluctuation term, and overlapping the low-frequency component and the res to obtain a trend term, namely the data processed by component recombination is the completed NEE time sequence data, and the obtained component is a plurality of IMF components and the residual error obtained after NEE is decomposed. It can be seen that the EMD-based time-frequency analysis method is suitable for both nonlinear, non-stationary signal analysis and linear, stationary signal analysis.
The EMD method assumes that any signal consists of different IMFs, each of which may be linear or nonlinear. The IMF component must satisfy 2 conditions: (1) The number of the extreme points is the same as or at most 1 different from the zero crossing points; (2) the upper and lower envelopes are locally symmetrical about the time axis. IMF is generated by screening raw data, which is an iterative process. The leakage signal can be decomposed via EMD into a sum of several IMFs and a residual function.
The EMD algorithm comprises the following basic steps:
step 1: obtaining maximum value and minimum value points of the completed NEE time sequence data T (i), and then fitting the extreme value points by a curve interpolation method to obtain an upper envelope curve T (i) of the signal max And lower envelope T (i) min
Step 2: averaging the upper and lower packet routes to obtain an average envelope m (i):
step 3: the original signal T (i) is subtracted from the average envelope m (i) to obtain a residual signal d (i), which is typically the first modal function (IMF) of the original signal T (i) for stationary signals, but for non-stationary signals, the signal is not a monotonically varying (e.g. monotonically increasing in a certain area), but a plurality of inflection points may occur, i.e. due to the complexity of the signal variation, if an inflection point reflecting a specific feature of the original signal T (i) is not selected, the obtained first order modal function is inaccurate, so that the filtering is continued.
Step 4: and (3) repeatedly executing the steps 1-3 for the residual signal d (i), stopping until the SD (screening threshold value, generally taking the value of 0.2-0.3) is smaller than the threshold value, and finally obtaining a proper first-order modal component c (i), namely a first IMF. Wherein the SD is determined as follows:
wherein n represents a time-series data sequence number, and k represents a reference number of the residual signal;
step 5: the signals T (i) and c (i) are subjected to difference to obtain a first-order residual quantity r (i), r (i) is used for replacing the original signal T (i), the processing of the steps 1-5 is executed, and the nth-order modal function c is obtained after repeating n times n (i) And the residual quantity r finally meeting the standard n (i) The expression of the final original signal after EMD decomposition is:
from the above, it is known that the time-series data is decomposed into a plurality of components c (n) and residuals r (n) by the EMD to reconstruct the NEE-filled sequence decomposition, the sum of the decomposed low-frequency components (IMF) is kept to be not more than 0.15 different from the total amount of the time-series decomposed trend term, the low-frequency components are reconstructed into the annual trend term with the similar time-series trend term ratio, and the sum of the other high-frequency components and residuals is used as the fluctuation term.
And obtaining trend items and fluctuation items after EMD decomposition, removing the trend items of the blank of the original data again, training and testing the machine learning model by using influence factor data to independently perform NEE decomposition without loss, searching and adjusting super parameters again or directly using first interpolation model parameters, inputting the influence factor data interpolation trend items of the loss value after training, and finally superposing the fluctuation items of the corresponding value to complete the interpolation of NEE.
The invention uses a time sequence addition model (TSA) and Empirical Model Decomposition (EMD), the two decomposition methods decompose the completed NEE time sequence data, the completed NEE time sequence data is firstly completed through RF, then the completed NEE is decomposed through the TSA and the EMD, and finally the interpolation is completed through a machine learning algorithm.
The invention is arranged in parallel by the TSA and the EMD, thus complementation and comparison are formed, the TSA can effectively extract the variation trend of the NEE with a long period, the TSA can capture the total amount of more than ninety five percent of NEE in the NEE with the year, and can effectively smooth data, the complexity of the data is reduced, but the TSA method with each day as the period can lead to inaccurate trend extraction or error caused by the long gap, and the EMD can effectively solve the nonlinear problem, but the capturing of the NEE with the year after the recombination of a plurality of components (IMF) after the EMD decomposition is worse than the TSA, so the EMD decomposition is set for parallel comparison, thereby playing the role of complementation comparison.
Examples
The invention aims at the interpolation of missing data, the missing of the data restricts the accuracy and the performance verification result of the method, and the current general method is to create artificial gaps in the data without the missing to verify the performance of the interpolation method, so that the invention also adopts a mode of creating artificial gaps with four different lengths to verify the performance of the method.
1. Experimental data sources: NEE data and meteorological data were observed from https:// ameriflux.
The Technical note Uncertainties in eddy covariance CO 2: 2fluxes in a semiarid sagebrush ecosystem caused by gap-filling approaches published in 2021 compares the RF schemes described in the 21-year published literature with those of the semi-arid site US-Hn1 under the same artificial void length, and the interpolation results are shown in Table 1, wherein the RF data are error analysis data of the best scheme RF in the 2021-published literature, and the TSA-RF (time series addition model decomposition combination RF) and EMD-RF (empirical mode decomposition combination RF) are comparative experimental results obtained under the same artificial void length by using the decomposed combination RF of the present invention.
TABLE 1
As can be seen from Table 1, the method has significantly improved accuracy of interpolation accuracy, and the invention uses the method of interpolation first, then EMD and moving average decomposition NEE time sequence, and performs comparison test with four different algorithms (Xgboost, RF, SVR, BP four machine learning algorithms) which respectively pair interpolation junctions under four different length artificial gaps of 25 sites in North AmericaThe experimental result shows that the method for filling NEE at last by first completing and then decomposing under different length gaps has greatly improved effect on various indexes compared with the prior direct interpolation, and for 1 hour (short) gap length, 25 sites respectively use EMD to decompose and combine R of RF and XGboost 2 Average 0.98, RMSE average 0.492 and 0.480, respectively, using time-series decomposed R 2 Average 0.99, RMSE average 0.357 and 0.364, respectively, two month (veryLong) gap length, EMD decomposition combined RF, XGboost R 2 Average values drop to 0.88 and 0.85, rmse average values rise to 1.60 and 1.59, respectively, using time-series resolved R 2 The average value is reduced to 0.88, and the average value of the RMSE is increased to 1.365 and 1.376, which shows that the method has good robustness and adaptability, and the interpolation effect of the data in the long blank is greatly improved.
In order to further verify the effect of the method according to the invention, the following experiments were carried out:
the 25 point vortex covariance flux towers in the experiment are all distributed in North America, the data meet flux data with the observed data quantity being at least greater than one calendar year, and the recorded data comprise a complete set of input factor data required for interpolation, and the method comprises the following steps: solar radiation, air temperature, humidity, wind speed and direction, and the like. In the experiment, data with the sampling rate of 1Hz and recorded once every 30 minutes are selected, and the data source website is as follows: https:// ameriflux. Lbl. Gov/sites/site-search/, sites are distributed in various areas of north america across a variety of climate zones, mainly temperate continental climate, temperate marine climate, subtropical humid climate, etc., and are of various representative types, including five types of farmlands, grasslands, bushes, evergreen conifer and She Kuoshe forests.
Only the NEE data of high quality (i.e. score 0) was used in this experiment for training of the model and filling of the blank. In order to evaluate the gap filling effect, it is necessary to introduce artificial gaps into the data, the invention generates four artificial gaps with different lengths for the decomposed data set, fills the gap by using four methods, and finally superimposes fluctuation items of the artificial gaps to obtain a predicted value and an observed value under the NEE artificial gap. The size and gap position of the artificial gap are difficult to be made to completely conform to the situation of the gap under the real scene, and in order to control the quality of the artificial gap and eliminate the potential influence of the sample size and the gap position on performance evaluation, each gap creates a training set and a test set of the artificial gap of 10 independent samples.
The present invention uses part of the observation station weather data provided by the FLUXNET2015 database as model inputs, including air temperature (TA_F), short wave radiation (SW_IN_F), precipitation (P_F), saturated vapor pressure difference (VPD_F), wind speed (WS_F), soil temperature (TS_F_MDS_1), soil moisture content (SWC_F_MDS_1), IN addition to the above-mentioned weather variables, the input variables to the ML algorithm also include Normalized Difference Vegetation Index (NDVI) and Enhanced Vegetation Index (EVI) from a medium resolution imaging spectrometer (MODIS), and three fuzzy variables (i.e., fractional days per year and sine and cosine functions to represent seasonal variations).
In the experiment, firstly, the missing value of NEE time sequence data with a gap is removed, namely, only data with a score of 0 is reserved, then the NEE time sequence data is divided into a training set and a testing set, 400 regression trees are created for each case by using a random forest R packet, and the number of variables of the binary tree in a designated node is 3. And training the model by taking all the influence factors as input quantities and completing first interpolation to obtain complete NEE time sequence data. Then using a time sequence addition model and EMD to decompose the filled NEE time sequence into a trend item and a fluctuation item, removing data which are originally absent in trend item data, taking the data without blank as a training set, randomly generating four artificial gaps with different lengths accounting for 10% -15%, taking 80% of the rest data as the training set, taking 20% as a test set and training the model. Meanwhile, four machine learning algorithms Xgboost, RF, SVR and BP neural networks are combined, and normalization processing is carried out on all input data in order to ensure consistency of model comparison. The specific model parameters were designed as follows: super-parametric optimization of the decomposed NEE time series trend terms in RF experiments using python grid search and 5 times cross-validation you, the super-parameters searched include the tree (50-5000), the maximum number of features selected when building the decision tree (0.2-0.8), the maximum depth of the decision tree (10-360), the minimum number of samples required for each leaf node (1, 2, 4 or 6), the minimum number of samples required for splitting the node (2, 5, 10, 12 or 15); XGboost parameters include learning rate (0.01, 0.1or 0.02), minimum loss function drop value required for node splitting (0-0.5), minimum sample weight sum (1, 2,5or 10) in child nodes, and ratio of feature samples (0.6-1) when building tree; the SVR adjustment parameters include a kernel function and a cost regularization parameter (c=1, 10,100, or 100); BP in experiments the kernel as library of python was used for structural design and parameter setting of BP neural networks. And predicting the missing value of the manual gap of the trend item after the parameters are determined, and then superposing the separated fluctuation item to obtain the filling result of the NEE and verify the accuracy and feasibility of the model.
The optimal parameters of the four algorithms in the experiment are:
RF:n_estimators=1636,min_samples_split=5,min_samples_leaf=2,max_features=0.5,max_depth=None,bootstrap=False,random_state=0
XGboost:subsample=0.8,seed=0,reg_lambda=1,reg_alpha=0,n_jobs=-1,n_estimators=3333,min_child_weight=5,max_depth=298,learning_rate=0.01,gamma=0.0,colsample_bytree=0.7
SVR:kernel='rbf',gamma=0.1,C=100
BP neural network comprises the following structure: input layer-intermediate layer-output layer = 120-10-1, activation function: sigmoid, training 200 times.
The experiment adopts four commonly used performance indexes, namely a determination coefficient (R2), a Root Mean Square Error (RMSE), an average absolute error (MAE) and a deviation (Bias), and the statistical comparison between a gap filling value and an original measured value in a manual gap is respectively carried out on NEE of each station, wherein the formula is as follows:
in the formula ,mi As a measurement value, p i The predicted value is represented by a value of the prediction, and />Representing the average of the measured and predicted values, respectively.
According to the invention, a time sequence addition model and an EMD decomposition algorithm are used for comparing and extracting trend items and fluctuation items, then four machine learning algorithms (Xgboost, RF, SVR and BP) are used for pre-interpolation of the extracted trend items, fluctuation items are superimposed, then four different length gaps are tested, interpolation results of 25 stations in North America under four different length artificial gaps are evaluated, and experimental results show that the effect of direct interpolation on each index is greatly improved by using a method of first completing and then decomposing and finally filling NEE under different length gaps.
In general, the performance of each algorithm after using EMD and moving average decomposition decreases with increasing gap, and the XGboost algorithm is superior to SVR and BP neural networks in the case of RF corresponding to four gaps of all sites, and the time series decomposition method of the effect shown by the EMD and time series decomposition is slightly superior to the EMD decomposition. For a gap length of 1 hour (short), all algorithms R 2 Highest, bias, RMSE, MAE lowest. For 1 hour (short) gap length, EMD decomposition was used to combine the R of RF, XGboost 2 Average 0.98, rmse average 0.492 and 0.480, respectively, using time series to decompose R for all sites 2 Average 0.99, rmse average 0.357 and 0.364, respectively, however, corresponding to two month (very long) gap length, EMD decompositionCombining R of all sites of RF, XGboost 2 The average value drops to 0.88 and 0.85, the rmse average value rises to 1.60 and 1.59, respectively, and the time series is used to decompose R for all sites 2 The average value dropped to 0.88 and the rmse average value increased to 1.365 and 1.376, with the overall RF slightly better than XGboost. The same is true for Bais and MAEs, which show an upward trend with gaps as short as long, it is noted that time series decomposition works better for longer gap fills.
In order to verify the improvement of the performance of the decomposed model, five stations with surface coverings respectively representing different vegetation are selected: farmland (GZ 1), grasslands (AR 1), bushes (SK 2), evergreen coniferous (Me 6) and deciduous broadleaf (oho). Then, the manual gap under four conditions of different lengths is verified by using RF alone (keeping consistent with all parameters of the RF algorithm after decomposition) and the performance improvement of each algorithm after decomposition is compared, and experimental results show that the effect of filling by adopting the decomposition trend item is obviously higher than that of filling by using RF alone. The median of the average of the indices after EMD decomposition at four different gaps appears as the mean value of the data obtained from the BP neural network (R 2 =0.930, rmse=1.407, bias= -0.006, mae=1.046) to SVR (R 2 =0.927, rmse=1.39, bias= -0.015, mae=1.040) to XGboost (R 2 =0.938, rmse=1.337, bias= -0.068, mae=0.966) and RF (R 2 =0.939, rmse=1.307, bias= -0.016, mae=0.942), the median of the average of the indices after time series decomposition appears as the mean value of the indices after time series decomposition from the BP neural network (R 2 =0.957, rmse=1.168, bias=0.0002, mae=0.840), to SVR (R 2 =0.959, rmse=1.159, bias=0.047, mae=0.846) to XGboost (R 2 =0.964, rmse=1.071, bias= -0.1352, mae=0.766) and RF (R 2 =0.966, rmse=1.041, bias= -0.016, mae=0.750) all perform better than RF alone as (R 2 =0.78,RMSE=2.58,Bias=-0.17,MAE=1.62)。
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (6)

1. The novel carbon dioxide exchange amount data interpolation method for the vortex covariance net ecological system is characterized by comprising the following steps of:
step 1: acquiring observed meteorological data and carbon dioxide flux data NEE;
step 2: preprocessing the obtained meteorological data and carbon dioxide flux data to obtain effective NEE data and corresponding influence factor data;
step 3: training a random forest model based on the data obtained in the step (2), and performing missing value pre-interpolation by using the trained random forest model to obtain complete NEE time sequence data;
step 4: processing the obtained complete NEE time sequence data based on a time sequence addition model TSA to obtain trend items and fluctuation items, removing an original missing value trend item division data set, training a machine learning model and performing interpolation to obtain complete interpolation trend item data;
step 5: processing the obtained complete NEE time sequence data based on an Empirical Mode Decomposition (EMD) algorithm to obtain trend items and fluctuation items, removing an original missing value trend item division data set, training a machine learning model and performing interpolation to obtain complete interpolation trend item data;
step 6: comparing the interpolated data obtained in the step 4 with the interpolated data obtained in the step 5;
step 7: and (3) superposing the trend item obtained in the step (4) or the step (5) and the fluctuation item obtained by decomposing the corresponding algorithm to obtain a final NEE interpolation result.
2. The novel vorticity covariance net ecosystem exchange data interpolation method according to claim 1, wherein the method comprises the following steps: the specific operation steps of the step 2 are as follows:
step 21: performing cubic spline interpolation on the weather data with the gaps corresponding to the NEEs to obtain data with the interval of half an hour;
step 22: dividing the carbon dioxide flux into three grades, performing quality evaluation, and retaining high-quality carbon dioxide flux data with a score of 0;
step 23: and removing the NEE missing data and the corresponding influence factor data, and retaining the NEE which is not missing and the corresponding image factor data.
3. The method for interpolating data of a new vorticity covariance net ecosystem exchange according to claim 2, wherein the specific operation steps of step 3 include:
step 31: dividing NEE effective data obtained in the step 2, wherein 75% of the NEE effective data are training sets, and 25% of the NEE effective data are test sets;
step 32: training a random forest model in a training set, testing the performance of the model by using a testing set, and obtaining the optimal parameter combination of the model by using grid search;
step 33: and inputting NEE missing part influence factor data into a trained RF model interpolation missing value to obtain complete NEE time sequence data.
4. A novel vorticity covariance net ecosystem exchange data interpolation method according to claim 3, wherein the step 4 of Time Series Addition (TSA) comprises the following steps:
step 41: defining T as a time sequence, P as a trend term, and R as a fluctuation term, wherein the steps are as follows:
T=P+R;
step 42: recording the NEE time sequence data completed in the step 3 as T i ,T i =[T 1 ,...,...,T i ,...,...,T n ];
Step 43: for complete NEE time series data T by moving average method i Decomposing to obtain decomposed trend items:
wherein ,is a trend term after decomposition, N is a period;
step 44: subtracting the trend term from the original sequence yields the fluctuation term R:
step 45: and removing the trend items of the blank of the original data, training and testing the machine learning model according to the obtained trend items, inputting the missing value corresponding influence factor data into the trained machine learning model to perform trend item interpolation, and finally superposing the interpolation result and the corresponding value fluctuation item to complete NEE interpolation.
5. The method for interpolating data of a new vorticity covariance net ecosystem exchange according to claim 4, wherein the step 5 of Empirical Mode Decomposition (EMD) comprises the steps of:
step 51: obtaining maximum value and minimum value points of the completed NEE time sequence data T (i), and fitting the extreme value points by a curve interpolation method to obtain an upper envelope curve T (i) of the signal max And lower envelope T (i) min
Step 52: the upper envelope and the lower envelope are averaged to obtain an average envelope m (i):
step 53: subtracting T (i) from m (i) to obtain a residual signal d (i) =T (i) -m (i)
Step 54: repeating steps 51-53 for the residual signal d (i) until SD is smaller than the threshold value, to obtain a suitable first-order modal component c (i), where the calculation formula of SD is:
step 55: the difference is carried out on the signals T (i) and c (i) to obtain a first order residual quantity r (i), r (i) is used for replacing the original signal T (i), and the steps 51-55 are repeatedly carried out for n times to obtain an nth order modal function c n (i) And the residual quantity r finally meeting the standard n (i) Thus, the expression of T (i) after EMD decomposition is obtained as follows:
step 56: reconstructing the EMD decomposed low-frequency components into trend terms, and taking the sum of other high-frequency components and residual errors as fluctuation terms;
step 57: and removing the trend items of the blank of the original data, training and testing the machine learning model according to the trend items after EMD decomposition, inputting the missing value corresponding influence factor data into the trained machine learning model for trend item interpolation, and finally superposing the interpolation result and the corresponding value fluctuation item to complete the NEE interpolation.
6. The novel vorticity covariance net ecosystem exchange data interpolation method according to claim 1, wherein the method comprises the following steps: the influence factor data includes: air temperature, short wave radiation, precipitation, saturated water vapor pressure difference, wind speed, soil temperature, soil water content, normalized difference vegetation index, enhanced vegetation index and three fuzzy variables.
CN202310567613.2A 2023-05-18 2023-05-18 Novel vorticity covariance net ecological system exchange data interpolation method Pending CN116756495A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310567613.2A CN116756495A (en) 2023-05-18 2023-05-18 Novel vorticity covariance net ecological system exchange data interpolation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310567613.2A CN116756495A (en) 2023-05-18 2023-05-18 Novel vorticity covariance net ecological system exchange data interpolation method

Publications (1)

Publication Number Publication Date
CN116756495A true CN116756495A (en) 2023-09-15

Family

ID=87954239

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310567613.2A Pending CN116756495A (en) 2023-05-18 2023-05-18 Novel vorticity covariance net ecological system exchange data interpolation method

Country Status (1)

Country Link
CN (1) CN116756495A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117609706A (en) * 2023-10-20 2024-02-27 北京师范大学 Method for interpolating data of carbon water flux
CN117609706B (en) * 2023-10-20 2024-06-04 北京师范大学 Method for interpolating data of carbon water flux

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117609706A (en) * 2023-10-20 2024-02-27 北京师范大学 Method for interpolating data of carbon water flux
CN117609706B (en) * 2023-10-20 2024-06-04 北京师范大学 Method for interpolating data of carbon water flux

Similar Documents

Publication Publication Date Title
Draper et al. Assessment of MERRA-2 land surface energy flux estimates
Poli et al. ERA-20C: An atmospheric reanalysis of the twentieth century
Pu et al. How reliable are CMIP5 models in simulating dust optical depth?
Pendall et al. Multiproxy record of late Pleistocene–Holocene climate and vegetation changes from a peat bog in Patagonia
Baker et al. Evapotranspiration in the Amazon: spatial patterns, seasonality, and recent trends in observations, reanalysis, and climate models
Van Haren et al. SST and circulation trend biases cause an underestimation of European precipitation trends
Praveen et al. On the relationship between mean monsoon precipitation and low pressure systems in climate model simulations
Niedermeyer et al. The stable hydrogen isotopic composition of sedimentary plant waxes as quantitative proxy for rainfall in the West African Sahel
Liu et al. Tree-ring δ18O in southwestern China linked to variations in regional cloud cover and tropical sea surface temperature
Valler et al. An updated global atmospheric paleo‐reanalysis covering the last 400 years
Seftigen et al. The influence of climate on 13C/12C and 18O/16O ratios in tree ring cellulose of Pinus sylvestris L. growing in the central Scandinavian Mountains
Helle et al. Interpreting climate proxies from tree-rings
Ling et al. Comprehensive evaluation of satellite-based and reanalysis soil moisture products using in situ observations over China
Kandasamy et al. An approach for evaluating the impact of gaps and measurement errors on satellite land surface phenology algorithms: Application to 20 year NOAA AVHRR data over Canada
Garcin et al. Hydroclimatic vulnerability of peat carbon in the central Congo Basin
Okazaki et al. Development and evaluation of a system of proxy data assimilation for paleoclimate reconstruction
Nicault et al. Hydrological reconstruction from tree-ring multi-proxies over the last two centuries at the Caniapiscau Reservoir, northern Québec, Canada
Hashimoto et al. High‐resolution mapping of daily climate variables by aggregating multiple spatial data sets with the random forest algorithm over the conterminous United States
Nicholas et al. Empirical downscaling of high-resolution regional precipitation from large-scale reanalysis fields
Xie et al. Using observed signals from the Arctic stratosphere and Indian Ocean to predict April–May precipitation in central China
Boysen et al. δ 18O in the Tropical Conifer Agathis robusta Records ENSO-Related Precipitation Variations
Schneising et al. Advances in retrieving XCH 4 and XCO from Sentinel-5 Precursor: improvements in the scientific TROPOMI/WFMD algorithm
Lossow et al. The SPARC water vapour assessment II: profile-to-profile comparisons of stratospheric and lower mesospheric water vapour data sets obtained from satellites
Wei et al. Seasonal temperature and moisture changes in interior semi‐arid Spain from the last interglacial to the Late Holocene
Holme et al. Varying regional δ 18 O–temperature relationship in high-resolution stable water isotopes from east Greenland

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination