CN116189796A - Machine learning-based satellite-borne short wave infrared CO 2 Column concentration estimation method - Google Patents

Machine learning-based satellite-borne short wave infrared CO 2 Column concentration estimation method Download PDF

Info

Publication number
CN116189796A
CN116189796A CN202211594763.4A CN202211594763A CN116189796A CN 116189796 A CN116189796 A CN 116189796A CN 202211594763 A CN202211594763 A CN 202211594763A CN 116189796 A CN116189796 A CN 116189796A
Authority
CN
China
Prior art keywords
data
machine learning
value
column concentration
column
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211594763.4A
Other languages
Chinese (zh)
Inventor
盖荣丽
李静波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University
Original Assignee
Dalian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University filed Critical Dalian University
Priority to CN202211594763.4A priority Critical patent/CN116189796A/en
Publication of CN116189796A publication Critical patent/CN116189796A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Investigating Or Analysing Materials By Optical Means (AREA)

Abstract

The invention discloses a machine learning-based satellite-borne short wave infrared CO 2 A method of estimating column concentration comprising: s1, extracting OCO-2 satellite wave band data to obtain 9 weakCO 2 Band and 6O 2 Band data; s2, 9 weakCO 2 Band and 6O 2 Performing feature screening on the wave band data, the NDVI normalized vegetation index, the SR surface reflectivity data, the DEM elevation topography data, the ERA5 atmosphere data, the AOD aerosol data and the TCCON station observation data, and reserving the first 31 features of screening according to importance; s3, performing correlation analysis on the first 31 screened features through a heat map to find out the correlation with CO 2 Features of stronger correlation and weaker features of column concentration; s4, will be combined with CO 2 Combining the characteristic with strong correlation with weak characteristic, inputting to five regression models with integrated learning, and predicting accuracy by using the different modelsComparative analysis, determining coefficient R of extreme random forest regression model 2 Highest, least error, best prediction effect.

Description

基于机器学习的星载短波红外CO2柱浓度估算方法Spaceborne shortwave infrared CO2 column concentration estimation method based on machine learning

技术领域Technical Field

本发明涉及大气卫星遥感预测技术领域,具体涉及基于机器学习的星载短波红外CO2柱浓度估算方法。The present invention relates to the technical field of atmospheric satellite remote sensing prediction, and in particular to a satellite-borne shortwave infrared CO2 column concentration estimation method based on machine learning.

背景技术Background Art

CO2是大气中主要的温室气体,对全球气候变化具有非常重要的影响。自工业时代以来,CO2浓度已增长至约30%,且保持着持续增长的趋势。所以监测CO2对于全球气候变暖的研究具有重要意义;因此准确掌握大气中的CO2含量及其变化,可为气候预测以及环境决策提供支持。 CO2 is the main greenhouse gas in the atmosphere and has a very important impact on global climate change. Since the industrial age, the concentration of CO2 has increased to about 30% and has maintained a trend of continuous growth. Therefore, monitoring CO2 is of great significance for the study of global warming; therefore, accurately grasping the CO2 content and its changes in the atmosphere can provide support for climate prediction and environmental decision-making.

传统的地基大气CO2探测方法虽然具有精度高、可靠性强的优点,但都是单点测量,缺乏对区域和全球大范围实时探测的能力,所以发展卫星观测CO2的方法和技术势在必行。短波红外波段对近地面CO2更敏感,因此更适合用于地面碳源汇动态变化的监测。Although traditional ground-based atmospheric CO 2 detection methods have the advantages of high accuracy and reliability, they are all single-point measurements and lack the ability to detect large-scale regional and global real-time detection. Therefore, it is imperative to develop satellite observation methods and technologies for CO 2. The short-wave infrared band is more sensitive to near-ground CO 2 , so it is more suitable for monitoring the dynamic changes of ground carbon sources and sinks.

目前国际上短波红外CO2观测数据多采用全物理反演算法,需要对整个光学路径进行模拟,辐射传输方程计算复杂且比较耗时。由于气溶胶、水汽和地表反射率对短波红外辐射过程影响复杂,现有的物理反演模型需要输入参数多且具有不确定性。At present, most short-wave infrared CO 2 observation data in the world use full physical inversion algorithms, which require simulation of the entire optical path, and the calculation of the radiation transfer equation is complex and time-consuming. Due to the complex influence of aerosols, water vapor and surface reflectivity on the short-wave infrared radiation process, the existing physical inversion model requires many input parameters and has uncertainty.

发明内容Summary of the invention

本发明的目的是,通过使用决策树、XGBoost、普通随机森林、极端随机森林和梯度提升回归模型分别对CO2柱浓度进行估算,然后对比分析找出估算精度最高的模型,对卫星遥感的大气CO2柱浓度进行估算,该方法具有预测的精度高、可解释性强的优点,并极大地提高了预测效率。The purpose of the present invention is to estimate the CO2 column concentration by using decision tree, XGBoost, ordinary random forest, extreme random forest and gradient boosting regression model respectively, and then compare and analyze to find the model with the highest estimation accuracy, and estimate the atmospheric CO2 column concentration from satellite remote sensing. This method has the advantages of high prediction accuracy and strong interpretability, and greatly improves the prediction efficiency.

为实现上述目的,本申请的技术方案为:基于机器学习的星载短波红外CO2柱浓度估算方法,包括:To achieve the above purpose, the technical solution of the present application is: a spaceborne shortwave infrared CO2 column concentration estimation method based on machine learning, comprising:

S1.获取OCO-2卫星波段数据,通过大气二氧化碳反演参数的敏感性分析对所述OCO-2卫星波段数据进行提取,得到9个weak_CO2波段以及6个O2波段数据;S1. Obtain OCO-2 satellite band data, extract the OCO-2 satellite band data through sensitivity analysis of atmospheric carbon dioxide inversion parameters, and obtain 9 weak_CO 2 bands and 6 O 2 band data;

S2.将9个weak_CO2波段以及6个O2波段数据、NDVI归一化植被指数、SR地表反射率数据、DEM高程地形数据、ERA5大气数据、AOD气溶胶数据、TCCON站观测数据进行特征筛选,按照重要性保留筛选的前31个特征;S2. The 9 weak_CO 2 bands and 6 O 2 bands data, NDVI normalized vegetation index, SR surface reflectance data, DEM elevation terrain data, ERA5 atmospheric data, AOD aerosol data, and TCCON station observation data were used for feature screening, and the top 31 features were retained according to importance;

S3.通过热图对筛选的前31个特征进行相关性分析,找出与CO2柱浓度相关性较强的特征和较弱的特征;S3. Perform correlation analysis on the first 31 features screened through heat maps to find out the features with strong and weak correlation with CO2 column concentration;

S4.将与CO2柱浓度相关性较强的特征和较弱的特征进行合并,作为输入的特征数据集,然后分别采用决策树、XGBoost、普通随机森林、极端随机森林和梯度提升回归模型对CO2平均柱浓度进行估算,通过对不同回归模型估算的决定系数R2、均方根误差RMSE、平均绝对误差MAE、平均相对误差MRE以及在误差允许范围内预测的精度进行对比分析,找出预测精度最高的模型为极端随机森林回归模型,使用极端随机森林回归模型对CO2柱平均浓度进行预测。S4. The features with strong and weak correlation with CO2 column concentration are merged as the input feature data set, and then the decision tree, XGBoost, ordinary random forest, extreme random forest and gradient boosting regression models are used to estimate the average column concentration of CO2 . By comparing and analyzing the determination coefficient R2 , root mean square error RMSE, mean absolute error MAE, mean relative error MRE estimated by different regression models and the prediction accuracy within the allowable error range, the model with the highest prediction accuracy is found to be the extreme random forest regression model, and the extreme random forest regression model is used to predict the average CO2 column concentration.

进一步的,所述OCO-2卫星波段数据包括经度lon、维度lat、太阳的天顶角和方位角、卫星的天顶角和方位角;所述ERA5大气数据包括温度、湿度、压强、风的U/V分量、降雨量、边界层高度(blh)、云底高(cbh)、云覆盖(tcc)、总降雨(tp)、风的垂直速度。Furthermore, the OCO-2 satellite band data includes longitude lon, latitude lat, zenith angle and azimuth of the sun, zenith angle and azimuth of the satellite; the ERA5 atmospheric data includes temperature, humidity, pressure, U/V component of wind, rainfall, boundary layer height (blh), cloud base height (cbh), cloud cover (tcc), total rainfall (tp), and vertical speed of wind.

进一步的,对OCO-2卫星波段数据进行提取前采用重采样方式确定提取范围,即根据目标区域的经纬度范围绘制网格,设置采样后的分辨率为0.5°×0.5°,通过每个网格的经纬度,得到每个网格中心点与原图像对应的每个像元中心点的欧式距离为:Furthermore, before extracting the OCO-2 satellite band data, the resampling method is used to determine the extraction range, that is, the grid is drawn according to the latitude and longitude range of the target area, and the resolution after sampling is set to 0.5°×0.5°. Through the longitude and latitude of each grid, the Euclidean distance between the center point of each grid and the center point of each pixel corresponding to the original image is obtained:

Figure BDA0003996675280000031
Figure BDA0003996675280000031

式中lonk为固定站点的经度、latk为固定站点的纬度、loni、lati分别为网格的经纬度。Where lon k is the longitude of the fixed site, lat k is the latitude of the fixed site, lon i and lat i are the longitude and latitude of the grid respectively.

进一步的,对9个weak_CO2波段以及6个O2波段数据中的异常值进行处理为:Furthermore, the outliers in the 9 weak_CO 2 bands and 6 O 2 bands are processed as follows:

Figure BDA0003996675280000032
Figure BDA0003996675280000032

式中σ为当天数据的标准差,即把±3σ以外的异常值全部剔除,并对每个站点每天多次测得的各波段数据取均值。Where σ is the standard deviation of the data for the day, that is, all outliers outside ±3σ are eliminated, and the average of the data of each band measured multiple times at each station every day is taken.

进一步的,决策树使用基尼指数来划分属性,假定当前样本集合X中第k类样本所占的比例为pk(k=1,2,3,…,y),则基尼值为:Furthermore, the decision tree uses the Gini index to divide attributes. Assuming that the proportion of the k-th class of samples in the current sample set X is p k (k = 1, 2, 3, ..., y), the Gini value is:

Figure BDA0003996675280000033
Figure BDA0003996675280000033

Gini(X)表明了在两个不同类型标签之间不一致性的随机抽样的可能性;基尼不纯度是指该样品被选择的概率乘上错误的概率。Gini(X)越小,则样本集合X的纯度越高;当一个节点中所有的样本都是一个类时,基尼不纯度为0。Gini(X) indicates the possibility of random sampling of inconsistency between two different types of labels; Gini impurity refers to the probability of the sample being selected multiplied by the probability of error. The smaller Gini(X), the higher the purity of the sample set X; when all samples in a node are of the same class, Gini impurity is 0.

假定离散属性a有v个可能的取值,若使用a对样本集合X进行分类,则会产生v个分支结点,记Xv为第v个分支结点包含样本集合X中所有在属性a上取值的样本;则属性a的基尼指数定义为:Assuming that the discrete attribute a has v possible values, if a is used to classify the sample set X, v branch nodes will be generated. Let Xv be the vth branch node containing all samples in the sample set X that have values on attribute a. Then the Gini index of attribute a is defined as:

Figure BDA0003996675280000041
Figure BDA0003996675280000041

基尼指数Gini(X,A)表示经过A=a分割后样本集X的不确定性;基尼指数越大,样本的不确定性就越大。The Gini index Gini(X, A) represents the uncertainty of the sample set X after the partition by A=a; the larger the Gini index, the greater the uncertainty of the sample.

进一步的,XGBoost中假设总共有K棵树,F表示树模型,则预测值

Figure BDA0003996675280000044
表示为:Furthermore, XGBoost assumes that there are K trees in total, and F represents the tree model, then the predicted value
Figure BDA0003996675280000044
It is expressed as:

Figure BDA0003996675280000042
Figure BDA0003996675280000042

式中xi为输入实例,表示第i个数据点的特征向量;K为CART树的数量;fk为表示第k棵CART树;Where xi is the input instance, representing the feature vector of the i-th data point; K is the number of CART trees; fk represents the k-th CART tree;

对应的目标函数L为:The corresponding objective function L is:

Figure BDA0003996675280000043
Figure BDA0003996675280000043

式中,l为损失函数,表示预测值与真实值之间的误差;yi为真实值;Ω为正则化函数,防止模型过拟合。Where l is the loss function, which represents the error between the predicted value and the true value; yi is the true value; Ω is the regularization function to prevent the model from overfitting.

进一步的,普通随机森林中,对于数据集的特征参数集X,建立模型h(X,θi),i=1,2,…,k,随机选择m个特征,使得每个叶节点选择最大信息增益的特征进行分裂;其中信息增益表示为:Furthermore, in ordinary random forests, for the feature parameter set X of the data set, a model h(X,θ i ), i=1,2,…,k is established, and m features are randomly selected so that each leaf node selects the feature with the maximum information gain for splitting; the information gain is expressed as:

Figure BDA0003996675280000051
Figure BDA0003996675280000051

Figure BDA0003996675280000052
Figure BDA0003996675280000052

式中i为回归值,pi表示对应值发生的概率,w为划分节点的个数,

Figure BDA0003996675280000053
为第m个划分叶节点的权重值。Where i is the regression value, pi represents the probability of the corresponding value, and w is the number of partition nodes.
Figure BDA0003996675280000053
is the weight value of the mth partition leaf node.

进一步的,极端随机森林中,假设个体学习器的泛化误差为Ei,则学习器的泛化误差加权值为:Furthermore, in extreme random forests, assuming that the generalization error of an individual learner is E i , the weighted value of the generalization error of the learner is:

Figure BDA0003996675280000054
Figure BDA0003996675280000054

假设个体学习器的分歧值为Ai,则学习器的加权分歧值为:Assuming that the divergence value of the individual learner is A i , the weighted divergence value of the learner is:

Figure BDA0003996675280000055
Figure BDA0003996675280000055

集成后的泛化误差表示为:The generalization error after integration is expressed as:

Figure BDA0003996675280000056
Figure BDA0003996675280000056

式中wi为权重,T为结构不同的决策树总数。Where wi is the weight and T is the total number of decision trees with different structures.

进一步的,梯度提升每次迭代得到的新学习器都是针对前一个学习器的残差进行拟合,最后将所有树的预测相加,从而完成预测任务;残差获取方式为:Furthermore, the new learner obtained in each iteration of gradient boosting is fitted to the residual of the previous learner, and finally the predictions of all trees are added together to complete the prediction task; the residual is obtained as follows:

rni=yi-fn-1(xi)r ni = yi -f n-1 ( xi )

式中,yi为第i个样本的实测值,fn-1(xi)为前一轮学习器的预测值;对残差记性拟合,得到一个拟合残差模型hn(x),更新回归树:In the formula, yi is the measured value of the ith sample, and fn -1 ( xi ) is the predicted value of the previous round of learner. The residual memory is fitted to obtain a fitted residual model hn (x), and the regression tree is updated:

fn(x)=fn-1(x)+hn(x)f n (x) = f n-1 (x) + hn (x)

进一步的,所述决定系数R2、均方根误差RMSE、平均绝对误差MAE、平均相对误差MRE获取方式为:Furthermore, the determination coefficient R 2 , root mean square error RMSE, mean absolute error MAE, and mean relative error MRE are obtained as follows:

Figure BDA0003996675280000061
Figure BDA0003996675280000061

Figure BDA0003996675280000062
Figure BDA0003996675280000062

Figure BDA0003996675280000063
Figure BDA0003996675280000063

Figure BDA0003996675280000064
Figure BDA0003996675280000064

式中:N为样本个数;fi为预测值;yi为真实值;

Figure BDA0003996675280000065
为平均值。Where: N is the number of samples; fi is the predicted value; yi is the true value;
Figure BDA0003996675280000065
is the average value.

本发明由于采用以上技术方案,能够取得如下的技术效果:本方法使用不同集成学习的方法通过卫星、植被、地表地形、大气、气溶胶数据等对大气CO2柱平均浓度进行预测,具有现实意义。相比于传统的物理反演方法,本方法考虑的特征充足、易修改、易解释、操作简单,并极大地提高了预测的效率。可以较好的预测CO2柱浓度,让集成学习模型预测不同卫星的CO2柱浓度结果可以更加精准,为环保部门的决策提供数据支持。Due to the adoption of the above technical scheme, the present invention can achieve the following technical effects: This method uses different integrated learning methods to predict the average concentration of atmospheric CO2 column through satellite, vegetation, surface topography, atmosphere, aerosol data, etc., which has practical significance. Compared with the traditional physical inversion method, this method considers sufficient features, is easy to modify, easy to explain, simple to operate, and greatly improves the prediction efficiency. The CO2 column concentration can be better predicted, so that the integrated learning model can predict the CO2 column concentration of different satellites more accurately, providing data support for the decision-making of the environmental protection department.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为OCO-2卫星观测光谱图,观测样本来自Tsukuba站点,(36.05N°,140.12E°),2019年1月1日;其中(a)为CO2的弱吸收波段,(b)为O2-A吸收波段;Figure 1 is the spectrum observed by the OCO-2 satellite. The observation sample is from the Tsukuba station (36.05N°, 140.12E°) on January 1, 2019. (a) is the weak absorption band of CO 2 , and (b) is the absorption band of O2-A.

图2为前31个特征重要性柱状图;Figure 2 is a bar chart of the importance of the first 31 features;

图3为CO2柱平均浓度卫星反演中各影响因子之间的相关性图;Figure 3 is a correlation diagram between various influencing factors in the satellite inversion of the average CO 2 column concentration;

图4为五种预测模型的训练结果图;Figure 4 is a diagram showing the training results of five prediction models;

图5为五种预测模型测试集预测CO2柱浓度与真实值的差值图;Figure 5 is a graph showing the difference between the predicted CO 2 column concentration and the true value of the test set of five prediction models;

图6为极端随机森林回归模型预测性能随自身参数的影响图;Figure 6 is a graph showing the influence of the prediction performance of the extreme random forest regression model on its own parameters;

图7为基于机器学习的星载短波红外CO2柱浓度估算方法流程图。Figure 7 is a flow chart of the spaceborne shortwave infrared CO 2 column concentration estimation method based on machine learning.

具体实施方式DETAILED DESCRIPTION

本发明的实施例是在以本发明技术方案为前提下进行实施的,给出了详细的实施方式和具体的操作过程,但本发明的保护范围不限于下述实施例。The embodiments of the present invention are implemented on the premise of the technical solution of the present invention, and detailed implementation methods and specific operation processes are given, but the protection scope of the present invention is not limited to the following embodiments.

实施例1Example 1

本实施例提供基于机器学习的星载短波红外CO2柱浓度估算方法,包括:This embodiment provides a spaceborne shortwave infrared CO 2 column concentration estimation method based on machine learning, including:

S1.获取OCO-2卫星波段数据,通过大气二氧化碳反演参数的敏感性分析对所述OCO-2卫星波段数据进行提取;由于CO2弱吸收波段受到水汽影响比较大,因此在强吸收波段(1.61μm)处选取对应的吸收通道,得到9个weak_CO2波段以及6个O2波段数据;S1. Obtain OCO-2 satellite band data, and extract the OCO-2 satellite band data through sensitivity analysis of atmospheric carbon dioxide inversion parameters; since the CO 2 weak absorption band is greatly affected by water vapor, the corresponding absorption channel is selected at the strong absorption band (1.61 μm) to obtain 9 weak_CO 2 bands and 6 O 2 bands data;

其中,所述OCO-2卫星波段数据包括经度lon、维度lat、太阳的天顶角和方位角、卫星的天顶角和方位角。The OCO-2 satellite band data includes longitude lon, latitude lat, the zenith angle and azimuth of the sun, and the zenith angle and azimuth of the satellite.

需要说明的是:对OCO-2卫星波段数据进行提取前采用重采样方式确定提取范围,即根据目标区域的经纬度范围绘制网格,设置采样后的分辨率为0.5°×0.5°,通过每个网格的经纬度,得到每个网格中心点与原图像对应的每个像元中心点的欧式距离为:It should be noted that before extracting the OCO-2 satellite band data, the extraction range is determined by resampling, that is, a grid is drawn according to the latitude and longitude range of the target area, and the resolution after sampling is set to 0.5°×0.5°. Through the longitude and latitude of each grid, the Euclidean distance between the center point of each grid and the center point of each pixel corresponding to the original image is obtained:

Figure BDA0003996675280000081
Figure BDA0003996675280000081

式中lonk为固定站点的经度、latk为固定站点的纬度、loni、lati分别为网格的经纬度。Where lon k is the longitude of the fixed site, lat k is the latitude of the fixed site, lon i and lat i are the longitude and latitude of the grid respectively.

优选的,对9个weak_CO2波段以及6个O2波段数据中的异常值进行处理为:Preferably, the outliers in the 9 weak_CO 2 bands and 6 O 2 bands are processed as follows:

Figure BDA0003996675280000082
Figure BDA0003996675280000082

式中σ为当天数据的标准差,即把±3σ以外的异常值全部剔除,并对每个站点每天多次测得的各波段数据取均值。Where σ is the standard deviation of the data for the day, that is, all outliers outside ±3σ are eliminated, and the average of the data of each band measured multiple times at each station every day is taken.

S2.将9个weak_CO2波段以及6个O2波段数据、NDVI归一化植被指数、SR地表反射率数据、DEM高程地形数据、ERA5大气数据、AOD气溶胶数据、TCCON站观测数据进行特征筛选,按照重要性保留筛选的前31个特征;S2. The 9 weak_CO 2 bands and 6 O 2 bands data, NDVI normalized vegetation index, SR surface reflectance data, DEM elevation terrain data, ERA5 atmospheric data, AOD aerosol data, and TCCON station observation data were used for feature screening, and the top 31 features were retained according to importance;

具体的,通过上述重采样方式进行特征筛选。Specifically, feature screening is performed through the above-mentioned resampling method.

S3.通过热图对筛选的前31个特征进行相关性分析,找出与CO2柱浓度相关性较强的特征和较弱的特征;S3. Perform correlation analysis on the first 31 features screened through heat maps to find out the features with strong and weak correlation with CO2 column concentration;

S4.将与CO2柱浓度相关性较强的特征和较弱的特征进行合并,输入至五种集成学习的回归模型中,分别输出所预测的CO2柱浓度;S4. Merge the features with strong correlation with CO2 column concentration and the features with weak correlation with CO2 column concentration, input them into five ensemble learning regression models, and output the predicted CO2 column concentration respectively;

具体的,极端随机森林中,假设个体学习器的泛化误差为Ei,则学习器的泛化误差加权值为:Specifically, in extreme random forests, assuming that the generalization error of an individual learner is E i , the weighted value of the generalization error of the learner is:

Figure BDA0003996675280000091
Figure BDA0003996675280000091

假设个体学习器的分歧值为Ai,则学习器的加权分歧值为:Assuming that the divergence value of the individual learner is A i , the weighted divergence value of the learner is:

Figure BDA0003996675280000092
Figure BDA0003996675280000092

集成后的泛化误差可表示为:The generalization error after integration can be expressed as:

Figure BDA0003996675280000093
Figure BDA0003996675280000093

式中wi为权重,T为结构不同的决策树总数。Where wi is the weight and T is the total number of decision trees with different structures.

使用这几种不同模型预测出的准确率,通过对比分析,极端随机森林回归模型的决定系数R2最高、误差最小、预测效果最好,明显优于其余模型的预测结果;四个评价指标相关数据如下表所示:Using these different models to predict the accuracy, through comparative analysis, the extreme random forest regression model has the highest coefficient of determination R2 , the smallest error, and the best prediction effect, which is significantly better than the prediction results of other models; the relevant data of the four evaluation indicators are shown in the following table:

表1四个评价指标相关数据Table 1 Related data of four evaluation indicators

Figure BDA0003996675280000094
Figure BDA0003996675280000094

本发明的实施例有较佳的实施性,并非是对本发明任何形式的限定。本发明实施例中描述的技术特征或技术特征的组合不应当被认为是孤立的,它们可以被互相组合从而达到更好的技术效果。本发明优选实施方式的范围也可以包括另外的实现,且者应被发明实施例所属技术领域的技术人员所理解。The embodiments of the present invention have better practicability and are not intended to limit the present invention in any form. The technical features or combinations of technical features described in the embodiments of the present invention should not be considered isolated, and they can be combined with each other to achieve better technical effects. The scope of the preferred embodiments of the present invention may also include other implementations, and should be understood by those skilled in the art of the invention embodiments.

Claims (10)

1.基于机器学习的星载短波红外CO2柱浓度估算方法,其特征在于,包括:1. A spaceborne shortwave infrared CO 2 column concentration estimation method based on machine learning, characterized by comprising: S1.获取OCO-2卫星波段数据,通过大气二氧化碳反演参数的敏感性分析对所述OCO-2卫星波段数据进行提取,得到9个weak_CO2波段以及6个O2波段数据;S1. Obtain OCO-2 satellite band data, extract the OCO-2 satellite band data through sensitivity analysis of atmospheric carbon dioxide inversion parameters, and obtain 9 weak_CO 2 bands and 6 O 2 band data; S2.将9个weak_CO2波段以及6个O2波段数据、NDVI归一化植被指数、SR地表反射率数据、DEM高程地形数据、ERA5大气数据、AOD气溶胶数据、TCCON站观测数据进行特征筛选,按照重要性保留筛选的前31个特征;S2. The 9 weak_CO 2 bands and 6 O 2 bands data, NDVI normalized vegetation index, SR surface reflectance data, DEM elevation terrain data, ERA5 atmospheric data, AOD aerosol data, and TCCON station observation data were used for feature screening, and the top 31 features were retained according to importance; S3.通过热图对筛选的前31个特征进行相关性分析,找出与CO2柱浓度相关性较强的特征和较弱的特征;S3. Perform correlation analysis on the first 31 features screened through heat maps to find out the features with strong and weak correlation with CO2 column concentration; S4.将与CO2柱浓度相关性较强的特征和较弱的特征进行合并,作为输入的特征数据集,然后分别采用决策树、XGBoost、普通随机森林、极端随机森林和梯度提升回归模型对CO2平均柱浓度进行估算,通过对不同回归模型估算的决定系数R2、均方根误差RMSE、平均绝对误差MAE、平均相对误差MRE以及在误差允许范围内预测的精度进行对比分析,找出预测精度最高的模型为极端随机森林回归模型,使用极端随机森林回归模型对CO2柱平均浓度进行预测。S4. The features with strong and weak correlation with CO2 column concentration are merged as the input feature data set, and then the decision tree, XGBoost, ordinary random forest, extreme random forest and gradient boosting regression models are used to estimate the average column concentration of CO2 . By comparing and analyzing the determination coefficient R2 , root mean square error RMSE, mean absolute error MAE, mean relative error MRE estimated by different regression models and the prediction accuracy within the allowable error range, the model with the highest prediction accuracy is found to be the extreme random forest regression model, and the extreme random forest regression model is used to predict the average CO2 column concentration. 2.根据权利要求1所述基于机器学习的星载短波红外CO2柱浓度估算方法,其特征在于,所述OCO-2卫星波段数据包括经度lon、维度lat、太阳的天顶角和方位角、卫星的天顶角和方位角;所述ERA5大气数据包括温度、湿度、压强、风的U/V分量、降雨量、边界层高度、云底高、云覆盖、总降雨、风的垂直速度。2. According to the method for estimating the concentration of a satellite-borne shortwave infrared CO2 column based on machine learning in claim 1, it is characterized in that the OCO-2 satellite band data includes longitude lon, latitude lat, zenith angle and azimuth of the sun, zenith angle and azimuth of the satellite; the ERA5 atmospheric data includes temperature, humidity, pressure, U/V component of wind, rainfall, boundary layer height, cloud base height, cloud cover, total rainfall, and vertical speed of wind. 3.根据权利要求1所述基于机器学习的星载短波红外CO2柱浓度估算方法,其特征在于,对OCO-2卫星波段数据进行提取前采用重采样方式确定提取范围,即根据目标区域的经纬度范围绘制网格,设置采样后的分辨率为0.5°×0.5°,通过每个网格的经纬度,得到每个网格中心点与原图像对应的每个像元中心点的欧式距离为:3. The satellite-borne shortwave infrared CO2 column concentration estimation method based on machine learning according to claim 1 is characterized in that the extraction range is determined by resampling before extracting the OCO-2 satellite band data, that is, a grid is drawn according to the longitude and latitude range of the target area, and the resolution after sampling is set to 0.5°×0.5°. The Euclidean distance between the center point of each grid and the center point of each pixel corresponding to the original image is obtained by the longitude and latitude of each grid:
Figure FDA0003996675270000021
Figure FDA0003996675270000021
式中lonk为固定站点的经度、latk为固定站点的纬度、loni、lati分别为网格的经纬度。Where lon k is the longitude of the fixed site, lat k is the latitude of the fixed site, lon i and lat i are the longitude and latitude of the grid respectively.
4.根据权利要求1所述基于机器学习的星载短波红外CO2柱浓度估算方法,其特征在于,对9个weak_CO2波段以及6个O2波段数据中的异常值进行处理为:4. According to the method for estimating the spaceborne shortwave infrared CO 2 column concentration based on machine learning in claim 1, it is characterized in that the abnormal values in the 9 weak_CO 2 bands and 6 O 2 bands are processed as follows:
Figure FDA0003996675270000022
Figure FDA0003996675270000022
式中σ为当天数据的标准差,即把±3σ以外的异常值全部剔除。In the formula, σ is the standard deviation of the data for that day, that is, all outliers outside ±3σ are eliminated.
5.根据权利要求1所述基于机器学习的星载短波红外CO2柱浓度估算方法,其特征在于,决策树使用基尼指数来划分属性,假定当前样本集合X中第k类样本所占的比例为pk(k=1,2,3,…,y),则基尼值为:5. According to the method for estimating the concentration of shortwave infrared CO 2 column based on spaceborne machine learning in claim 1, it is characterized in that the decision tree uses the Gini index to divide the attributes. Assuming that the proportion of the k-th class of samples in the current sample set X is p k (k=1,2,3,…,y), the Gini value is:
Figure FDA0003996675270000023
Figure FDA0003996675270000023
Gini(X)表明了在两个不同类型标签之间不一致性的随机抽样的可能性;Gini(X) indicates the probability of random sampling of inconsistencies between two different types of labels; 假定离散属性a有v个可能的取值,若使用a对样本集合X进行分类,则会产生v个分支结点,记Xv为第v个分支结点包含样本集合X中所有在属性a上取值的样本;则属性a的基尼指数定义为:Assuming that the discrete attribute a has v possible values, if a is used to classify the sample set X, v branch nodes will be generated. Let Xv be the vth branch node containing all samples in the sample set X that have values on attribute a. Then the Gini index of attribute a is defined as:
Figure FDA0003996675270000031
Figure FDA0003996675270000031
基尼指数Gini(X,A)表示经过A=a分割后样本集合X的不确定性;基尼指数越大,样本的不确定性就越大。The Gini index Gini(X, A) represents the uncertainty of the sample set X after the partition by A=a; the larger the Gini index, the greater the uncertainty of the sample.
6.根据权利要求1所述基于机器学习的星载短波红外CO2柱浓度估算方法,其特征在于,XGBoost中假设总共有K棵树,F表示树模型,则预测值
Figure FDA0003996675270000032
表示为:
6. According to the method for estimating the spaceborne shortwave infrared CO 2 column concentration based on machine learning in claim 1, it is characterized in that, in XGBoost, it is assumed that there are a total of K trees, F represents the tree model, and the predicted value
Figure FDA0003996675270000032
It is expressed as:
Figure FDA0003996675270000033
Figure FDA0003996675270000033
式中xi为输入实例,表示第i个数据点的特征向量;K为CART树的数量;fk为表示第k棵CART树;Where xi is the input instance, representing the feature vector of the i-th data point; K is the number of CART trees; fk represents the k-th CART tree; 对应的目标函数L为:The corresponding objective function L is:
Figure FDA0003996675270000034
Figure FDA0003996675270000034
式中,l为损失函数,表示预测值与真实值之间的误差;yi为真实值;Ω为正则化函数,防止模型过拟合。Where l is the loss function, which represents the error between the predicted value and the true value; yi is the true value; Ω is the regularization function to prevent the model from overfitting.
7.根据权利要求1所述基于机器学习的星载短波红外CO2柱浓度估算方法,其特征在于,普通随机森林中,对于数据集的特征参数集X,建立模型h(X,θi),i=1,2,…,k,随机选择m个特征,使得每个叶节点选择最大信息增益的特征进行分裂;其中信息增益表示为:7. The method for estimating the spaceborne shortwave infrared CO 2 column concentration based on machine learning according to claim 1 is characterized in that, in a common random forest, for the feature parameter set X of the data set, a model h(X,θ i ), i=1,2,…,k is established, and m features are randomly selected so that each leaf node selects the feature with the maximum information gain for splitting; wherein the information gain is expressed as:
Figure FDA0003996675270000041
Figure FDA0003996675270000041
Figure FDA0003996675270000042
Figure FDA0003996675270000042
式中i为回归值,pi表示对应值发生的概率,w为划分节点的个数,
Figure FDA0003996675270000043
为第m个划分叶节点的权重值。
Where i is the regression value, pi represents the probability of the corresponding value, and w is the number of partition nodes.
Figure FDA0003996675270000043
is the weight value of the mth partition leaf node.
8.根据权利要求1所述基于机器学习的星载短波红外CO2柱浓度估算方法,其特征在于,极端随机森林中,假设个体学习器的泛化误差为Ei,则学习器的泛化误差加权值为:8. The method for estimating the spaceborne shortwave infrared CO 2 column concentration based on machine learning according to claim 1, characterized in that, in the extreme random forest, assuming that the generalization error of the individual learner is E i , the weighted value of the generalization error of the learner is:
Figure FDA0003996675270000044
Figure FDA0003996675270000044
假设个体学习器的分歧值为Ai,则学习器的加权分歧值为:Assuming that the divergence value of the individual learner is A i , the weighted divergence value of the learner is:
Figure FDA0003996675270000045
Figure FDA0003996675270000045
集成后的泛化误差表示为:The generalization error after integration is expressed as:
Figure FDA0003996675270000046
Figure FDA0003996675270000046
式中wi为权重,T为结构不同的决策树总数。Where wi is the weight and T is the total number of decision trees with different structures.
9.根据权利要求1所述基于机器学习的星载短波红外CO2柱浓度估算方法,其特征在于,梯度提升每次迭代得到的新学习器都是针对前一个学习器的残差进行拟合,最后将所有树的预测相加,从而完成预测任务;残差获取方式为:9. The method for estimating the concentration of CO2 column from satellite-borne shortwave infrared based on machine learning according to claim 1 is characterized in that the new learner obtained in each iteration of gradient boosting is fitted with the residual of the previous learner, and finally the predictions of all trees are added together to complete the prediction task; the residual is obtained in the following manner: rni=yi-fn-1(xi)r ni = yi -f n-1 ( xi ) 式中,yi为第i个样本的实测值,fn-1(xi)为前一轮学习器的预测值;对残差记性拟合,得到一个拟合残差模型hn(x),更新回归树:In the formula, yi is the measured value of the ith sample, and fn -1 ( xi ) is the predicted value of the previous round of learner. The residual memory is fitted to obtain a fitted residual model hn (x), and the regression tree is updated: fn(x)=fn-1(x)+hn(x)。f n (x)=f n-1 (x)+h n (x). 10.根据权利要求1所述基于机器学习的星载短波红外CO2柱浓度估算方法,其特征在于,所述决定系数R2、均方根误差RMSE、平均绝对误差MAE、平均相对误差MRE获取方式为:10. The method for estimating the spaceborne shortwave infrared CO 2 column concentration based on machine learning according to claim 1, characterized in that the determination coefficient R 2 , root mean square error RMSE, mean absolute error MAE, and mean relative error MRE are obtained in the following manner:
Figure FDA0003996675270000051
Figure FDA0003996675270000051
Figure FDA0003996675270000052
Figure FDA0003996675270000052
Figure FDA0003996675270000053
Figure FDA0003996675270000053
Figure FDA0003996675270000054
Figure FDA0003996675270000054
式中:N为样本个数;fi为预测值;yi为真实值;
Figure FDA0003996675270000055
为平均值。
Where: N is the number of samples; fi is the predicted value; yi is the true value;
Figure FDA0003996675270000055
is the average value.
CN202211594763.4A 2022-12-13 2022-12-13 Machine learning-based satellite-borne short wave infrared CO 2 Column concentration estimation method Pending CN116189796A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211594763.4A CN116189796A (en) 2022-12-13 2022-12-13 Machine learning-based satellite-borne short wave infrared CO 2 Column concentration estimation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211594763.4A CN116189796A (en) 2022-12-13 2022-12-13 Machine learning-based satellite-borne short wave infrared CO 2 Column concentration estimation method

Publications (1)

Publication Number Publication Date
CN116189796A true CN116189796A (en) 2023-05-30

Family

ID=86451347

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211594763.4A Pending CN116189796A (en) 2022-12-13 2022-12-13 Machine learning-based satellite-borne short wave infrared CO 2 Column concentration estimation method

Country Status (1)

Country Link
CN (1) CN116189796A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117455066A (en) * 2023-11-13 2024-01-26 哈尔滨航天恒星数据系统科技有限公司 Corn planting accurate fertilizer distribution method based on multi-strategy optimization random forest, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李静波等: "基于机器学习的星载短波红外CO2柱浓度估算研究", 《中国环境科学》, 21 November 2022 (2022-11-21), pages 1 - 14 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117455066A (en) * 2023-11-13 2024-01-26 哈尔滨航天恒星数据系统科技有限公司 Corn planting accurate fertilizer distribution method based on multi-strategy optimization random forest, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Kaba et al. Estimation of daily global solar radiation using deep learning model
Li et al. Predicting ground-level PM2. 5 concentrations in the Beijing-Tianjin-Hebei region: a hybrid remote sensing and machine learning approach
Fraser et al. A method for detecting large-scale forest cover change using coarse spatial resolution imagery
Linares-Rodriguez et al. An artificial neural network ensemble model for estimating global solar radiation from Meteosat satellite images
Sayeed et al. A deep convolutional neural network model for improving WRF simulations
CN110427818B (en) Deep learning satellite data cloud detection method supported by hyperspectral data
Radman et al. S2MetNet: A novel dataset and deep learning benchmark for methane point source quantification using Sentinel-2 satellite imagery
Haq et al. Snow and glacial feature identification using Hyperion dataset and machine learning algorithms
CN117075138B (en) Remote sensing measurement and calculation method, system and medium for canopy height of 30-meter forest in area
CN115187441A (en) Method and device for calculating solid carbon amount of grassland, storage medium and computer equipment
Liu et al. Hyperspectral infrared sounder cloud detection using deep neural network model
CN119025927A (en) Rapid water quality inversion method, device, equipment and storage medium
Lee et al. New approach for snow cover detection through spectral pattern recognition with MODIS data
Riihimaki et al. Improving prediction of surface solar irradiance variability by integrating observed cloud characteristics and machine learning
Pouliot et al. Evaluation of annual forest disturbance monitoring using a static decision tree approach and 250 m MODIS data
Braghiere et al. Characterization of the radiative impact of aerosols on CO 2 and energy fluxes in the Amazon deforestation arch using artificial neural networks
CN111191594A (en) Cloud bottom height inversion method and system based on multi-source satellite data
CN116189796A (en) Machine learning-based satellite-borne short wave infrared CO 2 Column concentration estimation method
Milstein et al. Detail enhancement of AIRS/AMSU temperature and moisture profiles using a 3D deep neural network
CN118656650B (en) Soil humidity inversion method, equipment, medium and product
Shichkin et al. Comparison of artificial neural network, random forest and random perceptron forest for forecasting the spatial impurity distribution
Mogaraju Machine learning assisted prediction of land surface temperature (LST) based on major air pollutants over the Annamayya District of India
Chen et al. Remote sensing retrieval of aerosol types in China using geostationary satellite
CN116449460B (en) Regional month precipitation prediction method and system based on convolution UNet and transfer learning
Putra et al. Rainfall estimation using machine learning approaches with raingauge, radar, and satellite data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination