CN115293231A

CN115293231A - A Random Forest Prediction Method for Regional Ecological Harmony

Info

Publication number: CN115293231A
Application number: CN202210747133.XA
Authority: CN
Inventors: 王涛; 杨凯越
Original assignee: Chinese Academy of Geological Sciences
Current assignee: Chinese Academy of Geological Sciences
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2022-11-04

Abstract

The prediction method of the regional ecological harmony random forest comprises the following steps: 1. finely describing lake and grass elements of the mountain and water forest fields from the underground to the surface by integrating different time scales; 2. finely and quantitatively interpreting the natural elements of nearly one hundred years in time intervals by combining long-time satellite remote sensing; 3. collecting human factor characterization data of different years in the research area range; 4. establishing a human activity factor function by a multivariate regression and machine learning method; 5. and analyzing the period and frequency of the curve, predicting the change characteristics of each element in the future by using a mathematical model, and predicting the influence of human activities on other elements. The invention has excellent accuracy; can operate efficiently on large data sets; input samples with high dimensional characteristics can be processed without dimension reduction; the importance of each feature on the classification problem can be evaluated; in the generation process, an unbiased estimation of an internal generation error can be obtained; good results can be obtained also for the default value problem.

Description

Regional Ecological Harmony Random Forest Prediction Method

技术领域technical field

本发明属于生态环境预测技术领域，具体涉及一种地区生态和谐随机森林预测方法。The invention belongs to the technical field of ecological environment prediction, and in particular relates to a method for predicting regional ecological harmony random forest.

背景技术Background technique

分析某城市的人地系统，并根据对其环境承载力的估计，可以作为生态和谐型立体城市进行规划。基于本项技术，可以进一步通过自然科学与社会科学在人文、社会、管理等领域的交叉，要充分分析人类活动对其干扰，利用社会科学评价方法的办法融入社会效益进行二次评价。根据当地经济发展需求及不同利益方诉求，人居环境等要素，结合可预测的经济社会学模型，优选出最完善的规划组合提供给规划人员，同时提出分年度的用地策略及具体修复措施。因此，如何提高地区生态和谐的预测效率和预测精准度是当下急需解决的问题。By analyzing the man-land system of a city and estimating its environmental carrying capacity, it can be planned as an ecologically harmonious three-dimensional city. Based on this technology, through the intersection of natural science and social science in the fields of humanities, society, and management, it is necessary to fully analyze the interference of human activities, and use social science evaluation methods to integrate social benefits into secondary evaluation. According to the needs of local economic development, the appeals of different stakeholders, human settlements and other elements, combined with predictable economic and sociological models, the most complete planning combination is selected and provided to planners, and at the same time, annual land use strategies and specific restoration measures are proposed. Therefore, how to improve the prediction efficiency and accuracy of regional ecological harmony is an urgent problem to be solved.

发明内容Contents of the invention

本发明为了解决现有技术中的不足之处，提供一种地区生态和谐随机森林预测方法；其根据地球历史上长期和快速的环境变化证据提供了与现代进行比较关键基线，探索用直观定量的数学模型分析自然系统关系，借助跨越了地质时间尺度和人类时间尺度的高精度测年技术，使解决连接人类历史和地质历史，抽象人类要素对山水林田湖草系统的干扰要素和关系，运用人工智能算法实现对其高精度的还原与预测。In order to solve the deficiencies in the prior art, the present invention provides a regional ecological harmony random forest prediction method; it provides a key baseline for comparison with modern times based on the long-term and rapid environmental change evidence in the history of the earth, and uses intuitive and quantitative methods to explore Mathematical models analyze the relationship between natural systems, and with the help of high-precision dating technology that spans the geological time scale and human time scale, it can solve the problem of connecting human history and geological history, abstract the interference factors and relationships of human elements on the landscape, forest, field, lake and grass system, and use artificial The intelligent algorithm realizes its high-precision restoration and prediction.

为解决上述技术问题，本发明采用如下技术方案：地区生态和谐随机森林预测方法，包括以下步骤：In order to solve the above-mentioned technical problems, the present invention adopts the following technical solutions: a regional ecological harmony random forest prediction method, comprising the following steps:

第一步、针对多重要素，综合不同时间尺度，从地下到地表对山水林田湖草要素精细描述；The first step is to comprehensively describe the elements of mountains, rivers, forests, fields, lakes and grasses from the underground to the surface, aiming at multiple elements and synthesizing different time scales;

第二步、结合长时段卫星遥感分时段对近百年的自然要素做精细定量解译，自然要素包括山、水、林、田、湖、草的面积，以1年为时间单位标定，综合解译人类要素，人类要素包括建筑用地面积；The second step is to combine the long-term satellite remote sensing to make a fine and quantitative interpretation of the natural elements of the past century. The natural elements include the area of mountains, water, forests, fields, lakes, and grass. The time unit is calibrated in one year, and the comprehensive solution Human factors, human factors include building land area;

第三步、收集研究区范围内，不同年份人类要素表征数据，表征数据包括人口数量、GDP和工业开发强度，以1年为时间单位标定，综合解译人类要素；The third step is to collect the characterization data of human factors in different years within the research area. The characterization data include population, GDP and industrial development intensity, calibrated with one year as the time unit, and comprehensively interpret human factors;

第四步、根据历史资料和专家判断，收集研究年限内对研究区的环境承载能力做出判断，并作为后续模型训练的标准。通过多元回归及机器学习的方法，建立人类活动要素函数；基于系统性思维，通过多元回归及机器学习的方法建立起人与自然多要素多尺度的拟合关系，多尺度包括时间尺度，探讨自然环境自发演化过程及人类活动对这一过程的影响，拟合成一个跟时间相关的数学模型，数学模型即曲线函数；The fourth step is to collect and judge the environmental carrying capacity of the research area within the research period based on historical data and expert judgment, and use it as a standard for subsequent model training. Through multiple regression and machine learning methods, the function of human activity elements is established; based on systematic thinking, through multiple regression and machine learning methods, a multi-factor and multi-scale fitting relationship between man and nature is established. Multi-scales include time scales to explore nature The spontaneous evolution process of the environment and the influence of human activities on this process are fitted into a time-related mathematical model, and the mathematical model is a curve function;

第五步、通过将时间设定为未来的某个时间，分析曲线的周期与频率，用数学模型预测未来各个要素的变化特点，预测人类活动对其他要素的影响；基于上述工作提出环境承载力下限，划定该地区山、水、林、田、湖、草的面积或比例，划定生态功能保障基线、环境质量安全底线、自然资源利用上线，指导生态和谐型城市立体规划。The fifth step is to set the time as a certain time in the future, analyze the cycle and frequency of the curve, use mathematical models to predict the characteristics of changes in various elements in the future, and predict the impact of human activities on other elements; based on the above work, put forward the environmental carrying capacity The lower limit defines the area or proportion of mountains, water, forests, fields, lakes, and grass in the area, defines the baseline for ecological function protection, the bottom line for environmental quality and safety, and the upper limit for the use of natural resources to guide the three-dimensional planning of an ecologically harmonious city.

第一步具体为：在重点解剖区布置密集浅钻，通过对地下地质体精细定量表征，建立包括山水林田湖草要素的三维空间模型；利用多个钻孔，结合高精度测年技术，从全新世晚期（>5000a），按照千年级划分，到1000年前开始以100年为时间单位标定，精细恢复古地理，古环境格局，最终建立精度较高的四维时空地质模型。The first step is specifically: arrange intensive shallow drilling in key anatomical areas, establish a three-dimensional space model including the elements of mountains, rivers, forests, fields, lakes and grasses through fine and quantitative characterization of underground geological bodies; use multiple boreholes, combined with high-precision dating technology, from In the late Holocene (>5000a), according to the division of the millennium, the time unit of 100 years began to be calibrated by 1000 years ago, the paleogeography and paleoenvironmental pattern were finely restored, and a four-dimensional spatiotemporal geological model with high precision was finally established.

第四步中的多元回归及机器学习的方法为随机森林模型算法，随机森林模型算法的地理区域情况要素指标如下表所示：The method of multiple regression and machine learning in the fourth step is the random forest model algorithm, and the geographical area factor indicators of the random forest model algorithm are shown in the following table:

表1 地理区域情况要素指标Table 1 Geographical Region Situation Factor Indicators

假设数据收集年份为两个时段，第一个时段全新世中晚期至现代（7000 B.C-1950），共50个时间点，第二个时段1951-2021，共71个时间点，第一个时段主要用来构建地质演化背景， 1951-2020年该地区可作为生态和谐型立体城市进行规划的评判结果已知（1：可以作为；0：不可以作为），2021年的评判结果未知，是需要通过训练后的预测模型进行预测的；表1中共有23个指标，所以标准化后的Z矩阵大小是23*120，其对应的评判结果Y矩阵大小为120*1；待预测年份2021年的已知指标Z₂₀₂₁矩阵大小为23*1。Assuming that the data collection year is divided into two periods, the first period is from the middle and late Holocene to modern (7000 BC-1950), a total of 50 time points, the second period is 1951-2021, a total of 71 time points, the first period It is mainly used to construct the background of geological evolution. From 1951 to 2020, the evaluation results of this area as an ecologically harmonious three-dimensional city are known (1: can be used; 0: cannot be used), and the evaluation results in 2021 are unknown, which is required Predicted by the trained prediction model; there are 23 indicators in Table 1, so the size of the standardized Z matrix is 23*120, and the corresponding judgment result Y matrix size is 120*1; Known index Z ₂₀₂₁ matrix size is 23*1.

随机森林模型算法采用表1中参数进行数据清理例如处理缺失值、光滑噪声、识别或删除离群点以及归一化进行数据预处理；包括以下步骤：The random forest model algorithm uses the parameters in Table 1 for data cleaning, such as processing missing values, smooth noise, identifying or deleting outliers, and normalizing for data preprocessing; including the following steps:

（1）随机数生成，模型中的每棵树的生长为关键步骤、（2）计算预测指标MAE和MAPE、（3）随机森林参数优化、（4）根据准确率最高的原则，选择出最优模型、（5）根据随机森林生成的最优模型直接计算各特征的权重（非零实数），并根据由大到小的原则选择一定数目的较为重要的特征。(1) Random number generation, the growth of each tree in the model is a key step, (2) Calculation of predictive indicators MAE and MAPE, (3) Random forest parameter optimization, (4) According to the principle of the highest accuracy, select the most Optimal model, (5) directly calculate the weight of each feature (non-zero real number) according to the optimal model generated by the random forest, and select a certain number of more important features according to the principle from large to small.

步骤（1）包括以下三个主要步骤：Step (1) consists of the following three main steps:

A、bootstrap 采样：若训练集大小为N，对于每棵树随机且有放回地从训练集中的抽取n个训练样本作为该树的训练集；A. Bootstrap sampling: if the size of the training set is N, for each tree, randomly select n training samples from the training set with replacement as the training set of the tree;

B、特征随机：若每个样本的特征维度为M，指定一个常数m<<M，随机从M个特征中选取m个特征子集，每次树进行分裂时从这m个特征中选择最优的；B. Random features: If the feature dimension of each sample is M, specify a constant m<<M, randomly select m feature subsets from M features, and select the most feature from these m features each time the tree is split. Excellent;

C、每棵树都尽最大程度的生长，并且没有剪枝过。C. Each tree grows to its maximum extent and has not been pruned.

步骤（2）具体为：

为结果的真实值

为结果的估计值。预测指标MAE（Mean Absolute Error）表示平均绝对误差，值域：[0,+∞)；当预测值与真实值完全吻合时等于0，即完美模型；误差越大，MAE 值越大：Step (2) is specifically:

is the true value of the result

Estimated value for the result. The predictive indicator MAE (Mean Absolute Error) represents the mean absolute error, value range: [0,+∞); when the predicted value is completely consistent with the real value, it is equal to 0, that is, the perfect model; the larger the error, the larger the MAE value:

预测指标MAPE（Mean Absolute Percentage Error）表示平均绝对百分比误差，值域：[0,+∞)；当预测值与真实值完全吻合时等于0，即完美模型；误差越大，MAE 值越大：The predictive indicator MAPE (Mean Absolute Percentage Error) represents the mean absolute percentage error, value range: [0,+∞); when the predicted value is completely consistent with the real value, it is equal to 0, that is, the perfect model; the larger the error, the larger the MAE value:

。

.

步骤（3）具体为：使用机器学习中经典调参方法，对建立树的个数、最大特征的选择方式、树的最大深度、节点最小分裂所需样本个数、叶子节点最小样本数、是否随机选择最合适的参数组合、是否贝叶斯优化进行调整。Step (3) is specifically: using the classic parameter tuning method in machine learning, the number of established trees, the selection method of the largest feature, the maximum depth of the tree, the number of samples required for the minimum split of nodes, the minimum number of samples of leaf nodes, whether Randomly select the most suitable parameter combination, and adjust it with or without Bayesian optimization.

采用上述技术方案，本发明具有以下技术效果：Adopt above-mentioned technical scheme, the present invention has following technical effect:

从空间角度对构成某一地区（如城市）的自然资源和人类居住环境的各种地理要素的基本情况的反映，可以看为是地理信息根据不同的需求，在感知、统计和分析三种不同深度处理后得到的信息。本方法中关于建立起人与自然多要素多尺度（时间尺度）的拟合关系是核心要解决的问题。From the perspective of space, the reflection of the basic situation of various geographical elements that constitute the natural resources and human living environment of a certain area (such as a city) can be regarded as the three different types of geographical information in perception, statistics and analysis according to different needs. Information obtained after in-depth processing. In this method, the core problem to be solved is the establishment of a multi-element multi-scale (time scale) fitting relationship between human and nature.

参照（马万钟,杜清运.地理国情监测的体系框架研究[J].国土资源科技管理,2011,28(06):104-111）的研究，可以将归纳为自然环境要素、社会人文要素和产业经济要素。参考（刘凯. 生态脆弱型人地系统演变与可持续发展模式选择研究[D].山东师范大学，2017）中提出的指标体系原则，以某一地区可作为生态和谐型立体城市进行规划为目标。通过采用自然环境、经济社会2个要素的指标采用随机森林的方法进行预测模型的建立，进一步获得指标对预测结果影响的权重分析。Referring to (Ma Wanzhong, Du Qingyun. Research on the System Framework of Geographical National Conditions Monitoring[J]. Land and Resources Science and Technology Management, 2011,28(06):104-111), it can be summarized as natural environment elements, social and cultural elements and industrial economy. elements. Referring to the principle of the indicator system proposed in (Liu Kai. Research on the evolution of ecologically fragile human-land systems and the selection of sustainable development models [D]. Shandong Normal University, 2017), a certain area can be planned as an ecologically harmonious three-dimensional city. Target. By using the indicators of the natural environment and economic society, the prediction model is established by using the random forest method, and the weight analysis of the impact of the indicators on the prediction results is further obtained.

对于表1中所列的指标，可进行适当的补充或删减，特征纳入越多，准确率越高。尽量保留权重大的指标需要保留，可减少特征可降低运行时间，建议按照95%阈值选择的部分重要特征的数据集。对于采集时间，同一年可进行多个时间点的采集，比如每月一个数据点，使得样本量大大增加。增加样本量可增加预测模型的准确率。For the indicators listed in Table 1, appropriate supplements or deletions can be made. The more features included, the higher the accuracy rate. Try to keep the weighted indicators as much as possible, which can reduce the characteristics and reduce the running time. It is recommended to select some important feature data sets according to the 95% threshold. For the collection time, multiple time points can be collected in the same year, such as one data point per month, which greatly increases the sample size. Increasing the sample size increases the accuracy of the predictive model.

随机森林分类效果（错误率）与两个因素有关：森林中任意两棵树的相关性：相关性越大，错误率越大；森林中每棵树的分类能力：每棵树的分类能力越强，整个森林的错误率越低。减小特征选择个数m，树的相关性和分类能力也会相应的降低；增大m，两者也会随之增大。所以关键问题是如何选择最优的m（或者是范围），这也是随机森林唯一的一个参数。The random forest classification effect (error rate) is related to two factors: the correlation of any two trees in the forest: the greater the correlation, the greater the error rate; the classification ability of each tree in the forest: the higher the classification ability of each tree Stronger, the lower the error rate of the entire forest. Reducing the number of feature selection m will reduce the correlation and classification ability of the tree; increasing m will increase both. So the key question is how to choose the optimal m (or range), which is also the only parameter of random forest.

本发明选用随机森林模型算法，具有以下优点：1）在当前所有算法中，具有极好的准确率；2）能够有效地运行在大数据集上；3）能够处理具有高维特征的输入样本，而且不需要降维；4）能够评估各个特征在分类问题上的重要性；5）在生成过程中，能够获取到内部生成误差的一种无偏估计；6）对于缺省值问题也能够获得很好得结果。The present invention uses the random forest model algorithm, which has the following advantages: 1) among all the current algorithms, it has excellent accuracy; 2) it can effectively run on large data sets; 3) it can process input samples with high-dimensional features , and does not require dimensionality reduction; 4) It can evaluate the importance of each feature in the classification problem; 5) During the generation process, an unbiased estimate of the internal generation error can be obtained; 6) For the default value problem, it can also Get great results.

附图说明Description of drawings

图1 是随机森林模型示意图；Figure 1 is a schematic diagram of the random forest model;

图2 是随机森林模型算法流程示意图；Figure 2 is a schematic diagram of the random forest model algorithm flow;

图3是预测指标权重排列示意图；Figure 3 is a schematic diagram of the weight arrangement of predictive indicators;

图4是预测模型结果与实际结果比较示意图。Figure 4 is a schematic diagram of the comparison between the prediction model results and the actual results.

具体实施方式Detailed ways

如图1-4所示，本发明的地区生态和谐随机森林预测方法，包括以下步骤：As shown in Figures 1-4, the regional ecological harmony random forest prediction method of the present invention comprises the following steps:

步骤（2）具体为：

为结果的真实值

is the true value of the result

。

.

本实施例并非对本发明的形状、材料、结构等作任何形式上的限制，凡是依据本发明的技术实质对以上实施例所作的任何简单修改、等同变化与修饰，均属于本发明技术方案的保护范围。This embodiment does not impose any formal restrictions on the shape, material, structure, etc. of the present invention. All simple modifications, equivalent changes and modifications made to the above embodiments according to the technical essence of the present invention belong to the protection of the technical solution of the present invention. scope.

Claims

1. The regional ecological harmony random forest prediction method is characterized in that: comprising the following steps:

The first step is to comprehensively describe the elements of mountains, rivers, forests, fields, lakes and grasses from the underground to the surface, aiming at multiple elements and synthesizing different time scales;

The second step is to combine the long-term satellite remote sensing to make a fine and quantitative interpretation of the natural elements of the past century. The natural elements include the area of mountains, water, forests, fields, lakes, and grass. The time unit is calibrated in one year, and the comprehensive solution Human factors, human factors include building land area;

The third step is to collect the characterization data of human factors in different years within the research area. The characterization data include population, GDP and industrial development intensity, calibrated with one year as the time unit, and comprehensively interpret human factors;

The fourth step, according to historical data and expert judgment, collect the environmental carrying capacity of the research area within the research period to make a judgment, and use it as the standard for subsequent model training; through multiple regression and machine learning methods, establish the human activity element function; based on Systematic thinking, through multiple regression and machine learning methods to establish a multi-element and multi-scale fitting relationship between man and nature, multi-scale including time scale, to explore the spontaneous evolution process of the natural environment and the impact of human activities on this process, fitting Form a mathematical model related to time, and the mathematical model is a curve function;

The fifth step is to set the time as a certain time in the future, analyze the cycle and frequency of the curve, use mathematical models to predict the characteristics of changes in various elements in the future, and predict the impact of human activities on other elements; based on the above work, put forward the environmental carrying capacity The lower limit defines the area or proportion of mountains, water, forests, fields, lakes, and grass in the area, defines the baseline for ecological function protection, the bottom line for environmental quality and safety, and the upper limit for the use of natural resources to guide the three-dimensional planning of an ecologically harmonious city.

2. The regional ecological harmony random forest prediction method according to claim 1, characterized in that: the first step is specifically: arranging intensive shallow drills in key anatomical areas, and establishing a structure including mountains, rivers, forests, farmland, and lakes through fine and quantitative characterization of underground geological bodies. The three-dimensional space model of grass elements; using multiple boreholes, combined with high-precision dating technology, from the late Holocene (>5000a), according to the millennium-level division, to 1000 years ago, the 100-year time unit was used to calibrate, and the ancient Geography, palaeoenvironmental pattern, and finally establish a four-dimensional space-time geological model with high precision.

3. the regional ecological harmony random forest prediction method according to claim 1, is characterized in that: the multiple regression in the 4th step and the method of machine learning are random forest model algorithm, and the geographic area situation factor index of random forest model algorithm is as follows As shown in the table:

Table 1 Geographical Region Situation Factor Indicators

Assuming that the data collection year is divided into two periods, the first period is from the middle and late Holocene to modern (7000 BC-1950), a total of 50 time points, the second period is 1951-2021, a total of 71 time points, the first period It is mainly used to construct the background of geological evolution. From 1951 to 2020, the evaluation results of this area as an ecologically harmonious three-dimensional city are known (1: can be used; 0: cannot be used), and the evaluation results in 2021 are unknown, which is required Predicted by the trained prediction model; there are 23 indicators in Table 1, so the size of the standardized Z matrix is 23*120, and the corresponding judgment result Y matrix size is 120*1; Known index Z ₂₀₂₁ matrix size is 23*1.

4. The regional ecological harmony random forest prediction method according to claim 3, characterized in that: random forest model algorithm adopts parameters in table 1 to carry out data cleaning such as processing missing values, smooth noise, identifying or deleting outliers and normalization Perform data preprocessing; including the following steps:

(1) Random number generation, the growth of each tree in the model is a key step, (2) Calculation of predictive indicators MAE and MAPE, (3) Random forest parameter optimization, (4) According to the principle of the highest accuracy, select the most Optimal model, (5) directly calculate the weight of each feature (non-zero real number) according to the optimal model generated by the random forest, and select a certain number of more important features according to the principle from large to small.

5. The regional ecological harmony random forest prediction method according to claim 4, characterized in that: step (1) includes the following three main steps:

A. Bootstrap sampling: if the size of the training set is N, for each tree, randomly select n training samples from the training set with replacement as the training set of the tree;

B. Random features: If the feature dimension of each sample is M, specify a constant m<<M, randomly select m feature subsets from M features, and select the most feature from these m features each time the tree is split. Excellent;

C. Each tree grows to its maximum extent and has not been pruned.

6. The regional ecological harmony random forest prediction method according to claim 5, characterized in that: step (2) is specifically:

is the true value of the result

is the estimated value of the result;

The predictive indicator MAE (Mean Absolute Error) represents the mean absolute error, value range: [0,+∞); when the predicted value is completely consistent with the real value, it is equal to 0, that is, the perfect model; the larger the error, the larger the MAE value:

The predictive indicator MAPE (Mean Absolute Percentage Error) represents the mean absolute percentage error, value range: [0,+∞); when the predicted value is completely consistent with the real value, it is equal to 0, that is, the perfect model; the larger the error, the larger the MAE value:

.

7. The regional ecological harmony random forest prediction method according to claim 6, characterized in that: step (3) is specifically: using the classic parameter adjustment method in machine learning, the number of established trees, the selection method of the largest feature, The maximum depth of the tree, the number of samples required for the minimum split of nodes, the minimum number of samples of leaf nodes, whether to randomly select the most suitable parameter combination, and whether to adjust by Bayesian optimization.