CN114398951A - Land use change driving factor mining method based on random forest and crowd-sourced geographic information - Google Patents
Land use change driving factor mining method based on random forest and crowd-sourced geographic information Download PDFInfo
- Publication number
- CN114398951A CN114398951A CN202111529458.2A CN202111529458A CN114398951A CN 114398951 A CN114398951 A CN 114398951A CN 202111529458 A CN202111529458 A CN 202111529458A CN 114398951 A CN114398951 A CN 114398951A
- Authority
- CN
- China
- Prior art keywords
- land
- driving
- random forest
- model
- variable
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 77
- 230000008859 change Effects 0.000 title claims abstract description 71
- 238000007637 random forest analysis Methods 0.000 title claims abstract description 54
- 238000005065 mining Methods 0.000 title claims abstract description 20
- 238000012549 training Methods 0.000 claims abstract description 36
- 238000012545 processing Methods 0.000 claims abstract description 9
- 238000012216 screening Methods 0.000 claims abstract description 9
- 230000008030 elimination Effects 0.000 claims abstract description 7
- 238000003379 elimination reaction Methods 0.000 claims abstract description 7
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 claims description 11
- 230000000694 effects Effects 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 8
- 238000005070 sampling Methods 0.000 claims description 8
- 230000008676 import Effects 0.000 claims description 7
- 230000007423 decrease Effects 0.000 claims description 5
- 229930091051 Arenine Natural products 0.000 claims description 4
- 238000002790 cross-validation Methods 0.000 claims description 4
- 239000006185 dispersion Substances 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 4
- 238000003066 decision tree Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 2
- 230000007246 mechanism Effects 0.000 abstract description 14
- 238000011158 quantitative evaluation Methods 0.000 abstract description 2
- 238000011160 research Methods 0.000 description 14
- 238000011161 development Methods 0.000 description 6
- 238000011156 evaluation Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 3
- 230000015556 catabolic process Effects 0.000 description 3
- 238000010276 construction Methods 0.000 description 3
- 238000006731 degradation reaction Methods 0.000 description 3
- 230000001419 dependent effect Effects 0.000 description 3
- 239000012535 impurity Substances 0.000 description 3
- 238000005192 partition Methods 0.000 description 3
- 239000002689 soil Substances 0.000 description 3
- 238000013179 statistical model Methods 0.000 description 3
- 230000002411 adverse Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 239000004016 soil organic matter Substances 0.000 description 2
- 241001465754 Metazoa Species 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 244000144972 livestock Species 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000001558 permutation test Methods 0.000 description 1
- 238000000513 principal component analysis Methods 0.000 description 1
- 238000009418 renovation Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/29—Geographical information databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
- G06Q10/06393—Score-carding, benchmarking or key performance indicator [KPI] analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Data Mining & Analysis (AREA)
- Human Resources & Organizations (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Biology (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Development Economics (AREA)
- Educational Administration (AREA)
- Economics (AREA)
- Entrepreneurship & Innovation (AREA)
- Evolutionary Computation (AREA)
- Strategic Management (AREA)
- Game Theory and Decision Science (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Marketing (AREA)
- Remote Sensing (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
技术领域technical field
本发明属于地理大数据分析与挖掘领域,特别是一种基于随机森林和众源地理信息的土地利用变化驱动因子挖掘方法。The invention belongs to the field of geographic big data analysis and mining, in particular to a land use change driving factor mining method based on random forest and crowd source geographic information.
背景技术Background technique
城市土地利用格局作为城市发展的直观表现,受人类意识驱使,在自然、经济、文化和政策等多重因素的综合影响下,经历了地表上最为复杂的演化过程,对自然和生态系统产生了深远影响。我国作为世界最大的发展中国家,目前正处于高速城市化发展阶段,在人口增长与经济发展的双重压力下,土地空间资源被大幅度开发利用,城市土地利用经历着频繁而剧烈的变化。掌握城市空间结构演化规律,揭示土地利用变化微观驱动机制,可以为政府部门进行城市土地资源配置优化提供科学参考依据,对城市可持续发展具有重要意义。As an intuitive manifestation of urban development, the urban land use pattern is driven by human consciousness and under the comprehensive influence of multiple factors such as nature, economy, culture and policy, it has experienced the most complex evolution process on the surface, and has had a profound impact on nature and ecosystems. influences. As the largest developing country in the world, my country is currently in the stage of rapid urbanization. Under the dual pressures of population growth and economic development, land space resources have been greatly developed and utilized, and urban land use has experienced frequent and drastic changes. Mastering the evolution law of urban spatial structure and revealing the microscopic driving mechanism of land use change can provide a scientific reference for government departments to optimize the allocation of urban land resources, and is of great significance to the sustainable development of cities.
土地利用变化的驱动因子挖掘是揭示土地利用变化发生机理、演化规则以及未来趋势模拟的基础,一直是土地利用研究的重要方向。国内外学者已经开展了大量的土地利用变化驱动力研究工作,早期研究主要利用实证分析的形式揭示特点区域内某一种土地利用类型宏观层面的驱动机制。例如,Moran指出巴西亚马逊地区从1975年到1987年的森林退化主要受到当地政府对于畜牧场政策变化的影响,而不是人口增长因素。Sneath通过对比1992年至1995年间中国、俄罗斯以及蒙古国的草地变化,发现现代化的畜牧方式是导致牧场退化的主要原因。Pulido和Bocco以发展中国家的农户为研究对象,证明了农民的主观意识和传统文化对当地的土地退化情况起到决定性作用。尽管这些定性分析对土地利用驱动力识别奠定了良好基础,但是难以评估不同类型的人类或自然因素对土地利用变化的影响程度。为此,后续研究中陆续采用了统计学方法,进行因子驱动力的量化研究,以多元驱动因子为自变量,土地利用变化为因变量构建线性方程模型,如相关性分析、逻辑回归、线性回归、主成分分析等方法。The driving factor mining of land use change is the basis for revealing the mechanism, evolution rules and future trend simulation of land use change, and has always been an important direction of land use research. Scholars at home and abroad have carried out a lot of research on the driving forces of land use change. Early research mainly used the form of empirical analysis to reveal the macro-level driving mechanism of a certain land use type in a characteristic area. Moran, for example, points out that forest degradation in the Brazilian Amazon from 1975 to 1987 was largely the result of changes in local government policies on livestock farms, rather than population growth. By comparing grassland changes in China, Russia, and Mongolia between 1992 and 1995, Sneath found that modern animal husbandry was the main cause of grassland degradation. Pulido and Bocco took farmers in developing countries as research objects, and proved that farmers' subjective consciousness and traditional culture play a decisive role in local land degradation. Although these qualitative analyses provide a good foundation for the identification of land-use drivers, it is difficult to assess the extent to which different types of human or natural factors affect land-use change. To this end, statistical methods have been used in subsequent research to conduct quantitative research on factor driving forces, with multiple driving factors as independent variables and land use change as dependent variables to build linear equation models, such as correlation analysis, logistic regression, and linear regression. , principal component analysis, etc.
考虑到我国城市发展正由外向型的空间扩展,逐渐转为土地功能更新与旧城改造等小尺度上的城市空间再开发这一现状,迫切需要进行土地利用变化微观驱动机制的研究。然而,现有研究中主要存在两点不足难以满足土地利用变化的微观驱动力研究:首先,现有研究大部分集中在大尺度的土地利用变化的宏观驱动因素探究,存在研究尺度过大,驱动因素分类不精等问题,难以支撑城市内部建设用地功能转化的规律发现与机制研究;其次,城市土地利用变化受到自然环境和社会经济等多重因素的交互影响,用地功能与结构在高强度的人类活动影响下更加复杂,传统的统计学模型多以线性方程为基础,简化了多元驱动因子与土地利用变化之间的关系模型,无法真实、全面的反映出二者之间复杂的、非线性映射关系。Considering that my country's urban development is changing from an outward-oriented spatial expansion to a small-scale urban space redevelopment such as land function renewal and old city renovation, it is urgent to study the microscopic driving mechanism of land use change. However, there are two main deficiencies in the existing researches, which are difficult to meet the research on the microscopic driving forces of land use change: first, most of the existing researches focus on the exploration of the macroscopic driving factors of land use change on a large scale. Problems such as inaccurate classification of factors make it difficult to support the discovery of laws and mechanisms for the transformation of urban construction land functions; secondly, urban land use changes are influenced by multiple factors such as natural environment and social economy, and land use functions and structures are affected by high-intensity human activities Under the influence of more complex, traditional statistical models are mostly based on linear equations, which simplifies the relationship model between multiple driving factors and land use change, and cannot truly and comprehensively reflect the complex and nonlinear mapping relationship between the two. .
随着Web 2.0技术的发展,众源地理信息(Crowedsourcing geographicinformation)这种由民众在日常生活中主动或被动产生的海量数据,成为专业地理信息数据的重要补充。利用众源地理信息中反映出的人类活动和社会经济微观特征,地学研究的深度和广度得到进一步提升。兴趣点(Point of Interest,POI)数据是众源地理信息中应用最为广泛的一类,将POI标签蕴含的大量动态、精细的社会经济信息应用于城市土地利用研究中,为挖掘城市土地利用微观驱动机制提供了可能。但是在实际应用中,丰富的众源地理信息所带来的数据冗余度高和信息相关性强等问题,势必给核心驱动力的准确识别与筛选带来严重干扰。为此,有必要建立一种受变量间相关性影响较小的土地利用变化驱动力分析方法,既能构建驱动因子与土地利用变化之间的非线性模型,又能避免数据冗余的干扰,从多元驱动因子中精准识别中占主导地位和贡献能力强的驱动因子,更加深入、细化的挖掘影响城市土地利用变化的核心驱动因子。随机森林在特征筛选方面具有天然优势,根据变量对模型的贡献程度评估特征变量的重要性,作为分析土地利用变化的驱动力因子结果。随机森林可以根据变量对预测结果的贡献来衡量特征变量的重要性,将随机森林模型应用于构建土地利用类别与空间变量之间的关系,则可以通过计算变量重要性,分析出土地利用变化的核心驱动因子。随机森林模型提供了变量重要性评价方法,如根据Gini不纯度计算变量重要性MDI(mean decrease in impurity)指标,该方法计算每个变量对分类树每个节点上观测值异质性的影响,从而比较变量的重要性,然而这种算法可能会导致变量重要性评价产生严重偏差,主要是因为MDI指标是根据模型训练数据计算的统计值,不能完全体现出变量对于模型预测的贡献。相关研究表明在样本数据分布不均的情况下,MDI指标在模型过拟合时,没有明显贡献的指标会被误判为重要因子。此外,基于Gini不纯度的变量重要性评价更加倾向于赋予连续型变量高值,而低估离散型变量的重要性。为此,本研究基于随机森林模型与众源地理信息,引入变量置换检验方法评估特征变量的重要性,通过单一变量的随机置换,打破模型已经建立起来的变量与预测目标之间的关系,然后针对变量改变引发的模型误差变化进行指标计算,从而得出单一变量的重要性。该方法的优势在于对离散型和连续型变量的重要性评估不存在明显偏向,能够更加精准的对影响城市土地利用变化的驱动因子进行重要性量化评估和核心因子筛选。With the development of Web 2.0 technology, crowdsourced geographic information (Crowedsourcing geographic information), the massive data generated actively or passively by people in daily life, has become an important supplement to professional geographic information data. The depth and breadth of geoscience research has been further enhanced by utilizing the human activities and socio-economic micro-characteristics reflected in crowdsourced geographic information. Point of Interest (POI) data is the most widely used type of crowd-sourced geographic information. A large amount of dynamic and refined socioeconomic information contained in POI tags is applied to urban land use research, in order to explore the microcosm of urban land use. The drive mechanism offers the possibility. However, in practical applications, problems such as high data redundancy and strong information correlation brought about by abundant crowd-sourced geographic information are bound to bring serious interference to the accurate identification and screening of core driving forces. Therefore, it is necessary to establish a driving force analysis method of land use change that is less affected by the correlation between variables, which can not only build a nonlinear model between driving factors and land use change, but also avoid the interference of data redundancy. Accurately identify the dominant and contributing driving factors from the multiple driving factors, and explore the core driving factors affecting urban land use change in a more in-depth and detailed manner. Random forest has a natural advantage in feature screening, and evaluates the importance of feature variables according to their contribution to the model, as the result of analyzing the driving force factors of land use change. Random forest can measure the importance of characteristic variables according to the contribution of variables to the prediction results. Applying the random forest model to build the relationship between land use categories and spatial variables, the importance of variables can be calculated to analyze the effect of land use change. core driver. The random forest model provides a variable importance evaluation method, such as calculating the variable importance MDI (mean decrease in impurity) index according to Gini impurity, which calculates the impact of each variable on the heterogeneity of observations on each node of the classification tree, Therefore, the importance of variables is compared. However, this algorithm may lead to serious deviations in the evaluation of variable importance, mainly because the MDI indicator is a statistical value calculated based on model training data, which cannot fully reflect the contribution of variables to model prediction. Relevant studies have shown that in the case of uneven sample data distribution, when the MDI indicator overfits the model, indicators that do not contribute significantly will be misjudged as important factors. In addition, variable importance evaluations based on Gini impurity tend to assign high values to continuous variables and underestimate the importance of discrete variables. To this end, based on the random forest model and crowd-sourced geographic information, this study introduces a variable permutation test method to evaluate the importance of characteristic variables. Through random permutation of a single variable, the relationship between the variables already established by the model and the prediction target is broken, and then The index calculation is carried out for the change of model error caused by the change of the variable, so as to obtain the importance of a single variable. The advantage of this method is that there is no obvious bias in the importance assessment of discrete and continuous variables, and it can more accurately quantify the importance of driving factors affecting urban land use change and screen core factors.
发明内容SUMMARY OF THE INVENTION
本发明需要解决的技术问题是:为了揭示土地利用变化的微观驱动机制,提出了一种基于随机森林和众源地理信息的土地利用变化驱动因子挖掘方法,精准反映影响城市土地利用变化的特征因素重要性,实现土地利用变化核心驱动因素的筛选。The technical problem to be solved by the present invention is: in order to reveal the microscopic driving mechanism of land use change, a land use change driving factor mining method based on random forest and crowd source geographic information is proposed, which accurately reflects the characteristic factors affecting urban land use change. importance, and realize the screening of the core driving factors of land use change.
本发明解决其技术问题采用以下的技术方案:The present invention solves its technical problem and adopts following technical scheme:
本发明提供的一种基于随机森林和众源地理信息的土地利用变化驱动因子挖掘方法,该方法是:首先利用以POI点为主的众源地理数据构建影响土地利用变化的多元潜在驱动因子数据集,并进行数据空间化处理;然后,以多元潜在驱动因素为特征变量,土地利用专题图的用地类型作为预测变量,构建随机森林分类器模型,并进行模型训练;接着,利用训练好的模型,进行单一变量的K次随机置换,从而计算变量的重要性得分,根据得分进行驱动因子重要性排序;最后,利用递归特征消除原理,筛选影响土地利用变化的核心驱动力。The invention provides a land use change driving factor mining method based on random forest and crowd source geographic information. The method is as follows: firstly, using the crowd source geographic data mainly based on POI points to construct multivariate potential driving factor data affecting land use change Then, using the multivariate potential driving factors as the feature variables and the land use type of the land use thematic map as the predictor variable, build a random forest classifier model, and conduct model training; then, use the trained model , perform K random permutations of a single variable to calculate the importance score of the variable, and sort the importance of driving factors according to the score; finally, use the principle of recursive feature elimination to screen the core driving forces that affect land use change.
本发明可以采用以下方法进行数据空间化处理:根据不同因子的数据类型特征,利用核密度估计、缓冲区创建、欧式距离计算、分区统计和坡度、坡向计算等多种方法,进行数据的空间化处理,生成分辨率一致、连续面状型的空间变量集。The present invention can use the following methods to perform data spatialization processing: according to the data type characteristics of different factors, using various methods such as kernel density estimation, buffer creation, Euclidean distance calculation, partition statistics and slope, slope aspect calculation, etc. It is processed to generate a spatial variable set with consistent resolution and continuous surface shape.
本发明可以采用以下方法消除空间变量集数据量纲的差异:利用Arcmap10.2软件中的Fuzzy工具将变量进行离差标准化处理,实现空间变量像元值的归一化,变量的数值范围映射到0至1之间。The present invention can adopt the following method to eliminate the difference in the data dimension of the spatial variable set: using the Fuzzy tool in the Arcmap10.2 software to standardize the dispersion of the variable, realize the normalization of the pixel value of the spatial variable, and map the numerical range of the variable to Between 0 and 1.
本发明可以采用以下方法构建随机森林分类器模型:以多元潜在驱动因素为特征变量,土地利用专题图的用地类型作为预测变量,构建二者之间的映射关系。The present invention can use the following method to construct a random forest classifier model: take multiple potential driving factors as characteristic variables and the land use type of the land use thematic map as a predictor variable to construct a mapping relationship between the two.
本发明所述的构建随机森林分类器模型可以采用以下方法进行模型训练样本采集:The construction of the random forest classifier model described in the present invention can adopt the following methods to collect model training samples:
(1)将研究区两个不同年份的土地利用专题图导入ArcMap软件,数据空间分类率为30m,土地利用类型分为水域、林地和草地、耕地、未利用地、居住用地、工业用地、商业用地、公共管理用地和混合用地共九类,类别代码依次为1至9的数字;(1) Import the land use thematic maps of the study area in two different years into ArcMap software, the data space classification rate is 30m, and the land use types are divided into water, forest and grassland, cultivated land, unused land, residential land, industrial land, commercial land There are nine categories of land, public management land and mixed land, and the category codes are numbers from 1 to 9;
(2)利用ArcMap软件中Raster calculator进行不同年份土地利用类型的变化检测,代数表达式为Con(″landuse1.tif″!=″landuse2.tif″,1,0),生成新的栅格数据集landuse_difference.tif,像元值为1代表土地利用类型发生变化的区域,像元值为0则代表未发生变化;(2) Use the Raster calculator in ArcMap software to detect changes in land use types in different years. The algebraic expression is Con("landuse1.tif"!="landuse2.tif", 1, 0) to generate a new raster dataset landuse_difference.tif, a pixel value of 1 represents an area where the land use type has changed, and a pixel value of 0 represents no change;
(3)针对土地利用发生变化的像元,根据像元的空间位置索引,采用随机遍历抽样法,对不同类型用地设置相应的遍历步长,进行全域搜索和采样,形成训练样本集D=[(x1,y1),,...,(xn,yn)]。(3) For the pixels with changes in land use, according to the spatial position index of the pixels, the random traversal sampling method is used to set the corresponding traversal step size for different types of land, perform global search and sampling, and form a training sample set D = [ (x 1 , y 1 ), , ..., (x n , y n )].
本发明所述的构建随机森林分类器模型可以采用以下方法进行随机森林模型训练:将训练样本D输入随机森林模型进行模型训练,将最大特征数设置为潜在驱动因子个数N的平方根;通过决策树个数的迭代增长进行模型训练,模型训练效果的度量采用模型误差的平均间隔:The construction of the random forest classifier model according to the present invention can adopt the following methods for random forest model training: input the training sample D into the random forest model for model training, and set the maximum number of features as the square root of the number N of potential driving factors; The iterative growth of the number of trees is used for model training, and the model training effect is measured by the average interval of model errors:
式中:MGavg代表所有样本的平均间隔,n代表样本个数,mg(xi,yi)代表单一样本的间隔。如果mg(xi,yi)大于零,这时正确类别占据最多票数,在投票表决下最终分类结果正确;相反则最终分类结果错误。In the formula: MG avg represents the average interval of all samples, n represents the number of samples, and mg( xi , y i ) represents the interval of a single sample. If mg(x i , y i ) is greater than zero, then the correct category occupies the most votes, and the final classification result is correct under voting; otherwise, the final classification result is wrong.
本发明可以采用以下方法进行驱动因子的重要性得分计算与排序:对于样本D中每个特征变量j,随机置换该变量中的数值,生成新的、被破坏的训练样本计算新的样本间隔MGj,对每一个变量重复50次随机置换,以平均值作为变量重要性的最终结果:The present invention can use the following method to calculate and sort the importance score of the driving factor: for each feature variable j in the sample D, randomly replace the value in the variable to generate a new and destroyed training sample Calculate a new sample interval MG j , repeating 50 random permutations for each variable, taking the mean as the final result of variable importance:
式中:ij代表变量j的重要性得分,MG代表进行随机置换之前的模型平均间隔,K为随机置换次数,MGk,j代表对变量j进行第k次随机置换之后的模型平均间隔;In the formula: i j represents the importance score of variable j, MG represents the average interval of the model before random permutation, K is the number of random permutations, MG k, j represents the average interval of the model after the k-th random permutation of variable j;
利用这一步得到的特征因子重要性得分,按照降序将特征变量进行排序,即为土地利用变化的驱动因素重要性排序。Using the feature factor importance score obtained in this step, the feature variables are sorted in descending order, that is, the importance ranking of the driving factors of land use change.
本发明可以采用以下方法进行核心驱动力筛选:利用递归特征消除原理,根据驱动因子的重要性排序,从最为重要的驱动因子开始,每次添加一个因子,形成新的特征子集,输入到随机森林模型中,利用交叉验证方法训练模型并得到新的模型分类精度;重复上述步骤,直至特征子集中包含所有驱动因子。The present invention can use the following method to screen the core driving force: using the principle of recursive feature elimination, according to the importance of driving factors, starting from the most important driving factor, adding one factor at a time, forming a new feature subset, and inputting the random In the forest model, the cross-validation method is used to train the model and obtain the new model classification accuracy; the above steps are repeated until all driving factors are included in the feature subset.
所述的核心驱动力筛选方法可以采用以下方法进行核心驱动力数目确定:绘制模型分类精度随特征变量数目减少而变化曲线,寻找曲线中分类精度趋于收敛对应的点,则为核心驱动因子的数目。The core driving force screening method can use the following method to determine the number of core driving forces: draw a curve that changes the classification accuracy of the model with the decrease in the number of characteristic variables, and find the corresponding point in the curve where the classification accuracy tends to converge, which is the value of the core driving factor. number.
本发明提供的上述基于随机森林和众源地理信息的土地利用变化驱动因子挖掘方法,其用于精准评估和筛选对影响城市土地利用变化的微观驱动因素。The above-mentioned land use change driving factor mining method based on random forest and crowd-source geographic information provided by the present invention is used to accurately evaluate and screen the microscopic driving factors affecting urban land use change.
本发明与现有技术相比具有以下主要的技术效果:Compared with the prior art, the present invention has the following main technical effects:
(1)针对现有土地利用变化驱动力研究中存在研究尺度过大和驱动因素分类不精的问题,本发明发现引入蕴含丰富社会经济信息的众源地理数据,通过空间统计学方法,将抽象的城市发展驱动机制转化为二维空间上的量化特征表达,从数据形式上实现驱动因子与土地利用状态的完全统一,不仅解决了多源驱动因素与土地利用变化空间尺度不一致的问题,而且大大提升了土地利用驱动力分析的精细化程度,为揭示城市演化的微观驱动机制奠定了良好的数据基础。(1) Aiming at the problems that the research scale is too large and the classification of driving factors is not precise in the existing research on the driving force of land use change, the present invention finds that the introduction of crowd-sourced geographic data containing rich social and economic information, through the method of spatial statistics, the abstract urban The development driving mechanism is transformed into quantitative feature expression in two-dimensional space, and the complete unification of driving factors and land use status is realized in the form of data. The degree of refinement of the land use driving force analysis has laid a good data foundation for revealing the microscopic driving mechanism of urban evolution.
(2)考虑到传统统计学模型过于简化多元驱动因子与土地利用变化的复杂关系,难以真实反映微观驱动机制,本发明进行土地利用微观驱动因子挖掘时设计了基于随机森林的土地利用变化驱动因子重要性评估方法,该方法通过单一变量的随机置换,进行随机森林模型的重建与变化检验,独立考察每个驱动因子对模型预测能力的影响程度,从而有效避免了多元因子之间信息冗余对识别核心驱动因子的不利影响。相比随机森林模型中传统的变量重要性MDI指标,本发明的优势在于对离散型和连续型变量进行重要性评估时不存在明显偏向,能够更加精准的筛选出核心驱动因素。(2) Considering that the traditional statistical model oversimplifies the complex relationship between multiple driving factors and land use change, and it is difficult to truly reflect the microscopic driving mechanism, the present invention designs a land use change driving factor based on random forest when mining the microscopic driving factors of land use. Importance evaluation method, this method reconstructs and changes the random forest model through random permutation of a single variable, and independently examines the influence of each driving factor on the predictive ability of the model, thus effectively avoiding redundant information among multiple factors. Identify adverse effects of core drivers. Compared with the traditional variable importance MDI index in the random forest model, the present invention has the advantage that there is no obvious bias in the importance evaluation of discrete and continuous variables, and the core driving factors can be more accurately screened.
附图说明Description of drawings
图1是本发明的方法流程图。FIG. 1 is a flow chart of the method of the present invention.
图2是基于众源地理信息的潜在驱动因子空间化结果图。Figure 2 is a graph of spatialization results of potential drivers based on crowdsourced geographic information.
图3是随机森林模型迭代训练结果。Figure 3 is the iterative training result of the random forest model.
图4是驱动因子重要性排序图。Figure 4 is a ranking diagram of the importance of driving factors.
图5是核心驱动因子筛选结果图。Figure 5 is a graph of the screening results of core drivers.
具体实施方式Detailed ways
本发明针对传统统计学模型过于简化多元驱动因子与土地利用变化的复杂关系,难以真实反映微观驱动机制的问题,提出了基于随机森林模型和众源地理信息的土地利用变化驱动因子重要性评估方法,利用蕴含丰富社会经济因素的众源地理信息构建地利用变化的多元潜在驱动因子数据集,设计了分类间隔指标来衡量重构模型的泛化误差,依据变量置换前后间隔序列的差异性来识别特征因子的重要程度。该方法独立考察每个驱动因子对模型预测能力的影响程度,从而有效避免了多元因子之间信息冗余对识别核心驱动因子的不利影响,而且解决了传统随机森林算法对微小变量扰动的响应不够灵敏的难题,提高了土地利用变化微观驱动因素识别的准确性。Aiming at the problem that the traditional statistical model oversimplifies the complex relationship between multiple driving factors and land use change, and it is difficult to truly reflect the microscopic driving mechanism, the present invention proposes a land use change driving factor importance evaluation method based on random forest model and crowd-source geographic information , using the crowd-sourced geographic information with rich socioeconomic factors to construct a multivariate latent driver data set of changes, and design a classification interval index to measure the generalization error of the reconstructed model. The importance of eigenfactors. This method independently examines the influence of each driving factor on the predictive ability of the model, thereby effectively avoiding the adverse effect of information redundancy among multiple factors on identifying core driving factors, and solving the traditional random forest algorithm's insufficient response to small variable disturbances. Sensitive puzzles that improve the accuracy of identifying microscopic drivers of land-use change.
下面结合应用实施例及附图对本发明作进一步说明,但并不局限于下面所述内容。The present invention will be further described below with reference to application examples and accompanying drawings, but is not limited to the content described below.
本发明提供了一种基于随机森林和众源地理信息的土地利用变化驱动因子挖掘方法,具体是:首先,利用以POI点数据(如教育、公共服务和交通类)为主的众源地理信息构建影响土地利用变化的多元潜在驱动因子数据集,并生成数据空间化处理;然后,以多元潜在驱动因素为特征变量,土地利用类型作为预测变量,构建随机森林分类器模型,并进行模型迭代训练;接着,利用训练好的模型,进行单一变量的K次随机置换,从而计算变量的重要性得分,进行驱动因子的重要性排序;最后,利用递归特征消除原理,筛选出影响土地利用变化的核心驱动力。本发明能够精准的对影响城市土地利用变化的驱动因素进行重要性量化评估和核心因素筛选,从而挖掘土地利用变化的微观演化机制。The invention provides a land use change driving factor mining method based on random forest and crowd source geographic information, specifically: first, using crowd source geographic information mainly based on POI point data (such as education, public services and transportation) Construct a dataset of multivariate latent driving factors affecting land use change, and generate data spatial processing; then, with multivariate latent driving factors as feature variables and land use type as predictor variables, build a random forest classifier model, and perform model iterative training ; Next, use the trained model to perform K random permutations of a single variable to calculate the importance score of the variable and rank the importance of the driving factors; finally, use the principle of recursive feature elimination to screen out the core that affects land use change driving force. The invention can accurately carry out quantitative evaluation of importance and screening of core factors for the driving factors affecting urban land use change, so as to excavate the microscopic evolution mechanism of land use change.
上述方法中,可以采用以下方法进行数据空间化处理:针对POI点数据、线状数据、面状数据和栅格数据所代表多元潜在驱动因子,利用核密度估计、缓冲区创建、欧式距离计算、分区统计和坡度、坡向计算空间化方法进行数据的空间化处理(不同驱动因子对应的空间化方法如表1所示),生成分辨率为30m、连续面状型的空间变量集。具体空间化方法的实现采用Arcmap10.2软件中的kernel density、Multiple Ring Buffer、euclideanallocation、zonal statistics、slope和aspect工具,将原始特征数据分别导入对应的工具中,统一设置输出为分辨率30m和tif格式的结果图,形成空间变量集,即为多元潜在驱动因子数据集。In the above method, the following methods can be used for data spatial processing: for the multivariate potential driving factors represented by POI point data, line data, area data and raster data, using kernel density estimation, buffer creation, Euclidean distance calculation, Partition statistics and spatialization methods for slope and aspect calculation are used to spatialize the data (the spatialization methods corresponding to different driving factors are shown in Table 1) to generate a spatial variable set with a resolution of 30m and a continuous surface shape. The implementation of the specific spatialization method adopts the kernel density, Multiple Ring Buffer, euclideanallocation, zonal statistics, slope and aspect tools in the Arcmap10.2 software, import the original feature data into the corresponding tools, and set the output uniformly to a resolution of 30m and tif The resulting graph in the format forms a spatial variable set, that is, a multivariate latent driver data set.
上述方法中,可以采用以下方法消除空间变量集数据量纲的差异:利用Arcmap10.2软件中的Fuzzy membership工具将变量进行离差标准化处理,将所有空间变量的tif格式数据依次导入该工具中,保持默认设置,导出结果图依然为tif格式,而像元值转化为0至1之间的浮点型小数,实现空间变量像元值的归一化,消除不同变量之间的量纲和数据级差异影响。In the above method, the following methods can be used to eliminate the difference in the data dimension of the spatial variable set: using the Fuzzy membership tool in the Arcmap10.2 software to standardize the dispersion of the variables, and import the tif format data of all spatial variables into the tool in turn, Keep the default settings, the exported result image is still in tif format, and the pixel value is converted into a floating-point decimal between 0 and 1, which realizes the normalization of the pixel value of the spatial variable, and eliminates the dimension and data between different variables. level differences.
上述方法中,可以采用以下方法构建随机森林分类器模型:以20个归一化之后的空间变量作为特征变量(自变量),土地利用专题图的9种用地类型作为预测变量(因变量),利用随机森林模型来构建二者之间的映射关系。In the above method, the random forest classifier model can be constructed by the following methods: 20 normalized spatial variables are used as characteristic variables (independent variables), and 9 land use types of the land use thematic map are used as predictor variables (dependent variables), The random forest model is used to construct the mapping relationship between the two.
上述方法中,可以采用以下方法进行模型训练样本采集:In the above method, the following methods can be used to collect model training samples:
(1)将研究区两个不同年份的土地利用专题图导入ArcMap软件,数据空间分类率为30m,土地利用类型分为水域、林地和草地、耕地、未利用地、居住用地、工业用地、商业用地、公共管理用地和混合用地共九类,类别代码依次为1至9的数字。(1) Import the land use thematic maps of the study area in two different years into ArcMap software, the data space classification rate is 30m, and the land use types are divided into water, forest and grassland, cultivated land, unused land, residential land, industrial land, commercial land There are nine categories of land, public management land and mixed land, and the category codes are numbers from 1 to 9 in sequence.
(2)利用ArcMap软件中Raster calculator进行不同年份土地利用类型的变化检测,代数表达式为Con("landuse1.tif"!="landuse2.tif",1,0),生成新的栅格数据集landuse_difference.tif,像元值为1代表土地利用类型发生变化的区域,像元值为0则代表未发生变化。(2) Use the Raster calculator in ArcMap software to detect changes in land use types in different years. The algebraic expression is Con("landuse1.tif"!="landuse2.tif", 1,0) to generate a new raster dataset landuse_difference.tif, a pixel value of 1 represents an area where the land use type has changed, and a pixel value of 0 represents no change.
(3)编写Python语言程序,针对土地利用发生变化的像元,根据像元的空间位置索引,采用随机遍历抽样法,对不同类型用地设置相应的遍历步长,进行全域搜索和采样,形成训练样本集D=[(x1,y1),,...,(xn,yn)],式中:x1...xn代表随机森林模型中n个样本的自变量,y1...yn代表n个样本的因变量。(3) Write a Python language program, according to the pixels with changes in land use, according to the spatial position index of the pixels, adopt the random traversal sampling method, set the corresponding traversal step size for different types of land, perform global search and sampling, and form training. Sample set D=[(x 1 , y 1 ),,...,(x n , y n )], where: x 1 ...x n represents the independent variable of n samples in the random forest model, y 1 ...y n represents the dependent variable for n samples.
上述方法中,可以采用以下方法进行随机森林模型迭代训练:编写Python语言程序,将训练样本D输入随机森林模型进行模型训练,将最大特征数设置为潜在驱动因子个数N的平方根;通过决策树个数的迭代增长进行模型训练,模型训练效果的度量采用模型误差的平均间隔:In the above method, the random forest model can be iteratively trained by the following methods: write a Python language program, input the training sample D into the random forest model for model training, and set the maximum number of features as the square root of the number of potential driving factors N; The iterative growth of the number of models is used for model training, and the measurement of the model training effect adopts the average interval of model errors:
式中:MGavg代表所有样本的平均间隔,n代表样本个数,mg(xi,yi)代表单一样本的间隔。如果mg(xi,yi)大于零,这时正确类别占据最多票数,在投票表决下最终分类结果正确;相反则最终分类结果错误。In the formula: MG avg represents the average interval of all samples, n represents the number of samples, and mg( xi , y i ) represents the interval of a single sample. If mg(x i , y i ) is greater than zero, then the correct category occupies the most votes, and the final classification result is correct under voting; otherwise, the final classification result is wrong.
上述方法中,可以采用以下方法进行驱动因子的重要性得分计算与排序:编写Python语言程序,对于样本D中每个特征变量j,随机置换该变量中的数值,生成新的、被破坏的训练样本计算新的样本间隔MGj,对每一个变量重复K次(本例中为50次)随机置换,以平均值作为变量重要性的最终结果:In the above method, the following methods can be used to calculate and sort the importance scores of the driving factors: write a Python language program, for each feature variable j in sample D, randomly replace the value in the variable to generate a new, destroyed training sample Compute a new sample interval MG j , repeating K (50 in this case) random permutations for each variable, taking the mean as the final result of variable importance:
式中:ij代表变量j的重要性得分,MG代表进行随机置换之前的模型平均间隔,K为随机置换次数,MGk,j代表对变量j进行第k次随机置换之后的模型平均间隔。where i j represents the importance score of variable j, MG represents the average interval of the model before random permutation, K is the number of random permutations, and MG k, j represents the average interval of the model after the k-th random permutation of variable j.
利用这一步得到的特征因子重要性得分,按照降序将特征变量进行排序,即为土地利用变化的驱动因素重要性排序。Using the feature factor importance score obtained in this step, the feature variables are sorted in descending order, that is, the importance ranking of the driving factors of land use change.
上述方法中,可以采用以下方法进行核心驱动力筛选:利用递归特征消除原理,根据驱动因子的重要性排序,从最为重要的驱动因子开始,每次添加一个因子,形成新的特征子集,输入到随机森林模型中,利用交叉验证方法训练模型并得到新的模型分类精度;重复上述步骤,直至特征子集中包含所有驱动因子。In the above method, the following methods can be used to screen the core driving force: using the principle of recursive feature elimination, sorting according to the importance of driving factors, starting from the most important driving factor, adding one factor at a time to form a new feature subset, inputting In the random forest model, use the cross-validation method to train the model and obtain the new model classification accuracy; repeat the above steps until all driving factors are included in the feature subset.
上述方法中,可以采用以下方法进行核心驱动力数目确定:绘制模型分类精度随特征变量数目减少而变化曲线,寻找该曲线中分类精度趋于收敛对应的点,则为核心驱动因子的数目。In the above method, the following methods can be used to determine the number of core driving factors: draw a curve of the variation of the classification accuracy of the model with the decrease in the number of characteristic variables, and find the corresponding point in the curve where the classification accuracy tends to converge, which is the number of core driving factors.
本发明提供的上述基于随机森林和众源地理信息的土地利用变化驱动因子挖掘方法,用于揭示城市演化的微观驱动机制。The above-mentioned land use change driving factor mining method based on random forest and crowd-source geographic information provided by the present invention is used to reveal the microscopic driving mechanism of urban evolution.
应用案例:Applications:
本案例以武汉市中心区域作为研究区域,该区域面积2724.228平方千米,占武汉市总面积的31.79%,是城市化程度最高的区域。以该区域2015年-2020年土地利用变化的驱动力分析为例,结合附图与附表对本发明作进一步的说明。This case takes the central area of Wuhan as the research area, which covers an area of 2724.228 square kilometers, accounting for 31.79% of the total area of Wuhan, and is the area with the highest degree of urbanization. Taking the driving force analysis of land use change in this region from 2015 to 2020 as an example, the present invention will be further described with reference to the accompanying drawings and the attached table.
具体处理步骤(图1)如下:The specific processing steps (Figure 1) are as follows:
步骤1,从社会经济与自然生态量两个角度出发,基于众源地理信息构建影响土地利用变化的多元潜在驱动因子数据集,并进行数据空间化处理,具体包括:Step 1: From the perspectives of social economy and natural ecological volume, construct a dataset of multiple potential driving factors affecting land use change based on crowd-sourced geographic information, and perform data spatial processing, including:
(1)选择影响城市土地利用变化的因素主要包括自然生态与社会经济两大类共计20个因子,其中自然生态因子包括高程、坡度和坡向3种地形因子,水土保持功能、土壤有机质和水系3种生态因子;社会经济因子来源于POI等众源地理信息,包括人口、经济、教育、公共服务和交通类共14种(表1)。(1) The factors affecting urban land use change mainly include 20 factors in two categories: natural ecology and social economy. Among them, natural ecological factors include 3 terrain factors including elevation, slope and slope aspect, soil and water conservation function, soil organic matter and water system. There are 3 ecological factors; socioeconomic factors are derived from POI and other crowdsourced geographic information, including 14 types of population, economy, education, public services, and transportation (Table 1).
(2)获取研究区域内代表所有自然生态与社会经济因子的栅格(tif格式)或矢量数据(shapefile格式),依次导入专业地理信息数据处理与分析软件ArcMap10.2中,使用工具箱中的Project功能,对多种来源的数据进行坐标投影转化,保持数据的坐标投影一致,为:WGS_1984_UTM_Zone_49N;再使用Clip功能,以研究区边界为剪裁范围,将所有数据裁剪为统一形状。(2) Obtain raster (tif format) or vector data (shapefile format) representing all natural ecological and socioeconomic factors in the study area, and then import them into professional geographic information data processing and analysis software ArcMap10.2 in turn, and use the toolbox The Project function converts data from multiple sources into coordinate projections to keep the coordinate projections of the data consistent, as follows: WGS_1984_UTM_Zone_49N; then use the Clip function to cut all the data into a uniform shape with the boundary of the study area as the clipping range.
(3)对不同类型的因子采取不同的空间化处理方式,包括核密度估计、缓冲区创建、欧式距离计算和分区统计四种方法,分别利用Arcmap10.2软件中的Kernel Density,Multiple Ring Buffer,Euclidean Distance和Zonal Statistic工具实现,20种因子对应的空间化处理方法详见表1。处理完成后将生成分辨率为30m,像元数为3366990(2090*1611),数据格式为tif类型的空间变量数据集(图2)。(3) Different spatial processing methods are adopted for different types of factors, including four methods of kernel density estimation, buffer creation, Euclidean distance calculation and partition statistics, respectively using Kernel Density, Multiple Ring Buffer in Arcmap10.2 software, Euclidean Distance and Zonal Statistic tools are implemented, and the spatial processing methods corresponding to the 20 factors are shown in Table 1. After the processing is completed, a spatial variable dataset with a resolution of 30m, a number of pixels of 3366990 (2090*1611) and a data format of tif type will be generated (Figure 2).
(4)第一步,为了消除不同因子之间数据量纲的差异,利用Arcmap10.2软件中的Fuzzy工具将变量进行离差标准化处理,实现空间变量像元值的归一化,变量的数值范围映射到0至1之间。(4) In the first step, in order to eliminate the difference in the data dimension between different factors, use the Fuzzy tool in the Arcmap10.2 software to standardize the dispersion of the variables to achieve the normalization of the pixel values of the spatial variables, and the numerical values of the variables. The range maps to between 0 and 1.
第二步,以多元潜在驱动因素为特征变量,土地利用专题图的用地类型作为预测变量,构建随机森林分类器模型,并进行模型训练,具体包括:The second step is to build a random forest classifier model with multiple potential driving factors as characteristic variables and the land use type of the land use thematic map as a predictor variable, and conduct model training, including:
(1)将武汉市中心区域2015和2020年土地利用专题图导入ArcMap软件,数据空间分类率为30m,土地利用类型分为水域、林地和草地、耕地、未利用地、居住用地、工业用地、商业用地、公共管理用地和混合用地共九类,类别代码依次为1至9的数字。(1) Import the 2015 and 2020 land use thematic maps in the central area of Wuhan into ArcMap software, the data space classification rate is 30m, and the land use types are divided into water, forest and grassland, cultivated land, unused land, residential land, industrial land, There are nine categories of commercial land, public management land and mixed land, and the category codes are numbers from 1 to 9.
(2)利用ArcMap软件中Raster calculator进行2015年到2020年土地利用类型的变化检测,代数表达式为Con(″landuse2015.tif″!=″landuse2020.tif″,1,0),生成新的栅格数据集landuse_difference.tif,像元值为1代表2015年至2020年期间土地利用类型发生变化,像元值为0代表未发生变化。将像元值为1的像元挑选出来,即为2015年至2020年期间土地利用类型发生变化的区域,共计823636个像元。(2) Use the Raster calculator in ArcMap software to detect the change of land use type from 2015 to 2020. The algebraic expression is Con("landuse2015.tif"!="landuse2020.tif", 1, 0) to generate a new grid In the grid dataset landuse_difference.tif, a pixel value of 1 represents a change in land use type from 2015 to 2020, and a pixel value of 0 represents no change. The pixels with the pixel value of 1 are selected, which is the area where the land use type has changed from 2015 to 2020, with a total of 823,636 pixels.
(3)针对土地利用发生变化的823636个像元,根据像元的空间位置索引,采用随机遍历抽样法,对不同类型用地设置相应的遍历步长,进行全域搜索和采样,以20个潜在驱动因素为特征变量,2020年用地类型作为预测变量,形成训练样本集D=[(x1,y1),,...,(xn,yn)],其中水域样本5717个、林地和草地样本3283个、耕地样本10583个、未利用地样本6509个、居住用地样本6852个、工业用地样本6791个、商业用地样本4985个、公共管理用地样本2416个和混合用地样本2570个。(3) For the 823,636 pixels with changes in land use, according to the spatial position index of the pixels, the random traversal sampling method is used to set the corresponding traversal step size for different types of land, and perform global search and sampling. The factors are characteristic variables, and the type of land used in 2020 is used as a predictor variable to form a training sample set D=[(x 1 , y 1 ), ,..., (x n , y n )], including 5717 water samples, woodland and There were 3283 grassland samples, 10583 cultivated land samples, 6509 unused land samples, 6852 residential land samples, 6791 industrial land samples, 4985 commercial land samples, 2416 public management land samples and 2570 mixed land samples.
具体实现采用代码如下:The specific implementation code is as follows:
(4)将训练样本D输入随机森林模型进行模型训练,将最大特征数设置为潜在驱动因子个数N的平方根,本案例中采用20个驱动因子,则最大特征数为通过决策树个数的迭代增长进行模型训练,分析模型训练效果采用模型误差的平均间隔:(4) Input the training sample D into the random forest model for model training, and set the maximum number of features as the square root of the number of potential driving factors N. In this case, if 20 driving factors are used, the maximum number of features is Model training is performed by iterative growth of the number of decision trees, and the average interval of model errors is used to analyze the model training effect:
式中:MGavg代表所有样本的平均间隔,n代表样本个数,mg(xi,yi)代表单一样本的间隔。如果mg(xi,yi)大于零,这时正确类别占据最多票数,在投票表决下最终分类结果正确;相反则最终分类结果错误。结果显示,决策树个数为60时,模型的平均间隔趋于稳定(图3)。具体实现采用代码如下所示:In the formula: MG avg represents the average interval of all samples, n represents the number of samples, and mg( xi , y i ) represents the interval of a single sample. If mg(x i , y i ) is greater than zero, then the correct category occupies the most votes, and the final classification result is correct under voting; otherwise, the final classification result is wrong. The results show that when the number of decision trees is 60, the average interval of the model tends to be stable (Figure 3). The specific implementation code is as follows:
第三步,进行变量的重要性评价,对于样本D中每个特征变量j,随机置换该变量中的数值,生成新的、被破坏的训练样本计算新的样本间隔MGj,考虑到随机置换的不稳定性,对每一个变量重复50次随机置换,以平均值作为变量重要性的最终结果:The third step is to evaluate the importance of variables. For each feature variable j in sample D, randomly replace the value in the variable to generate new and destroyed training samples. Calculate the new sample interval MG j , taking into account the instability of random permutations, repeat 50 random permutations for each variable, and take the mean as the final result of variable importance:
式中:ij代表变量j的重要性得分,MG代表进行随机置换之前的模型平均间隔,K为随机置换次数,MGk,j代表对变量j进行第k次随机置换之后的模型平均间隔。where i j represents the importance score of variable j, MG represents the average interval of the model before random permutation, K is the number of random permutations, and MG k, j represents the average interval of the model after the k-th random permutation of variable j.
利用这一步得到的特征因子重要性得分,按照降序将特征变量进行排序,即为土地利用变化驱动力排序(图4),本案例中20个驱动因子的重要性从高到低排序为:人口、K12学校、工业设施、文化休闲场馆、公园绿地、公交站、高等院校、体育场馆、医院、水系、住宅价格、地铁站、水土保持、商业设施、主要干道、高程、土壤有机质、长途车站、坡度和坡向。具体实现代码如下:Using the feature factor importance score obtained in this step, the feature variables are sorted in descending order, which is the driving force ranking of land use change (Figure 4). In this case, the importance of the 20 driving factors is ranked from high to low: population , K12 schools, industrial facilities, cultural and leisure venues, parks, green spaces, bus stations, colleges and universities, stadiums, hospitals, water systems, residential prices, subway stations, soil and water conservation, commercial facilities, main roads, elevation, soil organic matter, long-distance stations , slope and aspect. The specific implementation code is as follows:
第四步,利用递归特征消除原理,进行核心驱动力筛选,根据驱动因子的重要性排序,从最为重要的驱动因子开始,每次添加一个因子,形成新的特征子集,输入到随机森林模型中,利用交叉验证方法训练模型并得到新的模型分类精度;重复上述步骤,直至特征子集中包含所有驱动因子;绘制模型分类精度随特征变量数目减少而变化的曲线,寻找曲线中分类精度趋于收敛对应的点位于第15个因子处(图5),此后模型的分类精度基本保持不变,则筛选出的15个核心驱动因子为:人口、K12学校、工业设施、文化休闲场馆、公园绿地、公交站、高等院校、体育场馆、医院、水系、住宅价格、地铁站、水土保持、商业设施和主要干道。The fourth step is to use the principle of recursive feature elimination to screen the core driving forces. According to the importance of the driving factors, start from the most important driving factor, and add one factor at a time to form a new feature subset, which is input to the random forest model. , use the cross-validation method to train the model and obtain the new model classification accuracy; repeat the above steps until all driving factors are included in the feature subset; draw the curve of the model classification accuracy as the number of feature variables decreases, and find that the classification accuracy in the curve tends to be The corresponding point of convergence is located at the 15th factor (Figure 5). After that, the classification accuracy of the model remains basically unchanged. The 15 core driving factors screened out are: population, K12 schools, industrial facilities, cultural and leisure venues, parks and green spaces , bus stations, colleges and universities, sports venues, hospitals, water systems, residential prices, subway stations, soil and water conservation, commercial facilities and main roads.
表1土地利用潜在驱动因子分类与空间化处理方式Table 1 Classification and spatialization of potential drivers of land use
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111529458.2A CN114398951B (en) | 2021-12-14 | 2021-12-14 | A method for mining driving factors of land use change based on random forest and crowd-source geographic information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111529458.2A CN114398951B (en) | 2021-12-14 | 2021-12-14 | A method for mining driving factors of land use change based on random forest and crowd-source geographic information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114398951A true CN114398951A (en) | 2022-04-26 |
CN114398951B CN114398951B (en) | 2024-12-24 |
Family
ID=81227519
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111529458.2A Active CN114398951B (en) | 2021-12-14 | 2021-12-14 | A method for mining driving factors of land use change based on random forest and crowd-source geographic information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114398951B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114861277A (en) * | 2022-05-23 | 2022-08-05 | 中国科学院地理科学与资源研究所 | Long-time-sequence national soil space function and structure simulation method |
CN114881834A (en) * | 2022-06-08 | 2022-08-09 | 生态环境部南京环境科学研究所 | Method and system for analyzing driving relationship of urban group ecological system service |
CN117077005A (en) * | 2023-08-21 | 2023-11-17 | 广东国地规划科技股份有限公司 | Optimization method and system for urban micro-update potential |
CN117172094A (en) * | 2023-07-25 | 2023-12-05 | 内蒙古师范大学 | Visualization and quantification methods of positive and negative impacts of driving factors of land use change |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104156537A (en) * | 2014-08-19 | 2014-11-19 | 中山大学 | Cellular automaton urban growth simulating method based on random forest |
WO2020083400A1 (en) * | 2018-10-26 | 2020-04-30 | 江苏智通交通科技有限公司 | Traffic accident data intelligent analysis and comprehensive application system |
AU2020101854A4 (en) * | 2020-08-17 | 2020-09-24 | China Communications Construction Co., Ltd. | A method for predicting concrete durability based on data mining and artificial intelligence algorithm |
-
2021
- 2021-12-14 CN CN202111529458.2A patent/CN114398951B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104156537A (en) * | 2014-08-19 | 2014-11-19 | 中山大学 | Cellular automaton urban growth simulating method based on random forest |
WO2020083400A1 (en) * | 2018-10-26 | 2020-04-30 | 江苏智通交通科技有限公司 | Traffic accident data intelligent analysis and comprehensive application system |
AU2020101854A4 (en) * | 2020-08-17 | 2020-09-24 | China Communications Construction Co., Ltd. | A method for predicting concrete durability based on data mining and artificial intelligence algorithm |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114861277A (en) * | 2022-05-23 | 2022-08-05 | 中国科学院地理科学与资源研究所 | Long-time-sequence national soil space function and structure simulation method |
CN114881834A (en) * | 2022-06-08 | 2022-08-09 | 生态环境部南京环境科学研究所 | Method and system for analyzing driving relationship of urban group ecological system service |
CN117172094A (en) * | 2023-07-25 | 2023-12-05 | 内蒙古师范大学 | Visualization and quantification methods of positive and negative impacts of driving factors of land use change |
CN117077005A (en) * | 2023-08-21 | 2023-11-17 | 广东国地规划科技股份有限公司 | Optimization method and system for urban micro-update potential |
CN117077005B (en) * | 2023-08-21 | 2024-05-10 | 广东国地规划科技股份有限公司 | Optimization method and system for urban micro-update potential |
Also Published As
Publication number | Publication date |
---|---|
CN114398951B (en) | 2024-12-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liu et al. | A big data approach to assess progress towards Sustainable Development Goals for cities of varying sizes | |
Qiao et al. | Industrialization, urbanization, and innovation: Nonlinear drivers of carbon emissions in Chinese cities | |
CN114398951A (en) | Land use change driving factor mining method based on random forest and crowd-sourced geographic information | |
Zhao et al. | China’s population spatialization based on three machine learning models | |
Li et al. | Multidimensional poverty in rural China: Indicators, spatiotemporal patterns and applications | |
Ma et al. | Identification of the numerical patterns behind the leading counties in the US local green building markets using data mining | |
Lv et al. | Detecting the true urban polycentric pattern of Chinese cities in morphological dimensions: A multiscale analysis based on geospatial big data | |
Chen et al. | Spatiotemporal differentiation of urban-rural income disparity and its driving force in the Yangtze River Economic Belt during 2000-2017 | |
CN114462503A (en) | A method for obtaining the matching relationship between supply and demand of medical resources based on accessibility | |
Yang et al. | A two-level random forest model for predicting the population distributions of urban functional zones: A case study in Changsha, China | |
CN111984701A (en) | Method, device, equipment and storage medium for predicting village settlement evolution | |
CN114819589A (en) | Urban space high-quality utilization determination method, system, computer equipment and terminal | |
Fu et al. | How does digitalization affect the urban-rural disparity at different disparity levels: A Bayesian Quantile Regression approach | |
CN117391487A (en) | Methods for community operation matching assessment, diagnosis and planning based on machine learning | |
CN115728463A (en) | Interpretable water quality prediction method based on semi-embedded feature selection | |
CN111401683B (en) | Method and device for measuring tradition of ancient villages | |
Tian et al. | Local carbon emission zone construction in the highly urbanized regions: application of residential and transport CO2 emissions in Shanghai, China | |
Chen et al. | Unraveling nonlinear effects of environment features on green view index using multiple data sources and explainable machine learning | |
CN111914339B (en) | Landscape design method and system based on landscape performance evaluation | |
CN113743659A (en) | Urban layout prediction method based on component method and Markov cellular automaton and application | |
CN116628121A (en) | River environmental factor and microbial community structure association measurement method | |
CN116307242A (en) | Construction method of larch site quality and productivity model | |
Li et al. | Characterizing urban spatial structure through built form typologies: A new framework using clustering ensembles | |
CN117172094B (en) | Positive and negative influence visualization and quantification method for land utilization change driving factors | |
Badiei et al. | A systematic review of fractal theory and its application in geography and urban planning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |