CN114398951A - Land use change driving factor mining method based on random forest and crowd-sourced geographic information - Google Patents

Land use change driving factor mining method based on random forest and crowd-sourced geographic information Download PDF

Info

Publication number
CN114398951A
CN114398951A CN202111529458.2A CN202111529458A CN114398951A CN 114398951 A CN114398951 A CN 114398951A CN 202111529458 A CN202111529458 A CN 202111529458A CN 114398951 A CN114398951 A CN 114398951A
Authority
CN
China
Prior art keywords
land use
variable
model
driving
random forest
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111529458.2A
Other languages
Chinese (zh)
Inventor
林安琪
吴浩
罗文庭
李岩
江志猛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central China Normal University
Original Assignee
Central China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central China Normal University filed Critical Central China Normal University
Priority to CN202111529458.2A priority Critical patent/CN114398951A/en
Publication of CN114398951A publication Critical patent/CN114398951A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis

Abstract

The invention provides a land use change driving factor mining method based on random forest and crowd-sourced geographic information, which comprises the following steps: firstly, constructing a multi-element potential driving factor data set influencing land use change by using multi-source geographic data mainly comprising POI points, and carrying out data spatialization processing; then, constructing a random forest classifier model by taking the multivariate potential driving factors as characteristic variables and taking the land type of the land use thematic map as a prediction variable, and performing model training; secondly, performing K times of random replacement on a single variable by using a trained model so as to calculate the importance score of the variable, and ranking the importance of the driving factors according to the score; and finally, screening the core driving force influencing the land use change by utilizing a recursive characteristic elimination principle. The method can carry out importance quantitative evaluation and core factor screening on the driving factors influencing urban land use change, thereby revealing a microscopic driving mechanism of urban evolution.

Description

Land use change driving factor mining method based on random forest and crowd-sourced geographic information
Technical Field
The invention belongs to the field of geographic big data analysis and excavation, and particularly relates to a land utilization change driving factor excavation method based on random forest and crowd-sourced geographic information.
Background
The urban land utilization pattern is used as the visual expression of urban development, is driven by human consciousness, and experiences the most complex evolution process on the ground surface under the comprehensive influence of multiple factors such as nature, economy, culture and policy, and has profound influence on the nature and the ecological system. China, as the largest developing country in the world, is currently in the high-speed urbanization development stage, land space resources are greatly developed and utilized under the double pressure of population growth and economic development, and urban land utilization is subject to frequent and severe changes. The urban space structure evolution rule is mastered, the microcosmic driving mechanism of land utilization change is disclosed, scientific reference basis can be provided for the government department to optimize urban land resource allocation, and the method has important significance on urban sustainable development.
The driving factor mining of land use change is the basis for revealing land use change occurrence mechanism, evolution rule and future trend simulation, and is always an important direction for land use research. Scholars at home and abroad have developed a great deal of research work on the driving force of land utilization change, and early research mainly utilizes the form of empirical analysis to disclose the driving mechanism of a certain soil utilization type macroscopic level in a characteristic area. For example, Moran points out that forest degradation in the amazon brazilian area from 1975 to 1987 was largely influenced by local government changes in animal farm policies, rather than population growth factors. Sneath discovered that modern farming was the main cause of pasture deterioration by comparing grassland changes in china, russia and mongolian countries between 1992 and 1995. Pulido and Bocco take farmers in developing countries as research objects, and prove that the subjective consciousness and traditional culture of the farmers play a decisive role in the local land degradation condition. Although these qualitative analyses lay a good foundation for land use driving force identification, it is difficult to assess the degree of influence of different types of human or natural factors on land use changes. Therefore, a statistical method is adopted in the follow-up research successively, the quantitative research of the factor driving force is carried out, the multivariate driving factor is used as an independent variable, the land utilization change is used as a dependent variable, and a linear equation model is constructed, such as correlation analysis, logistic regression, linear regression, principal component analysis and other methods.
Considering the current situation that the urban development in China gradually changes from outward space expansion to the urban space redevelopment on a small scale such as land function updating and old city transformation, the research on the land utilization change micro-driving mechanism is urgently needed. However, there are two major disadvantages in the existing research, which are difficult to satisfy the micro driving force research of land use change: firstly, the existing research mostly focuses on the research of macroscopic driving factors of large-scale land utilization change, and has the problems of overlarge research scale, inaccurate classification of the driving factors and the like, and is difficult to support the law discovery and mechanism research of the functional transformation of urban internal construction land; secondly, urban land utilization changes are influenced by interaction of multiple factors such as natural environment, social economy and the like, land functions and structures are more complex under the influence of high-intensity human activities, a traditional statistical model is based on linear equations, a relational model between a multivariate driving factor and the land utilization changes is simplified, and a complex and nonlinear mapping relation between the multivariate driving factor and the land utilization changes cannot be truly and comprehensively reflected.
With the development of Web 2.0 technology, mass data, which is mass data actively or passively generated by people in daily life, of popular geographic information (popular geographic information) becomes an important supplement of professional geographic information data. The depth and the breadth of the geological research are further improved by utilizing the human activities and the social and economic microscopic characteristics reflected in the public-source geographic information. Point of Interest (POI) data is the most widely applied class in the public-source geographic information, and a great amount of dynamic and fine social and economic information contained in POI labels is applied to urban land utilization research, so that the possibility is provided for excavating a micro driving mechanism for urban land utilization. However, in practical application, the problems of high data redundancy, strong information correlation and the like caused by abundant and numerous geographic information inevitably bring serious interference to the accurate identification and screening of the core driving force. Therefore, a land use change driving force analysis method which is less affected by correlation among variables needs to be established, a nonlinear model between the driving factors and the land use change can be established, interference of data redundancy can be avoided, the driving factors which are dominant and have strong contribution capacity are accurately identified from the multivariate driving factors, and core driving factors which affect the urban land use change are excavated more deeply and finely. The random forest has natural advantages in the aspect of feature screening, and the importance of feature variables is evaluated according to the contribution degree of the variables to the model and is used as a driving force factor result for analyzing land use change. The importance of the characteristic variables can be measured by the random forest according to the contribution of the variables to the prediction result, and the core driving factors of the land use change can be analyzed by calculating the importance of the variables by applying the random forest model to the construction of the relationship between the land use categories and the space variables. The random forest model provides a variable importance evaluation method, for example, a variable importance MDI (mean importance in importance) index is calculated according to Gini impurity, the method calculates the influence of each variable on the heterogeneity of an observed value on each node of a classification tree, so as to compare the importance of the variables, however, the algorithm may cause serious deviation of the variable importance evaluation, mainly because the MDI index is a statistical value calculated according to model training data, and cannot completely represent the contribution of the variables to model prediction. Relevant research shows that under the condition that sample data is not uniformly distributed, when the MDI index is over-fitted to the model, the index which does not obviously contribute can be misjudged as an important factor. Furthermore, variable importance evaluation based on Gini's purity is more likely to give high values to continuous variables, while underestimating the importance of discrete variables. Therefore, the importance of the single variable is obtained by introducing a variable displacement inspection method to evaluate the importance of the characteristic variable based on a random forest model and crowd-sourced geographic information, breaking the relationship between the variable established by the model and a prediction target through random displacement of the single variable, and then performing index calculation aiming at model error change caused by variable change. The method has the advantages that no obvious bias exists in importance evaluation of discrete and continuous variables, and importance quantitative evaluation and core factor screening can be performed on the driving factors influencing urban land utilization changes more accurately.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: in order to reveal a micro driving mechanism of land use change, a land use change driving factor mining method based on random forest and crowd-sourced geographic information is provided, so that the importance of characteristic factors influencing urban land use change is accurately reflected, and the screening of a land use change core driving factor is realized.
The invention adopts the following technical scheme for solving the technical problems:
the invention provides a land utilization change driving factor mining method based on random forest and crowd-sourced geographic information, which comprises the following steps: firstly, constructing a multi-element potential driving factor data set influencing land use change by using multi-source geographic data mainly comprising POI points, and carrying out data spatialization processing; then, constructing a random forest classifier model by taking the multivariate potential driving factors as characteristic variables and taking the land type of the land use thematic map as a prediction variable, and performing model training; secondly, performing K times of random replacement on a single variable by using a trained model so as to calculate the importance score of the variable, and ranking the importance of the driving factors according to the score; and finally, screening the core driving force influencing the land use change by utilizing a recursive characteristic elimination principle.
The invention can adopt the following method to carry out data spatialization treatment: according to the data type characteristics of different factors, a plurality of methods such as kernel density estimation, buffer zone creation, Euclidean distance calculation, partition statistics, gradient calculation, slope calculation and the like are utilized to carry out spatial processing on data, and a spatial variable set with consistent resolution and continuous planar type is generated.
The invention can adopt the following method to eliminate the difference of the space variable set data dimension: and (3) carrying out dispersion standardization processing on the variable by using a Fuzzy tool in Arcmap10.2 software to realize the normalization of the pixel value of the space variable, wherein the numerical range of the variable is mapped between 0 and 1.
The invention can adopt the following method to construct a random forest classifier model: and constructing a mapping relation between the characteristic variable and the land use type of the land use thematic map by taking the multiple potential driving factors as characteristic variables and taking the land use type of the land use thematic map as a prediction variable.
The random forest classifier model can be constructed by adopting the following method to collect model training samples:
(1) importing the land utilization thematic maps of two different years in a research area into ArcMap software, wherein the data space classification rate is 30m, the land utilization types are nine types including water areas, forest lands, grasslands, cultivated lands, unused lands, residential lands, industrial lands, commercial lands, public management lands and mixed lands, and the category codes are numbers from 1 to 9 in sequence;
(2) the method comprises the steps of utilizing a rater calculator in ArcMap software to detect changes of land use types in different years, generating a new grid data set, namely, a land _ difference.GIF, wherein an algebraic expression is Con ("land 1. tif"! land 2. tif', 1,0), and a pixel value of 1 represents an area with changed land use types, and a pixel value of 0 represents no change;
(3) aiming at the pixels with changed land utilization, according to the spatial position index of the pixels, a random traversal sampling method is adopted, corresponding traversal step lengths are set for different types of land, global search and sampling are carried out, and a training sample set D ═ x [ (x)1,y1),,...,(xn,yn)]。
The random forest classifier model can be constructed by adopting the following method for training the random forest model: inputting the training sample D into a random forest model for model training, and setting the maximum characteristic number as the square root of the number N of the potential driving factors; model training is carried out through iterative increase of the number of decision trees, and the average interval of model errors is adopted for measuring the training effect of the model:
Figure BDA0003410223620000041
in the formula: MGavgRepresents the average interval of all samples, n represents the number of samples, mg (x)i,yi) Representing the interval of a single sample. If mg (x)i,yi) If the number of the classification classes is larger than zero, the correct classification occupies the maximum number of votes, and the final classification result is correct under voting; otherwise the final classification result is erroneous.
The invention can adopt the following method to calculate and sort the importance scores of the driving factors: for each characteristic variable j in the sample D, randomly replacing the value in the variable to generate a new and damaged training sample
Figure BDA0003410223620000042
Calculating a new sample interval MGjThe random permutation was repeated 50 times for each variable, with the average as the final result of the variable importance:
Figure BDA0003410223620000051
in the formula: i.e. ijRepresents the importance score of variable j, MG represents the model average interval before random substitution, K is the number of random substitutions, MGk,jRepresents the model average interval after the kth random permutation on the variable j;
and sorting the feature variables according to a descending order by using the feature factor importance scores obtained in the step, namely sorting the importance of the driving factors of the land use change.
The invention can adopt the following method to screen the core driving force: by utilizing a recursive feature elimination principle, sequencing according to the importance of the driving factors, adding one factor each time from the most important driving factor to form a new feature subset, inputting the new feature subset into a random forest model, training the model by utilizing a cross validation method and obtaining new model classification accuracy; and repeating the steps until all the driving factors are contained in the feature subset.
The core driving force screening method can adopt the following method to determine the number of the core driving forces: and drawing a curve of which the classification precision of the model changes along with the reduction of the number of the characteristic variables, and finding out a point corresponding to the convergence of the classification precision in the curve, wherein the point is the number of the core driving factors.
The land use change driving factor mining method based on the random forest and the crowd-sourced geographic information is used for accurately evaluating and screening the microscopic driving factors influencing the urban land use change.
Compared with the prior art, the invention has the following main technical effects:
(1) aiming at the problems of overlarge research scale and inaccurate classification of driving factors in the existing research on the driving force for changing the land utilization, the invention discovers that multi-source geographic data containing rich social and economic information is introduced, an abstract driving mechanism for the city development is converted into quantitative characteristic expression on a two-dimensional space through a spatial statistics method, and the complete unification of the driving factors and the land utilization state is realized from the data form, so that the problem that the spatial scale of the multi-source driving factors and the spatial scale of the land utilization change are inconsistent is solved, the refinement degree of the analysis of the driving force for the land utilization is greatly improved, and a good data base is laid for disclosing the microscopic driving mechanism of the city evolution.
(2) In consideration of the fact that a traditional statistical model excessively simplifies the complex relation between the multivariate driving factors and the land use change and is difficult to truly reflect a microscopic driving mechanism, the invention designs a land use change driving factor importance evaluation method based on random forests when the land use microscopic driving factors are mined, the method carries out reconstruction and change inspection on the random forest model through random replacement of single variables, and independently inspects the influence degree of each driving factor on the model prediction capability, thereby effectively avoiding the adverse influence of information redundancy among the multivariate factors on the identification of the core driving factors. Compared with the traditional variable importance MDI index in the random forest model, the method has the advantages that no obvious deviation exists when importance evaluation is carried out on discrete variables and continuous variables, and core driving factors can be screened out more accurately.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a diagram of potential driver spatialization results based on crowd-sourced geographic information.
FIG. 3 is a result of iterative training of a random forest model.
FIG. 4 is a driver importance ranking graph.
FIG. 5 is a graph of core driver screening results.
Detailed Description
The invention provides a land use change driving factor importance evaluation method based on a random forest model and public geographic information, aiming at the problem that a traditional statistical model excessively simplifies the complex relation between a plurality of driving factors and land use change and is difficult to truly reflect a microscopic driving mechanism. The method independently inspects the influence degree of each driving factor on the model prediction capability, thereby effectively avoiding the adverse influence of information redundancy among multiple factors on the identification of the core driving factor, solving the problem that the response of the traditional random forest algorithm to the perturbation of a tiny variable is not sensitive enough, and improving the accuracy of the identification of the microcosmic driving factors of the land use change.
The invention is further illustrated by the following examples and figures, but is not limited thereto.
The invention provides a land use change driving factor mining method based on random forest and crowd-sourced geographic information, which specifically comprises the following steps: firstly, constructing a multi-element potential driving factor data set influencing land utilization change by using crowd-sourced geographic information mainly comprising POI point data (such as education, public service and traffic), and generating data spatialization processing; then, constructing a random forest classifier model by taking the multivariate potential driving factors as characteristic variables and taking the land utilization type as a prediction variable, and performing model iterative training; secondly, carrying out K times of random replacement of a single variable by using the trained model so as to calculate the importance score of the variable and carry out the importance ranking of the driving factors; and finally, screening out the core driving force influencing the land use change by utilizing a recursive characteristic elimination principle. The method can accurately carry out importance quantitative evaluation and core factor screening on the driving factors influencing the urban land use change, thereby excavating the micro evolution mechanism of the land use change.
In the above method, the following method may be adopted for the data spatialization processing: for multiple potential driving factors represented by POI point data, linear data, planar data and raster data, the data is subjected to spatialization processing by using a kernel density estimation method, a buffer zone creation method, an Euclidean distance calculation method, a partition statistics method and a slope and slope direction calculation spatialization method (the spatialization methods corresponding to different driving factors are shown in Table 1), and a continuous planar space variable set with the resolution of 30m is generated. The implementation of the specific spatialization method adopts the tools of kernel intensity, Multiple Ring Buffer, eutlidean allocation, zonal statistics, slope and aspect in Arcmap10.2 software, respectively introduces the original characteristic data into the corresponding tools, uniformly sets and outputs result graphs with the resolution of 30m and tif formats, and forms a space variable set, namely a multi-element potential driving factor data set.
In the above method, the following method can be adopted to eliminate the difference of the spatial variable set data dimension: a Fuzzy standardization tool in Arcmap10.2 software is used for carrying out dispersion standardization processing on variables, tif format data of all space variables are sequentially imported into the tool, default setting is kept, the exported result graph is still in tif format, pixel values are converted into floating point type decimal numbers between 0 and 1, normalization of the pixel values of the space variables is achieved, and dimension and data level difference influences among different variables are eliminated.
In the above method, the random forest classifier model may be constructed by the following method: and taking 20 normalized space variables as characteristic variables (independent variables), taking 9 land types of the land use thematic map as prediction variables (dependent variables), and constructing a mapping relation between the two by using a random forest model.
In the above method, the following method may be adopted for model training sample acquisition:
(1) the land utilization thematic maps of two different years in a research area are imported into ArcMap software, the data space classification rate is 30m, the land utilization types are divided into nine types including water areas, forest lands, grasslands, cultivated lands, unused lands, residential lands, industrial lands, commercial lands, public management lands and mixed lands, and the category codes are numbers from 1 to 9 in sequence.
(2) The method comprises the steps of utilizing a rater calculator in ArcMap software to detect changes of land use types in different years, generating a new grid data set, namely, a land _ difference.GIF, wherein an algebraic expression is Con (1. tif!. ("land 2. tif!, 1,0), and a pixel value of 1 represents an area with changed land use types, and a pixel value of 0 represents no change.
(3) Writing Python language program, aiming at the pixel with changed land use, according to the spatial position index of the pixel, adopting random traversal sampling method, setting corresponding traversal step length for different types of land use, carrying out global search and sampling, and forming a training sample set D ═ x [ [ (x)1,y1),,...,(xn,yn)]In the formula: x is the number of1...xnIndependent variable, y, representing n samples in a random forest model1...ynRepresenting the dependent variable of n samples.
In the above method, the following method may be adopted for iterative training of the random forest model: writing a Python language program, inputting a training sample D into a random forest model for model training, and setting the maximum characteristic number as the square root of the number N of potential driving factors; model training is carried out through iterative increase of the number of decision trees, and the average interval of model errors is adopted for measuring the training effect of the model:
Figure BDA0003410223620000081
in the formula: MGavgRepresents the average interval of all samples, n represents the number of samples, mg (x)i,yi) Representing the interval of a single sample. If mg (x)i,yi) Greater than zero, when the correct category occupiesAccording to the maximum vote number, the final classification result under voting is correct; otherwise the final classification result is erroneous.
In the above method, the following method may be adopted to calculate and sort the importance scores of the driving factors: writing Python language program, for each characteristic variable j in the sample D, randomly replacing the value in the variable to generate new and damaged training sample
Figure BDA0003410223620000083
Calculating a new sample interval MGjThe random permutation was repeated K times (50 times in this example) for each variable, with the average as the final result of the importance of the variable:
Figure BDA0003410223620000082
in the formula: i.e. ijRepresents the importance score of variable j, MG represents the model average interval before random substitution, K is the number of random substitutions, MGk,jRepresenting the model average interval after the kth random permutation on variable j.
And sorting the feature variables according to a descending order by using the feature factor importance scores obtained in the step, namely sorting the importance of the driving factors of the land use change.
In the above method, the following method can be used for core driving force screening: by utilizing a recursive feature elimination principle, sequencing according to the importance of the driving factors, adding one factor each time from the most important driving factor to form a new feature subset, inputting the new feature subset into a random forest model, training the model by utilizing a cross validation method and obtaining new model classification accuracy; and repeating the steps until all the driving factors are contained in the feature subset.
In the above method, the core driving force number determination may be performed by the following method: and drawing a curve of the model classification accuracy changing along with the reduction of the number of the characteristic variables, and finding out a point corresponding to the convergence tendency of the classification accuracy in the curve, wherein the point is the number of the core driving factors.
The land use change driving factor mining method based on the random forest and the crowd-sourced geographic information is used for revealing a microscopic driving mechanism of urban evolution.
The application case is as follows:
in this case, the central area of Wuhan city is used as the research area, and the area of the area is 2724.228 square kilometers, which accounts for 31.79% of the total area of Wuhan city, and is the area with the highest degree of urbanization. The present invention will be further described with reference to the drawings and the accompanying tables, taking as an example the driving force analysis of the land use change in the region from 2015 to 2020.
The specific processing steps (fig. 1) are as follows:
step 1, from the two aspects of social economy and natural ecological quantity, constructing a multi-element potential driving factor data set influencing land use change based on public-source geographic information, and performing data spatialization treatment, specifically comprising the following steps:
(1) factors influencing urban land utilization change are selected, wherein the factors mainly comprise 20 factors in total of natural ecology and social economy, and the natural ecological factors comprise 3 terrain factors of elevation, gradient and slope, and 3 ecological factors of a water and soil conservation function, soil organic matters and a water system; the socio-economic factors are derived from the POI and other public geographic information, including 14 types of population, economy, education, public service and transportation (Table 1).
(2) Acquiring grid (tif format) or vector data (shapefile format) representing all natural ecological and social economic factors in a research area, sequentially importing the grid or vector data into professional geographic information data processing and analyzing software ArcMap10.2, performing coordinate projection conversion on data from various sources by using Project function in a tool box, and keeping coordinate projection of the data consistent, wherein the method comprises the following steps: WGS _1984_ UTM _ Zone _ 49N; and then using the Clip function and taking the boundary of the research area as a clipping range to Clip all the data into a uniform shape.
(3) Different spatialization processing modes are adopted for different types of factors, the spatialization processing modes comprise four methods of Kernel Density estimation, Buffer area creation, Euclidean Distance calculation and partition statistics, the Kernel Density estimation, Multiple Ring Buffer, Euclidean Distance and Zonal static tools in Arcmap10.2 software are respectively used for realization, and the spatialization processing methods corresponding to 20 factors are detailed in a table 1. After the process is completed, a spatial-to-spatial variable data set (fig. 2) with a resolution of 30m, a pixel count of 3366990(2090 × 1611), and a data format of tif type is generated.
(4) Firstly, in order to eliminate the difference of data dimensions between different factors, a Fuzzy tool in Arcmap10.2 software is used for carrying out dispersion standardization processing on variables, so that the normalization of pixel values of space variables is realized, and the numerical range of the variables is mapped between 0 and 1.
And secondly, constructing a random forest classifier model by taking the multivariate potential driving factors as characteristic variables and taking the land type of the land use thematic map as a prediction variable, and performing model training, wherein the method specifically comprises the following steps of:
(1) the method is characterized in that a central region 2015 of Wuhan city and 2020-year land utilization thematic map is imported into ArcMap software, the data space classification rate is 30m, the land utilization types are divided into nine types including water areas, forest lands and grasslands, cultivated lands, unused lands, residential lands, industrial lands, commercial lands, public management lands and mixed lands, and the classification codes are numbers from 1 to 9 in sequence.
(2) A method for detecting the change of land utilization types from 2015 to 2020 by using a rater calculator in ArcMap software is adopted, an algebraic expression is Con (lan duse 2015.GIF! ═ lan duse 2020.GIF', 1,0), a new grid data set lan duse _ difference.GIF is generated, a pixel value of 1 represents that the land utilization types are changed from 2015 to 2020, and a pixel value of 0 represents that the land utilization types are not changed. And selecting the pixels with the pixel value of 1, namely selecting the regions with the land use types changed in 2015-2020, wherein 823636 pixels are counted.
(3) 823636 pixels with changed land utilization are indexed according to spatial positions of the pixels by adopting a random traversal sampling method, corresponding traversal step lengths are set for different types of land, global search and sampling are performed, 20 potential driving factors are used as characteristic variables, the type of the land used in 2020 is used as a prediction variable, and a training sample set D [ [ (x) is formed1,y1),,...,(xn,yn)]Water area samples 5717, forest land and grass samples3283 specimen, 10583 specimen for cultivated land, 6509 specimen for unutilized land, 6852 specimen for residential land, 6791 specimen for industrial land, 4985 specimen for commercial land, 2416 specimen for public management land, and 2570 specimen for mixed land.
The specific implementation adopts the following codes:
Figure BDA0003410223620000111
Figure BDA0003410223620000121
Figure BDA0003410223620000131
Figure BDA0003410223620000141
Figure BDA0003410223620000151
(4) inputting the training sample D into a random forest model for model training, setting the maximum characteristic number as the square root of the number N of potential driving factors, and adopting 20 driving factors in the case that the maximum characteristic number is
Figure BDA0003410223620000161
Model training is carried out through iterative increase of the number of decision trees, and the analysis model training effect adopts the average interval of model errors:
Figure BDA0003410223620000162
in the formula: MGavgRepresents the average interval of all samples, n represents the number of samples, mg (x)i,yi) Representing the interval of a single sample. If mg (x)i,yi) If the number of the classification classes is larger than zero, the correct classification occupies the maximum number of votes, and the final classification result is correct under voting; otherwise the final classification result is erroneous. The results show that the average interval of the model tends to be stable at a decision tree number of 60 (fig. 3). The specific implementation adopts the following codes:
Figure BDA0003410223620000163
Figure BDA0003410223620000171
Figure BDA0003410223620000181
thirdly, evaluating the importance of the variables, randomly replacing the value of each characteristic variable j in the sample D to generate a new and damaged training sample
Figure BDA0003410223620000182
Calculating a new sample interval MGjConsidering the instability of random permutation, repeating 50 times the random permutation for each variable, taking the average value as the final result of the importance of the variable:
Figure BDA0003410223620000183
in the formula: i.e. ijRepresents the importance score of variable j, MG represents the model average interval before random substitution, K is the number of random substitutions, MGk,jRepresenting the model average interval after the kth random permutation on variable j.
The feature variables are sorted in descending order by using the feature factor importance scores obtained in the step, namely the land use change driving force sorting (fig. 4), wherein the importance of 20 driving factors in the case is sorted from high to low as: population, school of K12, industrial facilities, cultural leisure venues, park greens, bus stations, universities, stadiums, hospitals, water systems, residential prices, subway stations, water and soil conservation, commercial facilities, major thoroughfares, elevations, soil organic matter, coach stations, slopes, and slopes. The specific implementation code is as follows:
Figure BDA0003410223620000184
Figure BDA0003410223620000191
Figure BDA0003410223620000201
fourthly, screening core driving forces by using a recursive feature elimination principle, adding a factor from the most important driving factor to form a new feature subset according to the importance ranking of the driving factors, inputting the new feature subset into a random forest model, training the model by using a cross validation method and obtaining new model classification accuracy; repeating the steps until the feature subset comprises all driving factors; drawing a curve of which the classification accuracy of the model changes along with the reduction of the number of the characteristic variables, finding a point, corresponding to which the classification accuracy tends to converge, in the curve at a 15 th factor (figure 5), and keeping the classification accuracy of the model basically unchanged, wherein the 15 screened core driving factors are as follows: population, school of K12, industrial facilities, cultural leisure venues, park greens, bus stations, colleges, stadiums, hospitals, water systems, residential prices, subway stations, water and soil conservation, commercial facilities, and major thoroughfares.
TABLE 1 Classification and spatialization processing method for land utilization potential driving factors
Figure BDA0003410223620000211

Claims (10)

1. A land use change driving factor mining method based on random forest and crowd-sourced geographic information is characterized by comprising the following steps: firstly, constructing a multi-element potential driving factor data set influencing land use change by using multi-source geographic data mainly comprising POI points, and carrying out data spatialization processing; then, constructing a random forest classifier model by taking the multivariate potential driving factors as characteristic variables and taking the land type of the land use thematic map as a prediction variable, and performing model training; secondly, performing K times of random replacement on a single variable by using a trained model so as to calculate the importance score of the variable, and ranking the importance of the driving factors according to the score; and finally, screening the core driving force influencing the land use change by utilizing a recursive characteristic elimination principle.
2. A land use variation driving factor mining method based on random forest and crowd-sourced geographical information as claimed in claim 1, characterized in that the data spatialization processing is performed by adopting the following method: according to the data type characteristics of different factors, a plurality of methods such as kernel density estimation, buffer zone creation, Euclidean distance calculation, partition statistics, gradient calculation, slope calculation and the like are utilized to carry out spatial processing on data, and a spatial variable set with consistent resolution and continuous planar type is generated.
3. A land use variable drive factor mining method based on random forest and crowd-sourced geographical information as claimed in claim 1, wherein the difference of spatial variable set data dimension is eliminated by adopting the following method: and (3) carrying out dispersion standardization processing on the variable by using a Fuzzy tool in Arcmap10.2 software to realize the normalization of the pixel value of the space variable, wherein the numerical range of the variable is mapped between 0 and 1.
4. A land use variation driving factor mining method based on random forest and crowd-sourced geographical information as claimed in claim 1, characterized in that the random forest classifier model is constructed by adopting the following method: and constructing a mapping relation between the characteristic variable and the land use type of the land use thematic map by taking the multiple potential driving factors as characteristic variables and taking the land use type of the land use thematic map as a prediction variable.
5. A land use variation driving factor mining method based on random forests and crowd-sourced geographical information as claimed in claim 4, wherein the random forest classifier model is constructed by adopting the following method to collect training samples of the random forest model:
(1) importing the land utilization thematic maps of two different years in a research area into ArcMap software, wherein the data space classification rate is 30m, the land utilization types are nine types including water areas, forest lands, grasslands, cultivated lands, unused lands, residential lands, industrial lands, commercial lands, public management lands and mixed lands, and the category codes are numbers from 1 to 9 in sequence;
(2) the method comprises the steps of utilizing a rater calculator in ArcMap software to detect changes of land use types in different years, generating a new grid data set (Landuse _ difference.GIF) with an algebraic expression of Con (Landuse 1.GIF!. ("Landuse 2.GIF!., 1,0), wherein the pixel value is 1 and represents an area with changed land use types, and the pixel value is 0 and represents no change;
(3) aiming at the pixels with changed land utilization, according to the spatial position index of the pixels, a random traversal sampling method is adopted, corresponding traversal step lengths are set for different types of land, global search and sampling are carried out, and a training sample set D ═ x [ (x)1,y1),,...,(xn,yn)]
6. A land use variation driving factor mining method based on random forests and crowd-sourced geographical information as claimed in claim 4, wherein the random forest classifier model is constructed for random forest model training using the following method: inputting the training sample D into a random forest model for model training, and setting the maximum characteristic number as the square root of the number N of the potential driving factors; model training is carried out through iterative increase of the number of decision trees, and the average interval of model errors is adopted for measuring the training effect of the model:
Figure FDA0003410223610000021
in the formula: MGavgRepresents the average interval of all samples, n represents the number of samples, mg (x)i,yi) Representing the interval of a single sample. If mg (x)i,yi) If the number of the classification classes is larger than zero, the correct classification occupies the maximum number of votes, and the final classification result is correct under voting; otherwise the final classification result is erroneous.
7. A land use variation driving factor mining method based on random forest and crowd-sourced geographical information as claimed in claim 1, characterized in that the importance scores of the driving factors are calculated and ranked by the following method: for each characteristic variable j in the sample D, randomly replacing the value in the variable to generate a new and damaged training sample
Figure FDA0003410223610000023
Calculating a new sample interval MGjThe random permutation was repeated 50 times for each variable, with the average as the final result of the variable importance:
Figure FDA0003410223610000022
in the formula: i.e. ijRepresents the importance score of variable j, MG represents the model average interval before random substitution, K is the number of random substitutions, MGk,jRepresents the model average interval after the kth random permutation on the variable j;
and sorting the feature variables according to a descending order by using the feature factor importance scores obtained in the step, namely sorting the importance of the driving factors of the land use change.
8. A land use variation driving factor mining method based on random forest and crowd-sourced geographical information as claimed in claim 1, characterized in that the following method is adopted for core driving force screening: according to the importance ranking of the driving factors, adding one factor each time from the most important driving factor to form a new feature subset, inputting the new feature subset into a random forest model, training the model by using a cross validation method and obtaining new model classification accuracy; and repeating the steps until all the driving factors are contained in the feature subset.
9. A land use varied driving factor mining method based on random forest and crowd-sourced geographical information as claimed in claim 8 wherein the core driving force screening method uses the following method for core driving force number determination: and drawing a curve of which the classification precision of the model changes along with the reduction of the number of the characteristic variables, and finding out a point corresponding to the convergence of the classification precision in the curve, wherein the point is the number of the core driving factors.
10. A method of random forest and crowd-sourced geographic information based land use change driver mining as claimed in any one of claims 1 to 9 for accurate assessment and screening of microscopic drivers for impact on urban land use changes.
CN202111529458.2A 2021-12-14 2021-12-14 Land use change driving factor mining method based on random forest and crowd-sourced geographic information Pending CN114398951A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111529458.2A CN114398951A (en) 2021-12-14 2021-12-14 Land use change driving factor mining method based on random forest and crowd-sourced geographic information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111529458.2A CN114398951A (en) 2021-12-14 2021-12-14 Land use change driving factor mining method based on random forest and crowd-sourced geographic information

Publications (1)

Publication Number Publication Date
CN114398951A true CN114398951A (en) 2022-04-26

Family

ID=81227519

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111529458.2A Pending CN114398951A (en) 2021-12-14 2021-12-14 Land use change driving factor mining method based on random forest and crowd-sourced geographic information

Country Status (1)

Country Link
CN (1) CN114398951A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114861277A (en) * 2022-05-23 2022-08-05 中国科学院地理科学与资源研究所 Long-time-sequence national soil space function and structure simulation method
CN114881834A (en) * 2022-06-08 2022-08-09 生态环境部南京环境科学研究所 Method and system for analyzing driving relationship of urban group ecological system service
CN117077005A (en) * 2023-08-21 2023-11-17 广东国地规划科技股份有限公司 Optimization method and system for urban micro-update potential
CN117077005B (en) * 2023-08-21 2024-05-10 广东国地规划科技股份有限公司 Optimization method and system for urban micro-update potential

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114861277A (en) * 2022-05-23 2022-08-05 中国科学院地理科学与资源研究所 Long-time-sequence national soil space function and structure simulation method
CN114881834A (en) * 2022-06-08 2022-08-09 生态环境部南京环境科学研究所 Method and system for analyzing driving relationship of urban group ecological system service
CN117077005A (en) * 2023-08-21 2023-11-17 广东国地规划科技股份有限公司 Optimization method and system for urban micro-update potential
CN117077005B (en) * 2023-08-21 2024-05-10 广东国地规划科技股份有限公司 Optimization method and system for urban micro-update potential

Similar Documents

Publication Publication Date Title
Östh et al. Analysing segregation using individualised neighbourhoods
Pontius et al. Comparing the input, output, and validation maps for several models of land change
CN111898315B (en) Landslide susceptibility assessment method based on fractal-machine learning hybrid model
Li et al. Pattern of spatial evolution of rural settlements in the Jizhou District of China during 1962–2030
CN114398951A (en) Land use change driving factor mining method based on random forest and crowd-sourced geographic information
CN106845559A (en) Take the ground mulching verification method and system of POI data special heterogeneity into account
CN113902580B (en) Historical farmland distribution reconstruction method based on random forest model
CN113360587B (en) Land surveying and mapping equipment and method based on GIS technology
CN108764527B (en) Screening method for soil organic carbon library time-space dynamic prediction optimal environment variables
CN112508332B (en) Gradual rural settlement renovation partitioning method considering multidimensional characteristics
CN111144637A (en) Regional power grid geological disaster forecasting model construction method based on machine learning
CN113240257A (en) Territorial space partitioning method and device based on minimum cumulative resistance model
CN115984044A (en) Excavation method for tourism development of high-potential villages
CN110826454A (en) Remote sensing image change detection method and device
CN113743659A (en) Urban layout prediction method based on component method and Markov cellular automaton and application
CN111984701A (en) Method, device, equipment and storage medium for predicting village settlement evolution
CN117114176A (en) Land utilization change prediction method and system based on data analysis and machine learning
CN112148821B (en) City mixed occupation space calculation method and system
CN111401683B (en) Method and device for measuring tradition of ancient villages
Yu et al. A phylogenetic approach identifies patterns of beta diversity and floristic subregions of the Qinghai-Tibet Plateau
Deng et al. Exploring the effects of local environment on population distribution: using imagery segmentation technology and street view
CN112860822A (en) Comprehensive analysis method for land resource bearing capacity based on geographical national situation view angle
CN116644809B (en) Urban development boundary demarcation method integrating geographic big data and machine learning
Li et al. Characterizing urban spatial structure through built form typologies: A new framework using clustering ensembles
Lumasuge et al. Implementation analytic network process method and geographic information system to determine the freswater fish farming location

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination