CN115129802A

CN115129802A - Population spatialization method based on multi-source data and ensemble learning

Info

Publication number: CN115129802A
Application number: CN202210782643.0A
Authority: CN
Inventors: 夏南; 赵鑫; 姜朋辉; 周琛; 陈振杰; 徐云耘; 黄学锋; 李满春
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2022-07-05
Filing date: 2022-07-05
Publication date: 2022-09-30

Abstract

The invention discloses a population spatialization method based on multi-source data and ensemble learning, which comprises the following steps: s1, acquiring and fusing multi-source data, and constructing a population spatialization database; s2, constructing an index system for model fitting from the population spatialization database, and screening effective indexes through feature importance of integrated learning model calculation; s3, constructing a Pop-XGboost population spatialization model by combining the relationship between the effective indexes and the community population; and S4, predicting the population space distribution, summarizing grid population simulation data to a community scale, comparing the grid population simulation data with real community demographic data, and verifying the result accuracy. By combining a multi-source data fusion technology, an index screening technology, an ensemble learning technology and the like to construct a population spatialization model, the high-precision population spatialization prediction is accurately and efficiently realized.

Description

Population spatialization method based on multi-source data and ensemble learning

Technical Field

The invention relates to the technical field of large geographic data application, in particular to a population spatialization method based on multi-source data and ensemble learning.

Background

Demographic information may provide scientific support for regional sustainable development and space planning. However, the current population data is usually a statistical value of each administrative region, the spatial resolution is low, and the spatial difference of population distribution in the administrative unit cannot be fully expressed. Also, statistical population data is difficult to match with some studies involving complex geographic boundaries, and is not conducive to integrating other multi-source data, such as remote sensing data. The population distribution data taking the grid as a unit is finer, the spatial heterogeneity of population can be better revealed, and the integration of resources, environment and management information is realized. Therefore, the development of the fine-scale population spatialization research is of great practical significance.

Common population distribution estimation methods can be categorized into two categories according to research goals and auxiliary data: region interpolation and statistical modeling. The population distribution result obtained by the regional interpolation method has lower prediction accuracy and spatial resolution, and the main reasons are that various influence factors of population distribution are not fully considered, and the rationality of regional interpolation is poor. Compared with a region interpolation method, the result precision of population spatialization can be effectively improved by using the statistical model. The statistical model method is mainly used for building linear relations between population and auxiliary variables through a specific statistical model so as to realize spatial prediction of the population. Common auxiliary data include land use data, night light data, roads, coastlines, MODIS/EVI, Digital Elevation Models (DEM), impervious surfaces, and the like.

While the above-mentioned ancillary data are effective, they still do not adequately represent the economic, social and cultural factors of the spatial distribution of the population. Therefore, more and more research is being conducted to apply openly acquired geospatial big data to map population distributions, thereby reflecting the intensity of human activities. The commonly used geographic big data comprises mobile phone signaling, behavior track data, points of interest (POIs), OpenStreetMap, social insurance accounts, house price, taxi track data and the like. In addition, building attribute data is also closely related to population distribution, such as building type, height, volume rate, building area, and the like. However, the single auxiliary data may have problems of incomplete data, abnormal data and the like, so that the obtained population simulation data has low correlation with the real population data. Therefore, the multi-source data is integrated to reflect the spatial distribution of the population, and the prediction deviation caused by a single type of auxiliary data can be reduced. However, as data and features thereof increase, data noise occurs when a population spatialization model is constructed, and it is difficult to realize highly accurate population spatialization prediction.

The statistical model is limited in processing complex and multivariate influence factors due to the problems of simple structure and the like. And the machine learning model can well process multi-source and multi-dimensional characteristics and accurately mine the relationship between multi-source auxiliary data and population. The ensemble learning algorithm based on the decision tree has high simulation precision, and mainly comprises RF, GBDT, XGboost and the like. Currently, in the field of remote sensing, RF and XGBoost models have been applied more, such as soil nutrient estimation, daily reference evapotranspiration calculation, land use classification, PM2.5 prediction, landslide sensitivity mapping, biomass estimation, and the like. Partial research results show that compared with the RF and GBDT models, the XGboost model can not only reduce the over-fitting problem and the calculation complexity, but also improve the prediction precision, so that the optimal solution of the model is more efficient. However, the XGBoost model has little application in fine-scale population spatialization, especially for research on integrating multisource geographic large data and building information, etc.

Thus, there are still two major technical drawbacks to current population spatialization studies:

(1) from a data perspective, a single auxiliary data may result in a prediction result with poor accuracy.

(2) From a model perspective, the addition of the auxiliary data can cause data noise and cause instability of model fitting, thereby affecting the result accuracy of population spatialization. In addition, the XGBoost model has little application in refining population spatialization.

Disclosure of Invention

Aiming at the problems in the related art, the invention provides a population spatialization method based on multi-source data and ensemble learning, so as to overcome the technical problems in the prior related art.

Therefore, the invention adopts the following specific technical scheme: the method comprises the following steps:

s1, acquiring and fusing multi-source data, and constructing a population spatialization database;

s2, constructing an index system for model fitting from the population spatialization database, and screening effective indexes through feature importance of integrated learning model calculation;

s3, constructing a Pop-XGboost population spatialization model by combining the relationship between the effective indexes and the community population;

and S4, predicting population spatial distribution, summarizing grid population simulation data to a community scale, comparing the grid population simulation data with real community demographic data, and verifying result accuracy.

Furthermore, the method for acquiring and fusing multi-source data and constructing the population spatialization database comprises the following steps:

s11, acquiring multi-source data;

s12, selecting a plurality of indexes from the multi-source data to construct an index system;

s13, resampling each index to be a grid scale of 100m multiplied by 100m, and summarizing to a community scale;

s14, counting the building data of the grid scale, and segmenting the building according to the grid;

and S15, constructing a population spatialization database taking a grid of 100m multiplied by 100m as a unit.

Further, the multi-source data comprises basic geographic data, remote sensing data, building data and geographic big data;

the building data includes total building area, building floor area, number of building floors, building volume, building type.

Further, the formula for segmenting the building according to the grid includes:

k＝S(Building _Ai )/S(Building _A )

GFA(Building _Ai )＝k×GFA(Building _A )

GFA(Grid _i )＝∑GFA(Building)

in the formula, S (Building) _A ) Represents the building area of building a;

S(Building _Ai ) Represents that the building A is in Grid _i The building area of (a);

k represents a segmentation coefficient;

Grid _i representing the ith mesh;

GFA(Building _A ) Representing the total area of A buildings;

GFA(Building _Ai ) Represents Grid _i The area of the inner A building;

GFA(Grid _i ) Represents Grid _i The total building area of all buildings within.

Further, an index system for model fitting is constructed from the population spatialization database, and effective indexes are screened out through feature importance of integrated learning model calculation, and the method comprises the following steps:

s21, constructing an ensemble learning model by taking the statistic value of the community scale index as an input variable and the real community population as an output target;

s22, respectively calculating the feature importance of each index according to the integrated learning model obtained by construction;

and S23, selecting an index which has a large influence on population distribution as an effective index according to the average value of the feature importance.

Further, the ensemble learning model comprises a random forest, a gradient boosting decision tree and an extreme gradient boosting decision tree.

Further, the method for constructing the Pop-XGboost population spatialization model by combining the relationship between the effective index and the community population comprises the following steps:

s31, based on the extreme gradient lifting decision tree, automatically determining the optimal parameters of the Pop-XGboost model in a specified range by using a GridSearchCV module of a sklern library;

s32, constructing a Pop-XGboost model by taking the effective indexes as input indexes and 75% of communities and real population data thereof as training sets;

and S33, verifying and analyzing the Pop-XGboost model by taking the rest 25% of communities and population data thereof as a test set.

Further, predicting the spatial population distribution, summarizing the grid population simulation data to a community scale, comparing the grid population simulation data with real community demographic data, and verifying the result accuracy, wherein the method comprises the following steps:

s41, introducing an XGB-Regressor module in the sklern library, and estimating the population of each grid;

s42, redistributing the population number of each grid of 100m multiplied by 100m according to the proportion of grid population occupying the total population of all grids in the county, which is obtained according to the Pop-XGboost model;

and S43, selecting an evaluation factor to evaluate the accuracy of the prediction population.

Further, the allocation formula is as follows:

where i represents each grid within the region;

j represents the administrative region where the grid is located;

P _i representing the population within each grid after correction;

D _j representing the total population of the county in which the grid is located;

M _i representing the population number estimated by the grid according to a Pop-XGboost model;

M _j a model representing all grids of the administrative area in which the grid is located estimates the population.

Further, the evaluation factor includes a relative mean absolute error, a root mean square error, and a decision coefficient.

The invention has the beneficial effects that: by combining a multi-source data fusion technology, an index screening technology, an ensemble learning technology and the like to construct a population spatialization model, the high-precision population spatialization prediction is accurately and efficiently realized. Firstly, basic geographic data, remote sensing data, building data (such as building types, volume rates and the like) and geographic big data (house price distribution and the like) are fused, a population spatialization database taking a 100m multiplied by 100m grid as a unit is constructed, and data support is provided for fine-scale population spatialization. Secondly, in order to solve the problems of noise enhancement and unstable model fitting caused by the increase of auxiliary data, the feature importance is calculated through various integrated learning models, indexes are screened to reduce data noise, and the precision of a fine-scale population spatialization result is improved; and finally, a Pop-XGboost population spatialization model is constructed, and the precision of the fine-scale population spatialization result is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required in the embodiments will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow diagram of a population spatialization method based on multi-source data and ensemble learning, according to an embodiment of the invention;

fig. 2 is a schematic diagram of Shenzhen administrative regions in a population spatialization method based on multi-source data and ensemble learning according to an embodiment of the present invention;

FIG. 3 is a technical roadmap for a population spatialization method based on multi-source data and ensemble learning, according to an embodiment of the invention;

FIG. 4 illustrates feature importance of different indicators in a population spatialization method based on multi-source data and ensemble learning according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a prediction result of population spatialization of Pop-XGBoost in a population spatialization method based on multi-source data and ensemble learning according to an embodiment of the present invention;

FIG. 6 is a schematic diagram comparing results of Pop-XGBoost and Worldpop population spatialization (100 m) in a population spatialization method based on multi-source data and ensemble learning according to an embodiment of the invention;

fig. 7 is a schematic diagram illustrating population distribution comparison of four areas a, b, c and d in fig. 5 in a population spatialization method based on multi-source data and ensemble learning according to an embodiment of the present invention.

Detailed Description

For further explanation of the various embodiments, the drawings which form a part of the disclosure and which are incorporated in and constitute a part of this specification, illustrate embodiments and, together with the description, serve to explain the principles of operation of the embodiments, and to enable one skilled in the art to understand the embodiments and advantages of the disclosure for reference and without scale, wherein elements are not shown in the drawings and like reference numerals are used to refer to like elements generally.

According to an embodiment of the invention, a population spatialization method based on multi-source data and ensemble learning is provided.

In the embodiment of the invention, the population of Shenzhen city is selected as an analysis object, the Shenzhen city is a coastal city in the south of China, is adjacent to hong Kong of China, is positioned in the south of the line of return north (113 degrees 46 to 114 degrees 37 ' E, 22 degrees 27 to 22 degrees 52 ' N), is the first economic special area of China, and comprises 10 administrative areas, such as a Futian area, a Roche area, a southern mountain area, a salt pan area, a Bao ' an area, a Longgang area, a Guangxi area, a Longhua area, a plateau mountain area, a Dapeng new area and the like, 74 streets and 734 communities (figure 1). The Shenzhen city population in 1979 is only 31 ten thousand, and the Chang live population of the Shenzhen city has grown to 1344 ten thousand by 2019. In 2019, the total area of Shenzhen city is 1997.47 square kilometers, and the area of the built-up region is 927.96 square kilometers.

And the population is predicted and verified from the data and parameters of the following five aspects.

(1) Demographic data

The study obtained 270 communities of population statistics from the Shenzhen city territory resource Committee in 2019 as real data for model training and model validation, wherein 202 training sets (74.9%) and 68 test sets (25.1%) as independent validation sets will not participate in the training of the model.

(2) Geographic big data

The geographic big data obtained by research comprises: baidu POIs, subway lines, house price, volume rate, greening rate, total number of households, parking spaces, OpenStreetMap road network data and the like.

POIs data used in the research are from the Baidu map of the Internet map service provider with the most use and the largest scale in China. And (4) rapidly acquiring POIs data of the Baidu map by calling the Place API of the Baidu map. The acquired POIs comprise the coordinates, types, names and other attribute information of the longitude and latitude network. And sorting the acquired data, and screening out repeated or attribute-missing partial data to obtain 802668 effective data. The data of various POIs are processed by a nuclear density analysis method to obtain 10 types of grid data (100m multiplied by 100m) of government institutions, education, enterprises, shopping, finance, sanitation, entertainment, traffic, catering, living and the like.

Acquiring a subway line of a research area from a Baidu subway map JavaScript API, and calculating the distance between each grid (100m multiplied by 100m) and the subway line by using a Euclidean distance tool in ArcGIS 10.8.

7714 cell data of Shenzhen city are acquired from a live-stable website and comprise 5 types of attributes such as room price, volume rate, greening rate, total house number and parking space. The residents are the most common online platform in China, and mainly release sold or rented real estate information. And interpolating the 5 types of attribute data to the whole research area by using a kriging interpolation method to obtain information of the room price, the volume rate, the greening rate, the total number of households, parking spaces and the like of each grid (100m multiplied by 100 m).

And acquiring the road network data of the research area from the OpenStreetMap. The road network information contained in the OpenStreetMap is rich and high in precision, and the spatial distribution state of population in a city range can be reflected. The OpenStreetMap road network type comprises 16 types of bicycle lanes, sidewalks, main roads, expressways and the like. The road network density is the ratio of the total road network length in the study unit to the study unit area.

(3) Remote sensing data

The remote sensing data obtained by research comprises: digital Elevation Model (DEM), Landsat 8 OLI data, impervious surface product, night light data.

Landsat 8 OLI (2018/04) and DEM are obtained from the geospatial data cloud. And processing Landsat 8 OLI data by utilizing ENVI5.3 to obtain a vegetation normalized index and a ground surface temperature. And (5) processing the DEM by using Acgis10.8, and extracting slope direction and gradient data.

The global impervious surface product (MSMT _ IS30) was obtained from the national Earth systems science data center with an overall accuracy of 95.1% and a resolution of 30m, from which the impervious surface fraction of each grid in the study area was calculated. The impervious surface proportion refers to the proportion of the impervious surface area in the research unit to the area of the research unit.

The night light data is Lopa first night light data with the spatial resolution of 130 m.

And finally, resampling the vegetation normalized index with the resolution of 30m, the elevation, the impervious surface proportion, the slope direction, the gradient, the earth surface temperature and the night lamplight to 100 m.

(4) Basic geographic data

The research acquires basic geographic data such as land utilization, administrative region boundaries and the like from Shenzhen city planning and the Council of the national resources. The distance from the evaluation unit grid (100m multiplied by 100m) to the water surface/coastline, the distance from the center urban area and the distance from the green land/park are calculated based on the Euclidean distance method. The construction land occupation ratio is the ratio of the area of the built-in land of the research unit to the area of the research unit.

(5) Building data

The building data come from Shenzhen city planning and the Council of the homeland resources, and comprise 5 attributes such as the total area of the building, the base area of the building, the number of building layers, the volume of the building, the type of the building and the like. Based on these 5 attributes, 7 indices such as building volume fraction, building density, total building area, number of building floors, building volume, building type, and the like were calculated in accordance with a 100m × 100m grid (table 1). The national standard, unified Standard for civil building design (GB 50352-2019), classifies the types of buildings into four categories, namely residential buildings, public buildings, industrial buildings and agricultural buildings. And quantifying the building types according to the distribution of the population in different building types. The building type quantification equation is the formula D7 Qi × D6, where Qi is the building type weight. The building weights of residential buildings, public buildings, industrial buildings, and agricultural buildings in the building types are 10, 5, 3, and 2, respectively. Building density refers to the ratio of the sum of the floor area of all the buildings in a zone to the area of the zone. The total building area ratio is the ratio of the total area of all buildings in the study unit to the area of the study unit.

Referring now to the drawings and the detailed description, there is provided a population spatialization method based on multi-source data and ensemble learning according to an embodiment of the present invention, as shown in fig. 1-7, the method including the steps of:

s1, acquiring and fusing multi-source data, and constructing a population spatialization database, wherein the method comprises the following steps:

s11, obtaining multi-source data including basic geographic data, remote sensing data, building data and geographic big data;

s14, counting the building data of grid scale, and dividing the building according to the grid, the dividing formula includes:

k＝S(Building _Ai )/S(Building _A )

GFA(Building _Ai )＝k×GFA(Building _A )

GFA(Grid _i )＝∑GFA(Building)

in the formula, S (Building) _A ) Represents the building area of building a;

S(Building _Ai ) Represents that the building A is Grid _i The building area of (a);

k represents a segmentation coefficient;

Grid _i representing the ith mesh;

GFA(Building _A ) Representing the total area of A buildings;

GFA(Building _Ai ) Represents Grid _i The area of the inner A building;

GFA(Grid _i ) Represents Grid _i Total building area of all buildings within. And S15, constructing a population spatialization database taking a grid of 100m multiplied by 100m as a unit.

The multi-source data comprises four types of data, namely basic geographic data, remote sensing data, building data and geographic big data, and 35 indexes are selected in total, wherein the indexes are shown in table 1;

the building data includes attributes such as total building area, building floor area, number of building floors, building volume, building type, and the like.

Table 1: population spatialization index system

S2, constructing an index system for model fitting from the population spatialization database, and screening effective indexes through feature importance of ensemble learning model calculation, wherein the method comprises the following steps:

s21, constructing an integrated learning model by taking the statistic value of the community scale index as an input variable and the community real population as an output target;

the ensemble learning completes a learning task by constructing and combining a plurality of classifiers, is also called a multi-classification system, and aims to improve the generalization ability and robustness of the base learner by combining the prediction results of the plurality of machine learning classifiers. Wherein the ensemble learning model comprises a Random Forest (RF), a Gradient Boosting Decision Tree (GBDT), and an extreme gradient boosting decision tree (XGboost). The RF, GBDT and XGboost models have advantages in evaluating feature importance and processing high dimensional data.

In particular, Random Forest (RF) is an integrated algorithm based on decision trees. The RF is an extended variant of Bagging, and random feature selection is further introduced in the training process of the decision tree on the basis of building Bagging integration by taking the decision tree as a base learner. The RF implementation process is as follows: first from the original training data set (D) ₁ ，D ₂ ，…， D _k ) Obtaining k sampling sets; secondly, training each sample set to obtain k weak learners { h ₁ ， h ₂ ，…，h _k }; finally, the average of the majority vote or the prediction result is used to obtain the final classification result.

Gradient Boosting Decision Trees (GBDTs) are widely used in classification, regression, and other tasks, and are a combination of decision trees and gradient boosting. The main difference between GBDT and RF is that the fit of the decision tree in GBDT is based on the previous decision tree residuals. Thus, GBDT can reduce bias and variance. During each iteration, the goal of GBDT is to establish a weak classifier of the regression tree to reduce the iterations of the loss function, which can be expressed as

The predicted value of the m-th iteration,

is a loss function, f _m (x _i ) For minimizing the loss function.

Extreme gradient boosting decision trees (XGBoost) are further optimizations for GBDT. The loss function can be expressed as

Wherein yi is a true value, and,

is the predicted value of the (m-1) th iteration. GBDTs use only first derivative information in optimization. XGboost performs second-order Taylor expansion on the loss function to obtain first-order and second-order derivatives, which can be expressed as

g _i And h _i Is a first and second order gradient statistical loss function, which can be expressed as

In addition, XGboost adds a regularization term Ω (f) to the loss function _m ) In order to control the complexity of the model, a smaller value indicates a lower computational complexity and a higher generalization ability. Therefore, XGboost can not only reduce the variance and complexity of the model, but also avoid the over-fitting problem.

S3, constructing a Pop-XGboost population spatialization model by combining the relationship between the effective indexes and the community population, and comprising the following steps:

where n _ estimators is the maximum number of iterations of the weak learner, or the maximum number of weak learners, generally does not affect the complexity of the model. max _ depth is the maximum depth of a tree, and when there are more features, it is recommended to refer to max _ depth, so that overfitting can be avoided. n _ estimators and max _ depth have great influence on the fitting accuracy of the Pop-XGboost model. The learning rate can improve the robustness of the model by reducing the weight of each step. min _ child _ weight is the minimum leaf node sample weight sum. When the min _ child _ weight value is large, the model can be prevented from learning local special samples. But if min _ child _ weight is too high, it will result in under-fitting. Gamma specifies the minimum penalty function degradation value required for node splitting. The larger this parameter value, the more conservative the algorithm. Subsample is a randomly sampled proportion, reducing the value of Subsample may avoid algorithm overfitting, but if this value is set too small, it may result in under-fitting. colsample _ ubytree is the column sample rate at which each decision tree is constructed.

Therefore, the optimal parameters such as learning _ rate, min _ child _ weight, gamma, subsample and colsample _ byte are found, which is very important for the Pop-XGboost model.

Table 2: parameter range

Parameter(s)	Value range of XGBoosT parameter	Automatic search step size
			N_ESTIMATORS	10-201	10
MAX_DEPTH	3-10	1
			L _{EARNING_RATE}	0-0.3	/
M _{IN_CHILD_WEIGHT}	1-6	1
			G _AMMA	0-0.5	/
S _UBSAMPLE	0.5-1	/
			C _{OLSAMPLE_BYTREE}	0.5-1	/

S4, predicting population space distribution, summarizing grid population simulation data to a community scale, comparing the grid population simulation data with real community demographic data, and verifying result accuracy, wherein the method comprises the following steps:

because the population of each grid estimated by the optimal model is an estimation generated according to the population of the community, the total amount of grid simulation population results needs to be controlled based on the statistical population data of the county scale.

wherein the allocation formula is:

where i represents each grid within the region;

j represents the administrative region where the grid is located;

P _i representing the population within each grid after correction;

Wherein the evaluation factors include a relative Mean Absolute Error (MAE), a Root Mean Square Error (RMSE), and a decision coefficient (R2). The smaller the MAE and RMSE values are, the more accurate the estimation result is; the value range of R2 is in the (0, 1) interval, and the closer the value is to 1, the closer the prediction population is to the real population is.

The following are specific examples developed using the present invention and the results analysis:

(ii) index screening results

From the mean values of the feature importance (relative values, also expressed as percentages) of the 35 indices, it can be seen that the sum of the feature importance of the 5-class building data exceeds 69%, indicating that the building data has a large impact on the spatial distribution of the population. The characteristic importance of the building area (D3) is 33.02 percent at most, and the characteristic importance of the building floor number (D5) and the building type (D7) is larger, namely 14.63 percent and 10.29 percent respectively. The sum of the characteristic importance of the geobig data is about 21.73%. The sum of the characteristic importance of POIs density in the geographic big data is about 16.57%, wherein the contribution rate of the nuclear density of medical facilities (A6) is 6.45% at most, and the contribution rate of the density of restaurants (A9) is 0.38% at least. In addition, the total number of houses (A14) and the subway distance (A16) have higher characteristic contribution rates, which are 1.34% and 1.03%, respectively. The sum of the characteristic importance of the remote sensing data is about 6.17%. The DEM (B1) and the impervious surface account rate are higher than the luminous intensity (B6) of the night lamp (B3) in characteristic contribution rates of 1.27%, 1.30% and 1.31%, respectively. The sum of the feature importance of the underlying geographic data is about 3.05%. There is only a large characteristic contribution (1%) from the water/shoreline (C1). In summary, 21 low redundancy indexes with importance scores greater than 1% are selected. The number of the building data effective indexes is 7, the number of the geographic big data effective indexes is10, the number of the remote sensing data effective indexes is3, and the number of the basic geographic data effective indexes is 1.

(II) constructing a Pop-XGboost population spatialization model based on effective indexes

A Pop-XGboost population spatialization model is constructed by taking 21 effective indexes as input variables and taking the real population of 202 communities (accounting for 75 percent of 270 communities) as an output target. The optimum parameters are shown in table 3. The accuracy of the constructed model was judged with the true population of 68 communities (25% of 270 communities) as the verification set. The result shows that the precision of the training set is as high as 99 percent, and the precision of the testing set is as high as 92 percent.

Table 3: optimal parameters of different models

(III) prediction result of grid scale population space distribution

And counting the selected 21 indexes to a grid scale of 100m multiplied by 100m to be used as an input variable of the Pop-XGBoost model, and predicting population space distribution of Shenzhen city. The results are shown in FIG. 5, where each grid represents the population within the grid area (1 hectare). As can be seen from fig. 5, the regions with a large population density are mainly concentrated in the southeast of the futian region (south garden streets, fubao streets, futian streets, etc.), the southwest of the luohu region (south lake streets, east gate streets, etc.), the middle of the longhua region (longhua streets and longgang regions), and the south of the longgang region (buji streets and south bay streets). In general, people are intensively distributed in areas with superior regional conditions and better economic conditions, so that the people can conveniently carry out activities such as production and life. Therefore, the prediction result of the spatial population distribution based on the Pop-XGboost accords with the actual distribution characteristics of the population.

(IV) precision verification

The Worldpop data set in Shenzhen 2019 is obtained, is a population spatialization grid data set which is wide in application, high in recognition degree and good in precision, and provides the finest spatial resolution of the data set which can reach 100m multiplied by 100 m. Compared with the prediction result of the Pop-XGboost model (figure 6), the result shows that the population space distribution prediction result trends of the Pop-XGboost model and the Worldpop data set are approximately the same, and the population high-value gathering areas are approximately the same.

Selecting the Sentinel-2 remote sensing image data of 2019 with the spatial resolution of 20 m, and verifying the population prediction result of the Pop-XGboost model. In the detail comparison results of the four regions (a, b, c, d) in fig. 7, it can be found that the predicted results of the Pop-XGBoost model and the Worldpop data set in different regions are substantially consistent. Worldpop, however, does not simulate well in low population density areas (e.g., area b), and is not consistent with remote sensing images, mainly because Worldpop uses data that is not fine enough. The prediction result of the Pop-XGboost model provided by the research is basically consistent with that of a remote sensing image, and the population space distribution of a fine scale can be reflected better.

And summarizing the grid population results obtained by the model to a community scale, comparing the grid population results with the statistical value of the community scale, and calculating indexes such as MAE, RMSE, R2 and the like. The result shows that the RMSE value of the Pop-XGboost model is 12783.82, the MAE value is 8006.07, and the R2 value is 80.54%, and the overall accuracy is good.

In summary, by means of the technical scheme of the invention, the spatial population model is constructed by combining the multi-source data fusion technology, the index screening technology, the ensemble learning technology and the like, and the high-precision spatial population prediction is accurately and efficiently realized. Firstly, basic geographic data, remote sensing data, building data (such as building types, volume rates and the like) and geographic big data (house price distribution and the like) are fused, a population spatialization database taking a 100m multiplied by 100m grid as a unit is constructed, and data support is provided for fine-scale population spatialization. Secondly, in order to solve the problems of noise enhancement and unstable model fitting caused by the increase of auxiliary data, the feature importance is calculated through various integrated learning models, indexes are screened to reduce data noise, and the precision of a fine-scale population spatialization result is improved; and finally, a Pop-XGboost population spatialization model is constructed, and the precision of the fine-scale population spatialization result is improved.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A population spatialization method based on multi-source data and ensemble learning is characterized by comprising the following steps:

s3, constructing a Pop-XGboost population spatialization model by combining the relation between the effective index and the community population;

and S4, predicting the population space distribution, summarizing grid population simulation data to a community scale, comparing the grid population simulation data with real community demographic data, and verifying the result accuracy.

2. The population spatialization method based on multi-source data and ensemble learning according to claim 1, wherein the multi-source data is obtained and fused to construct a population spatialization database, and the method comprises the following steps:

s11, acquiring multi-source data;

s14, counting the building data of the grid scale, and dividing the building according to the grid;

3. The population spatialization method based on multi-source data and ensemble learning according to claim 2, wherein the multi-source data comprises basic geographic data, remote sensing data, building data and geographic big data;

the building data includes total building area, building floor area, number of building floors, building volume, and building type.

4. The population spatialization method based on multi-source data and ensemble learning according to claim 3, wherein the formula for segmenting the building according to the grid comprises:

k＝S(Building _Ai )/S(Building _A )

GFA(Building _Ai )＝k×GFA(Building _A )

GFA(Grid _i )＝∑GFA(Building)

in the formula, S (Building) _A ) Represents the building area of building a;

k represents a segmentation coefficient;

Grid _i representing the ith mesh;

GFA(Building _A ) Representing the total area of A buildings;

GFA(Building _Ai ) Represents Grid _i The area of the inner A buildings;

GFA(Grid _i ) Represents Grid _i Total building area of all buildings within.

5. The population spatialization method based on multi-source data and ensemble learning of claim 1, wherein an index system for model fitting is constructed from the population spatialization database, and effective indexes are screened out through feature importance calculated by an ensemble learning model, and the method comprises the following steps:

6. The population spatialization method based on multi-source data and ensemble learning according to claim 5, wherein the ensemble learning model comprises a random forest, a gradient boosting decision tree and an extreme gradient boosting decision tree.

7. The population spatialization method based on multi-source data and ensemble learning according to claim 6, wherein a Pop-XGboost population spatialization model is constructed by combining the relationship between the effective index and the community population, and the method comprises the following steps:

s31, based on the extreme gradient lifting decision tree, automatically searching the optimal parameters of the Pop-XGboost model in a specified range by using a GridSearchCV module of a sklern library;

8. The method of claim 7, wherein the predicting of the spatial population distribution, the summarizing of the simulation grid population data to the community scale, the comparing with the real community demographic data, and the verification of the result accuracy comprises the following steps:

s42, redistributing the population number of each grid of 100m multiplied by 100m according to the proportion of the grid population to the total population of all grids in the county of the district where the grid population is obtained according to the Pop-XGboost model;

9. The population spatialization method based on multi-source data and ensemble learning according to claim 8, wherein the distribution formula is as follows:

where i represents each grid within the region;

j represents the administrative region where the grid is located;

P _i representing the population within each grid after correction;

10. The population spatialization method based on multi-source data and ensemble learning according to claim 9, wherein the evaluation factors include relative average absolute error, root mean square error and decision coefficient.