CN115129802A - Population spatialization method based on multi-source data and ensemble learning - Google Patents

Population spatialization method based on multi-source data and ensemble learning Download PDF

Info

Publication number
CN115129802A
CN115129802A CN202210782643.0A CN202210782643A CN115129802A CN 115129802 A CN115129802 A CN 115129802A CN 202210782643 A CN202210782643 A CN 202210782643A CN 115129802 A CN115129802 A CN 115129802A
Authority
CN
China
Prior art keywords
population
building
grid
data
spatialization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210782643.0A
Other languages
Chinese (zh)
Inventor
夏南
赵鑫
姜朋辉
周琛
陈振杰
徐云耘
黄学锋
李满春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202210782643.0A priority Critical patent/CN115129802A/en
Publication of CN115129802A publication Critical patent/CN115129802A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Marketing (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Business, Economics & Management (AREA)
  • Primary Health Care (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Remote Sensing (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Game Theory and Decision Science (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a population spatialization method based on multi-source data and ensemble learning, which comprises the following steps: s1, acquiring and fusing multi-source data, and constructing a population spatialization database; s2, constructing an index system for model fitting from the population spatialization database, and screening effective indexes through feature importance of integrated learning model calculation; s3, constructing a Pop-XGboost population spatialization model by combining the relationship between the effective indexes and the community population; and S4, predicting the population space distribution, summarizing grid population simulation data to a community scale, comparing the grid population simulation data with real community demographic data, and verifying the result accuracy. By combining a multi-source data fusion technology, an index screening technology, an ensemble learning technology and the like to construct a population spatialization model, the high-precision population spatialization prediction is accurately and efficiently realized.

Description

Population spatialization method based on multi-source data and ensemble learning
Technical Field
The invention relates to the technical field of large geographic data application, in particular to a population spatialization method based on multi-source data and ensemble learning.
Background
Demographic information may provide scientific support for regional sustainable development and space planning. However, the current population data is usually a statistical value of each administrative region, the spatial resolution is low, and the spatial difference of population distribution in the administrative unit cannot be fully expressed. Also, statistical population data is difficult to match with some studies involving complex geographic boundaries, and is not conducive to integrating other multi-source data, such as remote sensing data. The population distribution data taking the grid as a unit is finer, the spatial heterogeneity of population can be better revealed, and the integration of resources, environment and management information is realized. Therefore, the development of the fine-scale population spatialization research is of great practical significance.
Common population distribution estimation methods can be categorized into two categories according to research goals and auxiliary data: region interpolation and statistical modeling. The population distribution result obtained by the regional interpolation method has lower prediction accuracy and spatial resolution, and the main reasons are that various influence factors of population distribution are not fully considered, and the rationality of regional interpolation is poor. Compared with a region interpolation method, the result precision of population spatialization can be effectively improved by using the statistical model. The statistical model method is mainly used for building linear relations between population and auxiliary variables through a specific statistical model so as to realize spatial prediction of the population. Common auxiliary data include land use data, night light data, roads, coastlines, MODIS/EVI, Digital Elevation Models (DEM), impervious surfaces, and the like.
While the above-mentioned ancillary data are effective, they still do not adequately represent the economic, social and cultural factors of the spatial distribution of the population. Therefore, more and more research is being conducted to apply openly acquired geospatial big data to map population distributions, thereby reflecting the intensity of human activities. The commonly used geographic big data comprises mobile phone signaling, behavior track data, points of interest (POIs), OpenStreetMap, social insurance accounts, house price, taxi track data and the like. In addition, building attribute data is also closely related to population distribution, such as building type, height, volume rate, building area, and the like. However, the single auxiliary data may have problems of incomplete data, abnormal data and the like, so that the obtained population simulation data has low correlation with the real population data. Therefore, the multi-source data is integrated to reflect the spatial distribution of the population, and the prediction deviation caused by a single type of auxiliary data can be reduced. However, as data and features thereof increase, data noise occurs when a population spatialization model is constructed, and it is difficult to realize highly accurate population spatialization prediction.
The statistical model is limited in processing complex and multivariate influence factors due to the problems of simple structure and the like. And the machine learning model can well process multi-source and multi-dimensional characteristics and accurately mine the relationship between multi-source auxiliary data and population. The ensemble learning algorithm based on the decision tree has high simulation precision, and mainly comprises RF, GBDT, XGboost and the like. Currently, in the field of remote sensing, RF and XGBoost models have been applied more, such as soil nutrient estimation, daily reference evapotranspiration calculation, land use classification, PM2.5 prediction, landslide sensitivity mapping, biomass estimation, and the like. Partial research results show that compared with the RF and GBDT models, the XGboost model can not only reduce the over-fitting problem and the calculation complexity, but also improve the prediction precision, so that the optimal solution of the model is more efficient. However, the XGBoost model has little application in fine-scale population spatialization, especially for research on integrating multisource geographic large data and building information, etc.
Thus, there are still two major technical drawbacks to current population spatialization studies:
(1) from a data perspective, a single auxiliary data may result in a prediction result with poor accuracy.
(2) From a model perspective, the addition of the auxiliary data can cause data noise and cause instability of model fitting, thereby affecting the result accuracy of population spatialization. In addition, the XGBoost model has little application in refining population spatialization.
Disclosure of Invention
Aiming at the problems in the related art, the invention provides a population spatialization method based on multi-source data and ensemble learning, so as to overcome the technical problems in the prior related art.
Therefore, the invention adopts the following specific technical scheme: the method comprises the following steps:
s1, acquiring and fusing multi-source data, and constructing a population spatialization database;
s2, constructing an index system for model fitting from the population spatialization database, and screening effective indexes through feature importance of integrated learning model calculation;
s3, constructing a Pop-XGboost population spatialization model by combining the relationship between the effective indexes and the community population;
and S4, predicting population spatial distribution, summarizing grid population simulation data to a community scale, comparing the grid population simulation data with real community demographic data, and verifying result accuracy.
Furthermore, the method for acquiring and fusing multi-source data and constructing the population spatialization database comprises the following steps:
s11, acquiring multi-source data;
s12, selecting a plurality of indexes from the multi-source data to construct an index system;
s13, resampling each index to be a grid scale of 100m multiplied by 100m, and summarizing to a community scale;
s14, counting the building data of the grid scale, and segmenting the building according to the grid;
and S15, constructing a population spatialization database taking a grid of 100m multiplied by 100m as a unit.
Further, the multi-source data comprises basic geographic data, remote sensing data, building data and geographic big data;
the building data includes total building area, building floor area, number of building floors, building volume, building type.
Further, the formula for segmenting the building according to the grid includes:
k=S(Building Ai )/S(Building A )
GFA(Building Ai )=k×GFA(Building A )
GFA(Grid i )=∑GFA(Building)
in the formula, S (Building) A ) Represents the building area of building a;
S(Building Ai ) Represents that the building A is in Grid i The building area of (a);
k represents a segmentation coefficient;
Grid i representing the ith mesh;
GFA(Building A ) Representing the total area of A buildings;
GFA(Building Ai ) Represents Grid i The area of the inner A building;
GFA(Grid i ) Represents Grid i The total building area of all buildings within.
Further, an index system for model fitting is constructed from the population spatialization database, and effective indexes are screened out through feature importance of integrated learning model calculation, and the method comprises the following steps:
s21, constructing an ensemble learning model by taking the statistic value of the community scale index as an input variable and the real community population as an output target;
s22, respectively calculating the feature importance of each index according to the integrated learning model obtained by construction;
and S23, selecting an index which has a large influence on population distribution as an effective index according to the average value of the feature importance.
Further, the ensemble learning model comprises a random forest, a gradient boosting decision tree and an extreme gradient boosting decision tree.
Further, the method for constructing the Pop-XGboost population spatialization model by combining the relationship between the effective index and the community population comprises the following steps:
s31, based on the extreme gradient lifting decision tree, automatically determining the optimal parameters of the Pop-XGboost model in a specified range by using a GridSearchCV module of a sklern library;
s32, constructing a Pop-XGboost model by taking the effective indexes as input indexes and 75% of communities and real population data thereof as training sets;
and S33, verifying and analyzing the Pop-XGboost model by taking the rest 25% of communities and population data thereof as a test set.
Further, predicting the spatial population distribution, summarizing the grid population simulation data to a community scale, comparing the grid population simulation data with real community demographic data, and verifying the result accuracy, wherein the method comprises the following steps:
s41, introducing an XGB-Regressor module in the sklern library, and estimating the population of each grid;
s42, redistributing the population number of each grid of 100m multiplied by 100m according to the proportion of grid population occupying the total population of all grids in the county, which is obtained according to the Pop-XGboost model;
and S43, selecting an evaluation factor to evaluate the accuracy of the prediction population.
Further, the allocation formula is as follows:
Figure BDA0003730209640000041
where i represents each grid within the region;
j represents the administrative region where the grid is located;
P i representing the population within each grid after correction;
D j representing the total population of the county in which the grid is located;
M i representing the population number estimated by the grid according to a Pop-XGboost model;
M j a model representing all grids of the administrative area in which the grid is located estimates the population.
Further, the evaluation factor includes a relative mean absolute error, a root mean square error, and a decision coefficient.
The invention has the beneficial effects that: by combining a multi-source data fusion technology, an index screening technology, an ensemble learning technology and the like to construct a population spatialization model, the high-precision population spatialization prediction is accurately and efficiently realized. Firstly, basic geographic data, remote sensing data, building data (such as building types, volume rates and the like) and geographic big data (house price distribution and the like) are fused, a population spatialization database taking a 100m multiplied by 100m grid as a unit is constructed, and data support is provided for fine-scale population spatialization. Secondly, in order to solve the problems of noise enhancement and unstable model fitting caused by the increase of auxiliary data, the feature importance is calculated through various integrated learning models, indexes are screened to reduce data noise, and the precision of a fine-scale population spatialization result is improved; and finally, a Pop-XGboost population spatialization model is constructed, and the precision of the fine-scale population spatialization result is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required in the embodiments will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow diagram of a population spatialization method based on multi-source data and ensemble learning, according to an embodiment of the invention;
fig. 2 is a schematic diagram of Shenzhen administrative regions in a population spatialization method based on multi-source data and ensemble learning according to an embodiment of the present invention;
FIG. 3 is a technical roadmap for a population spatialization method based on multi-source data and ensemble learning, according to an embodiment of the invention;
FIG. 4 illustrates feature importance of different indicators in a population spatialization method based on multi-source data and ensemble learning according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a prediction result of population spatialization of Pop-XGBoost in a population spatialization method based on multi-source data and ensemble learning according to an embodiment of the present invention;
FIG. 6 is a schematic diagram comparing results of Pop-XGBoost and Worldpop population spatialization (100 m) in a population spatialization method based on multi-source data and ensemble learning according to an embodiment of the invention;
fig. 7 is a schematic diagram illustrating population distribution comparison of four areas a, b, c and d in fig. 5 in a population spatialization method based on multi-source data and ensemble learning according to an embodiment of the present invention.
Detailed Description
For further explanation of the various embodiments, the drawings which form a part of the disclosure and which are incorporated in and constitute a part of this specification, illustrate embodiments and, together with the description, serve to explain the principles of operation of the embodiments, and to enable one skilled in the art to understand the embodiments and advantages of the disclosure for reference and without scale, wherein elements are not shown in the drawings and like reference numerals are used to refer to like elements generally.
According to an embodiment of the invention, a population spatialization method based on multi-source data and ensemble learning is provided.
In the embodiment of the invention, the population of Shenzhen city is selected as an analysis object, the Shenzhen city is a coastal city in the south of China, is adjacent to hong Kong of China, is positioned in the south of the line of return north (113 degrees 46 to 114 degrees 37 ' E, 22 degrees 27 to 22 degrees 52 ' N), is the first economic special area of China, and comprises 10 administrative areas, such as a Futian area, a Roche area, a southern mountain area, a salt pan area, a Bao ' an area, a Longgang area, a Guangxi area, a Longhua area, a plateau mountain area, a Dapeng new area and the like, 74 streets and 734 communities (figure 1). The Shenzhen city population in 1979 is only 31 ten thousand, and the Chang live population of the Shenzhen city has grown to 1344 ten thousand by 2019. In 2019, the total area of Shenzhen city is 1997.47 square kilometers, and the area of the built-up region is 927.96 square kilometers.
And the population is predicted and verified from the data and parameters of the following five aspects.
(1) Demographic data
The study obtained 270 communities of population statistics from the Shenzhen city territory resource Committee in 2019 as real data for model training and model validation, wherein 202 training sets (74.9%) and 68 test sets (25.1%) as independent validation sets will not participate in the training of the model.
(2) Geographic big data
The geographic big data obtained by research comprises: baidu POIs, subway lines, house price, volume rate, greening rate, total number of households, parking spaces, OpenStreetMap road network data and the like.
POIs data used in the research are from the Baidu map of the Internet map service provider with the most use and the largest scale in China. And (4) rapidly acquiring POIs data of the Baidu map by calling the Place API of the Baidu map. The acquired POIs comprise the coordinates, types, names and other attribute information of the longitude and latitude network. And sorting the acquired data, and screening out repeated or attribute-missing partial data to obtain 802668 effective data. The data of various POIs are processed by a nuclear density analysis method to obtain 10 types of grid data (100m multiplied by 100m) of government institutions, education, enterprises, shopping, finance, sanitation, entertainment, traffic, catering, living and the like.
Acquiring a subway line of a research area from a Baidu subway map JavaScript API, and calculating the distance between each grid (100m multiplied by 100m) and the subway line by using a Euclidean distance tool in ArcGIS 10.8.
7714 cell data of Shenzhen city are acquired from a live-stable website and comprise 5 types of attributes such as room price, volume rate, greening rate, total house number and parking space. The residents are the most common online platform in China, and mainly release sold or rented real estate information. And interpolating the 5 types of attribute data to the whole research area by using a kriging interpolation method to obtain information of the room price, the volume rate, the greening rate, the total number of households, parking spaces and the like of each grid (100m multiplied by 100 m).
And acquiring the road network data of the research area from the OpenStreetMap. The road network information contained in the OpenStreetMap is rich and high in precision, and the spatial distribution state of population in a city range can be reflected. The OpenStreetMap road network type comprises 16 types of bicycle lanes, sidewalks, main roads, expressways and the like. The road network density is the ratio of the total road network length in the study unit to the study unit area.
(3) Remote sensing data
The remote sensing data obtained by research comprises: digital Elevation Model (DEM), Landsat 8 OLI data, impervious surface product, night light data.
Landsat 8 OLI (2018/04) and DEM are obtained from the geospatial data cloud. And processing Landsat 8 OLI data by utilizing ENVI5.3 to obtain a vegetation normalized index and a ground surface temperature. And (5) processing the DEM by using Acgis10.8, and extracting slope direction and gradient data.
The global impervious surface product (MSMT _ IS30) was obtained from the national Earth systems science data center with an overall accuracy of 95.1% and a resolution of 30m, from which the impervious surface fraction of each grid in the study area was calculated. The impervious surface proportion refers to the proportion of the impervious surface area in the research unit to the area of the research unit.
The night light data is Lopa first night light data with the spatial resolution of 130 m.
And finally, resampling the vegetation normalized index with the resolution of 30m, the elevation, the impervious surface proportion, the slope direction, the gradient, the earth surface temperature and the night lamplight to 100 m.
(4) Basic geographic data
The research acquires basic geographic data such as land utilization, administrative region boundaries and the like from Shenzhen city planning and the Council of the national resources. The distance from the evaluation unit grid (100m multiplied by 100m) to the water surface/coastline, the distance from the center urban area and the distance from the green land/park are calculated based on the Euclidean distance method. The construction land occupation ratio is the ratio of the area of the built-in land of the research unit to the area of the research unit.
(5) Building data
The building data come from Shenzhen city planning and the Council of the homeland resources, and comprise 5 attributes such as the total area of the building, the base area of the building, the number of building layers, the volume of the building, the type of the building and the like. Based on these 5 attributes, 7 indices such as building volume fraction, building density, total building area, number of building floors, building volume, building type, and the like were calculated in accordance with a 100m × 100m grid (table 1). The national standard, unified Standard for civil building design (GB 50352-2019), classifies the types of buildings into four categories, namely residential buildings, public buildings, industrial buildings and agricultural buildings. And quantifying the building types according to the distribution of the population in different building types. The building type quantification equation is the formula D7 Qi × D6, where Qi is the building type weight. The building weights of residential buildings, public buildings, industrial buildings, and agricultural buildings in the building types are 10, 5, 3, and 2, respectively. Building density refers to the ratio of the sum of the floor area of all the buildings in a zone to the area of the zone. The total building area ratio is the ratio of the total area of all buildings in the study unit to the area of the study unit.
Referring now to the drawings and the detailed description, there is provided a population spatialization method based on multi-source data and ensemble learning according to an embodiment of the present invention, as shown in fig. 1-7, the method including the steps of:
s1, acquiring and fusing multi-source data, and constructing a population spatialization database, wherein the method comprises the following steps:
s11, obtaining multi-source data including basic geographic data, remote sensing data, building data and geographic big data;
s12, selecting a plurality of indexes from the multi-source data to construct an index system;
s13, resampling each index to be a grid scale of 100m multiplied by 100m, and summarizing to a community scale;
s14, counting the building data of grid scale, and dividing the building according to the grid, the dividing formula includes:
k=S(Building Ai )/S(Building A )
GFA(Building Ai )=k×GFA(Building A )
GFA(Grid i )=∑GFA(Building)
in the formula, S (Building) A ) Represents the building area of building a;
S(Building Ai ) Represents that the building A is Grid i The building area of (a);
k represents a segmentation coefficient;
Grid i representing the ith mesh;
GFA(Building A ) Representing the total area of A buildings;
GFA(Building Ai ) Represents Grid i The area of the inner A building;
GFA(Grid i ) Represents Grid i Total building area of all buildings within. And S15, constructing a population spatialization database taking a grid of 100m multiplied by 100m as a unit.
The multi-source data comprises four types of data, namely basic geographic data, remote sensing data, building data and geographic big data, and 35 indexes are selected in total, wherein the indexes are shown in table 1;
the building data includes attributes such as total building area, building floor area, number of building floors, building volume, building type, and the like.
Table 1: population spatialization index system
Figure BDA0003730209640000081
Figure BDA0003730209640000091
S2, constructing an index system for model fitting from the population spatialization database, and screening effective indexes through feature importance of ensemble learning model calculation, wherein the method comprises the following steps:
s21, constructing an integrated learning model by taking the statistic value of the community scale index as an input variable and the community real population as an output target;
the ensemble learning completes a learning task by constructing and combining a plurality of classifiers, is also called a multi-classification system, and aims to improve the generalization ability and robustness of the base learner by combining the prediction results of the plurality of machine learning classifiers. Wherein the ensemble learning model comprises a Random Forest (RF), a Gradient Boosting Decision Tree (GBDT), and an extreme gradient boosting decision tree (XGboost). The RF, GBDT and XGboost models have advantages in evaluating feature importance and processing high dimensional data.
In particular, Random Forest (RF) is an integrated algorithm based on decision trees. The RF is an extended variant of Bagging, and random feature selection is further introduced in the training process of the decision tree on the basis of building Bagging integration by taking the decision tree as a base learner. The RF implementation process is as follows: first from the original training data set (D) 1 ,D 2 ,…, D k ) Obtaining k sampling sets; secondly, training each sample set to obtain k weak learners { h 1 , h 2 ,…,h k }; finally, the average of the majority vote or the prediction result is used to obtain the final classification result.
Gradient Boosting Decision Trees (GBDTs) are widely used in classification, regression, and other tasks, and are a combination of decision trees and gradient boosting. The main difference between GBDT and RF is that the fit of the decision tree in GBDT is based on the previous decision tree residuals. Thus, GBDT can reduce bias and variance. During each iteration, the goal of GBDT is to establish a weak classifier of the regression tree to reduce the iterations of the loss function, which can be expressed as
Figure BDA0003730209640000101
Figure BDA0003730209640000102
The predicted value of the m-th iteration,
Figure BDA0003730209640000103
is a loss function, f m (x i ) For minimizing the loss function.
Extreme gradient boosting decision trees (XGBoost) are further optimizations for GBDT. The loss function can be expressed as
Figure BDA0003730209640000104
Wherein yi is a true value, and,
Figure BDA0003730209640000105
is the predicted value of the (m-1) th iteration. GBDTs use only first derivative information in optimization. XGboost performs second-order Taylor expansion on the loss function to obtain first-order and second-order derivatives, which can be expressed as
Figure BDA0003730209640000106
Figure BDA0003730209640000107
g i And h i Is a first and second order gradient statistical loss function, which can be expressed as
Figure BDA0003730209640000108
Figure BDA0003730209640000109
In addition, XGboost adds a regularization term Ω (f) to the loss function m ) In order to control the complexity of the model, a smaller value indicates a lower computational complexity and a higher generalization ability. Therefore, XGboost can not only reduce the variance and complexity of the model, but also avoid the over-fitting problem.
S22, respectively calculating the feature importance of each index according to the integrated learning model obtained by construction;
and S23, selecting an index which has a large influence on population distribution as an effective index according to the average value of the feature importance.
S3, constructing a Pop-XGboost population spatialization model by combining the relationship between the effective indexes and the community population, and comprising the following steps:
s31, based on the extreme gradient lifting decision tree, automatically determining the optimal parameters of the Pop-XGboost model in a specified range by using a GridSearchCV module of a sklern library;
where n _ estimators is the maximum number of iterations of the weak learner, or the maximum number of weak learners, generally does not affect the complexity of the model. max _ depth is the maximum depth of a tree, and when there are more features, it is recommended to refer to max _ depth, so that overfitting can be avoided. n _ estimators and max _ depth have great influence on the fitting accuracy of the Pop-XGboost model. The learning rate can improve the robustness of the model by reducing the weight of each step. min _ child _ weight is the minimum leaf node sample weight sum. When the min _ child _ weight value is large, the model can be prevented from learning local special samples. But if min _ child _ weight is too high, it will result in under-fitting. Gamma specifies the minimum penalty function degradation value required for node splitting. The larger this parameter value, the more conservative the algorithm. Subsample is a randomly sampled proportion, reducing the value of Subsample may avoid algorithm overfitting, but if this value is set too small, it may result in under-fitting. colsample _ ubytree is the column sample rate at which each decision tree is constructed.
Therefore, the optimal parameters such as learning _ rate, min _ child _ weight, gamma, subsample and colsample _ byte are found, which is very important for the Pop-XGboost model.
Table 2: parameter range
Parameter(s) Value range of XGBoosT parameter Automatic search step size
N_ESTIMATORS 10-201 10
MAX_DEPTH 3-10 1
L EARNING_RATE 0-0.3 /
M IN_CHILD_WEIGHT 1-6 1
G AMMA 0-0.5 /
S UBSAMPLE 0.5-1 /
C OLSAMPLE_BYTREE 0.5-1 /
S32, constructing a Pop-XGboost model by taking the effective indexes as input indexes and 75% of communities and real population data thereof as training sets;
and S33, verifying and analyzing the Pop-XGboost model by taking the rest 25% of communities and population data thereof as a test set.
S4, predicting population space distribution, summarizing grid population simulation data to a community scale, comparing the grid population simulation data with real community demographic data, and verifying result accuracy, wherein the method comprises the following steps:
s41, introducing an XGB-Regressor module in the sklern library, and estimating the population of each grid;
because the population of each grid estimated by the optimal model is an estimation generated according to the population of the community, the total amount of grid simulation population results needs to be controlled based on the statistical population data of the county scale.
S42, redistributing the population number of each grid of 100m multiplied by 100m according to the proportion of grid population occupying the total population of all grids in the county, which is obtained according to the Pop-XGboost model;
wherein the allocation formula is:
Figure BDA0003730209640000111
where i represents each grid within the region;
j represents the administrative region where the grid is located;
P i representing the population within each grid after correction;
D j representing the total population of the county in which the grid is located;
M i representing the population number estimated by the grid according to a Pop-XGboost model;
M j a model representing all grids of the administrative area in which the grid is located estimates the population.
And S43, selecting an evaluation factor to evaluate the accuracy of the prediction population.
Wherein the evaluation factors include a relative Mean Absolute Error (MAE), a Root Mean Square Error (RMSE), and a decision coefficient (R2). The smaller the MAE and RMSE values are, the more accurate the estimation result is; the value range of R2 is in the (0, 1) interval, and the closer the value is to 1, the closer the prediction population is to the real population is.
The following are specific examples developed using the present invention and the results analysis:
(ii) index screening results
From the mean values of the feature importance (relative values, also expressed as percentages) of the 35 indices, it can be seen that the sum of the feature importance of the 5-class building data exceeds 69%, indicating that the building data has a large impact on the spatial distribution of the population. The characteristic importance of the building area (D3) is 33.02 percent at most, and the characteristic importance of the building floor number (D5) and the building type (D7) is larger, namely 14.63 percent and 10.29 percent respectively. The sum of the characteristic importance of the geobig data is about 21.73%. The sum of the characteristic importance of POIs density in the geographic big data is about 16.57%, wherein the contribution rate of the nuclear density of medical facilities (A6) is 6.45% at most, and the contribution rate of the density of restaurants (A9) is 0.38% at least. In addition, the total number of houses (A14) and the subway distance (A16) have higher characteristic contribution rates, which are 1.34% and 1.03%, respectively. The sum of the characteristic importance of the remote sensing data is about 6.17%. The DEM (B1) and the impervious surface account rate are higher than the luminous intensity (B6) of the night lamp (B3) in characteristic contribution rates of 1.27%, 1.30% and 1.31%, respectively. The sum of the feature importance of the underlying geographic data is about 3.05%. There is only a large characteristic contribution (1%) from the water/shoreline (C1). In summary, 21 low redundancy indexes with importance scores greater than 1% are selected. The number of the building data effective indexes is 7, the number of the geographic big data effective indexes is10, the number of the remote sensing data effective indexes is3, and the number of the basic geographic data effective indexes is 1.
(II) constructing a Pop-XGboost population spatialization model based on effective indexes
A Pop-XGboost population spatialization model is constructed by taking 21 effective indexes as input variables and taking the real population of 202 communities (accounting for 75 percent of 270 communities) as an output target. The optimum parameters are shown in table 3. The accuracy of the constructed model was judged with the true population of 68 communities (25% of 270 communities) as the verification set. The result shows that the precision of the training set is as high as 99 percent, and the precision of the testing set is as high as 92 percent.
Table 3: optimal parameters of different models
Figure BDA0003730209640000121
Figure BDA0003730209640000131
(III) prediction result of grid scale population space distribution
And counting the selected 21 indexes to a grid scale of 100m multiplied by 100m to be used as an input variable of the Pop-XGBoost model, and predicting population space distribution of Shenzhen city. The results are shown in FIG. 5, where each grid represents the population within the grid area (1 hectare). As can be seen from fig. 5, the regions with a large population density are mainly concentrated in the southeast of the futian region (south garden streets, fubao streets, futian streets, etc.), the southwest of the luohu region (south lake streets, east gate streets, etc.), the middle of the longhua region (longhua streets and longgang regions), and the south of the longgang region (buji streets and south bay streets). In general, people are intensively distributed in areas with superior regional conditions and better economic conditions, so that the people can conveniently carry out activities such as production and life. Therefore, the prediction result of the spatial population distribution based on the Pop-XGboost accords with the actual distribution characteristics of the population.
(IV) precision verification
The Worldpop data set in Shenzhen 2019 is obtained, is a population spatialization grid data set which is wide in application, high in recognition degree and good in precision, and provides the finest spatial resolution of the data set which can reach 100m multiplied by 100 m. Compared with the prediction result of the Pop-XGboost model (figure 6), the result shows that the population space distribution prediction result trends of the Pop-XGboost model and the Worldpop data set are approximately the same, and the population high-value gathering areas are approximately the same.
Selecting the Sentinel-2 remote sensing image data of 2019 with the spatial resolution of 20 m, and verifying the population prediction result of the Pop-XGboost model. In the detail comparison results of the four regions (a, b, c, d) in fig. 7, it can be found that the predicted results of the Pop-XGBoost model and the Worldpop data set in different regions are substantially consistent. Worldpop, however, does not simulate well in low population density areas (e.g., area b), and is not consistent with remote sensing images, mainly because Worldpop uses data that is not fine enough. The prediction result of the Pop-XGboost model provided by the research is basically consistent with that of a remote sensing image, and the population space distribution of a fine scale can be reflected better.
And summarizing the grid population results obtained by the model to a community scale, comparing the grid population results with the statistical value of the community scale, and calculating indexes such as MAE, RMSE, R2 and the like. The result shows that the RMSE value of the Pop-XGboost model is 12783.82, the MAE value is 8006.07, and the R2 value is 80.54%, and the overall accuracy is good.
In summary, by means of the technical scheme of the invention, the spatial population model is constructed by combining the multi-source data fusion technology, the index screening technology, the ensemble learning technology and the like, and the high-precision spatial population prediction is accurately and efficiently realized. Firstly, basic geographic data, remote sensing data, building data (such as building types, volume rates and the like) and geographic big data (house price distribution and the like) are fused, a population spatialization database taking a 100m multiplied by 100m grid as a unit is constructed, and data support is provided for fine-scale population spatialization. Secondly, in order to solve the problems of noise enhancement and unstable model fitting caused by the increase of auxiliary data, the feature importance is calculated through various integrated learning models, indexes are screened to reduce data noise, and the precision of a fine-scale population spatialization result is improved; and finally, a Pop-XGboost population spatialization model is constructed, and the precision of the fine-scale population spatialization result is improved.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A population spatialization method based on multi-source data and ensemble learning is characterized by comprising the following steps:
s1, acquiring and fusing multi-source data, and constructing a population spatialization database;
s2, constructing an index system for model fitting from the population spatialization database, and screening effective indexes through feature importance of integrated learning model calculation;
s3, constructing a Pop-XGboost population spatialization model by combining the relation between the effective index and the community population;
and S4, predicting the population space distribution, summarizing grid population simulation data to a community scale, comparing the grid population simulation data with real community demographic data, and verifying the result accuracy.
2. The population spatialization method based on multi-source data and ensemble learning according to claim 1, wherein the multi-source data is obtained and fused to construct a population spatialization database, and the method comprises the following steps:
s11, acquiring multi-source data;
s12, selecting a plurality of indexes from the multi-source data to construct an index system;
s13, resampling each index to be a grid scale of 100m multiplied by 100m, and summarizing to a community scale;
s14, counting the building data of the grid scale, and dividing the building according to the grid;
and S15, constructing a population spatialization database taking a grid of 100m multiplied by 100m as a unit.
3. The population spatialization method based on multi-source data and ensemble learning according to claim 2, wherein the multi-source data comprises basic geographic data, remote sensing data, building data and geographic big data;
the building data includes total building area, building floor area, number of building floors, building volume, and building type.
4. The population spatialization method based on multi-source data and ensemble learning according to claim 3, wherein the formula for segmenting the building according to the grid comprises:
k=S(Building Ai )/S(Building A )
GFA(Building Ai )=k×GFA(Building A )
GFA(Grid i )=∑GFA(Building)
in the formula, S (Building) A ) Represents the building area of building a;
S(Building Ai ) Represents that the building A is in Grid i The building area of (a);
k represents a segmentation coefficient;
Grid i representing the ith mesh;
GFA(Building A ) Representing the total area of A buildings;
GFA(Building Ai ) Represents Grid i The area of the inner A buildings;
GFA(Grid i ) Represents Grid i Total building area of all buildings within.
5. The population spatialization method based on multi-source data and ensemble learning of claim 1, wherein an index system for model fitting is constructed from the population spatialization database, and effective indexes are screened out through feature importance calculated by an ensemble learning model, and the method comprises the following steps:
s21, constructing an ensemble learning model by taking the statistic value of the community scale index as an input variable and the real community population as an output target;
s22, respectively calculating the feature importance of each index according to the integrated learning model obtained by construction;
and S23, selecting an index which has a large influence on population distribution as an effective index according to the average value of the feature importance.
6. The population spatialization method based on multi-source data and ensemble learning according to claim 5, wherein the ensemble learning model comprises a random forest, a gradient boosting decision tree and an extreme gradient boosting decision tree.
7. The population spatialization method based on multi-source data and ensemble learning according to claim 6, wherein a Pop-XGboost population spatialization model is constructed by combining the relationship between the effective index and the community population, and the method comprises the following steps:
s31, based on the extreme gradient lifting decision tree, automatically searching the optimal parameters of the Pop-XGboost model in a specified range by using a GridSearchCV module of a sklern library;
s32, constructing a Pop-XGboost model by taking the effective indexes as input indexes and 75% of communities and real population data thereof as training sets;
and S33, verifying and analyzing the Pop-XGboost model by taking the rest 25% of communities and population data thereof as a test set.
8. The method of claim 7, wherein the predicting of the spatial population distribution, the summarizing of the simulation grid population data to the community scale, the comparing with the real community demographic data, and the verification of the result accuracy comprises the following steps:
s41, introducing an XGB-Regressor module in the sklern library, and estimating the population of each grid;
s42, redistributing the population number of each grid of 100m multiplied by 100m according to the proportion of the grid population to the total population of all grids in the county of the district where the grid population is obtained according to the Pop-XGboost model;
and S43, selecting an evaluation factor to evaluate the accuracy of the prediction population.
9. The population spatialization method based on multi-source data and ensemble learning according to claim 8, wherein the distribution formula is as follows:
Figure FDA0003730209630000031
where i represents each grid within the region;
j represents the administrative region where the grid is located;
P i representing the population within each grid after correction;
D j representing the total population of the county in which the grid is located;
M i representing the population number estimated by the grid according to a Pop-XGboost model;
M j a model representing all grids of the administrative area in which the grid is located estimates the population.
10. The population spatialization method based on multi-source data and ensemble learning according to claim 9, wherein the evaluation factors include relative average absolute error, root mean square error and decision coefficient.
CN202210782643.0A 2022-07-05 2022-07-05 Population spatialization method based on multi-source data and ensemble learning Pending CN115129802A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210782643.0A CN115129802A (en) 2022-07-05 2022-07-05 Population spatialization method based on multi-source data and ensemble learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210782643.0A CN115129802A (en) 2022-07-05 2022-07-05 Population spatialization method based on multi-source data and ensemble learning

Publications (1)

Publication Number Publication Date
CN115129802A true CN115129802A (en) 2022-09-30

Family

ID=83382368

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210782643.0A Pending CN115129802A (en) 2022-07-05 2022-07-05 Population spatialization method based on multi-source data and ensemble learning

Country Status (1)

Country Link
CN (1) CN115129802A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116956133A (en) * 2023-07-26 2023-10-27 中国地震局地质研究所 Building function identification method based on time sequence mobile phone signaling data and machine learning
WO2024100937A1 (en) * 2022-11-07 2024-05-16 株式会社Nttドコモ Population output device and estimation model

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024100937A1 (en) * 2022-11-07 2024-05-16 株式会社Nttドコモ Population output device and estimation model
CN116956133A (en) * 2023-07-26 2023-10-27 中国地震局地质研究所 Building function identification method based on time sequence mobile phone signaling data and machine learning
CN116956133B (en) * 2023-07-26 2024-02-27 中国地震局地质研究所 Building function identification method based on time sequence mobile phone signaling data and machine learning

Similar Documents

Publication Publication Date Title
Estoque et al. Quantifying landscape pattern and ecosystem service value changes in four rapidly urbanizing hill stations of Southeast Asia
CN111932036B (en) Fine spatio-temporal scale dynamic population prediction method and system based on position big data
Tariq et al. Spatio-temporal assessment of land use land cover based on trajectories and cellular automata Markov modelling and its impact on land surface temperature of Lahore district Pakistan
Balk et al. The distribution of people and the dimension of place: methodologies to improve the global estimation of urban extents
Hashimoto et al. Scenario analysis of land-use and ecosystem services of social-ecological landscapes: implications of alternative development pathways under declining population in the Noto Peninsula, Japan
Ballestores Jr et al. An integrated parcel-based land use change model using cellular automata and decision tree
CN115129802A (en) Population spatialization method based on multi-source data and ensemble learning
Dietrich et al. Temporal and spatial high-resolution climate data from 1961 to 2100 for the German National Forest Inventory (NFI)
Chandan et al. Analysing spatial patterns and trend of future urban expansion using SLEUTH
Campisano et al. A dimensionless approach for the urban-scale evaluation of domestic rainwater harvesting systems for toilet flushing and garden irrigation
CN111310898A (en) Landslide hazard susceptibility prediction method based on RNN
Das et al. Studying land use dynamics using decadal satellite images and Dyna-CLUE model in the Mahanadi River basin, India
Zhao et al. Mapping population distribution based on XGBoost using multisource data
Liu et al. Habitation environment suitability and population density patterns in China: A regionalization approach
Liang et al. Modeling urban growth in the middle basin of the Heihe River, northwest China
KR100904012B1 (en) Method for inquiring system of living environment information for apartment house
Jagarnath et al. Modelling urban land change processes and patterns for climate change planning in the Durban metropolitan area, South Africa
Mushore et al. Estimating urban LST using multiple remotely sensed spectral indices and elevation retrievals
KR102526361B1 (en) System and method for analyzing effect of ground surface regarding heat wave and cold wave at local level
Xu et al. A framework for the evaluation of roof greening priority
Arthur Tropical Cyclone Hazard Assessment 2018
Siervo et al. Geomorphic analysis and semi-automated landforms extraction in different natural landscapes
Guo et al. High-resolution satellite images reveal the prevalent positive indirect impact of urbanization on urban tree canopy coverage in South America
Yin et al. Disaggregation of an urban population with M_IDW interpolation and building information
Chang et al. Application of GIS Sensor Technology in Digital Management of Urban Gardens under the Background of Big Data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination