CN112669976A

CN112669976A - Crowd health assessment method and system based on ecological environment change

Info

Publication number: CN112669976A
Application number: CN202110288401.1A
Authority: CN
Inventors: 俞乐; 赵剑桥; 刘晓暄; 黄小猛; 周峥
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-03-18
Filing date: 2021-03-18
Publication date: 2021-04-16
Anticipated expiration: 2041-03-18
Also published as: CN112669976B

Abstract

The invention provides a crowd health assessment method and system based on ecological environment change, wherein the method comprises the following steps: performing pixelization processing on the acquired statistical yearbook data to acquire vector boundary data corresponding to the statistical yearbook data; determining model training set data and verification set data based on the vector boundary data, ecological environment data corresponding to the statistical yearbook data and historical disease incidence data; constructing and training a crowd health prediction model based on model training set data and verification set data; and performing health prediction on corresponding diseases of the crowd based on the trained crowd health prediction model. By utilizing the method, the influence of the yearbook and the ecological environment on the health of the crowd can be quantitatively analyzed and counted, and the disease incidence prediction accuracy is improved.

Description

Crowd health assessment method and system based on ecological environment change

Technical Field

The invention relates to the field of crowd health assessment, in particular to a crowd health assessment method and system based on ecological environment change.

Background

The change of ecological environment is one of the core problems of the current global change research, and is closely related to climate change, biological diversity, ecological environment evolution, human health and the like. In the last hundred years, the change of land utilization breaks the energy flow and the material circulation balance of the current urban ecosystem, so that serious social, economic and ecological environmental problems are caused, and the threat of various diseases caused by urbanization to human beings is brought forward, thereby becoming an important problem concerned by countries in the world.

However, quantitative research on the relationship between ecological environment change and population health is lacking at present, and particularly, social and economic data and ecological environment data are lacking in cooperation. The Chinese statistical yearbook data is an important statistical data source in domestic research, and with the development of science and technology and the application of multi-field interdisciplinary subjects, the spatialization problem of statistical data is more and more emphasized, and the statistical yearbook data becomes one of the hot problems of the current geographic science and social science research. How to combine statistical data with geographical distribution to explore the spatially varying influence has become a dilemma to be solved.

However, the current annual book for statistics is numerous, such as "national health and health annual book for statistics" in China, "national population annual book for statistics" in China, "annual book for agriculture" in China, "annual book for urban statistics in China," annual book for forestry statistics "in China, etc., and the annual book for statistics is wide in design range, contains numerous variables, and is a tedious work for collecting and collating the data of the annual book. Furthermore, in history transition of the administrative units in the statistical yearbook from history to the present, because the administrative division boundary change and the administrative division name change of each place are not matched with the existing geographic unit vector diagram in space, the research requirement of the interdisciplinary science between the natural science and the social science is difficult to meet.

At present, research is mainly focused on spatialization of statistical data, but research objects are mainly focused on a certain field or a plurality of fields such as population and domestic total production value indexes, and the method is not suitable for multi-field multivariable data spatialization such as Chinese statistical yearbook. In addition, most of the existing research at present is focused on a certain area (such as Tianjin city and sunny region), no attention is paid to the spatialization of a large range (such as various local cities and county-level units in China in the national statistics yearbook), and the difference of pixel levels at specific geographic positions is rarely researched; therefore, a modeling scheme for evaluating health impact by matching ecological environment data on the basis is needed.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide a method and a system for assessing the health of a population based on ecological environment changes, so as to solve the problems of single field, limited spatial area, impact assessment effect, and the like existing in the current health impact research.

The invention provides a crowd health assessment method based on ecological environment change, which comprises the following steps: performing pixelization processing on the acquired statistical yearbook data to acquire vector boundary data corresponding to the statistical yearbook data; determining model training set data and verification set data based on the vector boundary data, ecological environment data corresponding to the statistical yearbook data and historical disease incidence data; constructing and training a crowd health prediction model based on model training set data and verification set data; and performing health prediction and evaluation on the corresponding diseases of the crowd based on the trained crowd health prediction model.

In addition, it is preferable that the step of performing pixelization processing on the acquired statistical yearbook data to acquire vector boundary data corresponding to the statistical yearbook data includes: acquiring all statistical yearbook data, wherein the statistical yearbook data comprises yearbook data information of a grade city and a grade county unit; arranging the annual book data into an excel table, and standardizing the annual book data through the standard row and column numbers of the excel table; meanwhile, map atlas of the city and county level units are obtained, vectorization processing is carried out on the map atlas, and vector data corresponding to the map atlas are obtained; the normalized statistical yearbook data is pixelized based on the vector data to acquire vector boundary data corresponding to the statistical yearbook data.

In addition, a preferred technical solution is that the process of acquiring all the statistical yearbook data includes: searching the statistical yearbook by using a CNKI database, and downloading the searched statistical yearbook data; performing data supplement on the searched statistical yearbook data to form all the statistical yearbook data; wherein the data supplement includes statistical yearbook data copied from books between the data network and national library.

In addition, a preferred technical solution is that the specification process of the statistical yearbook data includes: carrying out digital processing on the statistic yearbook data; establishing space-time row labels of excel tables for the digitalized statistical yearbook data; each excel table represents a variable, a line in each excel table represents a spatially synchronous city-level and county-level cell, and a column represents different time information.

In addition, a preferred technical solution is that the process of obtaining map atlas of the city and county level units, and performing vectorization processing on the map atlas, and obtaining vector data corresponding to the map atlas includes: scanning a paper map atlas and storing the map atlas in a graphic format; when the administrative division adjustment exists in the map set, recording changes through a new version of the administrative division map and a place name along a leather comparison table; establishing a new map layer based on the scanned map atlas, and setting the new map layer into a visible and editable mode; calling an ArcGIS tool on the new layer to draw a path, and summarizing and checking topology of all the drawn new layers; and adding vector attributes to the new map layer after the topology inspection, and acquiring vector data corresponding to the map atlas.

In addition, a preferred embodiment is a process of performing pixelization processing on the normalized statistical yearbook data based on the vector data, including: dividing nationwide land conditions based on Chinese annual land utilization data, and determining land distribution conditions of annual land-level cities and county-level units; meanwhile, acquiring pixilated auxiliary data within a preset year limit, and judging the spatialization type of a variable in the statistical yearbook; and pixelating the statistical yearbook data based on the spatialization type of the variable and the standard row-column number of each grade city and county unit corresponding to each year in the excel table.

In addition, the preferred technical scheme is that the auxiliary data comprises population data, age structure data and night light data; the spatialization types of the variables include: population density only correlation, combined distribution of land use and night light data, combined distribution of population density and land use, combined distribution of population density and age structure, no geographic distribution feature.

In addition, the preferable technical scheme is that the ecological environment data comprises administrative region longitude and latitude data, atmospheric pollution data, biological climate data and biological diversity data.

In addition, the preferred technical solution is that the process of determining the model training set data and the verification set data includes: acquiring spatial weight matrixes of the local city and county level units based on the vector boundary data; based on the spatial weight matrix, taking the incidence of historical diseases as a variable, and obtaining a univariate local spatial autocorrelation index; according to the spatial autocorrelation indexes, acquiring spatial relationships between the local-level city and county-level units and the incidence of historical diseases; wherein the spatial relationships include high-high spatial clustering, low-low spatial clustering, high-low spatial clustering, and low-high spatial clustering; and constructing model training set data and verification set data based on clustering of the spatial relationship, the vector boundary data and the ecological environment data.

According to another aspect of the present invention, there is also provided a system for assessing the health of a population based on changes in the ecological environment, the system comprising: a vector boundary data acquisition unit configured to perform pixelization processing on the acquired statistical yearbook data to acquire vector boundary data corresponding to the statistical yearbook data; the training and verification data acquisition unit is used for determining model training set data and verification set data based on vector boundary data, ecological environment data corresponding to the statistical yearbook data and historical disease incidence data; the model building and training unit is used for building and training a crowd health prediction model based on model training set data and verification set data; and the health prediction and evaluation unit is used for performing health prediction and evaluation on corresponding diseases of the crowd based on the trained crowd health prediction model.

By utilizing the crowd health assessment method and system based on the ecological environment change, the obtained statistic yearbook data is subjected to pixelization processing to obtain corresponding vector boundary data, then model training set data and verification set data are obtained based on the vector boundary data, the ecological environment data and historical disease incidence data, and a crowd health prediction model is constructed and trained according to the model training set data and the verification set data, so that the distribution characteristics and the rules of the social and economic data in the administrative units of the grade cities and the counties can be revealed, the exact values of the social and economic data on the geographical positions can be analyzed more deeply, meticulously and comprehensively, accurate pixel-level data support is provided for spatial analysis of each field, and data and technical guidance are provided for researching multi-factor analysis among multiple regions for a long time.

To the accomplishment of the foregoing and related ends, one or more aspects of the invention comprise the features hereinafter fully described. The following description and the annexed drawings set forth in detail certain illustrative aspects of the invention. These aspects are indicative, however, of but a few of the various ways in which the principles of the invention may be employed. Further, the present invention is intended to include all such aspects and their equivalents.

Drawings

Other objects and results of the present invention will become more apparent and more readily appreciated as the same becomes better understood by reference to the following description taken in conjunction with the accompanying drawings. In the drawings:

FIG. 1 is a flow chart of a method for assessing the health of a population based on changes in the ecological environment according to an embodiment of the present invention;

FIG. 2 is a schematic block diagram illustrating a method for assessing the health of a population based on changes in the ecological environment according to an embodiment of the present invention;

FIG. 3 is a logic block diagram of a system for assessing the health of a population based on changes in the ecological environment according to an embodiment of the present invention.

The same reference numbers in all figures indicate similar or corresponding features or functions.

Detailed Description

In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of one or more embodiments. It may be evident, however, that such embodiment(s) may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to facilitate describing one or more embodiments.

In order to describe the method and system for assessing the health of the population based on the change of the ecological environment in detail, the following describes an embodiment of the invention in detail with reference to the accompanying drawings.

Fig. 1 and 2 respectively show a flow and a principle of a method for assessing the health of a population based on ecological environment changes according to an embodiment of the present invention.

As shown in fig. 1 and fig. 2, the method for assessing the health of a population based on ecological environment changes according to the embodiment of the present invention includes the following steps:

s110: the acquired statistical yearbook data is subjected to pixelization processing to acquire vector boundary data corresponding to the statistical yearbook data.

Wherein the step of performing pixelization processing on the acquired statistical yearbook data to acquire vector boundary data corresponding to the statistical yearbook data includes:

s111: acquiring all statistical yearbook data, wherein the statistical yearbook data comprises yearbook data information of a grade city and a grade county unit;

s112: arranging the annual book data into an excel table, and standardizing the annual book data through the standard row and column numbers of the excel table; at the same time, the user can select the desired position,

s113: acquiring map atlas of a city and county level unit, and carrying out vectorization processing on the map atlas to acquire vector data corresponding to the map atlas;

s114: the normalized statistical yearbook data is pixelized based on the vector data to acquire vector boundary data corresponding to the statistical yearbook data.

In the above steps, step S111 further includes: searching the statistical yearbook by using a CNKI database, and downloading the searched statistical yearbook data; performing data supplement on the searched statistical yearbook data to form all the statistical yearbook data; the data supplement comprises statistical yearbook data such as copied from a book between a data network and a national library.

Step S112 further includes: carrying out digital processing on the statistic yearbook data; establishing space-time row labels of excel tables for the digitalized statistical yearbook data; each excel table represents a variable, the rows in each excel table represent spatially synchronized prefecture and county level cells, and the columns represent different time information (year).

Step S113 further includes: scanning a paper map atlas and storing the map atlas in a graphic format; when the administrative division adjustment exists in the map set, recording changes through a new version of the administrative division map and a place name along a leather comparison table; establishing a new map layer based on the scanned map atlas, and setting the new map layer into a visible and editable mode; calling an ArcGIS tool on the new layer to draw a path, and summarizing and checking topology of all the drawn new layers; and adding vector attributes to the new map layer after the topology inspection, and acquiring vector data corresponding to the map atlas.

Specifically, a paper map atlas is scanned by a scanner and stored in a graphic format. If the administrative division is adjusted, the change can be recorded by a new version of the administrative division map and the place name along the leather comparison table. The method comprises the steps of opening a scanning map in specialized image processing software, then establishing a new map layer for the scanned map atlas by using ArcGIS, and setting the map layer to be visible and editable. Then, drawing points, lines, broken lines, arcs, polygons, rectangles and the like on the newly-built map layer by referring to the scanning map picture and calling an ArcGIS tool to draw paths; and finally, summarizing all the layers after drawing, adding vector attributes after topology inspection is finished, comparing names and regions of the local city and county level units in the annual statistical yearbook data, and unifying annual vector maps. If the administrative division is adjusted, the changed polygons can be edited independently, and the topological relation is kept among the peripheral polygons.

The map atlas can collect map atlases of Chinese district-level city and county-level unit as comprehensive as possible since 1949, and map vectorization is carried out after scanning. Among them, map vectorization is a process of converting image data into vector data. Then, the processed vector data is used for making pixelization on the statistical yearbook data by referring to remote sensing mountain song data such as pixelization land utilization data, wherein the pixelization is the process of placing the statistical data arranged in the Excel table on a map space according to the grid positions, namely the statistical data spatialization on the grid pixel level.

Step S114 further includes: dividing nationwide land conditions based on Chinese annual land utilization data, and determining land distribution conditions of annual land-level cities and county-level units; meanwhile, acquiring pixilated auxiliary data within a preset year limit, and judging the spatialization type of a variable in the statistical yearbook; and pixelating the statistical yearbook data based on the spatialization type of the variable and the standard row-column number of each grade city and county unit corresponding to each year in the excel table to obtain vector boundary data corresponding to the statistical yearbook data.

The auxiliary data comprises population data, age structure data and night light data; the spatialization types of the variables include: population density only correlation, combined distribution of land use and night light data, combined distribution of population density and land use, combined distribution of population density and age structure, no geographic distribution feature.

Specifically, Chinese annual land utilization data is used for dividing the land conditions of the whole country, determining the land distribution condition of each grade city and county level unit every year, then auxiliary data such as pixilated population data, age structure data, night light data and the like in multiple years are downloaded, and variables (including total regional production value, total regional production value increase rate, total per capita regional production value, local public financial income, local public financial expenditure, annual account population, natural increase rate, hospital and health hospital bed number (pieces), hospital/health hospital number (pieces), third industry-health/social security/social welfare annual end town unit workers, industrial sulfur dioxide production amount (ton), industrial sulfur dioxide discharge amount (ton), industrial wastewater discharge amount (ten thousand tons) and the like in statistical yearbook data are judged, The generation amount of industrial smoke (dust) and the emission amount of the industrial smoke (dust) are divided into a, only population density correlation, b, land utilization and night light data combined distribution, c, population density and land utilization combined distribution, d, population density and age structure combined distribution, e, without geographic distribution characteristics and other spatialization types according to the prior knowledge of multi-party experts and the characterization meaning of statistical data; then, by utilizing the classification condition of each variable, the number of each local city and county level unit (administrative unit) corresponding to each year in Excel is used for spatialization according to the category of the variable, specifically, the value of each pixel of the remote sensing data in the spatialization type is used as the weight, the total variable value of the administrative unit is distributed to each pixel point according to the weight, the pixelation processing of statistical data is completed, and the vector boundary data is obtained.

S120: and determining model training set data and verification set data based on the vector boundary data, the ecological environment data corresponding to the statistical yearbook data and the historical disease incidence data.

Specifically, the ecological environment data includes administrative district longitude and latitude data, atmospheric pollution data, biological climate data, biological diversity data, and the like, and the historical disease incidence data includes county-level corresponding disease incidence data collected in published literature and statistical yearbook. The historical diseases can be various types of chronic diseases such as infectious diseases, cardiovascular diseases and the like.

The longitude and latitude data of the administrative district can be obtained by extracting longitude and latitude coordinates of the mass center of each district and county in decimal system based on the boundary of the country and county in 2015, and the specific age can be set according to the disease type or the evaluation requirement.

The biological climate data can be generated based on the monthly temperature and the monthly precipitation, has important biological significance, and can reflect annual trends, seasonality, and extreme or restrictive environmental factors. For example, 19 types of biological climate variables, such as BIO 1-BIO 19, can be included, and the mean value of the biological climate data in 1970-. The variable abbreviations and corresponding meanings are related as follows: (ii) ampibians-amphibian species abundance; bareland-percent bare area; BC-BC discharge; BIO 1-annual average air temperature; BIO 2-average daily poor (monthly average (highest-lowest temperature)); BIO 3-isothermality (BIO 2/BIO 7) (. times.100); BIO 4-seasonal in temperature (standard deviation 100); BIO 5-maximum temperature for the warmest months; BIO 6-lowest temperature of the coolest month; BIO 7-annual temperature range (BIO 5-BIO 6); BIO 8-average temperature in the wettest quarters; BIO 9-average temperature of the dryest quarters; BIO 10-average temperature in the warmest quarters; BIO 11-average temperature of coldest quarters; BIO 12-annual precipitation; BIO 13-precipitation during the wettest months; BIO 14-precipitation in the most arid months; BIO 15-precipitation seasonality (coefficient of variation); BIO 16-moisture precipitation in the wettest quarters; BIO 17-precipitation in the most arid quarters; BIO 18-the amount of precipitation in the warmest quarters; BIO 19-coldest quarters precipitation; bird-species abundance; the discharge amount of CO-CO; CO 2-CO 2 emissions; cropland-percentage of field area; dmsp-night light; forest area percentage; GDP-total production value in area; GDP _ rate-area production gross growth rate; grassland area percentage; hospital _ bednum-hospital & health hospital bed number; hospital _ num-hospital & health hospital count; household _ pop-family membership population; impervious-percent area of impermeable layer; lat-center longitude; lon-center longitude; mammals-mammalian species abundance; NH 3-NH 3 emission; NOx-NOx emissions; OC-OC emission; discharging amount of PAHs-PAHs; per _ GDP-Total production value in per-capita region; PM 10-PM 10 emissions; PM 2.5-PM 2.5 emission; pop _ growth-the natural growth rate of the population; revenue _ in-public financial revenue; revenue _ out-public financial expenditure; s02_ production-amount of industrial SO2 produced; shrubland-brush area percentage; smoke _ emission-emission of industrial smoke (dust) dust; smoke _ production-amount of industrial smoke (dust) dust production; SO 2-SO 2 emission; SO2_ emission-industrial SO2 emission; srtm 90-elevation; srtmtpi-multiscale terrain position index; srtmtopographicdersity-terrain diversity; tertiary _ industry-third industry (health & social security & social welfare industry) end-of-year town unit practitioners; TSP-TSP discharge amount; tundra-area percentage of frozen origin; water _ emission-discharge of industrial wastewater; water-water area percentage; wetland-percentage of wetland area.

The atmospheric pollution data, which may also be referred to as atmospheric pollutant emission data, includes Black Carbon (BC), carbon monoxide (CO), carbon dioxide (CO 2), ammonia (NH 3), nitrogen oxides (NOx), Organic Carbon (OC), Polycyclic Aromatic Hydrocarbons (PAHs), inhalable particulate matter (PM 10), fine particulate matter (PM 2.5), sulfur dioxide (SO 2), and total suspended particulate matter (TSP). The atmospheric pollution data can be determined by further processing the data based on the monthly-by-monthly emission data (unit: g/km 2/month) of the atmospheric pollutants in 2004-2014, calculating the annual total emission amount of each type of atmospheric pollutants in 2004-2014, and acquiring the average total emission amount of each type of atmospheric pollutants in the time series.

Wherein, the biodiversity data, including but not limited to total abundance data of biodiversity of three species, includes: birds, mammals, amphibians. Wherein, the data of birds and mammals can be updated in 2018, and the data of amphibians can be updated in 2017. The data respectively counts the distribution of 10035 birds, 5270 mammals and 6188 amphibians in the global range, has higher accuracy and wide application, and can obtain the average abundance of the birds, the mammals and the amphibians in all counties in China to be used as input variables of a crowd health prediction model in the crowd health assessment based on the ecological environment change.

Further, the process of obtaining the model training set data and the validation set data includes: acquiring spatial weight matrixes of the local city and county level units based on the vector boundary data; based on the spatial weight matrix, taking the incidence of historical diseases as a variable, and obtaining a univariate local spatial autocorrelation index; according to the spatial autocorrelation indexes, acquiring spatial relationships between the local-level city and county-level units and the incidence of historical diseases; wherein the spatial relationships include high-high spatial clustering, low-low spatial clustering, high-low spatial clustering, and low-high spatial clustering; and constructing model training set data and verification set data based on clustering processing of the spatial relationship, the vector boundary data and the ecological environment data.

Specifically, the crowd health prediction model can test and count the spatial relationship of the ten thousand people morbidity rate with significant effect through spatial correlation Local Indexes (LISA). Based on the space analysis software GeoDa, the space autocorrelation analysis of the counties and counties of China is realized. Considering that if global analysis is adopted, obvious spatial heterogeneity can be ignored, the method adopts local spatial autocorrelation to analyze the spatial feature difference between each district and the county in the field, and fully embodies the spatial heterogeneity and instability of local areas.

Firstly, a space distance weight-based adaptive kernel method is adopted, and a space weight matrix is obtained based on vector boundary data of the whole counties in China, wherein the matrix reflects the space dependency relationship among the counties. Then, on the basis of the spatial weight matrix of the counties and the average ten-thousand-person incidence rate as a variable, the local Moran's I of the univariate is calculated so as to reflect the spatial difference of the characteristics of the ten-thousand-person incidence rate between each county and the county in the field. The Moran's I index has a value range of [ -1, 1], when the index is greater than 0, the researched variables show spatial positive correlation, namely, spatial objects with similar characteristics are gathered together, and the larger the spatial autocorrelation index value is, the more obvious the spatial correlation is. The spatial autocorrelation index Moran's I calculated by the crowd health prediction model is 0.539, the index is larger than 0, and the characteristic that the morbidity of ten thousands of people is in spatial positive correlation is shown, and the spatial autocorrelation is high.

Finally, spatial clustering of characteristics of ten thousand people's morbidity in each county can also be identified. Significance testing was done on a 0.05 basis based on 999 monte carlo randomized simulations. If the p value of the district is greater than 0.05, the county has no significant spatial relationship with the incidence rate of ten thousand people in the county in the field; if the p value < =0.05 in the county indicates that the county has a significant spatial relationship with the incidence rate of ten thousand people in the county in the field, the spatial relationship can be divided into four types: high-high spatial clustering, low-low spatial clustering, high-low spatial clustering, and low-high spatial clustering. Wherein, the high-high spatial clustering represents that the county and the field county have the morbidity of ten thousand people; low-low spatial clustering means that the county and the field county have low incidence of ten thousand people; high-low spatial clustering represents a county with ten thousand people's morbidity, whose field county has ten thousand people's morbidity; low-high spatial clustering represents a county with a low incidence of ten thousand people, whose field county has a high incidence of ten thousand people.

In addition, in order to improve the evaluation accuracy of the crowd health prediction model, social and economic data can be used as input variables of the crowd health prediction model. Wherein the socio-economic data comprises at least 15 socio-economic data, including: the method comprises the following steps of (1) total production value of a region, total production value increase rate of the region, total production value of per capita region, local public financial income, local public financial expenditure, terminal household registers and population, natural increase rate, bed number (number) of hospitals and health homes, unit employees of cities and towns in the third industry, namely health, social security, social welfare and the terminal of the year, industrial sulfur dioxide production amount (ton), industrial sulfur dioxide discharge amount, industrial wastewater discharge amount (ten thousand tons), industrial smoke (powder) dust production amount and industrial smoke (powder) dust discharge amount. For each county, the socioeconomic data can be obtained by calculating the average value of the 15 indexes in year 2004-2015.

In other words, the input variables of the crowd health prediction model may include vector boundary data, ecological environment data, training set data and verification set data obtained according to historical disease incidence data, and the like; in the process of performing the pixelization processing on the data of the statistical yearbook, a land coverage map and a population density map are also needed, and the land coverage map and the population density map can be understood as the map atlas. The method comprises the following steps of extracting area proportions of 9 soil coverage types of each county and county year by year in 2004-2015 by using a Google Earth Engine platform based on Chinese year-by-year soil coverage data, wherein the area proportions comprise: farmlands, forests, grasslands, shrubs, wetlands, water bodies, frozen sources, impervious beds and wastelands. Night light information for 2004-. Based on corrected time series data of a Radar terrain mapping Mission (SRTM) of the space Shuttle, multi-scale terrain Position Index (TRI), terrain Diversity (Topographic Diversity) and elevation of each district and county year by year are extracted in 2004-2015. For each county, the extracted data includes: 9 land coverage types, night light data, terrain position indexes, terrain diversity data and elevation data, 13 indexes in total, and average values in the years of 2004-2015 are calculated respectively.

S130: and constructing and training a crowd health prediction model based on the model training set data and the verification set data.

S140: and performing health prediction and evaluation on the corresponding diseases of the crowd based on the trained crowd health prediction model.

As a specific example, the process of constructing and training the crowd health prediction model will be described below by taking an infectious disease as an example.

First, 497 counties with the total ten thousand people incidence rate belonging to high-high clustering and 807 counties belonging to low-low clustering are selected as samples based on spatial system correlation analysis, namely 1304 counties in total.

From the 1304 samples, 70% of the samples were randomly selected as training samples (training set data) and the other 30% of the samples were selected as test samples (validation set data), a random forest was constructed, and the following 8 combinations were tested, respectively:

a 1: longitude and latitude 2+ land 13- > Total ten thousand people morbidity

a 2: longitude and latitude 2+ land 13+ biological diversity 3- > Total ten thousand people morbidity

b 1: longitude and latitude 2+ land 13+ climate 19+ biological diversity 3- > Total ten thousand people morbidity

b 2: longitude and latitude 2+ land 13+ climate 19- > Total ten thousand people morbidity

c 1: longitude and latitude 2+ land 13+ climate 19+ atmospheric pollution 11- > Total ten thousand people morbidity

c 2: longitude and latitude 2+ land 13+ climate 19+ atmospheric pollution 11+ biological diversity 3- > total ten thousand people morbidity;

d 1: longitude and latitude 2+ land 13+ climate 19+ atmospheric pollution 11+ socioeconomic 15- > total ten thousand people morbidity;

d 2: longitude and latitude 2+ land 13+ climate 19+ atmospheric pollution 11+ socioeconomic 15+ biological diversity 3- > Total ten-thousand people morbidity

Note: the number following each type of data represents the number of variables that the type of data contains.

Then, in order to reduce the uncertainty of random forests as much as possible, the group health prediction model, hereinafter referred to as the model, repeatedly executes the random forest algorithm for 10 times and averages the random forest algorithm to check the reliability of each combined prediction result and realize the importance analysis of each input variable.

And when the model is operated every time, the model respectively constructs 8 random forests aiming at the 8 combinations. The number of decision trees contained in each random forest is set to be 1000; and adopting a random forest regression model, wherein the number of variables for the decision tree is set to be one third of the number of variables input into the random forest. For each combination, the pearson correlation coefficient r between the combined prediction result and the validation sample in each of 10 runs was calculated and recorded, and the mean and sample variance of 10 r were further calculated. As shown in Table 1 below, the sample variances between the respective 10 r-values of 8 combinations, all of which are below e-05 in magnitude, indicate that the variation degree of r-values of each group is very low; the r mean values of the 8 combinations exceed 0.6, and the r mean values are increased as the number of input model variables increases, which shows that the larger the input variables are, the higher the reliability of the model is.

In addition, table 2 below shows the p-value of the test corresponding to each combination correlation calculation for each model run, the p-value of the significance level in the invention is set to 0.01, the p-mean of 8 combinations is far less than 0.001, and the significance test is passed, so the correlation represented by the r-value has significance.

In addition, the model also assesses the importance of all 63 input variables through two indices, namely IncMSE and IncNodePurity. The IncMSE represents the increment of the Mean Square Error (MSE) of the prediction result when each variable is randomly assigned, and the larger the value is, the larger the influence of the variable on the prediction result is, and the higher the importance is; IncNodePurity reflects a decrease in RSS (sum of squared residuals), with an increase in IncNodePurity representing a decrease in the Gini index, the greater the value, the greater the importance of the corresponding variable.

In view of the fact that IncMSE and IncNodePurity are positively correlated with the importance of variables, in order to more intuitively represent the importance of each input variable, the model arranges the variables in a descending order according to IncMSE and IncNodePurity respectively, and records the ranking of the variables. I.e., the variable with rank 1, has the highest importance among all variables. In each model run, 8 sets of variable importance rankings (corresponding to 8 input variable combinations) are available, and the average of the 8 rankings for each variable is calculated. Considering that the number of input variables of 8 combinations is different, only the d2 group contains all 63 variables, so when calculating the variable mean, the null value is ignored, and only all combinations containing the current variable, which correspond to the ranked mean, are calculated.

In summary, based on the two indexes, namely the IncMSE and the IncNodePurity, the importance of 63 variables in the model is determined, and the flow is as follows:

firstly, obtaining the importance ranking average value of each variable in 8 combinations during each model operation, and regarding the average value as the final importance ranking of the corresponding variable in the current model operation; second, the average of the 10 final importance rankings that each variable has when the model runs 10 times is further computed. The characteristic importance ranking is shown in the following tables 3-1, 3-2, 3-3, 4-1, 4-2 and 4-3, and the meanings represented by the faces in the tables can refer to the relationship between the abbreviations of the variables and the corresponding meanings.

It can be known that, based on the two sets of importance rankings of the incumse and the incunodeprurity, there are 9 coincident variables in the variables of the top ten importance rankings, which are: lon, cropland, BIO13, dmsp, BIO16, GDP _ rate, BIO18, BIO12, BIO14, indicating that these variables are key variables for assessing the incidence of infectious diseases.

According to the crowd health assessment method based on ecological environment change, provided by the invention, the distribution characteristics and rules of the social and economic data in administrative units of city and county levels can be disclosed, so that the exact value of the social and economic data on the geographical position can be analyzed more deeply, meticulously and comprehensively, accurate pixel-level data support can be provided for spatial analysis in various fields, and data and technical guidance can be provided for researching long-time multi-factor analysis among multiple regions.

Corresponding to the crowd health assessment method based on the ecological environment change, the invention also provides a crowd health assessment system based on the ecological environment change.

FIG. 3 illustrates the logic of the system for assessing the health of a population based on changes in the ecological environment, according to an embodiment of the present invention.

As shown in fig. 3, the system 200 for assessing the health of a population based on ecological environment changes according to the embodiment of the present invention includes: a vector boundary data acquisition unit 210 for performing pixelization processing on the acquired statistical yearbook data to acquire vector boundary data corresponding to the statistical yearbook data; a training and verification data obtaining unit 220 for determining model training set data and verification set data based on the vector boundary data, the ecological environment data corresponding to the statistical yearbook data, and the historical disease incidence data; a model construction and training unit 230, configured to construct and train a crowd health prediction model based on model training set data and verification set data; and the health prediction and evaluation unit 240 is used for performing health prediction and evaluation on corresponding diseases of the crowd based on the trained crowd health prediction model.

It should be noted that, for the embodiment of the crowd health assessment system 200 based on ecological environment changes provided by the present invention, reference may be made to the description in the embodiment of the crowd health assessment method based on ecological environment changes, and details are not repeated here.

According to the method and the system for evaluating the crowd health based on the ecological environment change, provided by the invention, the complete accuracy and reliability of data are ensured to the maximum extent by using a CNKI database and other data sources, an Excel digital statistic yearbook database is established according to three dimensions of time, space and variables, Chinese grade city and county level map libraries are scanned according to actual requirements, vectorization and correction are carried out, and finally, the spatial pixelation of the statistic yearbook data of national grade and county grade administrative units is realized by depending on the support of pixel level remote sensing data such as land utilization data, population data, age structure data, night light data and the like through the steps and the method. The distribution characteristics and the rules of the socioeconomic data in administrative units of city and county levels can be revealed, so that the exact values of the socioeconomic data on the geographical position can be analyzed more deeply, meticulously and comprehensively, accurate pixel-level data support is provided for spatial analysis in various fields, and data and technical guidance are provided for researching multi-factor analysis among multiple regions for a long time.

The method and system for assessing the health of a population based on changes in the ecological environment according to the present invention are described above by way of example with reference to the accompanying drawings. However, it should be understood by those skilled in the art that various modifications can be made to the above-described method and system for assessing the health of the population based on the change of the ecological environment without departing from the scope of the present invention. Therefore, the scope of the present invention should be determined by the contents of the appended claims.

Claims

1. A crowd health assessment method based on ecological environment change is characterized by comprising the following steps:

performing pixelization processing on the acquired statistical yearbook data to acquire vector boundary data corresponding to the statistical yearbook data;

determining model training set data and validation set data based on the vector boundary data, the ecological environment data corresponding to the statistical yearbook data, and historical disease incidence data;

constructing and training a crowd health prediction model based on the model training set data and the verification set data;

and performing health prediction and evaluation on the corresponding diseases of the crowd based on the trained crowd health prediction model.

2. The method for assessing the health of a population based on changes in the ecological environment according to claim 1, wherein the step of pixelating the acquired statistical yearbook data to acquire vector boundary data corresponding to the statistical yearbook data comprises:

acquiring all statistical yearbook data, wherein the statistical yearbook data comprises yearbook data information of a grade city and a grade county unit;

arranging the statistical yearbook data into an excel table, and normalizing the statistical yearbook data through a standard row-column number of the excel table; at the same time, the user can select the desired position,

acquiring map atlas of a city and county level unit, and carrying out vectorization processing on the map atlas to acquire vector data corresponding to the map atlas;

and performing pixelization processing on the normalized statistical yearbook data based on the vector data to acquire vector boundary data corresponding to the statistical yearbook data.

3. The method of claim 2, wherein the step of obtaining all the data of the statistical yearbook comprises:

searching the statistical yearbook by using a CNKI database, and downloading the searched statistical yearbook data;

performing data supplement on the searched statistical yearbook data to form all the statistical yearbook data; wherein the data supplement includes statistical yearbook data copied from a book between the data network and the national library year.

4. The method of claim 2, wherein the normative process of the statistical yearbook data comprises:

carrying out digital processing on the statistical yearbook data;

establishing space-time row labels of excel tables for the digitalized statistical yearbook data; each excel table represents a variable, a line in each excel table represents a spatially synchronous city-level and county-level cell, and a column represents different time information.

5. The method for assessing the health of the crowd based on the ecological environment change as claimed in claim 2, wherein the step of obtaining the map atlas of the city and county level units, and performing vectorization processing on the map atlas, and obtaining the vector data corresponding to the map atlas comprises:

scanning a paper map atlas and storing the map atlas in a graphic format; when the map atlas has administrative division adjustment, recording changes through a new version of administrative division map and a place name along a leather comparison table;

establishing a new map layer based on the scanned map atlas, and setting the new map layer into a visible and editable mode;

calling an ArcGIS tool on the new layer to draw a path, and summarizing and checking topology of all the drawn new layers;

and adding vector attributes to the new map layer after the topology inspection, and acquiring vector data corresponding to the map atlas.

6. The method as claimed in claim 2, wherein the process of pixelating the normalized statistical yearbook data based on the vector data comprises:

dividing nationwide land conditions based on Chinese annual land utilization data, and determining land distribution conditions of the land-level cities and county-level units every year; at the same time, the user can select the desired position,

acquiring pixilated auxiliary data within a preset year, and judging the spatialization type of a variable in the statistical yearbook;

and pixelating the statistical yearbook data based on the spatialization type of the variable and the standard row-column number of each local city and county level unit corresponding to each year in the excel table.

7. The method of claim 6, wherein the auxiliary data includes demographic data, age structure data, and night light data;

the spatialization types of the variables comprise: population density only correlation, combined distribution of land use and night light data, combined distribution of population density and land use, combined distribution of population density and age structure, no geographic distribution feature.

8. The method for assessing the health of a population based on changes in the ecological environment of claim 1, wherein the ecological environment data comprises administrative district latitude and longitude data, atmospheric pollution data, biological climate data and biological diversity data.

9. The method for assessing the health of a population based on changes in the ecological environment of claim 2, wherein the process of determining the model training set data and the validation set data comprises:

acquiring spatial weight matrixes of the local city and county level units based on the vector boundary data;

based on the spatial weight matrix, taking the incidence of the historical diseases as a variable, and obtaining a univariate local spatial autocorrelation index;

according to the spatial autocorrelation indexes, acquiring spatial relationships between the local-level city and county-level units and the historical disease incidence; wherein the spatial relationships comprise high-high spatial clustering, low-low spatial clustering, high-low spatial clustering, and low-high spatial clustering;

and constructing model training set data and verification set data based on clustering processing of the spatial relationship, the vector boundary data and the ecological environment data.

10. A crowd health assessment system based on ecological environment changes, comprising:

a vector boundary data acquisition unit configured to perform pixelization processing on the acquired statistical yearbook data to acquire vector boundary data corresponding to the statistical yearbook data;

a training and verification data acquisition unit for determining model training set data and verification set data based on the vector boundary data, ecological environment data corresponding to the statistical yearbook data and historical disease incidence data;

the model building and training unit is used for building and training a crowd health prediction model based on the model training set data and the verification set data;

and the health prediction and evaluation unit is used for performing health prediction and evaluation on corresponding diseases of the crowd based on the trained crowd health prediction model.