CN110610446A

CN110610446A - County town classification method based on two-step clustering thought

Info

Publication number: CN110610446A
Application number: CN201910675029.2A
Authority: CN
Inventors: 陈茜; 肖润华; 丁雪茹; 史晓宇; 沙颖萱; 戴卓然; 王凯艺
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2019-07-25
Filing date: 2019-07-25
Publication date: 2019-12-24

Abstract

A county town classification method based on two-step clustering thought comprises the following steps: 1) establishing a basic database by using the collected county statistical data; 2) performing first-level clustering by adopting a structural clustering analysis method based on indexes reflecting the urbanization level such as economy, population and the like, and dividing county town samples with high, medium and low levels of urbanization levels; 3) performing secondary clustering, screening out main factors reflecting the function characteristics of cities and towns, performing structural clustering analysis on county town data indexes, and dividing into large-scale indexes; 4) respectively determining the cumulative variance contribution rate of each index combination by adopting a principal component analysis method so as to verify the effectiveness of secondary clustering; 5) aiming at different statistical distribution characteristics of different data indexes, an improved Narson classification method is adopted to judge the dominance degree of each index in each class; 6) and judging the advantage combination of each major category by adopting a combination analysis method, carrying out 1-0 binary scoring, and comprehensively determining the characteristic classification of the county town.

Description

County town classification method based on two-step clustering thought

Technical Field

The invention relates to the technical system field of county town carbon control planning, in particular to a town characteristic classification method based on a two-step clustering thought.

Background

The urbanization construction of counties and counties in China is also continuously promoted, the urban population number, the urban road area and the built area of counties and cities are continuously increased, many farmers move into buildings covered by counties and cities, higher-level requirements on employment, medical treatment, learning and the like are met, and the pursuit of people for good life is achieved after all. This has brought about an increasing demand of county residents for the number and service level of public service facilities such as hospitals, schools, parks, and the like.

In addition to the changes in population distribution and infrastructure requirements in county areas, which require changes in the original infrastructures such as transportation and public services in county areas, carbon control planning techniques are being studied more and more deeply. The existing town classification method cannot adapt to the difference of the volume of vast county areas, lacks a low-carbon view angle for county area town planning under a carbon control planning system, and is mainly used for performing 'one-cutting' classification directly by using a structural clustering analysis method.

The research data object of the traditional Narson classification method is the number of workers, the dominance degree is selected as 'average value + n standard deviations' to be used as level division, but for county sample data collected at present, economic data, particularly GDP, total retail quantity of social consumer goods, dominable income of everyone and the like have obvious difference distribution, and the space distribution pattern of 'high east, low west, high south, strong north and weak' is presented overall, namely counties with high economic development and population aggregation degree in China are mainly distributed in city group geographic regions such as Zhu-triangular, Chang-triangular and Jingjin Ji, the county regions in China are extremely unbalanced in development, and a few county data samples with economic dominance are prone to be heightened by the overall average value and variance, so that the traditional dominance degree threshold cannot reflect most of county data samples with index, and the problem of understandard rate is caused.

Disclosure of Invention

The purpose of the invention is as follows: in order to solve the problems in the prior art, the invention aims to provide a county and town classification method serving for novel county and town infrastructure configuration mode research, which is used for establishing a basic database by collecting basic panel statistical data of multiple county and town populations, economic industry, town locations, traffic characteristics, space construction and the like. And then, based on indexes reflecting the urbanization level such as economy, population and the like, performing first-level clustering by adopting a structural clustering analysis method, and dividing county-area samples with high, medium and low levels of urbanization levels. And performing secondary clustering to screen out main factors reflecting the function characteristics of cities and towns, performing structural clustering analysis on county and town data indexes to divide four large classes of indexes, and determining the cumulative variance contribution rate of each class of index combination by adopting a principal component analysis method respectively so as to verify the effectiveness of the secondary clustering. And (3) judging the dominance degree of each index in each category by adopting an improved Narson method, finally judging each large category by adopting a combined analysis method to carry out binary scoring, and comprehensively determining the characteristic classification of county and town.

The technical scheme is as follows: a county town classification method based on two-step clustering thought comprises the following steps:

step 1, basic data cleaning and processing:

1 a: the data types of the samples comprise statistical data (class I data) and spatial measurement data (class II data);

1 b: cleaning the collected data, eliminating samples with missing data, and then carrying out dimensionless standardization processing to ensure that data values are distributed in the interval of [0,1] so as to be convenient for subsequent comparative analysis;

step 2, first-level clustering based on economic and population indexes:

2 a: carrying out structural sample clustering on indexes such as GDP total amount, average population GDP, constant population density, central town population quantity and the like of county sample data to obtain economic and population level clustering results of different counties;

2 b: according to the result of sample cluster analysis, the data distribution interval of each classified sample is defined, so that the basis and the data range of the division of the urbanization level type in the first-level cluster are determined;

step 3, index screening and secondary clustering: continuously screening partial indexes with strong correlation for the primary clustering result in the step 2 to ensure the effectiveness of the classification result and the data dimension reduction significance effect of the principal component analysis;

3 a: performing R-type cluster analysis on all index variables, checking a correlation matrix of a secondary cluster analysis result, partially abandoning and screening indexes (such as the number of urban population and the number of central urban population) with the correlation larger than 0.8, and continuously repeating the screening work until no index with strong correlation appears repeatedly;

3 b: correcting by adopting a Ward's error method, selecting Euclidean spatial squared distance as a distance calculation clustering basis, and performing second-level structural clustering analysis to obtain different types of data indexes and corresponding classification results thereof;

step 4, checking the validity of the clustering result of the secondary indexes;

and 5, determining the dominance of each index by adopting an improved Narson classification method:

performing combined analysis on the dominance degrees of the various major indexes obtained in the step 5, performing the combined analysis on each primary classification, classifying the four major indexes into a comprehensive dominance type on the basis of each primary classification, classifying the indexes into a common maturity type on the first level of the primary classification because the dominance degrees of the four major indexes are all 1 on the second level if only the dominance degree of the economic level and facility supply is 1, classifying the indexes into a tourism landscape type if the dominance degree of the tourism landscape and characteristic development is 1, classifying the indexes into an industrial energy type if the dominance degree of the industrial structure and energy consumption level is 1, classifying the indexes into a regional communication type if the dominance degrees of the economic level and facility supply and the space and traffic trip are 1, classifying the indexes into a regional communication type if the dominance degree of the any index is 1 on the third level, otherwise, the model is classified as 'general type'.

5 a: judging the effective data indexes of the secondary classification results detected in the step 5 by using a traditional Narson classification method according to the mode of 'average value + standard deviation', and calculating the dominance rate of each index;

5 b: for part of indexes with the advantage rate lower than 40% obtained by using a traditional Narson classification method, a median statistical mark is used as a discrimination threshold value of the dominance degree, namely, counties with data indexes larger than the dominance degree can be regarded as having dominance in the indexes, namely, the indexes are discriminated as 'dominant type', and if the data indexes are smaller than the dominance degree, the indexes are discriminated as 'normal type';

step 6, judging the advantage combination of each major category by adopting a combination analysis method, carrying out 1-0 binary scoring, and comprehensively determining the characteristic classification of county town:

Further, in step 1, the statistical data includes a total GDP amount, a per-capita GDP, a total population of a permanent living, a population density of a permanent living, a central town population amount, a urbanization rate, a town population amount, three types of industrial GDP proportions, a total retail amount of social consumer goods, a total annual freight transportation amount, a highway traffic mileage, a per-capita highway mileage, a number of public transportation vehicles, a per-highway passenger transportation distance, a motor vehicle holding amount, a motor vehicle annual driving mileage, a per-capita income, a ten-thousand-yuan GDP energy consumption, a comprehensive energy consumption amount, a number of visitors and a number of scenic spots above AAA level. The spatial measurement data comprise the area of an administrative region, the average elevation, the spatial distance between the administrative region and a main city region of a county or city, the number of the access and exit of the expressway, the distance between the nearest access and exit of the expressway and the distance between the nearest railway stations.

Further, in the step 1, a Z-score standardization method is adopted to perform dimensionless standardization processing on the data.

Further, in the step 4, the accumulated variance contribution rate is calculated by using principal component analysis. Generally, the contribution rate of more than 75% has a great guiding significance on the practical application value of the principal component dimensionality reduction, so that when verifying clustering index data, the accumulated variance contribution rate needs to be ensured to be more than 75% as much as possible so as to ensure the scientificity and effectiveness of index classification.

And 2, performing primary clustering based on economic and population indexes, wherein data of the primary clustering are the total GDP (global data unit) of counties, the average GDP of people, the constant population density, the population number of a central town and the like, the clustering number is set to be 3, and three determined classifications can be obtained. Theoretically, the more data samples are, the closer the intervals divided based on the data of the three classifications are, the more stable the intervals are around a certain value.

The secondary index clustering method in step 3 is a clustering process for continuously screening data, is not completed at one time, and needs to judge through a correlation matrix of each clustering result, continuously repeat screening and clustering analysis until no data index with the correlation higher than 0.8 exists, and step 3 is finished.

The improved Narson classification method in the step 5 is different from the traditional Narson classification method in that only the mean value and the standard deviation are used as a discrimination threshold, the calculation criterion of the dominance threshold is adjusted according to richer data sample characteristics, discrimination and classification of county area samples with dominance in certain indexes are realized, and for data indexes with dominance rate lower than 40% in part of the traditional Narson classification method, the median is used as the discrimination threshold of the dominance.

Compared with the prior art, the invention has the following remarkable progress: the method is different from the previous special classification focusing on certain aspects of urban economic functions, industrial structures or natural resources and the like and the previous primary clustering, is based on a secondary clustering idea, classifies county and town from the viewpoints of energy consumption, carbon emission and the like, and highlights the basic role of carrying out configuration mode research on different types of county and town infrastructures from the viewpoint of low-carbon planning.

Detailed Description

The technical solution of the present invention will be further described in detail with reference to the following specific examples.

A county and town classification method based on a two-step clustering thought mainly carries out a two-step clustering analysis process on statistical data in aspects of population, economic industry, town regions, traffic characteristics, space construction and the like of a plurality of county and town, carries out type division on the county and town from a quantitative view angle, and comprises the following steps: 1) establishing a basic database by using the collected county statistical data; 2) performing first-level clustering by adopting a structural clustering analysis method based on indexes reflecting the urbanization level such as economy, population and the like, and dividing county town samples with high, medium and low levels of urbanization levels; 3) performing secondary clustering, screening out main factors reflecting the function characteristics of cities and towns, performing structural clustering analysis on county town data indexes, and dividing into large-scale indexes; 4) respectively determining the cumulative variance contribution rate of each index combination by adopting a principal component analysis method so as to verify the effectiveness of secondary clustering; 5) aiming at different statistical distribution characteristics of different data indexes, an improved Narson classification method is adopted to judge the dominance degree of each index in each class; 6) and judging the advantage combination of each major category by adopting a combination analysis method, carrying out 1-0 binary scoring, and comprehensively determining the characteristic classification of the county town.

Examples

1) Collecting data and establishing a database;

on the basis of samples of 4 national key research and development project demonstration counties, county domain sample data of each humanistic geographic region in China are collected, and the total number of the county domain sample data is 62.

The data types of the samples include statistical data (class I data) and spatial measurement data (class II data), see table 2, and the data sources include:

2016-year national economy and social development statistics bulletin published by county and city statistics bureau website

② 2017 statistical yearbook published by statistical bureau website in each county and city

(iii) Chinese City statistics annual inspection 2016

Fourthly national sixth census data (dividing villages and towns)

Google Earth (GE) elevation DEM data

Sixthly, hundred degree map (coordinate transformation)

TABLE 2 Town-classified data collection index

The unit and the obtaining method of each to-be-selected data index are described in table 3:

TABLE 3 candidate data index description

2) Data cleaning and processing;

due to the fact that the dimension magnitude of the data indexes is inconsistent, dimensionless standardization processing needs to be carried out on the data. The data were normalized using the Z-score normalization method of SPSS in this example, resulting in values between 0 and 1, as follows:

wherein the content of the first and second substances,for the ith index, the normalized value of the ith sample data, X_ijThe observed value of the ith sample data of the jth index,and S_jThe mean and standard deviation of the jth index are shown. Thereby it is convenient toCan convert the numbers of different dimensions into standard values to fall in [0,1]]Within the interval (c), comparison and analysis are facilitated.

3) First-stage classification;

in first-level town classification, by taking an S-shaped curve three-stage of town development as a guiding theory, carrying out hierarchical sample clustering on indexes such as GDP total amount, average population GDP, constant population density, central town population number and the like of county sample data to obtain economic and population hierarchical clustering analysis results of different counties. And giving out corresponding data intervals according to the clustering results, wherein the primary classification results are shown in a table 4.

TABLE 4 classification results of economic and demographic index levels of county towns

4) Secondary classification;

after the index data is standardized, R-type structural clustering analysis is firstly carried out. When R-type clustering analysis is carried out, part of indexes with strong correlation need to be screened out so as to ensure the effectiveness of classification results and the data dimension reduction significance effect of principal component analysis. Structural index clustering is adopted, Ward's error method correction is carried out, Euclidean spatial square distance is selected as a clustering basis, and index classification results are obtained through screening of indexes.

5) The principal component analysis verifies the clustering result of the secondary indexes;

the cumulative variance contribution rate needs to be calculated using principal component analysis. Generally, the contribution rate of more than 75% has a great guiding significance on the practical application value of the principal component dimensionality reduction, so that when verifying clustering index data, the accumulated variance contribution rate needs to be ensured to be more than 75% as much as possible so as to ensure the scientificity and effectiveness of index classification. And (4) carrying out principal component analysis on the four types of indexes, wherein the results comprise lithotripsy graphs, characteristic values and contribution rate calculation results. According to the lithograph calculated by PCA and the feature value calculation result, 2, 3 and 5 main components are respectively selected from the four clustering indexes, the cumulative variance contribution rates are 84.14%, 85.62%, 87.62% and 81.62% respectively and are all larger than 75%, the result of PCA can be used as the basis of clustering analysis, and the clustering analysis result is effectively supported. According to the analysis result given by PCA, the data dimension reduction and the clustering analysis result can be considered to be effective, so that the clustering analysis result of the data index is determined, and is shown in Table 5.

TABLE 5 data index clustering results

6) Improving a Narson classification method to determine index dominance;

the method comprises the steps of adopting an improved Narson classification method, adjusting a calculation criterion of an dominance threshold value according to characteristics of data samples, achieving distinguishing and classification of county-area samples with dominance of certain indexes, and conducting dominance score statistics on county-area samples of each urbanization level by applying a combined analysis method. Therefore, for a part of data indexes with the dominance rate lower than 40%, the median is taken as a discrimination threshold of the dominance degree, that is, the county area of the data index greater than the dominance degree can be considered to have dominance in the index, and classification is performed on the basis to classify and divide into a "dominant type" and a "general type". And finally, calculating the combined score of the 'dominant type' indexes of each sample by adopting a combined analysis method, and determining the comprehensive classification of county and town.

7) Determining the classification of towns by a combined analysis method;

performing combined analysis on the dominance degrees of the major categories obtained in the step 5, performing the combined analysis on each primary category, classifying the major categories into a comprehensive dominance type on the basis of each primary category as the dominance degrees of four major categories are 1 for a first level of the primary category, classifying the major categories into a common maturity type if only the dominance degree of the economic level and facility supply is 1 for a second level, classifying the major categories into a tourism style if the dominance degree of the tourism style and characteristic development is 1, classifying the major categories into an industrial energy style if the dominance degree of the tourism style and characteristic development is 1, classifying the major categories into a regional communication style if the dominance degrees of the industrial structure and energy consumption level are 1, classifying the regional communication style if the dominance degrees of the economic level and facility supply and the spatial and traffic travel are 1, classifying the regional communication style if any one of the index dominance degrees is 1 for a third level, otherwise, the model is classified as 'general type'. The classification results are shown in Table 6.

TABLE 6 results of city and town classification in county

Claims

1. A county and town classification method based on two-step clustering thought is characterized by comprising the following steps:

step 1, basic data cleaning and processing:

1 a: the county sample data index type comprises a statistical data index and a spatial measurement data index;

1 b: cleaning the collected data, eliminating samples with missing data, and then carrying out dimensionless standardization processing to ensure that the data values are distributed in the interval of [0,1 ];

step 2, performing first-level clustering by adopting a structural clustering analysis method:

2 a: performing structural sample clustering on the county sample data indexes to obtain economic and population level clustering results of different counties;

step 3, index screening and secondary clustering:

3 a: performing R-type cluster analysis on all index variables, checking a correlation matrix of a secondary cluster analysis result, partially discarding and screening indexes with correlation greater than 0.8, and continuously repeating the screening work until no indexes with strong correlation repeatedly appear;

5 a: judging the effective data indexes of the secondary classification results detected in the step 4 by using a traditional Narson classification method according to the mode of 'average value + standard deviation', and calculating the dominance rate of each index;

2. The county town classification method based on two-step clustering thought as claimed in claim 1, wherein: the statistical data indexes comprise GDP total amount, average GDP, average population amount, average population density, central town population amount, urbanization rate, town population amount, three-class industrial GDP proportion, social consumer retail total amount, annual freight transportation total amount, highway traffic mileage, average highway mileage, number of public transport operation vehicles, average highway distance of passenger transport, motor vehicle holding amount, annual mileage of motor vehicles, average dominant income, ten thousand yuan GDP energy consumption, comprehensive energy consumption, visitor reception number and number of scenic spots above AAA level.

3. The county town classification method based on two-step clustering thought as claimed in claim 1, wherein: the spatial measurement data indexes comprise administrative area, average elevation, spatial distance between the administrative area and main city areas of counties and cities, number of expressway entrances and exits, distance between the nearest expressway entrances and exits and distance between the nearest railway stations.

4. The county town classification method based on two-step clustering thought as claimed in claim 1, wherein: the data were subjected to dimensionless normalization using the Z-score normalization method.

5. The county town classification method based on two-step clustering idea as claimed in claim 1, wherein in the step 4, the cumulative variance contribution rate is calculated by using principal component analysis.