CN112990976A

CN112990976A - Commercial network site selection method, system, equipment and medium based on open source data mining

Info

Publication number: CN112990976A
Application number: CN202110332552.2A
Authority: CN
Inventors: 魏宗财; 刘雨飞; 魏纾晴; 彭丹丽; 陈旭华; 刘晨瑜; 唐琦婧
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-03-29
Filing date: 2021-03-29
Publication date: 2021-06-18
Anticipated expiration: 2041-03-29
Also published as: CN112990976B

Abstract

The invention discloses a commercial network site selection method, a system, equipment and a medium based on open source data mining, wherein the method comprises the following steps: acquiring data of a target area through a multi-source data open platform; carrying out grid division and numbering on the target area, and constructing an index system for clustering and addressing according to the obtained data; preprocessing data of a target area; respectively linking the divided grids according to the preprocessed data, and counting the value of each influence factor in the index system; counting the value of each influencing factor in the index system according to the number of the grid, and analyzing by using a two-step clustering algorithm; and giving site selection suggestions of different types and different scales of commercial network points according to the analysis result of the two-step clustering algorithm. The method is based on data mined from open source data, and is combined with a two-step clustering algorithm to carry out analysis, so that according to the analysis result, assistance and reference can be provided for site selection of different-scale and different-category commercial network points of cities.

Description

Commercial network site selection method, system, equipment and medium based on open source data mining

Technical Field

The invention relates to a commercial site selection method, in particular to a commercial site selection method, a commercial site selection system, a commercial site selection device and a commercial site selection medium based on open source data mining.

Background

The significance of site selection of commercial network is very important, and from the aspect of macroscopic urban planning, the commercial network is an important component of urban high-quality development and influences the vitality of cities and the traveling of citizens. The reasonable commercial network layout can increase the operation efficiency of the city; from microscopic enterprises and individuals, the commercial network is a basic unit for operation and development, and the compatibility of different urban land and businesses is also a key factor for judging whether the site selection of the commercial network can be realized. The site selection has long-term and fixed properties compared with other factors, when the external environment changes, other operation factors can be adjusted, the site selection is difficult to change once being determined, the site selection is proper, and enterprises and individuals can benefit for a long time.

In the existing commercial site selection method, factors such as population, traffic, existing commercial aggregation, shop rent and the like are mainly considered, the factors are core indexes which need to be considered in commercial site selection, but the indexes are not comprehensive, and meanwhile, the problems that the data usage amount is small, the data usage amount is not considered from the whole city, and whether commercial sites of different types and scales are suitable for construction or not is difficult to determine by using unified standards exist. The existing learners use the shared bicycle traffic trip data to analyze the traffic hot spot area of the city, the high correlation between the traffic hot spot area and the business is proved, and meanwhile, the compatibility between different urban land and the business is also a key factor for judging whether the site selection of the business network can be realized.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a commercial site selection method, a system, equipment and a medium based on open source data mining, wherein the method, the system, the equipment and the medium are based on the data mined by the open source data and are combined with a two-step clustering algorithm for analysis, and according to the analysis result, the method, the system, the equipment and the medium can provide assistance and reference for site selection of commercial sites of different scales and different classes in cities.

The invention aims to provide a commercial network site selection method based on open source data mining.

The invention also provides a business network site selection system based on the open source data mining.

It is a third object of the invention to provide a computer apparatus.

It is a fourth object of the present invention to provide a storage medium.

The first purpose of the invention can be achieved by adopting the following technical scheme:

a method for commercial site selection based on open source data mining, the method comprising:

acquiring data of a target area through a multi-source data open platform;

performing grid division and numbering on the target area, and constructing an index system for clustering and addressing according to the acquired data;

preprocessing the data of the target area;

respectively linking the divided grids according to the preprocessed data, and counting the value of each influence factor in the index system;

counting the value of each influencing factor in the index system according to the number of the grid, and analyzing by using a two-step clustering algorithm;

and giving site selection suggestions of different types and different scales of commercial network points according to the analysis result of the two-step clustering algorithm.

Further, the grid division and numbering are performed on the target area, each grid is a basic unit for address selection, and an index system for clustering address selection is constructed according to the obtained data, and the method specifically comprises the following steps:

extracting the boundary of the target area, creating a mesh surface element to cover the boundary of the target area and cutting according to the boundary to obtain a mesh with a corresponding number;

according to the factors to be considered in the site selection of the commercial network points, the clustering indexes are divided into six categories, namely population factors, shared bicycle traveling factors, shop rent factors, traffic comprehensive factors, commercial aggregation factors and land utilization factors.

Further, the data of the target area comprises population density, shared bicycle travel, shop rent, urban road traffic, commercial POI and land utilization data; the business POI data comprises catering business POIs, financial business POIs and shopping business POI data;

the preprocessing the data of the target area specifically includes:

dividing the grid into five types based on a natural break point classification method according to population density grid data, assigning values to corresponding intervals from low to high to obtain population re-classification grid maps, and converting the population re-classification grid maps into ordered category variables;

selecting a line tracking interval tool according to the travel data of the shared bicycle, and performing line tracking analysis based on the starting point and the ending point to obtain path line data of the shared bicycle;

selecting a rent field to process shop rent data by using a nuclear density estimation method to obtain a shop rent evaluation grid map; classifying according to the obtained shop rent evaluation grid map based on a natural break point classification method, assigning values to corresponding intervals from low to high to obtain a shop rent reclassification grid map, and converting the shop rent reclassification grid map into ordered category variables;

processing commercial POI data by using a nuclear density estimation method to respectively obtain a catering commercial concentration distribution grid map, a shopping commercial concentration distribution grid map and a financial commercial concentration distribution grid map;

classifying the obtained catering, shopping and financial commercial concentration distribution grid maps based on a natural discontinuous point classification method, assigning values to corresponding intervals from low to high to obtain catering, shopping and financial commercial concentration reclassification grid maps, and converting all reclassification grid maps into ordered category variables;

aiming at urban main road POI data, secondary main road POI data, urban subway station POI data and urban bus station POI data in urban road traffic data, buffer areas are respectively established for the urban main road, the urban secondary main road, the bus station and the subway station according to the distance, and the values are assigned to each buffer area according to the distance, so that the urban main road, the urban secondary main road, the bus station and the subway station are converted into ordered category variables;

and converting the land use current situation graph into vector data, and respectively assigning values to the commercial land parcels and the non-commercial land parcels to obtain land use classification surface elements.

Further, the step of respectively linking the divided grids according to the preprocessed data, and counting the value of each influence factor in the index system specifically includes:

respectively converting the population re-classification grid map, the shop rent re-classification grid map and the catering, shopping and financial commercial aggregation re-classification grid map into surface elements, and respectively performing spatial link with the divided grids to respectively obtain population density factor evaluation, shop rent factor evaluation, catering commercial aggregation evaluation, shopping commercial aggregation evaluation and financial commercial aggregation evaluation; wherein, the link value of a single grid is the average value of the grid values in the grid;

carrying out spatial linkage on the path line data of the shared bicycle and the divided grids to obtain the travel path length factor evaluation of the shared bicycle; wherein, the link value of a single grid is the total value of the path length in the grid;

obtaining a traffic comprehensive factor map by applying a multi-factor weighted stack analysis method according to the urban main road, the urban secondary main road, the bus station and the subway station buffer area;

converting the traffic comprehensive factor graph into surface factors, and performing spatial link with the divided grids to obtain urban road traffic factor evaluation; wherein, the link value of a single grid is the average value of the grid values in the grid;

carrying out spatial linkage on the land use classification surface elements and the divided grids to obtain land use evaluation; wherein, the link value of a single grid is the average value of the element values in the grid.

Further, the multi-factor weighted overlap-add analysis method specifically includes:

the method comprises the following steps of performing superposition analysis on a main road distance, a secondary main road distance, a subway station distance and a bus station distance by four factors to obtain traffic comprehensive factor evaluation, wherein an evaluation model is as follows:

wherein S is the final traffic comprehensive factor evaluation, W_iIs a weight; x_iIs a variable factor; wherein the weight of the distance between the main road and the secondary road is 0.3, the weight of the distance between the secondary road and the bus station is 0.2, the weight of the distance between the bus station and the subway station is 0.3, and the traffic comprehensive factor evaluation is obtained.

Further, the counting of the value of each influencing factor in the index system according to the number of the grid and the analysis by using a two-step clustering algorithm specifically include:

counting the values of population density factor evaluation, sharing single-vehicle travel path length factor, shop rent factor evaluation, traffic factor evaluation, catering business concentration evaluation, financial business concentration evaluation, shopping business concentration evaluation and land utilization evaluation into a table;

according to the statistical table, a two-step clustering algorithm is used for analyzing to generate a clustering result table, so that the class numbers correspond to corresponding grid numbers;

obtaining a final result graph according to the clustering result table;

and giving site selection suggestions of different types and different scales of commercial network points according to the spatial distribution of the clustering result table and the final result graph.

Further, the two-step clustering algorithm comprises a pre-clustering stage and a clustering stage, wherein distance measures are used in the pre-clustering stage and the clustering stage;

the pre-clustering stage comprises: reading data points in the data set one by adopting the idea of CF tree growth in a BIRCH algorithm, and clustering data points in a dense area in advance to form a plurality of small sub-clusters while generating a CF tree;

the clustering stage comprises: the result of the prepolymerization stage, i.e., the sub-clusters, is used as an object, and the sub-clusters are combined one by an agglomeration method until a desired number of clusters is reached.

The second purpose of the invention can be achieved by adopting the following technical scheme:

a commercial site selection system based on open source data mining, the system comprising:

the data acquisition module is used for acquiring data of a target area through the multi-source data open platform;

the grid division module is used for carrying out grid division and numbering on the target area and constructing an index system of clustering address selection according to the acquired data;

the data preprocessing module is used for preprocessing the data of the target area;

the statistical module is used for respectively linking the divided grids according to the preprocessed data and counting the value of each influence factor in the index system;

the statistic and analysis module is used for carrying out statistics on the value of each influence factor in the index system according to the serial number of the grid and analyzing by using a two-step clustering algorithm;

and the site selection suggestion module of the commercial network points is used for giving site selection suggestions of the commercial network points of different types and different scales according to the analysis result of the two-step clustering algorithm.

The third purpose of the invention can be achieved by adopting the following technical scheme:

a computer device comprises a processor and a memory for storing a program executable by the processor, wherein the processor executes the program stored in the memory to realize the commercial site addressing method.

The fourth purpose of the invention can be achieved by adopting the following technical scheme:

a storage medium stores a program which, when executed by a processor, implements the above-described commercial site selection method.

Compared with the prior art, the invention has the following beneficial effects:

the invention is based on data mined by open source data, and comprehensively considers various factors from the whole city, and the compatibility of different places and businesses of the city is also a key factor for site selection implementation of business network points; and the two-step clustering algorithm is combined to provide assistance and reference for site selection of commercial network points of different types and different scales of cities.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

Fig. 1 is a flowchart of a method for selecting a site of a commercial site based on open source data mining according to embodiment 1 of the present invention.

Fig. 2 is a population density factor evaluation chart in example 1 of the present invention.

Fig. 3 is a travel path length diagram of the shared bicycle in embodiment 1 of the present invention.

Fig. 4 is a view showing the evaluation of commercial rent factors in example 1 of the present invention.

FIG. 5 is a graph showing the evaluation of the commercial concentration of catering in example 1 of the present invention.

Fig. 6 is a diagram showing the evaluation of the aggregation of financial businesses in example 1 of the present invention.

Fig. 7 is a diagram showing the evaluation of the shopping category business concentration in embodiment 1 of the present invention.

Fig. 8 is a traffic factor evaluation chart of embodiment 1 of the present invention.

FIG. 9 is a land use evaluation chart of example 1 of the present invention.

FIG. 10 is a schematic model diagram of embodiment 1 of the present invention.

FIG. 11 is a cluster quality map of example 1 of the present invention.

Fig. 12 is a schematic diagram of a clustering result table in embodiment 1 of the present invention.

Fig. 13 is a diagram of a two-step clustering algorithm-based business addressing partition in accordance with embodiment 1 of the present invention.

Fig. 14 is a block diagram of a structure of a commercial site location system based on open-source data mining according to embodiment 2 of the present invention.

Fig. 15 is a block diagram of a computer device according to embodiment 3 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer and more complete, the technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts based on the embodiments of the present invention belong to the protection scope of the present invention.

Example 1

The embodiment of the invention is described below with reference to the accompanying drawings, in which an open source data mining-based site aided planning and site selection method is provided by taking a Tianheyuan of Guangzhou city as an example.

Fig. 1 is a flowchart of a method for addressing a business site based on open source data mining according to embodiment 1 of the present invention.

And S101, acquiring data of a target area through a multi-source data open platform.

The data comprises population density, shared bicycle trip, shop rent, urban road traffic, commercial POI and land utilization data; the business POI data comprises catering business POIs, financial business POIs and shopping business POI data; the urban road traffic data comprises urban road data, bus station distance and subway station distance, and the urban road data comprises urban main road distance and urban secondary main road distance.

In an implementation case, for the data acquisition in step S101, the specific implementation method is as follows:

population density in the Tianhe area of Guangzhou city, obtained via the WorldPop website; shared bicycle trip points in Guangzhou city sky river area excel format are provided with data of 2019, 9 and 16 days by Mobai bicycle App; shop rent data, Guangzhou city sky river area urban road data, urban bus station point data, urban subway station data, catering commercial POI data, financial commercial POI data and shopping commercial POI data are obtained through a high-grade map database; the land utilization state diagram of the Tianhe area in Guangzhou city is obtained through government official network official documents.

And S102, carrying out grid division and numbering on the target area, and constructing an index system of clustering and address selection according to the acquired data.

In an implementation case, for the step S102, performing gridding processing on the target area according to the administrative boundary of the river region, numbering grids, and constructing a clustered index system, the specific implementation method is as follows:

and extracting the administrative boundaries of the Tianheyuan of Guangzhou city, importing the administrative boundaries into GIS software, creating 300m by 300m grid surface elements to cover the full administrative boundaries, and cutting according to the boundaries to obtain 1729 grids with corresponding numbers. According to various factors to be considered in the site selection of the commercial network points, clustering indexes are divided into the following steps: population factors, shared bicycle trip factors, shop rent factors, traffic comprehensive factors, business aggregation factors and land utilization factors, which relate to 10 indexes, and are specifically shown in the following table 1:

TABLE 1 clustering index Table

S103, preprocessing the data of the target area.

In an implementation case, the specific implementation method for preprocessing the acquired data of the population density of the river area, the travel points of the shared bicycle, the rents of the shops, the urban road traffic, the various commercial POIs and the land utilization in the step S103 is as follows:

for population factors, importing the population density grid data of the river region into GIS software, selecting a reclassification tool to be divided into five types based on a natural break point classification method, assigning values of corresponding intervals from low to high as 1, 2, 3, 4 and 5 to obtain a reclassification grid map of the population of the river region, and converting the reclassification grid map into ordered category variables to facilitate subsequent cluster analysis.

The natural discontinuity classification method is to identify classification intervals based on natural grouping inherent in data, to optimally group similarity values, and to maximize differences between classes. The grouping method is to divide data into a plurality of classes, and for these classes, the boundaries thereof are set at positions where the difference in data values is relatively large.

Calculating the sum of the squares of the total deviations (SDAM) for a certain array of the classification results, and recording a set of results as A_arrayMean value of

Comprises the following steps:

then the sum of the squares of the total deviations (SDAM) is:

in the formulas (1) and (2), n is the number of elements in the array; x_iIs the value of the ith element.

Calculating the sum of the squares of the total deviations (SDCM) of the classes for each combination of the ranges in the classification result, finding the smallest value and recording the smallest value as SDCM_min. The n elements are divided into k classes, so that the classification result can be divided into k subsets, one of which is [ X1X2 … Xi ]]、[Xi+1Xi+2…Xj]、…、[Xj+1Xj+2…Xn]The sum of the total deviation squares SDAMi, SDAMj, …, SDAMn for each subset is calculated and the sum SDCM1 is summed as:

SDCM₁＝SDAM_i+SDAM_j+…+SDAM_n (3)

the classification result can also be classified into other cases of k classesSequentially calculating SDCM₂，…，

Of which the smallest value is selected as the final result SDCM_minAnd the test is carried out through the goodness of fit.

By calculating gradients gvf for various classifications_iComprises the following steps:

gvf_iranging from 1 (perfect fit) to 0 (poor fit), higher gradients indicating greater inter-class differences, and experiments that demonstrated passage of SDCM_minThe obtained classification has the largest gradient value, and a conclusion that the result of the natural discontinuity point classification method is ideal can be obtained.

For the shared single-vehicle travel factors, importing the shared single-vehicle travel point data in the Zhongzhou city Tianhe area weekend into ArcGIS software, selecting a line tracking interval tool, and performing line tracking analysis based on a starting point and an ending point to obtain the path line element data of the shared single vehicle. The commercial network site selection has an important relation with the travel of residents, and the shared bicycle path on weekends can reflect the travel path of the activities of the residents to a certain extent, so that the higher the linear density of the shared bicycle is, the denser the path is, the higher the commercial value is.

For the business concentration factor, a catering business, shopping business and financial business concentration distribution grid map can be obtained by processing catering business, shopping business and financial business POI points in a river by using a nuclear density estimation method, and the business concentration is an important factor which needs to be referred to in the site selection of the business net band point. The higher the aggregation of restaurants, the larger the corresponding passenger flow volume, and simultaneously the benign competitive cycle effect can be formed between the restaurants and the periphery, so that the areas with higher aggregation of related categories are more suitable for site selection of the commercial outlets in general.

For the shop rent factors, the core density estimation method is used for selecting rent fields as value reference to process the shop POI points, so that a schematic diagram for evaluating the shop rent factors in the river region can be obtained, and the shop rent has important significance for site selection of commercial network points.

The Kernel Density Estimation (Kernel Density Estimation) estimates the Density of a point or line pattern by means of a moving cell. Given sample points x1, x2, … …, xn, a detailed distribution of attribute variable data was modeled using kernel estimation. When two-dimensional data is calculated, the value of d is 2, and a common kernel density estimation function formula is as follows:

wherein K (x) is referred to as a kernel function, (x-x)_i)²+(y-y_i)²Is a point (x)_i,y_i) And (x, y), h is the bandwidth, and n is the number of points in the study.

In kernel density estimation, bandwidth is a free parameter that defines the magnitude of the smoothing quantity, and too large or too small of bandwidth affects the result of f (x). Using the "rule of thumb" of Silverman, the formula for the wideband optimisation calculation can be simplified to that of Ker, a.p. and b.k.goodwin, under the assumption that f (x) is normal:

where σ is the sample variance.

After obtaining a riverside catering, shopping and financial commercial aggregation distribution grid map and a shop rent evaluation grid map, selecting a reclassification tool to be divided into five types based on a natural discontinuous point grading method, assigning values of corresponding intervals from low to high as 1, 2, 3, 4 and 5, obtaining a riverside catering, shopping and financial commercial aggregation reclassification grid map and a shop rent reclassification grid map, and converting the grid maps into ordered category variables to facilitate subsequent clustering analysis.

For traffic factors, guiding main roads, secondary roads, urban subway station POI data and urban public transport station POI data of a river area into ArcGIS software respectively, and selecting a tool of multiple buffer areas. The urban arterial road is divided into 25m, 50m, 75m, 100m and 125 m; 20m, 40m, 60m, 80m and 100m are used for urban secondary roads; the bus stop is divided into 30m, 60m, 90m, 120m and 150 m; and establishing buffer areas at 50m, 100m, 150m, 200m and 250m for subway stations. Meanwhile, because the commercial network site has an important relation with the accessibility of road traffic, the closer to a public traffic station and a main road, the higher the accessibility, and the more suitable the commercial network site is to be arranged, the values of '1, 2, 3, 4 and 5' are assigned to corresponding buffer areas from far to near, so that the buffer areas are converted into ordered category variables, and the subsequent cluster analysis is facilitated.

For land utilization factors, a land utilization current situation map of a river region is converted into vector data and introduced into ArcGIS software, commercial land parcels and non-commercial land parcels are assigned with values of '5', the non-commercial land parcels are assigned with values of '1', the higher the correlation with the commercial land parcels, the higher the land utilization classification value is, and therefore land utilization classification surface elements are obtained.

And finishing the primary processing of the data of each clustering factor.

And S104, respectively linking the divided grids according to the preprocessed data, and counting the value of each influence factor in the index system.

In an implementation case, for the step S104, linking the grid of the river cell according to the preprocessed data, and counting the values of the various influencing factors, the specific implementation method is as follows:

converting the Chinese river course population reclassification grid map obtained in the step S103 into a surface element in a GIS by using a grid-to-surface element tool, carrying out spatial linkage on the surface element and the 300 m-300 m grids, and obtaining an average value of grid values in the grids according to corresponding ordered category variables and the linkage value of a single grid to obtain a graph 2;

spatially linking the element data of the route lines of the shared bicycle on the weekend of the sky river area obtained in the step S103 with the grids of 300m × 300m, and obtaining the total value of the route length in the grids according to the corresponding ordered category variables and the link value of the single grid to obtain a graph 3;

converting the riverzone shop rent reclassification grid map obtained in the step S103 into a surface element in a GIS by using a grid-to-surface element tool, carrying out spatial linkage on the surface element and the 300 m-300 m grids, and averaging the grid values in the grids according to the corresponding ordered category variables and the linkage value of each grid to obtain a graph 4;

converting the commercial aggregation reclassification grid map of catering, finance and shopping in the river region obtained in the step S103 into surface elements by using a grid-to-surface element tool in ArcGIS, performing spatial link with 300m × 300m divided grids in the river region, and taking the average value of grid values in the grids according to corresponding ordered category variables and link values of single grids to obtain a graph 5, a graph 6 and a graph 7;

and (3) performing superposition analysis on the main road distance, the secondary road distance, the subway station distance and the bus station distance by using the main and secondary road and public traffic station buffer areas in the sky river region obtained in the step (S103) by using a multi-factor weighted superposition analysis method to obtain traffic comprehensive factor evaluation, wherein the evaluation model is as follows:

in the formula (7), S is the final traffic comprehensive factor evaluation; w_iIs a weight; x_iIs a variable factor. The distance weight of the main road is 0.3, the distance weight of the secondary road is 0.2, the distance weight of the bus station is 0.2, the distance weight of the subway station is 0.3, the comprehensive traffic factor evaluation in the sky river region is obtained, a comprehensive traffic factor evaluation graph is converted into a surface element by a grid surface element conversion tool and is spatially linked with a 300 m-300 m grid in the sky river region, and the link value of a single grid is the average value of grid values in the grid to obtain a graph 8;

performing spatial linking on the elements of the land utilization classification surface of the river region obtained in the step S103 and the grids of 300m × 300m, and taking the average value of the element values in the grids by using the link values of the single grid to obtain a graph 9;

and S105, counting the value of each influencing factor in the index system according to the number of the grid, and analyzing by using a two-step clustering algorithm.

And counting all the influence factor values into a table according to the serial numbers of the unit grids of the river region, analyzing by using a two-step clustering algorithm, and giving planning and site selection suggestions of businesses in the river region with different types and scales according to the analysis result of the clustering algorithm.

In an implementation case, for step S105, all the influence factor values are counted into a table according to the numbers of the grid cells of the river, a two-step clustering algorithm is applied for analysis, and planning and site selection suggestions of businesses in the river with different categories and scales are given according to the analysis result of the clustering algorithm, the specific implementation method is as follows:

and (4) importing the same EXCEL form in the step S104 according to the grid number in the figures 2, 3, 4, 5, 6, 7, 8 and 9, importing the form into SPSS software, and selecting a two-step clustering tool for analysis.

The two-step clustering algorithm comprises two stages:

a pre-clustering (pre-clustering) stage. The idea of CF tree growth in the BIRCH algorithm is adopted, data points in the data set are read one by one, and the data points in the dense area are clustered in advance while the CF tree is generated to form a plurality of small sub-clusters.

A clustering (clustering) stage. The result of the prepolymerization stage, i.e., the sub-clusters, is used as a target, and the sub-clusters are combined one by an aggregation method until the desired number of clusters is reached.

In both types of operations, distance measures are used, and the distance measures mainly adopt Euclidean distances and log-likelihood distances.

The Euclidean distance is the distance between two class centers, and the class center refers to the mean value of all variables in a class. Assume a data set Q with m samples, each with n variable indices. Then there are:

in this matrix (which is not saved during the calculation), x_ijIs the observed value of the jth variable of the ith sample (l is less than or equal to i is less than or equal to m;l is less than or equal to j is less than or equal to n), corresponding to the observed value x of each sample_i＝(x_i1、x_i2···x_ik···x_in) Can be seen as a point in n-dimensional space. Before clustering, k observed quantities are selected (or set by a system) as initial clustering center points, and the observed quantities are distributed to the classes where the centers of the classes are located according to the minimum distance principle of the centers of the classes to form k classes formed by first iteration. And calculating the mean value of each variable according to the observed quantities forming each class, wherein the n mean values of each class form k points in the n-dimensional space, and the k points are the class centers of the second iteration. And iterating according to the method until reaching the specified iteration times or meeting the criterion requirement of stopping iteration, stopping iteration and finishing clustering.

In this process, the Euclidean distance is represented by d_ijIt is shown that the calculation formula is the square root of the squared euclidean distance, as follows:

the log-likelihood distance can handle continuous variables and categorical variables. It is based on the probability values of the distances between the two classes, which vary as the likelihood logarithms decrease when the two classes are merged into one class. When likelihood logarithms are calculated, continuous variables ideally need to satisfy normal distribution, classification variables need to satisfy polynomial distribution, and the variables are assumed to be independent of each other. We define the distance between class j and class s as d (j, s):

d(j,s)＝ξ_j+ξ_s-ξ_<j,r> (9)

in this process, Bayesian (BIC) or Akaik (AIC) criteria are calculated for each class and an initial estimate of the number of classes is made, and the final number of clusters is determined as the one that maximizes the distance between the two closest classes in the initial class. Assuming that the number of clusters is J, the calculation formula is as follows:

wherein N represents the total number of observed quantities, K^AIs the total number of continuous variables used in the process, K^BIs the total number of categorical variables used in the process, L_kRepresenting the number of the kth categorical variable.

Selecting classification variables: and 7 sub-items including shop rent factor, catering business concentration, shopping business concentration, financial business concentration, traffic factor evaluation, population density factor evaluation and land factor evaluation.

Continuous variable selection: the length factor of the shared bicycle travel path is 1 subentry.

The analysis results are shown in fig. 10, fig. 11, fig. 12 and table 2 below; wherein, the model summary and the cluster quality are respectively shown in fig. 10 and fig. 11; clustering results are classified into 6 classes, and the specific subentry factor conditions of each class are shown in the following table 2; and (3) generating a clustering result table in the SPSS, as shown in FIG. 12, wherein the class numbers correspond to the ID numbers of the corresponding 300m × 300m grids, and reintroducing ArcGIS software to obtain the final result, as shown in FIG. 13.

TABLE 2 table of each class of subentry factor

And S106, giving site selection suggestions of different types and different scales of commercial network points according to the analysis result of the two-step clustering algorithm.

The proposed siting for different classes of different size commercial sites, based on the spatial distribution of fig. 13 and the data of fig. 12, is shown in table 3 below:

TABLE 3 location suggestions for different types and different sizes of commercial outlets

It should be noted that while the method operations of the above-described embodiments are described in a particular order, this does not require or imply that these operations must be performed in that particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, the depicted steps may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, or one step broken down into multiple step executions.

Example 2:

as shown in fig. 14, this embodiment provides a commercial site selection system based on open source data mining, which includes a data obtaining module 1401, a grid dividing module 1402, a data preprocessing module 1403, a statistics module 1404, a statistics and analysis module 1405, and a site selection suggestion module 1406 of a commercial site, where the specific functions of the modules are as follows:

an acquiring data module 1401, configured to acquire data of a target area through a multi-source data open platform;

a grid division module 1402, configured to perform grid division and numbering on the target area, and construct an index system for clustering and addressing according to the obtained data;

a data preprocessing module 1403, configured to preprocess the data of the target region;

a statistic module 1404, configured to link the divided grids according to the preprocessed data, and count values of each influencing factor in the index system;

a statistics and analysis module 1405, configured to perform statistics on the value of each influence factor in the index system according to the number of the grid, and perform analysis by using a two-step clustering algorithm;

and the site selection suggestion module 1406 is used for giving site selection suggestions of the commercial sites of different types and different scales according to the analysis result of the two-step clustering algorithm.

The specific implementation of each module in this embodiment may refer to embodiment 1, which is not described herein any more; it should be noted that, the system provided in this embodiment is only illustrated by the division of the functional modules, and in practical applications, the functions may be allocated to different functional modules according to needs, that is, the internal structure is divided into different functional modules, so as to complete all or part of the functions described above.

Example 3:

the present embodiment provides a computer apparatus, which may be a computer, as shown in fig. 15, and includes a processor 1502, a memory, an input device 1503, a display 1504, and a network interface 1505 connected by a system bus 1501, the processor is used for providing computing and control capabilities, the memory includes a nonvolatile storage medium 1506 and an internal memory 1507, the nonvolatile storage medium 1506 stores an operating system, computer programs, and a database, the internal memory 1507 provides an environment for the operating system and the computer programs in the nonvolatile storage medium to run, and when the processor 1502 executes the computer programs stored in the memory, the commercial site selection method of embodiment 1 is implemented as follows:

acquiring data of a target area through a multi-source data open platform;

preprocessing the data of the target area;

Example 4:

the present embodiment provides a storage medium, which is a computer-readable storage medium, and stores a computer program, and when the computer program is executed by a processor, the method for addressing a commercial site of the foregoing embodiment 1 is implemented as follows:

acquiring data of a target area through a multi-source data open platform;

preprocessing the data of the target area;

It should be noted that the computer readable storage medium of the present embodiment may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In summary, the invention obtains multiple items of index data through the multi-source data open platform, counts the value of each influencing factor in each index, counts the values of all the influencing factors into one table, applies two-step clustering algorithm analysis, and gives business site selection suggestions of different types and different scales according to the clustering algorithm analysis result. The method can provide assistance and reference for planning and site selection of different types and scales of commercial network points of cities.

The above-mentioned embodiments only represent possible embodiments of the present invention, and the description thereof is specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A business network site selection method based on open source data mining is characterized by comprising the following steps:

acquiring data of a target area through a multi-source data open platform;

preprocessing the data of the target area;

2. The method as claimed in claim 1, wherein the grid division and numbering of the target area are performed, each grid is a basic unit for site selection, and an index system for clustering site selection is constructed according to the obtained data, specifically comprising:

3. The method of claim 1, wherein the data of the target area comprises population density, shared bicycle travel, shop rent, urban road traffic, business POI, land utilization data; the business POI data comprises catering business POIs, financial business POIs and shopping business POI data;

the preprocessing the data of the target area specifically includes:

4. The method as claimed in claim 3, wherein the step of respectively linking the divided grids according to the preprocessed data and counting the value of each influencing factor in the index system comprises:

5. A method of siting commercial network points according to claim 4, characterized in that said multi-factor weighted overlap-add analysis method is in particular:

6. The method as claimed in claim 4, wherein the said method for selecting an address of a commercial site includes the steps of counting the values of each influencing factor in the index system according to the number of the grids, and analyzing the values by using a two-step clustering algorithm:

obtaining a final result graph according to the clustering result table;

7. A business site selection method according to any one of claims 1 to 6 wherein the two-step clustering algorithm comprises a pre-clustering stage and a clustering stage, both of which use distance measures;

8. A business network site selection system based on open source data mining, the system comprising:

9. A computer device comprising a processor and a memory for storing a program executable by the processor, wherein the processor, when executing the program stored in the memory, implements the method of any one of claims 1 to 7.

10. A storage medium storing a program which, when executed by a processor, implements the method of any one of claims 1 to 7.