CN108427965B

CN108427965B - Hot spot area mining method based on road network clustering

Info

Publication number: CN108427965B
Application number: CN201810179464.1A
Authority: CN
Inventors: 仇国庆; 赵婉滢; 马俊; 张少昀
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2018-03-05
Filing date: 2018-03-05
Publication date: 2022-08-23
Anticipated expiration: 2038-03-05
Also published as: CN108427965A

Abstract

The present invention claims to protect a travel hotspot area mining method based on road network trajectory clustering. In this method, the taxi trajectories are mapped into the road network, and a clustering method combining interest points and trajectories collected from actual roads is adopted. Combined with the density peak clustering algorithm, an OPAM algorithm based on the density peak optimization of the initial center is proposed, namely DP‑OPAM. The algorithm uses the local density of data points and the shortest distance from these points to higher-density points, and uses a decision diagram to select the category of data points with higher density and the closest distance as the initial cluster center. According to the initial clustering center, the OPAM clustering algorithm with reverse learning is used to obtain the clustering result. Comparing the new algorithm with the original OPAM algorithm, the new algorithm can not only automatically determine the cluster centers, but also improve the accuracy and clustering time, and realize the analysis of user travel hotspots.

Description

A method for mining hot spots based on road network clustering

技术领域technical field

本发明属于一种数据挖掘方法，尤其涉及一种基于道路网络的出租车轨迹聚类方法。The invention belongs to a data mining method, in particular to a taxi trajectory clustering method based on a road network.

背景技术Background technique

智能交通作为当今世界交通运输发展的热点，在支撑交通运输管理的同时，更加注重满足民众出行和公众交通出行的需求。近几年来，智能交通系统建设迅速发展，许多先进的技术广泛应用于智能交通系统。GPS设备的广泛应用使得轨迹的提取变得更加方便。这些GPS设备能够收集到大量的移动位置序列信息和车载状态信息，这些数据蕴含着丰富的交通信息和用户行为信息。通过对轨迹数据进行分析和挖掘，我们能够了解交通状况，合理规划行程，发现人群行为特征，协助改善交通状况等。As a hot spot in the development of transportation in the world today, intelligent transportation not only supports transportation management, but also pays more attention to meeting the needs of people's travel and public transportation. In recent years, the construction of intelligent transportation systems has developed rapidly, and many advanced technologies have been widely used in intelligent transportation systems. The wide application of GPS equipment makes the extraction of trajectory more convenient. These GPS devices can collect a large amount of mobile location sequence information and vehicle status information, and these data contain rich traffic information and user behavior information. By analyzing and mining trajectory data, we can understand traffic conditions, plan trips reasonably, discover crowd behavior characteristics, and assist in improving traffic conditions.

出租车轨迹能够全方位覆盖城市路网交通，既能反映出实时的交通密集度和流通度，也能反映出人群的出行规律和区域特征。所以，通过对出租车轨迹的海量数据进行分析，发现隐藏在数据中的深层次信息，借助于数据挖掘技术，分析出数据整体特征描述和交通态势发展预测，为交通管理部门进行交通检测和道路控制提供支持等方面发挥着重大作用。Taxi trajectories can cover urban road network traffic in an all-round way, reflecting not only the real-time traffic density and circulation, but also the travel patterns and regional characteristics of the crowd. Therefore, by analyzing the massive data of taxi trajectories, we can find the deep-level information hidden in the data. With the help of data mining technology, we can analyze the overall characteristics of the data and predict the development of traffic situation, and carry out traffic detection and road traffic detection for the traffic management department. Controls provide support, etc. play a major role.

聚类分析作为一种常用的数据挖掘技术，可以作为获得数据的分布状况的工具，便于观察每一簇数据的特征，集中对特定的聚簇集合作进一步地分析。此外，还可以作为其他算法(如分类和定性归纳算法)的预处理步骤。移动对象的轨迹聚类，通过发现相似的运动轨迹、提取运动特征等方式，发现移动对象的运动规律和行为模式。出租车的轨迹是由间断的序列点构成。轨迹传统的聚类分析在度量轨迹相似性时，大多考虑的时点与点之间的直线距离，而忽略了现实的距离可达情况。As a common data mining technique, cluster analysis can be used as a tool to obtain the distribution of data, which is convenient to observe the characteristics of each cluster of data, and focus on specific clusters for further analysis. In addition, it can be used as a preprocessing step for other algorithms such as classification and qualitative induction algorithms. Trajectory clustering of moving objects, by finding similar motion trajectories, extracting motion features, etc., finds the motion laws and behavior patterns of moving objects. The trajectories of taxis are composed of discontinuous sequences of points. When measuring the similarity of trajectories, the traditional cluster analysis of trajectories mostly considers the straight-line distance between time points and points, while ignoring the actual distance reachability.

车辆轨迹的聚类分析研究，主要有两种方法：一种是将整条轨迹作为对象进行分类比较，另一种则是将轨迹按照一定的标准分为子轨迹段，对得到的子轨迹段进行分类。前者的优点在于方法简单，便于直观的评价轨迹之间的相似性，但同时，这种方法不能很好的分辨出轨迹的局部特征，聚类效果常常不够理想。后一种方法，可以改善前者在轨迹局部特征方面带来的问题，对于不同形状的轨迹，聚类效果更佳。但缺点是，轨迹分割的方法对聚类结果的影响较大，不同的分割方法可能造成结果的差异很大。There are two main methods for cluster analysis of vehicle trajectories: one is to classify and compare the entire trajectory as an object; sort. The advantage of the former is that the method is simple, and it is convenient to intuitively evaluate the similarity between trajectories, but at the same time, this method cannot distinguish the local features of the trajectories well, and the clustering effect is often not ideal. The latter method can improve the problems caused by the former in terms of local characteristics of the trajectory, and the clustering effect is better for trajectories of different shapes. However, the disadvantage is that the method of trajectory segmentation has a great influence on the clustering results, and different segmentation methods may cause great differences in the results.

发明内容SUMMARY OF THE INVENTION

本发明旨在解决以上现有技术的问题。提出了一种可显著提高聚类效果，实现用户出行区域挖掘的基于路网聚类的热点区域挖掘方法。本发明的技术方案如下：The present invention aims to solve the above problems of the prior art. A method for mining hotspot areas based on road network clustering is proposed, which can significantly improve the clustering effect and realize user travel area mining. The technical scheme of the present invention is as follows:

一种基于路网聚类的热点区域挖掘方法，其包括以下步骤：A method for mining hot spots based on road network clustering, comprising the following steps:

步骤1：搜集出租车轨迹数据集，进行包括数据标准化、归一化的数据预处理，保留有效字段，删除冗余数据，得到预处理后的车辆上下客轨迹点；Step 1: Collect taxi trajectory data sets, perform data preprocessing including data standardization and normalization, retain valid fields, delete redundant data, and obtain preprocessed vehicle boarding and passenger trajectory points;

步骤2：确定城市经纬度范围，在开源网站上提取该城市包括商场、学校在内的兴趣点；Step 2: Determine the latitude and longitude range of the city, and extract the city's points of interest including shopping malls and schools on the open source website;

步骤3：获取城市的路网信息，将轨迹点映射到道路网络中；Step 3: Obtain the road network information of the city and map the trajectory points to the road network;

步骤4：选取经过步骤1预处理后的车辆上下客轨迹点中的80％作为训练集，采用改进的基于密度峰值优化初始中心的OPAM算法聚类出代表上下车热点的区域，改进点主要在于：使用密度峰选取初始聚类中心，初始点的选取更准确、便捷，其余20％作为测试集，测试由上下客轨迹点中的80％作为训练集搭建好模型的聚类效果；Step 4: Select 80% of the pre-processed vehicle pick-up and drop-off trajectory points in step 1 as the training set, and use the improved OPAM algorithm based on the density peak to optimize the initial center to cluster the areas representing the pick-up and drop-off hot spots. The improvement points are mainly in : Use the density peak to select the initial clustering center, the selection of the initial point is more accurate and convenient, the remaining 20% is used as the test set, and the clustering effect of the model is built by 80% of the track points of the pick-up and drop-off passengers as the training set;

步骤5：将步骤4的模型中输入步骤2采集到的具有路网信息的兴趣点，聚类得到具有路网特征的居民热点活动区域，将聚类结果和采集到的兴趣点对比，判断居民出行的热点区域。Step 5: Input the points of interest with road network information collected in step 2 into the model of step 4, and cluster to obtain the hotspot activity areas of residents with road network characteristics, and compare the clustering results with the collected points of interest to judge residents. Travel hotspots.

进一步的，所述步骤1具体为：首先搜集城市某月的出租车轨迹数据集，选取该城市数据量较为集中一周的轨迹数据，进行数据预处理，保留上下车轨迹点经纬度数据，上下车时间数据等有效字段，删除冗余数据。Further, the step 1 is specifically: first collect the taxi trajectory data set of a certain month in the city, select the trajectory data with a relatively concentrated amount of data for one week in the city, carry out data preprocessing, retain the latitude and longitude data of the alighting track points, and the time of getting on and off the bus. Data and other valid fields, delete redundant data.

进一步的，所述步骤2确定城市经纬度范围，在开源网站上提取该城市包括商场、学校在内的兴趣点，具体为：Further, the step 2 determines the range of longitude and latitude of the city, and extracts the points of interest of the city including shopping malls and schools on the open source website, specifically:

首先在开源网站openstreetmap上输入目标城市的经纬度范围，下载整个城市的地图，导出的OSM地图数据中way代表用户的移动轨迹，node代表路径。选取node标签为residence、school、shop为代表兴趣点。First, enter the latitude and longitude range of the target city on the open source website openstreetmap, and download the map of the entire city. In the exported OSM map data, the way represents the user's movement trajectory, and the node represents the path. Select the node label as residence, school, and shop to represent points of interest.

进一步的，所述步骤3获取城市的路网信息，将轨迹点映射到道路网络中，具体为：Further, the step 3 obtains the road network information of the city, and maps the trajectory points to the road network, specifically:

采用TAREEG网络服务项目得到电子地图数据，提取该城市的路网信息，提取城市路网数据后，通过ST-Matching模型将上述所得的GPS移动轨迹投射到获取到的路网地图上，得到司机经过每一个路段e上(j-1+1)个连续时刻p_i,…,p_j的轨迹点。Using the TREEG network service project to obtain the electronic map data, extract the road network information of the city, and after extracting the city road network data, the GPS movement trajectory obtained above is projected onto the obtained road network map through the ST-Matching model. The trajectory points of (j-1+1) consecutive times p _i ,...,p _j on each road segment e.

进一步的，所述步骤4具体为：首先选取处理好的车辆上下客轨迹点中的80％作为训练集，采用改进基于反向学习围绕中心点划分聚类算法(OPAM)聚类出代表上下车热点的区域，改进OPAM算法分为三个阶段：第一个阶段初始化，构造决策图，选取远离大部分样本的右上角区域的密度峰值点作为初始聚类中心，密度峰值点个数为类簇数k；第二阶段构造初始聚类中心，计算数据集中的各点与每个聚类中心的最小距离，将其余样本点分配到最近初始类簇中心，形成初始划分，计算聚类误差平方和；第三阶段反向学习并代入围绕中心点划分聚类算法(PAM)，将典型PAM聚类算法得到的k个簇和经反向学习后得到k个反向簇进行排列组合得到k×k个类簇组合，寻找轮廓系数最大的类簇组合。Further, the step 4 is specifically as follows: first, 80% of the processed vehicle disembarkation and passenger trajectory points are selected as the training set, and an improved clustering algorithm (OPAM) based on reverse learning is used to cluster the representative getting on and off the vehicle. In the hotspot area, the improved OPAM algorithm is divided into three stages: the first stage initializes, constructs a decision diagram, selects the density peak point in the upper right corner area far away from most samples as the initial cluster center, and the number of density peak points is the cluster Number k; the second stage constructs the initial cluster center, calculates the minimum distance between each point in the data set and each cluster center, assigns the remaining sample points to the nearest initial cluster center, forms the initial division, and calculates the sum of squares of clustering errors ; The third stage is reverse learning and substituting the clustering algorithm around the center point (PAM), the k clusters obtained by the typical PAM clustering algorithm and the k reverse clusters obtained after reverse learning are arranged and combined to obtain k × k Find the cluster combination with the largest silhouette coefficient.

进一步的，所述PAM算法的步骤如下：Further, the steps of the PAM algorithm are as follows:

(1)从给定数据集D中任意选取k个元素，将选定的k个元素标记为初始代表对象或种子o_j；(1) arbitrarily select k elements from a given data set D, and mark the selected k elements as initial representative objects or seeds o _j ;

(2)根据欧氏距离计算方式，计算数据集D中的任一非代表对象o_i与k个代表对象之间的距离，并将o_i分配到与其距离最近的代表对象所代表的簇；(2) According to the Euclidean distance calculation method, calculate the distance between any non-representative object o _i in the data set D and k representative objects, and assign o _i to the cluster represented by the representative object with the closest distance;

(3)任意选取一个非代表对象o_random；(3) arbitrarily select a non-representative object o _random ;

(4)计算总代价S：(4) Calculate the total cost S:

S＝dist(p,o_random)-dist(p,o_j)S=dist(p,o _random )-dist(p,o _j )

(5)如果总代价S＜0，表明非代表对象o_random是较优解，元素o_random可以代替元素o_j，形成新的k个代表对象的集合，继续返回到步骤(2)，做新一轮的对象分配；(5) If the total cost S < 0, it indicates that the non-representative object o _random is a better solution, and the element o _random can replace the element o _j to form a new set of k representative objects, and continue to return to step (2) to make a new One round of object allocation;

(6)如果总代价S＞0，表明代表对象o_j是较优解，转到步骤(3)，重新选取非代表对象进行总代价的比较，直至送代价S不再发生变化，即得到总代价最小的k个类簇。(6) If the total cost S>0, it indicates that the representative object o _j is a better solution, go to step (3), and re-select non-representative objects to compare the total cost until the sending cost S does not change, that is, the total cost is obtained. The k clusters with the least cost.

本发明的优点及有益效果如下：The advantages and beneficial effects of the present invention are as follows:

本发明将居民的出行热点分析结合道路网络，采用具体道路网络中的兴趣区域和原有聚类簇相结合聚类的方法，原有簇再次聚集到新簇中包含的兴趣区域特征表示出居民出行的热点区域，解决了欧式空间中时间、空间方面存在的不足。该方法采用基于密度峰值优化初始中心的方法构造决策树确定初始中心，减少了计算量并使聚类准确率更高。并且通过特殊兴趣点和轨迹结合聚类，解决数据稀疏性和计算量庞大的问题，实现用户的轨迹行为分析。The invention combines the analysis of residents' travel hotspots with the road network, and adopts the method of combining the interest area in the specific road network with the original cluster cluster. The travel hotspot solves the deficiencies of time and space in European space. This method uses the method of optimizing the initial center based on the peak density to construct a decision tree to determine the initial center, which reduces the amount of calculation and makes the clustering accuracy higher. And through the combination of special interest points and trajectories clustering, the problems of data sparsity and huge amount of calculation are solved, and user trajectory behavior analysis is realized.

附图说明Description of drawings

图1是本发明提供优选实施例PAM聚类算法流程图；Fig. 1 is that the present invention provides a preferred embodiment PAM clustering algorithm flow chart;

图2OPAM聚类算法流程图。Figure 2. Flow chart of OPAM clustering algorithm.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、详细地描述。所描述的实施例仅仅是本发明的一部分实施例。The technical solutions in the embodiments of the present invention will be described clearly and in detail below with reference to the accompanying drawings in the embodiments of the present invention. The described embodiments are only some of the embodiments of the invention.

本发明解决上述技术问题的技术方案是：The technical scheme that the present invention solves the above-mentioned technical problems is:

如图2所示，该发明采用基于密度峰值优化初始中心的OPAM聚类算法和道路网路结合进行热点区域挖掘方法的具体步骤是：As shown in Figure 2, the invention adopts the OPAM clustering algorithm based on the density peak optimization initial center and the road network to combine the specific steps of the hot spot mining method as follows:

步骤1：搜集城市某月的出租车轨迹数据集，选取该城市数据量较为集中一周的的轨迹数据。进行数据预处理，保留上下车轨迹点经纬度数据，上下车时间数据等有效字段，删除冗余数据。Step 1: Collect the taxi trajectory data set of a city in a certain month, and select the trajectory data for a week with a relatively concentrated amount of data in the city. Carry out data preprocessing, retain valid fields such as the latitude and longitude data of the track point of getting on and off, and the time data of getting on and off, and delete redundant data.

步骤2：在开源网站openstreetmap上输入目标城市的经纬度范围，下载整个城市的地图。导出的OSM地图数据中way代表用户的移动轨迹，node代表路径。由于OSM源数据中way对象记录的是用户移动轨迹，所以way并不是只表示道路信息，也会表示建筑信息，如：一栋楼的外部轮廓。所以，我们也需要对way对象进行相应的筛选过滤，将不需要的way信息去掉。另外一方面，下载的地图源数据中的node采样点非常的多，实际路线规划过程中只需要知道way与way之间相交的十字路口地点即可。way和node中的有很多代表不同属性的tag，way中只保留属性为highway属性值为residential、service、living_street、unclassified、trunk、trunk_link、secondary、secondary_link、primary、tertiary、tertiary_link，node中删除不同way却node相同的节点。选取node标签为residence、school、shop为代表兴趣点。Step 2: Enter the latitude and longitude range of the target city on the open source website openstreetmap, and download the map of the entire city. In the exported OSM map data, the way represents the user's movement trajectory, and the node represents the path. Since the way object in the OSM source data records the user's movement trajectory, the way does not only represent road information, but also building information, such as the external outline of a building. Therefore, we also need to filter the way object accordingly to remove the unnecessary way information. On the other hand, there are many node sampling points in the downloaded map source data. In the actual route planning process, you only need to know the intersection location between the way and the way. There are many tags representing different attributes in way and node. In way, only the attribute value of highway is reserved. The attribute value is residential, service, living_street, unclassified, trunk, trunk_link, secondary, secondary_link, primary, tertiary, tertiary_link, and different ways are deleted in node. But the node is the same node. Select the node label as residence, school, and shop to represent points of interest.

步骤3：Step 3:

上述采集的司机的GPS轨迹包含瞬时位置坐标隐藏了司机每个时刻在哪一个路段上的信息。采用TAREEG网络服务项目得到电子地图数据，提取该城市的路网信息。提取城市路网数据后，通过ST-Matching模型将上述所得的GPS移动轨迹投射到获取到的路网地图上，得到司机经过每一个路段e上(j-1+1)个连续时刻p_i,…,p_j的轨迹点。The GPS track of the driver collected above contains the instantaneous position coordinates, which hides the information on which road section the driver is on at each moment. The electronic map data is obtained by using the TREEG network service project, and the road network information of the city is extracted. After extracting the urban road network data, the above-obtained GPS movement trajectory is projected onto the obtained road network map through the ST-Matching model, and the driver passes (j-1+1) consecutive moments p _i on each road section e, ..., the trajectory points of p _j .

步骤4Step 4

选取处理好的车辆上下客轨迹点中的80％作为训练集，采用改进OPAM算法聚类出代表上下车热点的区域。DP-OPAM算法分为三个阶段：初始化、构造初始聚类中心和反向学习并代入PAM算法。Select 80% of the processed vehicle disembarkation and passenger trajectory points as the training set, and use the improved OPAM algorithm to cluster the areas representing the hotspots of getting on and off. The DP-OPAM algorithm is divided into three stages: initialization, construction of initial cluster centers, reverse learning and substitution into the PAM algorithm.

设样本的数据点为i，局部密度为ρ_i，数据点i的局部密度ρ_i的计算方式为：

其中，

d_c为截断距离，对于大量数据而言，局部密度实质是数据点之间的相对密度，所以d_c具有鲁棒性。定义δ_i是数据点i到任何比其密度大的点的距离的最小值：δ_i＝min_j:ρj＞ρi(d_ij)对于局部密度最大的点，需要特殊处理，一般改点的值为：δ_i＝max_j(d_ij)Assuming that the data point of the sample is i, the local density is ρ _i , the calculation method of the local density ρ _i of the data point i is:

in,

_dc is the cutoff distance. For a large amount of data, the local density is essentially the relative density between data points, so _dc is robust. Definition δ _i is the minimum value of the distance from data point i to any point with greater density: δ _i =min _j:ρj>ρi (d _ij ) For the point with the largest local density, special treatment is required, and the value of the point is generally changed. is: δ _i =max _j (d _ij )

第一阶段初始化first stage initialization

(1)初始化求出各数据点之间的距离矩阵D＝{d_ij}i,j＝1,...,n，并确定截断距离。(1) Initialize the distance matrix D={d _ij }i,j=1,...,n between each data point, and determine the cutoff distance.

(2)根据公式S＝dist(p,o_random)-dist(p,o_j)求出局部密度，利用公式

计算样本的高密度距离δ_i。(2) Calculate the local density according to the formula S=dist(p,o _random )-dist(p,o _j ), and use the formula

Calculate the high density distance δ _i of the sample.

(3)构造ρ以为横轴，为δ纵轴的决策图，选择局部密度ρ和高密度距离δ都较高的数据点，且明显远离大部分样本的右上角区域的密度峰值点作为初始聚类中心，密度峰值点个数为类簇数k。(3) Construct a decision diagram with ρ as the horizontal axis and δ as the vertical axis. Select the data points with high local density ρ and high density distance δ, and the density peak point in the upper right corner of most samples is obviously far away as the initial cluster. Class center, the number of density peak points is the number of clusters k.

第二阶段构造初始聚类中心The second stage constructs initial cluster centers

(1)计算数据集中的各点与每个聚类中心的最小距离，将其余样本点分配到最近初始类簇中心，形成初始划分，计算聚类误差平方和。(1) Calculate the minimum distance between each point in the data set and each cluster center, assign the remaining sample points to the nearest initial cluster center, form an initial division, and calculate the sum of squares of clustering errors.

第三阶段反向学习并代入PAM算法The third stage of reverse learning and substituting into the PAM algorithm

(1)将上述得到的k个原始簇进行反向学习，求得k个对应的反向簇。(1) Perform reverse learning on the k original clusters obtained above to obtain k corresponding reverse clusters.

(2)将典型PAM聚类算法得到的k个簇和经反向学习后得到k个反向簇进行排列组合得到k×k个类簇组合。(2) The k clusters obtained by the typical PAM clustering algorithm and the k reverse clusters obtained after reverse learning are arranged and combined to obtain k × k cluster combinations.

(3)计算每一种簇类组合的簇内间距a(o)、簇间间距b(o)和轮廓系数s(o),比较s₁,s₂,...,s_k×k，寻找轮廓系数最大的类簇组合。(3) Calculate the intra-cluster spacing a(o), inter-cluster spacing b(o) and silhouette coefficient s(o) of each cluster combination, compare s ₁ , s ₂ ,...,s _k×k , Find the cluster combination with the largest silhouette coefficient.

其中，PAM算法的步骤如下：Among them, the steps of the PAM algorithm are as follows:

(4)计算总代价S：(4) Calculate the total cost S:

S＝dist(p,o_random)-dist(p,o_j)S=dist(p,o _random )-dist(p,o _j )

步骤5Step 5

将步骤4的模型中输入步骤2采集到的具有路网信息的兴趣点，聚类结果得到的k个类簇和采集到的具有代表性的兴趣点进行相似性度量，分析居民出行区域属于哪些兴趣点，从而分析居民热点区域。聚类得到具有路网特征的居民热点活动区域。将聚类结果和采集到的兴趣点对比，判断居民出行的热点区域。Input the points of interest with road network information collected in step 2 into the model of step 4, and measure the similarity between the k clusters obtained from the clustering results and the representative points of interest collected, and analyze which areas the residents travel to belong to. Points of interest to analyze residential hotspots. Clustering obtains residents' hotspot activity areas with road network characteristics. Compare the clustering results with the collected points of interest to determine the hotspot areas for residents to travel.

对于待测量轨迹tr_a和tr_b采用Hausdorff距离测量轨迹相似度。H(tr_a,tr_b)＝max{h(tr_a,tr_b),h(tr_b,tr_a)}，其中

应用Hausdorff距离计算两条轨迹中每个点到另外一条轨迹上所有点的最小值，然后从各自的最小值集合中找出最大的。当小于相似度阈值时认为和兴趣点空间上相似，将被保存到候选集合中。把候选集合中距离大于某个阈值的轨迹删除，得到离轨迹最近的路网兴趣点，即居民出行热点区域。For the trajectories _{tra and tr b} _to be measured, the Hausdorff distance is used to measure the trajectory similarity. H(tr _a ,tr _b )=max{h(t _a ,tr _b ),h(tr _b ,tr _a )}, where

The Hausdorff distance is used to calculate the minimum value from each point in the two trajectories to all points on the other trajectory, and then find the maximum from the respective sets of minimum values. When it is smaller than the similarity threshold, it is considered to be spatially similar to the interest point and will be saved into the candidate set. The trajectories with distances greater than a certain threshold in the candidate set are deleted, and the road network interest points closest to the trajectories are obtained, that is, the residents' travel hotspots.

以上这些实施例应理解为仅用于说明本发明而不用于限制本发明的保护范围。在阅读了本发明的记载的内容之后，技术人员可以对本发明作各种改动或修改，这些等效变化和修饰同样落入本发明权利要求所限定的范围。The above embodiments should be understood as only for illustrating the present invention and not for limiting the protection scope of the present invention. After reading the contents of the description of the present invention, the skilled person can make various changes or modifications to the present invention, and these equivalent changes and modifications also fall within the scope defined by the claims of the present invention.

Claims

1. A hot spot region mining method based on road network clustering is characterized by comprising the following steps:

step 1: collecting a taxi track data set, carrying out data preprocessing including data standardization and normalization, reserving effective fields, deleting redundant data, and obtaining preprocessed upper and lower taxi track points;

step 2: determining the latitude and longitude range of a city, and extracting interest points of the city including a shopping mall and a school on an open source website;

and 3, step 3: acquiring city road network information, and mapping track points into a road network;

and 4, step 4: selecting 80% of the upper and lower passenger track points of the vehicle preprocessed in the step 1 as a training set, and clustering out an area representing an upper and lower vehicle hot point by adopting an improved reverse learning-based center point partitioning clustering algorithm, wherein the improvement point is as follows: selecting an initial clustering center by using the density peak; the other 20 percent is used as a test set, and the clustering effect of a model built by taking 80 percent of upper and lower guest track points as a training set is tested;

and 5: inputting the interest points with road network information acquired in the step 3 into the model in the step 4, clustering to obtain a resident hot spot activity area with road network characteristics, comparing a clustering result with the acquired interest points, and judging a hot spot area of resident travel;

the step 4 specifically comprises the following steps: firstly, selecting 80% of processed upper and lower passenger track points of a vehicle as a training set, clustering an area representing an upper and lower hot spot by adopting an improved reverse learning-based central point partition clustering algorithm, and dividing the improved reverse learning-based central point partition clustering algorithm into three stages: initializing a first stage, constructing a decision diagram, and selecting density peak points in an upper right corner area far away from most samples as initial clustering centers, wherein the number of the density peak points is a cluster number k; constructing initial clustering centers, calculating the minimum distance between each point in the data set and each clustering center, distributing the rest sample points to the nearest initial clustering center to form initial division, and calculating the clustering error square sum; the third stage reversely learns and substitutes the algorithm for dividing and clustering around the central point, k clusters obtained by dividing and clustering around the central point and k reverse clusters obtained after reversely learning are arranged and combined to obtain k multiplied by k cluster combinations, and the cluster combination with the maximum outline coefficient is searched;

the step of dividing the clustering algorithm around the center point is as follows:

(1) arbitrarily selecting k elements from a given data set D, and marking the selected k elements as initial representative objects or seeds o _j ；

(2) Calculating any non-representative object o in the data set D according to the Euclidean distance calculation mode _i And k representative objects, and o _i Allocating to the cluster represented by the representative object closest to the cluster;

(3) arbitrarily selecting a non-representative object o _random ；

(4) Calculating the total cost S:

S＝dist(p,o _random )-dist(p,o _j )，

(5) if the total cost S < 0, indicating a non-representative object o _random Is a better solution, element o _random In place of the element o _j Forming a new k representative object set, continuously returning to the step (2), and performing a new round of object allocation;

(6) if the total cost S > 0, it indicates that the representative object o _j If the total cost is the optimal solution, turning to the step (3), reselecting the non-representative object to compare the total cost until the cost S is not changed any more, and obtaining k class clusters with the minimum total cost;

for the track tr to be measured _a And tr _b Track similarity, H (tr), is measured using the Hausdorff distance _a ,tr _b )＝max{h(tr _a ,tr _b ),h(tr _b ,tr _a ) Therein of

Calculating the distance from each point of the two tracks to the other track by using Hausdorff distanceThe minimum value with points is found out, and then the maximum value is found out from the respective minimum value set; when the similarity is smaller than the similarity threshold value, the spatial similarity with the interest point is considered, and the similarity is stored in the candidate set; deleting the tracks with the distance greater than a certain threshold value in the candidate set to obtain the road network interest points closest to the tracks, namely the resident trip hot spot areas;

let the data point of the sample be i and the local density be ρ _i Local density ρ of data point i _i The calculation method is as follows:

wherein,

d _c for the truncation distance, δ is defined _i Is the minimum of the distance of a data point i to any point greater than its local density: delta _i ＝min _j:ρj＞ρi (d _ij ) For the point with the maximum local density, special treatment is needed, and the point is changed into the following value: delta _i ＝max _j (d _ij )；

First phase initialization

(1) Initializing and solving a distance matrix D ═ D between data points _ij I, j ═ 1.. times, n, and determining a truncation distance;

(2) according to the formula S ═ dist (p, o) _random )-dist(p,o _j ) Calculating local density by formula

Calculating a high density distance δ of a sample _i ；

(3) And constructing a decision diagram with rho as a horizontal axis and delta as a vertical axis, selecting data points with higher local density rho and high density distance delta, taking density peak points of the upper right corner area far away from most samples as initial clustering centers, and taking the number of the density peak points as a clustering number k.

2. The road network clustering-based hot spot region mining method according to claim 1, wherein the step 1 specifically comprises: firstly, a taxi track data set of a certain month in a city is collected, track data of a week in the city data set is selected, data preprocessing is carried out, longitude and latitude data of track points of an upper taxi and a lower taxi are reserved, effective fields of the data of the time of the upper taxi and the lower taxi are reserved, and redundant data are deleted.

3. The road network clustering-based hotspot region mining method of claim 1, wherein the step 2 determines a city latitude and longitude range, extracts interest points of the city including shopping malls and schools from an open source website, and specifically comprises the following steps:

firstly, inputting the longitude and latitude range of a target city on an open source website opentreetmap, downloading a map of the whole city, deriving ways in OSM map data to represent the moving track of a user, representing paths by nodes, and selecting node labels as residual, school and shop to represent interest points.

4. The road network clustering-based hotspot region mining method according to claim 3, wherein the step 3 acquires road network information of a city, and maps track points into a road network, specifically:

obtaining electronic map data by adopting TAREEG network service items, extracting road network information of the city, projecting the obtained moving track on the obtained road network map through an ST-Matching model after extracting the road network data of the city, and obtaining the track points p of a driver passing through j continuous moments on each road section e _i ,…,p _j 。