CN110598755A - OD flow clustering method based on vector constraint - Google Patents

OD flow clustering method based on vector constraint Download PDF

Info

Publication number
CN110598755A
CN110598755A CN201910764133.9A CN201910764133A CN110598755A CN 110598755 A CN110598755 A CN 110598755A CN 201910764133 A CN201910764133 A CN 201910764133A CN 110598755 A CN110598755 A CN 110598755A
Authority
CN
China
Prior art keywords
cluster
vector
point
flow
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910764133.9A
Other languages
Chinese (zh)
Other versions
CN110598755B (en
Inventor
张健钦
郭小刚
徐志洁
张学东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Civil Engineering and Architecture
Original Assignee
Beijing University of Civil Engineering and Architecture
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Civil Engineering and Architecture filed Critical Beijing University of Civil Engineering and Architecture
Priority to CN201910764133.9A priority Critical patent/CN110598755B/en
Publication of CN110598755A publication Critical patent/CN110598755A/en
Application granted granted Critical
Publication of CN110598755B publication Critical patent/CN110598755B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Abstract

The invention discloses an OD flow clustering method based on vector constraint, which comprises the following steps: step one, obtaining an OD stream data set; extracting an event point from each OD stream for representing the spatial position of the OD stream; calculating a vector of each OD flow by using the O point geographic coordinate and the D point geographic coordinate of each OD flow, and representing the OD flow by using the vector; clustering all OD flows based on the spatial characteristic distance of the event points by adopting a clustering algorithm based on division, thereby dividing all the OD flows into a plurality of spatial clusters; and thirdly, clustering OD flows contained in each space cluster respectively by adopting a clustering algorithm based on division and based on the geometric characteristic distance of the vector, thereby dividing the OD flows contained in each space cluster into a plurality of vector clusters. The method simplifies the complexity of OD flow similarity calculation, optimizes the characteristic matrix dimension in the clustering process, and pays more attention to the overall spatial distribution and motion trend of the OD flows.

Description

OD flow clustering method based on vector constraint
Technical Field
The invention relates to the field of software, in particular to an OD flow clustering method based on vector constraint.
Background
The OD flow is semantic thinning and feature extraction of complex track data, and can clearly express the geographic information of the origin-destination point of a real track, the implicit track flow direction, the flow distance and specific thematic attributes (such as population migration volume, logistics freight volume, traffic flow and the like). With the popularization of GPS positioning and the proliferation of sensors of the Internet of things, mass mobile track data are generated, and how to find a flow mode and search human-ground interaction relation in dense OD track data is an important problem in mobile track data mining. By using a visual analysis means, students can solve the phenomena of staggered overlapping of side lines and disordered display by methods of side binding, OD point polymerization, edge shape positioning and the like, thereby highlighting the OD flow clusters with larger flow. There are also scholars who perform pattern discovery of different application scenarios by means of spatial clustering with respect to the clustering of O points, D points, OD points and OD streams (edges). In the research idea and method of OD flow clustering, most researchers regard OD flow data as a set of O points and D points, transform the OD flow data based on a point clustering algorithm, and perform double iteration by using the spatial characteristics of the OD points, thereby realizing the OD flow clustering. These OD flow clustering algorithms are very susceptible to constraints on spatial distribution of OD points and limitations set by search radius or internal connectivity parameters, and do not have the ability to actively find irregular flow clusters.
Disclosure of Invention
An object of the present invention is to solve at least the above problems and/or disadvantages and to provide at least the advantages described hereinafter.
To achieve these objects and other advantages in accordance with the purpose of the invention, there is provided an OD stream clustering method based on vector constraints, comprising the steps of:
step one, obtaining an OD stream data set; extracting an event point from each OD stream for representing the spatial position of the OD stream; calculating a vector of each OD flow by using the O point geographic coordinate and the D point geographic coordinate of each OD flow, and representing the OD flow by using the vector;
clustering all OD flows based on the spatial characteristic distance of the event points by adopting a clustering algorithm based on division, thereby dividing all the OD flows into a plurality of spatial clusters;
and thirdly, clustering OD flows contained in each space cluster respectively by adopting a clustering algorithm based on division and based on the geometric characteristic distance of the vector, thereby dividing the OD flows contained in each space cluster into a plurality of vector clusters.
Preferably, in the OD flow clustering method based on vector constraint, the clustering algorithm based on partitioning is a k-means clustering algorithm.
Preferably, in the OD stream clustering method based on vector constraint, in the second step, a clustering algorithm based on partitioning is used to cluster all OD streams based on spatial feature distances of event points, so as to partition all OD streams into a plurality of spatial clusters, and the specific process includes:
step (1) traverse k1Is clustered and calculated at each k1Obtaining the error square sum curve and the contour coefficient curve by taking the value of the error square sum of the current cluster and the overall contour coefficient of the current cluster, and selecting a plurality of contour coefficient maximum values, k, from the contour coefficient curve1=2,3, 4···K1
Extracting an elbow inflection point to be checked from the error square sum curve by adopting a key point perception algorithm, wherein a straight line is constructed through a first point and a last point in the error square sum curve, and a point with the maximum vertical distance from the straight line on the error square sum curve is calculated and taken as the elbow inflection point to be checked;
step (3) searching the nearest neighbor point of the elbow inflection point in a plurality of contour coefficient maxima, and using the k corresponding to the nearest neighbor point1Is the optimal k1Value, said optimal k1The value is the cluster number of the spatial cluster.
Preferably, in the OD stream clustering method based on vector constraint, in step (1) of step two, k is traversed1The process of clustering all values of (a) includes: in each cycle, the value k is calculated1The cluster contour coefficient of each space cluster under the current value is randomly generated into a new cluster center in the next cycle in the space cluster with the minimum cluster contour coefficient.
Preferably, in the OD stream clustering method based on vector constraint, in the third step, a clustering algorithm based on partitioning is adopted to cluster OD streams contained in any spatial cluster based on a geometric feature distance of a vector, so as to partition the OD streams contained in the spatial cluster into a plurality of vector clusters, and the specific process includes:
step (1) traverse k2Is clustered and calculated at each k2Obtaining the error square sum curve and the contour coefficient curve by taking the value of the error square sum of the current cluster and the overall contour coefficient of the current cluster, and selecting a plurality of contour coefficient maximum values, k, from the contour coefficient curve2=2,3, 4···K2
Extracting an elbow inflection point to be checked from the error square sum curve by adopting a key point perception algorithm, wherein a straight line is constructed through a first point and a last point in the error square sum curve, and a point with the maximum vertical distance from the straight line on the error square sum curve is calculated and taken as the elbow inflection point to be checked;
searching the nearest point of the elbow inflection point to be checked in a plurality of contour coefficient maximum values, and using k corresponding to the nearest point2Is the optimal k2Value, said optimal k2The value is the cluster number of the vector cluster contained in the spatial cluster.
Preferably, in the OD stream clustering method based on vector constraint, in step (1) of step three, k is traversed2The process of clustering all values of (a) includes: in each cycle, the value k is calculated2The cluster contour coefficient of each vector cluster under the current value is randomly generated into a new cluster center in the next cycle in the vector cluster with the minimum cluster contour coefficient.
Preferably, in the third step, before the OD streams included in each spatial cluster are respectively clustered based on the vector geometric feature distance by using a partition-based clustering algorithm, it is assumed that the OD streams in the same spatial cluster are translated to make the spatial positions of the OD streams in the same spatial cluster coincide, so that the OD streams in the same spatial cluster are unified under the same vector spatial coordinate system, and each spatial cluster exists under a separate vector spatial coordinate system.
Preferably, in the OD stream clustering method based on vector constraint, the event point is a geometric midpoint of the OD stream; the spatial characteristic distance is Euclidean distance, and the geometric characteristic distance is a difference value of the modified cosine similarity.
Preferably, the OD stream clustering method based on vector constraint further includes:
and fourthly, visually displaying the plurality of space clusters on the map, defining each space cluster as an OD flow community, and/or respectively calculating an event point mean value and a vector mean value of each vector cluster so as to obtain a representative OD flow of each vector cluster, and visually displaying all the representative OD flows on the map.
Preferably, in the OD stream clustering method based on vector constraint, the OD stream data set is obtained by: obtaining taxi moving track data generated by GPS positioning, and obtaining track data with passenger getting-on and getting-off position information through semantic extraction.
The invention at least comprises the following beneficial effects:
the invention provides a two-step clustering flow clustering algorithm aiming at OD flows, wherein event points are used for representing the spatial position of the OD flows, vectors are used for representing the OD flows, so that the mode characteristics of the OD flows are expressed through the OD flow event points and the OD flow vectors, the clustering is carried out by utilizing the clustering algorithm based on the division based on the spatial characteristic distance of the event points, so that the OD flows with relatively close spatial position relation are divided into the same spatial cluster, and then the OD flows contained in each spatial cluster are clustered respectively by utilizing the clustering algorithm based on the division based on the geometric characteristic distance of the vectors, so that the OD flows with similar flow modes are clustered into one vector cluster. The method simplifies the complexity of OD flow similarity calculation, optimizes the characteristic matrix dimension in the clustering process, pays more attention to the overall spatial distribution and the motion trend of the OD flows, can excavate OD flow clusters with any shapes and representative characteristics and discover OD flow communities, and is favorable for optimizing traffic community planning and analyzing OD flow dynamics.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.
Drawings
FIG. 1 is a flow diagram of an OD flow clustering method based on vector constraints in one embodiment;
FIG. 2 is a logic diagram of an OD flow clustering method based on vector constraints in one embodiment;
FIG. 3 is a flow diagram of an OD flow clustering method based on vector constraints in another embodiment;
FIG. 4 is a schematic diagram of a computing process of a keypoint awareness algorithm in one embodiment;
FIG. 5 is a schematic diagram of identifying elbow inflection points to be verified in an SSE curve using a keypoint perception algorithm in one embodiment;
FIG. 6 is a schematic diagram of an elbow inflection point verification using a contour coefficient curve in one embodiment;
FIG. 7 is a flow diagram illustrating the selection of a new cluster center in the next cycle in one embodiment;
FIG. 8 is an OD flow flight line diagram of an original taxi in one embodiment;
FIG. 9(a) is a flyplot of the 1 st taxi OD stream space cluster based on event point clustering in one embodiment;
FIG. 9(b) is a flyplot of the 2 nd taxi OD stream space cluster based on event point clustering in one embodiment;
FIG. 9(c) is a flyplot of the 3 rd taxi OD stream space cluster based on event point clustering in one embodiment;
FIG. 9(d) is a flyplot of the 4 th taxi OD stream space cluster based on event point clustering in one embodiment;
FIG. 10 is a flight diagram of representative OD flows in a clustering result in one embodiment.
Detailed Description
The present invention is further described in detail below with reference to the attached drawings so that those skilled in the art can implement the invention by referring to the description text.
Referring to fig. 1, the present invention provides an OD flow clustering method based on vector constraint, including the following steps:
step one, obtaining an OD stream data set; extracting an event point from each OD stream data for representing the spatial position of the OD stream; and calculating a vector of each OD flow by using the O point geographic coordinates and the D point geographic coordinates of each OD flow data, and representing the OD flow by using the vector.
Reading in the view of the space-time point process can regard OD flow as an event of urban crowd activity, can be abstracted as a point process, the spatial attribute of the OD flow is represented by event points of the OD flow, and the offline can also be abstracted as points under the condition of small and medium scale in map synthesis, so that P is defined by the methododAnd is an OD stream event point used to characterize the spatial properties of the OD stream. The original purpose of representing the spatial position of the OD stream by using the point coordinates is to regard the OD stream as a whole, regard it as a line object, and further represent the overall spatial position attribute of the OD stream by using the OD stream event points. In one embodiment, the OD stream in the taxi track data is taxi track data containing semantic information of passengers getting on and off the taxi, and generation of one OD stream represents that a passenger takes a taxi to complete one activity, which can be regarded as a one-time point process, and the spatial attribute of the taxi OD stream can be represented by using an event point.
The OD flow is a directed line segment which is taken as complex track data to be extracted semantically, has no meaning of an entity on a geographic space, and represents the flow on the geographic space on the semantic space. Although there is no specific road network-based real trajectory for OD flows, there are definite directions of crowd movement and spatial and temporal distances between ODs. Vector is a quantity having a magnitude and direction, and in the present invention, OD flows are considered as geometric vectors whose magnitude and direction are determined by the geometric vectors of the OD flowsModulo and direction.
The calculation formula of (a) is as follows:
wherein, { XO,YOO-Point geographic coordinate of OD flow, { XD,YDD point geographical coordinates of the OD stream. In each taxi OD stream, point O represents the GPS position of the taxi when the passenger gets on the taxi, and point D is the GPS position of the taxi when the passenger gets off the taxi.
The direction of the OD flow is represented by geometric vector features, and the distribution and distance in space are replaced by the distribution density and distance of the OD flow event points. The present invention therefore defines the data structure of an OD stream as:
and step two, clustering all OD flows based on the spatial characteristic distance of the event points by adopting a clustering algorithm based on division, thereby dividing all OD flows into a plurality of spatial clusters.
And thirdly, clustering OD flows contained in each space cluster respectively by adopting a clustering algorithm based on division and based on the geometric characteristic distance of the vector, thereby dividing the OD flows contained in each space cluster into a plurality of vector clusters.
The invention selects a clustering algorithm based on the definition and selection of clusters. The expected clustering result of the invention is the OD flows with close spatial relationship and similar geometric shape in the cluster, so the clustering algorithm based on division is selected in the second step and the third step for two-step clustering.
The vector characteristics are description forms which can reflect the line state most, and the size and the direction of the vector describe the length and the angle of the line, so that the distance and the direction of OD flow are expressed. However, the vector feature cannot describe information of a spatial dimension, and any high-dimensional object can be mapped into a point object in a two-dimensional space, so that the spatial feature of an event point of an OD stream is adopted in the invention to reflect the spatial attribute of the OD stream. The invention can express any OD stream object through the event point and the vector of the OD stream, and carries out two-step clustering from the space dimension and the geometric dimension.
Fig. 2 is a logic diagram of the OD flow clustering method based on vector constraint according to the present invention, which simulates the process of converting OD flow data from point space to flow space and vector space in clustering algorithm logic. The raw OD data resides in discrete GPS trace point space. Paired OD point set data can be obtained through semantic extraction, namely a point space with passenger getting-on and getting-off information is obtained. And then calculating an OD flow vector through the OD point pair, constructing a flow model of the OD flow, and establishing a traffic flow space with the OD flow. The event points of the OD stream are further extracted, thereby constructing an OD stream space with OD stream event points. In the OD flow space, the expression of OD flow data is described as two dimensions: spatial dimension, geometric feature dimension, and defining elements in the OD stream clusters should satisfy both spatial and geometric feature dimension similarity. Thus, the OD stream clustering process can be realized by a two-step clustering method. The first is the "spatial partitioning" process. OD flows are divided into a plurality of space clusters in an OD flow space by using a clustering algorithm based on OD flow event points, the size and the direction of the OD flows in each space cluster are different, but the spatial position relationship among the OD flows is relatively tight. This is followed by a "vector clustering" process. In the process, only the geometric characteristics of the OD flows in each spatial cluster are considered, and the OD flow vectors are used for calculating the similarity to perform clustering.
The invention does not adopt a method for mashup calculation of a composite stream distance expression by using the space distance and the form distance. The reason for this is that the fusion of spatial distance and morphological distance is very complex, and the two features are interdependent and affect each other. The expression mode of the flow space distance by using the weighted distance function also has the defect, and the problems of multi-scale expression and global normalization cannot be well solved. From the view of global flow distribution, the scale difference generated by different lengths, angles and spatial positions cannot be solved well; from the view of local flow distribution, the density distribution of different dimensions has a significant influence on the clustering result, and the uniform solution cannot be performed from the global perspective. According to the invention, through dimension division, the influence of global distribution is solved through spatial clustering, so that a stream cluster set with a tight intra-cluster spatial relationship is obtained, then clustering is carried out through the geometric characteristic distance in the spatial clusters, the problem of uneven local characteristic density is solved, and representative vector clusters in different spatial clusters are respectively obtained.
The method firstly carries out clustering on the spatial similarity, is beneficial to finding the OD flow community, and is beneficial to optimizing traffic plot planning and analyzing OD flow mechanics problems. The method clusters the OD flows in the same space cluster based on the vector characteristics, pays more attention to the overall space distribution and the movement trend of the OD flows, can excavate the OD flow clusters with any shapes and representative characteristics, and is favorable for analyzing OD flow dynamics problems.
The invention simplifies the complexity of OD flow similarity calculation and optimizes the characteristic matrix dimension in the clustering process. According to the invention, through dimension deconstruction, cluster aggregation is firstly carried out on spatial similarity, so that large-scale OD stream data are simplified, OD streams are grouped through the spatial clusters and unified with a vector coordinate system, and then representative vector characteristics are respectively replaced through improved cosine similarity in each cluster. The method also improves the normalization error and local feature loss generated by solidifying the vector features into 4 orientations and 8 orientations in the previous research.
Assuming that the distance matrix is a symmetric matrix, in the existing line clustering algorithm, a 2 × n matrix with a diagonal of 0 needs to be constructed for the O and D points for n OD flows. However, in the algorithm proposed by the present invention, the feature matrix size is:
the clustering algorithm provided by the invention is suitable for a distributed operation environment, and especially for the step three, each spatial cluster can independently perform geometric characteristic clustering operation.
In a preferred embodiment, in the vector constraint-based OD stream clustering method, the partition-based clustering algorithm is a k-means clustering algorithm.
The selection of seed points (namely, clustering centers) is an important step of k-means (k-means) clustering, the size of k determines the number of clusters, and the selection of k influences the iteration efficiency of the algorithm. When the k size is determined, there is a phenomenon that an inflection point is subjectively selected by using the conventional elbow method, and there are cases where a plurality of maximum values exist by using the contour coefficient. In the invention, a PIP (proportional plus) key point perception algorithm is adopted in the two-step clustering of the second step and the third step to automatically identify SSE (sum of squared errors) elbow inflection points, and the accuracy of SSE elbow inflection point judgment is improved by checking through a contour coefficient curve, so that the optimal k value is automatically extracted.
And through traversing all values of k, calculating the sum of squares of errors under different values of k and the overall contour coefficient of the current cluster, and finding the elbow inflection point and contour coefficient curve of the SSE curve.
The elbow method mainly searches the relation between the k value and the real clustering number by calculating the error sum of squares. Wherein, the calculation formula of SSE is as follows:
wherein, CiIs the ith cluster, p is CiSample point of (1), miIs CiMean of all samples in (1). When k is smaller than the true cluster number, the increase of k will cause the SSE to decrease greatly, and when k reaches the true cluster number, the clustering degree return obtained by increasing k will become smaller rapidly, so the k value corresponding to the elbow inflection point of the SSE curve is the true cluster number of the data.
The contour coefficient is a clustering effect evaluation mode combining two factors of cohesion and separation. Wherein, for any vector i, its contour coefficient s (i) is:
where a (i) is the average of the distances of the i-vector to all other points in the cluster to which it belongs, and b (i) is the minimum of the average distances of the i-vector to all points in each cluster other than itself. And averaging the contour coefficients of all the points to obtain the total contour coefficient of the clustering result (namely the overall contour coefficient of the current cluster). The closer the contour factor approaches 1, the better the cohesion and separation. However, the contour coefficient is a relative evaluation index, the contour coefficient fluctuates with the change of k, the contour coefficient is a non-convex curve, and a plurality of local optimal solutions exist, so that the elbow method is required to assist, and the k value corresponding to the maximum value of the contour coefficient is selected as the optimal clustering number.
Fig. 4 shows a schematic diagram of the calculation process of the PIP keypoint sensing algorithm. Defining a curve as a sequence P, taking a first point P1 and a last point P2 of the sequence as a first PIP (key point) and a second PIP in a PIP key point perception algorithm, connecting the first key point P1 and the last key point P2 to construct a straight line P1P2, calculating the vertical distance from each point in the sequence P to the straight line P1P2, and identifying a point in the sequence with the largest vertical distance as an inflection point.
The vertical distance from the test point p3 to the straight line p1p2 in fig. 4 is calculated by the formula:
wherein xc=x3
The invention adopts a key point perception algorithm to identify an elbow inflection point to be verified from the SSE curve so as to improve the accuracy of identifying the elbow inflection point in the SSE curve. Fig. 5 illustrates a process for identifying an elbow inflection point to be verified in an SSE curve using a keypoint-aware algorithm. And taking the SSE curve as a sequence P, capturing a first point (a head point) and a last point (a tail point) in the SSE curve, constructing a straight line of adjacent PIPs, calculating the vertical distance from each point in the SSE curve to the straight line, and capturing a point with the maximum vertical distance as a height fluctuation point (namely an elbow inflection point to be checked).
In the present invention, the sequence P-SSE curve is a monotonic curve, with inflection points often present around the third PIP, so that only one PIP identification is needed. PIP keypoint sensing algorithms are typically used to compress static data and cannot be solved stably for sequences of varying length. The third PIP will fluctuate slightly around the true elbow inflection point as the tail point changes, so the present invention uses the contour coefficient maximum point as a constraint to assist in selecting the optimal k value.
Fig. 6 shows a schematic diagram of elbow inflection point verification using a contour coefficient curve. To better illustrate the identification process, the SSE sequence and the sequence of contour coefficients are normalized to be in the same coordinate system. And calculating the maximum value of the contour coefficient, wherein three contour coefficient maximum value points exist on the contour coefficient curve in the figure 6 and are respectively represented by circles. Searching the nearest point of the elbow inflection point to be checked (namely, the third PIP) in the contour coefficient maximum value, calculating the horizontal distance between the contour coefficient maximum value and the elbow inflection point to be checked in the searching process, and taking the contour coefficient maximum value point with the minimum horizontal distance as the nearest point. And migrating a third PIP to a k value position corresponding to the nearest neighbor point, thereby acquiring a PIP inflection point under the constraint of the contour coefficient, and taking the corresponding k value as an optimal clustering number.
During the traversal process, a new cluster center needs to be selected in each new cycle. In order to optimize the efficiency of the clustering algorithm in each cycle, the invention evaluates the cluster contour coefficient of each cluster in the current cycle, and generates a new clustering center in the space cluster range with the minimum cluster contour coefficient for the next round of calculation. Fig. 7 shows the selection process of the new cluster center in the next cycle. The calculation process of the cluster contour coefficient of each cluster is that the contour coefficient of each point is calculated firstly, and then the average value of the contour coefficients of all the points in the cluster is used as the cluster contour coefficient. And (4) solving a value range of the cluster with the minimum cluster contour coefficient, so that a group of coordinate pairs (the objects in the cluster exist in the form of the coordinate pairs) are randomly generated in the value range, and the coordinate pairs are used as new seed points (namely new cluster centers).
In the two-step clustering process of the second step and the third step, a PIP (latent aggregate points) key point perception algorithm is adopted to automatically identify SSE elbow inflection points, and verification is carried out through a contour coefficient curve, so that the optimal space clustering number and the optimal vector clustering number are determined. Fig. 3 shows an OD flow clustering method based on vector constraint in an embodiment, in which the error square sum key point perception algorithm considering the contour coefficient is adopted in both step two and step three.
Furthermore, the method is adopted to select new clustering centers in the two-step clustering process of the second step and the third step, so that the iteration efficiency is improved, and the iteration times are reduced.
In a preferred embodiment, in the OD stream clustering method based on vector constraint, in the second step, a clustering algorithm based on partitioning is used to cluster all OD streams based on spatial feature distances of event points, so as to partition all OD streams into a plurality of spatial clusters, and the specific process includes:
step (1) traverse k1Is clustered and calculated at each k1Obtaining the error square sum curve and the contour coefficient curve by taking the value of the error square sum of the current cluster and the overall contour coefficient of the current cluster, and selecting a plurality of contour coefficient maximum values, k, from the contour coefficient curve1=2,3, 4···K1
Extracting an elbow inflection point to be checked from the error square sum curve by adopting a key point perception algorithm, wherein a straight line is constructed through a first point and a last point in the error square sum curve, and a point with the maximum vertical distance from the straight line on the error square sum curve is calculated and taken as the elbow inflection point to be checked;
searching the nearest point of the elbow inflection point to be checked in a plurality of contour coefficient maximum values, and using k corresponding to the nearest point1Is the optimal k1Value, said optimal k1The value is the cluster number of the spatial cluster.
K1The value of (a) can be determined empirically.
In a preferred embodiment, in the OD stream clustering method based on vector constraint, in step (1) of step two, k is traversed1The process of clustering all values of (a) includes: in each cycle, the value k is calculated1Under the current value ofAnd (4) randomly generating a new cluster center in the next cycle in the space cluster with the minimum cluster contour coefficient by using the cluster contour coefficient of each space cluster.
In a preferred embodiment, in the OD stream clustering method based on vector constraint, in the third step, a clustering algorithm based on partitioning is adopted to cluster OD streams contained in any spatial cluster based on a geometric feature distance of a vector, so as to partition the OD streams contained in the spatial cluster into a plurality of vector clusters, and the specific process includes:
step (1) traverse k2Is clustered and calculated at each k2Obtaining the error square sum curve and the contour coefficient curve by taking the value of the error square sum of the current cluster and the overall contour coefficient of the current cluster, and selecting a plurality of contour coefficient maximum values, k, from the contour coefficient curve2=2,3, 4···K2
Extracting an elbow inflection point from the error square sum curve by adopting a key point perception algorithm, wherein a straight line is constructed through a first point and a last point in the error square sum curve, and a point with the maximum vertical distance from the straight line on the error square sum curve is calculated and is taken as the elbow inflection point to be verified;
searching the nearest point of the elbow inflection point to be checked in a plurality of contour coefficient maximum values, and using k corresponding to the nearest point2Is the optimal k2Value, said optimal k2The value is the cluster number of the vector cluster contained in the spatial cluster.
K2The value of (a) can be determined empirically.
In a preferred embodiment, in the OD stream clustering method based on vector constraint, in step (1) of step three, k is traversed2The process of clustering all values of (a) includes: in each cycle, the value k is calculated2The cluster contour coefficient of each vector cluster under the current value is randomly generated into a new cluster center in the next cycle in the vector cluster with the minimum cluster contour coefficient.
In a preferred embodiment, in the third step, before the OD streams included in each spatial cluster are respectively clustered based on the vector geometric feature distance by using a partition-based clustering algorithm, it is assumed that the OD streams in the same spatial cluster are translated to make the spatial positions of the OD streams in the same spatial cluster coincide, so that the OD streams in the same spatial cluster are unified under the same vector coordinate system, and each spatial cluster exists under a separate vector spatial coordinate system.
Referring to fig. 2, it is assumed that the OD streams in each spatial cluster are translated before vector clustering is performed, so that spatial positions of the OD streams in the same spatial cluster are overlapped, so as to ignore spatial position differences of the OD streams in the same spatial cluster, and further unify an OD stream vector coordinate system. Therefore, after the space division, each space cluster in the OD stream space is converted into an independent vector space separately, and the clustering process of all vector clusters can be run in parallel.
In the process of vector clustering, each vector coordinate system is normalized first.
In a preferred embodiment, in the OD stream clustering method based on vector constraint, the event point is a geometric midpoint of the OD stream; the spatial characteristic distance is Euclidean distance, and the geometric characteristic distance is a difference value of the modified cosine similarity.
PodIs the geometric line midpoint of the OD flow, the calculation formula is as follows:
wherein, { XO,YOO-Point geographic coordinate of OD flow, { XD,YDD point geographical coordinates of the OD stream. In this embodiment, the spatial properties of the OD streams are represented by the geometric line midpoints of the OD streams.
The method takes the OD flow as an integral object, and constructs a corresponding distance function through the space attribute and the geometric attribute of the OD flow. However, in high-dimensional data analysis, data of different dimensions cannot be directly compared and operated, and weight distribution has strong subjectivity when a distance function is constructed. Therefore, in the invention, two-step clustering is respectively carried out based on the similarity of the spatial dimension and the geometric characteristic, and the Euclidean distance and the modified cosine similarity are adopted as a spatial distance function and a geometric characteristic distance function.
Defining a spatial characteristic distance D of the OD flowspatial(i, j) is OD stream ODiEvent point P ofiAnd OD stream ODjEvent point PjOf the geographic space of (D)EUC(Pi,Pj). The calculation formula is as follows:
wherein, the event point PiIs represented by the geographic coordinates ofEvent point PiIs represented by the geographic coordinates of
The calculation formula of the vector is as follows:wherein, { XO,YOO-Point geographic coordinate of OD flow, { XD,YDD point geographical coordinates of the OD stream. Geometric feature distance D for defining OD flowvectorAnd (i, j) is the difference value of the modified cosine similarity. The calculation formula is as follows:
wherein, SimAdjCos(ODi,ODj) Representative vectorSum vectorModified cosine similarity between them, RxIs the mean value within a cluster, R, in a given dimension xyIs the mean within the cluster in the specified dimension y. The cosine similarity is corrected, different dimensionality normalization is carried out on the basis of considering vector angle difference, influence factors of a vector mode are indirectly considered, and the similarity of the vector size and the vector direction is comprehensively measured. Because the similarity value range is [ -1,1 [ ]]The invention thus uses the dissimilarity calculated by the difference as a function of distance.
In a preferred embodiment, the OD stream clustering method based on vector constraint further includes: and fourthly, visually displaying the plurality of space clusters on the map, defining each space cluster as an OD flow community, and/or respectively calculating an event point mean value and a vector mean value of each vector cluster so as to obtain a representative OD flow of each vector cluster, and visually displaying all the representative OD flows on the map.
In a preferred embodiment, in the OD stream clustering method based on vector constraint, the OD stream data set is obtained by: obtaining taxi moving track data generated by GPS positioning, and obtaining track data with passenger getting-on and getting-off position information through semantic extraction.
Example one
The taxi OD stream is a track with passenger getting-on and getting-off position information obtained by semantic extraction by using taxi moving track data generated by GPS positioning. Compared with a complex real track, the OD flow does not completely depend on real road network distribution data, can directly reflect the travel characteristics of urban residents, and is an important data source for mining the time-space activity rule of urban crowds. The data used for the experiment is partial taxi GPS track data (12000 strips) from 6 am to 9 am on 11 am on 1 month 11 in 2008 in Beijing city through primary OD semantic extraction, the data format is an original taxi GPS track information structure, and the data comprises fields of encrypted numbers, GPS feedback time, real-time longitude and latitude, riding state, riding events, speed, direction angles and the like of taxies. In order to facilitate visualization of the aggregation result, a javascript + html + css front-end webpage development technology is used for carrying out all research experiments, and a clustering and visualization algorithm is written by using js language. The visualization of the unclustered taxi OD flows is shown in FIG. 8.
Clustering is carried out based on event point space characteristic distance, and the optimal k is automatically obtained by adopting an error square sum key point perception algorithm considering contour coefficients in the clustering process1Value, k, which is based on taxi OD stream event point space cluster1The value was 4. The 4 OD flow space clusters generated during clustering are shown in fig. 9(a), 9(b), 9(c) and 9 (d). In fig. 9(a), 9(b), 9(c) and 9(d), each figure shows one OD stream community.
Clustering is carried out based on the geometric characteristic distance of the vector, and the optimal k is automatically obtained by adopting an error square sum key point perception algorithm considering the outline coefficient in the clustering process2Obtaining the value of a vector cluster k corresponding to each space cluster2The values are respectively: 4. 4, 5 and 4. In addition, in order to facilitate observation of the overall flow trend of 17 flow clusters, the OD flow event point mean value and the OD flow vector mean value of each type of flow cluster are calculated as representative visual description indexes of the flow clusters, and the visual result is shown in fig. 10.
By comparing 9(a), 9(b), 9(c), 9(d) with 10, it can be seen that the aggregation product of the first step of clustering "spatial partitioning" is constrained by the OD flow vector coordinate system, the OD flow clusters are clearly divided into 4 bundles, each of which in turn constrains a different cluster. This composite flow pattern is mainly affected by the calculation of the OD flow event points. The OD stream is not a true trace and there is no trace midpoint, which can be known as a spatial abstraction of the OD stream by defining stream event points. Therefore, the dots in the OD stream have a certain physical significance in performing cluster analysis and pattern recognition. The stream bundling pattern is easier to find with the OD stream midpoint as the event point. The physical meaning of taxi OD flow clustering is traffic flow clustering with obvious space division generated by city functional area attraction and traffic hub social contact. Because the formation of the traffic flow community has dependency on the urban traffic hub, the invention compares the bundling mode with the spatial distribution of the Beijing urban traffic hubs (Beijing Western station and six-Liqiao passenger transport main hub, Beijing south station and Song Jia Zhuang traffic hub and south aster airport, Beijing station and six-Liqiao and four-benefit traffic hub and Beijing capital airport and west aster traffic hub), and finds that the two have obvious spatial correlation. It can also be seen that the transportation hub not only serves as a place for collecting the passenger flow, but also generates attraction to the surrounding traffic interaction. Therefore, by a line aggregation method, the OD stream communities can be better identified, and OD stream clusters with potential spatial connection can be found.
The aggregation algorithm can find representative and arbitrarily-shaped geometric feature clusters on the basis of identifying OD flow event point space clusters. In the prior art, the similarity of OD points is often constrained by defining a regular search space or an additional geographical cell partition or a uniform continuous density space, so as to obtain regular clusters with similar geometric forms or irregular clusters with similar OD point semantic features or uniform OD point density. The morphological structure of these clusters depends on the parameter definition of the hierarchy-based and density-based clustering algorithms. However, due to the solidification of parameters such as search radius, intra-cluster connectivity and the like, the existing aggregation algorithm cannot deal well with the line set with uneven global density. Therefore, the invention normalizes the origin of the vector coordinates of the OD flow in the space cluster by restricting the vector coordinate system through event point space clustering based on the partition clustering algorithm. And then performing vector geometric feature clustering through a distance function based on the modified cosine similarity. All the processes of the algorithm do not need to preset any parameter, and only the optimal k values of the space cluster and the vector cluster are calculated respectively through indexes such as contour coefficients, error square sums and the like. By adopting the spatial cluster clustering and the modified cosine similarity in advance, the characteristic loss generated by global data normalization can be solved as much as possible. And because the algorithm does not adopt the conventional two-point constraint on the constraint of the geometric form, the algorithm is only influenced by the similarity of the k value of the geometric cluster optimized in the space cluster and the correction cosine, and the OD flow can find the irregularly distributed aggregation cluster. Thus the algorithm can not only identify clusters with similar patterns, but also clusters with a vergence pattern. As shown in fig. 10, the invention can find a traffic evacuation and convergence pattern based on the capital international airport impact.
The invention provides a two-step clustering flow clustering algorithm for OD data, which expresses the mode characteristics of OD flows through OD flow event points and OD flow vectors, maps the OD data from a massive point set space to an independent vector characteristic space, simplifies the complexity of OD flow similarity calculation, and pays more attention to the overall spatial distribution and motion trend of the OD flows. Compared with the prior art, a line clustering thought based on two-point clustering is changed, the overall (high-dimensional) similarity of the OD flows is paid more attention to, the characteristic matrix dimension in the clustering process is optimized, automatic optimal clustering number resolving without any parameter is realized, any shape OD flow cluster with representative characteristics can be mined, OD flow communities can be found, optimization of traffic plot planning is facilitated, and OD flow mechanics problems are analyzed.
While embodiments of the invention have been disclosed above, it is not intended to be limited to the uses set forth in the specification and examples. It can be applied to all kinds of fields suitable for the present invention. Additional modifications will readily occur to those skilled in the art. It is therefore intended that the invention not be limited to the exact details and illustrations described and illustrated herein, but fall within the scope of the appended claims and equivalents thereof.

Claims (10)

1. The OD flow clustering method based on vector constraint is characterized by comprising the following steps of:
step one, obtaining an OD stream data set; extracting an event point from each OD stream for representing the spatial position of the OD stream; calculating a vector of each OD flow by using the O point geographic coordinate and the D point geographic coordinate of each OD flow, and representing the OD flow by using the vector;
clustering all OD flows based on the spatial characteristic distance of the event points by adopting a clustering algorithm based on division, thereby dividing all the OD flows into a plurality of spatial clusters;
and thirdly, clustering OD flows contained in each space cluster respectively by adopting a clustering algorithm based on division and based on the geometric characteristic distance of the vector, thereby dividing the OD flows contained in each space cluster into a plurality of vector clusters.
2. The vector constraint-based OD flow clustering method of claim 1, wherein the partition-based clustering algorithm is a k-means clustering algorithm.
3. The method as claimed in claim 2, wherein in the second step, all the OD streams are clustered based on the spatial feature distance of the event point by using a clustering algorithm based on partitioning, so as to partition all the OD streams into a plurality of spatial clusters, and the specific process includes:
step (1) traverse k1Is clustered and calculated at each k1Obtaining the error square sum curve and the contour coefficient curve by taking the value of the error square sum of the current cluster and the overall contour coefficient of the current cluster, and selecting a plurality of contour coefficient maximum values, k, from the contour coefficient curve1=2,3,4···K1
Extracting an elbow inflection point to be checked from the error square sum curve by adopting a key point perception algorithm, wherein a straight line is constructed through a first point and a last point in the error square sum curve, and a point with the maximum vertical distance from the straight line on the error square sum curve is calculated and taken as the elbow inflection point to be checked;
step (3) searching the nearest neighbor point of the elbow inflection point in a plurality of contour coefficient maxima, and using the k corresponding to the nearest neighbor point1Is the optimal k1Value, said optimal k1The value is the cluster number of the spatial cluster.
4. The vector constraint-based OD flow clustering method of claim 3, wherein in step (1) of step two, k is traversed1The process of clustering all values of (a) includes: in each cycle, the value k is calculated1The cluster contour coefficient of each space cluster under the current value is randomly generated into a new cluster center in the next cycle in the space cluster with the minimum cluster contour coefficient.
5. The method according to claim 2, wherein in the third step, a clustering algorithm based on partitioning is used to cluster the OD streams contained in any spatial cluster based on the geometric feature distance of the vector, so as to partition the OD streams contained in the spatial cluster into a plurality of vector clusters, and the specific process includes:
step (1) traverse k2Is clustered and calculated at each k2Obtaining the error square sum curve and the contour coefficient curve by taking the value of the error square sum of the current cluster and the overall contour coefficient of the current cluster, and selecting a plurality of contour coefficient maximum values, k, from the contour coefficient curve2=2,3,4···K2
Extracting an elbow inflection point to be checked from the error square sum curve by adopting a key point perception algorithm, wherein a straight line is constructed through a first point and a last point in the error square sum curve, and a point with the maximum vertical distance from the straight line on the error square sum curve is calculated and taken as the elbow inflection point to be checked;
searching the nearest point of the elbow inflection point to be checked in a plurality of contour coefficient maximum values, and using k corresponding to the nearest point2Is the optimal k2Value, said optimal k2The value is the cluster number of the vector cluster contained in the spatial cluster.
6. The vector constraint-based OD flow clustering method of claim 5, wherein in step (1) of the third step, k is traversed2The process of clustering all values of (a) includes: in each cycle, the value k is calculated2The cluster contour coefficient of each vector cluster under the current value is randomly generated into a new cluster center in the next cycle in the vector cluster with the minimum cluster contour coefficient.
7. The method according to claim 1, wherein in the third step, before the OD streams contained in each spatial cluster are clustered based on the vector-based geometric feature distance by using the partition-based clustering algorithm, it is assumed that the OD streams in the same spatial cluster are translated to make the spatial positions of the OD streams in the same spatial cluster coincide, so that the OD streams in the same spatial cluster are unified under the same vector space coordinate system, and each spatial cluster exists under a separate vector space coordinate system.
8. The OD flow clustering method based on vector constraints of claim 1, wherein the event point is a geometric midpoint of the OD flow; the spatial characteristic distance is Euclidean distance, and the geometric characteristic distance is a difference value of the modified cosine similarity.
9. The vector constraint-based OD flow clustering method of claim 1, further comprising:
and fourthly, visually displaying the plurality of space clusters on the map, defining each space cluster as an OD flow community, and/or respectively calculating an event point mean value and a vector mean value of each vector cluster so as to obtain a representative OD flow of each vector cluster, and visually displaying all the representative OD flows on the map.
10. The OD flow clustering method based on vector constraints of claim 1, wherein the OD flow data set is obtained by: obtaining taxi moving track data generated by GPS positioning, and obtaining track data with passenger getting-on and getting-off position information through semantic extraction.
CN201910764133.9A 2019-08-19 2019-08-19 OD flow clustering method based on vector constraint Active CN110598755B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910764133.9A CN110598755B (en) 2019-08-19 2019-08-19 OD flow clustering method based on vector constraint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910764133.9A CN110598755B (en) 2019-08-19 2019-08-19 OD flow clustering method based on vector constraint

Publications (2)

Publication Number Publication Date
CN110598755A true CN110598755A (en) 2019-12-20
CN110598755B CN110598755B (en) 2022-07-15

Family

ID=68854975

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910764133.9A Active CN110598755B (en) 2019-08-19 2019-08-19 OD flow clustering method based on vector constraint

Country Status (1)

Country Link
CN (1) CN110598755B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111966951A (en) * 2020-07-06 2020-11-20 东南数字经济发展研究院 User group hierarchy dividing method based on social e-commerce transaction data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104318324A (en) * 2014-10-13 2015-01-28 南京大学 Taxi GPS (Global Positioning System) record based airport bus station and path planning method
CN106651027A (en) * 2016-12-21 2017-05-10 北京航空航天大学 Internet regular bus route optimization method based on social network
CN109376906A (en) * 2018-09-21 2019-02-22 中国科学院深圳先进技术研究院 Travel time prediction method, system and electronic equipment based on various dimensions track
US20190130476A1 (en) * 2017-04-25 2019-05-02 Yada Zhu Management System and Predictive Modeling Method for Optimal Decision of Cargo Bidding Price
CN109871412A (en) * 2018-12-26 2019-06-11 航天科工广信智能技术有限公司 Lane flow analysis method based on K-Means cluster

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104318324A (en) * 2014-10-13 2015-01-28 南京大学 Taxi GPS (Global Positioning System) record based airport bus station and path planning method
CN106651027A (en) * 2016-12-21 2017-05-10 北京航空航天大学 Internet regular bus route optimization method based on social network
US20190130476A1 (en) * 2017-04-25 2019-05-02 Yada Zhu Management System and Predictive Modeling Method for Optimal Decision of Cargo Bidding Price
CN109376906A (en) * 2018-09-21 2019-02-22 中国科学院深圳先进技术研究院 Travel time prediction method, system and electronic equipment based on various dimensions track
CN109871412A (en) * 2018-12-26 2019-06-11 航天科工广信智能技术有限公司 Lane flow analysis method based on K-Means cluster

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111966951A (en) * 2020-07-06 2020-11-20 东南数字经济发展研究院 User group hierarchy dividing method based on social e-commerce transaction data

Also Published As

Publication number Publication date
CN110598755B (en) 2022-07-15

Similar Documents

Publication Publication Date Title
Wang et al. Vehicle trajectory clustering based on dynamic representation learning of internet of vehicles
Deng et al. Generating urban road intersection models from low-frequency GPS trajectory data
Yan et al. SeMiTri: a framework for semantic annotation of heterogeneous trajectories
CN110909788B (en) Statistical clustering-based road intersection position identification method in track data
CN104330089B (en) A kind of method that map match is carried out using history gps data
Yu et al. Road network generalization considering traffic flow patterns
Fu et al. Finding abnormal vessel trajectories using feature learning
Li et al. Coupled application of generative adversarial networks and conventional neural networks for travel mode detection using GPS data
Qi et al. Recognizing driving styles based on topic models
Huang et al. Survey on vehicle map matching techniques
Yue et al. Detect: Deep trajectory clustering for mobility-behavior analysis
CN108961758A (en) A kind of crossing broadening lane detection method promoting decision tree based on gradient
CN113378891A (en) Urban area relation visual analysis method based on track distribution representation
Kong et al. Spatial-temporal-cost combination based taxi driving fraud detection for collaborative internet of vehicles
US20220082405A1 (en) System and method for vehicle event data processing for identifying parking areas
Chen et al. An analysis of movement patterns between zones using taxi GPS data
Zhang et al. Automated detecting and placing road objects from street-level images
Li et al. Exploring multiple crowdsourced data to learn deep convolutional neural networks for road extraction
Wang et al. AIS ship trajectory clustering based on convolutional auto-encoder
CN110598755B (en) OD flow clustering method based on vector constraint
Wang et al. Research and application of traffic visualization based on vehicle GPS big data
Jiang et al. Topological relationship model for geographical flows
He et al. Perceiving commerial activeness over satellite images
Li et al. The parallel and precision adaptive method of marine lane extraction based on QuadTree
Lyu et al. Movement-aware map construction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant