CN110598755A

CN110598755A - OD flow clustering method based on vector constraint

Info

Publication number: CN110598755A
Application number: CN201910764133.9A
Authority: CN
Inventors: 张健钦; 郭小刚; 徐志洁; 张学东
Original assignee: Beijing University of Civil Engineering and Architecture
Current assignee: Beijing University of Civil Engineering and Architecture
Priority date: 2019-08-19
Filing date: 2019-08-19
Publication date: 2019-12-20
Anticipated expiration: 2039-08-19
Also published as: CN110598755B

Abstract

The invention discloses an OD flow clustering method based on vector constraint, which comprises the following steps: step one, obtaining an OD stream data set; extracting an event point from each OD stream for representing the spatial position of the OD stream; calculating a vector of each OD flow by using the O point geographic coordinate and the D point geographic coordinate of each OD flow, and representing the OD flow by using the vector; clustering all OD flows based on the spatial characteristic distance of the event points by adopting a clustering algorithm based on division, thereby dividing all the OD flows into a plurality of spatial clusters; and thirdly, clustering OD flows contained in each space cluster respectively by adopting a clustering algorithm based on division and based on the geometric characteristic distance of the vector, thereby dividing the OD flows contained in each space cluster into a plurality of vector clusters. The method simplifies the complexity of OD flow similarity calculation, optimizes the characteristic matrix dimension in the clustering process, and pays more attention to the overall spatial distribution and motion trend of the OD flows.

Description

OD flow clustering method based on vector constraint

Technical Field

The invention relates to the field of software, in particular to an OD flow clustering method based on vector constraint.

Background

The OD flow is semantic thinning and feature extraction of complex track data, and can clearly express the geographic information of the origin-destination point of a real track, the implicit track flow direction, the flow distance and specific thematic attributes (such as population migration volume, logistics freight volume, traffic flow and the like). With the popularization of GPS positioning and the proliferation of sensors of the Internet of things, mass mobile track data are generated, and how to find a flow mode and search human-ground interaction relation in dense OD track data is an important problem in mobile track data mining. By using a visual analysis means, students can solve the phenomena of staggered overlapping of side lines and disordered display by methods of side binding, OD point polymerization, edge shape positioning and the like, thereby highlighting the OD flow clusters with larger flow. There are also scholars who perform pattern discovery of different application scenarios by means of spatial clustering with respect to the clustering of O points, D points, OD points and OD streams (edges). In the research idea and method of OD flow clustering, most researchers regard OD flow data as a set of O points and D points, transform the OD flow data based on a point clustering algorithm, and perform double iteration by using the spatial characteristics of the OD points, thereby realizing the OD flow clustering. These OD flow clustering algorithms are very susceptible to constraints on spatial distribution of OD points and limitations set by search radius or internal connectivity parameters, and do not have the ability to actively find irregular flow clusters.

Disclosure of Invention

An object of the present invention is to solve at least the above problems and/or disadvantages and to provide at least the advantages described hereinafter.

To achieve these objects and other advantages in accordance with the purpose of the invention, there is provided an OD stream clustering method based on vector constraints, comprising the steps of:

step one, obtaining an OD stream data set; extracting an event point from each OD stream for representing the spatial position of the OD stream; calculating a vector of each OD flow by using the O point geographic coordinate and the D point geographic coordinate of each OD flow, and representing the OD flow by using the vector;

clustering all OD flows based on the spatial characteristic distance of the event points by adopting a clustering algorithm based on division, thereby dividing all the OD flows into a plurality of spatial clusters;

and thirdly, clustering OD flows contained in each space cluster respectively by adopting a clustering algorithm based on division and based on the geometric characteristic distance of the vector, thereby dividing the OD flows contained in each space cluster into a plurality of vector clusters.

Preferably, in the OD flow clustering method based on vector constraint, the clustering algorithm based on partitioning is a k-means clustering algorithm.

Preferably, in the OD stream clustering method based on vector constraint, in the second step, a clustering algorithm based on partitioning is used to cluster all OD streams based on spatial feature distances of event points, so as to partition all OD streams into a plurality of spatial clusters, and the specific process includes:

step (1) traverse k₁Is clustered and calculated at each k₁Obtaining the error square sum curve and the contour coefficient curve by taking the value of the error square sum of the current cluster and the overall contour coefficient of the current cluster, and selecting a plurality of contour coefficient maximum values, k, from the contour coefficient curve₁＝2，3， 4···K₁；

Extracting an elbow inflection point to be checked from the error square sum curve by adopting a key point perception algorithm, wherein a straight line is constructed through a first point and a last point in the error square sum curve, and a point with the maximum vertical distance from the straight line on the error square sum curve is calculated and taken as the elbow inflection point to be checked;

step (3) searching the nearest neighbor point of the elbow inflection point in a plurality of contour coefficient maxima, and using the k corresponding to the nearest neighbor point₁Is the optimal k₁Value, said optimal k₁The value is the cluster number of the spatial cluster.

Preferably, in the OD stream clustering method based on vector constraint, in step (1) of step two, k is traversed₁The process of clustering all values of (a) includes: in each cycle, the value k is calculated₁The cluster contour coefficient of each space cluster under the current value is randomly generated into a new cluster center in the next cycle in the space cluster with the minimum cluster contour coefficient.

Preferably, in the OD stream clustering method based on vector constraint, in the third step, a clustering algorithm based on partitioning is adopted to cluster OD streams contained in any spatial cluster based on a geometric feature distance of a vector, so as to partition the OD streams contained in the spatial cluster into a plurality of vector clusters, and the specific process includes:

step (1) traverse k₂Is clustered and calculated at each k₂Obtaining the error square sum curve and the contour coefficient curve by taking the value of the error square sum of the current cluster and the overall contour coefficient of the current cluster, and selecting a plurality of contour coefficient maximum values, k, from the contour coefficient curve₂＝2，3， 4···K₂；

searching the nearest point of the elbow inflection point to be checked in a plurality of contour coefficient maximum values, and using k corresponding to the nearest point₂Is the optimal k₂Value, said optimal k₂The value is the cluster number of the vector cluster contained in the spatial cluster.

Preferably, in the OD stream clustering method based on vector constraint, in step (1) of step three, k is traversed₂The process of clustering all values of (a) includes: in each cycle, the value k is calculated₂The cluster contour coefficient of each vector cluster under the current value is randomly generated into a new cluster center in the next cycle in the vector cluster with the minimum cluster contour coefficient.

Preferably, in the third step, before the OD streams included in each spatial cluster are respectively clustered based on the vector geometric feature distance by using a partition-based clustering algorithm, it is assumed that the OD streams in the same spatial cluster are translated to make the spatial positions of the OD streams in the same spatial cluster coincide, so that the OD streams in the same spatial cluster are unified under the same vector spatial coordinate system, and each spatial cluster exists under a separate vector spatial coordinate system.

Preferably, in the OD stream clustering method based on vector constraint, the event point is a geometric midpoint of the OD stream; the spatial characteristic distance is Euclidean distance, and the geometric characteristic distance is a difference value of the modified cosine similarity.

Preferably, the OD stream clustering method based on vector constraint further includes:

and fourthly, visually displaying the plurality of space clusters on the map, defining each space cluster as an OD flow community, and/or respectively calculating an event point mean value and a vector mean value of each vector cluster so as to obtain a representative OD flow of each vector cluster, and visually displaying all the representative OD flows on the map.

Preferably, in the OD stream clustering method based on vector constraint, the OD stream data set is obtained by: obtaining taxi moving track data generated by GPS positioning, and obtaining track data with passenger getting-on and getting-off position information through semantic extraction.

The invention at least comprises the following beneficial effects:

the invention provides a two-step clustering flow clustering algorithm aiming at OD flows, wherein event points are used for representing the spatial position of the OD flows, vectors are used for representing the OD flows, so that the mode characteristics of the OD flows are expressed through the OD flow event points and the OD flow vectors, the clustering is carried out by utilizing the clustering algorithm based on the division based on the spatial characteristic distance of the event points, so that the OD flows with relatively close spatial position relation are divided into the same spatial cluster, and then the OD flows contained in each spatial cluster are clustered respectively by utilizing the clustering algorithm based on the division based on the geometric characteristic distance of the vectors, so that the OD flows with similar flow modes are clustered into one vector cluster. The method simplifies the complexity of OD flow similarity calculation, optimizes the characteristic matrix dimension in the clustering process, pays more attention to the overall spatial distribution and the motion trend of the OD flows, can excavate OD flow clusters with any shapes and representative characteristics and discover OD flow communities, and is favorable for optimizing traffic community planning and analyzing OD flow dynamics.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.

Drawings

FIG. 1 is a flow diagram of an OD flow clustering method based on vector constraints in one embodiment;

FIG. 2 is a logic diagram of an OD flow clustering method based on vector constraints in one embodiment;

FIG. 3 is a flow diagram of an OD flow clustering method based on vector constraints in another embodiment;

FIG. 4 is a schematic diagram of a computing process of a keypoint awareness algorithm in one embodiment;

FIG. 5 is a schematic diagram of identifying elbow inflection points to be verified in an SSE curve using a keypoint perception algorithm in one embodiment;

FIG. 6 is a schematic diagram of an elbow inflection point verification using a contour coefficient curve in one embodiment;

FIG. 7 is a flow diagram illustrating the selection of a new cluster center in the next cycle in one embodiment;

FIG. 8 is an OD flow flight line diagram of an original taxi in one embodiment;

FIG. 9(a) is a flyplot of the 1 st taxi OD stream space cluster based on event point clustering in one embodiment;

FIG. 9(b) is a flyplot of the 2 nd taxi OD stream space cluster based on event point clustering in one embodiment;

FIG. 9(c) is a flyplot of the 3 rd taxi OD stream space cluster based on event point clustering in one embodiment;

FIG. 9(d) is a flyplot of the 4 th taxi OD stream space cluster based on event point clustering in one embodiment;

FIG. 10 is a flight diagram of representative OD flows in a clustering result in one embodiment.

Detailed Description

The present invention is further described in detail below with reference to the attached drawings so that those skilled in the art can implement the invention by referring to the description text.

Referring to fig. 1, the present invention provides an OD flow clustering method based on vector constraint, including the following steps:

step one, obtaining an OD stream data set; extracting an event point from each OD stream data for representing the spatial position of the OD stream; and calculating a vector of each OD flow by using the O point geographic coordinates and the D point geographic coordinates of each OD flow data, and representing the OD flow by using the vector.

Reading in the view of the space-time point process can regard OD flow as an event of urban crowd activity, can be abstracted as a point process, the spatial attribute of the OD flow is represented by event points of the OD flow, and the offline can also be abstracted as points under the condition of small and medium scale in map synthesis, so that P is defined by the method_odAnd is an OD stream event point used to characterize the spatial properties of the OD stream. The original purpose of representing the spatial position of the OD stream by using the point coordinates is to regard the OD stream as a whole, regard it as a line object, and further represent the overall spatial position attribute of the OD stream by using the OD stream event points. In one embodiment, the OD stream in the taxi track data is taxi track data containing semantic information of passengers getting on and off the taxi, and generation of one OD stream represents that a passenger takes a taxi to complete one activity, which can be regarded as a one-time point process, and the spatial attribute of the taxi OD stream can be represented by using an event point.

The OD flow is a directed line segment which is taken as complex track data to be extracted semantically, has no meaning of an entity on a geographic space, and represents the flow on the geographic space on the semantic space. Although there is no specific road network-based real trajectory for OD flows, there are definite directions of crowd movement and spatial and temporal distances between ODs. Vector is a quantity having a magnitude and direction, and in the present invention, OD flows are considered as geometric vectors whose magnitude and direction are determined by the geometric vectors of the OD flowsModulo and direction.

The calculation formula of (a) is as follows:

wherein, { X_O,Y_OO-Point geographic coordinate of OD flow, { X_D,Y_DD point geographical coordinates of the OD stream. In each taxi OD stream, point O represents the GPS position of the taxi when the passenger gets on the taxi, and point D is the GPS position of the taxi when the passenger gets off the taxi.

The direction of the OD flow is represented by geometric vector features, and the distribution and distance in space are replaced by the distribution density and distance of the OD flow event points. The present invention therefore defines the data structure of an OD stream as:

and step two, clustering all OD flows based on the spatial characteristic distance of the event points by adopting a clustering algorithm based on division, thereby dividing all OD flows into a plurality of spatial clusters.

The invention selects a clustering algorithm based on the definition and selection of clusters. The expected clustering result of the invention is the OD flows with close spatial relationship and similar geometric shape in the cluster, so the clustering algorithm based on division is selected in the second step and the third step for two-step clustering.

The vector characteristics are description forms which can reflect the line state most, and the size and the direction of the vector describe the length and the angle of the line, so that the distance and the direction of OD flow are expressed. However, the vector feature cannot describe information of a spatial dimension, and any high-dimensional object can be mapped into a point object in a two-dimensional space, so that the spatial feature of an event point of an OD stream is adopted in the invention to reflect the spatial attribute of the OD stream. The invention can express any OD stream object through the event point and the vector of the OD stream, and carries out two-step clustering from the space dimension and the geometric dimension.

Fig. 2 is a logic diagram of the OD flow clustering method based on vector constraint according to the present invention, which simulates the process of converting OD flow data from point space to flow space and vector space in clustering algorithm logic. The raw OD data resides in discrete GPS trace point space. Paired OD point set data can be obtained through semantic extraction, namely a point space with passenger getting-on and getting-off information is obtained. And then calculating an OD flow vector through the OD point pair, constructing a flow model of the OD flow, and establishing a traffic flow space with the OD flow. The event points of the OD stream are further extracted, thereby constructing an OD stream space with OD stream event points. In the OD flow space, the expression of OD flow data is described as two dimensions: spatial dimension, geometric feature dimension, and defining elements in the OD stream clusters should satisfy both spatial and geometric feature dimension similarity. Thus, the OD stream clustering process can be realized by a two-step clustering method. The first is the "spatial partitioning" process. OD flows are divided into a plurality of space clusters in an OD flow space by using a clustering algorithm based on OD flow event points, the size and the direction of the OD flows in each space cluster are different, but the spatial position relationship among the OD flows is relatively tight. This is followed by a "vector clustering" process. In the process, only the geometric characteristics of the OD flows in each spatial cluster are considered, and the OD flow vectors are used for calculating the similarity to perform clustering.

The invention does not adopt a method for mashup calculation of a composite stream distance expression by using the space distance and the form distance. The reason for this is that the fusion of spatial distance and morphological distance is very complex, and the two features are interdependent and affect each other. The expression mode of the flow space distance by using the weighted distance function also has the defect, and the problems of multi-scale expression and global normalization cannot be well solved. From the view of global flow distribution, the scale difference generated by different lengths, angles and spatial positions cannot be solved well; from the view of local flow distribution, the density distribution of different dimensions has a significant influence on the clustering result, and the uniform solution cannot be performed from the global perspective. According to the invention, through dimension division, the influence of global distribution is solved through spatial clustering, so that a stream cluster set with a tight intra-cluster spatial relationship is obtained, then clustering is carried out through the geometric characteristic distance in the spatial clusters, the problem of uneven local characteristic density is solved, and representative vector clusters in different spatial clusters are respectively obtained.

The method firstly carries out clustering on the spatial similarity, is beneficial to finding the OD flow community, and is beneficial to optimizing traffic plot planning and analyzing OD flow mechanics problems. The method clusters the OD flows in the same space cluster based on the vector characteristics, pays more attention to the overall space distribution and the movement trend of the OD flows, can excavate the OD flow clusters with any shapes and representative characteristics, and is favorable for analyzing OD flow dynamics problems.

The invention simplifies the complexity of OD flow similarity calculation and optimizes the characteristic matrix dimension in the clustering process. According to the invention, through dimension deconstruction, cluster aggregation is firstly carried out on spatial similarity, so that large-scale OD stream data are simplified, OD streams are grouped through the spatial clusters and unified with a vector coordinate system, and then representative vector characteristics are respectively replaced through improved cosine similarity in each cluster. The method also improves the normalization error and local feature loss generated by solidifying the vector features into 4 orientations and 8 orientations in the previous research.

Assuming that the distance matrix is a symmetric matrix, in the existing line clustering algorithm, a 2 × n matrix with a diagonal of 0 needs to be constructed for the O and D points for n OD flows. However, in the algorithm proposed by the present invention, the feature matrix size is:

the clustering algorithm provided by the invention is suitable for a distributed operation environment, and especially for the step three, each spatial cluster can independently perform geometric characteristic clustering operation.

In a preferred embodiment, in the vector constraint-based OD stream clustering method, the partition-based clustering algorithm is a k-means clustering algorithm.

The selection of seed points (namely, clustering centers) is an important step of k-means (k-means) clustering, the size of k determines the number of clusters, and the selection of k influences the iteration efficiency of the algorithm. When the k size is determined, there is a phenomenon that an inflection point is subjectively selected by using the conventional elbow method, and there are cases where a plurality of maximum values exist by using the contour coefficient. In the invention, a PIP (proportional plus) key point perception algorithm is adopted in the two-step clustering of the second step and the third step to automatically identify SSE (sum of squared errors) elbow inflection points, and the accuracy of SSE elbow inflection point judgment is improved by checking through a contour coefficient curve, so that the optimal k value is automatically extracted.

And through traversing all values of k, calculating the sum of squares of errors under different values of k and the overall contour coefficient of the current cluster, and finding the elbow inflection point and contour coefficient curve of the SSE curve.

The elbow method mainly searches the relation between the k value and the real clustering number by calculating the error sum of squares. Wherein, the calculation formula of SSE is as follows:

wherein, C_iIs the ith cluster, p is C_iSample point of (1), m_iIs C_iMean of all samples in (1). When k is smaller than the true cluster number, the increase of k will cause the SSE to decrease greatly, and when k reaches the true cluster number, the clustering degree return obtained by increasing k will become smaller rapidly, so the k value corresponding to the elbow inflection point of the SSE curve is the true cluster number of the data.

The contour coefficient is a clustering effect evaluation mode combining two factors of cohesion and separation. Wherein, for any vector i, its contour coefficient s (i) is:

where a (i) is the average of the distances of the i-vector to all other points in the cluster to which it belongs, and b (i) is the minimum of the average distances of the i-vector to all points in each cluster other than itself. And averaging the contour coefficients of all the points to obtain the total contour coefficient of the clustering result (namely the overall contour coefficient of the current cluster). The closer the contour factor approaches 1, the better the cohesion and separation. However, the contour coefficient is a relative evaluation index, the contour coefficient fluctuates with the change of k, the contour coefficient is a non-convex curve, and a plurality of local optimal solutions exist, so that the elbow method is required to assist, and the k value corresponding to the maximum value of the contour coefficient is selected as the optimal clustering number.

Fig. 4 shows a schematic diagram of the calculation process of the PIP keypoint sensing algorithm. Defining a curve as a sequence P, taking a first point P1 and a last point P2 of the sequence as a first PIP (key point) and a second PIP in a PIP key point perception algorithm, connecting the first key point P1 and the last key point P2 to construct a straight line P1P2, calculating the vertical distance from each point in the sequence P to the straight line P1P2, and identifying a point in the sequence with the largest vertical distance as an inflection point.

The vertical distance from the test point p3 to the straight line p1p2 in fig. 4 is calculated by the formula:

wherein x_c＝x₃。

The invention adopts a key point perception algorithm to identify an elbow inflection point to be verified from the SSE curve so as to improve the accuracy of identifying the elbow inflection point in the SSE curve. Fig. 5 illustrates a process for identifying an elbow inflection point to be verified in an SSE curve using a keypoint-aware algorithm. And taking the SSE curve as a sequence P, capturing a first point (a head point) and a last point (a tail point) in the SSE curve, constructing a straight line of adjacent PIPs, calculating the vertical distance from each point in the SSE curve to the straight line, and capturing a point with the maximum vertical distance as a height fluctuation point (namely an elbow inflection point to be checked).

In the present invention, the sequence P-SSE curve is a monotonic curve, with inflection points often present around the third PIP, so that only one PIP identification is needed. PIP keypoint sensing algorithms are typically used to compress static data and cannot be solved stably for sequences of varying length. The third PIP will fluctuate slightly around the true elbow inflection point as the tail point changes, so the present invention uses the contour coefficient maximum point as a constraint to assist in selecting the optimal k value.

Fig. 6 shows a schematic diagram of elbow inflection point verification using a contour coefficient curve. To better illustrate the identification process, the SSE sequence and the sequence of contour coefficients are normalized to be in the same coordinate system. And calculating the maximum value of the contour coefficient, wherein three contour coefficient maximum value points exist on the contour coefficient curve in the figure 6 and are respectively represented by circles. Searching the nearest point of the elbow inflection point to be checked (namely, the third PIP) in the contour coefficient maximum value, calculating the horizontal distance between the contour coefficient maximum value and the elbow inflection point to be checked in the searching process, and taking the contour coefficient maximum value point with the minimum horizontal distance as the nearest point. And migrating a third PIP to a k value position corresponding to the nearest neighbor point, thereby acquiring a PIP inflection point under the constraint of the contour coefficient, and taking the corresponding k value as an optimal clustering number.

During the traversal process, a new cluster center needs to be selected in each new cycle. In order to optimize the efficiency of the clustering algorithm in each cycle, the invention evaluates the cluster contour coefficient of each cluster in the current cycle, and generates a new clustering center in the space cluster range with the minimum cluster contour coefficient for the next round of calculation. Fig. 7 shows the selection process of the new cluster center in the next cycle. The calculation process of the cluster contour coefficient of each cluster is that the contour coefficient of each point is calculated firstly, and then the average value of the contour coefficients of all the points in the cluster is used as the cluster contour coefficient. And (4) solving a value range of the cluster with the minimum cluster contour coefficient, so that a group of coordinate pairs (the objects in the cluster exist in the form of the coordinate pairs) are randomly generated in the value range, and the coordinate pairs are used as new seed points (namely new cluster centers).

In the two-step clustering process of the second step and the third step, a PIP (latent aggregate points) key point perception algorithm is adopted to automatically identify SSE elbow inflection points, and verification is carried out through a contour coefficient curve, so that the optimal space clustering number and the optimal vector clustering number are determined. Fig. 3 shows an OD flow clustering method based on vector constraint in an embodiment, in which the error square sum key point perception algorithm considering the contour coefficient is adopted in both step two and step three.

Furthermore, the method is adopted to select new clustering centers in the two-step clustering process of the second step and the third step, so that the iteration efficiency is improved, and the iteration times are reduced.

In a preferred embodiment, in the OD stream clustering method based on vector constraint, in the second step, a clustering algorithm based on partitioning is used to cluster all OD streams based on spatial feature distances of event points, so as to partition all OD streams into a plurality of spatial clusters, and the specific process includes:

searching the nearest point of the elbow inflection point to be checked in a plurality of contour coefficient maximum values, and using k corresponding to the nearest point₁Is the optimal k₁Value, said optimal k₁The value is the cluster number of the spatial cluster.

K₁The value of (a) can be determined empirically.

In a preferred embodiment, in the OD stream clustering method based on vector constraint, in step (1) of step two, k is traversed₁The process of clustering all values of (a) includes: in each cycle, the value k is calculated₁Under the current value ofAnd (4) randomly generating a new cluster center in the next cycle in the space cluster with the minimum cluster contour coefficient by using the cluster contour coefficient of each space cluster.

In a preferred embodiment, in the OD stream clustering method based on vector constraint, in the third step, a clustering algorithm based on partitioning is adopted to cluster OD streams contained in any spatial cluster based on a geometric feature distance of a vector, so as to partition the OD streams contained in the spatial cluster into a plurality of vector clusters, and the specific process includes:

Extracting an elbow inflection point from the error square sum curve by adopting a key point perception algorithm, wherein a straight line is constructed through a first point and a last point in the error square sum curve, and a point with the maximum vertical distance from the straight line on the error square sum curve is calculated and is taken as the elbow inflection point to be verified;

K₂The value of (a) can be determined empirically.

In a preferred embodiment, in the OD stream clustering method based on vector constraint, in step (1) of step three, k is traversed₂The process of clustering all values of (a) includes: in each cycle, the value k is calculated₂The cluster contour coefficient of each vector cluster under the current value is randomly generated into a new cluster center in the next cycle in the vector cluster with the minimum cluster contour coefficient.

In a preferred embodiment, in the third step, before the OD streams included in each spatial cluster are respectively clustered based on the vector geometric feature distance by using a partition-based clustering algorithm, it is assumed that the OD streams in the same spatial cluster are translated to make the spatial positions of the OD streams in the same spatial cluster coincide, so that the OD streams in the same spatial cluster are unified under the same vector coordinate system, and each spatial cluster exists under a separate vector spatial coordinate system.

Referring to fig. 2, it is assumed that the OD streams in each spatial cluster are translated before vector clustering is performed, so that spatial positions of the OD streams in the same spatial cluster are overlapped, so as to ignore spatial position differences of the OD streams in the same spatial cluster, and further unify an OD stream vector coordinate system. Therefore, after the space division, each space cluster in the OD stream space is converted into an independent vector space separately, and the clustering process of all vector clusters can be run in parallel.

In the process of vector clustering, each vector coordinate system is normalized first.

In a preferred embodiment, in the OD stream clustering method based on vector constraint, the event point is a geometric midpoint of the OD stream; the spatial characteristic distance is Euclidean distance, and the geometric characteristic distance is a difference value of the modified cosine similarity.

P_odIs the geometric line midpoint of the OD flow, the calculation formula is as follows:

wherein, { X_O,Y_OO-Point geographic coordinate of OD flow, { X_D,Y_DD point geographical coordinates of the OD stream. In this embodiment, the spatial properties of the OD streams are represented by the geometric line midpoints of the OD streams.

The method takes the OD flow as an integral object, and constructs a corresponding distance function through the space attribute and the geometric attribute of the OD flow. However, in high-dimensional data analysis, data of different dimensions cannot be directly compared and operated, and weight distribution has strong subjectivity when a distance function is constructed. Therefore, in the invention, two-step clustering is respectively carried out based on the similarity of the spatial dimension and the geometric characteristic, and the Euclidean distance and the modified cosine similarity are adopted as a spatial distance function and a geometric characteristic distance function.

Defining a spatial characteristic distance D of the OD flow_spatial(i, j) is OD stream OD_iEvent point P of_iAnd OD stream OD_jEvent point P_jOf the geographic space of (D)_EUC(P_i,P_j). The calculation formula is as follows:

wherein, the event point P_iIs represented by the geographic coordinates ofEvent point P_iIs represented by the geographic coordinates of

The calculation formula of the vector is as follows:wherein, { X_O,Y_OO-Point geographic coordinate of OD flow, { X_D,Y_DD point geographical coordinates of the OD stream. Geometric feature distance D for defining OD flow_vectorAnd (i, j) is the difference value of the modified cosine similarity. The calculation formula is as follows:

wherein, Sim_AdjCos(OD_i,OD_j) Representative vectorSum vectorModified cosine similarity between them, R_xIs the mean value within a cluster, R, in a given dimension x_yIs the mean within the cluster in the specified dimension y. The cosine similarity is corrected, different dimensionality normalization is carried out on the basis of considering vector angle difference, influence factors of a vector mode are indirectly considered, and the similarity of the vector size and the vector direction is comprehensively measured. Because the similarity value range is [ -1,1 [ ]]The invention thus uses the dissimilarity calculated by the difference as a function of distance.

In a preferred embodiment, the OD stream clustering method based on vector constraint further includes: and fourthly, visually displaying the plurality of space clusters on the map, defining each space cluster as an OD flow community, and/or respectively calculating an event point mean value and a vector mean value of each vector cluster so as to obtain a representative OD flow of each vector cluster, and visually displaying all the representative OD flows on the map.

In a preferred embodiment, in the OD stream clustering method based on vector constraint, the OD stream data set is obtained by: obtaining taxi moving track data generated by GPS positioning, and obtaining track data with passenger getting-on and getting-off position information through semantic extraction.

Example one

The taxi OD stream is a track with passenger getting-on and getting-off position information obtained by semantic extraction by using taxi moving track data generated by GPS positioning. Compared with a complex real track, the OD flow does not completely depend on real road network distribution data, can directly reflect the travel characteristics of urban residents, and is an important data source for mining the time-space activity rule of urban crowds. The data used for the experiment is partial taxi GPS track data (12000 strips) from 6 am to 9 am on 11 am on 1 month 11 in 2008 in Beijing city through primary OD semantic extraction, the data format is an original taxi GPS track information structure, and the data comprises fields of encrypted numbers, GPS feedback time, real-time longitude and latitude, riding state, riding events, speed, direction angles and the like of taxies. In order to facilitate visualization of the aggregation result, a javascript + html + css front-end webpage development technology is used for carrying out all research experiments, and a clustering and visualization algorithm is written by using js language. The visualization of the unclustered taxi OD flows is shown in FIG. 8.

Clustering is carried out based on event point space characteristic distance, and the optimal k is automatically obtained by adopting an error square sum key point perception algorithm considering contour coefficients in the clustering process₁Value, k, which is based on taxi OD stream event point space cluster₁The value was 4. The 4 OD flow space clusters generated during clustering are shown in fig. 9(a), 9(b), 9(c) and 9 (d). In fig. 9(a), 9(b), 9(c) and 9(d), each figure shows one OD stream community.

Clustering is carried out based on the geometric characteristic distance of the vector, and the optimal k is automatically obtained by adopting an error square sum key point perception algorithm considering the outline coefficient in the clustering process₂Obtaining the value of a vector cluster k corresponding to each space cluster₂The values are respectively: 4. 4, 5 and 4. In addition, in order to facilitate observation of the overall flow trend of 17 flow clusters, the OD flow event point mean value and the OD flow vector mean value of each type of flow cluster are calculated as representative visual description indexes of the flow clusters, and the visual result is shown in fig. 10.

By comparing 9(a), 9(b), 9(c), 9(d) with 10, it can be seen that the aggregation product of the first step of clustering "spatial partitioning" is constrained by the OD flow vector coordinate system, the OD flow clusters are clearly divided into 4 bundles, each of which in turn constrains a different cluster. This composite flow pattern is mainly affected by the calculation of the OD flow event points. The OD stream is not a true trace and there is no trace midpoint, which can be known as a spatial abstraction of the OD stream by defining stream event points. Therefore, the dots in the OD stream have a certain physical significance in performing cluster analysis and pattern recognition. The stream bundling pattern is easier to find with the OD stream midpoint as the event point. The physical meaning of taxi OD flow clustering is traffic flow clustering with obvious space division generated by city functional area attraction and traffic hub social contact. Because the formation of the traffic flow community has dependency on the urban traffic hub, the invention compares the bundling mode with the spatial distribution of the Beijing urban traffic hubs (Beijing Western station and six-Liqiao passenger transport main hub, Beijing south station and Song Jia Zhuang traffic hub and south aster airport, Beijing station and six-Liqiao and four-benefit traffic hub and Beijing capital airport and west aster traffic hub), and finds that the two have obvious spatial correlation. It can also be seen that the transportation hub not only serves as a place for collecting the passenger flow, but also generates attraction to the surrounding traffic interaction. Therefore, by a line aggregation method, the OD stream communities can be better identified, and OD stream clusters with potential spatial connection can be found.

The aggregation algorithm can find representative and arbitrarily-shaped geometric feature clusters on the basis of identifying OD flow event point space clusters. In the prior art, the similarity of OD points is often constrained by defining a regular search space or an additional geographical cell partition or a uniform continuous density space, so as to obtain regular clusters with similar geometric forms or irregular clusters with similar OD point semantic features or uniform OD point density. The morphological structure of these clusters depends on the parameter definition of the hierarchy-based and density-based clustering algorithms. However, due to the solidification of parameters such as search radius, intra-cluster connectivity and the like, the existing aggregation algorithm cannot deal well with the line set with uneven global density. Therefore, the invention normalizes the origin of the vector coordinates of the OD flow in the space cluster by restricting the vector coordinate system through event point space clustering based on the partition clustering algorithm. And then performing vector geometric feature clustering through a distance function based on the modified cosine similarity. All the processes of the algorithm do not need to preset any parameter, and only the optimal k values of the space cluster and the vector cluster are calculated respectively through indexes such as contour coefficients, error square sums and the like. By adopting the spatial cluster clustering and the modified cosine similarity in advance, the characteristic loss generated by global data normalization can be solved as much as possible. And because the algorithm does not adopt the conventional two-point constraint on the constraint of the geometric form, the algorithm is only influenced by the similarity of the k value of the geometric cluster optimized in the space cluster and the correction cosine, and the OD flow can find the irregularly distributed aggregation cluster. Thus the algorithm can not only identify clusters with similar patterns, but also clusters with a vergence pattern. As shown in fig. 10, the invention can find a traffic evacuation and convergence pattern based on the capital international airport impact.

The invention provides a two-step clustering flow clustering algorithm for OD data, which expresses the mode characteristics of OD flows through OD flow event points and OD flow vectors, maps the OD data from a massive point set space to an independent vector characteristic space, simplifies the complexity of OD flow similarity calculation, and pays more attention to the overall spatial distribution and motion trend of the OD flows. Compared with the prior art, a line clustering thought based on two-point clustering is changed, the overall (high-dimensional) similarity of the OD flows is paid more attention to, the characteristic matrix dimension in the clustering process is optimized, automatic optimal clustering number resolving without any parameter is realized, any shape OD flow cluster with representative characteristics can be mined, OD flow communities can be found, optimization of traffic plot planning is facilitated, and OD flow mechanics problems are analyzed.

While embodiments of the invention have been disclosed above, it is not intended to be limited to the uses set forth in the specification and examples. It can be applied to all kinds of fields suitable for the present invention. Additional modifications will readily occur to those skilled in the art. It is therefore intended that the invention not be limited to the exact details and illustrations described and illustrated herein, but fall within the scope of the appended claims and equivalents thereof.

Claims

1. The OD flow clustering method based on vector constraint is characterized by comprising the following steps of:

2. The vector constraint-based OD flow clustering method of claim 1, wherein the partition-based clustering algorithm is a k-means clustering algorithm.

3. The method as claimed in claim 2, wherein in the second step, all the OD streams are clustered based on the spatial feature distance of the event point by using a clustering algorithm based on partitioning, so as to partition all the OD streams into a plurality of spatial clusters, and the specific process includes:

step (1) traverse k₁Is clustered and calculated at each k₁Obtaining the error square sum curve and the contour coefficient curve by taking the value of the error square sum of the current cluster and the overall contour coefficient of the current cluster, and selecting a plurality of contour coefficient maximum values, k, from the contour coefficient curve₁＝2，3，4···K₁；

4. The vector constraint-based OD flow clustering method of claim 3, wherein in step (1) of step two, k is traversed₁The process of clustering all values of (a) includes: in each cycle, the value k is calculated₁The cluster contour coefficient of each space cluster under the current value is randomly generated into a new cluster center in the next cycle in the space cluster with the minimum cluster contour coefficient.

5. The method according to claim 2, wherein in the third step, a clustering algorithm based on partitioning is used to cluster the OD streams contained in any spatial cluster based on the geometric feature distance of the vector, so as to partition the OD streams contained in the spatial cluster into a plurality of vector clusters, and the specific process includes:

step (1) traverse k₂Is clustered and calculated at each k₂Obtaining the error square sum curve and the contour coefficient curve by taking the value of the error square sum of the current cluster and the overall contour coefficient of the current cluster, and selecting a plurality of contour coefficient maximum values, k, from the contour coefficient curve₂＝2，3，4···K₂；

6. The vector constraint-based OD flow clustering method of claim 5, wherein in step (1) of the third step, k is traversed₂The process of clustering all values of (a) includes: in each cycle, the value k is calculated₂The cluster contour coefficient of each vector cluster under the current value is randomly generated into a new cluster center in the next cycle in the vector cluster with the minimum cluster contour coefficient.

7. The method according to claim 1, wherein in the third step, before the OD streams contained in each spatial cluster are clustered based on the vector-based geometric feature distance by using the partition-based clustering algorithm, it is assumed that the OD streams in the same spatial cluster are translated to make the spatial positions of the OD streams in the same spatial cluster coincide, so that the OD streams in the same spatial cluster are unified under the same vector space coordinate system, and each spatial cluster exists under a separate vector space coordinate system.

8. The OD flow clustering method based on vector constraints of claim 1, wherein the event point is a geometric midpoint of the OD flow; the spatial characteristic distance is Euclidean distance, and the geometric characteristic distance is a difference value of the modified cosine similarity.

9. The vector constraint-based OD flow clustering method of claim 1, further comprising:

10. The OD flow clustering method based on vector constraints of claim 1, wherein the OD flow data set is obtained by: obtaining taxi moving track data generated by GPS positioning, and obtaining track data with passenger getting-on and getting-off position information through semantic extraction.