CN107679558A

CN107679558A - A kind of user trajectory method for measuring similarity based on metric learning

Info

Publication number: CN107679558A
Application number: CN201710847477.7A
Authority: CN
Inventors: 邵俊明; 刘松灵; 杨勤丽; 于忠靖; 朱庆
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2017-09-19
Filing date: 2017-09-19
Publication date: 2018-02-09
Anticipated expiration: 2037-09-19
Also published as: CN107679558B

Abstract

The invention discloses a kind of user trajectory method for measuring similarity based on metric learning, and the similitude between user trajectory is obtained by calculating user trajectory distance with reference to measurement learning method.Firstly generate the place temporal joint probability distribution matrix of each user.The initial similarity between the different user track based on user distribution matrix is calculated followed by KL divergences, and the initial category (user trajectory being divided into different classifications according to similarity matrix, calculate with convenient follow-up similarity measurements flow function) for the method generation user trajectory for passing through spectral clustering.Finally on the basis of initial similarity matrix S and Track Initiation category set C, with reference to measurement learning art, obtain the similitude for possessing user preference pattern and the user trajectory with identical dimensional and characterize vector sum metric function number, on this basis, the distance between user trajectory is calculated, obtains user trajectory similitude.

Description

User track similarity measurement method based on measurement learning

Technical Field

The invention belongs to the technical field of track similarity measurement, and particularly relates to a user track similarity measurement method based on metric learning.

Background

With the development of positioning satellites, personal positioning devices and wireless networks, user trajectory data has shown explosive growth. In consideration of the potential social value of user trajectory data mining, the technical field is more and more concerned by all circles, especially the fields of computer science, geographic information science, social science and the like. Meanwhile, in the industrial field, analysis and mining of user trajectory data create huge commercial values for various fields. For example, a traffic management department can analyze traffic flow data to avoid the urban congestion phenomenon of travel peak and solve the problems of similar urban traffic and urban environment; an enterprise relating to user travel business can solve the problems of user travel path planning, neighbor user recommendation, customer location optimization and the like by carrying out data mining on user trajectory data and establishing an effective model.

In a user trajectory data mining algorithm, measurement of user trajectory similarity is often involved, such as trajectory clustering, trajectory prediction, abnormal trajectory detection and the like. The user track similarity measurement is a core technology in user track data mining and has important theoretical and application values. The current user trajectory similarity measurement is mainly divided into measurement in a space-time space and a feature space.

In the space-time space, because the user trajectory has a time characteristic, the similarity measurement method usually extends the similarity measurement method of a time sequence from a time-attribute sequence to a three-dimensional space-time sequence of a time-space-attribute, such as a maximum public subsequence, a dynamic time warping, a minimum edit distance and the like. The common drawback of this method is that all coordinates and time information in the user track are considered equally, and some key location or time information existing in the user track is ignored. Some similarity measurement methods only consider coordinate information, and the user trajectory has space-time tight coupling, so that the similarity of the user trajectory cannot be effectively measured.

In the feature space, the basic idea is to extract some inherent features of the user trajectory, such as the speed, curvature, length, starting point, etc. of the user trajectory. The method relies on expert knowledge, so that great redundancy exists among the characteristics easily, and key division information between the user track and the user track is lost.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a user trajectory similarity measurement method based on metric learning, which is used for generating a similarity representation of user trajectory global property, user preference pattern and consistent feature dimension.

In order to achieve the above object, the present invention provides a user trajectory similarity measurement method based on metric learning, which is characterized by comprising the following steps:

(1) User mobile data collection and cleaning

Collecting user mobile data, and sorting and cleaning the user mobile data according to analysis requirements: extracting time position information of a key Point hidden in user mobile data by adopting a key Point information extraction technology (namely POI (Point of Interest) to obtain a track representation of a user based on the key Point;

(2) User location-time joint distribution calculation

Firstly, clustering the time and position information of key places of all users by using a clustering algorithm (such as DBSCAN) to obtain a hot spot area, then obtaining the key places accessed by all users at high frequency by combining the position information of the known key places of the city, and extracting P key places which are ranked at the front, namely more accessed key places as places accessed by a user track.

The activity time of the whole user is dynamically divided according to the distribution of the activity time on the time dimension to obtain T time periods, and each user is divided into access places and time periods based on user tracksIts site-time joint probability distribution matrixi =1,2, \ 8230;. M, m is the number of users, and the matrix directly reflects the distribution of each user track in the space dimension and the time dimension;

(3) Obtaining the initial similarity matrix of the user track

Location-time joint probability distribution matrix based on each userCalculating an initial similarity matrix S between the tracks of the user:

wherein the initial similarity matrix S is a symmetric similarity matrix S _i,j Representing the similarity between user i and user j, is defined as follows:

wherein, sigma is a function width parameter, determined according to specific implementation conditions, and KL divergence d _i,j Is defined as:

wherein w _i (p, t) is a site-time joint probability distribution matrixIn time period t, the probability of the user trajectory to appear towards visiting location p, user i, w _j (p, t) is a site-time joint probability distribution matrixIn the time period t, the probability that the user track tends to the visiting place pis appears;

(4) Initial category acquisition of trajectory

Summing each row of the initial similarity matrix S, sequentially using the sum as the diagonal elements of the diagonal matrix D according to the row correspondence, then calculating the Laplace matrix L = D-S, and solving the first k minimum eigenvalues of the Laplace matrix L through SVD (singular value decomposition)And corresponding feature vectors

Constructing a matrix M: each feature vector is combinedSequentially serving as a column to form a matrix M with M rows and k columns, wherein each row of the matrix M corresponds to each row in the original initial similarity matrix S, namely a k-dimensional representation of a user track;

finally, on the K-dimensional representation, obtaining category label information of each user track in a K-Means mode to form a track initial category set C;

(5) Trajectory similarity metric learning

Respectively corresponding the initial similarity matrix S and the initial track category set C to two elements in metric learning, namely: the similarity matrix and the marginal information are processed by a metric learning method to obtain a metric function A after learning optimization, and meanwhile, similarity characterization vectors of user tracks in the same feature space can be obtained

Finally, the similarity characterization vector of the user track is combinedAnd measuring a function number A, and calculating by using a Mahalanobis distance algorithm to obtain the distance between user tracks:

distance between user trajectories dist (sv) _i ,sv _j ) The smaller the similarity, the larger the similarity, and vice versa.

The object of the invention is thus achieved.

The user track similarity measurement method based on measurement learning combines the measurement learning method to obtain the similarity between the user tracks by calculating the user track distance. Firstly, collecting, arranging and cleaning user mobile data, and then extracting P user track in the user mobile data to approach to an access place (hereinafter referred to as an access place for short) by using a clustering method; meanwhile, the whole user activity time is dynamically divided according to the distribution of the user activity time in the time dimension to obtain T time periods, so that a location-time combined probability distribution matrix (hereinafter referred to as a user distribution matrix) of each user is generated. And then, calculating initial similarity among different user tracks based on the user distribution matrix through KL divergence, and generating initial categories of the user tracks through a spectral clustering method (namely, dividing the user tracks into different categories according to the similarity matrix so as to facilitate subsequent similarity measurement function calculation). And finally, on the basis of the initial similarity matrix S and the track initial category set C, combining with a metric learning technology to obtain similarity characteristic vectors of user tracks which have the user preference mode and have the same dimensionalityAnd measuring a function number A, and on the basis, calculating the distance between the user tracks to obtain the similarity of the user tracks.

In the invention, aiming at the problems of non-uniform track spatio-temporal scale and effective feature extraction of the traditional method in the aspect of space-time space, the effectiveness of the user track similarity measurement is improved by obtaining the spatio-temporal distribution feature of the user track and adopting a measurement learning method, so that the purposes of similarity and dissimilarity are achieved. The method can be widely applied to various user trajectory data mining technologies.

Drawings

FIG. 1 is a flow chart of an embodiment of a method for measuring track similarity based on metric learning according to the present invention;

FIG. 2 is a schematic diagram of time period dynamic partitioning in the present invention, wherein T1-T4 represent four time periods, respectively having a track coordinate probability distribution partitioned according to a fixed time period (1 hour) and a track probability distribution subjected to dynamic partitioning;

FIG. 3 is a system diagram of a user trajectory similarity measure method according to an embodiment of the present invention;

fig. 4 is a schematic diagram of the principle of the metric learning technique in the present invention, in which (a) represents the distance between two class samples before the metric learning, and (b) represents the distance between two class samples after the metric learning.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the main content of the present invention.

Fig. 1 is a flowchart of a specific embodiment of the method for measuring track similarity based on metric learning according to the present invention.

In this embodiment, as shown in fig. 1, the trajectory similarity measurement method based on metric learning of the present invention includes the following steps:

step S1: user mobile data collection and cleaning

User movement data generally contains a user number and trajectory information, where the trajectory information is an ordered sequence of binary groups of < location coordinates, timestamps >. For common GPS (global positioning system) trajectories, such as: vehicle positioning GPS tracks, personal cell phone GPS positioning tracks, etc., are coordinate plus time series generated with higher frequency acquisition, and therefore need to be sorted and cleaned.

In the GPS data, since redundant information is excessive in the GPS data, it is necessary to extract time and position information of a key (important) place having a temporal-spatial distribution characteristic. The invention adopts the traditional POI (Point of Interest) extraction method to extract the position information of the key (important) points hidden in the GPS data and extract the time distribution information of the corresponding positions to obtain the time position information of the key points of the user track, namely the track representation of the user based on the key points. In the specific implementation process, other similar extraction methods or expert knowledge may also be used to extract the time and position information of the key location in the position data.

Step S2: user location-time joint distribution computation

In the invention, on the aspect of position POI extraction, a density-based DBSCAN clustering method is adopted to obtain a hot spot area accessed by all user tracks, and the hot spot area implicitly indicates that the area has higher probability of being accessed aiming at all users. And combining the position information of the known key places of the city to obtain the key places visited by all users at high frequency, and extracting P key places which are ranked at the top, namely more visited as places which tend to be visited by the user track. Therefore, the user track length can be obviously reduced, and the original motion characteristic mode of the user is reserved.

And dynamically dividing the whole user activity time according to the distribution of the whole user activity time in the time dimension to obtain T time periods. On the time period division, statistical analysis is performed on the time dimension of the user trajectory, resulting in a time probability distribution as shown in fig. 2. The invention solves the time interval division from the original time interval division (such as 24 hours a day) to the dynamic time interval division through the difference of the occurrence frequency of the track points of different time slices. For example, assume that a user trajectory data set has the following distribution in the time dimension:

< (0 point to 1 point, w = 0.3), (1 point to 2 points, w = 0.05),

(3 point to 4 point, w = 0.35), (4 point to 5 point, w = 0.3) >, and

wherein w is the probability of the user appearing in the time period, and can be divided according to the dynamic time period:

wherein p is _t The probability of the occurrence of the user in the time period T, delta is the span of the time period which is used for controlling the division parameter, the value is between 0 and 1 and can be automatically adjusted, and T _t Representing a time period t. In the present embodiment, δ =0.1 is taken, and a new time distribution as shown in fig. 2 is obtained. It can be seen that most users appear between 3-5 points and very few users appear between 1-2 points before dynamic time division is performed, which results in that when a user distribution matrix is calculated, the elements of the users between 1-2 points approach to zero, and if the time period is large, the whole user distribution matrix becomes sparse. After dynamic time division, the span of the time period for which the user is sparse changes from the original fixed width (1 hour) to delta/p _t Equal to extending the time period of the original user back and forth by 0.5 × (δ/p) at the same time _t -1), more visited sites are included, thus solving the problem of sparseness for the user. In this example, it is assumed that after the dynamic partitioning process, the partitioning of the access point in the time dimension becomes:

the probability of the track coordinate appearing between the original 1 point and the original 2 points is obviously improved.

The spatial and temporal features are obtained by the method, and a user distribution matrix of each user is generated:

and step S3: user trajectory initial similarity matrix acquisition

Aiming at the user distribution matrix obtained in the step S2The invention adopts KL divergence mode to calculate the initial similarity matrix S of the user track:

wherein the initial similarity matrix S is a symmetric similarity matrix S _i,j Representing the similarity between user i and user j, defined as follows:

wherein w _i (p, t) is a site-time joint probability distribution matrixIn time period t, the probability of the user trajectory to appear towards visiting location p, user i, w _j (p, t) is a site-time joint probability distribution matrixIn a period of timet, probability that the user track tends to appear at the visiting place pdusej;

it can be considered that a new characterization of the user trajectory in the similarity space is obtained, that is, each row in the matrix corresponds to a user trajectory, and the subsequent metric learning is also based on the similarity matrix. Intuitively understand that if the similarity of two tracks and the similarity of other tracks are the same, the two tracks can be considered to be similar; if two tracks have significant difference from the similarity of the rest tracks, the two tracks are considered to be dissimilar.

And step S4: trajectory initial class acquisition

Once the initial similarity matrix S of all user track data is obtained, clustering division is carried out on the user tracks on the similar space by using a spectral clustering method, and category label information of each user track is obtained.

In this embodiment, the specific method of cluster partitioning is as follows:

4.1 A) and carrying out summation operation on each row of the initial similarity matrix S of all user track data to obtain each row of the similarity matrix of the track data set and d _i

The row sums are then taken as the elements on the diagonal of the diagonal matrix D. The physical meaning of the diagonal matrix can be interpreted as a sum of similar weights for each user trajectory and other user trajectories similar thereto, and then we calculate:

L＝D-S

wherein L is a laplace matrix. After the Laplace matrix L is obtained through calculation, matrix decomposition is carried out on the Laplace matrix L to sequentially output the first k minimum eigenvaluesAnd corresponding feature vectors

4.2 K), the number of initial categories of the track is selected

Before labeling a user track with labels obtained by clustering division, the number k of specific category labels needs to be initialized. From a practical point of view, it is difficult for one to determine the magnitude of the k value a priori. The invention provides a method for selecting a k value based on a minimum description length so as to perform clustering operation on a user track.

Specifically, n different k values are initialized, and for different k values, a description length of k based on model parameters can be calculated for the similarity matrix of the track data set. From the principle of minimum description length, we can know that when the used k minimizes the value of all similarity vectors of model coding observation data, namely, the trajectory data set, we can consider the current k value as the optimal choice.

Suppose is provided with k ₁ ,k ₂ Two parameter choices, we calculate:

wherein θ is k ₁ Or k ₂ ，|C _i I is the number of samples contained in the ith cluster, s _j Is the jth row, μ, in the trace similarity matrix _i Is the mean of the ith cluster. dist is a distance function, in this example using the Euclidean distance metric. If there is Loss ₁ >Loss ₂ Then, the selection parameter k is stated ₂ All data can be better encoded so we will choose this value as the label number parameter for the model.

4.3 ) and K-Means clustering to obtain user track category label information

Constructing a matrix M: each feature vector is combinedSequentially forming a matrix M with M rows and k columns as a column, wherein each row of the matrix M corresponds to each row in the original initial similarity matrix SIs a k-dimensional representation of the user trajectory; and finally, on the K-dimensional representation, obtaining the category label information of each user track in a K-Means mode to form a track initial category set C.

Step S5: trajectory similarity metric learning

In the invention, the representation with uniform characteristic dimension between user tracks and a more accurate and robust similarity measurement function are obtained by using measurement learning.

5.1 Definitions involved in metric learning)

5.1.1 In a specific metric learning task, there are often three sets of sample pairs, which are necessarily connected, necessarily unconnected, and similar difference sets, and are exemplified below for the convenience of the reader of the present invention to understand better;

assuming that there is a sample set < a, B, C >, and it is known that a, B are similar to each other to a high degree and B, C are similar to each other to a low degree, there are a necessarily connected set S = { (a, B) }, a necessarily unconnected set D = { (B, C) }, and a similarity difference set Diff = { ((a, B), C) }. Through the samples in each set, the similarity or the difference between the samples can be obtained, and then the samples are used as constraint conditions to be added into the subsequent learning process. In this example, we have already obtained the category label information of each user track through S4, and a necessarily connected set and a necessarily unconnected set can be established in sequence for the basis. The criterion for allocating the sample pairs to the sets is based on adding the sample pairs to the sets which need to be connected if the two user track label information are consistent, and adding the sample pairs to the sets which need not be connected if the user track label information is different.

5.1.2 Known by a euclidean distance function), known by

The extended distance metric function, namely mahalanobis distance, can be obtained:

wherein the transformation matrix A ∈ R ^d×d And must be a semi-positive definite matrix. It can be seen that when a = I, the mahalanobis distance degenerates to the euclidean distance. When the constraint a is a diagonal matrix, we learn a metric function with different weights in each feature dimension. If A is decomposed to obtain A ^1/2 Then, multiplying a certain sample in the original space, namely, equivalently, transforming each dimension of the sample to obtain the representation of the sample on the new feature space.

5.2 In the same class), although we have embedded the user trajectory into a feature space with one dimension being consistent through the previous steps, there is a bias in the distance measure such that the distance between samples that originally belong to the same class is larger than the distance between samples that belong to different classes. At this time, we have obtained an initial trajectory similarity metric and a constructed constraint set as shown in fig. 3, and using the category information or constraint information, and on the basis of the initial similarity matrix, we perform one of the most critical steps of the present invention, i.e., iteratively optimize the metric function.

The result that the present invention needs to achieve, i.e. learning a new metric function a, is determined as follows. To obtain a new measurement function through calculation and combine with user similarity constraint information, an optimization model is constructed

A≥1

Wherein A ≧ 1 represents that A is a semi-positive definite matrix. And solving the optimization problem by adopting an optimization method of gradient descent and iterative mapping, and finally returning a similarity measurement function matrix A.

As shown in fig. 4, assuming that different graphs respectively represent different user trajectory categories, the physical meaning of the model is that, for two trajectories belonging to the set that must be connected, the distance between them is minimized (here, the characterization in the similarity space is used), i.e., the distance between circles and the distance between triangles in the graph are shortened; while maximizing the user trajectories belonging to the necessarily disjoint set, i.e. enlarging the distance between the circle and the triangle. Therefore, the purpose of optimizing the distance measurement function (corresponding to the distance measurement of the shadow sample and other samples successfully corrected in the diagram) can be achieved by carrying out corresponding weight change on different dimensions in the measurement space.

Specifically, in the present invention, the initial similarity matrix S and the initial trajectory category set C are respectively corresponding to two elements in metric learning: the similarity matrix and the marginal information are processed by using a metric learning method to the whole user track set, so that a metric function A after learning optimization is obtained, and similarity characterization vectors of all user tracks in the same feature space can be obtained

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all inventions utilizing the inventive concept are protected.

Claims

1. A user trajectory similarity measurement method based on metric learning is characterized by comprising the following steps:

(1) User mobile data collection and cleaning

Collecting user mobile data, and sorting and cleaning the user mobile data according to analysis requirements: extracting time position information of the key points hidden in the user mobile data by adopting a key Point information extraction technology (namely POI (Point of Interest), so as to obtain a track representation of the user based on the key points;

(2) User location-time joint distribution calculation

Firstly, clustering key location time and position information of all users by using a clustering algorithm (such as DBSCAN) to obtain a hot spot area, then obtaining key locations accessed by all users at high frequency by combining position information of known key locations of cities, and extracting P key locations ranked ahead, namely more accessed key locations as locations that user tracks tend to access.

The activity time of the whole user is dynamically divided according to the distribution of the activity time in the time dimension to obtain T time periods, and a location-time joint probability distribution matrix of each user is obtained based on the user track trend access location and time period divisioni =1,2, \ 8230;. M, m is the number of users, and the matrix directly reflects the distribution of each user track in the space dimension and the time dimension;

(3) Obtaining the initial similarity matrix of the user track

wherein w _i (p, t) is a site-time joint probability distribution matrixIn time period t, probability of occurrence of user i when user track tends to visit place p, w _j (p, t) is a site-time joint probability distribution matrixIn the time period t, the probability that the user track tends to the visiting place pis appears;

(4) Initial trajectory category acquisition

Summing each row of the initial similarity matrix S, sequentially using the sum as elements on diagonal lines of a diagonal matrix D according to row correspondence, then calculating a Laplace matrix L = D-S, and solving the first k minimum eigenvalues of the Laplace matrix L through SVD (singular value decomposition)And corresponding feature vectors

finally, on the K-dimensional representation, obtaining the category label information of each user track in a K-Means mode to form a track initial category set C;

(5) And learning track similarity measurement

distance between user trajectories dist (sv) _i ,sv _j ) The smaller, the greater the similarity, the inverseThe smaller the similarity.

2. The method according to claim, wherein the dynamic partitioning is:

wherein p is _t The probability of the occurrence of the user in the time period T, delta is the span of the time period which is used for controlling the division parameter, the value is between 0 and 1 and can be automatically adjusted, and T _t Representing a time period t.