CN103744861B

CN103744861B - Lookup method and device for frequency sub-trajectories in trajectory data

Info

Publication number: CN103744861B
Application number: CN201310687107.3A
Authority: CN
Inventors: 黄鑫; 罗军
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Guangdong Gaohang Intellectual Property Operation Co ltd; Jiangsu Aerospace Dawei Technology Co Ltd
Priority date: 2013-12-12
Filing date: 2013-12-12
Publication date: 2017-05-03
Anticipated expiration: 2033-12-12
Also published as: CN103744861A

Abstract

The invention belongs to the technical field of data processing, and provides a lookup method and device for frequent sub-trajectories in trajectory data. The lookup method comprises the steps of separating spatial information and time information in the trajectory data; encoding the spatial information into first-kind characters, wherein each first-kind character is used for representing a geographic position; encoding the time information into second-kind characters, wherein each second-kind character is used for representing a time interval; according to the spatial information encoded into the first-kind characters and the time information encoded into the second-kind characters, establishing a generalized suffix tree; looking up frequent sub-character-strings in the generalized suffix tree; converting the looked-up frequent sub-character-strings into the frequent sub-trajectories. According to the frequent sub-trajectory lookup method and device in the trajectory data, the efficient character string algorithm is used for processing the complex multi-dimensional numerical data, so that the computation complexity of the whole frequent sub-trajectory lookup process is greatly reduced.

Description

Frequent sub-trajectory lookup method and device in a kind of track data

Technical field

The invention belongs to the frequent sub-trajectory lookup method in technical field of data processing, more particularly to a kind of track data And device.

Background technology

Track data is exactly under space-time environment, to be obtained by the sampling to one or more mobile object motion process Data message, including sampling point position, sampling time, speed etc., these sample point data information are according to sampling sequencing Constitute track data.Common track data includes that vehicle driving trace, the travel locus of mobile interchange network users, movement are mutual Track, etc. of registering of on-line customer, contains abundant information in the track data of magnanimity, its frequent sub-trajectory can be showed The behavior pattern and custom of most people, or the Changing Pattern of performance weather etc..

Because track data is numeric data, it is impossible to directly apply mechanically looking into for the frequent substring of character string quite ripe at present Frequent sub-trajectory during algorithm is looked for search track data, therefore, in prior art directly track data is divided mostly And cluster, track of the length for O (n) is divided into into O (n²) individual sub-trajectory, then these sub-trajectories are carried out cluster analysis to find Frequent sub-trajectory, whole process computation complexity is high, and operation time is long.

The content of the invention

The purpose of the embodiment of the present invention is to provide the frequent sub-trajectory lookup method in a kind of track data, it is intended to solved The high problem of the existing algorithm computation complexity that frequent sub-trajectory is searched in track data.

The embodiment of the present invention is achieved in that the frequent sub-trajectory lookup method in a kind of track data, including：

Spatial information and temporal information in separated track data；

By the spatial information encode into first kind character, each described first kind character is used to represent a geographical position Put；

The temporal information is encoded into into Equations of The Second Kind character, each described Equations of The Second Kind character be used for represent one it is intersegmental every when Between；

According to be encoded into the first kind character the spatial information and be encoded into the Equations of The Second Kind character it is described when Between information, set up generalized suffix tree；

Search the frequent substring in the generalized suffix tree；

The described frequent substring for finding out is converted into into frequent sub-trajectory.

The another object of the embodiment of the present invention is that the frequent sub-trajectory provided in a kind of track data searches device, bag Include：

Separative element, for spatial information and temporal information in separated track data；

First coding unit, for, into first kind character, each described first kind character to be used by the spatial information encode In one geographical position of expression；

Second coding unit, for the temporal information to be encoded into into Equations of The Second Kind character, each described Equations of The Second Kind character is used In one section of interval time of expression；

Unit is set up, the spatial information of the first kind character is encoded into for basis and is encoded into the Equations of The Second Kind The temporal information of character, sets up generalized suffix tree；

Searching unit, for searching the generalized suffix tree in frequent substring；

Converting unit, for the described frequent substring for finding out to be converted into into frequent sub-trajectory.

The embodiment of the present invention combines data mining technology, suffix tree algorithm and inexact matching, it is achieved thereby that compared with The lookup of the frequent sub-trajectory in excellent track data, it is complex to process by using more efficient character string algorithm Multi dimensional numerical data so that the computation complexity of whole frequently sub-trajectory search procedure is substantially reduced.

Description of the drawings

Fig. 1 is the flowchart of the frequent sub-trajectory lookup method in track data provided in an embodiment of the present invention；

Fig. 2 is that frequent sub-trajectory lookup method S102 in track data provided in an embodiment of the present invention implements stream Cheng Tu；

Fig. 3 is that the frequent sub-trajectory lookup method in track data provided in an embodiment of the present invention gathers to spatial information The schematic diagram of class；

Fig. 4 is that frequent sub-trajectory lookup method S103 in track data provided in an embodiment of the present invention implements stream Cheng Tu；

Fig. 5 is the generalized suffix tree that the frequent sub-trajectory lookup method in track data provided in an embodiment of the present invention is set up Schematic diagram；

Fig. 6 is the structured flowchart that the frequent sub-trajectory in track data provided in an embodiment of the present invention searches device.

Specific embodiment

In order that the objects, technical solutions and advantages of the present invention become more apparent, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, and It is not used in the restriction present invention.

What Fig. 1 showed the frequent sub-trajectory lookup method in track data provided in an embodiment of the present invention realizes flow process, Details are as follows：

Spatial information and temporal information in S101, in separated track data.

Spatial information and temporal information are included in track data, wherein, spatial information generally comprises the Jing of position Degree, latitude etc., and temporal information is generally indicated by unix timestamps.

Table 1 is the specific example of one section of track data, wherein, the temporal information of record is into corresponding longitude and latitude Unix timestamps：

Table 1

In S101, the spatial information and temporal information in track data is separated first, track data is separated Into a spatial information sequence, for example (113.333,22.368), (113.111,23.013) ... }, and a temporal information Sequence, such as { 1385521584,1385521233 ... }.

In S102, by the spatial information encode into first kind character, each described first kind character is used to represent one Individual geographical position.

In the present embodiment, the spatial information that S101 is isolated is clustered, changes into corresponding geographical position, then compiled Code is into character, and the character being encoded into is used to represent corresponding geographical position.As shown in Fig. 2 S102 is specially：

In S201, the spatial information is clustered, generate N number of cluster, the N is the integer more than 1.

According to the spatial information sequence of the track data isolated in S101, based on longitude and latitude by the sky of track data Between information clustered.Specifically, density-based algorithms can be adopted（Density-Based Spatial Clustering of Applications with Noise, DBSCAN）To realize the cluster of spatial information.As the present invention An implementation example, as shown in figure 3, one in each point representation space information sequence record, based on these The corresponding longitude numbers of point and latitude numerical value are clustered to these points, have ultimately produced tri- clusters of A, B, C, and remaining isolated Point is then excluded away as noise, is not involved in ensuing calculating process.

In S202, the geographical position corresponding to each cluster of generation is determined respectively.

In the present embodiment, the positional information such as longitude and latitude according to involved by each cluster, by contrasting on map Position differentiated, to determine the geographical position corresponding to each cluster.Typically, in real data acquisition In, the cluster of generation can represent commercial circle in a city or city in society, etc..

In S203, the geographical position according to corresponding to each cluster carries out character code, each cluster is generated respectively corresponding The first kind character.

Through step shown in Fig. 2, the spatial information in track data for example (113.333,22.368), (113.111, 23.013) ... } then can be converted to { A, C ... }, wherein, A represent (113.333,22.368) be located the corresponding character of cluster, C Represent that (113.111,23.013) the corresponding character of cluster being located, thus achieves the spatial information in track data by original two The conversion of dimension value to geographic area.

In S103, the temporal information is encoded into into Equations of The Second Kind character, each described Equations of The Second Kind character is used to represent one Section interval time.

As shown in figure 4, S103 is specially：

In S401, the temporal information is converted into into interval time by timestamp.

In S102, spatial information sequence is had been converted to into character string form, and original temporal information is one Individual timestamp sequence, each timestamp in sequence represents the time in the corresponding geographical position in spatial information sequence, Then in S103, need for timestamp to be converted into interval time, to determine from into A to into the time difference B.Example Such as, the spatial information sequence after conversion is { A, A, B ... }, and corresponding interval time sequence is { t₁,t₂,t₃..., then into A Time be t₁, the time into B is t₃, then from into A to being t into the time interval B₃-t₁。

In S402, the interval time is standardized.

In S402, before the interval time to being converted to is standardized, first first by wherein long interval when Between as noise removal, then remaining interval time is standardized.Specifically, interval time can be carried out by following formula Standardization：

interval(k）=interval(k)/max_M=1：ninterval(m)

Wherein, interval (k) represents k-th interval time, max_M=1：nWhen interval (m) represents n significant interval Between in maximum, specific standardized method is to take the maximum in all significant interval times, then by k-th interval Time divided by the maximum, that is, has obtained standardized k-th interval time.For standardized k-th interval time, as One implementation example of the present invention, its result can be as accurate as 0.001.

After being standardized to interval time, all of interval time may translate into similar 0.303,0.349, 0.788 grade numerical value.

It is the interval time matching Equations of The Second Kind character after each standardization in S403.

Generally, it is that it is rounded up that numerical value can be changed into into the simplest method of character, for example：

Interval time 1=0.349 ≈ 0.3, interval time 2=0.350 ≈ 0.4 are then right with character according to default numerical value Should be related to, by 0.3 character 3 is converted into, by 0.4 character 4 is converted into.However, in fact, interval time 1 and interval time 2 it is true Real number value is extremely close, but has been matched to different characters so that matching result cannot reflect between different numerical value True gap, accordingly, as one embodiment of the present of invention, can solve this problem using the method for inexact matching, It is the combination for matching described interval time after each standardization an Equations of The Second Kind character by inexact matching, described second Two Equations of The Second Kind characters are included in the combination of class character.

Specifically：Following inexact matching method can be adopted：

| | 0.35=characters 6 | | the characters 7 of interval time 1=0.349 ≈ 0.3；

| | 0.4=characters 7 | | the characters 8 of interval time 2=0.350 ≈ 0.35.

I.e., it is first determined the default value that interval time after standardization is located is interval, and determine that the default value is interval Two values end points, then, according to the corresponding relation of default Equations of The Second Kind character and numerical end point, the two numerical value are divided Not corresponding two Equations of The Second Kind character match to the interval time after the standardization, so as to the interval time after each is standardized It is converted into containing the combination (character k, character k+1) of two Equations of The Second Kind characters.

After it have passed through S102 and S103, track data can be converted into what spatial information and temporal information were intersected Character string sequence, for example：

A（Character 6）B（Character 6）C…

In S104, the spatial information of the first kind character and the Equations of The Second Kind character is encoded into according to being encoded into The temporal information, set up generalized suffix tree.

Suffix tree（suffix tree）As a kind of data structure, can be used for supporting effective string matching and looking into Ask, in the present embodiment, due to temporal information be by containing the combination of two Equations of The Second Kind characters come coded representation, because This, when contributing, can be represented with that character compared with fractional value is represented in the combination of Equations of The Second Kind character.For example, Every time 1=character 6 | | character 7, then during achievement, the interval time 1 can be represented with character 6.

When the temporal information on suffix tree node and the temporal information not also being put on suffix tree are compared, The concrete scene of inexact matching is as follows：

Node n=character k=character k | | character (k+1),

For example, node n=characters 6=characters 6 | | character 7, i.e. the interval time that needs compare is encoded as character 6 and word When according with 7, the node n corresponding to character 6 can be matched.

And for spatial information, still by the way of accurately mate being placed on suffix tree.

In the present embodiment, the achievement process of generalized suffix tree can be completed using Ukkonen algorithms, by above-mentioned side As shown in figure 5, wherein, each non-root node represents a substring for generalized suffix tree that method is established, and by for Each node in suffix tree increases a count attribute newly, for going out in generalized suffix tree to the corresponding character string of the node Existing number of times is counted, then, the number of times that this substring occurs be exactly this node all child nodes in leaf node Count attribute sums：

Count (s)=count_leaf1+count_leaf2+count_leaf3+ ...,

Wherein, s represents a character string, and Count (s) then represents the node that path is reached by s by root node Count property values, count_leaf1, count_leaf2, count_leaf3 ... then represent respectively the child node of the node In all leaf nodes count property values.

And in turn, the count attributes of each leaf node then represent the number of 2-d index contained by the leaf node, if in generations The 2-d index of one leaf node of table, then：

In=(index1, index2),

Wherein, index1 represents that the corresponding substring of the leaf node occurred in which character string, and index2 is represented The original position that the corresponding substring of the leaf node occurs in the character string.

As shown in figure 5, (0,5), (1,2) be wherein positioned at leftmost leaf node two 2-d indexs, the leaf node The character string of representative is suffix " A ", then (0, what is 5) represented is that suffix " A " occurs and occur in the 0th character string " BANANA " Original position be 5（Note:According to the custom of computer science, herein the counting of index number and position is all from the beginning of 0）, (1, 2) suffix is then represented " A " in the 1st character string " ANA " inner appearance and the original position that occurs be as 2.

In S105, the frequent substring in the generalized suffix tree is searched.

For the generalized suffix tree established in S104, take the method for breadth first traversal to carry out traversal of tree, such as Really the numerical value of the count attributes of a node meets following condition：

Count (node A)>Min_repeat_times, wherein, Count (node A) represents generalized suffix tree interior joint A Count property values, min_repeat_times be used for represent a predetermined threshold value,

Then the substring of the node on behalf is frequent substring, i.e. be more than the count attributes in generalized suffix tree Character string corresponding to the node of predetermined threshold value is defined as the frequent substring

If conversely, a node is judged as not being frequent substring, carrying out cutting tree to the subtree with node as root, The child node of the node is no longer traveled through in subsequent process, the search efficiency of frequent substring is improved with this.

In S106, the described frequent substring for finding out is converted into into frequent sub-trajectory.

For the frequent substring found out in S105, such as A (character 6) B (character 7) C ... can be character by character Spatial information or temporal information that the character is represented are translated into, specifically：

If the character is first kind character, then it represents that what the character was represented is spatial information, therefore the character is changed into The corresponding geographical position of the character；

If the character is Equations of The Second Kind character, then it represents that what the character was represented is temporal information, therefore takes the character and be somebody's turn to do Neighbours' character of character, changes into corresponding numerical value and takes average.For example, represent compared with decimal in the combination with Equations of The Second Kind character That character of value is representing during temporal information, if character is 6, its neighbours' character is 7, and above-mentioned two character difference is corresponding Numerical value is 0.3 and 0.35, then then average is taken to 0.3 and 0.35, so as to restore the temporal information.

Table 2 shows the example of the frequent sub-trajectory that the generalized suffix tree shown according to Fig. 5 is ultimately produced：

Table 2

The embodiment of the present invention combines data mining technology, suffix tree algorithm and inexact matching, it is achieved thereby that compared with The lookup of the frequent sub-trajectory in excellent track data, it is complex to process by using more efficient character string algorithm Multi dimensional numerical data so that the computation complexity of whole frequently sub-trajectory search procedure is substantially reduced, and rational clustering method Also so that the embodiment of the present invention is more accurate to the clustering of track data spatial information.

Fig. 6 shows that the frequent sub-trajectory in track data provided in an embodiment of the present invention searches the structured flowchart of device, The device can be used for running the frequent sub-trajectory lookup method in the track data described in Fig. 1 to Fig. 5 embodiments of the present invention.For It is easy to explanation, illustrate only part related to the present embodiment.

With reference to Fig. 6, the device includes；

Separative element 61, the spatial information and temporal information in separated track data.

First coding unit 62, by the spatial information encode into first kind character, each described first kind character is used for Represent a geographical position.

Second coding unit 63, by the temporal information Equations of The Second Kind character is encoded into, and each described Equations of The Second Kind character is used for Represent one section of interval time.

Unit 64 is set up, the spatial information of the first kind character and the Equations of The Second Kind word is encoded into according to being encoded into The temporal information of symbol, sets up generalized suffix tree.

Searching unit 65, searches the frequent substring in the generalized suffix tree.

Converting unit 66, by the described frequent substring for finding out frequent sub-trajectory is converted into.

Alternatively, first coding unit 62 includes：

Cluster subelement, clusters to the spatial information, generates N number of cluster, and the N is the integer more than 1.

Determination subelement, determines respectively the geographical position corresponding to each cluster of generation.

Coded sub-units, the geographical position according to corresponding to each cluster for generation carries out character code, generates respectively every The corresponding first kind character of individual cluster.

Alternatively, second coding unit 63 includes：

Conversion subunit, interval time is converted into by the temporal information by timestamp.

Normalizer unit, standardizes the interval time.

Coupling subelement, is the interval time matching Equations of The Second Kind character after each standardization.

Alternatively, the coupling subelement specifically for：

Determine the interval two values end points of the default value at the place of the interval time after the standardization；

The interval after described two numerical end points are distinguished into corresponding two Equations of The Second Kind character match to the standardization Time.

Alternatively, described device also includes：

Adding unit, is that each node in the generalized suffix tree increases a count attribute, and the count attribute is used Count in the number of times occurred in the generalized suffix tree to the corresponding character string of the node；

The searching unit 65 specifically for：

The count attribute in the generalized suffix tree is more than the character string corresponding to the node of predetermined threshold value to determine For the frequent substring.

Presently preferred embodiments of the present invention is the foregoing is only, not to limit the present invention, all essences in the present invention Any modification, equivalent and improvement made within god and principle etc., should be included within the scope of the present invention.

Claims

1. the frequent sub-trajectory lookup method in a kind of track data, it is characterised in that include：

Spatial information and temporal information in separated track data；

By the spatial information encode into first kind character, each described first kind character is used to represent a geographical position；

The temporal information is encoded into into Equations of The Second Kind character, each described Equations of The Second Kind character is used to represent one section of interval time；

According to the spatial information for being encoded into the first kind character and the time letter for being encoded into the Equations of The Second Kind character Breath, sets up generalized suffix tree；

Search the frequent substring in the generalized suffix tree；

The described frequent substring for finding out is converted into into frequent sub-trajectory；

Wherein, it is described by the spatial information encode into first kind character, each described first kind character be used for represent a ground Reason position includes：

The spatial information is clustered, N number of cluster is generated, the N is the integer more than 1；Each cluster of generation is determined respectively Corresponding geographical position；Geographical position according to corresponding to each cluster for generation carries out character code, and each is generated respectively The corresponding first kind character of cluster；

Wherein, described that the temporal information is encoded into into Equations of The Second Kind character, each described Equations of The Second Kind character is used to represent that one is intersegmental Include every the time：

The temporal information is converted into into interval time by timestamp；By interval (k)=interval (k)/max_M=1： _nInterval (m) is standardized to the interval time, wherein, interval (k) represents k-th interval time, max_M=1： _nInterval (m) represents the maximum in n significant interval time；It is the matching of described interval time the after each standardization Two class characters.

2. the method for claim 1, it is characterised in that described is matching of described interval time after each standardization the Two class characters include：

The interval time after described two numerical end points are distinguished into corresponding two Equations of The Second Kind character match to the standardization.

3. the method for claim 1, it is characterised in that it is described set up generalized suffix tree after, described in the lookup Before frequent substring in generalized suffix tree, methods described also includes：

Increase a count attribute for each node in the generalized suffix tree, the count attribute is used for the node correspondence The number of times that occurs in the generalized suffix tree of character string counted；

The frequent substring searched in the generalized suffix tree includes：

The character string that the count attribute in the generalized suffix tree is more than corresponding to the node of predetermined threshold value is defined as into institute State frequent substring.

4. the frequent sub-trajectory in a kind of track data searches device, it is characterised in that include：

First coding unit, for, into first kind character, each described first kind character to be used for table by the spatial information encode Show a geographical position；

Second coding unit, for the temporal information to be encoded into into Equations of The Second Kind character, each described Equations of The Second Kind character is used for table Show one section of interval time；

Unit is set up, the spatial information of the first kind character is encoded into for basis and is encoded into the Equations of The Second Kind character The temporal information, set up generalized suffix tree；

Converting unit, for the described frequent substring for finding out to be converted into into frequent sub-trajectory；

Wherein, first coding unit includes：

Cluster subelement, for clustering to the spatial information, generates N number of cluster, and the N is the integer more than 1；

Determination subelement, for determining the geographical position corresponding to each cluster of generation respectively；

Coded sub-units, for the geographical position according to corresponding to each cluster for generation character code is carried out, and is generated respectively every The corresponding first kind character of individual cluster；

Second coding unit includes：

Conversion subunit, for the temporal information to be converted into into interval time by timestamp；

Normalizer unit, for by interval (k)=interval (k)/max_M=1：nInterval (m) is to the interval Time is standardized, wherein, interval (k) represents k-th interval time, max_M=1：nInterval (m) represents that n has Maximum in effect interval time；

Coupling subelement, for matching Equations of The Second Kind character for the interval time after each standardization.

5. device as claimed in claim 4, it is characterised in that the coupling subelement specifically for：

6. device as claimed in claim 4, it is characterised in that described device also includes：

Adding unit, for increasing a count attribute for each node in the generalized suffix tree, the count attribute is used Count in the number of times occurred in the generalized suffix tree to the corresponding character string of the node；

The searching unit specifically for：