CN103744861A

CN103744861A - Lookup method and device for frequency sub-trajectories in trajectory data

Info

Publication number: CN103744861A
Application number: CN201310687107.3A
Authority: CN
Inventors: 黄鑫; 罗军
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Guangdong Gaohang Intellectual Property Operation Co ltd; Jiangsu Aerospace Dawei Technology Co Ltd
Priority date: 2013-12-12
Filing date: 2013-12-12
Publication date: 2014-04-23
Anticipated expiration: 2033-12-12
Also published as: CN103744861B

Abstract

The invention belongs to the technical field of data processing, and provides a lookup method and device for frequent sub-trajectories in trajectory data. The lookup method comprises the steps of separating spatial information and time information in the trajectory data; encoding the spatial information into first-kind characters, wherein each first-kind character is used for representing a geographic position; encoding the time information into second-kind characters, wherein each second-kind character is used for representing a time interval; according to the spatial information encoded into the first-kind characters and the time information encoded into the second-kind characters, establishing a generalized suffix tree; looking up frequent sub-character-strings in the generalized suffix tree; converting the looked-up frequent sub-character-strings into the frequent sub-trajectories. According to the frequent sub-trajectory lookup method and device in the trajectory data, the efficient character string algorithm is used for processing the complex multi-dimensional numerical data, so that the computation complexity of the whole frequent sub-trajectory lookup process is greatly reduced.

Description

Frequent sub-trajectory lookup method and device in a kind of track data

Technical field

The invention belongs to technical field of data processing, relate in particular to frequent sub-trajectory lookup method and device in a kind of track data.

Background technology

Track data is exactly under space-time environment, by the data message that the sampling of one or more mobile object motion process is obtained, comprise sampling point position, sampling time, speed etc., these sampled point data messages have formed track data according to sampling sequencing.Common track data comprises vehicle driving trace, mobile Internet user's travel locus, mobile Internet user's the track of registering, etc., in the track data of magnanimity, containing abundant information, its frequent sub-trajectory can show behavior pattern and the custom of most people, or the Changing Pattern of performance weather etc.

Because track data is numeric data; can not directly apply mechanically the current quite algorithm of searching of the ripe frequent substring of character string and search the frequent sub-trajectory in track data; therefore; in prior art, mostly directly track data is divided and cluster, the track that is O (n) by length is divided into O (n ²) individual sub-trajectory, then these sub-trajectories are carried out to cluster analysis and find frequent sub-trajectory, whole process computation complexity is high, and operation time is long.

Summary of the invention

The object of the embodiment of the present invention is to provide the frequent sub-trajectory lookup method in a kind of track data, is intended to solve the existing high problem of algorithm computation complexity of searching frequent sub-trajectory in track data.

The embodiment of the present invention is achieved in that the frequent sub-trajectory lookup method in a kind of track data, comprising:

Spatial information in separated track data and temporal information;

Described spatial information is encoded into first kind character, and each described first kind character is used for representing a geographic position;

Described temporal information is encoded into Equations of The Second Kind character, and each described Equations of The Second Kind character is used for representing the spacer segment time;

According to the described spatial information and the described temporal information that is encoded into described Equations of The Second Kind character that are encoded into described first kind character, set up broad sense suffix tree;

Search the frequent substring in described broad sense suffix tree;

Convert the described frequent substring finding out to frequent sub-trajectory.

Another object of the embodiment of the present invention is to provide the frequent sub-trajectory in a kind of track data to search device, comprising:

Separative element, for separating of the spatial information in track data and temporal information;

The first coding unit, for described spatial information is encoded into first kind character, each described first kind character is used for representing a geographic position;

The second coding unit, for described temporal information is encoded into Equations of The Second Kind character, each described Equations of The Second Kind character is used for representing the spacer segment time;

Set up unit, for according to the described spatial information and the described temporal information that is encoded into described Equations of The Second Kind character that are encoded into described first kind character, set up broad sense suffix tree;

Search unit, for searching the frequent substring of described broad sense suffix tree;

Converting unit, for converting the described frequent substring finding out to frequent sub-trajectory.

The embodiment of the present invention combines data mining technology, suffix tree algorithm and inexact matching, thereby realized searching of frequent sub-trajectory in track data preferably, by processing comparatively complicated multi dimensional numerical data with comparatively efficient character string algorithm, the computation complexity of whole frequent sub-trajectory search procedure is reduced greatly.

Accompanying drawing explanation

Fig. 1 is the realization flow figure of the frequent sub-trajectory lookup method in the track data that provides of the embodiment of the present invention;

Fig. 2 is the specific implementation process flow diagram of the frequent sub-trajectory lookup method S102 in the track data that provides of the embodiment of the present invention;

Fig. 3 is the frequent sub-trajectory lookup method in the track data that provides of the embodiment of the present invention carries out cluster schematic diagram to spatial information;

Fig. 4 is the specific implementation process flow diagram of the frequent sub-trajectory lookup method S103 in the track data that provides of the embodiment of the present invention;

Fig. 5 is the schematic diagram of the broad sense suffix tree of setting up of the frequent sub-trajectory lookup method in the track data that provides of the embodiment of the present invention;

Fig. 6 is the structured flowchart that the frequent sub-trajectory in the track data that provides of the embodiment of the present invention is searched device.

Embodiment

In order to make object of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.

Fig. 1 shows the realization flow of the frequent sub-trajectory lookup method in the track data that the embodiment of the present invention provides, and details are as follows:

In S101, the spatial information in separated track data and temporal information.

Track data has comprised spatial information and temporal information, and wherein, spatial information generally comprises longitude, latitude of position etc., and temporal information represents by unix timestamp conventionally.

Table 1 is the concrete example of one section of track data, and wherein, the temporal information of record is the unix timestamp that enters corresponding longitude and latitude:

Table 1

In S101, first to the spatial information in track data, carry out separate with temporal information, track data is separated into a spatial information sequence, for example { (113.333,22.368), (113.111,23.013) ..., and a temporal information sequence, for example { 1385521584,1385521233 ....

In S102, described spatial information is encoded into first kind character, each described first kind character is used for representing a geographic position.

In the present embodiment, the isolated spatial information of S101 is carried out to cluster, change into corresponding geographic position, then be encoded into character, and the character being encoded into is for representing corresponding geographic position.As shown in Figure 2, S102 is specially:

In S201, described spatial information is carried out to cluster, generate N bunch, described N is greater than 1 integer.

According to the spatial information sequence of isolated track data in S101, based on longitude and latitude, the spatial information of track data is carried out to cluster.Particularly, can adopt clustering algorithm (Density-Based Spatial Clustering of Applications with Noise, DBSCAN) based on density to carry out the cluster of implementation space information.As a realization example of the present invention, as shown in Figure 3, a record in each some representation space information sequence wherein, based on these, put corresponding longitude numerical value and latitude numerical value carries out cluster to these points, tri-bunches of A, B, C have finally been generated, remaining isolated point is excluded away as noise, does not participate in ensuing computation process.

In S202, determine respectively the each bunch of corresponding geographic position generating.

In the present embodiment, according to positional informations such as each bunch of involved longitude and latitudes, by the position on contrast map, differentiate, to determine each bunch of corresponding geographic position.Conventionally in fact, in real data acquisition, generation bunch can represent a commercial circle in city or the city in society, etc.

In S203, according to each bunch of corresponding geographic position, carry out character code, generate respectively each bunch of corresponding described first kind character.

Through step shown in Fig. 2, the spatial information in track data is { (113.333,22.368) for example, (113.111,23.013) ... can be converted to A, C ..., wherein, A represents bunch corresponding character at (113.333,22.368) place, and C represents (113.111,23.013) bunch corresponding character at place, has realized the conversion to geographic area by original two dimensional numerical value of the spatial information in track data thus.

In S103, described temporal information is encoded into Equations of The Second Kind character, each described Equations of The Second Kind character is used for representing the spacer segment time.

As shown in Figure 4, S103 is specially:

In S401, by timestamp, convert described temporal information to interval time.

In S102, by changed Format Series Lines of spatial information sequence, and original temporal information is a timestamp sequence, each timestamp in sequence all represents the time that enters corresponding geographic position in spatial information sequence, in S103, need to convert timestamp to interval time, to determine from entering A to the mistiming entering B.For example, the spatial information sequence after conversion be A, A, B ..., corresponding sequence interval time is { t ₁, t ₂, t ₃..., the time that enters A is t ₁, the time that enters B is t ₃, so from enter A to the time interval entering B be t ₃-t ₁.

In S402, interval time described in standardization.

In S402, before carrying out standardization the interval time being converted to, first first using wherein long interval time as noise removal, then to carrying out standardization remaining interval time.Particularly, can be by following formula to carrying out standardization interval time:

interval(k）=interval(k)/max _m＝１：ninterval(m)

Wherein, interval (k) represents k interval time, max _m=1:ninterval (m) represents the maximal value in n significant interval time, concrete standardized method, be the maximal value of getting in the time of all significant intervals, then by k interval time divided by this maximal value, obtained standardized k interval time.For standardized k interval time, as a realization example of the present invention, its result can be as accurate as 0.001.

After carried out to standardization interval time, will convert the numerical value such as similar 0.303,0.349,0.788 all interval times to.

In S403, for mating Equations of The Second Kind character the described interval time after each standardization.

Conventionally, numerical value can be changed into the simplest way of character is that it is rounded up, for example:

Interval time 1=0.349 ≈ 0.3, interval time 2=0.350 ≈ 0.4, according to default numerical value and the corresponding relation of character, by 0.3 changed 3, by 0.4 changed 4.But, in fact, interval time 1 is very close with the actual value of interval time 2, but has been matched to different characters, makes matching result cannot reflect the true gap between different numerical value, therefore, as one embodiment of the present of invention, can adopt the way of inexact matching to solve this problem, pass through inexact matching, for the combination of mating an Equations of The Second Kind character described interval time after each standardization, in the combination of described Equations of The Second Kind character, comprise two Equations of The Second Kind characters.

Particularly: can adopt following inexact matching method:

Interval time 1=0.349 ≈ 0.3||0.35=character 6|| character 7;

Interval time 2=0.350 ≈ 0.35||0.4=character 7|| character 8.

; first after settling the standard interval time place default value interval; and two numerical value end points in definite this default value interval; then; according to default Equations of The Second Kind character and the corresponding relation of numerical value end points; these two numerical value are distinguished to two corresponding Equations of The Second Kind character match to the interval time after this standardization, thereby will convert the combination (character k, character k+1) that has comprised two Equations of The Second Kind characters the interval time after each standardization to.

After having passed through S102 and S103, track data can be converted into the character string sequence of spatial information and temporal information intersection, for example:

A(character 6) B(character 6) C

In S104, according to the described spatial information and the described temporal information that is encoded into described Equations of The Second Kind character that are encoded into described first kind character, set up broad sense suffix tree.

Suffix tree (suffix tree) is as a kind of data structure, can be used for supporting effective string matching and inquiry, in the present embodiment, because temporal information is that combination by having comprised two Equations of The Second Kind characters carrys out coded representation, therefore, when contributing, can compared with that character of fractional value, represent with representative in the combination of Equations of The Second Kind character.For example, interval time, 1=character 6|| character 7,, in achievement process, can represent this interval time 1 with character 6.

By the temporal information on suffix tree node with when not also being put into temporal information on suffix tree and comparing, the concrete scene of inexact matching is as follows:

Node n=character k=character k|| character (k+1),

For example, node n=character 6=character 6|| character 7, while needing be encoded as character 6 and character 7 interval time relatively, all can match the corresponding node n of character 6 that is.

And for spatial information, still adopt the mode of exact matching to be placed on suffix tree.

In the present embodiment, can adopt Ukkonen algorithm to complete the achievement process of broad sense suffix tree, the broad sense suffix tree establishing by said method as shown in Figure 5, wherein, each non-root node all represents a substring, and by the newly-increased count attribute of the each node in suffix tree, for the number of times that character string corresponding to this node occurred broad sense suffix tree, count, so, the number of times that this substring occurs is exactly the count attribute sum of leaf node in all child nodes of this node:

Count(s)=count_leaf1+count_leaf2+count_leaf3+…，

Wherein, s represents a character string, Count (s) represents by the set out count property value of the node that path arrives for s of root node, count_leaf1, count_leaf2, count_leaf3 ... represent respectively the count property value of all leaf nodes in the child node of this node.

And conversely, the count attribute of each leaf node represents the number of the contained 2-d index of this leaf node, establish in and represent the 2-d index of a leaf node:

in=(index1,index2)，

Wherein, index1 represents the substring that this leaf node is corresponding in which character string occurred, index2 represents the reference position that substring that this leaf node is corresponding occurs in this character string.

As shown in Figure 5, (0,5), (1,2) be two 2-d indexs that are wherein positioned at leftmost leaf node, the character string of this leaf node representative is suffix " A ", (0,5) what represent is that the reference position that suffix " A " occurs and occurs in the 0th character string " BANANA " is 5(note: according to the custom of computer science, the counting of index number and position is all since 0 herein), (1,2) represents suffix " A " the 1st character string " ANA " inner occur and occur reference position be 2.

In S105, search the frequent substring in described broad sense suffix tree.

For the broad sense suffix tree establishing in S104, take the way of breadth First traversal to carry out traversal of tree, if the numerical value of the count attribute of a node meets following condition:

Count (node A) >min_repeat_times, wherein, Count (node A) represents the count property value of node A in broad sense suffix tree, min_repeat_times is used for representing a predetermined threshold value,

The substring of this node representative is frequent substring, that is, the corresponding character string of node that the count attribute in broad sense suffix tree is greater than to predetermined threshold value is defined as described frequent substring

Otherwise, if being judged as, a node not frequent substring, the subtree take node as root is cut to tree, in subsequent process, no longer travel through the child node of this node, and with this, improve the search efficiency of frequent substring.

In S106, convert the described frequent substring finding out to frequent sub-trajectory.

For the frequent substring finding out in S105, as A (character 6) B (character 7) C ..., can be translated into character by character spatial information or temporal information that this character represents, particularly:

If this character is first kind character, what represent this character representative is spatial information, therefore this character is changed into the geographic position that this character is corresponding;

If this character is Equations of The Second Kind character, what represent this character representative is temporal information, therefore gets neighbours' character of this character and this character, changes into corresponding numerical value and gets average.For example,, when representative in the combination with Equations of The Second Kind character represents temporal information compared with that character of fractional value, if character is 6, its neighbours' character is 7, above-mentioned two characters respectively numerical value of correspondence are 0.3 and 0.35, so to 0.3 and 0.35, get average, thereby restore this temporal information.

Table 2 shows the example of the last frequent sub-trajectory generating of broad sense suffix tree of showing according to Fig. 5:

Table 2

The embodiment of the present invention combines data mining technology, suffix tree algorithm and inexact matching, thereby realized searching of frequent sub-trajectory in track data preferably, by processing comparatively complicated multi dimensional numerical data with comparatively efficient character string algorithm, the computation complexity of whole frequent sub-trajectory search procedure is reduced greatly, and reasonably clustering method also make the embodiment of the present invention more accurate to the clustering of track data space information.

Fig. 6 shows frequent sub-trajectory in the track data that the embodiment of the present invention provides and searches the structured flowchart of device, and this device can be for the frequent sub-trajectory lookup method in the track data described in operation Fig. 1 to Fig. 5 embodiment of the present invention.For convenience of explanation, only show the part relevant to the present embodiment.

With reference to Fig. 6, this device comprises;

Separative element 61, the spatial information in separated track data and temporal information.

The first coding unit 62, is encoded into first kind character by described spatial information, and each described first kind character is used for representing a geographic position.

The second coding unit 63, is encoded into Equations of The Second Kind character by described temporal information, and each described Equations of The Second Kind character is used for representing the spacer segment time.

Set up unit 64, according to the described spatial information and the described temporal information that is encoded into described Equations of The Second Kind character that are encoded into described first kind character, set up broad sense suffix tree.

Search unit 65, search the frequent substring in described broad sense suffix tree.

Converting unit 66, converts the described frequent substring finding out to frequent sub-trajectory.

Alternatively, described the first coding unit 62 comprises:

Cluster subelement, carries out cluster to described spatial information, generates N bunch, and described N is greater than 1 integer.

Determine subelement, determine respectively the each bunch of corresponding geographic position generating.

Coding subelement, according to carrying out character code for the each bunch of corresponding geographic position generating, generates respectively each bunch of corresponding described first kind character.

Alternatively, described the second coding unit 63 comprises:

Conversion subelement, converts described temporal information to interval time by timestamp.

Standardization subelement, interval time described in standardization.

Coupling subelement, for mating Equations of The Second Kind character the described interval time after each standardization.

Alternatively, described coupling subelement specifically for:

Determine after described standardization described interval time place two numerical value end points in default value interval;

By two Equations of The Second Kind character match corresponding to described two numerical value end points difference, give the described interval time after this standardization.

Alternatively, described device also comprises:

Increase unit, for the each node in described broad sense suffix tree increases a count attribute, described count attribute is counted for the number of times that character string corresponding to this node occurred described broad sense suffix tree;

Described search unit 65 specifically for:

The corresponding character string of node that described count attribute in described broad sense suffix tree is greater than to predetermined threshold value is defined as described frequent substring.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any modifications of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.

Claims

1. the frequent sub-trajectory lookup method in track data, is characterized in that, comprising:

Spatial information in separated track data and temporal information;

Search the frequent substring in described broad sense suffix tree;

2. the method for claim 1, is characterized in that, describedly described spatial information is encoded into first kind character comprises:

Described spatial information is carried out to cluster, generate N bunch, described N is greater than 1 integer;

Determine respectively the each bunch of corresponding geographic position generating;

According to carrying out character code for the each bunch of corresponding geographic position generating, generate respectively each bunch of corresponding described first kind character.

3. the method for claim 1, is characterized in that, described described temporal information is encoded into Equations of The Second Kind character, and each described Equations of The Second Kind character is used for representing that the spacer segment time comprises:

By timestamp, convert described temporal information to interval time;

Interval time described in standardization;

For mating Equations of The Second Kind character the described interval time after each standardization.

4. method as claimed in claim 3, is characterized in that, describedly for mating Equations of The Second Kind character the described interval time after each standardization, comprises:

5. the method for claim 1, is characterized in that, described set up broad sense suffix tree after, described in search the frequent substring in described broad sense suffix tree before, described method also comprises:

For the each node in described broad sense suffix tree increases a count attribute, described count attribute is counted for the number of times that character string corresponding to this node occurred described broad sense suffix tree;

The described frequent substring of searching in described broad sense suffix tree comprises:

6. the frequent sub-trajectory in track data is searched a device, it is characterized in that, comprising:

7. device as claimed in claim 6, is characterized in that, described the first coding unit comprises:

Cluster subelement, for described spatial information is carried out to cluster, generates N bunch, and described N is greater than 1 integer;

Determine subelement, for determining respectively the each bunch of corresponding geographic position generating;

Coding subelement, for according to carrying out character code for the each bunch of corresponding geographic position generating, generates respectively each bunch of corresponding described first kind character.

8. device as claimed in claim 6, is characterized in that, described the second coding unit comprises:

Conversion subelement, for converting described temporal information to interval time by timestamp;

Standardization subelement, for interval time described in standardization;

Coupling subelement, is used to and mates described interval time after each standardization Equations of The Second Kind character.

9. device as claimed in claim 8, is characterized in that, described coupling subelement specifically for:

10. device as claimed in claim 6, is characterized in that, described device also comprises:

Increase unit, be used to the each node in described broad sense suffix tree to increase a count attribute, described count attribute is counted for the number of times that character string corresponding to this node occurred described broad sense suffix tree;

Described search unit specifically for: