Specific embodiment
In order that the objects, technical solutions and advantages of the present invention become more apparent, it is right below in conjunction with drawings and Examples
The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, and
It is not used in the restriction present invention.
What Fig. 1 showed the frequent sub-trajectory lookup method in track data provided in an embodiment of the present invention realizes flow process,
Details are as follows:
Spatial information and temporal information in S101, in separated track data.
Spatial information and temporal information are included in track data, wherein, spatial information generally comprises the Jing of position
Degree, latitude etc., and temporal information is generally indicated by unix timestamps.
Table 1 is the specific example of one section of track data, wherein, the temporal information of record is into corresponding longitude and latitude
Unix timestamps:
Table 1
In S101, the spatial information and temporal information in track data is separated first, track data is separated
Into a spatial information sequence, for example (113.333,22.368), (113.111,23.013) ... }, and a temporal information
Sequence, such as { 1385521584,1385521233 ... }.
In S102, by the spatial information encode into first kind character, each described first kind character is used to represent one
Individual geographical position.
In the present embodiment, the spatial information that S101 is isolated is clustered, changes into corresponding geographical position, then compiled
Code is into character, and the character being encoded into is used to represent corresponding geographical position.As shown in Fig. 2 S102 is specially:
In S201, the spatial information is clustered, generate N number of cluster, the N is the integer more than 1.
According to the spatial information sequence of the track data isolated in S101, based on longitude and latitude by the sky of track data
Between information clustered.Specifically, density-based algorithms can be adopted(Density-Based Spatial
Clustering of Applications with Noise, DBSCAN)To realize the cluster of spatial information.As the present invention
An implementation example, as shown in figure 3, one in each point representation space information sequence record, based on these
The corresponding longitude numbers of point and latitude numerical value are clustered to these points, have ultimately produced tri- clusters of A, B, C, and remaining isolated
Point is then excluded away as noise, is not involved in ensuing calculating process.
In S202, the geographical position corresponding to each cluster of generation is determined respectively.
In the present embodiment, the positional information such as longitude and latitude according to involved by each cluster, by contrasting on map
Position differentiated, to determine the geographical position corresponding to each cluster.Typically, in real data acquisition
In, the cluster of generation can represent commercial circle in a city or city in society, etc..
In S203, the geographical position according to corresponding to each cluster carries out character code, each cluster is generated respectively corresponding
The first kind character.
Through step shown in Fig. 2, the spatial information in track data for example (113.333,22.368), (113.111,
23.013) ... } then can be converted to { A, C ... }, wherein, A represent (113.333,22.368) be located the corresponding character of cluster, C
Represent that (113.111,23.013) the corresponding character of cluster being located, thus achieves the spatial information in track data by original two
The conversion of dimension value to geographic area.
In S103, the temporal information is encoded into into Equations of The Second Kind character, each described Equations of The Second Kind character is used to represent one
Section interval time.
As shown in figure 4, S103 is specially:
In S401, the temporal information is converted into into interval time by timestamp.
In S102, spatial information sequence is had been converted to into character string form, and original temporal information is one
Individual timestamp sequence, each timestamp in sequence represents the time in the corresponding geographical position in spatial information sequence,
Then in S103, need for timestamp to be converted into interval time, to determine from into A to into the time difference B.Example
Such as, the spatial information sequence after conversion is { A, A, B ... }, and corresponding interval time sequence is { t1,t2,t3..., then into A
Time be t1, the time into B is t3, then from into A to being t into the time interval B3-t1。
In S402, the interval time is standardized.
In S402, before the interval time to being converted to is standardized, first first by wherein long interval when
Between as noise removal, then remaining interval time is standardized.Specifically, interval time can be carried out by following formula
Standardization:
interval(k)=interval(k)/maxM=1:ninterval(m)
Wherein, interval (k) represents k-th interval time, maxM=1:nWhen interval (m) represents n significant interval
Between in maximum, specific standardized method is to take the maximum in all significant interval times, then by k-th interval
Time divided by the maximum, that is, has obtained standardized k-th interval time.For standardized k-th interval time, as
One implementation example of the present invention, its result can be as accurate as 0.001.
After being standardized to interval time, all of interval time may translate into similar 0.303,0.349,
0.788 grade numerical value.
It is the interval time matching Equations of The Second Kind character after each standardization in S403.
Generally, it is that it is rounded up that numerical value can be changed into into the simplest method of character, for example:
Interval time 1=0.349 ≈ 0.3, interval time 2=0.350 ≈ 0.4 are then right with character according to default numerical value
Should be related to, by 0.3 character 3 is converted into, by 0.4 character 4 is converted into.However, in fact, interval time 1 and interval time 2 it is true
Real number value is extremely close, but has been matched to different characters so that matching result cannot reflect between different numerical value
True gap, accordingly, as one embodiment of the present of invention, can solve this problem using the method for inexact matching,
It is the combination for matching described interval time after each standardization an Equations of The Second Kind character by inexact matching, described second
Two Equations of The Second Kind characters are included in the combination of class character.
Specifically:Following inexact matching method can be adopted:
| | 0.35=characters 6 | | the characters 7 of interval time 1=0.349 ≈ 0.3;
| | 0.4=characters 7 | | the characters 8 of interval time 2=0.350 ≈ 0.35.
I.e., it is first determined the default value that interval time after standardization is located is interval, and determine that the default value is interval
Two values end points, then, according to the corresponding relation of default Equations of The Second Kind character and numerical end point, the two numerical value are divided
Not corresponding two Equations of The Second Kind character match to the interval time after the standardization, so as to the interval time after each is standardized
It is converted into containing the combination (character k, character k+1) of two Equations of The Second Kind characters.
After it have passed through S102 and S103, track data can be converted into what spatial information and temporal information were intersected
Character string sequence, for example:
A(Character 6)B(Character 6)C…
In S104, the spatial information of the first kind character and the Equations of The Second Kind character is encoded into according to being encoded into
The temporal information, set up generalized suffix tree.
Suffix tree(suffix tree)As a kind of data structure, can be used for supporting effective string matching and looking into
Ask, in the present embodiment, due to temporal information be by containing the combination of two Equations of The Second Kind characters come coded representation, because
This, when contributing, can be represented with that character compared with fractional value is represented in the combination of Equations of The Second Kind character.For example,
Every time 1=character 6 | | character 7, then during achievement, the interval time 1 can be represented with character 6.
When the temporal information on suffix tree node and the temporal information not also being put on suffix tree are compared,
The concrete scene of inexact matching is as follows:
Node n=character k=character k | | character (k+1),
For example, node n=characters 6=characters 6 | | character 7, i.e. the interval time that needs compare is encoded as character 6 and word
When according with 7, the node n corresponding to character 6 can be matched.
And for spatial information, still by the way of accurately mate being placed on suffix tree.
In the present embodiment, the achievement process of generalized suffix tree can be completed using Ukkonen algorithms, by above-mentioned side
As shown in figure 5, wherein, each non-root node represents a substring for generalized suffix tree that method is established, and by for
Each node in suffix tree increases a count attribute newly, for going out in generalized suffix tree to the corresponding character string of the node
Existing number of times is counted, then, the number of times that this substring occurs be exactly this node all child nodes in leaf node
Count attribute sums:
Count (s)=count_leaf1+count_leaf2+count_leaf3+ ...,
Wherein, s represents a character string, and Count (s) then represents the node that path is reached by s by root node
Count property values, count_leaf1, count_leaf2, count_leaf3 ... then represent respectively the child node of the node
In all leaf nodes count property values.
And in turn, the count attributes of each leaf node then represent the number of 2-d index contained by the leaf node, if in generations
The 2-d index of one leaf node of table, then:
In=(index1, index2),
Wherein, index1 represents that the corresponding substring of the leaf node occurred in which character string, and index2 is represented
The original position that the corresponding substring of the leaf node occurs in the character string.
As shown in figure 5, (0,5), (1,2) be wherein positioned at leftmost leaf node two 2-d indexs, the leaf node
The character string of representative is suffix " A ", then (0, what is 5) represented is that suffix " A " occurs and occur in the 0th character string " BANANA "
Original position be 5(Note:According to the custom of computer science, herein the counting of index number and position is all from the beginning of 0), (1,
2) suffix is then represented " A " in the 1st character string " ANA " inner appearance and the original position that occurs be as 2.
In S105, the frequent substring in the generalized suffix tree is searched.
For the generalized suffix tree established in S104, take the method for breadth first traversal to carry out traversal of tree, such as
Really the numerical value of the count attributes of a node meets following condition:
Count (node A)>Min_repeat_times, wherein, Count (node A) represents generalized suffix tree interior joint A
Count property values, min_repeat_times be used for represent a predetermined threshold value,
Then the substring of the node on behalf is frequent substring, i.e. be more than the count attributes in generalized suffix tree
Character string corresponding to the node of predetermined threshold value is defined as the frequent substring
If conversely, a node is judged as not being frequent substring, carrying out cutting tree to the subtree with node as root,
The child node of the node is no longer traveled through in subsequent process, the search efficiency of frequent substring is improved with this.
In S106, the described frequent substring for finding out is converted into into frequent sub-trajectory.
For the frequent substring found out in S105, such as A (character 6) B (character 7) C ... can be character by character
Spatial information or temporal information that the character is represented are translated into, specifically:
If the character is first kind character, then it represents that what the character was represented is spatial information, therefore the character is changed into
The corresponding geographical position of the character;
If the character is Equations of The Second Kind character, then it represents that what the character was represented is temporal information, therefore takes the character and be somebody's turn to do
Neighbours' character of character, changes into corresponding numerical value and takes average.For example, represent compared with decimal in the combination with Equations of The Second Kind character
That character of value is representing during temporal information, if character is 6, its neighbours' character is 7, and above-mentioned two character difference is corresponding
Numerical value is 0.3 and 0.35, then then average is taken to 0.3 and 0.35, so as to restore the temporal information.
Table 2 shows the example of the frequent sub-trajectory that the generalized suffix tree shown according to Fig. 5 is ultimately produced:
Table 2
The embodiment of the present invention combines data mining technology, suffix tree algorithm and inexact matching, it is achieved thereby that compared with
The lookup of the frequent sub-trajectory in excellent track data, it is complex to process by using more efficient character string algorithm
Multi dimensional numerical data so that the computation complexity of whole frequently sub-trajectory search procedure is substantially reduced, and rational clustering method
Also so that the embodiment of the present invention is more accurate to the clustering of track data spatial information.
Fig. 6 shows that the frequent sub-trajectory in track data provided in an embodiment of the present invention searches the structured flowchart of device,
The device can be used for running the frequent sub-trajectory lookup method in the track data described in Fig. 1 to Fig. 5 embodiments of the present invention.For
It is easy to explanation, illustrate only part related to the present embodiment.
With reference to Fig. 6, the device includes;
Separative element 61, the spatial information and temporal information in separated track data.
First coding unit 62, by the spatial information encode into first kind character, each described first kind character is used for
Represent a geographical position.
Second coding unit 63, by the temporal information Equations of The Second Kind character is encoded into, and each described Equations of The Second Kind character is used for
Represent one section of interval time.
Unit 64 is set up, the spatial information of the first kind character and the Equations of The Second Kind word is encoded into according to being encoded into
The temporal information of symbol, sets up generalized suffix tree.
Searching unit 65, searches the frequent substring in the generalized suffix tree.
Converting unit 66, by the described frequent substring for finding out frequent sub-trajectory is converted into.
Alternatively, first coding unit 62 includes:
Cluster subelement, clusters to the spatial information, generates N number of cluster, and the N is the integer more than 1.
Determination subelement, determines respectively the geographical position corresponding to each cluster of generation.
Coded sub-units, the geographical position according to corresponding to each cluster for generation carries out character code, generates respectively every
The corresponding first kind character of individual cluster.
Alternatively, second coding unit 63 includes:
Conversion subunit, interval time is converted into by the temporal information by timestamp.
Normalizer unit, standardizes the interval time.
Coupling subelement, is the interval time matching Equations of The Second Kind character after each standardization.
Alternatively, the coupling subelement specifically for:
Determine the interval two values end points of the default value at the place of the interval time after the standardization;
The interval after described two numerical end points are distinguished into corresponding two Equations of The Second Kind character match to the standardization
Time.
Alternatively, described device also includes:
Adding unit, is that each node in the generalized suffix tree increases a count attribute, and the count attribute is used
Count in the number of times occurred in the generalized suffix tree to the corresponding character string of the node;
The searching unit 65 specifically for:
The count attribute in the generalized suffix tree is more than the character string corresponding to the node of predetermined threshold value to determine
For the frequent substring.
The embodiment of the present invention combines data mining technology, suffix tree algorithm and inexact matching, it is achieved thereby that compared with
The lookup of the frequent sub-trajectory in excellent track data, it is complex to process by using more efficient character string algorithm
Multi dimensional numerical data so that the computation complexity of whole frequently sub-trajectory search procedure is substantially reduced, and rational clustering method
Also so that the embodiment of the present invention is more accurate to the clustering of track data spatial information.
Presently preferred embodiments of the present invention is the foregoing is only, not to limit the present invention, all essences in the present invention
Any modification, equivalent and improvement made within god and principle etc., should be included within the scope of the present invention.