Embodiment
In order to make object of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.
Fig. 1 shows the realization flow of the frequent sub-trajectory lookup method in the track data that the embodiment of the present invention provides, and details are as follows:
In S101, the spatial information in separated track data and temporal information.
Track data has comprised spatial information and temporal information, and wherein, spatial information generally comprises longitude, latitude of position etc., and temporal information represents by unix timestamp conventionally.
Table 1 is the concrete example of one section of track data, and wherein, the temporal information of record is the unix timestamp that enters corresponding longitude and latitude:
Table 1
In S101, first to the spatial information in track data, carry out separate with temporal information, track data is separated into a spatial information sequence, for example { (113.333,22.368), (113.111,23.013) ..., and a temporal information sequence, for example { 1385521584,1385521233 ....
In S102, described spatial information is encoded into first kind character, each described first kind character is used for representing a geographic position.
In the present embodiment, the isolated spatial information of S101 is carried out to cluster, change into corresponding geographic position, then be encoded into character, and the character being encoded into is for representing corresponding geographic position.As shown in Figure 2, S102 is specially:
In S201, described spatial information is carried out to cluster, generate N bunch, described N is greater than 1 integer.
According to the spatial information sequence of isolated track data in S101, based on longitude and latitude, the spatial information of track data is carried out to cluster.Particularly, can adopt clustering algorithm (Density-Based Spatial Clustering of Applications with Noise, DBSCAN) based on density to carry out the cluster of implementation space information.As a realization example of the present invention, as shown in Figure 3, a record in each some representation space information sequence wherein, based on these, put corresponding longitude numerical value and latitude numerical value carries out cluster to these points, tri-bunches of A, B, C have finally been generated, remaining isolated point is excluded away as noise, does not participate in ensuing computation process.
In S202, determine respectively the each bunch of corresponding geographic position generating.
In the present embodiment, according to positional informations such as each bunch of involved longitude and latitudes, by the position on contrast map, differentiate, to determine each bunch of corresponding geographic position.Conventionally in fact, in real data acquisition, generation bunch can represent a commercial circle in city or the city in society, etc.
In S203, according to each bunch of corresponding geographic position, carry out character code, generate respectively each bunch of corresponding described first kind character.
Through step shown in Fig. 2, the spatial information in track data is { (113.333,22.368) for example, (113.111,23.013) ... can be converted to A, C ..., wherein, A represents bunch corresponding character at (113.333,22.368) place, and C represents (113.111,23.013) bunch corresponding character at place, has realized the conversion to geographic area by original two dimensional numerical value of the spatial information in track data thus.
In S103, described temporal information is encoded into Equations of The Second Kind character, each described Equations of The Second Kind character is used for representing the spacer segment time.
As shown in Figure 4, S103 is specially:
In S401, by timestamp, convert described temporal information to interval time.
In S102, by changed Format Series Lines of spatial information sequence, and original temporal information is a timestamp sequence, each timestamp in sequence all represents the time that enters corresponding geographic position in spatial information sequence, in S103, need to convert timestamp to interval time, to determine from entering A to the mistiming entering B.For example, the spatial information sequence after conversion be A, A, B ..., corresponding sequence interval time is { t
1, t
2, t
3..., the time that enters A is t
1, the time that enters B is t
3, so from enter A to the time interval entering B be t
3-t
1.
In S402, interval time described in standardization.
In S402, before carrying out standardization the interval time being converted to, first first using wherein long interval time as noise removal, then to carrying out standardization remaining interval time.Particularly, can be by following formula to carrying out standardization interval time:
interval(k)=interval(k)/max
m=1:ninterval(m)
Wherein, interval (k) represents k interval time, max
m=1:ninterval (m) represents the maximal value in n significant interval time, concrete standardized method, be the maximal value of getting in the time of all significant intervals, then by k interval time divided by this maximal value, obtained standardized k interval time.For standardized k interval time, as a realization example of the present invention, its result can be as accurate as 0.001.
After carried out to standardization interval time, will convert the numerical value such as similar 0.303,0.349,0.788 all interval times to.
In S403, for mating Equations of The Second Kind character the described interval time after each standardization.
Conventionally, numerical value can be changed into the simplest way of character is that it is rounded up, for example:
Interval time 1=0.349 ≈ 0.3, interval time 2=0.350 ≈ 0.4, according to default numerical value and the corresponding relation of character, by 0.3 changed 3, by 0.4 changed 4.But, in fact, interval time 1 is very close with the actual value of interval time 2, but has been matched to different characters, makes matching result cannot reflect the true gap between different numerical value, therefore, as one embodiment of the present of invention, can adopt the way of inexact matching to solve this problem, pass through inexact matching, for the combination of mating an Equations of The Second Kind character described interval time after each standardization, in the combination of described Equations of The Second Kind character, comprise two Equations of The Second Kind characters.
Particularly: can adopt following inexact matching method:
Interval time 1=0.349 ≈ 0.3||0.35=character 6|| character 7;
Interval time 2=0.350 ≈ 0.35||0.4=character 7|| character 8.
; first after settling the standard interval time place default value interval; and two numerical value end points in definite this default value interval; then; according to default Equations of The Second Kind character and the corresponding relation of numerical value end points; these two numerical value are distinguished to two corresponding Equations of The Second Kind character match to the interval time after this standardization, thereby will convert the combination (character k, character k+1) that has comprised two Equations of The Second Kind characters the interval time after each standardization to.
After having passed through S102 and S103, track data can be converted into the character string sequence of spatial information and temporal information intersection, for example:
A(character 6) B(character 6) C
In S104, according to the described spatial information and the described temporal information that is encoded into described Equations of The Second Kind character that are encoded into described first kind character, set up broad sense suffix tree.
Suffix tree (suffix tree) is as a kind of data structure, can be used for supporting effective string matching and inquiry, in the present embodiment, because temporal information is that combination by having comprised two Equations of The Second Kind characters carrys out coded representation, therefore, when contributing, can compared with that character of fractional value, represent with representative in the combination of Equations of The Second Kind character.For example, interval time, 1=character 6|| character 7,, in achievement process, can represent this interval time 1 with character 6.
By the temporal information on suffix tree node with when not also being put into temporal information on suffix tree and comparing, the concrete scene of inexact matching is as follows:
Node n=character k=character k|| character (k+1),
For example, node n=character 6=character 6|| character 7, while needing be encoded as character 6 and character 7 interval time relatively, all can match the corresponding node n of character 6 that is.
And for spatial information, still adopt the mode of exact matching to be placed on suffix tree.
In the present embodiment, can adopt Ukkonen algorithm to complete the achievement process of broad sense suffix tree, the broad sense suffix tree establishing by said method as shown in Figure 5, wherein, each non-root node all represents a substring, and by the newly-increased count attribute of the each node in suffix tree, for the number of times that character string corresponding to this node occurred broad sense suffix tree, count, so, the number of times that this substring occurs is exactly the count attribute sum of leaf node in all child nodes of this node:
Count(s)=count_leaf1+count_leaf2+count_leaf3+…,
Wherein, s represents a character string, Count (s) represents by the set out count property value of the node that path arrives for s of root node, count_leaf1, count_leaf2, count_leaf3 ... represent respectively the count property value of all leaf nodes in the child node of this node.
And conversely, the count attribute of each leaf node represents the number of the contained 2-d index of this leaf node, establish in and represent the 2-d index of a leaf node:
in=(index1,index2),
Wherein, index1 represents the substring that this leaf node is corresponding in which character string occurred, index2 represents the reference position that substring that this leaf node is corresponding occurs in this character string.
As shown in Figure 5, (0,5), (1,2) be two 2-d indexs that are wherein positioned at leftmost leaf node, the character string of this leaf node representative is suffix " A ", (0,5) what represent is that the reference position that suffix " A " occurs and occurs in the 0th character string " BANANA " is 5(note: according to the custom of computer science, the counting of index number and position is all since 0 herein), (1,2) represents suffix " A " the 1st character string " ANA " inner occur and occur reference position be 2.
In S105, search the frequent substring in described broad sense suffix tree.
For the broad sense suffix tree establishing in S104, take the way of breadth First traversal to carry out traversal of tree, if the numerical value of the count attribute of a node meets following condition:
Count (node A) >min_repeat_times, wherein, Count (node A) represents the count property value of node A in broad sense suffix tree, min_repeat_times is used for representing a predetermined threshold value,
The substring of this node representative is frequent substring, that is, the corresponding character string of node that the count attribute in broad sense suffix tree is greater than to predetermined threshold value is defined as described frequent substring
Otherwise, if being judged as, a node not frequent substring, the subtree take node as root is cut to tree, in subsequent process, no longer travel through the child node of this node, and with this, improve the search efficiency of frequent substring.
In S106, convert the described frequent substring finding out to frequent sub-trajectory.
For the frequent substring finding out in S105, as A (character 6) B (character 7) C ..., can be translated into character by character spatial information or temporal information that this character represents, particularly:
If this character is first kind character, what represent this character representative is spatial information, therefore this character is changed into the geographic position that this character is corresponding;
If this character is Equations of The Second Kind character, what represent this character representative is temporal information, therefore gets neighbours' character of this character and this character, changes into corresponding numerical value and gets average.For example,, when representative in the combination with Equations of The Second Kind character represents temporal information compared with that character of fractional value, if character is 6, its neighbours' character is 7, above-mentioned two characters respectively numerical value of correspondence are 0.3 and 0.35, so to 0.3 and 0.35, get average, thereby restore this temporal information.
Table 2 shows the example of the last frequent sub-trajectory generating of broad sense suffix tree of showing according to Fig. 5:
Table 2
The embodiment of the present invention combines data mining technology, suffix tree algorithm and inexact matching, thereby realized searching of frequent sub-trajectory in track data preferably, by processing comparatively complicated multi dimensional numerical data with comparatively efficient character string algorithm, the computation complexity of whole frequent sub-trajectory search procedure is reduced greatly, and reasonably clustering method also make the embodiment of the present invention more accurate to the clustering of track data space information.
Fig. 6 shows frequent sub-trajectory in the track data that the embodiment of the present invention provides and searches the structured flowchart of device, and this device can be for the frequent sub-trajectory lookup method in the track data described in operation Fig. 1 to Fig. 5 embodiment of the present invention.For convenience of explanation, only show the part relevant to the present embodiment.
With reference to Fig. 6, this device comprises;
Separative element 61, the spatial information in separated track data and temporal information.
The first coding unit 62, is encoded into first kind character by described spatial information, and each described first kind character is used for representing a geographic position.
The second coding unit 63, is encoded into Equations of The Second Kind character by described temporal information, and each described Equations of The Second Kind character is used for representing the spacer segment time.
Set up unit 64, according to the described spatial information and the described temporal information that is encoded into described Equations of The Second Kind character that are encoded into described first kind character, set up broad sense suffix tree.
Search unit 65, search the frequent substring in described broad sense suffix tree.
Converting unit 66, converts the described frequent substring finding out to frequent sub-trajectory.
Alternatively, described the first coding unit 62 comprises:
Cluster subelement, carries out cluster to described spatial information, generates N bunch, and described N is greater than 1 integer.
Determine subelement, determine respectively the each bunch of corresponding geographic position generating.
Coding subelement, according to carrying out character code for the each bunch of corresponding geographic position generating, generates respectively each bunch of corresponding described first kind character.
Alternatively, described the second coding unit 63 comprises:
Conversion subelement, converts described temporal information to interval time by timestamp.
Standardization subelement, interval time described in standardization.
Coupling subelement, for mating Equations of The Second Kind character the described interval time after each standardization.
Alternatively, described coupling subelement specifically for:
Determine after described standardization described interval time place two numerical value end points in default value interval;
By two Equations of The Second Kind character match corresponding to described two numerical value end points difference, give the described interval time after this standardization.
Alternatively, described device also comprises:
Increase unit, for the each node in described broad sense suffix tree increases a count attribute, described count attribute is counted for the number of times that character string corresponding to this node occurred described broad sense suffix tree;
Described search unit 65 specifically for:
The corresponding character string of node that described count attribute in described broad sense suffix tree is greater than to predetermined threshold value is defined as described frequent substring.
The embodiment of the present invention combines data mining technology, suffix tree algorithm and inexact matching, thereby realized searching of frequent sub-trajectory in track data preferably, by processing comparatively complicated multi dimensional numerical data with comparatively efficient character string algorithm, the computation complexity of whole frequent sub-trajectory search procedure is reduced greatly, and reasonably clustering method also make the embodiment of the present invention more accurate to the clustering of track data space information.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any modifications of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.