CN110263108B

CN110263108B - Keyword Skyline fuzzy query method and system based on road network

Info

Publication number: CN110263108B
Application number: CN201910388590.2A
Authority: CN
Inventors: 秦小麟; 李星罗; 王宁; 鲍斌国; 张彤; 陈骏岭
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2019-05-10
Filing date: 2019-05-10
Publication date: 2023-07-11
Anticipated expiration: 2039-05-10
Also published as: CN110263108A

Abstract

The invention discloses a keyword Skyline fuzzy query method and a system based on a road network, wherein the method comprises the following steps: firstly, constructing a KR-Tree index, then converting keywords input by a user into a triplet form < L, R, T > which can be identified by a computer, then calling the KR-Tree index, quickly searching a database according to a user query condition, and finally returning a search result to the user; the system comprises: the system comprises an index construction module for constructing a KR-Tree index, an input module for inputting keywords, a conversion module for converting the keywords into a triplet form, a retrieval module for retrieving a database and an output module for displaying retrieval results. The invention fully meets the preference requirement of the inquiring user, increases the fault tolerance of the inquiry, and improves the inquiring efficiency by improving the pruning efficiency of irrelevant nodes in the inquiring process.

Description

Keyword Skyline fuzzy query method and system based on road network

Technical Field

The invention belongs to the technical field of space database query, and particularly relates to a keyword Skyline fuzzy query method and system based on a road network.

Background

With the rapid development of GPS positioning technology, the popularity of wireless terminal devices, location-based services and applications have generated vast amounts of spatial text data, such as: the method includes the steps that merchant information (including spatial position information of merchants and descriptive label information of merchants) applied by mobile phones is starved, and messages (including spatial geographic position information and text keyword information when the users issue messages) issued by users on the newwave microblogs. Aiming at massive space text data, aiming at multi-preference query proposed by a user, how to simultaneously select a group of relatively better results in multiple dimensions is integrated into a hot spot of current research.

Skyline queries are often used to solve multi-objective decision-making problems. For the object set in the multi-dimensional dataset, if the attribute of the object A in all latitudes is not weaker than that of the object B, and the attribute of the object A in at least one latitude is better than that of the object B, the object A is said to dominate the object B. While the result set of Skyline queries is a set of objects that are not subject to any object. The choice of whether to govern the choice of objects in the computing process is often determined in practice by the preferences of the querying user.

The existing space keyword Skyline query method takes the shortest path between the Point of interest (Point of interest) and the query Point as a measurement distance, but as the complexity of the road network environment increases, the time overhead of road network distance calculation increases, so that the query efficiency decreases sharply. The user may input the keyword "starbucks" as "starbucks" during the actual query process, so that the user may not obtain the desired query result. However, in the conventional solution, the similarity filtering between the keywords needs to rely on a uniform threshold value for filtering, and for the query keywords with different lengths, it is difficult to perform similarity measurement through a uniform threshold value; secondly, the traditional method mainly aims at solving the problem of fuzzy matching of single keywords, and has low support degree for solving the problem of fuzzy matching of multiple keywords.

The patent applications related to this document are as follows:

[1] ciphertext-based multi-keyword fuzzy query method under cloud environment (application date: 2018.05.23, publication number: CN 108710698A).

[2] A hybrid spatial index mechanism (date of application: 2017.10.12, publication number: CN108052514 a) that handles geographical text Skyline queries.

[3] Skyline query method based on space time sequence data stream application (application date: 2016.12.14, publication number: CN 106708989A).

Disclosure of Invention

The invention aims to: the invention provides a keyword Skyline fuzzy query method and a keyword Skyline fuzzy query system based on a road network, which are used for solving the problem that in the prior art, the fuzzy matching efficiency is reduced due to the fact that keywords with different lengths are input in error.

The technical scheme is as follows: the invention provides a keyword Skyline fuzzy query method based on a road network, which comprises the following steps:

step 1, constructing a corresponding KR-Tree index in a memory aiming at a database stored in a disk;

step 2, converting keywords input by a user into a triplet form < L, R, T > which can be identified by a computer, wherein L is the spatial position of the user, R is the radius of a query area of the user, and T is a keyword set input by the user;

step 3, calling a KR-Tree index, and searching a database by using the KR-Tree index according to the depth priority principle and < L, R, T >;

and step 4, returning the search result to the user after the search is finished.

Further, the specific method for constructing the KR-Tree index is as follows:

step A, estimating the size of a required memory space according to the number of the interesting points in the database, and applying for the memory space with the corresponding size to a computer;

step B, initializing an index head in a starting section of the memory space so as to generate a root node, and starting to access leaf nodes under the root node from top to bottom;

step C, traversing all the interest points in the database, sequentially inserting the space coordinates of all the interest points into leaf nodes as key values, and introducing father nodes if necessary, thereby completing the construction of a KR-Tree index frame;

step D, traversing the keywords held by all the interest points, and inserting the keywords held by all the interest points into specific positions of AK-Table indexes of the corresponding leaf nodes, so as to construct the AK-Table indexes of the leaf nodes; the corresponding leaf nodes are the leaf nodes where the coordinates of the interest points of a certain keyword are located;

e, after the AK-Table indexes of the leaf nodes are built, building AK-Table indexes of all father nodes;

and F, after the whole KR-Tree index is built, the index in the memory is written into the disk in a blocking way by taking the node as a basic unit.

Further, the specific method in the step C is as follows:

step C1, inserting a key value into a current node, and if the number of the key values in the node exceeds a maximum value F, splitting the node according to a neighbor principle so as to generate two new nodes; if the father node points downwards to the current node, the father node points downwards to generate two new nodes and update the information of the father node, if no father node points downwards to the current node, a new father node is generated to the upper layer, the pointer of the father node points downwards to split to generate two new nodes, and then the information of the father node is updated;

and C2, starting from the root node again, sequentially inserting the key values of the rest interest points into the existing leaf nodes according to the neighbor principle until the key values of all the interest points are sequentially inserted into the leaf nodes.

Further, the specific operation of updating the information of the parent node is as follows: and taking the father node as the current node, and inserting the space region coordinate information respectively represented by the two new nodes into the father node as two key values.

Further, the specific method in the step D is as follows: inputting all keywords held by a current interest point into a Hash function one by one to obtain a Hash value of each keyword in the interest point, inserting the keywords into specific positions of AK-Table indexes of corresponding leaf nodes according to the Hash value of each keyword, recording the ids of the interest point, and if the positions have recorded the same keywords as the keywords in other interest points, newly adding a subsequent linked list node in the positions, and recording the keywords and the ids of the current interest points in the subsequent linked list node; until all keywords in all interest points are inserted into the AK-Table index.

Further, the specific method in the step E is as follows: the key word of each father node is a set of all key words of the next layer node connected with the father node, all key words held by each father node are input into a Hash function one by one to obtain a Hash value of each key word, the Hash value of each key word is inserted into a specific position of an AK-Table index of the father node according to the Hash value of each key word, the id of a position area where the key word is located is recorded, if the position is already recorded, a subsequent linked list node is newly added behind the position, and the key word and the id of the position area where the key word is located are recorded in the subsequent linked list node.

Further, the specific method for searching the database is as follows:

step 3.1, starting to access the indexed nodes from top to bottom by the root node indexed by the KR-Tree according to the query condition;

step 3.2, judging whether the current node is a leaf node, if so, turning to step 3.8, otherwise turning to step 3.3;

step 3.3, judging whether an overlapping area exists in the space of the area where the current node is located and the query area, and if so, turning to step 3.4; otherwise, turning to step 3.6;

step 3.4, calculating whether the text similarity between the set of keywords in the AK-Table index of the current node and T is smaller than or equal to a threshold K; if yes, turning to step 3.5; otherwise, turning to step 3.6;

step 3.5, accessing a subsequent node of the current node, namely a next layer node connected with the current node according to a depth priority principle, and converting the step 3.2;

step 3.6, judging whether the current node has a brother node which is not accessed yet, if so, accessing the brother node and jumping to the step 3.2; otherwise, stopping downward access, returning to the last node, and turning to the step 3.7;

step 3.7, judging whether the current node is a root node, if so, turning to step 3.9; otherwise, turning to step 3.6;

step 3.8, comparing all the interest points in the current leaf node with all the interest points in the candidate set, eliminating the candidate set and the interest points which are governed by other interest points in the current leaf node, and reserving the rest of the interest points in the candidate set to form a new candidate set, and turning to step 3.6 after the comparison is finished;

and 3.9, traversing all leaf nodes meeting the query condition, and taking the interest points in the final candidate set as query results.

Further, the method for calculating the text similarity in the step 3.4 is as follows:

if the user inputs a single keyword, the text similarity is calculated by adopting the following formula:

wherein S (t) _q ，T _o ) Keyword t representing user q input _q Set T of all keywords in region o where current node is located _o Text similarity of (c); ED (t) _q ，t _o ) For the key word t _q Changing the editing operation into the keyword t after the editing operation of adding, deleting and modifying _o Is the least operand of (1); w (t) _q ) For keyword t _q Weight value of (2); max is the weight value of the keyword with the largest weight value in the database; wherein W (t) _q )＝TF(t _q ，T)*IDF(t _q ，U)；TF(t _q T) represents the keyword T _q Frequency of occurrence in keyword set T, IDF (T _q U) is the inverse document frequency, representing the keyword t _q Reciprocal of frequency of occurrence in all points of interest of the database;

if the user inputs multiple keywords, the text similarity is calculated by adopting the following formula:

where |T| represents the number of query keywords.

Further, the specific method for comparing the dominant relations in the step 2.8 is as follows: if the point of interest i is not emptyThe inter-attribute is not weaker than the interest point j, and the text space distance of the interest point i is closer to the user than the interest point j, so that the interest point i dominates the interest point j; text space distance D of specific user and interest point _t The calculation method of (q, i) is as follows, wherein the interest points i and j are the interest points in the leaf nodes:

D _t (q，i)＝D _r (q，i)/S

wherein D is _r (q, i) represents the road network distance from the user q to the queried interest point i, S is the keyword or keyword set input by the user and the keyword set T held by the interest point i _i Text similarity between them, s=s (t if a single keyword is input by the user _q ，T _i ) The method comprises the steps of carrying out a first treatment on the surface of the If the user inputs multiple keywords then s=s (T, T _i )。

Further, the method comprises the steps of: the device comprises an index construction module, an input module, a conversion module, a retrieval module and an output module;

the index construction module is used for sequentially inserting coordinates of the interest points and keywords held by the interest points into leaf nodes of the index in the memory so as to construct a KR-Tree index;

the input module is used for inputting keywords by a user and transmitting the input keywords to the conversion module;

the conversion module is used for converting the keywords input by the user into a form which can be identified by a computer and transmitting the keywords to the retrieval module;

the retrieval module utilizes the KR-Tree index constructed by the index construction module to retrieve based on keywords input by a user, and transmits retrieval results to the output module;

the output module is used for displaying the search result.

The beneficial effects are that:

(1) The invention provides a space text index structure KR-Tree, which stores space information and text information in an index node at the same time. In the space region query process, query keyword information can be utilized to efficiently prune the query region, so that the Skyline query efficiency is further improved.

(2) Aiming at the problem that error input possibly exists in the user query process, the invention provides a keyword similarity measurement scheme based on the edit distance, and the TF-IDF model is utilized to endow each existing keyword with initialization weight, so that the measurement value is more in accordance with user preference, and the query fault tolerance is increased.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a flow chart of comparing dominant relationships between points of interest according to the present invention;

FIG. 3 is a schematic diagram of a KR-Tree index structure according to the present invention;

fig. 4 is a flow chart of KR-Tree index creation of the present invention.

Detailed Description

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

In order to solve the problem of multi-preference demand of user query, the invention provides a keyword Skyline fuzzy query method based on a road network, and the specific flow of the method is shown in figure 1:

The specific flow of the step 3 is as follows:

step 3.1: starting to access the indexed nodes from top to bottom by the root node indexed by the KR-Tree according to the query condition;

step 3.2: judging whether the current node is a leaf node, if so, turning to step 3.8, otherwise turning to step 3.3;

step 3.3: judging whether an overlapping area exists between the area of the current node and the query area space, and if so, turning to step 3.4; otherwise, turning to step 3.6;

step 3.4: calculating whether the text similarity between a set of keywords in an AK-Table index of a current node and T is smaller than or equal to a threshold K; if yes, turning to step 3.5; otherwise, turning to step 3.6;

step 3.5: accessing a subsequent node of the current node, namely a next layer node connected with the current node according to a depth priority principle, and converting the next layer node into a step 3.2;

step 3.6: judging whether the current node has a brother node which is not accessed yet, if so, accessing the brother node and jumping to the step 3.2; otherwise, stopping downward access, returning to the last node, and turning to the step 3.7;

step 3.7: judging whether the current node is a root node or not, if so, turning to step 3.9; otherwise, turning to step 3.6;

step 3.8: comparing all the interest points in the current leaf node with all the interest points in the candidate set, eliminating the candidate set and the interest points which are governed by other interest points in the current leaf node, and reserving the rest of the interest points in the candidate set to form a new candidate set, and turning to step 3.6 after the comparison is finished;

step 3.9: all leaf nodes meeting the query condition are traversed, and the interest points in the candidate set finally obtained are used as query results.

The method for calculating the text similarity in the step 3.4 comprises the following steps:

wherein S (t) _q ，T _o ) Keyword t representing user q input _q With the current nodeSet T of all keywords in region o _o Text similarity of (c); ED (t) _q ，t _o ) For the key word t _q Editing operations such as adding, deleting and modifying are performed to become a keyword t _o Is the least operand of (1); w (t) _q ) For keyword t _q Weight value of (2); max is the weight value of the keyword with the largest weight value in the database; wherein W (t) _q )＝TF(t _q ，T)*IDF(t _q ，U)；TF(t _q T) represents the keyword T _q Frequency of occurrence in keyword set T, IDF (T _q U) is the inverse document frequency, representing the keyword t _q Reciprocal of frequency of occurrence in all points of interest of the database;

where |T| represents the number of query keywords.

The invention mainly provides a keyword weight model based on a TF-IDF model, a fuzzy keyword measurement method is realized by using the model, and a space text dominant relation calculation method is further provided, and the method flow is shown in figure 2 and comprises the following key steps:

step S1, aiming at a keyword t input by a user _q Calculating the weight W (t) of each keyword according to the TF-IDF model _q )：

Step S2, calculate the edit distance ED (t _q ，t _i )，t _i A certain keyword in the interest point; t is t _i ∈T _i ，T _i A set of keywords held for point of interest i;

step S3, calculating text similarity S between keywords or keyword sets input by a user and all the keyword sets held by the interest points i (the interest points are all in leaf nodes); if the user inputs a single keyword s=s (t _q ，T _i ) If the user inputs multiple keywords then s=s (T, T _i )；

Step S4, calculating the space text distance from the user q to the queried interest point i:

D _t (q，i)＝D _r (q，i)/S

wherein D is _r (q, i) represents the road network distance of the querying user q to the queried point of interest i.

Step S5, space text dominance judgment is carried out on the interest points in the current leaf nodes and the interest points in the candidate set, wherein the judgment rule is as follows:

if point of interest i is not weaker than point of interest j in non-spatial properties (other properties than space, such as good scoring by the merchant (point of interest) in hungry), and the text spatial distance of point of interest i is closer to the user than point of interest j, point of interest i dominates point of interest j.

The invention provides an efficient space text index structure KR-Tree by combining with an IR-Tree index, which can effectively improve the fuzzy matching of keywords and pruning efficiency of a space region irrelevant to query, wherein the construction process of the index is shown in figure 4:

and step A, estimating the size of the required memory space according to the number of the interesting points in the database, and applying for the memory space with the corresponding size to the computer.

And B, initializing an index head in a starting section of the memory space, and generating a root node.

And C, traversing the coordinates of all the interest points in the database, sequentially inserting the space coordinates of all the interest points into leaf nodes as key values, and introducing parent nodes if necessary, thereby completing the construction of the KR-Tree index frame.

The specific method comprises the following steps: inserting the key value into the current node, if the number of the key values in the node exceeds the maximum value F, splitting the node according to the neighbor principle, thereby generating two new nodes; if the father node points downwards to the current node, the father node points downwards to the two generated new nodes, and the information of the father node is updated, if the father node does not point downwards to the current node, a new father node is generated at the upper layer, the pointer of the father node points downwards to split to generate two new nodes, and the information of the father node is updated;

starting from the root node again, sequentially inserting the key values of the rest interest points into the existing leaf nodes according to the neighbor principle until the key values of all the interest points are sequentially inserted into the leaf nodes;

the specific operation of updating the parent node information is as follows: and taking the father node as the current node, and inserting the space region coordinate information respectively represented by the two new nodes into the father node as two key values.

And D, traversing the keyword information held by the current interest point to construct AK-tables of the leaf nodes, as shown in figure 3.

And taking the key words held by the current interest points as key values, obtaining a specific Hash value through a Hash function, and inserting the specific Hash value into the position corresponding to the AK-Table index according to the Hash value. If the current node stores the record, a next linked list node is added, and the record information of the node is updated. The node record is the id of the current interest point and is used for quickly indexing the record during inquiry.

And E, if all the interest points are inserted, constructing a non-leaf node related AK-Table from bottom to top.

The specific method comprises the following steps: the key word of each father node is a set of all related words of the next layer node connected with the father node, all the key words held by each father node are sequentially input into a Hash function to obtain a Hash value of each key word, the Hash value of each key word is inserted into a specific position of an AK-Table index of the father node according to the Hash value of each key word, the id of a position area where the key word is located is recorded, if the position is already recorded, a subsequent linked list node is newly added behind the position, and the key word and the id of the position area where the key word is located are recorded in the subsequent linked list node.

And F, after the whole KR-Tree index is built, the index in the current memory is written into the disk in a blocking way by taking the node as a basic unit.

A keyword Skyline fuzzy query system based on a road network, the system comprising: the device comprises an index construction module, an input module, a conversion module, a retrieval module and an output module;

the input module is used for inputting keywords to a user; transmitting the input keywords to a conversion module;

the conversion module is used for converting keywords input by a user into a form which can be identified by the system; and transmitted to the retrieval module

The retrieval module utilizes the index constructed by the index construction module to retrieve based on the keywords input by the user; and transmitting the search result to an output module;

and the output module outputs and is used for displaying the search result.

The invention aims to solve the problems of storage index of a space text data set and multi-preference query of users, and provides a space text dominance model by combining space attributes and text attributes, so that the preference requirements of querying users are fully met. Aiming at the phenomenon that input errors possibly occur in the query process of a user, the method for fuzzy matching of the keywords is provided to increase the fault tolerance rate of the query. Based on the characteristics of the IR-Tree index, a space text index structure KR-Tree is provided, and the pruning efficiency of irrelevant nodes in the query process can be improved by utilizing text information in the indexed nodes, so that the query efficiency is improved. The method is widely applied to application scenes related to Skyline query of the road network.

In addition, the specific features described in the above embodiments may be combined in any suitable manner without contradiction. The various possible combinations of the invention are not described in detail in order to avoid unnecessary repetition.

Claims

1. The keyword Skyline fuzzy query method based on the road network is characterized by comprising the following steps of:

step 4, after the search is finished, returning a search result to the user;

the specific method for constructing the corresponding KR-Tree index is as follows:

step C, traversing all the interest points in the database, sequentially inserting the space coordinates of all the interest points into leaf nodes as key values, and introducing parent nodes, thereby completing the construction of a KR-Tree index frame;

step D, traversing the keywords held by all the interest points, and inserting the keywords held by all the interest points into the AK-Table indexes of the corresponding leaf nodes, so as to construct the AK-Table indexes of the leaf nodes; the corresponding leaf nodes are the leaf nodes where the coordinates of the interest points of a certain keyword are located;

2. The method according to claim 1, wherein the specific method of step C is:

3. The method according to claim 2, wherein the specific operation of updating the information of the parent node is: and taking the father node as the current node, and inserting the space region coordinate information respectively represented by the two new nodes into the father node as two key values.

4. The method according to claim 1, wherein the specific method of step D is: inputting all keywords held by a current interest point into a Hash function one by one to obtain a Hash value of each keyword in the interest point, inserting the keywords into AK-Table indexes of corresponding leaf nodes according to the Hash value of each keyword, recording the id of the interest point, and if the position has recorded the same keywords as the keywords in other interest points, newly adding a subsequent linked list node in the position, and recording the keywords and the id of the current interest point in the subsequent linked list node; until all keywords in all interest points are inserted into the AK-Table index.

5. The method according to claim 1, wherein the specific method of step E is: the key words of each father node are the set of all key words of the next layer node connected with the father node, all key words held by each father node are input into a Hash function one by one to obtain a Hash value of each key word, the Hash value of each key word is inserted into an AK-Table index of the father node according to the Hash value of each key word, the id of the position area where the key word is located is recorded, if the position is already recorded, a subsequent linked list node is newly added behind the position, and the key words and the id of the position area where the key word is located are recorded in the subsequent linked list node.

6. The method according to claim 1, wherein the specific method for searching the database is as follows:

7. The method of claim 6, wherein the method of calculating text similarity in step 3.4 is as follows:

wherein S (t) _q ,T _o ) Keyword t representing user q input _q Set T of all keywords in region o where current node is located _o Text similarity of (c); ED (t) _q ，t _o ) For the key word t _q Changing the editing operation into the keyword t after the editing operation of adding, deleting and modifying _o Is the least operand of (1); w (t) _q ) For keyword t _q Weight value of (2); max is the weight value of the keyword with the largest weight value in the database; wherein W (t) _q )＝TF(t _q ，T)*IDF(t _q ,U)；TF(t _q T) represents the keyword T _q Frequency of occurrence in keyword set T, IDF (T _q U) is the inverse document frequency, representing the keyword t _q Reciprocal of frequency of occurrence in all points of interest of the database; u represents all points of interest of the database;

where |T| represents the number of query keywords.

8. According to claim 7The method is characterized in that the specific method for comparing the dominant relations in the step 3.8 is as follows: if the interest point i is not weaker than the interest point j in the non-spatial attribute, and the text spatial distance of the interest point i is closer to the user than the interest point j, the interest point i dominates the interest point j; text space distance D of specific user and interest point _t The calculation method of (q, i) is as follows, wherein the interest points i and j are the interest points in the leaf nodes:

D _t (q,i)＝D _r (q,i)/S

wherein D is _r (q, i) represents the road network distance from the user q to the queried interest point i, S is the keyword or keyword set input by the user and the keyword set T held by the interest point i _i Text similarity between them, s=s (t if a single keyword is input by the user _q ,T _i ) The method comprises the steps of carrying out a first treatment on the surface of the If the user inputs multiple keywords then s=s (T, T _i )。