CN114791942B

CN114791942B - Spatial text density clustering retrieval method

Info

Publication number: CN114791942B
Application number: CN202210704570.3A
Authority: CN
Inventors: 李晓涛; 王艺沾; 朱海平; 罗昌银; 张卫平; 金炯华; 倪明堂; 黄培; 吴淑敏
Original assignee: Guangdong Intelligent Robotics Institute
Current assignee: Guangdong Intelligent Robotics Institute
Priority date: 2022-06-21
Filing date: 2022-06-21
Publication date: 2022-09-20
Anticipated expiration: 2042-06-21
Also published as: CN114791942A

Abstract

The invention discloses a spatial text density clustering retrieval method, which comprises the following steps: acquiring road network information and query keywords; the road network information includes a text object; constructing a mixed index structure based on the inverted file and according to the road network information; the inverted file comprises a publishing list of the query keywords; obtaining an object set according to the mixed index structure and the issuing list of the query keywords; calculating the shortest path between any two text objects in the object set and the number of the text objects on the shortest path; determining the mutual reachable distance of any two text objects; establishing a minimum spanning tree according to the mutual reachable distance between any two text objects, and storing the minimum spanning tree to a queue after processing; and taking the value in the queue as a retrieval result. The method can solve the problem of top-K space text clustering retrieval.

Description

Spatial text density clustering retrieval method

Technical Field

The invention relates to the technical field of space keyword retrieval, in particular to a space text density clustering retrieval method.

Background

The spatial keyword query takes a position and a group of keywords as parameters, returns objects related to the parameters, and plays an indispensable role in geographic text information retrieval and personalized services. The location in the query represents the user's location intent, while the keywords describe the user's actual needs.

In recent years, spatial keyword queries have become a hot direction in the research community, and many different types of spatial keyword queries have been proposed. However, these queries are all different from the top-k space text clustering search query problem, the top-k space text clustering search returns k sets of space text objects containing the search keywords, each set is a cluster implemented by a density function, so that each cluster contains related space web objects related to the search keywords, and the density of each cluster satisfies the query constraints; and establishing a cost function according to the space distance of the clusters and the text correlation of the query parameters, and sequencing the clusters. This design allows the shape of the recovery area to no longer be a fixed size rectangle or circle, while enhancing the robustness of the algorithm.

However, the existing spatial text clustering retrieval method only focuses on Euclidean space, and ignores the actual distance to the target. In practical application, the position and accessibility of the space text object are limited by network connectivity, and on the premise, the method for solving the problem of road network Top-k space text clustering query has practical value and significance.

Disclosure of Invention

Therefore, the technical problem to be solved by the present invention is to overcome the defects in the prior art, and to provide a spatial text density clustering retrieval method.

The invention provides a spatial text density clustering retrieval method, which comprises the following steps:

acquiring road network information and query keywords; the road network information includes a text object;

constructing a mixed index structure based on the inverted file and according to the road network information; the inverted file comprises a publishing list of the query keywords;

the mixed index structure is used for organizing the text objects in the road network information and storing the text objects; obtaining an object set according to the mixed index structure and the distribution list of the query keywords;

calculating the shortest path between any two text objects in the object set and the number of the text objects on the shortest path; determining the mutual reachable distance of any two text objects according to the shortest path and the number of the text objects on the shortest path; establishing a minimum spanning tree according to the mutual reachable distance between any two text objects, and storing the minimum spanning tree to a queue after processing; and taking the value in the queue as a retrieval result.

Preferably, a mixed index structure is constructed based on the road network information, the G tree and the inverted file;

the process of obtaining the hybrid index structure is as follows: constructing a G tree according to road network information, adding a distance matrix and a pointer pointing to an inverted file for each node in the G tree from bottom to top on the basis of the G tree, and constructing a mixed index structure;

text objects are stored in leaf nodes of the mixed index structure.

Preferably, any two text objects are respectively marked as a and b; then the mutual reachable distance of any two text objects is denoted as dmreach-k (a, b); the calculation formula is as follows:

wherein the core _k (a) Representing the spatial distance between the text object a and the k nearest neighbor text object; core _k (b) Representing the spatial distance between the text object b and the kth nearest neighbor text object; d (a, b) represents the road network distance between the text object a and the text object b.

Preferably, a k-order triangulation structure is adopted, a subgraph is formed according to the mutual reachable distance of any two text objects, and then a minimum spanning tree is established.

Preferably, the specific process of establishing the minimum spanning tree is as follows:

regarding the minimum spanning tree as a weighted graph, wherein the text objects are used as vertexes, and the mutual reachable distance between any two text objects is used as the weight of an edge between any two text objects; reducing the edges between any two text objects which need to be considered for establishing a minimum spanning tree by adopting a k-order triangulation structure; the remaining edges in the weighted graph and the text objects will form subgraphs from which the minimum spanning tree is built.

Preferably, the set of objects is a set of objects related to the query keyword.

Preferably, compressing the minimum spanning tree, extracting density clusters, and storing the density clusters into a queue; and selecting the density clusters in the queue as retrieval results.

Preferably, density clusters include cluster stability; the stability of the clusters was noted:

wherein, in the step (A),

(ii) a The persistence of the clusters is noted as:

；

the inverse of distance when the node p representing the minimum spanning tree under the current cluster is separated from the current cluster,

the reciprocal of distance when the current cluster is generated by splitting is represented, the distance represents the size of an edge in a minimum spanning tree corresponding to the current cluster when the current cluster is separated, lambda represents the continuity of the cluster, and lambda is the reciprocal of the distance; the increase of lambda is in direct proportion to the reduction of the cluster, and the cluster is continuously reduced until the cluster disappears or is split into sub-clusters; cluster denotes the current cluster.

Preferably, the process of extracting density clusters is as follows: all edges in the minimum spanning tree are subjected to incremental sequencing, and for each edge, a parallel search set is adopted to combine two subgraphs with edge links, so that the minimum spanning tree is compressed, and the minimum spanning tree is compressed and converted into a tree structure; traversing the tree structure from the leaf nodes to the top from bottom to top, calculating the stability of all clusters in the tree structure, and extracting the cluster with the best stability; when the sum of the stability of the sub-clusters is larger than the stability of the clusters, replacing the stability of the clusters with the sum of the stability of the sub-clusters; otherwise, merging all the sub-clusters; and when traversing to the root node of the tree structure, taking the extracted cluster as a density cluster.

Preferably, the density clusters selected in the queue are screened through a cost function, and the cost function is recorded as: cost; the cost function calculation formula is as follows:

wherein, alpha is (0, 1)]Indicates a user preference, tr _q.ψ (R) represents the maximum text relevance value of the text object in the density cluster.

The technical scheme of the invention has the following advantages: the efficient pruning of the spatial information can be realized through the mixed index structure, and the efficient retrieval can be further realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is an exemplary diagram of k-STC on a road network in an implementation of the present invention;

FIG. 2 is a flow chart of a retrieval method in an implementation of the present invention;

FIG. 3 is a schematic diagram of a road network in which the present invention is implemented;

FIG. 4 is a schematic view of the road network shown in FIG. 3 after being divided;

FIG. 5 is a diagram illustrating a hybrid index structure in accordance with an embodiment of the present invention;

FIG. 6 is a diagram illustrating inverted files in the hybrid index structure shown in FIG. 5;

FIG. 7 is a diagram illustrating distance matrices and shortcut keys in the hybrid index structure shown in FIG. 5.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

As shown in fig. 1, there are 8 objects thereino ₁ ,o ₂ ,o ₃ ,…,o ₈ Dispersed in the road network, each object has a separate keyword as a tag. In fig. 1 a set of solutions is searched for containing restaurants, hotels and shops, and if the actual distance between objects is not taken into account, four objects in the dashed box in the lower right cornero ₂ ,o ₄ ,o ₅ ,o ₇ The optimal result is obtained; in practice, however, the distance between objects is determined by the shortest distance between them, and the query point cannot reach directly across the highwayo ₂ ,o ₄ ,o ₅ The position of the four objects in the upper left-hand dotted line boxo ₁ ,o ₃ ,o ₆ ,o ₈ May be a better choice. Therefore, the problem of spatial text clustering query on a research road network cannot ignore the actual distance to the target.

As shown in fig. 2, this embodiment provides a spatial text density clustering retrieval method, which adopts an HDBSCAN clustering method, and the retrieval method includes the steps of:

as shown in FIG. 3, the present implementation provides a road network comprising 14 nodes and 17 edges, among others11 text objects are distributed in the road network;v ₀ ,v ₁ ,v ₂ ,…,v ₁₃ representing nodes on a road network;o ₁ ,o ₂ ,…,o ₁₁ representing a text object;

table 1 represents a document vector of a text object;

the document vector contained in each text object can be known from table 1.

in this embodiment, the inverted file further includes a query keyword, an object containing the query keyword, and a weight (frequency) thereof.

Specifically, a hybrid index structure is constructed based on road network information, a G tree and inverted files;

the process of obtaining the hybrid index structure is as follows: constructing a G tree according to the road network information; and adding a distance matrix and a pointer of the pointed inverted file to each node in the G tree from bottom to top based on the G tree.

The leaf nodes are also stored with text objects, the distance matrix in the leaf nodes stores the distance from the top point to the boundary point in the corresponding subgraph, and the pointed reverse file comprises the keywords and the weight of all related text objects in the corresponding subgraph; the pointed inverted file indexes the text information of all the text objects stored in the leaf nodes;

for non-leaf nodes, the distance matrix stores the shortest distance between the boundary points of all child nodes of the non-leaf nodes; the pointed inverted file stores keywords contained in child nodes in the non-leaf nodes and the maximum weight of the keywords; and constructing a mixed index structure based on the process.

As shown in FIG. 4, the G-tree divides the road network into severalSub-graphs of almost the same size are created and a tree is created for the sub-graphs. G-tree at selected nodes on the basis of G-tree(s) ((v ₃ ,v ₄ ,v ₅ ,v ₆ ,v ₇ ,v ₈ ,v ₉ ,v ₁₀ ,v ₁₁ ,v ₁₂ ,v ₁₃ ) And each shortcut stores the distance between the boundaries of two nodes, and the shortest path distance between two vertexes is calculated efficiently. The nodes are grouped on the basis of FIG. 3, and can be divided into G tree according to the division rule of G treeG ₁ 、G ₂ (ii) a Then, the number of each division is 2 after the specification, the maximum number of the node trees contained in each group is 4, and finally, the node trees can be further dividedG ₃ 、G ₄ 、G ₅ 、G ₆ And four groups. In the figure, if an edge exists between two groups and the groups can be connected through the edge, two end points of the edge are boundary points of the two groups respectively and are stored in nodes. Taking the example of fig. 4 as an example,v ₀ ,v ₂ ,v ₃ is thatG ₅ AndG ₆ the boundary point of (a) is,v ₀ ,v ₁ ,v ₄ is thatG ₁ AndG ₂ the boundary point of (2). G, the tree does not store the distance of each node, but stores the shortest path distances between the boundary points and the boundary points, and the shortest path distances are stored in a distance matrix; for example, forG ₅ In other words, the boundary points arev ₂ Thus, therefore, it isG ₅ Will contain the distance matrix fromv ₂ ToG ₅ All other nodes in (b), (c), (d) and (d)v ₂ ,v ₁₀ ,v ₁₁ ) The shortest path of (2). For G tree, two shortcut keys are usedS ₁ AndS ₂ connection ofG ₃ AndG ₅ and anG ₄ AndG ₆ . Thus, if you want to inquirev ₄ Tov ₁₀ Distance between, only access to shortcut keyS ₁ And is divided intoThe distance matrix of the group can obtain the query result.

As shown in fig. 5, a structural example of a hybrid index structure (IG tree) is provided; inverted file in FIG. 5: (IF ₀ 、IF ₁ 、IF ₂ 、IF ₃ 、IF ₄ 、IF ₅ 、IF ₆ ) As shown in fig. 6; distance matrix in fig. 5 (G ₀ 、G ₁ 、G ₂ 、G ₃ 、G ₄ 、G ₅ 、G ₆ ) And a shortcut key (S ₁ 、S ₂ ) As shown in fig. 7.

All the nodes contained in each group are stored, and a matrix holding nodes to boundary points. Each group also contains a pointer to an inverted file that indexes the textual information of all the textual objects stored in the node. As shown in fig. 5, groupG ₃ Containing objectso _6, o ₇ ,o ₈ (ii) a Group ofG ₄ Containing objectso _9, o ₁₀ ,o ₁₁ (ii) a Group ofG ₅ Containing objectso _1, o ₂ (ii) a Group ofG ₆ Containing objectso _3, o ₄ ,o ₅ (ii) a Each group constitutes a leaf node of the hybrid index structure (IG tree), each leaf node also containing a pointer to the inverted file.

The mixed index structure is used for organizing the text objects in the road network information and storing the text objects; obtaining an object set according to the mixed index structure and the issuing list of the query keywords; the set of objects is a set of objects related to the query keyword.

Calculating the shortest path between any two text objects in the object set and the number of the text objects on the shortest path; determining the mutual reachable distance of any two text objects according to the shortest path and the number of the text objects on the shortest path; establishing a minimum spanning tree according to the mutual reachable distance between any two text objects, compressing the minimum spanning tree, extracting density clusters, and storing the density clusters into a queue; and selecting top-k density clusters in the queue as retrieval results.

In this embodiment, any two text objects are respectively marked as a and b; then the mutual reachable distance of any two text objects is denoted as dmreach-k (a, b); the calculation formula is as follows:

wherein the core _k (a) Representing the spatial distance between the text object a and the k nearest neighbor text object; core _k (b) Representing the spatial distance between the text object b and the k nearest neighbor text object; d (a, b) represents the road network distance between the text object a and the text object b.

And when the number of the text objects on the shortest path of the two text objects a and b is larger than k-2 (not including a and b), the shortest path between the two text objects is the mutual reachable distance between any two text objects. When a is the center of the circle b, core _k (b) Is the shortest path between two text objects within the radius of the text object is core _k (b) In that respect When b is at a as the center of the circle core _k (a) Is the shortest path between two text objects within the radius of the text object is core _k (a)。

Furthermore, a k-order triangulation structure is adopted, a subgraph is formed according to the mutual reachable distance of any two text objects, and then a minimum spanning tree is established.

The specific process of establishing the minimum spanning tree is as follows:

regarding the minimum spanning tree as a weighted graph, wherein the text objects are used as vertexes, and the mutual reachable distance between any two text objects is used as the weight of an edge between any two text objects; reducing the edges between any two text objects which need to be considered for establishing a minimum spanning tree by adopting a k-order triangulation structure; due to the HDBSCAN clustering, the state of a text object can be changed along with the increase of neighbor objects, a new cluster can be generated, and two old clusters can be fused into the new cluster; it is assumed that there is only one circle, p and q are located on the boundary of the circle, and there are at most k points belonging to D inside the circle, which is called a k-order Delaunay edge, abbreviated as k-od edge. According to the definition of the mutual reachable distance of any two text objects, if one edge is not a k-od edge, p and q are already in the same connected component, so that a subgraph only containing the k-od edge is consistent with the minimum spanning tree generated by the original graph, a k-order triangulation structure is adopted for auxiliary calculation, and the minimum spanning tree is constructed in a faster mode; the remaining edges in the weighted graph and the text objects will form subgraphs from which the minimum spanning tree is built.

In this embodiment, density clusters include cluster stability and cluster persistence; the stability of the clusters was noted:

(ii) a The persistence of the cluster is noted as:

；

the reciprocal of distance when the splitting generates the current cluster is represented, the distance represents the size of the corresponding minimum spanning tree edge when the current cluster is separated, the lambda represents the continuity of the cluster, and the lambda is the reciprocal of the distance. As λ increases (i.e., distance decreases), the clusters become smaller and smaller until they disappear or split into sub-clusters (an increase in λ is proportional to a decrease in the clusters, which decrease until they disappear or split into sub-clusters); cluster denotes the current cluster; by selecting a suitable lambda, a more stable cluster can be selected. And the process of extracting density clusters is as follows: all edges in the minimum spanning tree are subjected to incremental sequencing, and for each edge, a parallel search set is adopted to combine two subgraphs with edge links, so that the minimum spanning tree is compressed, and the minimum spanning tree is compressed and converted into a tree structure; traversing the tree structure from the leaf nodes to the top from the bottom, calculating the stability of all clusters in the tree structure, and extracting the cluster with the best stability; wherein the sum of the stabilities of the sub-clusters is greater than that of the clustersReplacing the stability of the cluster with the sum of the stabilities of the sub-clusters; otherwise, merging all the sub-clusters; and when traversing to the root node of the tree structure, taking the extracted cluster as a density cluster.

For the density clusters selected in the queue, in this embodiment, the density clusters are screened through a cost function, and the cost function is recorded as: cost; the cost function calculation formula is as follows:

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims

1. A spatial text density clustering retrieval method is characterized by comprising the following steps:

acquiring road network information and query keywords; the road network information comprises a text object;

constructing a mixed index structure based on the road network information, the G tree and the inverted file;

the process of obtaining the hybrid index structure is as follows: constructing a G tree according to the road network information, adding a distance matrix and a pointer pointing to an inverted file for each node in the G tree from bottom to top based on the G tree, and constructing the mixed index structure;

the text object is stored in a leaf node of a mixed index structure;

for non-leaf nodes, the distance matrix stores the shortest distance between the boundary points of all child nodes of the non-leaf nodes; the pointed inverted file stores keywords contained in child nodes in the non-leaf nodes and the maximum weight of the keywords;

the mixed index structure is used for organizing text objects in the road network information and storing the text objects; obtaining an object set according to the mixed index structure and the issuing list of the query keywords;

calculating the shortest path between any two text objects in the object set and the number of the text objects on the shortest path; determining the mutual reachable distance of any two text objects according to the shortest path and the number of the text objects on the shortest path; establishing a minimum spanning tree according to the mutual reachable distance between any two text objects, and storing the minimum spanning tree to a queue after processing; taking the value in the queue as a retrieval result;

forming a subgraph according to the mutual reachable distance of any two text objects by adopting a k-order triangulation structure, and further establishing a minimum spanning tree;

the specific process of establishing the minimum spanning tree is as follows:

regarding the minimum spanning tree as a weighted graph, wherein the text object is used as a vertex, and the mutual reachable distance between any two text objects is used as the weight of an edge between any two text objects; reducing edges between any two text objects to be considered for establishing the minimum spanning tree by adopting a k-order triangulation structure; the remaining edges in the weighted graph and the text objects form subgraphs, and a minimum spanning tree is built according to the subgraphs.

2. The spatial text density clustering retrieval method according to claim 1, characterized in that any two text objects are respectively marked as a, b; then the mutual reachable distance of any two text objects is recorded as dmreach-k (a, b); the calculation formula is as follows:

3. The method of claim 1, wherein the set of objects is a set of objects related to the query keyword.

4. The spatial text density clustering retrieval method according to claim 2, characterized in that the minimum spanning tree is compressed, density clusters are extracted, and the density clusters are stored in the queue; and selecting the density clusters in the queue as retrieval results.

5. The spatial text density cluster retrieval method according to claim 4, wherein the density cluster comprises cluster stability and cluster persistence; the stability of the clusters was noted:

(ii) a The cluster persistence is noted as:

；

representing the reciprocal of distance when the split produces the current cluster, distance representing the size of the corresponding minimum spanning tree edge when the current cluster splits,λ represents the persistence of the cluster, and λ is the inverse of distance; the increase of lambda is in direct proportion to the reduction of the cluster, and the cluster is continuously reduced until the cluster disappears or is split into sub-clusters; cluster denotes the current cluster.

6. The spatial text density clustering retrieval method according to claim 5, wherein the process of extracting density clusters is: all edges in the minimum spanning tree are subjected to increasing sequencing, and for each edge, a parallel search set is adopted to combine two subgraphs linked by the edges, so that the minimum spanning tree is compressed, and the minimum spanning tree is compressed and converted into a tree structure; traversing the tree structure from leaf nodes to top, calculating the stability of all clusters in the tree structure, and extracting the cluster with the best stability; when the sum of the stability of the sub-clusters is larger than the stability of the clusters, replacing the stability of the clusters with the sum of the stability of the sub-clusters; otherwise, merging all the sub-clusters; and when traversing to the root node of the tree structure, taking the extracted cluster as a density cluster.

7. The spatial text density clustering retrieval method according to claim 6, characterized in that the density clusters selected in the queue are screened by a cost function, and the cost function is recorded as: cost; the cost function calculation formula is as follows:

wherein, α ∈ (0, 1)]Indicates a user preference, tr _q.ψ (R) represents the maximum text relevance value of the text object in the density cluster.