CN111813778B

CN111813778B - Approximate keyword storage and query method for large-scale road network data

Info

Publication number: CN111813778B
Application number: CN202010650465.7A
Authority: CN
Inventors: 邰伟鹏; 陶荣荣; 付山; 赵佳俊; 胡涛; 王小林
Original assignee: Anhui Gongda Information Technology Co ltd; Anhui University of Technology AHUT
Current assignee: Anhui Gongda Information Technology Co ltd; Anhui University of Technology AHUT
Priority date: 2020-07-08
Filing date: 2020-07-08
Publication date: 2024-03-29
Anticipated expiration: 2040-07-08
Also published as: CN111813778A

Abstract

The invention discloses a method for storing and inquiring approximate keywords for large-scale road network data, and belongs to the technical field of road network data processing. The storage method of the invention comprises the following steps: preprocessing original road network data to obtain a preprocessing result; constructing an R-Tree index, a point index, an adjacent point B-Tree index and an interest point B-Tree index according to the preprocessing result, wherein the indexes are mutually related through pointers; and storing the constructed index into a memory storage space. The query method of the invention comprises the following steps: and inquiring according to the index stored by the storage method. The invention aims to overcome the defects of lower utilization rate of a road network data storage space and lower query efficiency of road network data in the prior art, and provides a method for storing and querying approximate keywords for large-scale road network data, which can improve the utilization rate of the road network data storage space and greatly improve the query efficiency of the road network data.

Description

Approximate keyword storage and query method for large-scale road network data

Technical Field

The invention relates to the technical field of road network data processing, in particular to a method for storing and inquiring approximate keywords for large-scale road network data.

Background

In recent years, with the rapid development of science and technology and popularization of social informatization, the traditional mode of manually collecting information is gradually replaced by electronic equipment. Meanwhile, with the diversification of data acquisition equipment, the rapid development of a 5G network and the more and more intensive research of the field of space geographic information, geographic space data with huge scale is generated, the data volume is exponentially increased, and meanwhile, the high computational complexity of the space data brings great challenges to a series of processing procedures such as data storage and query.

In the traditional database, data is organized and stored in a two-dimensional table mode, the storage and the representation modes are single, the space data is multidimensional, the information quantity is large, the information complexity is high, and the traditional database can not effectively store, process and display the space data information. To remedy the deficiencies of conventional databases in processing spatial data, spatial databases have been developed. The data query function is a basic function in conventional database and spatial database operations. When the information required by the query is concentrated from the large-scale data stored in the database system, the query efficiency is often improved by adopting an index combination algorithm. As for the index structure, in the conventional database, index structures such as B-Tree, b+ -Tree and the like are generally used, the spatial database generally uses a spatial index structure of R-Tree and a variant form of R-Tree, that is, R x-Tree, and the structure and performance of the index directly affect the performance of the database. The query of the data in the space database is complex, and many attributes of the object, such as functional description information, geographical position information and the like of the object, are required to be considered, and the attributes are described in a text form, so that special conditions of word missing, word misplacement, ambiguity and the like are also required to be considered in the query, the problem of the query of the space approximate keyword is related, and the concept of the query of the space approximate keyword is currently available and is studied in depth.

The traditional index structure and query method have been difficult to meet the increasing demands of users in terms of efficiency in the face of large-scale data in a spatial database, and the overall rate is still in a reduced situation in the face of the increasing spatial data. In view of the above problems, researchers have proposed many solutions, among which reducing data sources is one of the most fundamental and simplest approaches. However, the increase of the data volume in the database is unavoidable, and the method for directly reducing the total data volume is not feasible, and the size of the data to be queried can be reduced indirectly only by dividing the query region. With the rapid development of parallel technology, the efficiency is improved by a parallel spatial index mechanism. In addition, the query efficiency can also be effectively improved by using an improved index structure such as mhrtre on the basis of the original index and matching with a new query algorithm such as a balanced partition adaptive algorithm. Road network data is becoming a special case of spatial data and is attracting attention. As with the problem of spatial data query, the problem of gradually decreasing efficiency also exists in road network data query.

In summary, how to improve the utilization efficiency of the space for storing the road network data and the query efficiency of the road network data is a problem to be solved in the prior art.

Disclosure of Invention

1. Problems to be solved

The invention aims to overcome the defects of lower utilization rate of a road network data storage space and lower query efficiency of road network data in the prior art, and provides a method for storing and querying approximate keywords for large-scale road network data, which can improve the utilization rate of the road network data storage space and greatly improve the query efficiency of the road network data.

2. Technical proposal

In order to solve the problems, the technical scheme adopted by the invention is as follows:

the invention discloses a method for storing approximate keywords for large-scale road network data, which comprises the following steps: preprocessing original road network data to obtain a preprocessing result; then constructing an R-Tree index, a point index, an adjacent point B-Tree index and an interest point B-Tree index according to the preprocessing result, wherein the R-Tree index, the point index, the adjacent point B-Tree index and the interest point B-Tree index are mutually related through pointers; and then storing the constructed R-Tree index, the point index, the adjacent point B-Tree index and the interest point B-Tree index into a memory storage space.

Further, the specific process of preprocessing the original road network data is as follows: obtaining intersection points according to original road network data, clustering intersection points with density characteristic representation to obtain a clustering result, filling non-characteristic intersection points into the clustering result through road network expansion, merging the clustering result by using a one-pass clustering algorithm to obtain a rectangular full coverage map, and mapping the interest points into the road network according to an interest point mapping algorithm to obtain an interest point set.

Further, the specific process of obtaining the intersection point according to the original road network data is as follows: extracting basic composition elements in original road network data, and constructing a node. Txt file and a way. Txt file according to the basic composition elements, wherein the node. Txt file stores node information, and the way. Txt file stores road information; and then the intersection point is obtained through the definition of the node.

Further, the specific process of constructing the R-Tree index, the point index, the neighboring point B x-Tree index, and the interest point B x-Tree index according to the preprocessing result is as follows: the preprocessing result comprises a node. Txt file and a way. Txt file, and an R-Tree index is constructed according to the rectangular full coverage map and the node. Txt; constructing a point index according to leaf nodes of the node. Txt file traversal R-Tree; constructing an adjacent point B-Tree index according to the node. Txt file and the way. Txt file; and constructing an interest point B-Tree index according to the interest point set and the way.

Furthermore, the specific process of filling the non-characteristic intersection points into the clustering result through road network expansion is as follows: firstly, wrapping the clustering result by using a minimum outsourcing polygon; and expanding the minimum outsourcing polygon, wherein the expansion is performed according to the direction from the high crossing to the low crossing, and the minimum outsourcing rectangle is taken from the expanded minimum outsourcing polygon to wrap.

Further, the specific process of merging the clustering results by using a one-pass clustering algorithm to obtain the rectangular full coverage map is as follows: and finding out abnormal value points of the areas of the minimum outsourcing rectangles by using a box graph, and merging areas smaller than the abnormal value points by using a one-pass clustering algorithm to obtain a rectangular full coverage graph.

Further, the specific process of constructing the neighbor point B x-Tree index according to the node. And selecting a road section identifier formed by basic constituent elements from the node. Txt file and the way. Txt file as keywords in the node, and constructing from the bottom layer to the first layer according to the keywords to obtain an adjacent point B-Tree index, wherein the upper node is constructed by extracting the first keywords of the lower node.

Further, the specific process of constructing the interest point B x-Tree index according to the interest point set and the way. And constructing an inverted list of the interest point keywords, and obtaining an interest point B-Tree index from the bottom layer to the first layer according to the interest point keywords, wherein the upper layer node is constructed by extracting the first interest point keywords of the lower layer node.

The approximate keyword query method for the large-scale road network data adopts the index stored by the approximate keyword storage method for the large-scale road network data for query.

Further, it includes; firstly, acquiring a query request; searching a corresponding index according to the query request, and obtaining an approximate keyword set path list according to the index; then reading a data file corresponding to the query request according to the path list; the index comprises an R-Tree index, a point index, an adjacent point B-Tree index and an interest point B-Tree index.

3. Advantageous effects

Compared with the prior art, the invention has the beneficial effects that:

(1) The approximate keyword storage method for the large-scale road network data is characterized in that R-Tree indexes, point indexes, adjacent point B-Tree indexes and interest point B-Tree indexes are constructed, indexes are related through pointers, so that the compactness of the road network data in logic is enhanced, the relevance of the data is improved, and the utilization efficiency of storage space is further improved.

(2) In the method for inquiring the approximate key words of the large-scale road network data, in the step of inquiring the approximate key words by utilizing the constructed external memory index file, a KNN inquiring method for paralleling the space and the approximate key words is provided, so that the method replaces the traditional inquiring method of serializing two inquiring modes, and improves the inquiring efficiency; the multi-neighbor query task of the second level can be realized, the near real-time query is achieved, and the processing of big data can be further realized.

Drawings

FIG. 1 is a schematic flow diagram of a method for storing approximate keywords for large-scale road network data according to the present invention;

FIG. 2 is a schematic flow diagram of a method for querying approximate keywords for large-scale road network data according to the invention;

fig. 3 is a schematic diagram of index association in example 2.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention; moreover, the embodiments are not independent, and can be combined with each other as required, so that a better effect is achieved. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

For a further understanding of the present invention, the present invention will be described in detail with reference to the drawings and examples.

Example 1

With reference to fig. 1, the approximate keyword storage method for large-scale road network data of the present invention constructs an R-Tree index, a point index, an adjacent point B-Tree index and an interest point B-Tree index, and the indexes are related to each other by pointers, so that the compactness of the road network data in logic is enhanced, the relevance of the data is improved, and the utilization efficiency of the storage space is further improved.

The invention discloses a method for storing approximate keywords for large-scale road network data, which comprises the following specific steps:

(1) Preprocessing original road network data

Preprocessing original road network data to obtain a preprocessing result; it is worth to say that the structured data required for building the external storage index file is obtained by preprocessing the original road network data. The specific process of pretreatment is as follows:

1-1, obtaining a crossing point according to original road network data, specifically, extracting basic composition elements in the original road network data, and constructing a node. Txt file and a way. Txt file according to the basic composition elements, wherein the node. Txt file stores node information, and the way. Txt file stores road information; it should be noted that, since the routing nodes are formed, the information of each path includes information of all nodes on the path. In this embodiment, XML. Etre. CElementTree in python is used to preprocess the original road network data of XML structure, extract information from the parts whose tags are node, nd and way, and construct two files of node. Txt and way. Txt.

In addition, each node storage structure within the node. Txt file is:

where Nn is the set of neighbor points of node_id. Each path structure in the way.txt file is: [ way_id, way_info ]]Where way_info is the set of on-way node ids.

1-2, then defining the intersection points through a node. Txt file, and filtering all the intersection points forming the long way, thereby reducing the interference of the long way on data processing. The definition of the long way in this embodiment is: marking all intersection points on the graph according to coordinates, and averaging the graphDividing into 40 x 40 grids, and calculating Euclidean distance d between two endpoints for each path _o And Manhattan distance d _m The unit of distance is the number of meshes. If max (d) _o ,d _m ) And (2) not less than 10, namely, the maximum of the two distance measures is not less than 10 grids, and the road is considered to belong to a long road. The endpoints of the long way are obtained through long way definition calculation, the points related to the long way are found out from the way.

1-3, clustering intersection points with density characteristic representation to obtain a clustering result, wherein the clustering is performed by adopting a density clustering algorithm, specifically, the embodiment uses a density clustering algorithm (GDSCAN) based on a partition auxiliary grid after being improved based on an original DBSCAN density clustering algorithm to exclude points far from a neighborhood range of the points through grids where the points are located and grids directly connected around the points, and only calculates the points near the neighborhood, thereby reducing the calculated amount.

1-4, filling non-characteristic intersection points into a clustering result through road network expansion, and specifically, wrapping the clustering result by using a minimum outsourcing polygon; expanding the minimum outsourcing polygon, wherein the expansion is performed according to the direction from the high crossing to the low crossing, and the minimum outsourcing rectangle is taken from the expanded minimum outsourcing polygon to wrap. It is worth to say that, when the road network expands, because the crossing of four-way is often gathered around the crossing of five-way and high-way, if clustering is directly carried out by the crossing of four-way, one-step operation that the clustering result expands to the crossing of five-way and high-way can be omitted, thus greatly saving time.

1-5, merging the clustering results by using a one-pass clustering algorithm to obtain a rectangular full coverage map, specifically, finding out abnormal value points of areas of a plurality of minimum outsourcing rectangles by using a box map, and merging areas smaller than the abnormal value points by using the one-pass clustering algorithm to obtain the rectangular full coverage map. It is worth noting that taking the Minimum Bounding Rectangle (MBR) from the expanded bounding polygon, there will be different minimum MBR areas and gaps between them, and the present invention uses a one-pass clustering algorithm to merge small-area MBRs.

1-6, mapping the interest points into the road network according to an interest point mapping algorithm to obtain a preprocessing result. The interest points refer to information points, namely points which correspond to the real world and have certain information in the road network, such as bus stations and shops, can be directly selected through preprocessing, and can be filled to obtain complete road network information through mapping the interest points. Specifically, the interest point is mapped into the road network structure by using an interest point mapping algorithm, and the offset distance and the identification of the section to which the interest point belongs are added to the road network structure, and it is worth noting that the offset distance and the section to which the interest point belongs can be used for rapidly counting the interest point set closest to the cross road point in the road section.

(2) Building an index

Constructing an R-Tree index, a point index, an adjacent point B-Tree index and an interest point B-Tree index according to the preprocessing result, wherein the R-Tree index, the point index, the adjacent point B-Tree index and the interest point B-Tree index are mutually related through pointers; specifically, the preprocessing result includes a node. Txt file and a way. Txt file, and it is to be noted that, ids of nodes in the node. Txt file are mapped from 0 in sequence, the sequence numbers are increased once, and ids in corresponding neighboring point sets are searched from the mapped ids through binary search, so that the nodes in the files can be used conveniently later.

And then constructing an R-Tree index according to the rectangular full coverage map and node. Txt, and extracting the region formed by all points in the lower node as the basic composition structure of the upper node. It should be noted that, when constructing the region point set R-Tree, the overlapping region is stored by using a manner of storing redundant data. In the embodiment, the leaf node of the R-Tree index is stored in the form of 12 bytes (point identification, point coordinates and point number), so as to determine the maximum storage number 341 of the 4KB single sector; non-leaf nodes are created by saving leaf node rectangular areas ((X) _min ，y _min )，(x _max ，y _max ) 20 bytes in total, and { tree depth, front and back sibling pointers }16 bytes, thereby determining the maximum number of leaf nodes that can be accommodated, in this embodiment 16+204×20=4096.

Then traversing leaf nodes of the R-Tree according to the node. Txt file to construct a point index; since the R-Tree leaf nodes are traversal nodes, txt, the point index is established by directly traversing the leaf nodes of the R-Tree, i.e. the list in the point index leaf node corresponds to the elements in the list in the R-Tree leaf node one by one, and further since the structure of the traversed list contains the set of adjacency points, the point index and the adjacency point B-Tree index established subsequently can be associated through the relationship.

Then constructing an adjacent point B-Tree index according to the node. Txt file and the way. Txt file; specifically, a basic element component road section identifier is selected from a node file and a way file to be used as a key word in the node, namely, the adjacent point set part in the point structure of the pretreatment is traversed, the adjacent point set inner point and the table head point component road section identifier are used as the key word in the adjacent point B-Tree node, the same sequence traversal is carried out, the upper node is constructed by extracting the first key word of the lower node, the memory allocated to the node is filled up, and then the next node is filled up, so that the adjacent point B-Tree index is obtained from the bottom layer construction to the first layer. Finally, an interest point B-Tree index is constructed according to the interest point set and the way.txt file, specifically, because the nodes exist on the road and adjacent points B-Tree are managed conveniently, road section identifiers are used as keywords in the nodes; and constructing an inverted list of the interest point keywords, wherein the upper node is constructed by extracting the first interest point keyword of the lower node, and finally obtaining the interest point B-Tree index. It is worth to say that the constructed R-Tree index, point index, adjacent point B-Tree index and interest point B-Tree index are related to each other through pointers, so that data compactness is improved, subsequent data query is facilitated, and query efficiency of road network data can be greatly improved.

(3) Index storage

The constructed R-Tree index, point index, adjacent point B x-Tree index and interest point B x-Tree index are stored in the memory storage space, and in this embodiment, the R-Tree index, point index, adjacent point B x-Tree index and interest point B x-Tree index are stored in the memory disk. It is worth to say that, the node size of each index tree is allocated to 4KB during storage, so that the node space can be fully utilized during construction, and the utilization rate of the road network data storage space is improved.

The approximate keyword query method for the large-scale road network data, disclosed by the invention, is used for querying indexes stored by the approximate keyword storage method for the large-scale road network data, and a reasonable and efficient data index structure is utilized to support the approximate keyword query, so that the inverted index based on q-grams is realized, and efficient filtering is provided for the approximate keyword query.

Referring to fig. 2, in the approximate keyword query method for large-scale road network data, repeated data is filtered in the query process by combining four index queries, and is queried by combining a q-gram inverted list, and the next time of query which does not meet the K value is immediately followed by the next time of query, and the next round of space comparison is simultaneously performed in the process of character string comparison; it is worth to say that, when one query does not meet the requirement, the original query range is enlarged to re-query.

Referring to fig. 2, the method for inquiring approximate keywords for large-scale road network data of the present invention comprises the following steps;

firstly acquiring a query request, searching a corresponding index according to the query request, and obtaining an approximate keyword set path list according to the index; then reading a data file corresponding to the query request according to the path list; the index comprises an R-Tree index, a point index, an adjacent point B-Tree index and an interest point B-Tree index. And then reading the source data file corresponding to the request according to the path list. Specifically, traversing an intersection part of an R-Tree index of a regional point set according to a query request, and searching to leaf nodes; through traversing results, after filtering repeated nodes possibly caused by redundant storage, searching adjacent point B-Tree indexes, and positioning road section identifiers related to requests; after filtering the repeated road section identifiers, positioning the accurate road section on the interest point B-Tree index further according to the road section identifiers, and extracting the interest points in the accurate road section; and finally, performing approximate keyword comparison on keywords in the interest points.

In the step of inquiring the approximate key words by utilizing the constructed external storage index file, the invention provides a KNN inquiring method for paralleled space and approximate key words, replaces the traditional inquiring method of two inquiring modes in series, and improves the inquiring efficiency; the multi-neighbor query task of the second level can be realized, the near real-time query is achieved, and the processing of big data can be further realized.

Example 2

The content of this embodiment is basically the same as that of embodiment 1, and the specific process of the approximate keyword storage method for large-scale road network data in this embodiment is as follows:

(1) Preprocessing original road network file

The original road network data is taken from an OSM official network, the geographic position is the northeast region of the United states, the downloaded data has about 16GB, and the structure is an XML file structure. In order to realize the rapid processing of the original road network data, it is proposed to use an XML. Etre. CElementTree in the parsing XML library in the Python and a jit decorator in the number library, and to improve the efficiency of the data processing by modifying the function. The processing method is fast, has good stability and adaptability to large-scale road network data processing, has the characteristic of slow growth along with the increase of the data scale and time consumption, and can rapidly process the original 16GB data to obtain 9600 w-point data.

(2) Crossing point

The method comprises the steps that a cross point is obtained through node. Txt file definition, the cross point is defined as a cross point which is considered to be a when a point id mark exists in a point adjacent point set, for example, 3 point id marks exist in a point adjacent point set, and the cross point is considered to be a 3 cross point; the basic object of the subsequent processing step is determined to be the point by defining the intersection point, the point with the characteristic representation is found, the subsequent processing by using the cluster is determined, and the efficiency and the accuracy of the processing of the original data are improved.

(3) Point clustering

The invention provides an improved density clustering algorithm in the processing of the point set, and the algorithm directly selects the points near the neighborhood for calculation by reasonably dividing the original point set region and then filtering the points far from the neighborhood, so that the efficiency is greatly improved, and the point set facing the large-scale road network data still has a better processing effect. Improvements in meshingThe auxiliary grid division of the GDSCAN algorithm is divided according to the average of the original scatter diagram, but the shortest map side length is divided by the average of cluster neighborhood radius epsilon. Namely, the original scatter diagram is set to be ERec in length and width respectively _length And ERec _width The following should be satisfied:

ERec _width ≥MIN(X _max -X _min ,Y _max -Y _min )/ε

ERec _length ＝MAX((X _max -X _min ,Y _max -Y _min )/ERec _width

wherein X is _max 、X _min 、Y _max 、Y _min Respectively from the left lower corner vertex coordinates (X) _min ,Y _min ) And upper right corner vertex coordinates (X _max ,Y _max )。

(4) Optimizing primary clustering results

The information of other intersection points is lacking after primary clustering, so that the information needs to be supplemented through a road network expansion algorithm, and the embodiment uses the adjacent point set of the points to sequentially fill the low-intersection points into the convex hull of the clustering result, and takes the outermost layer points for integral expansion, so that the efficiency is improved. The situation that huge gaps and different sizes still exist among the expanded outsourcing rectangular clusters is provided, the embodiment provides that a box diagram is used for solving the abnormal area demarcation point, and the upper quarter point Q of the box diagram is used ₃ Lower quarter point Q ₁ The obtained abnormal area value formula is: s=q ₃ -1.5*IQR，IQR＝Q ₃ -Q ₁ . And then, merging the outsourcing rectangles smaller than S by using a one-pass clustering algorithm, processing the problems, and determining a radius threshold r of the cluster by using the algorithm, wherein the specific steps are as follows:

1) Selecting all point objects in a dataset

2) Calculating the distance between any two points and saving the distance as D

3) Calculating the mean EX and standard deviation DX of D

4) The threshold r is taken between EX-0.5DX and EX+0.5DX

(5) Mapping points of interest onto a road network using a point of interest mapping algorithm

The interest point mapping algorithm comprises the following steps:

1) Construction of Voronoi diagram for points on road network

2) Initializing a set of edges to null

3) And taking the construction point of the Voronoi polygon where the point p to be mapped is located, and inserting edges which start and end with the construction point into the edge set.

4) Traversing the edge set, calculating the shortest distance from the point p to each edge, and taking the edge where the minimum value of the result is located.

5) And calculating the offset distance of the point p on the road network, and reconstructing the point p.

6) Establishing index information

The method has higher query real-time requirement, but does not comprise points of interest, only has 9600w more points at crossing points, and has huge quantity and more scattered points, so the method provides a plurality of indexes for the situation, and the indexes are stored in an external memory, so that different point information is processed in a divide-and-conquer mode, and simultaneously, the indexes are logically associated by using pointers. Meanwhile, since each index node is prescribed to have a size of 4KB in order to improve I/O efficiency. The index structure is described in detail below, as shown in conjunction with FIG. 3:

regional point set R-Tree index (Region R-Tree in FIG. 3): for each point, the nodes are stored in the form of { id, x, y }, and a total of 12 bytes and a 4KB sector can be stored for 341 nodes, and the remaining 4 bytes record the number of the nodes stored by the index node, because the last node does not have to be full of 341 nodes. Every 341 points are a group, the minimum outsourcing rectangle is taken as an area through the point coordinates, the rectangle coordinates are extracted and stored in the upper layer of nodes, and the recording mode of the minimum outsourcing rectangle is (X _min ,Y _min )，(X _max ,Y _max ) The total of 20 bytes, the non-leaf nodes also comprise node depth, front and back brother pointers and 16 bytes of leaf node number, at most 204 regional blocks can be stored, the regional blocks are combined layer by layer from bottom to top, and finally, a regional point set R-Tree is formed, and the index fully utilizes space and node structures to directly influence the traversing of the index and the efficiency of internal and external memory interaction.

Point index (Point-index in FIG. 3): the index is obtained by traversing the node. Txt file after sequential mapping id, and the connection efficiency is improved by sacrificing a small amount of space, so that the traversing index efficiency is directly improved.

Neighbor point B x-Tree index (Adjacency List B x-Tree in fig. 3): the node size of the neighboring node B x-Tree is set to be the same as the fan blade 4KB. Extracting adjacent points and headers from an adjacent point list to form a road section representation such as<id ₁ ,id ₂ >And follow the principle that the front id is larger than the back id to prevent data repetition. Each leaf node also contains 16 bytes of road segment identifiers, front and back sibling pointers and father node pointers, and the road segment identifiers and additional information are 40 bytes, so that 102 road segment identifiers can be accommodated. The non-leaf node extracts the first road segment identifier of the leaf node as a key and a pointer pointing to the leaf node for 12 bytes, and adds a total of 16 bytes of sibling pointers, node depth and key numbers, so that the non-leaf node can store 340 keys, namely 340 leaf nodes. Repeating the extraction. Build up from the bottom layer until the top layer of the tree. The index also fully utilizes the space and the node structure to directly influence the efficiency of the traversal of the index and the internal and external memory interaction.

Point of interest B x-Tree index (POI List B x-Tree in fig. 3): the interest point index structure utilizes the characteristic that interest points are on a road section, and takes the road section identification as a keyword, so that the road section is only needed to be positioned when searching, the size of a 4KB node is set as same as the index construction, the node space is fully utilized, and the efficiency of direct jump index traversal and internal and external memory interaction is improved.

The invention discloses an approximate keyword query method for large-scale road network data, which comprises the following specific processes of;

in the step of carrying out approximate keyword query by utilizing the constructed external storage index file, a high-efficiency road network KNN query algorithm is provided, and the specific steps are as follows:

1) Setting a query region R with a side length of 50 for a given query Q _Q 。

2) Traversing the regional point set R-Tree, if R _Q Intersection with the region contained by the node of the R-Tree is performed, and children of the node are traversed until the leaf node is reached. And if the intersection does not exist, continuing to sequentially traverse the regional point set R-Tree.

3) Duplicate points are removed (result point duplication may occur due to redundant storage of the index).

4) Is connected to the adjacency point B-Tree index by a point index.

5) The road segments involved are queried at the adjacency point B-Tree.

6) Removing duplicate road segments (duplicate results appear when inquiring about points before and after the road segments are composed of two points)

7) And inquiring on the index of the interest point B-Tree according to the road section identification to obtain the interest point on the road section.

8) And (3) inquiring the keywords of the interest points by using approximate keywords, wherein the number of the inquiring results is K, if the number of the inquiring results is less than K, the side length of the original inquiring area is enlarged by 50, and the step (2) is carried out.

The invention has been described in detail hereinabove with reference to specific exemplary embodiments thereof. It will be understood that various modifications and changes may be made without departing from the scope of the invention as defined by the appended claims. The detailed description and drawings are to be regarded in an illustrative rather than a restrictive sense, and if any such modifications and variations are desired to be included within the scope of the invention described herein. Furthermore, the background art is intended to illustrate the status and meaning of the development of the technology and is not intended to limit the invention or the application and field of application of the invention.

Claims

1. The approximate keyword storage method for the large-scale road network data is characterized by comprising the following steps of:

preprocessing original road network data to obtain a preprocessing result;

constructing an R-Tree index, a point index, an adjacent point B-Tree index and an interest point B-Tree index according to the preprocessing result, wherein the R-Tree index, the point index, the adjacent point B-Tree index and the interest point B-Tree index are mutually related through pointers;

storing the constructed R-Tree index, point index, adjacent point B-Tree index and interest point B-Tree index into a memory storage space; the specific process of constructing the R-Tree index, the point index, the adjacent point B-Tree index and the interest point B-Tree index according to the preprocessing result comprises the following steps:

the preprocessing result comprises a node. Txt file and a way. Txt file, and an R-Tree index is constructed according to the rectangular full coverage map and the node. Txt; constructing a point index according to leaf nodes of the node. Txt file traversal R-Tree; constructing an adjacent point B-Tree index according to the node. Txt file and the way. Txt file; and constructing an interest point B-Tree index according to the interest point set and the way.

2. The approximate keyword storage method for large-scale road network data according to claim 1, wherein the specific process of preprocessing the original road network data is as follows:

obtaining intersection points according to original road network data, clustering intersection points with density characteristic representation to obtain a clustering result, filling non-characteristic intersection points into the clustering result through road network expansion, merging the clustering result by using a one-pass clustering algorithm to obtain a rectangular full coverage map, and mapping the interest points into the road network according to an interest point mapping algorithm to obtain an interest point set.

3. The approximate keyword storage method for large-scale road network data according to claim 2, wherein the specific process of obtaining the intersection point according to the original road network data is as follows:

extracting basic composition elements in original road network data, and constructing a node. Txt file and a way. Txt file according to the basic composition elements, wherein the node. Txt file stores node information, and the way. Txt file stores road information; and then the intersection point is obtained through the definition of the node.

4. The method for storing approximate keywords of large-scale road network data according to claim 2, wherein the specific process of filling non-characteristic intersection points into the clustering result through road network expansion is as follows:

firstly, wrapping the clustering result by using a minimum outsourcing polygon;

expanding the minimum outsourcing polygon, wherein the expansion is performed according to the direction from the high crossing to the low crossing, and the minimum outsourcing rectangle is taken from the expanded minimum outsourcing polygon to wrap.

5. The method for storing approximate keywords for large-scale road network data according to claim 4, wherein the specific process of merging the clustering results by using a one-pass clustering algorithm to obtain the rectangular full coverage map comprises the following steps:

and finding out abnormal value points of the areas of the minimum outsourcing rectangles by using a box graph, and merging areas smaller than the abnormal value points by using a one-pass clustering algorithm to obtain a rectangular full coverage graph.

6. The approximate keyword storage method for large-scale road network data according to claim 1, wherein the specific process of constructing the neighbor point B x-Tree index according to the node. And selecting a road section identifier formed by basic constituent elements from the node. Txt file and the way. Txt file as keywords in the node, and constructing from the bottom layer to the first layer according to the keywords to obtain an adjacent point B-Tree index, wherein the upper node is constructed by extracting the first keywords of the lower node.

7. The approximate keyword storage method for large-scale road network data according to claim 1, wherein the specific process of constructing the interest point B x-Tree index according to the interest point set and the way.

And constructing an inverted list of the interest point keywords, and obtaining an interest point B-Tree index from the bottom layer to the first layer according to the interest point keywords, wherein the upper layer node is constructed by extracting the first interest point keywords of the lower layer node.

8. The approximate keyword query method for large-scale road network data is characterized in that the index stored by the approximate keyword storage method for large-scale road network data is used for query, wherein the index is stored by the approximate keyword storage method for large-scale road network data according to any one of claims 1-7.

9. The method for querying approximate keywords of large-scale road network data according to claim 8, comprising the steps of; firstly, acquiring a query request; searching a corresponding index according to the query request, and obtaining an approximate keyword set path list according to the index; and then reading the data file corresponding to the query request according to the path list.