CN111813778B - Approximate keyword storage and query method for large-scale road network data - Google Patents
Approximate keyword storage and query method for large-scale road network data Download PDFInfo
- Publication number
- CN111813778B CN111813778B CN202010650465.7A CN202010650465A CN111813778B CN 111813778 B CN111813778 B CN 111813778B CN 202010650465 A CN202010650465 A CN 202010650465A CN 111813778 B CN111813778 B CN 111813778B
- Authority
- CN
- China
- Prior art keywords
- road network
- point
- index
- network data
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 67
- 238000007781 pre-processing Methods 0.000 claims abstract description 24
- 230000005055 memory storage Effects 0.000 claims abstract description 4
- 230000008569 process Effects 0.000 claims description 21
- 238000012946 outsourcing Methods 0.000 claims description 19
- 238000013507 mapping Methods 0.000 claims description 12
- 230000002159 abnormal effect Effects 0.000 claims description 8
- 239000000203 mixture Substances 0.000 claims description 7
- 239000000470 constituent Substances 0.000 claims description 2
- 238000012545 processing Methods 0.000 abstract description 16
- 238000013500 data storage Methods 0.000 abstract description 6
- 230000007547 defect Effects 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 8
- 238000010276 construction Methods 0.000 description 6
- 238000001914 filtration Methods 0.000 description 5
- 238000011161 development Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000012554 master batch record Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2246—Trees, e.g. B+trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/29—Geographical information databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9537—Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Remote Sensing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method for storing and inquiring approximate keywords for large-scale road network data, and belongs to the technical field of road network data processing. The storage method of the invention comprises the following steps: preprocessing original road network data to obtain a preprocessing result; constructing an R-Tree index, a point index, an adjacent point B-Tree index and an interest point B-Tree index according to the preprocessing result, wherein the indexes are mutually related through pointers; and storing the constructed index into a memory storage space. The query method of the invention comprises the following steps: and inquiring according to the index stored by the storage method. The invention aims to overcome the defects of lower utilization rate of a road network data storage space and lower query efficiency of road network data in the prior art, and provides a method for storing and querying approximate keywords for large-scale road network data, which can improve the utilization rate of the road network data storage space and greatly improve the query efficiency of the road network data.
Description
Technical Field
The invention relates to the technical field of road network data processing, in particular to a method for storing and inquiring approximate keywords for large-scale road network data.
Background
In recent years, with the rapid development of science and technology and popularization of social informatization, the traditional mode of manually collecting information is gradually replaced by electronic equipment. Meanwhile, with the diversification of data acquisition equipment, the rapid development of a 5G network and the more and more intensive research of the field of space geographic information, geographic space data with huge scale is generated, the data volume is exponentially increased, and meanwhile, the high computational complexity of the space data brings great challenges to a series of processing procedures such as data storage and query.
In the traditional database, data is organized and stored in a two-dimensional table mode, the storage and the representation modes are single, the space data is multidimensional, the information quantity is large, the information complexity is high, and the traditional database can not effectively store, process and display the space data information. To remedy the deficiencies of conventional databases in processing spatial data, spatial databases have been developed. The data query function is a basic function in conventional database and spatial database operations. When the information required by the query is concentrated from the large-scale data stored in the database system, the query efficiency is often improved by adopting an index combination algorithm. As for the index structure, in the conventional database, index structures such as B-Tree, b+ -Tree and the like are generally used, the spatial database generally uses a spatial index structure of R-Tree and a variant form of R-Tree, that is, R x-Tree, and the structure and performance of the index directly affect the performance of the database. The query of the data in the space database is complex, and many attributes of the object, such as functional description information, geographical position information and the like of the object, are required to be considered, and the attributes are described in a text form, so that special conditions of word missing, word misplacement, ambiguity and the like are also required to be considered in the query, the problem of the query of the space approximate keyword is related, and the concept of the query of the space approximate keyword is currently available and is studied in depth.
The traditional index structure and query method have been difficult to meet the increasing demands of users in terms of efficiency in the face of large-scale data in a spatial database, and the overall rate is still in a reduced situation in the face of the increasing spatial data. In view of the above problems, researchers have proposed many solutions, among which reducing data sources is one of the most fundamental and simplest approaches. However, the increase of the data volume in the database is unavoidable, and the method for directly reducing the total data volume is not feasible, and the size of the data to be queried can be reduced indirectly only by dividing the query region. With the rapid development of parallel technology, the efficiency is improved by a parallel spatial index mechanism. In addition, the query efficiency can also be effectively improved by using an improved index structure such as mhrtre on the basis of the original index and matching with a new query algorithm such as a balanced partition adaptive algorithm. Road network data is becoming a special case of spatial data and is attracting attention. As with the problem of spatial data query, the problem of gradually decreasing efficiency also exists in road network data query.
In summary, how to improve the utilization efficiency of the space for storing the road network data and the query efficiency of the road network data is a problem to be solved in the prior art.
Disclosure of Invention
1. Problems to be solved
The invention aims to overcome the defects of lower utilization rate of a road network data storage space and lower query efficiency of road network data in the prior art, and provides a method for storing and querying approximate keywords for large-scale road network data, which can improve the utilization rate of the road network data storage space and greatly improve the query efficiency of the road network data.
2. Technical proposal
In order to solve the problems, the technical scheme adopted by the invention is as follows:
the invention discloses a method for storing approximate keywords for large-scale road network data, which comprises the following steps: preprocessing original road network data to obtain a preprocessing result; then constructing an R-Tree index, a point index, an adjacent point B-Tree index and an interest point B-Tree index according to the preprocessing result, wherein the R-Tree index, the point index, the adjacent point B-Tree index and the interest point B-Tree index are mutually related through pointers; and then storing the constructed R-Tree index, the point index, the adjacent point B-Tree index and the interest point B-Tree index into a memory storage space.
Further, the specific process of preprocessing the original road network data is as follows: obtaining intersection points according to original road network data, clustering intersection points with density characteristic representation to obtain a clustering result, filling non-characteristic intersection points into the clustering result through road network expansion, merging the clustering result by using a one-pass clustering algorithm to obtain a rectangular full coverage map, and mapping the interest points into the road network according to an interest point mapping algorithm to obtain an interest point set.
Further, the specific process of obtaining the intersection point according to the original road network data is as follows: extracting basic composition elements in original road network data, and constructing a node. Txt file and a way. Txt file according to the basic composition elements, wherein the node. Txt file stores node information, and the way. Txt file stores road information; and then the intersection point is obtained through the definition of the node.
Further, the specific process of constructing the R-Tree index, the point index, the neighboring point B x-Tree index, and the interest point B x-Tree index according to the preprocessing result is as follows: the preprocessing result comprises a node. Txt file and a way. Txt file, and an R-Tree index is constructed according to the rectangular full coverage map and the node. Txt; constructing a point index according to leaf nodes of the node. Txt file traversal R-Tree; constructing an adjacent point B-Tree index according to the node. Txt file and the way. Txt file; and constructing an interest point B-Tree index according to the interest point set and the way.
Furthermore, the specific process of filling the non-characteristic intersection points into the clustering result through road network expansion is as follows: firstly, wrapping the clustering result by using a minimum outsourcing polygon; and expanding the minimum outsourcing polygon, wherein the expansion is performed according to the direction from the high crossing to the low crossing, and the minimum outsourcing rectangle is taken from the expanded minimum outsourcing polygon to wrap.
Further, the specific process of merging the clustering results by using a one-pass clustering algorithm to obtain the rectangular full coverage map is as follows: and finding out abnormal value points of the areas of the minimum outsourcing rectangles by using a box graph, and merging areas smaller than the abnormal value points by using a one-pass clustering algorithm to obtain a rectangular full coverage graph.
Further, the specific process of constructing the neighbor point B x-Tree index according to the node. And selecting a road section identifier formed by basic constituent elements from the node. Txt file and the way. Txt file as keywords in the node, and constructing from the bottom layer to the first layer according to the keywords to obtain an adjacent point B-Tree index, wherein the upper node is constructed by extracting the first keywords of the lower node.
Further, the specific process of constructing the interest point B x-Tree index according to the interest point set and the way. And constructing an inverted list of the interest point keywords, and obtaining an interest point B-Tree index from the bottom layer to the first layer according to the interest point keywords, wherein the upper layer node is constructed by extracting the first interest point keywords of the lower layer node.
The approximate keyword query method for the large-scale road network data adopts the index stored by the approximate keyword storage method for the large-scale road network data for query.
Further, it includes; firstly, acquiring a query request; searching a corresponding index according to the query request, and obtaining an approximate keyword set path list according to the index; then reading a data file corresponding to the query request according to the path list; the index comprises an R-Tree index, a point index, an adjacent point B-Tree index and an interest point B-Tree index.
3. Advantageous effects
Compared with the prior art, the invention has the beneficial effects that:
(1) The approximate keyword storage method for the large-scale road network data is characterized in that R-Tree indexes, point indexes, adjacent point B-Tree indexes and interest point B-Tree indexes are constructed, indexes are related through pointers, so that the compactness of the road network data in logic is enhanced, the relevance of the data is improved, and the utilization efficiency of storage space is further improved.
(2) In the method for inquiring the approximate key words of the large-scale road network data, in the step of inquiring the approximate key words by utilizing the constructed external memory index file, a KNN inquiring method for paralleling the space and the approximate key words is provided, so that the method replaces the traditional inquiring method of serializing two inquiring modes, and improves the inquiring efficiency; the multi-neighbor query task of the second level can be realized, the near real-time query is achieved, and the processing of big data can be further realized.
Drawings
FIG. 1 is a schematic flow diagram of a method for storing approximate keywords for large-scale road network data according to the present invention;
FIG. 2 is a schematic flow diagram of a method for querying approximate keywords for large-scale road network data according to the invention;
fig. 3 is a schematic diagram of index association in example 2.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention; moreover, the embodiments are not independent, and can be combined with each other as required, so that a better effect is achieved. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
For a further understanding of the present invention, the present invention will be described in detail with reference to the drawings and examples.
Example 1
With reference to fig. 1, the approximate keyword storage method for large-scale road network data of the present invention constructs an R-Tree index, a point index, an adjacent point B-Tree index and an interest point B-Tree index, and the indexes are related to each other by pointers, so that the compactness of the road network data in logic is enhanced, the relevance of the data is improved, and the utilization efficiency of the storage space is further improved.
The invention discloses a method for storing approximate keywords for large-scale road network data, which comprises the following specific steps:
(1) Preprocessing original road network data
Preprocessing original road network data to obtain a preprocessing result; it is worth to say that the structured data required for building the external storage index file is obtained by preprocessing the original road network data. The specific process of pretreatment is as follows:
1-1, obtaining a crossing point according to original road network data, specifically, extracting basic composition elements in the original road network data, and constructing a node. Txt file and a way. Txt file according to the basic composition elements, wherein the node. Txt file stores node information, and the way. Txt file stores road information; it should be noted that, since the routing nodes are formed, the information of each path includes information of all nodes on the path. In this embodiment, XML. Etre. CElementTree in python is used to preprocess the original road network data of XML structure, extract information from the parts whose tags are node, nd and way, and construct two files of node. Txt and way. Txt.
In addition, each node storage structure within the node. Txt file is:
where Nn is the set of neighbor points of node_id. Each path structure in the way.txt file is: [ way_id, way_info ]]Where way_info is the set of on-way node ids.
1-2, then defining the intersection points through a node. Txt file, and filtering all the intersection points forming the long way, thereby reducing the interference of the long way on data processing. The definition of the long way in this embodiment is: marking all intersection points on the graph according to coordinates, and averaging the graphDividing into 40 x 40 grids, and calculating Euclidean distance d between two endpoints for each path o And Manhattan distance d m The unit of distance is the number of meshes. If max (d) o ,d m ) And (2) not less than 10, namely, the maximum of the two distance measures is not less than 10 grids, and the road is considered to belong to a long road. The endpoints of the long way are obtained through long way definition calculation, the points related to the long way are found out from the way.
1-3, clustering intersection points with density characteristic representation to obtain a clustering result, wherein the clustering is performed by adopting a density clustering algorithm, specifically, the embodiment uses a density clustering algorithm (GDSCAN) based on a partition auxiliary grid after being improved based on an original DBSCAN density clustering algorithm to exclude points far from a neighborhood range of the points through grids where the points are located and grids directly connected around the points, and only calculates the points near the neighborhood, thereby reducing the calculated amount.
1-4, filling non-characteristic intersection points into a clustering result through road network expansion, and specifically, wrapping the clustering result by using a minimum outsourcing polygon; expanding the minimum outsourcing polygon, wherein the expansion is performed according to the direction from the high crossing to the low crossing, and the minimum outsourcing rectangle is taken from the expanded minimum outsourcing polygon to wrap. It is worth to say that, when the road network expands, because the crossing of four-way is often gathered around the crossing of five-way and high-way, if clustering is directly carried out by the crossing of four-way, one-step operation that the clustering result expands to the crossing of five-way and high-way can be omitted, thus greatly saving time.
1-5, merging the clustering results by using a one-pass clustering algorithm to obtain a rectangular full coverage map, specifically, finding out abnormal value points of areas of a plurality of minimum outsourcing rectangles by using a box map, and merging areas smaller than the abnormal value points by using the one-pass clustering algorithm to obtain the rectangular full coverage map. It is worth noting that taking the Minimum Bounding Rectangle (MBR) from the expanded bounding polygon, there will be different minimum MBR areas and gaps between them, and the present invention uses a one-pass clustering algorithm to merge small-area MBRs.
1-6, mapping the interest points into the road network according to an interest point mapping algorithm to obtain a preprocessing result. The interest points refer to information points, namely points which correspond to the real world and have certain information in the road network, such as bus stations and shops, can be directly selected through preprocessing, and can be filled to obtain complete road network information through mapping the interest points. Specifically, the interest point is mapped into the road network structure by using an interest point mapping algorithm, and the offset distance and the identification of the section to which the interest point belongs are added to the road network structure, and it is worth noting that the offset distance and the section to which the interest point belongs can be used for rapidly counting the interest point set closest to the cross road point in the road section.
(2) Building an index
Constructing an R-Tree index, a point index, an adjacent point B-Tree index and an interest point B-Tree index according to the preprocessing result, wherein the R-Tree index, the point index, the adjacent point B-Tree index and the interest point B-Tree index are mutually related through pointers; specifically, the preprocessing result includes a node. Txt file and a way. Txt file, and it is to be noted that, ids of nodes in the node. Txt file are mapped from 0 in sequence, the sequence numbers are increased once, and ids in corresponding neighboring point sets are searched from the mapped ids through binary search, so that the nodes in the files can be used conveniently later.
And then constructing an R-Tree index according to the rectangular full coverage map and node. Txt, and extracting the region formed by all points in the lower node as the basic composition structure of the upper node. It should be noted that, when constructing the region point set R-Tree, the overlapping region is stored by using a manner of storing redundant data. In the embodiment, the leaf node of the R-Tree index is stored in the form of 12 bytes (point identification, point coordinates and point number), so as to determine the maximum storage number 341 of the 4KB single sector; non-leaf nodes are created by saving leaf node rectangular areas ((X) min ,y min ),(x max ,y max ) 20 bytes in total, and { tree depth, front and back sibling pointers }16 bytes, thereby determining the maximum number of leaf nodes that can be accommodated, in this embodiment 16+204×20=4096.
Then traversing leaf nodes of the R-Tree according to the node. Txt file to construct a point index; since the R-Tree leaf nodes are traversal nodes, txt, the point index is established by directly traversing the leaf nodes of the R-Tree, i.e. the list in the point index leaf node corresponds to the elements in the list in the R-Tree leaf node one by one, and further since the structure of the traversed list contains the set of adjacency points, the point index and the adjacency point B-Tree index established subsequently can be associated through the relationship.
Then constructing an adjacent point B-Tree index according to the node. Txt file and the way. Txt file; specifically, a basic element component road section identifier is selected from a node file and a way file to be used as a key word in the node, namely, the adjacent point set part in the point structure of the pretreatment is traversed, the adjacent point set inner point and the table head point component road section identifier are used as the key word in the adjacent point B-Tree node, the same sequence traversal is carried out, the upper node is constructed by extracting the first key word of the lower node, the memory allocated to the node is filled up, and then the next node is filled up, so that the adjacent point B-Tree index is obtained from the bottom layer construction to the first layer. Finally, an interest point B-Tree index is constructed according to the interest point set and the way.txt file, specifically, because the nodes exist on the road and adjacent points B-Tree are managed conveniently, road section identifiers are used as keywords in the nodes; and constructing an inverted list of the interest point keywords, wherein the upper node is constructed by extracting the first interest point keyword of the lower node, and finally obtaining the interest point B-Tree index. It is worth to say that the constructed R-Tree index, point index, adjacent point B-Tree index and interest point B-Tree index are related to each other through pointers, so that data compactness is improved, subsequent data query is facilitated, and query efficiency of road network data can be greatly improved.
(3) Index storage
The constructed R-Tree index, point index, adjacent point B x-Tree index and interest point B x-Tree index are stored in the memory storage space, and in this embodiment, the R-Tree index, point index, adjacent point B x-Tree index and interest point B x-Tree index are stored in the memory disk. It is worth to say that, the node size of each index tree is allocated to 4KB during storage, so that the node space can be fully utilized during construction, and the utilization rate of the road network data storage space is improved.
The approximate keyword query method for the large-scale road network data, disclosed by the invention, is used for querying indexes stored by the approximate keyword storage method for the large-scale road network data, and a reasonable and efficient data index structure is utilized to support the approximate keyword query, so that the inverted index based on q-grams is realized, and efficient filtering is provided for the approximate keyword query.
Referring to fig. 2, in the approximate keyword query method for large-scale road network data, repeated data is filtered in the query process by combining four index queries, and is queried by combining a q-gram inverted list, and the next time of query which does not meet the K value is immediately followed by the next time of query, and the next round of space comparison is simultaneously performed in the process of character string comparison; it is worth to say that, when one query does not meet the requirement, the original query range is enlarged to re-query.
Referring to fig. 2, the method for inquiring approximate keywords for large-scale road network data of the present invention comprises the following steps;
firstly acquiring a query request, searching a corresponding index according to the query request, and obtaining an approximate keyword set path list according to the index; then reading a data file corresponding to the query request according to the path list; the index comprises an R-Tree index, a point index, an adjacent point B-Tree index and an interest point B-Tree index. And then reading the source data file corresponding to the request according to the path list. Specifically, traversing an intersection part of an R-Tree index of a regional point set according to a query request, and searching to leaf nodes; through traversing results, after filtering repeated nodes possibly caused by redundant storage, searching adjacent point B-Tree indexes, and positioning road section identifiers related to requests; after filtering the repeated road section identifiers, positioning the accurate road section on the interest point B-Tree index further according to the road section identifiers, and extracting the interest points in the accurate road section; and finally, performing approximate keyword comparison on keywords in the interest points.
In the step of inquiring the approximate key words by utilizing the constructed external storage index file, the invention provides a KNN inquiring method for paralleled space and approximate key words, replaces the traditional inquiring method of two inquiring modes in series, and improves the inquiring efficiency; the multi-neighbor query task of the second level can be realized, the near real-time query is achieved, and the processing of big data can be further realized.
Example 2
The content of this embodiment is basically the same as that of embodiment 1, and the specific process of the approximate keyword storage method for large-scale road network data in this embodiment is as follows:
(1) Preprocessing original road network file
The original road network data is taken from an OSM official network, the geographic position is the northeast region of the United states, the downloaded data has about 16GB, and the structure is an XML file structure. In order to realize the rapid processing of the original road network data, it is proposed to use an XML. Etre. CElementTree in the parsing XML library in the Python and a jit decorator in the number library, and to improve the efficiency of the data processing by modifying the function. The processing method is fast, has good stability and adaptability to large-scale road network data processing, has the characteristic of slow growth along with the increase of the data scale and time consumption, and can rapidly process the original 16GB data to obtain 9600 w-point data.
(2) Crossing point
The method comprises the steps that a cross point is obtained through node. Txt file definition, the cross point is defined as a cross point which is considered to be a when a point id mark exists in a point adjacent point set, for example, 3 point id marks exist in a point adjacent point set, and the cross point is considered to be a 3 cross point; the basic object of the subsequent processing step is determined to be the point by defining the intersection point, the point with the characteristic representation is found, the subsequent processing by using the cluster is determined, and the efficiency and the accuracy of the processing of the original data are improved.
(3) Point clustering
The invention provides an improved density clustering algorithm in the processing of the point set, and the algorithm directly selects the points near the neighborhood for calculation by reasonably dividing the original point set region and then filtering the points far from the neighborhood, so that the efficiency is greatly improved, and the point set facing the large-scale road network data still has a better processing effect. Improvements in meshingThe auxiliary grid division of the GDSCAN algorithm is divided according to the average of the original scatter diagram, but the shortest map side length is divided by the average of cluster neighborhood radius epsilon. Namely, the original scatter diagram is set to be ERec in length and width respectively length And ERec width The following should be satisfied:
ERec width ≥MIN(X max -X min ,Y max -Y min )/ε
ERec length =MAX((X max -X min ,Y max -Y min )/ERec width
wherein X is max 、X min 、Y max 、Y min Respectively from the left lower corner vertex coordinates (X) min ,Y min ) And upper right corner vertex coordinates (X max ,Y max )。
(4) Optimizing primary clustering results
The information of other intersection points is lacking after primary clustering, so that the information needs to be supplemented through a road network expansion algorithm, and the embodiment uses the adjacent point set of the points to sequentially fill the low-intersection points into the convex hull of the clustering result, and takes the outermost layer points for integral expansion, so that the efficiency is improved. The situation that huge gaps and different sizes still exist among the expanded outsourcing rectangular clusters is provided, the embodiment provides that a box diagram is used for solving the abnormal area demarcation point, and the upper quarter point Q of the box diagram is used 3 Lower quarter point Q 1 The obtained abnormal area value formula is: s=q 3 -1.5*IQR,IQR=Q 3 -Q 1 . And then, merging the outsourcing rectangles smaller than S by using a one-pass clustering algorithm, processing the problems, and determining a radius threshold r of the cluster by using the algorithm, wherein the specific steps are as follows:
1) Selecting all point objects in a dataset
2) Calculating the distance between any two points and saving the distance as D
3) Calculating the mean EX and standard deviation DX of D
4) The threshold r is taken between EX-0.5DX and EX+0.5DX
(5) Mapping points of interest onto a road network using a point of interest mapping algorithm
The interest point mapping algorithm comprises the following steps:
1) Construction of Voronoi diagram for points on road network
2) Initializing a set of edges to null
3) And taking the construction point of the Voronoi polygon where the point p to be mapped is located, and inserting edges which start and end with the construction point into the edge set.
4) Traversing the edge set, calculating the shortest distance from the point p to each edge, and taking the edge where the minimum value of the result is located.
5) And calculating the offset distance of the point p on the road network, and reconstructing the point p.
6) Establishing index information
The method has higher query real-time requirement, but does not comprise points of interest, only has 9600w more points at crossing points, and has huge quantity and more scattered points, so the method provides a plurality of indexes for the situation, and the indexes are stored in an external memory, so that different point information is processed in a divide-and-conquer mode, and simultaneously, the indexes are logically associated by using pointers. Meanwhile, since each index node is prescribed to have a size of 4KB in order to improve I/O efficiency. The index structure is described in detail below, as shown in conjunction with FIG. 3:
regional point set R-Tree index (Region R-Tree in FIG. 3): for each point, the nodes are stored in the form of { id, x, y }, and a total of 12 bytes and a 4KB sector can be stored for 341 nodes, and the remaining 4 bytes record the number of the nodes stored by the index node, because the last node does not have to be full of 341 nodes. Every 341 points are a group, the minimum outsourcing rectangle is taken as an area through the point coordinates, the rectangle coordinates are extracted and stored in the upper layer of nodes, and the recording mode of the minimum outsourcing rectangle is (X min ,Y min ),(X max ,Y max ) The total of 20 bytes, the non-leaf nodes also comprise node depth, front and back brother pointers and 16 bytes of leaf node number, at most 204 regional blocks can be stored, the regional blocks are combined layer by layer from bottom to top, and finally, a regional point set R-Tree is formed, and the index fully utilizes space and node structures to directly influence the traversing of the index and the efficiency of internal and external memory interaction.
Point index (Point-index in FIG. 3): the index is obtained by traversing the node. Txt file after sequential mapping id, and the connection efficiency is improved by sacrificing a small amount of space, so that the traversing index efficiency is directly improved.
Neighbor point B x-Tree index (Adjacency List B x-Tree in fig. 3): the node size of the neighboring node B x-Tree is set to be the same as the fan blade 4KB. Extracting adjacent points and headers from an adjacent point list to form a road section representation such as<id 1 ,id 2 >And follow the principle that the front id is larger than the back id to prevent data repetition. Each leaf node also contains 16 bytes of road segment identifiers, front and back sibling pointers and father node pointers, and the road segment identifiers and additional information are 40 bytes, so that 102 road segment identifiers can be accommodated. The non-leaf node extracts the first road segment identifier of the leaf node as a key and a pointer pointing to the leaf node for 12 bytes, and adds a total of 16 bytes of sibling pointers, node depth and key numbers, so that the non-leaf node can store 340 keys, namely 340 leaf nodes. Repeating the extraction. Build up from the bottom layer until the top layer of the tree. The index also fully utilizes the space and the node structure to directly influence the efficiency of the traversal of the index and the internal and external memory interaction.
Point of interest B x-Tree index (POI List B x-Tree in fig. 3): the interest point index structure utilizes the characteristic that interest points are on a road section, and takes the road section identification as a keyword, so that the road section is only needed to be positioned when searching, the size of a 4KB node is set as same as the index construction, the node space is fully utilized, and the efficiency of direct jump index traversal and internal and external memory interaction is improved.
The invention discloses an approximate keyword query method for large-scale road network data, which comprises the following specific processes of;
in the step of carrying out approximate keyword query by utilizing the constructed external storage index file, a high-efficiency road network KNN query algorithm is provided, and the specific steps are as follows:
1) Setting a query region R with a side length of 50 for a given query Q Q 。
2) Traversing the regional point set R-Tree, if R Q Intersection with the region contained by the node of the R-Tree is performed, and children of the node are traversed until the leaf node is reached. And if the intersection does not exist, continuing to sequentially traverse the regional point set R-Tree.
3) Duplicate points are removed (result point duplication may occur due to redundant storage of the index).
4) Is connected to the adjacency point B-Tree index by a point index.
5) The road segments involved are queried at the adjacency point B-Tree.
6) Removing duplicate road segments (duplicate results appear when inquiring about points before and after the road segments are composed of two points)
7) And inquiring on the index of the interest point B-Tree according to the road section identification to obtain the interest point on the road section.
8) And (3) inquiring the keywords of the interest points by using approximate keywords, wherein the number of the inquiring results is K, if the number of the inquiring results is less than K, the side length of the original inquiring area is enlarged by 50, and the step (2) is carried out.
The invention has been described in detail hereinabove with reference to specific exemplary embodiments thereof. It will be understood that various modifications and changes may be made without departing from the scope of the invention as defined by the appended claims. The detailed description and drawings are to be regarded in an illustrative rather than a restrictive sense, and if any such modifications and variations are desired to be included within the scope of the invention described herein. Furthermore, the background art is intended to illustrate the status and meaning of the development of the technology and is not intended to limit the invention or the application and field of application of the invention.
Claims (9)
1. The approximate keyword storage method for the large-scale road network data is characterized by comprising the following steps of:
preprocessing original road network data to obtain a preprocessing result;
constructing an R-Tree index, a point index, an adjacent point B-Tree index and an interest point B-Tree index according to the preprocessing result, wherein the R-Tree index, the point index, the adjacent point B-Tree index and the interest point B-Tree index are mutually related through pointers;
storing the constructed R-Tree index, point index, adjacent point B-Tree index and interest point B-Tree index into a memory storage space; the specific process of constructing the R-Tree index, the point index, the adjacent point B-Tree index and the interest point B-Tree index according to the preprocessing result comprises the following steps:
the preprocessing result comprises a node. Txt file and a way. Txt file, and an R-Tree index is constructed according to the rectangular full coverage map and the node. Txt; constructing a point index according to leaf nodes of the node. Txt file traversal R-Tree; constructing an adjacent point B-Tree index according to the node. Txt file and the way. Txt file; and constructing an interest point B-Tree index according to the interest point set and the way.
2. The approximate keyword storage method for large-scale road network data according to claim 1, wherein the specific process of preprocessing the original road network data is as follows:
obtaining intersection points according to original road network data, clustering intersection points with density characteristic representation to obtain a clustering result, filling non-characteristic intersection points into the clustering result through road network expansion, merging the clustering result by using a one-pass clustering algorithm to obtain a rectangular full coverage map, and mapping the interest points into the road network according to an interest point mapping algorithm to obtain an interest point set.
3. The approximate keyword storage method for large-scale road network data according to claim 2, wherein the specific process of obtaining the intersection point according to the original road network data is as follows:
extracting basic composition elements in original road network data, and constructing a node. Txt file and a way. Txt file according to the basic composition elements, wherein the node. Txt file stores node information, and the way. Txt file stores road information; and then the intersection point is obtained through the definition of the node.
4. The method for storing approximate keywords of large-scale road network data according to claim 2, wherein the specific process of filling non-characteristic intersection points into the clustering result through road network expansion is as follows:
firstly, wrapping the clustering result by using a minimum outsourcing polygon;
expanding the minimum outsourcing polygon, wherein the expansion is performed according to the direction from the high crossing to the low crossing, and the minimum outsourcing rectangle is taken from the expanded minimum outsourcing polygon to wrap.
5. The method for storing approximate keywords for large-scale road network data according to claim 4, wherein the specific process of merging the clustering results by using a one-pass clustering algorithm to obtain the rectangular full coverage map comprises the following steps:
and finding out abnormal value points of the areas of the minimum outsourcing rectangles by using a box graph, and merging areas smaller than the abnormal value points by using a one-pass clustering algorithm to obtain a rectangular full coverage graph.
6. The approximate keyword storage method for large-scale road network data according to claim 1, wherein the specific process of constructing the neighbor point B x-Tree index according to the node. And selecting a road section identifier formed by basic constituent elements from the node. Txt file and the way. Txt file as keywords in the node, and constructing from the bottom layer to the first layer according to the keywords to obtain an adjacent point B-Tree index, wherein the upper node is constructed by extracting the first keywords of the lower node.
7. The approximate keyword storage method for large-scale road network data according to claim 1, wherein the specific process of constructing the interest point B x-Tree index according to the interest point set and the way.
And constructing an inverted list of the interest point keywords, and obtaining an interest point B-Tree index from the bottom layer to the first layer according to the interest point keywords, wherein the upper layer node is constructed by extracting the first interest point keywords of the lower layer node.
8. The approximate keyword query method for large-scale road network data is characterized in that the index stored by the approximate keyword storage method for large-scale road network data is used for query, wherein the index is stored by the approximate keyword storage method for large-scale road network data according to any one of claims 1-7.
9. The method for querying approximate keywords of large-scale road network data according to claim 8, comprising the steps of; firstly, acquiring a query request; searching a corresponding index according to the query request, and obtaining an approximate keyword set path list according to the index; and then reading the data file corresponding to the query request according to the path list.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010650465.7A CN111813778B (en) | 2020-07-08 | 2020-07-08 | Approximate keyword storage and query method for large-scale road network data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010650465.7A CN111813778B (en) | 2020-07-08 | 2020-07-08 | Approximate keyword storage and query method for large-scale road network data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111813778A CN111813778A (en) | 2020-10-23 |
CN111813778B true CN111813778B (en) | 2024-03-29 |
Family
ID=72841986
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010650465.7A Active CN111813778B (en) | 2020-07-08 | 2020-07-08 | Approximate keyword storage and query method for large-scale road network data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111813778B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112287218B (en) * | 2020-10-26 | 2022-11-01 | 安徽工业大学 | Knowledge graph-based non-coal mine literature association recommendation method |
CN113468293B (en) * | 2021-07-13 | 2023-06-13 | 沈阳航空航天大学 | Road network Top-k path query method based on multi-keyword coverage |
CN113626434B (en) * | 2021-08-04 | 2024-09-27 | 北京理工大学 | Method and system for managing dimension table cache facing data distribution characteristics in Gaia clusters |
CN113688257B (en) * | 2021-08-19 | 2024-04-12 | 安徽工大信息技术有限公司 | Author name identity judging method based on large-scale literature data |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014001612A1 (en) * | 2012-06-29 | 2014-01-03 | Nokia Corporation | Method and apparatus for multidimensional data storage and file system with a dynamic ordered tree structure |
CN103544291A (en) * | 2013-10-29 | 2014-01-29 | 东北林业大学 | Mobile object continuous k-nearest neighbor (CKNN) query method based on road based road networks tree (RRN-Tree) in road network |
CN103634403A (en) * | 2013-12-06 | 2014-03-12 | 南京邮电大学 | Urban road network location reporting and indexing method based on moving object clustering |
CN107145526A (en) * | 2017-04-14 | 2017-09-08 | 浙江大学 | Geographical social activity keyword Reverse nearest neighbor inquiry processing method under a kind of road network |
-
2020
- 2020-07-08 CN CN202010650465.7A patent/CN111813778B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014001612A1 (en) * | 2012-06-29 | 2014-01-03 | Nokia Corporation | Method and apparatus for multidimensional data storage and file system with a dynamic ordered tree structure |
CN103544291A (en) * | 2013-10-29 | 2014-01-29 | 东北林业大学 | Mobile object continuous k-nearest neighbor (CKNN) query method based on road based road networks tree (RRN-Tree) in road network |
CN103634403A (en) * | 2013-12-06 | 2014-03-12 | 南京邮电大学 | Urban road network location reporting and indexing method based on moving object clustering |
CN107145526A (en) * | 2017-04-14 | 2017-09-08 | 浙江大学 | Geographical social activity keyword Reverse nearest neighbor inquiry processing method under a kind of road network |
Non-Patent Citations (3)
Title |
---|
一种面向大规模二维点集数据的密度聚类算法;王小林等;安徽工业大学学报(自然科学版)(第2期);第147-164页 * |
基于4-叉树结构的路网数据最近邻查询算法;陈可心;陈业斌;;安徽工业大学学报(自然科学版)(第03期);全文 * |
空间近似关键字反远邻查询;邰伟鹏等;电子学报(第6期);第1343-1348页 * |
Also Published As
Publication number | Publication date |
---|---|
CN111813778A (en) | 2020-10-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111813778B (en) | Approximate keyword storage and query method for large-scale road network data | |
CN103198151B (en) | The search index system and method for regional urban public traffic vehicles operation information | |
US6816779B2 (en) | Programmatically computing street intersections using street geometry | |
CN110765331B (en) | Method and system for retrieving space-time data | |
CN106528793B (en) | Space-time fragment storage method of distributed spatial database | |
EP2339481A1 (en) | Enablement of three-dimensional hosting, indexing, analysing and querying structure for spatial systems | |
CN102902826B (en) | A kind of image method for quickly retrieving based on reference picture index | |
CN112181991B (en) | Earth simulation system grid remapping method based on rapid construction of KD tree | |
CN111639075B (en) | Non-relational database vector data management method based on flattened R tree | |
CN105574212A (en) | Image retrieval method for multi-index disk Hash structure | |
CN104731984B (en) | Automobile wheel hub surface sampling point R tree overflow node incremental clustering optimization method | |
US20030158667A1 (en) | Programmatically deriving street geometry from address data | |
CN108009265A (en) | A kind of space data index method under cloud computing environment | |
CN103500165B (en) | A kind of combination cluster and the high-dimensional vector quantity search method of double key value | |
Nguyen et al. | A multi-perspective approach to interpreting spatio-semantic changes of large 3D city models in CityGML using a graph database | |
CN116775661A (en) | Big space data storage and management method based on Beidou grid technology | |
Wu et al. | A spatiotemporal trajectory data index based on the Hilbert curve code | |
Zheng et al. | Searching activity trajectory with keywords | |
Yin et al. | Efficient trajectory compression and range query processing | |
CN103365960A (en) | Off-line searching method of structured data of electric power multistage dispatching management | |
CN105677840A (en) | Data query method based on multi-dimensional increasing data model | |
Liu et al. | HBase-based spatial-temporal index model for trajectory data | |
WO2020215436A1 (en) | Search method applied to spatial keyword query of electronic map | |
Meijers et al. | A storage and transfer efficient data structure for variable scale vector data | |
CN114791942B (en) | Spatial text density clustering retrieval method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |