CN111639075B

CN111639075B - Non-relational database vector data management method based on flattened R tree

Info

Publication number: CN111639075B
Application number: CN202010387252.XA
Authority: CN
Inventors: 向隆刚; 王越
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2020-05-09
Filing date: 2020-05-09
Publication date: 2023-05-12
Anticipated expiration: 2040-05-09
Also published as: CN111639075A

Abstract

The invention provides a vector data management method in a non-relational database, which is oriented to a distributed non-relational database and designs an index structure based on an R tree flattening strategy for vector data; establishing a library table structure comprising vector data and an index structure, and correlating the library table structure with the vector data; coding vector data into a warehouse, and constructing a flattened R tree index; providing a flattened R tree-based spatial query processing algorithm for vector data; vector data in the non-relational database is maintained, including updated and deleted. The invention provides vector data query processing capability supported by the R tree for the non-relational database by establishing the R tree index based on the flattening strategy, can support the organization and management of large-scale vector data, and ensures that the mass storage and parallel calculation of the non-relational database and the high availability, high reliability and other technologies can benefit from the vector data types.

Description

Non-relational database vector data management method based on flattened R tree

Technical Field

The invention belongs to the technical field of databases, and particularly relates to a vector data management method in a non-relational database.

Background

More than 85% of the data in the real world is related to geographic location, and the global geospatial data volume of 2016 has exceeded 6000PB as reported by the institute of the global institute of mckinson, and is still increasing at a rate of PB level each year. The vector data structure is complex compared with the simple structure of raster data, and takes on the main tasks of space analysis and space data query. The heterogeneous unstructured vector data increases the difficulty of management by using a traditional relational database; in the face of massive data sets with huge and continuously growing data volume, the relational database also has the problem of difficult to overcome in expandability.

The non-relational database follows the CAP theory and the BASE principle, emphasizes the mode freedom, the read-write efficiency and the transverse scalability while weakening the transactional property, can provide high-efficiency random access, multi-format data storage and high-concurrency data read-write, and provides a new thought and method for the problem by the strong expansion capability and the calculation capability. The non-relational database system generally adopts a Key-Value storage model to store data, and the high-efficiency query of the data is ensured by automatically establishing an index for the Key. In addition, the query capability of the database can be enriched by establishing a secondary index.

The purpose of the spatial index is to improve the query efficiency, the traditional spatial index is not designed for a distributed environment, and when the storage and management of massive vector data are carried out, the problems that the data storage organization is difficult, the real-time query requirement is difficult to meet and the like exist. The original spatial index of the non-relational database has poor support on vector data, and taking MongoDB as an example, the 2d index and the 2 dspere index are two spatial indexes which are originally supported by MongoDB, the 2d index only supports the index of the point element, and the 2 dspere index has the problems of not supporting plane coordinate data, poor adaptability on data and the like, and is difficult to support the query processing of the vector data.

Therefore, when the non-relational database is used for managing the mass vector data, the problem that the proper index is not available exists, and the data cannot be efficiently organized and managed by using the native space index mode, so that the advantage of high concurrency of the non-relational database is difficult to be exerted.

Disclosure of Invention

The invention aims to solve the technical problems that: the vector data management method in the non-relational database provides the organization and management capability of mass vector data for the distributed non-relational database.

The technical scheme adopted by the invention for solving the technical problems is as follows: a method for vector data management in a non-relational database, characterized by: the method comprises the following steps:

s1, in a non-relational database environment, an auxiliary index structure based on an R tree flattening strategy is designed for vector data;

s2, establishing a library table structure comprising vector data and an index structure, wherein all data tables in the library table structure are associated with an implicit naming rule through an explicit association record;

s3, coding and storing vector data in a warehouse, organizing geometric and attribute information in a GeoJSON form and a JSON form respectively, and constructing a flattened R tree index;

s4, when a query request is received, determining an index metadata ID according to a query condition, further obtaining an R tree root node ID, and then performing search of vector data in parallel based on an R tree index table, and finally returning a query result;

and S5, maintaining vector data in the non-relational database, including updating and deleting.

According to the above method, in the step S1, the specific design steps of the auxiliary index structure based on the R tree flattening strategy are as follows:

1.1, abstracting a vector object into a minimum outsourcing rectangle MBR (Minimum Bounding Rectangle), combining recursions of MBRs with adjacent spatial positions into MBRs with higher levels, and finally forming a layered tree structure based on the minimum outsourcing rectangle;

1.2, expanding the R tree index structure into a flattened index node set, namely, expressing each index node as a JSON structure, and taking a unique identification of the node as a pointer of a parent index item to a child index node;

and 1.3, setting a fan-out coefficient M of the R tree, and defining the number of child nodes of the rest R tree nodes to be positioned between intervals [2, M ] except the root node.

According to the method, in the R tree nodes, the record format of the R leaf nodes is < OID, MBR >, and the record format of the intermediate nodes is < OID, pointer, MBR >; where OID is the unique identifier of the node, OID the Pointer points to its child node, and MBR is the minimum bounding rectangle.

According to the above method, in the step S2, the library table structure is designed as follows:

2.1, managing multi-source heterogeneous vector data in a data set form, wherein each vector data set organizes logically related vector data with the same type;

2.2, designing a vector data table, an R tree index table, a vector metadata table and an index metadata table, wherein the vector elements, the index structures and the metadata of the vector data set are respectively stored;

and 2.3, establishing association relations among the vector data table, the R tree index table, the vector metadata table and the index metadata table, wherein each vector data set corresponds to one vector data table and one R tree index table, and metadata description is carried out in the vector metadata table and the index metadata table respectively.

According to the above method, the step S3 specifically includes:

3.1, taking a vector data set as a unit, coding all vector elements in the vector data set, writing the vector data set into a vector data table, and organizing geometric and attribute information in a GeoJSON and JSON form respectively; geoJSON is a geospatial information data exchange format based on JavaScript object notation;

3.2, inquiring geometric metadata information of a space domain where the vector elements are in the vector metadata table, and acquiring corresponding index metadata ID;

3.3, acquiring an R tree index table and a root node ID thereof from the index metadata table, navigating to a target leaf node of the R tree index table according to the geometric relationship between the minimum outsourcing rectangle of the vector element and the outsourcing rectangle of the index item, inserting the index item related to the vector element, and updating the R tree index table;

and 3.4, after the writing of the vector data set is completed, updating the vector metadata table and the index metadata table.

According to the above method, in 3.3, the specific mode of ID node navigation and R tree index table update includes the following steps:

3.3.1, searching an optimal insertion node by using ID node navigation according to the geometric relationship of the minimum outsourcing rectangle of the vector elements and the outsourcing rectangle of the index item, judging whether the number of child nodes of the node exceeds a set fan-out coefficient, if so, executing the step 3.3.2, otherwise, executing the step 3.3.3;

3.3.2, performing node splitting operation, equally dividing the node into two new nodes through an R tree node splitting algorithm, and navigating again to find the optimal inserted node;

3.3.3, inserting the index item of the vector element into the node, and updating the node;

and 3.3.4, if the root node splits, updating the information of the root node in the index metadata table.

According to the method, the structure of the JSON is { ID, L, C, D }; wherein the ID is a unique identifier of the index node, namely the OID; l (Level) is the number of layers the node is located in the tree; c (Count) is the number of child nodes owned by the node; d (Descendants) is a JSON nested structure, and records the unique identifier of the child node owned by the node and the minimum bounding box; d (Descendants) is defined as D { { P, M }, …, { P, M }, where P (Pointer) points to the OID of its child node and M (MBR) is the minimum bounding rectangle of the child node, organized in GeoJSON.

According to the above method, the step S4 specifically includes:

4.1, giving query conditions such as data set names, query ranges and the like by a user;

4.2, inquiring geometric metadata information of the space domain in the vector metadata table according to a given inquiry condition, and acquiring ID information of corresponding index metadata;

4.3, obtaining the ID of the R tree index table and the root node thereof from the index metadata table;

4.4, querying the R tree index table, and taking out index items of vector elements meeting query conditions;

and 4.5, extracting vector data from the vector data table, and performing fine filtering to finally obtain a query result.

According to the above method, in 4.4, the specific way of querying the R tree index table includes the following steps:

4.4.1, acquiring a corresponding key value pair in the R tree index table through ID information of the root node, and taking out and deserializing the key value pair;

4.4.2, judging the relation between MBR of each child node and the query range by using the geometric information in GeoJSON, finding out the child nodes where the MBR intersects with the query range or is in the query range, and taking out the corresponding child node key value pairs according to pointers of the parent index items to the child index nodes and performing reverse serialization;

4.4.3, repeating the step 4.4.2 until the leaf nodes of the R tree are inquired.

According to the above method, in the step S5, the process of deleting vector data includes:

5.1, inquiring geometric metadata information of a space domain where data to be deleted are located in a vector metadata table, and acquiring corresponding index metadata information;

5.2, acquiring an R tree index table and a root node thereof from the index metadata table, and positioning an index item associated with the vector element to be deleted according to the geometric relationship between the minimum outsourcing rectangle of the R tree node and the query frame;

5.3, deleting the corresponding vector data in the vector data table;

and 5.4, deleting the index item associated with the vector data in the R tree index table.

According to the scheme, after the data insertion operation is completed, the index information of the corresponding data is updated, so as to ensure the integrity of the content item. In the non-relational database, the error is normal, if the index item is inserted first, the system is down after the index item is inserted, and the system considers that the data is stored in the database after restarting, so that the data is lost.

The beneficial effects of the invention are as follows: the method of the invention stores the geometric information of vector data in a GeoJSON file, and codes the data in a key value pair mode for storage; by designing an index structure based on a flattened R tree, the adjacent entities in space are ensured to be stored in the same or adjacent storage nodes; vector data query processing supported by R trees is provided for the distributed non-relational database, and R tree query operation of multi-node parallel execution can be performed by utilizing the distributed storage characteristic of the non-relational database, so that the characteristics of the distributed and high concurrency of the non-relational database are fully utilized, and the query efficiency is greatly improved; the updating and deleting of the vector data can not cause index errors, and the requirement of data access instantaneity is met.

Drawings

FIG. 1 is a flow chart of a method according to an embodiment of the invention.

Fig. 2 is a spatial distribution of a vector element MBR according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of the R tree structure corresponding to fig. 2.

Fig. 4 is a block diagram illustrating an example of a storage structure of vector data according to an embodiment of the present invention.

FIG. 5 shows a library table structure and association relationship including vector data and index structures according to an embodiment of the present invention.

FIG. 6 is a flow chart of vector data insertion and R-tree index construction.

Fig. 7 is a vector data query flow chart.

Fig. 8 is a vector data deletion flowchart.

Detailed Description

The invention will be further described with reference to specific examples and figures.

As shown in FIG. 1, the method for managing vector data of a non-relational database based on a flattened R tree provided by the invention specifically comprises the following steps:

101. in a non-relational database environment, an auxiliary index structure based on an R tree flattening strategy is designed for vector data.

Specifically, the embodiment of the invention designs a flattened R tree index storage scheme for an HBase database, wherein the scheme takes a unique identifier of an R tree node as a row key, and node information is stored in the database in a JSON nested format as each column in a column family E. Wherein, the spatial information of the vector data is organized in a GeoJSON format. Each row in the data table represents a node, and all nodes with the same index are stored in the same R tree index table, and the detailed structure is shown in the table 1-1.

Table 1-1R Tree index Table Structure

The various fields and type descriptions of the R-Tree index Table are shown in tables 1-2.

Table 1-2R tree index table structure description

Exemplary, in the embodiment of the present invention, vector data in a preset area are located in the same spatial domain, and the distribution situation is shown in fig. 2. Setting the fan-out coefficient m=3 of the R tree, wherein each node boundary range is represented by a minimum outsourcing rectangle MBR, and the index items about the vector elements are stored by leaf nodes, and the corresponding R tree structure is shown in fig. 3. Based on the R-tree index table structure in tables 1-2, the R-tree in FIG. 3 is expanded into a flattened document set, as shown in tables 1-3, so that the query operation of the tree can be completed by node ID navigation.

Flattened storage of table 1-3R tree nodes

The JSON detail structure is { ID, L, C, D }. Wherein the ID (OID) is a unique identifier of the index node, L (Level) is the number of layers in which the node is located in the tree, C (Count) is the number of child nodes owned by the node, D (Descendants) is a JSON nested structure, and the unique identifier of the child node owned by the node and the minimum bounding box are recorded.

D (Descendants) is defined as D { { P, M }, …, { P, M }, where P (Pointer) points to the OID of its child node and M (MBR) is the minimum bounding rectangle of the child node, organized in GeoJSON.

102. And establishing a library table structure comprising vector data and an index structure, wherein all data tables in the library table structure are associated with implicit naming rules through explicit association records.

The storage, organization and management of vector data in the non-relational database HBase in the embodiment of the present invention relate to various library tables, and other library table structure designs besides the R tree index table designed as described above are described as follows:

the vector metadata table is used to store vector metadata, interpret the detailed information of each vector dataset in the database, and assist the indexing system in filtering some meaningless requests. The vector METADATA table is named "vo_metadata", whose row key is the name of the dataset (DatasetName); the metadata table contains two column families of the necessary column family (E) and the optional column family (F). Other user-defined fields are placed under column family F. The structure of the metadata table is shown in table 2-1.

Table 2-1 vector metadata table structure

The various fields and type descriptions of the vector metadata table are shown in table 2-2.

Table 2-2 vector metadata table structure description

The index metadata stored in the index metadata table is description information of the R tree spatial index. The information is referenced by vector metadata information, a vector data table and an R tree index table are related to each other, and the internal structure of an R tree node and the position of a starting node of an algorithm are determined by recording R tree parameters. The index METADATA table is named "idx_metadata", and its row key is the name of the index table (IndexTableName), and the detailed structure is shown in table 3-1.

Table 3-1 index metadata Table Structure

The various fields and type descriptions of the index metadata table are shown in table 3-2.

TABLE 3-2 index metadata Table structural description

The vector data table stores therein the original vector data information. Fig. 4 shows a storage structure of a certain vector element in a vector data table in an embodiment of the present invention, and a GeoJSON form is used to organize geometric information of vector data. Specifically, the geometry information is stored in a "GEOINFO" field, where the "type" field identifies the geometry type of the element and the "pivot" field stores an array of vertex coordinates for the geometry object. The non-geometric information of the element is also stored by a different field, such as a "NAME" field representing the NAME of the vector element. The detailed structure of the vector data table is shown in table 4-1.

TABLE 4-1 vector data Table Structure

The various fields and types of the vector data table are described in Table 4-2.

TABLE 4-2 vector data table structure illustrations

Specifically, for the characteristics of the HBase database, a vector database mode supported by the R tree index is designed, as shown in fig. 5. The association rules of the vector data table, the R tree index table, the vector metadata table and the index metadata table are as follows:

the space domain name of the vector data table and the corresponding name space are recorded in the vector metadata table, and when more than one space domain exists in one vector data set, a plurality of records are stored in the vector metadata set.

The R tree index table corresponding to each space domain is bound with the vector data table corresponding to the R tree index table through a specific naming specification, the naming mode of the index table is 'Rtree_space domain name_naming space', and the naming mode of the data table is 'space domain name_naming space'.

The index metadata information is associated with the vector metadata table by means of record IDs, while the root node in the R-tree index table is also associated with the index metadata table by means of record IDs.

103. And (3) coding and warehousing the vector data, storing the vector data, organizing the geometric and attribute information in a GeoJSON form and a JSON form respectively, and constructing a flattened R tree index.

An exemplary flow chart of vector data insertion and R-tree index construction according to an embodiment of the present invention is shown in fig. 6. Firstly, taking a vector data set as a unit, coding all vector elements in the vector data set, and writing the vector elements into a vector data table. And then inquiring whether geometric metadata information of a spatial domain where the vector element is located in the vector metadata table exists or not, if not, storing the geometric metadata information in the vector metadata table, and updating index metadata information.

Illustratively, an index metadata ID is obtained, an R tree index table and its root node ID are obtained from the index metadata table, and navigation is performed to find the best inserted node. Wherein, it is necessary to determine whether the number of child nodes of the best inserted node exceeds a preset fan-out coefficient.

Specifically, starting from the root node, firstly judging whether the current node MBR contains the MBR of the vector element to be inserted, if not, continuously judging whether the next node MBR is contained until the MBR of the vector element to be inserted is contained, and judging whether the child node MBR of the node contains the vector element to be inserted. The best insertion node should satisfy that the node's own MBR contains the MBR of the vector element to be inserted and its child node MBR does not contain the MBR of the vector element to be inserted. After navigating to the target leaf node, judging whether the number of child nodes of the current node exceeds a preset fan-out coefficient, if not, inserting an index item related to the vector element; otherwise, performing node splitting by using an R tree splitting strategy, performing id node navigation again, and inserting the index item of the vector element into the split optimal node. If the root node splits, the information of the root node is updated in the index metadata table. And finally, updating the R tree index item and the metadata table information to complete the data insertion operation.

After the data insertion operation is completed, the index information of the corresponding data is updated in order to ensure the integrity of the content item. In the non-relational database, the error is normal, if the index item is inserted first, the system is down after the index item is inserted, and the system considers that the data is stored in the database after restarting, so that the data is lost.

104. Providing query support for vector data: when a query request is received, the index metadata ID is determined according to the query condition, and then the R tree root node ID is obtained, so that the retrieval of vector data is executed in parallel based on an R tree index table, and finally a query result is returned.

An exemplary vector data query flow chart provided in the embodiment of the present invention is shown in fig. 7, and includes the following steps:

step 1: the user gives query conditions such as data set names, query ranges and the like, and the corresponding query ranges in the embodiment of the invention are shown in a query frame in fig. 2 to acquire a query polygon area;

step 2: inquiring geometric metadata information of the space domain in the vector metadata table, if the information does not exist, ending the inquiry, and ensuring that vector elements meeting inquiry conditions do not exist in the data set; if the information exists, acquiring an ID of the corresponding index metadata;

step 3: inquiring an index metadata table, and acquiring an R tree index table corresponding to the inquired polygon area and the ID of a root node thereof;

step 4: and querying an R tree index table, and taking out key value pairs of the root nodes and performing reverse serialization. The geometric information recorded in GeoJSON is judged and known: in fig. 2, the query box intersects the MBR of child node N1 and is contained by the MBR of child node N2. According to pointers of parent index items pointing to child index nodes, key value pairs corresponding to N1 and N2 nodes are taken out from an index table in a multithreading parallel mode, and after reverse serialization, the geometric relationship is further judged to be known: the query box intersects the MBRs of child nodes N4, N6, N7. Recursively, the key value pairs corresponding to N4, N6 and N7 are extracted in parallel and are inversely sequenced, and the geometrical relationship is known by parallel judgment in a multithreading mode: the MBRs of child nodes L10, L16, and L19 fall within the query box, and the MBRs of child nodes L12 and L18 intersect the query box. Judging that the leaf nodes are queried currently, the leaf node set meeting the condition is as follows: { L10, L12, L16, L18, L19};

step 5: and extracting vector data from the vector data table according to the index item, and filtering geometric information of the obtained data through fine query to obtain a vector data set meeting the query condition. In addition, the filtering process can be attribute information filtering, such as whether the building area of the building is more than 2500m ² Whether a name string contains a certain mall name, etc.

105. Vector data is maintained, including updated and deleted.

Specifically, the updating of the data can be performed in real time through a pre-established R tree index based on a flattening strategy, and the implementation mode of the updating process is similar to that of the establishment process of the R tree index. When the vector data stored in the database is modified, the index data of the nodes affected by the modified data are updated at the same time, so that the effectiveness of the data is ensured.

An exemplary vector data deleting flow chart provided by the embodiment of the present invention is shown in fig. 8, and the vector data deleting flow corresponding to the leaf node L17 index item in the R tree includes the following steps:

step 1: inquiring geometric metadata information of a space domain where data to be deleted are located in a vector metadata table, and obtaining corresponding index metadata information;

step 2: inquiring an index metadata table, acquiring an R tree index table and a root node thereof, and positioning an index item associated with a vector element to be deleted according to the geometric relationship between the minimum outsourcing rectangle of the R tree node and an inquiry frame;

step 3: deleting the corresponding vector data in the vector data table;

step 4: deleting an index item associated with the vector data in the R tree index table;

step 5: the R tree index table and index metadata table are updated.

The invention provides a non-relational database vector data management method based on a flattened R tree, which is oriented to a novel non-relational database, provides vector data query processing supported by the R tree for a distributed non-relational database by establishing an R tree index based on a flattened strategy, and can utilize the distributed storage characteristic of the non-relational database to carry out the R tree query operation of multi-node parallel execution, thereby supporting the large-scale vector data organization and management oriented to the non-relational database, enabling mass storage and parallel calculation of the non-relational database, realizing high availability, high reliability and other technologies to benefit to vector data types.

The above embodiments are merely for illustrating the design concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, the scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications according to the principles and design ideas of the present invention are within the scope of the present invention.

Claims

1. A method for vector data management in a non-relational database, characterized by: the method comprises the following steps:

s5, maintaining vector data in the non-relational database, including updating and deleting;

in the step S1, the specific design steps of the auxiliary index structure based on the R tree flattening strategy are as follows:

1.1, abstracting a vector object into a minimum outsourcing rectangle MBR, recursively combining MBRs with adjacent spatial positions into an MBR with a higher level, and finally forming a layered tree structure based on the minimum outsourcing rectangle;

1.3, setting a fan-out coefficient M of an R tree, and defining the number of child nodes of other R tree nodes to be positioned between intervals [2, M ] except for a root node;

in the step S2, the library table structure is designed as follows:

2.3, establishing association relations among the vector data table, the R tree index table, the vector metadata table and the index metadata table, wherein each vector data set corresponds to one vector data table and one R tree index table, and metadata description is carried out in the vector metadata table and the index metadata table respectively;

the step S3 specifically comprises the following steps:

2. The vector data management method according to claim 1, characterized in that: in the R tree nodes, the record format of the R leaf nodes is < OID, MBR >, and the record format of the intermediate nodes is < OID, pointer, MBR >; where OID is the unique identifier of the node, OID the Pointer points to its child node, and MBR is the minimum bounding rectangle.

3. The vector data management method according to claim 1, characterized in that: in 3.3, the specific modes of the ID node navigation and the R tree index table updating comprise the following steps:

4. The vector data management method according to claim 1, characterized in that: the structure of the JSON is { ID, L, C, D }; wherein the ID is a unique identifier of the index node, namely the OID; l is the number of layers the node is located in the tree; c is the number of child nodes owned by the node; d is a JSON nested structure, and records a unique identifier and a minimum bounding box of a child node owned by the node;

the detailed structure of D is D { { P, M }, …, { P, M }, where P is an abbreviation of Pointer, and OID pointing to its child node; m is an abbreviation for minimum wrapper rectangle MBR for child node, organized in GeoJSON form.

5. The vector data management method according to claim 1, characterized in that: the step S4 specifically comprises the following steps:

6. The vector data management method according to claim 5, wherein: in 4.4, the specific way of querying the R tree index table includes the following steps:

7. The vector data management method according to claim 1, characterized in that: in the step S5, the process of deleting vector data includes:

5.3, deleting the corresponding vector data in the vector data table;