CN111639075A

CN111639075A - Non-relational database vector data management method based on flattened R tree

Info

Publication number: CN111639075A
Application number: CN202010387252.XA
Authority: CN
Inventors: 向隆刚; 王越
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2020-05-09
Filing date: 2020-05-09
Publication date: 2020-09-08
Anticipated expiration: 2040-05-09
Also published as: CN111639075B

Abstract

The invention provides a vector data management method in a non-relational database, which is oriented towards a distributed non-relational database, and designs an index structure based on an R-tree flattening strategy for vector data; establishes a library table structure including the vector data and the index structure, and related to each other; encode vector data into the database, and build a flattened R-tree index at the same time; provide a spatial query processing algorithm based on flattened R-tree for vector data; maintain vector data in non-relational databases, including update and delete . By establishing an R-tree index based on a flattening strategy, the present invention provides the non-relational database with the vector data query processing capability supported by the R-tree, and can support the organization and management of large-scale vector data, enabling mass storage and parallel computing of the non-relational database. , and technical dividends such as high availability and high reliability benefit vector data types.

Description

A non-relational database vector data management method based on flattened R tree

技术领域technical field

本发明属于数据库技术领域，具体是一种非关系数据库中的矢量数据管理方法。The invention belongs to the technical field of databases, in particular to a vector data management method in a non-relational database.

背景技术Background technique

现实世界中的数据超过85％与地理位置有关，据麦肯锡全球研究所报告，2016年全球地理空间数据总量已经超越了6000PB，且每年仍以PB级别的速度在增加。相比于栅格数据的简单结构，矢量数据结构复杂，并且承担着主要的空间分析、空间数据查询任务。异构非结构化的矢量数据加大了使用传统关系数据库管理的难度；面对数据量庞大且持续增长的海量数据集，关系型数据库在可扩展性上也存在难以克服的问题。More than 85% of the data in the real world is related to geographic location. According to the McKinsey Global Institute report, the total amount of global geospatial data in 2016 has exceeded 6000PB, and it is still increasing at the speed of PB level every year. Compared with the simple structure of raster data, the structure of vector data is complex, and it undertakes the main tasks of spatial analysis and spatial data query. Heterogeneous and unstructured vector data increases the difficulty of using traditional relational database management; in the face of huge and continuously growing massive data sets, relational databases also have insurmountable problems in scalability.

非关系型数据库遵循CAP理论和BASE原则，在弱化事务性的同时强调模式自由、读写效率与横向的伸缩性，能够提供高效的随机访问、多格式的数据存储和高并发的数据读写，以其强大的扩展能力与计算能力为该问题提供了新的思路与方法。非关系型数据库系统通常采用Key-Value存储模型存储数据，通过自动对Key建立索引，保证对数据的高效查询。此外，还可以通过建立二级索引，丰富数据库的查询能力。The non-relational database follows the CAP theory and the BASE principle. While weakening the transactional nature, it emphasizes schema freedom, read and write efficiency, and horizontal scalability. It can provide efficient random access, multi-format data storage, and highly concurrent data read and write. It provides new ideas and methods for this problem with its powerful expansion ability and computing power. Non-relational database systems usually use the Key-Value storage model to store data, and automatically establish an index on the key to ensure efficient data query. In addition, the query capability of the database can be enriched by establishing a secondary index.

空间索引的目的是提高查询效率，传统的空间索引并非面向分布式环境而设计的，在进行海量矢量数据的存储和管理时，存在数据存储组织困难、难以满足实时查询需求等诸多问题。而非关系数据库的原生空间索引对矢量数据支持性较差，以MongoDB为例，2d索引与2dsphere索引是MongoDB原生支持的两种空间索引，2d索引仅支持点要素的索引，2dsphere索引存在不支持平面坐标数据、对数据的自适应性较差等问题，难以支持矢量数据的查询处理。The purpose of spatial index is to improve query efficiency. Traditional spatial index is not designed for distributed environment. When storing and managing massive vector data, there are many problems such as difficulty in data storage organization and difficulty in meeting real-time query requirements. The native spatial index of non-relational database has poor support for vector data. Taking MongoDB as an example, 2d index and 2dsphere index are the two spatial indexes natively supported by MongoDB. 2d index only supports index of point elements, and 2dsphere index does not support it. Problems such as plane coordinate data and poor adaptability to data make it difficult to support the query processing of vector data.

由此可知，在使用非关系数据库管理海量矢量数据时存在没有合适索引的问题，使用原生的空间索引方式会导致数据库无法高效的组织和管理数据，难以发挥非关系数据库高并发的优势。It can be seen that there is no suitable index when using a non-relational database to manage massive vector data. Using the native spatial index method will cause the database to be unable to efficiently organize and manage data, and it is difficult to take advantage of the high concurrency of non-relational databases.

发明内容SUMMARY OF THE INVENTION

本发明要解决的技术问题是：提供一种非关系数据库中的矢量数据管理方法，为分布式非关系数据库提供对海量矢量数据的组织和管理能力。The technical problem to be solved by the present invention is to provide a vector data management method in a non-relational database, which provides the distributed non-relational database with the ability to organize and manage massive vector data.

本发明为解决上述技术问题所采取的技术方案为：一种非关系数据库中的矢量数据管理方法，其特征在于：本方法包括以下步骤：The technical scheme adopted by the present invention to solve the above-mentioned technical problems is: a vector data management method in a non-relational database, characterized in that: the method comprises the following steps:

S1、在非关系数据库环境中，为矢量数据设计基于R树扁平化策略的辅助索引结构；S1. In a non-relational database environment, design an auxiliary index structure based on the R-tree flattening strategy for vector data;

S2、建立包括矢量数据和索引结构在内的库表结构，库表结构中各数据表之间通过显式的关联记录与隐式的命名规则进行关联；S2. Establish a library table structure including vector data and an index structure, and associate each data table in the library table structure with an explicit association record and an implicit naming rule;

S3、将矢量数据编码入库存储，几何和属性信息分别以GeoJSON和JSON形式进行组织，同时构建扁平化R树索引；S3. Encode the vector data into the library for storage, organize the geometry and attribute information in GeoJSON and JSON respectively, and build a flattened R-tree index at the same time;

S4、当收到查询请求时，根据查询条件确定索引元数据ID，进而获取R树根节点ID，从而基于R树索引表，并行执行矢量数据的检索，最终返回查询结果；S4. When a query request is received, the index metadata ID is determined according to the query conditions, and then the R-tree root node ID is obtained, so that the vector data retrieval is performed in parallel based on the R-tree index table, and the query result is finally returned;

S5、对非关系数据库中的矢量数据进行维护，包括更新和删除。S5. Maintain the vector data in the non-relational database, including updating and deleting.

按上述方法，所述的S1中，基于R树扁平化策略的辅助索引结构具体设计步骤如下：According to the above method, in the S1, the specific design steps of the auxiliary index structure based on the R-tree flattening strategy are as follows:

1.1、将矢量对象抽象为最小外包矩形MBR(Minimum Bounding Rectangle)，空间位置相邻的MBR将递归合并成更高一级的MBR，最终形成一个分层的，基于最小外包矩形的树形结构；1.1. The vector object is abstracted as the Minimum Bounding Rectangle (Minimum Bounding Rectangle), and the MBRs with adjacent spatial positions will be recursively merged into a higher-level MBR, and finally form a hierarchical tree structure based on the minimum outer rectangle;

1.2、将R树索引结构展开成扁平化的索引节点集合，即将每一索引节点表达为JSON结构，并将该节点的唯一标识作为父索引项指向子索引节点的指针；1.2. Expand the R-tree index structure into a flat set of index nodes, that is, express each index node as a JSON structure, and use the unique identifier of the node as the pointer of the parent index item to the child index node;

1.3、设置R树的扇出系数M，除根节点外，规定其余R树节点的子节点数量位于区间[2,M]之间。1.3. Set the fan-out coefficient M of the R-tree, except for the root node, specify that the number of child nodes of the remaining R-tree nodes is between the interval [2, M].

按上述方法，所述的R树节点中，R树叶节点的记录格式为<OID,MBR>，中间节点的记录格式为<OID,Pointer,MBR>；其中OID为该节点的唯一标识符，Pointer指向其子节点的OID，MBR为最小外包矩形。According to the above method, in the R tree node, the record format of the R leaf node is <OID, MBR>, and the record format of the intermediate node is <OID, Pointer, MBR>; wherein OID is the unique identifier of the node, Pointer OID pointing to its child node, MBR is the smallest enclosing rectangle.

按上述方法，所述的S2中，库表结构的设计如下：According to the above method, in the described S2, the design of the library table structure is as follows:

2.1、以数据集形式管理多源异构的矢量数据，每一个矢量数据集组织逻辑相关、类型相同的矢量数据；2.1. Manage multi-source heterogeneous vector data in the form of data sets, each vector data set organizes logically related vector data of the same type;

2.2、设计矢量数据表、R树索引表、矢量元数据表和索引元数据表，分别用来存储矢量数据集的矢量要素、索引结构，以及两者的元数据；2.2. Design the vector data table, the R-tree index table, the vector metadata table and the index metadata table, which are used to store the vector elements, the index structure of the vector data set, and the metadata of both;

2.3、建立矢量数据表、R树索引表、矢量元数据表和索引元数据表这四类表之间的关联关系，每个矢量数据集对应一个矢量数据表、一个R树索引表，并分别在矢量元数据表和索引元数据表中进行元数据描述。2.3. Establish the relationship between the four types of tables: vector data table, R-tree index table, vector metadata table and index metadata table. Each vector data set corresponds to a vector data table and an R-tree index table, and respectively. Metadata descriptions are made in the vector metadata table and the index metadata table.

按上述方法，所述的S3具体包括：According to the above method, the S3 specifically includes:

3.1、以矢量数据集为单位，将其中的所有矢量要素编码，写入到矢量数据表中，几何和属性信息分别以GeoJSON和JSON形式进行组织；GeoJSON是基于JavaScript对象表示法的地理空间信息数据交换格式；3.1. Take the vector dataset as a unit, encode all the vector elements in it and write it into the vector data table, and organize the geometry and attribute information in the form of GeoJSON and JSON respectively; GeoJSON is a geospatial information data based on JavaScript object notation exchange format;

3.2、查询矢量元数据表中矢量要素所在空间域的几何元数据信息，获取对应的索引元数据ID；3.2. Query the geometric metadata information of the spatial domain where the vector elements are located in the vector metadata table, and obtain the corresponding index metadata ID;

3.3、从索引元数据表中获取R树索引表及其根节点ID，依据矢量要素最小外包矩形和索引项外包矩形的几何关系，导航到R树索引表的目的叶子节点，插入关于矢量要素的索引项，更新R树索引表；3.3. Obtain the R-tree index table and its root node ID from the index metadata table. According to the geometric relationship between the minimum enclosing rectangle of the vector element and the enclosing rectangle of the index item, navigate to the destination leaf node of the R-tree index table, and insert the information about the vector element. Index item, update the R-tree index table;

3.4、完成矢量数据集的写入后，更新矢量元数据表和索引元数据表。3.4. After completing the writing of the vector dataset, update the vector metadata table and the index metadata table.

按上述方法，所述的3.3中，ID节点导航和R树索引表更新的具体方式包括如下步骤：According to the above method, in 3.3, the specific method of ID node navigation and R-tree index table update includes the following steps:

3.3.1、依据矢量要素最小外包矩形和索引项外包矩形的几何关系，使用ID节点导航寻找最佳插入节点，并判断该节点的子节点个数是否超出所设定的扇出系数，若超出扇出系数，执行步骤3.3.2，否则执行步骤3.3.3；3.3.1. According to the geometric relationship between the minimum outer rectangle of the vector element and the outer rectangle of the index item, use the ID node navigation to find the best insertion node, and judge whether the number of child nodes of the node exceeds the set fan-out coefficient, if it exceeds Fan-out coefficient, go to step 3.3.2, otherwise go to step 3.3.3;

3.3.2、进行节点分裂操作，通过R树节点分裂算法将该节点均分为两个新的节点，再次导航寻找最佳插入节点；3.3.2. Perform node splitting operation, divide the node into two new nodes through the R-tree node splitting algorithm, and navigate again to find the best insertion node;

3.3.3、将矢量要素的索引项插入节点，并更新该节点；3.3.3. Insert the index item of the vector element into the node, and update the node;

3.3.4、若根节点发生分裂，在索引元数据表中更新根节点的信息。3.3.4. If the root node is split, update the information of the root node in the index metadata table.

按上述方法，所述JSON的结构为{ID,L,C,D}；其中ID为索引节点的唯一标识符，即OID；L(Level)为该节点位于树的层数；C(Count)为该节点拥有子节点的数量；D(Descendants)为JSON嵌套结构，记录该节点拥有的子节点的唯一标识符和最小包围盒；D(Descendants)的详细结构为D:{{P,M},…,{P,M}}，其中P(Pointer)指向其子节点的OID，M(MBR)为子节点的最小外包矩形，以GeoJSON形式组织。According to the above method, the structure of the JSON is {ID,L,C,D}; where ID is the unique identifier of the index node, namely OID; L(Level) is the level of the node in the tree; C(Count) The number of child nodes that the node has; D (Descendants) is a JSON nested structure, which records the unique identifier and minimum bounding box of the child nodes owned by the node; the detailed structure of D (Descendants) is D:{{P,M },...,{P,M}}, where P(Pointer) points to the OID of its child node, and M(MBR) is the smallest enclosing rectangle of the child node, organized in the form of GeoJSON.

按上述方法，所述的S4具体包括：According to the above method, the S4 specifically includes:

4.1、用户给定数据集名称、查询范围等查询条件；4.1. The user specifies the query conditions such as the dataset name, query scope, etc.;

4.2、依据给定的查询条件查询矢量元数据表中该空间域的几何元数据信息，获取对应索引元数据的ID信息；4.2. Query the geometric metadata information of the spatial domain in the vector metadata table according to the given query conditions, and obtain the ID information of the corresponding index metadata;

4.3、从索引元数据表中获取R树索引表及其根节点的ID；4.3. Obtain the ID of the R-tree index table and its root node from the index metadata table;

4.4、查询R树索引表，取出满足查询条件的矢量要素的索引项；4.4. Query the R-tree index table, and take out the index items of the vector elements that meet the query conditions;

4.5、从矢量数据表中取出矢量数据，进行精过滤，最终得到查询结果。4.5. Take out the vector data from the vector data table, perform fine filtering, and finally get the query result.

按上述方法，所述的4.4中，查询R树索引表的具体方式包括如下步骤：According to the above method, the specific method of querying the R-tree index table in 4.4 includes the following steps:

4.4.1、通过根节点的ID信息获取R树索引表中对应的键值对，将其取出并反序列化；4.4.1. Obtain the corresponding key-value pair in the R-tree index table through the ID information of the root node, take it out and deserialize it;

4.4.2、由GeoJSON中的几何信息判断各子节点的MBR与查询范围的关系，找出MBR与查询范围相交或在查询范围内的子节点，根据父索引项指向子索引节点的指针取出对应的子节点键值对并反序列化；4.4.2. Determine the relationship between the MBR of each child node and the query range from the geometric information in GeoJSON, find out the child nodes where the MBR intersects with the query range or are within the query range, and retrieve the corresponding child index node according to the pointer of the parent index item to the child index node. child node key-value pair and deserialize;

4.4.3、重复步骤4.4.2，直至查询到R树的叶子节点。4.4.3. Repeat step 4.4.2 until the leaf node of the R tree is queried.

按上述方法，所述的S5中，删除矢量数据的过程包括：According to the above method, in the described S5, the process of deleting the vector data includes:

5.1、查询矢量元数据表中待删除数据所在空间域的几何元数据信息，获取对应的索引元数据信息；5.1. Query the geometric metadata information of the spatial domain where the data to be deleted is located in the vector metadata table, and obtain the corresponding index metadata information;

5.2、从索引元数据表中获取R树索引表及其根节点，依据R树节点最小外包矩形和查询框的几何关系，定位与待删除矢量要素关联的索引项；5.2. Obtain the R-tree index table and its root node from the index metadata table, and locate the index item associated with the vector element to be deleted according to the geometric relationship between the minimum outer rectangle of the R-tree node and the query box;

5.3、删除矢量数据表中对应的矢量数据；5.3. Delete the corresponding vector data in the vector data table;

5.4、删除R树索引表中与该矢量数据相关联的索引项。5.4. Delete the index entry associated with the vector data in the R-tree index table.

按上述方案，在插入数据操作完成后，才会更新对应数据的索引信息，目的是为了保证内容条目的完整性。在非关系数据库中，错误是常态，如果先插入索引项，在插入索引项后系统宕机，重启后系统会认为该数据已经存入数据库中，造成数据丢失。According to the above solution, the index information of the corresponding data is updated only after the data insertion operation is completed, in order to ensure the integrity of the content entry. In a non-relational database, errors are the norm. If an index entry is inserted first, the system crashes after the index entry is inserted. After restarting, the system will consider that the data has been stored in the database, resulting in data loss.

本发明的有益效果为：本发明方法将矢量数据的几何信息存储于GeoJSON文件，以键值对形式将数据编码入库存储；通过设计基于扁平化R树的索引结构，保证了空间上相邻实体保存在同一个或相邻的存储节点中；为分布式非关系数据库提供R树支持的矢量数据查询处理，并且能够利用非关系数据库的分布式存储特性进行多节点并行执行的R树查询操作，充分利用了非关系数据库分布式、高并发的特点，大大提高了查询效率；矢量数据的更新和删除都不会引起索引的错误，满足数据访问实时性的要求。The beneficial effects of the present invention are as follows: the method of the present invention stores the geometric information of the vector data in the GeoJSON file, and encodes the data in the form of key-value pairs and stores it in the library; by designing the index structure based on the flattened R tree, the adjacent space in the space is guaranteed. Entities are stored in the same or adjacent storage nodes; it provides R-tree-supported vector data query processing for distributed non-relational databases, and can use the distributed storage characteristics of non-relational databases to perform multi-node parallel execution of R-tree query operations , making full use of the distributed and high concurrency characteristics of non-relational databases, greatly improving query efficiency; the update and deletion of vector data will not cause index errors and meet the requirements of real-time data access.

附图说明Description of drawings

图1为本发明一实施例的方法流程图。FIG. 1 is a flowchart of a method according to an embodiment of the present invention.

图2为本发明一实施例提供的一种矢量要素MBR的空间分布。FIG. 2 is a spatial distribution of a vector element MBR according to an embodiment of the present invention.

图3为图2对应的R树结构示意图。FIG. 3 is a schematic diagram of an R-tree structure corresponding to FIG. 2 .

图4为本发明一实施例中矢量数据的存储结构样例。FIG. 4 is an example of a storage structure of vector data in an embodiment of the present invention.

图5为本发明一实施例建立的包括矢量数据和索引结构在内的库表结构及关联关系。FIG. 5 is a library table structure and an association relationship including vector data and an index structure established by an embodiment of the present invention.

图6为矢量数据插入与R树索引构建流程图。FIG. 6 is a flowchart of vector data insertion and R-tree index construction.

图7为矢量数据查询流程图。FIG. 7 is a flow chart of vector data query.

图8为矢量数据删除流程图。FIG. 8 is a flow chart of vector data deletion.

具体实施方式Detailed ways

下面结合具体实例和附图对本发明做进一步说明。The present invention will be further described below with reference to specific examples and accompanying drawings.

如图1所示，本发明提供的一种基于扁平化R树的非关系数据库矢量数据管理方法，该方法具体包括：As shown in Figure 1, the present invention provides a non-relational database vector data management method based on a flattened R tree, the method specifically includes:

101、在非关系数据库环境中，为矢量数据设计基于R树扁平化策略的辅助索引结构。101. In a non-relational database environment, an auxiliary index structure based on the R-tree flattening strategy is designed for vector data.

具体的，本发明实施例设计了一种面向HBase数据库的扁平化R树索引存储方案，该方案将R树节点唯一标识符作为行键，节点信息以JSON嵌套格式作为列族E中的各列存储于数据库中。其中，矢量数据的空间信息以GeoJSON格式组织。数据表中每行表示一个节点，相同索引的全部节点保存在同一个R树索引表中，详细结构如表1-1所示。Specifically, the embodiment of the present invention designs a flattened R-tree index storage scheme for HBase database. The scheme uses the unique identifier of the R-tree node as the row key, and the node information is in the JSON nested format as each item in the column family E. Columns are stored in the database. Among them, the spatial information of vector data is organized in GeoJSON format. Each row in the data table represents a node, and all nodes with the same index are stored in the same R-tree index table. The detailed structure is shown in Table 1-1.

表1-1 R树索引表结构Table 1-1 R-tree index table structure

R树索引表的各个字段及类型说明如表1-2所示。The fields and types of the R-tree index table are described in Table 1-2.

表1-2 R树索引表结构说明Table 1-2 R-tree index table structure description

示例性的，本发明实施例中预设区域内的矢量数据位于相同空间域，分布情况如图2所示。设置R树的扇出系数M＝3，每个节点边界范围以最小外包矩形MBR表示，以叶子节点存储关于矢量要素的索引项，对应的R树结构如图3所示。基于表1-2中的R树索引表结构，将图3中的R树展开为扁平化文档集合，如表1-3所示，使得树的查询操作可以由节点ID导航来完成。Exemplarily, in the embodiment of the present invention, the vector data in the preset area are located in the same spatial domain, and the distribution is shown in FIG. 2 . The fan-out coefficient of the R tree is set to M=3, the boundary range of each node is represented by the minimum outer rectangle MBR, and the index items about the vector elements are stored in the leaf nodes. The corresponding R tree structure is shown in Figure 3. Based on the R-tree index table structure in Table 1-2, the R-tree in Figure 3 is expanded into a flat document collection, as shown in Table 1-3, so that the tree query operation can be completed by node ID navigation.

表1-3 R树节点的扁平化存储Table 1-3 Flattened storage of R-tree nodes

JSON详细结构为{ID,L,C,D}。其中ID(OID)为索引节点的唯一标识符，L(Level)为该节点位于树的层数，C(Count)为该节点拥有子节点的数量，D(Descendants)为JSON嵌套结构，记录该节点拥有的子节点的唯一标识符和最小包围盒。The JSON detailed structure is {ID,L,C,D}. Where ID(OID) is the unique identifier of the index node, L(Level) is the level of the node in the tree, C(Count) is the number of child nodes the node has, D(Descendants) is the JSON nested structure, record The unique identifier and minimum bounding box of the child nodes this node has.

D(Descendants)的详细结构为D:{{P,M},…,{P,M}}，其中P(Pointer)指向其子节点的OID，M(MBR)为子节点的最小外包矩形，以GeoJSON形式组织。The detailed structure of D(Descendants) is D:{{P,M},…,{P,M}}, where P(Pointer) points to the OID of its child node, M(MBR) is the smallest enclosing rectangle of the child node, Organized in GeoJSON.

102、建立包括矢量数据和索引结构在内的库表结构，库表结构中各数据表之间通过显式的关联记录与隐式的命名规则进行关联。102. Establish a library table structure including vector data and an index structure, and associate each data table in the library table structure with an explicit association record and an implicit naming rule.

示例性的，本发明实施例中矢量数据在非关系数据库HBase中的存储、组织和管理涉及各类库表，除上述设计的R树索引表外，其它库表结构设计及说明如下：Exemplarily, the storage, organization, and management of vector data in the non-relational database HBase in the embodiment of the present invention involve various library tables. Except for the R-tree index table designed above, the structure design and description of other library tables are as follows:

矢量元数据表用于存储矢量元数据，对数据库中各矢量数据集的详细信息进行解释，并帮助索引系统过滤一些无意义的请求。矢量元数据表命名为“VO_METADATA”，其行键为数据集的名称(DatasetName)；元数据表包含必要列族(E)和可选列族(F)两个列族。其他由用户定义的字段放置在列族F下。元数据表的结构如表2-1所示。The vector metadata table is used to store vector metadata, interpret the detailed information of each vector data set in the database, and help the indexing system filter some meaningless requests. The vector metadata table is named "VO_METADATA", and its row key is the name of the dataset (DatasetName); the metadata table contains two column families, the necessary column family (E) and the optional column family (F). Other user-defined fields are placed under column family F. The structure of the metadata table is shown in Table 2-1.

表2-1 矢量元数据表结构Table 2-1 Vector metadata table structure

矢量元数据表的各个字段及类型说明如表2-2所示。The fields and types of the vector metadata table are described in Table 2-2.

表2-2 矢量元数据表结构说明Table 2-2 Vector metadata table structure description

索引元数据表中存储的索引元数据是R树空间索引的描述信息。该信息被矢量元数据信息引用，将矢量数据表与R树索引表相互关联，通过记录R树参数，进而决定R树节点的内部结构与算法的起始节点位置。索引元数据表命名为“IDX_METADATA”，其行键为索引表的名称(IndexTableName)，详细结构如表3-1所示。The index metadata stored in the index metadata table is the description information of the R-tree spatial index. This information is referenced by the vector metadata information, which associates the vector data table with the R-tree index table, and then determines the internal structure of the R-tree node and the starting node position of the algorithm by recording the R-tree parameters. The index metadata table is named "IDX_METADATA", and its row key is the name of the index table (IndexTableName). The detailed structure is shown in Table 3-1.

表3-1 索引元数据表结构Table 3-1 Index metadata table structure

索引元数据表的各个字段及类型说明如表3-2所示。The fields and types of the index metadata table are described in Table 3-2.

表3-2 索引元数据表结构说明Table 3-2 Index metadata table structure description

矢量数据表中存储原始矢量数据信息。图4展示了本发明实施例中某一矢量要素在矢量数据表中的存储结构，采用GeoJSON形式对矢量数据的几何信息进行组织。具体的，几何信息保存在“GEOINFO”字段中，其中“type”字段标识了该要素的几何类型，“coordinate”字段保存了几何对象的顶点坐标数组。要素的非几何信息也通过不同的字段进行存储，如表示矢量要素名称的“NAME”字段。矢量数据表的详细结构如表4-1所示。The original vector data information is stored in the vector data table. FIG. 4 shows the storage structure of a vector element in the vector data table in the embodiment of the present invention, and the geometric information of the vector data is organized in the form of GeoJSON. Specifically, the geometric information is stored in the "GEOINFO" field, where the "type" field identifies the geometry type of the element, and the "coordinate" field stores the vertex coordinate array of the geometric object. Non-geometric information for features is also stored in different fields, such as the "NAME" field that represents the name of the vector feature. The detailed structure of the vector data table is shown in Table 4-1.

表4-1 矢量数据表结构Table 4-1 Vector data table structure

矢量数据表的各个字段及类型说明如表4-2所示。The fields and types of the vector data table are described in Table 4-2.

表4-2 矢量数据表结构说明Table 4-2 Vector data table structure description

具体的，针对HBase数据库的特点，设计了R树索引支持的矢量数据库模式，如图5所示。矢量数据表、R树索引表、矢量元数据表与索引元数据表的关联规则如下：Specifically, according to the characteristics of HBase database, the vector database mode supported by R-tree index is designed, as shown in Figure 5. The association rules of vector data table, R-tree index table, vector metadata table and index metadata table are as follows:

矢量元数据表中记录着矢量数据表的空间域名及其对应的命名空间，当一个矢量数据集合存在不止一个空间域时，将会在矢量元数据集合中存储多条记录。The vector metadata table records the spatial domain name of the vector data table and its corresponding namespace. When a vector data set has more than one spatial domain, multiple records will be stored in the vector metadata set.

每个空间域对应的R树索引表通过特定的命名规范与其对应的矢量数据表进行绑定，索引表的命名方式为“Rtree_空间域名_命名空间”，数据表的命名方式为“空间域名_命名空间”。The R-tree index table corresponding to each spatial domain is bound to its corresponding vector data table through a specific naming convention. The naming method of the index table is "Rtree_spatial domain name_namespace", and the naming method of the data table is "spatial domain name" _Namespaces".

索引元数据信息通过记录ID的方式与矢量元数据表关联，同时R树索引表中的根节点也通过记录ID的方式与索引元数据表关联。The index metadata information is associated with the vector metadata table by recording the ID, and the root node in the R-tree index table is also associated with the index metadata table by recording the ID.

103、将矢量数据编码入库存储，几何和属性信息分别以GeoJSON和JSON形式进行组织，同时构建扁平化R树索引。103. Encode the vector data into the library for storage, organize the geometry and attribute information in the form of GeoJSON and JSON, and build a flattened R-tree index at the same time.

示例性的，本发明实施例提出的一种矢量数据插入与R树索引构建流程图如图6所示。首先以矢量数据集为单位，将其中的所有矢量要素编码，写入到矢量数据表中。然后查询矢量元数据表中矢量要素所在空间域的几何元数据信息是否存在，若不存在，则在矢量元数据表中存储几何元数据信息，同时更新索引元数据信息。Exemplarily, a flowchart of vector data insertion and R-tree index construction proposed by the embodiment of the present invention is shown in FIG. 6 . First, take the vector dataset as a unit, encode all the vector elements in it, and write it into the vector data table. Then query whether the geometric metadata information of the spatial domain where the vector element is located in the vector metadata table exists, if not, store the geometric metadata information in the vector metadata table, and update the index metadata information at the same time.

示例性的，获取索引元数据ID，从索引元数据表中获取R树索引表及其根节点ID，导航寻找最佳插入节点。其中，需要判断最佳插入节点的子节点数量是否超出预设扇出系数。Exemplarily, the index metadata ID is obtained, the R-tree index table and its root node ID are obtained from the index metadata table, and the optimal insertion node is found by navigation. Among them, it is necessary to judge whether the number of child nodes of the optimally inserted node exceeds the preset fan-out coefficient.

具体的，从根节点出发，首先判断当前节点MBR是否包含待插入矢量要素的MBR，若不包含，继续判断下一节点是否包含，直至包含待插入矢量要素的MBR时，判断该节点的子节点MBR是否包含待插入矢量要素MBR。最佳插入节点应满足节点自身MBR包含待插入矢量要素的MBR而其子节点MBR不包含待插入矢量要素的MBR。导航至目的叶子节点后，判断当前节点的子节点数量是否超出预设扇出系数，若未超出，插入关于矢量要素的索引项；否则，使用R树分裂策略进行节点分裂，再次进行id节点导航，将矢量要素的索引项插入分裂后的最佳节点。如果根节点发生分裂，在索引元数据表中更新根节点的信息。最后，更新R树索引项和元数据表信息，完成数据插入操作。Specifically, starting from the root node, first judge whether the MBR of the current node contains the MBR of the vector element to be inserted, if not, continue to judge whether the next node contains the MBR until the MBR of the vector element to be inserted is contained, judge the child node of the node Whether the MBR contains the vector element MBR to be inserted. The optimal insertion node should satisfy that the MBR of the node itself contains the MBR of the vector element to be inserted and the MBR of its child node does not contain the MBR of the vector element to be inserted. After navigating to the destination leaf node, determine whether the number of child nodes of the current node exceeds the preset fan-out coefficient, if not, insert the index item about the vector element; otherwise, use the R-tree splitting strategy to split the node, and perform the id node navigation again , insert the index item of the vector element into the best node after splitting. If the root node is split, update the information of the root node in the index metadata table. Finally, update the R-tree index item and metadata table information to complete the data insertion operation.

在插入数据操作完成后，才会更新对应数据的索引信息，目的是为了保证内容条目的完整性。在非关系数据库中，错误是常态，如果先插入索引项，在插入索引项后系统宕机，重启后系统会认为该数据已经存入数据库中，造成数据丢失。After the data insertion operation is completed, the index information of the corresponding data is updated, in order to ensure the integrity of the content entry. In a non-relational database, errors are the norm. If an index entry is inserted first, the system crashes after the index entry is inserted. After restarting, the system will consider that the data has been stored in the database, resulting in data loss.

104、提供对矢量数据的查询支持：当收到查询请求时，根据查询条件确定索引元数据ID，进而获取R树根节点ID，从而基于R树索引表，并行执行矢量数据的检索，最终返回查询结果。104. Provide query support for vector data: When a query request is received, the index metadata ID is determined according to the query conditions, and then the R-tree root node ID is obtained, so that the vector data retrieval is performed in parallel based on the R-tree index table, and finally returned search result.

示例性的，本发明实施例提出的一种矢量数据查询流程图如图7所示，包括如下步骤：Exemplarily, a vector data query flowchart proposed by an embodiment of the present invention is shown in FIG. 7 , including the following steps:

步骤1：用户给定数据集名称、查询范围等查询条件，本发明实施例中对应的查询范围如图2中查询框所示，获取查询多边形区域；Step 1: The user specifies query conditions such as the name of the dataset and the query range, and the corresponding query range in the embodiment of the present invention is shown in the query box in FIG. 2, and the query polygon area is obtained;

步骤2：查询矢量元数据表中该空间域的几何元数据信息，若信息不存在，则查询结束，数据集中不存在符合查询条件的矢量要素；若信息存在，获取对应索引元数据的ID；Step 2: Query the geometric metadata information of the spatial domain in the vector metadata table. If the information does not exist, the query ends, and there are no vector elements that meet the query conditions in the dataset; if the information exists, obtain the ID of the corresponding index metadata;

步骤3：查询索引元数据表，获取与查询多边形区域对应的R树索引表及其根节点的ID；Step 3: query the index metadata table to obtain the R-tree index table corresponding to the query polygon area and the ID of its root node;

步骤4：查询R树索引表，取出根节点的键值对并反序列化。由GeoJSON中记录的几何信息判断得知：图2中查询框与子节点N1的MBR相交，被子节点N2的MBR包含。根据父索引项指向子索引节点的指针，以多线程并行的方式从索引表中取出N1、N2节点对应的键值对，反序列化后进一步判断几何关系可知：查询框与子节点N4、N6、N7的MBR相交。递归的，并行取出N4、N6、N7对应的键值对并反序列化，使用多线程的方式并行判断几何关系得知：子节点L10、L16和L19的MBR落入查询框内，子节点L12和L18的MBR与查询框相交。判断可知，当前已查询至叶子节点，可得满足条件的叶子节点集合为：{L10,L12,L16,L18,L19}；Step 4: Query the R-tree index table, take out the key-value pair of the root node and deserialize it. Judging from the geometric information recorded in GeoJSON: the query box in Figure 2 intersects with the MBR of the child node N1, and is contained by the MBR of the child node N2. According to the pointer of the parent index item to the child index node, the key-value pairs corresponding to the N1 and N2 nodes are extracted from the index table in a multi-threaded parallel manner. , the MBR of N7 intersects. Recursively, the key-value pairs corresponding to N4, N6, and N7 are taken out in parallel and deserialized, and the geometric relationship is judged in parallel by using multi-threading. and the MBR of L18 intersects the query box. It can be seen from the judgment that the current leaf node has been queried, and the set of leaf nodes that meet the conditions can be obtained as: {L10, L12, L16, L18, L19};

步骤5：依据索引项从矢量数据表中取出矢量数据，通过精查询对所得数据进行几何信息过滤，得到符合查询条件的矢量数据集合。此外，该过滤过程还可以是属性信息过滤，例如建筑物的建筑面积是否大于2500m²，名称字符串中是否包含某商场名称等。Step 5: extracting vector data from the vector data table according to the index item, and filtering the obtained data by geometric information through precise query to obtain a vector data set that meets the query conditions. In addition, the filtering process can also be attribute information filtering, such as whether the building area of the building is greater than 2500m ² , whether the name string contains the name of a shopping mall, and so on.

105、对矢量数据进行维护，包括更新和删除。105. Maintain vector data, including updating and deleting.

具体的，对数据的更新可以通过预先建立的基于扁平化策略的R树索引实时进行，其更新过程的实现方式与R树索引的建立过程类似。当数据库中存储的矢量数据被修改时，同时更新被修改数据影响到的节点的索引数据，保证了数据的实效性。Specifically, the update of the data can be performed in real time through the pre-established R-tree index based on the flattening strategy, and the implementation manner of the update process is similar to the establishment process of the R-tree index. When the vector data stored in the database is modified, the index data of the nodes affected by the modified data is updated at the same time, which ensures the validity of the data.

示例性的，本发明实施例提出的一种矢量数据删除流程图如图8所示，删除R树中叶子节点L17索引项对应的矢量数据流程包括如下步骤：Exemplarily, a vector data deletion flowchart proposed by an embodiment of the present invention is shown in FIG. 8 , and the deletion of the vector data corresponding to the index item of the leaf node L17 in the R tree includes the following steps:

步骤1：查询矢量元数据表中待删除数据所在空间域的几何元数据信息，获取对应的索引元数据信息；Step 1: query the geometric metadata information of the spatial domain where the data to be deleted in the vector metadata table is located, and obtain the corresponding index metadata information;

步骤2：查询索引元数据表，获取R树索引表及其根节点，依据R树节点最小外包矩形和查询框的几何关系，定位与待删除矢量要素关联的索引项；Step 2: query the index metadata table, obtain the R-tree index table and its root node, and locate the index item associated with the vector element to be deleted according to the geometric relationship between the minimum outer rectangle of the R-tree node and the query box;

步骤3：删除矢量数据表中对应的矢量数据；Step 3: Delete the corresponding vector data in the vector data table;

步骤4：删除R树索引表中与该矢量数据相关联的索引项；Step 4: delete the index entry associated with the vector data in the R-tree index table;

步骤5：更新R树索引表和索引元数据表。Step 5: Update the R-tree index table and index metadata table.

本发明提供一种基于扁平化R树的非关系数据库矢量数据管理方法，面向新型非关系数据库，通过建立基于扁平化策略的R树索引，为分布式非关系数据库提供R树支持的矢量数据查询处理，并且能够利用非关系数据库的分布式存储特性，进行多节点并行执行的R树查询操作，从而支持面向非关系数据库的大规模矢量数据组织和管理，使得非关系数据库的海量存储、并行计算，以及高可用、高可靠等技术红利惠及到矢量数据类型。The invention provides a non-relational database vector data management method based on flattened R-tree, which is oriented to new non-relational databases, and provides vector data query supported by R-tree for distributed non-relational databases by establishing an R-tree index based on a flattening strategy. It can also use the distributed storage characteristics of non-relational databases to perform R-tree query operations executed in parallel on multiple nodes, thereby supporting large-scale vector data organization and management for non-relational databases, enabling mass storage and parallel computing of non-relational databases. , and technical dividends such as high availability and high reliability benefit vector data types.

以上实施例仅用于说明本发明的设计思想和特点，其目的在于使本领域内的技术人员能够了解本发明的内容并据以实施，本发明的保护范围不限于上述实施例。所以，凡依据本发明所揭示的原理、设计思路所作的等同变化或修饰，均在本发明的保护范围之内。The above embodiments are only used to illustrate the design ideas and features of the present invention, and the purpose is to enable those skilled in the art to understand the contents of the present invention and implement them accordingly, and the protection scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications made according to the principles and design ideas disclosed in the present invention fall within the protection scope of the present invention.

Claims

1. A vector data management method in a non-relational database is characterized in that: the method comprises the following steps:

s1, in a non-relational database environment, designing an auxiliary index structure based on an R tree flattening strategy for vector data;

s2, establishing a base table structure including vector data and an index structure, wherein all data tables in the base table structure are associated through explicit association records and implicit naming rules;

s3, encoding and storing the vector data, organizing the geometric and attribute information in GeoJSON and JSON forms respectively, and constructing a flattened R tree index;

s4, when receiving the query request, determining index metadata ID according to the query condition, further obtaining R tree root node ID, thereby executing the retrieval of vector data in parallel based on the R tree index table, and finally returning the query result;

and S5, maintaining the vector data in the non-relational database, including updating and deleting.

2. The vector data management method according to claim 1, characterized in that: in S1, the specific design steps of the auxiliary index structure based on the R-tree flattening policy are as follows:

1.1, abstracting a vector object into a minimum outsourcing rectangle MBR, recursively combining MBRs adjacent to spatial positions into a higher-level MBR, and finally forming a layered tree structure based on the minimum outsourcing rectangle;

1.2, expanding the R tree index structure into a flattened index node set, namely expressing each index node into a JSON structure, and using the unique identifier of the node as a pointer of a parent index item to a child index node;

1.3, setting a fan-out coefficient M of the R tree, and except for a root node, setting the number of child nodes of other R tree nodes to be positioned between intervals [2, M ].

3. The vector data management method according to claim 2, characterized in that: in the R tree nodes, the record format of the R leaf node is < OID, MBR >, and the record format of the middle node is < OID, Pointer, MBR >; wherein the OID is the unique identifier of the node, the Pointer points to the OID of the child node, and the MBR is the minimum outsourcing rectangle.

4. The vector data management method according to claim 1, characterized in that: in S2, the library table structure is designed as follows:

2.1, managing multi-source heterogeneous vector data in a data set form, wherein each vector data set organizes logically related vector data with the same type;

2.2, designing a vector data table, an R tree index table, a vector metadata table and an index metadata table, wherein the vector data table, the R tree index table, the vector metadata table and the index metadata table are respectively used for storing vector elements, index structures and metadata of the vector elements and the index structures of the vector data set;

and 2.3, establishing an association relation among four types of tables, namely a vector data table, an R tree index table, a vector metadata table and an index metadata table, wherein each vector data set corresponds to one vector data table and one R tree index table, and metadata description is respectively carried out in the vector metadata table and the index metadata table.

5. The vector data management method according to claim 1, characterized in that: the S3 specifically includes:

3.1, coding all vector elements in the vector data set by taking the vector data set as a unit, writing the vector elements into a vector data table, and organizing the geometric and attribute information in GeoJSON and JSON forms respectively; GeoJSON is a geographic space information data exchange format based on a JavaScript object representation method;

3.2, inquiring the geometric metadata information of the space domain where the vector elements are located in the vector metadata table, and acquiring the corresponding index metadata ID;

3.3, acquiring an R tree index table and a root node ID thereof from the index metadata table, navigating to a target leaf node of the R tree index table according to the geometric relationship between the minimum vector element outsourcing rectangle and the index item outsourcing rectangle, inserting an index item related to the vector element, and updating the R tree index table;

and 3.4, after the writing of the vector data set is completed, updating the vector metadata table and the index metadata table.

6. The vector data management method according to claim 5, wherein: in the step 3.3, the specific modes of ID node navigation and R-tree index table update include the following steps:

3.3.1, according to the geometric relationship between the minimum outsourcing rectangle of the vector elements and the outsourcing rectangle of the index item, searching the optimal insertion node by using ID node navigation, judging whether the number of child nodes of the node exceeds the set fan-out coefficient, if so, executing a step 3.3.2, otherwise, executing a step 3.3.3;

3.3.2, performing node splitting operation, equally dividing the node into two new nodes through an R tree node splitting algorithm, and navigating again to find an optimal insertion node;

3.3.3, inserting the index item of the vector element into the node, and updating the node;

3.3.4, if the root node is split, updating the information of the root node in the index metadata table.

7. The vector data management method according to claim 5, wherein: the JSON has a structure of { ID, L, C, D }; wherein the ID is a unique identifier of the index node, namely OID; l is the number of layers of the node in the tree; c is the number of the child nodes owned by the node; d is a JSON nested structure, and records the unique identifier and the minimum bounding box of the child node owned by the node;

the detailed structure of D is D { { P, M }, …, { P, M } }, wherein P is the abbreviation of Pointer, pointing to the OID of its child node; m is an abbreviation of MBR, the minimum bounding rectangle of the child node, organized in the form of GeoJSON.

8. The vector data management method according to claim 1, characterized in that: the S4 specifically includes:

4.1, giving query conditions such as data set names, query ranges and the like by a user;

4.2, inquiring the geometric metadata information of the space domain in the vector metadata table according to a given inquiry condition to obtain the ID information of the corresponding index metadata;

4.3, acquiring the ID of the R tree index table and the root node thereof from the index metadata table;

4.4, inquiring the R tree index table, and taking out the index items of the vector elements meeting the inquiry conditions;

and 4.5, taking out the vector data from the vector data table, and carrying out fine filtering to finally obtain a query result.

9. The vector data management method according to claim 8, wherein: in 4.4, the specific way of querying the R tree index table includes the following steps:

4.4.1, acquiring a corresponding key value pair in the R tree index table through the ID information of the root node, taking out the key value pair and deserializing the key value pair;

4.4.2, judging the relation between the MBR of each child node and the query range according to the geometric information in the GeoJSON, finding out child nodes which are intersected with the query range or in the query range of the MBR, and taking out and deserializing corresponding child node key value pairs according to pointers of the parent index items pointing to the child index nodes;

4.4.3, repeating the step 4.4.2 until the leaf node of the R tree is inquired.

10. The vector data management method according to claim 1, characterized in that: in S5, the process of deleting the vector data includes:

5.1, inquiring the geometric metadata information of the space domain where the data to be deleted is located in the vector metadata table, and acquiring corresponding index metadata information;

5.2, acquiring an R tree index table and a root node thereof from the index metadata table, and positioning an index item associated with the vector element to be deleted according to the geometric relationship between the minimum outsourcing rectangle of the R tree node and the query box;

5.3, deleting the corresponding vector data in the vector data table;

and 5.4, deleting the index item associated with the vector data in the R tree index table.