CN116821127A

CN116821127A - Method for realizing hash index of kv stored distributed database

Info

Publication number: CN116821127A
Application number: CN202310739518.6A
Authority: CN
Inventors: 柴毅; 徐佳庆; 牟冠学; 蒋家超
Original assignee: Shanghai Yunxi Technology Co ltd
Current assignee: Shanghai Yunxi Technology Co ltd
Priority date: 2023-06-21
Filing date: 2023-06-21
Publication date: 2023-09-29

Abstract

The invention discloses a method for realizing hash indexes of a distributed database stored by a kv, which relates to the technical field of distributed kv, and is characterized in that the method is based on a distributed database in a kv storage mode of a RocksDB, carries out the operations of creating, deleting and modifying the hash indexes, carries out the operations of inserting, updating, deleting and data backfilling related to the hash indexes, adds the plan support related to the hash indexes, stores the information of the hash indexes through metadata in the indexes, distinguishes the data stored in different partitions through modifying key prefixes of the hash indexes, distributes query requests to each partition when the query operation related to the hash indexes is executed, and solves the problem of database write-in hot spots and AP performance under certain accurate query conditions by modifying part of the plan when the query requests can be positioned to specific partitions under certain conditions.

Description

Method for realizing hash index of kv stored distributed database

Technical Field

The invention relates to the technical field of distributed kv, in particular to a method for realizing a hash index of a distributed database stored by kv.

Background

The distributed relational database is mainly designed to be extensible, strong in consistency and high in reliability. In order to improve the expansibility of the database, the distributed relational database adopts a complete decentralization architecture, the positions of all nodes in the cluster are completely equivalent, the bottom layer storage organizes data into ordered Key-Value pairs to form a KV map, the KV map is logically segmented into a large number of Key spaces according to the Range, and each Key space is called Range. Each Range is replicated and distributed to a plurality of nodes, so that high availability of the Range is guaranteed. Queries are concurrently executed in a distributed task fashion at each data node, with a Raft consistency protocol being used between multiple copies of Range. Only one of the plurality of identical Ranges is a leader and is responsible for executing corresponding kv operations. The election of leader also relies on the Raft protocol.

The key of the user table is an arbitrary byte array, and the composition mode is as follows:

a main key: globally unique tableid+primary key id+primary key code.

Common index: globally unique tableid+index id+index key code+primary key code.

The distributed relational database described herein uses range partitions by default, which have higher performance than hash partitions for common range scan queries, but the load of range partitions may become unbalanced for a particular range of scan queries, because such queries are done in a small portion of range, which is particularly true in the case of sequential insertion, and thus is more suitable for using hash partitions.

When a database frequently accesses a range, the bottom layer of the distributed relational database stores Key-values, and the Key-values are stored in a global Key space, all tables and indexes are mapped to the space, the Key space is divided into ordered continuous blocks called ranges, and each range has a certain size (configurable). As data is added/deleted, it splits/merges into more or less ranges.

When a database frequently accesses a range, the database attempts to split the frequently accessed range into multiple smaller ranges (referred to as load-based splitting), which also attempts to redistribute the ranges in the cluster according to load in order to achieve even load distribution in the cluster. If the successive insertion loads reach one of the boundaries of range, they cannot be split as such. This results in a single range hot spot. The data is appended to the end of only one range until it reaches its maximum size threshold, and then to the end of a new range, so our insert/query performance is limited by the single range performance. For this, a hash index is introduced to solve such a problem.

Disclosure of Invention

Aiming at the needs and the shortcomings of the prior art development, the invention provides a method for realizing a hash index of a kv stored distributed database.

The invention discloses a method for realizing a hash index of a kv stored distributed database, which solves the technical problems and adopts the following technical scheme:

a method for realizing hash index of a distributed database stored in kv is based on the distributed database in the kv storage mode of a RocksDB, and comprises the steps of creating, deleting and modifying the hash index, performing hash index related insertion, updating, deleting and data backfilling operations, and adding hash index related plan support, so that the execution performance of large data volume insertion/import of the database is improved, and the execution efficiency of accurate query is improved.

Optionally, when the hash index is created, according to a hash index column and a hash bucket defined by a user in the creating grammar, adding ColumnNames, columnIDs, name, ID items of information, and if data exists in the table, triggering data backfilling operation of the hash index;

when the hash index is created, the database divides the corresponding range according to the partition ID in advance, and then the data adding/deleting corresponding range splitting/merging operation is performed in each of the pre-divided range.

Preferably, the key coding mode of the hash index is as follows:

globally unique tableid+index id+partition id+index key code+primary key code.

Optionally, when deleting the hash index, the hash index data in each partition needs to be deleted.

Optionally, modifying the hash index includes modifying the partition number and partition names of the hash index, and modifying the hash index into other partition modes;

the related operations of modifying the hash index are: the index metadata is modified according to the information in the grammar, and if the partition number is related to the modification of the partition mode, the data backfilling operation is needed.

Optionally, when the hash index is subjected to operations of insertion, updating, deleting and data backfilling, the corresponding hash value is required to be calculated according to the key value corresponding to the hash distribution column, and then the hash partition ID where the data is located is obtained by modulus, and the key prefix of the index is modified according to the partition ID.

Further optionally, when the hash index performs the update operation, if the hash index performs the update operation involving modification of the hash value, the partition ID where the hash index is located needs to be recalculated by using the hash algorithm xxhash, and if the partition changes, the update operation needs to be converted into the insert+delete operation.

Further optionally, the originally ordered index data is scattered and stored into each hash partition through the hash index, and at this time, the data in each partition is ordered;

adding the plan support related to the hash index, when the query index operation is executed, the query is required to be carried out in each partition, if the result set is required to be ordered, the merging and sorting operation is required to be added on the query upper layer of each partition;

when the plan support related to the hash index is added and index inquiry is executed, the inquiry can be converted into inquiry operation in a certain partition by containing the precise filtering condition of the hash column in the sphere expression.

Preferably, when the hash index needs to be subjected to the lookupjoin, when the range of the index query is generated, the query request needs to be distributed to each partition, and when a specific partition can be located through the value of the hash column, the query request can be sent to the located specific partition.

Compared with the prior art, the method for realizing the hash index of the kv stored distributed database has the beneficial effects that:

(1) According to the method, the information of the hash index is stored through metadata in the index, the data stored in different partitions are distinguished and stored through modifying key prefixes of the hash index, when query operation related to the hash index is executed, a query request is generally required to be distributed to each partition, when a specific partition can be located under certain conditions, an accurate query request can be sent to the specific partition through modifying part of a plan, and the problem of database write-in hot spots and AP performance under certain accurate query conditions are solved;

(2) When incremental data are continuously inserted, the hash index is created, the performance of the hash index can be improved by about 40% compared with that of the common index, the hash index can have higher performance than the common index when the query is carried out under certain specific conditions, and the hash index has no great difference from the common index in performance under other conditions.

Drawings

Fig. 1 is a flow chart of a method according to a first embodiment of the present invention.

Detailed Description

In order to make the technical scheme, the technical problems to be solved and the technical effects of the invention more clear, the technical scheme of the invention is clearly and completely described below by combining specific embodiments.

Embodiment one:

with reference to fig. 1, this embodiment proposes a method for implementing a hash index of a kv stored distributed database, which is based on a distributed database in a kv storage mode of a RocksDB, performs creation, deletion and modification of the hash index, performs hash index related insertion, update, deletion and data backfilling operations, and adds a hash index related plan support, so as to improve execution performance of large data volume insertion/import of the database and improve execution efficiency of accurate query.

And (one) creating, deleting and modifying the hash index.

(1) When the hash index is created, according to a hash index column and a hash bucket defined by a user in the creating grammar, adding ColumnNames, columnIDs, name, ID various information, and if data exist in the table, triggering data backfilling operation of the hash index;

Preferably, the key coding mode of the hash index is as follows:

globally unique tableid+index id+partition id+index key code+primary key code.

(2) When deleting the hash index, the hash index data in each partition needs to be deleted.

(3) Modifying the hash index comprises modifying the partition number and partition names of the hash index and modifying the hash index into other partition modes;

And (II) performing hash index related insertion, updating, deleting and data backfilling operations.

When the hash index is subjected to insertion, updating, deleting and data backfilling operations, the corresponding hash value is required to be calculated according to the key value corresponding to the hash distribution column, and then modulo is carried out on the hash value to obtain a hash partition ID where the data is located, and the key prefix of the index is modified according to the partition ID.

When the hash index is updated, if the hash index is updated and the value of the hash is modified, the partition ID of the partition needs to be recalculated by using a hash algorithm xxhash, and if the partition is changed, the updating operation needs to be changed into an insert-delete operation.

And (III) adding the plan support related to the hash index.

The originally ordered index data is scattered and stored into each hash partition through the hash index, and at the moment, the data in each partition are ordered.

When the hash index needs to be subjected to the lookupjoin, when the range of index query is generated, the query request needs to be distributed to each partition, and when a specific certain partition can be located through the value of the hash column, the query request can be sent to the located specific partition.

In summary, by adopting the method for realizing the hash index of the kv stored distributed database, which is disclosed by the invention, the data stored in different partitions can be distinguished by modifying the key prefix of the hash index, and the accurate query request can be sent to the data by modifying part of the plan, so that the problem of database write-in hot spot and the AP performance under certain accurate query conditions are solved.

The foregoing has outlined rather broadly the principles and embodiments of the present invention in order that the detailed description of the invention may be better understood. Based on the above-mentioned embodiments of the present invention, any improvements and modifications made by those skilled in the art without departing from the principles of the present invention should fall within the scope of the present invention.

Claims

The method is characterized in that the method is based on a distributed database in a RocksDB in a kv storage mode, comprises the steps of creating, deleting and modifying the hash index, performing hash index related insertion, updating, deleting and data backfilling operations, and adding hash index related planning support, so that the execution performance of large data volume insertion/import of the database is improved, and the execution efficiency of accurate query is improved.
2. The method for realizing the hash index of the kv stored distributed database according to claim 1, wherein when the hash index is created, each item of information ColumnNames, columnIDs, name, ID is added according to a hash index column and a hash bucket defined by a user in a creating grammar, and if data exists in a table, a data backfilling operation of the hash index is triggered;

when the hash index is created, the database divides the corresponding range according to the partition ID in advance, and then the data adding/deleting corresponding range splitting/merging operation is performed in each of the pre-divided range.
3. The method for implementing the hash index of the kv stored distributed database according to claim 2, wherein the key coding mode of the hash index is:

globally unique tableid+index id+partition id+index key code+primary key code.
4. The method for implementing a hash index of a kv stored distributed database according to claim 2, wherein when deleting the hash index, the hash index data in each partition needs to be deleted.
5. The method for implementing the hash index of the kv stored distributed database according to claim 2, wherein the modification of the hash index includes modifying the partition number, partition name, and modifying the hash index into other partition modes;

the related operations of modifying the hash index are: the index metadata is modified according to the information in the grammar, and if the partition number is related to the modification of the partition mode, the data backfilling operation is needed.
6. The method for implementing the hash index of the kv stored distributed database according to claim 5, wherein when the hash index is subjected to operations of insertion, update, deletion and data backfilling, the corresponding hash value is calculated according to the key value corresponding to the hash distribution column, and then the hash partition ID where the data is located is obtained by taking a modulus of the hash value, and the key prefix of the index is modified according to the partition ID.
7. The method according to claim 6, wherein when the hash index performs an update operation, if the hash index performs an update operation involving a modification of a hash value, it is necessary to recalculate the partition ID in which the hash index is located using a hash algorithm xxhash, and if the partition changes, it is necessary to change the update operation into an insert+delete operation.
8. The kv-stored distributed database hash index implementation method according to claim 7, wherein the originally ordered index data is scattered and stored into each hash partition through the hash index, and at this time, the data in each partition is ordered;

adding the plan support related to the hash index, when the query index operation is executed, the query is required to be carried out in each partition, if the result set is required to be ordered, the merging and sorting operation is required to be added on the query upper layer of each partition;

when the plan support related to the hash index is added and index inquiry is executed, the inquiry can be converted into inquiry operation in a certain partition by containing the precise filtering condition of the hash column in the sphere expression.
9. The method for implementing a hash index of a kv stored distributed database according to claim 8, wherein when the hash index needs to perform a look-up join, when generating a range of index queries, a query request needs to be distributed to each partition, and when a specific partition can be located by a value of a hash column, the query request can be sent to the located specific partition.