CN111966654A

CN111966654A - Mixed filter based on Trie dictionary tree

Info

Publication number: CN111966654A
Application number: CN202010829227.2A
Authority: CN
Inventors: 张炜刚; 贾德星; 孙思清; 高传集
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2020-08-18
Filing date: 2020-08-18
Publication date: 2020-11-20

Abstract

The invention discloses a Trie dictionary tree-based hybrid filter, which relates to the field of data storage, and is characterized in that a Trie dictionary tree is used as a basic structure of the hybrid filter, the prefix length of the Trie dictionary tree is set, meanwhile, a binary array is added on a root node of the last layer of the Trie dictionary tree, and the prefix length and the binary array length are adjusted according to the memory occupation; the method can perform efficient file filtering during range query, reduce IO throughput, enable the database realized by the storage engine based on the LSM log tree to have better range query performance, and improve the query performance of the database.

Description

Mixed filter based on Trie dictionary tree

Technical Field

The invention discloses a hybrid filter, relates to the field of data storage, and particularly relates to a hybrid filter based on a Trie dictionary tree.

Background

The storage engine for high-speed writing is realized by using an LSM-Tree (Log merging Tree), and Memtable in a memory is written firstly in the writing process, and then asynchronous sequential Flush is carried out to a disk. A large number of SST files are generated in the writing process, whether the Key is contained in the SST files needs to be judged in the reading process, and whether the Key is contained in the SST files is judged by the rocksDB through a Bloom Filter (Bloom Filter) at present. However, for multi-version control, version numbers are added behind written keys, the query of the keys is converted into range query, and only single-value query is supported because hit judgment is performed through a Hash function, so that such a storage engine based on LSM-Tree must read SST files on all disks to perform range query, and the disk I/O cost is very high.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a hybrid Filter based on a Trie dictionary tree, wherein the hybrid Filter (TbMF) is a hit test data structure with high space efficiency, and is different from the existing Bloom Filter (Bloom Filter) which only supports single-value query filtering, and the TbMF can perform single-value query and range query filtering.

The specific scheme provided by the invention is as follows:

a method for realizing a hybrid filter based on a Trie dictionary tree takes the Trie dictionary tree as the basic structure of the hybrid filter,

setting the prefix length of the Trie dictionary tree, adding binary number groups on the root node of the last layer of the Trie dictionary tree,

and adjusting the prefix length and the binary array length according to the memory occupation.

In the method for realizing the hybrid filter based on the Trie dictionary tree, the Trie dictionary tree is compressed, and nodes without branches on a line are merged.

A Trie-based hybrid filter: taking the Trie dictionary tree as a basic structure,

In the hybrid filter based on the Trie dictionary tree, the Trie dictionary tree is compressed, and nodes without branches on a line are merged.

The invention also provides application of the hybrid filter based on the Trie dictionary tree.

The application of the Trie dictionary tree-based hybrid filter is applied to a storage engine based on an LSM log tree, and files are filtered when range query and single-value query are carried out.

The application of the Trie dictionary tree-based hybrid filter is applied to RocksDB, and files are filtered when range query and single-value query are carried out.

The application of the Trie dictionary tree-based hybrid filter performs the query process:

a matching root node is found by the prefix,

if the matching is successful, the Hash hit verification is carried out on the Key,

if the hit verification of 1 Hash function fails, the fact that the Key is not in the set is indicated.

The invention has the advantages that:

the invention provides a hybrid filter based on a Trie dictionary tree, which is used for helping a storage engine based on an LSM log tree to efficiently filter files during range query and single-value query, improving query performance, reducing disk I/O (input/output) consumption, reducing IO throughput, simultaneously reducing memory consumption by carrying out compression coding on the structure of the filter, and adjusting user scenes between memory consumption and false positive rate FPR (field programmable logic controller) by configuring prefix length and binary array length.

Drawings

FIG. 1 is a schematic diagram of a truncated Trie tree according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a root node adding binary array in accordance with an embodiment of the present invention;

fig. 3 is a schematic diagram of compressing TbMF according to an embodiment of the present invention.

FIG. 4 is a flow chart of a query process in the application of the present invention.

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

The invention provides a method for realizing a hybrid filter based on a Trie dictionary tree, which takes the Trie dictionary tree as the basic structure of the hybrid filter,

The method can obtain a Trie based Mixed Filter (TbMF for short) based on the Trie dictionary tree, limit the length of the Trie structure, is beneficial to coding and inquiring, simultaneously increases binary number groups on the root nodes of the last layer, avoids the situation that the character string prefix is matched and the suffix is not matched to judge errors, and adjusts the memory occupation by setting the prefix length and the length of the binary number groups, thereby effectively balancing the memory occupation and the false positive rate FPR. In addition, the Trie dictionary tree is used as a basic structure, a large number of character strings can be counted, ordered and stored, the character strings are often used for text word frequency counting by a search engine system, the query time is reduced by utilizing the public prefixes of the character strings, unnecessary character string comparison is reduced to the maximum extent, and the query efficiency is higher than that of a hash tree.

On the basis, when the prefix length of the Trie dictionary tree is set, bit complementing is carried out on the length if the length is insufficient, meanwhile, a binary array is added on the root node of the last layer of the Trie dictionary tree, the TbMF is compressed, and nodes without branches on a certain line of the Trie dictionary tree can be combined into 1 node. Referring to fig. 1-3, fig. 1 illustrates the structure of the Trie after writing a string Key, in this case, 5 keys are written: data, date, datating, database, and dataset, wherein a node marked with an oblique line indicates that a link to the node is a complete Key, in order to make the Trie smaller, truncation is performed according to a Length set to 6, and a complementary bit is performed when the Length is insufficient, in fig. 2, a binary array is added to a root node of the truncated Trie, so that the query performance of a single Key is improved, and in fig. 3, the TbMF compression process is performed, and if no branch node exists on the line, merging can be performed, so that the depth of the tree is reduced.

The present invention further provides a hybrid filter corresponding to the implementation method, and since the structure of the hybrid filter obtained in the implementation method of the present invention is consistent, the embodiments are also based on the same concept, and specific contents may refer to descriptions in the embodiments of the method of the present invention, and are not described herein again.

The invention also provides application of the hybrid filter, which is mainly applied to a storage engine based on the LSM log tree, and can efficiently filter files during range query and single-value query, improve query performance and reduce disk I/O consumption. The storage engine based on the LSM log tree takes RocksDB as an example, and the RocksDB is an open source Key/Value storage engine realized by C + + language. The RocksDB can efficiently use fast storage, such as a solid state disk, a flash memory, and the like, and is also suitable for a commercial host with multiple CPU cores. Therefore, many common distributed databases and software stacks (such as the romance, cloud and stream database, the TiDB, Cockroach, the mylocks, the Ceph, and the like) adopt the RocksDB as the KV embedded storage engine.

The RocksDB uses LSM-Tree (log merge Tree) to achieve high-speed writing. The Memtable in the memory is written firstly in the writing process, and then the asynchronous sequence Flush is carried out to the disk. Since sequential writing to the disk is much faster than random writing, a high performance boost can be achieved. However, this structure may generate a large number of SST files in the writing process, the reading process needs to determine whether the Key is contained in the SST file, the existing RocksDB completes this task through a Bloom Filter (Bloom Filter), and the Bloom Filter can help to reach two conclusions: 1. this Key is likely in the set; 2. this Key must not be in the set. If the Key is not in the set, the SST file retrieval can be skipped, so that disk IO is reduced, and query performance is improved, but the RocksDB is used as a database of the KV storage engine, and in order to support MVCC (multi-version control), version numbers are added behind the written Key, so that range query is required, and Bloom cannot be completed.

The hybrid filter of the invention supports both single-value query and range query, and the query process is as follows: and matching the prefixes to corresponding root nodes (if the prefixes can be matched), then performing Hash hit verification on the Key, and if the hit verification of 1 Hash function fails, indicating that the Key is not in the set.

On the basis, after the hybrid filter is compressed, the compressed TbMF can be used for effectively reducing the FPR and I/O throughput, the performance improvement of about 2 times of range query can be achieved by verifying with a 100G data set in the RocksDB, and the improvement is more obvious if the ratio of the range query result to be empty is higher.

And the database realized based on rocksDB, such as Langchao Yunxi DB, TiDB, MyRocks, Cockroach and the like, uses a hybrid filter, can have better range query performance, and improves the query performance of the database.

The above-mentioned embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims

1. A method for realizing a hybrid filter based on a Trie dictionary tree is characterized in that the Trie dictionary tree is used as a basic structure of the hybrid filter,

2. The method of claim 1, wherein the Trie is compressed, and nodes without branches on a line are merged.

3. A mixing filter based on Trie dictionary tree is characterized in that the Trie dictionary tree is used as a basic structure,

4. The Trie-Trie-based hybrid filter according to claim 3, wherein the Trie-Trie is compressed to merge nodes without branches on the line.

5. Use of a Trie-based hybrid filter according to claim 3 or 4.

6. The application of the Trie-Trie-based hybrid filter in the storage engine based on the LSM log tree as claimed in claim 5, wherein the application is used for filtering files during range query and single-value query.

7. The use of the Trie-Trie-based hybrid filter according to claim 5, wherein the filter is applied to a RocksDB for filtering files during range query and single-value query.

8. The use of the Trie-based hybrid filter according to claim 6 or 7, wherein the query process is performed:

a matching root node is found by the prefix,