CN111966654A - Mixed filter based on Trie dictionary tree - Google Patents

Mixed filter based on Trie dictionary tree Download PDF

Info

Publication number
CN111966654A
CN111966654A CN202010829227.2A CN202010829227A CN111966654A CN 111966654 A CN111966654 A CN 111966654A CN 202010829227 A CN202010829227 A CN 202010829227A CN 111966654 A CN111966654 A CN 111966654A
Authority
CN
China
Prior art keywords
trie
dictionary tree
tree
hybrid filter
trie dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010829227.2A
Other languages
Chinese (zh)
Inventor
张炜刚
贾德星
孙思清
高传集
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Cloud Information Technology Co Ltd
Original Assignee
Inspur Cloud Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Cloud Information Technology Co Ltd filed Critical Inspur Cloud Information Technology Co Ltd
Priority to CN202010829227.2A priority Critical patent/CN111966654A/en
Publication of CN111966654A publication Critical patent/CN111966654A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a Trie dictionary tree-based hybrid filter, which relates to the field of data storage, and is characterized in that a Trie dictionary tree is used as a basic structure of the hybrid filter, the prefix length of the Trie dictionary tree is set, meanwhile, a binary array is added on a root node of the last layer of the Trie dictionary tree, and the prefix length and the binary array length are adjusted according to the memory occupation; the method can perform efficient file filtering during range query, reduce IO throughput, enable the database realized by the storage engine based on the LSM log tree to have better range query performance, and improve the query performance of the database.

Description

Mixed filter based on Trie dictionary tree
Technical Field
The invention discloses a hybrid filter, relates to the field of data storage, and particularly relates to a hybrid filter based on a Trie dictionary tree.
Background
The storage engine for high-speed writing is realized by using an LSM-Tree (Log merging Tree), and Memtable in a memory is written firstly in the writing process, and then asynchronous sequential Flush is carried out to a disk. A large number of SST files are generated in the writing process, whether the Key is contained in the SST files needs to be judged in the reading process, and whether the Key is contained in the SST files is judged by the rocksDB through a Bloom Filter (Bloom Filter) at present. However, for multi-version control, version numbers are added behind written keys, the query of the keys is converted into range query, and only single-value query is supported because hit judgment is performed through a Hash function, so that such a storage engine based on LSM-Tree must read SST files on all disks to perform range query, and the disk I/O cost is very high.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a hybrid Filter based on a Trie dictionary tree, wherein the hybrid Filter (TbMF) is a hit test data structure with high space efficiency, and is different from the existing Bloom Filter (Bloom Filter) which only supports single-value query filtering, and the TbMF can perform single-value query and range query filtering.
The specific scheme provided by the invention is as follows:
a method for realizing a hybrid filter based on a Trie dictionary tree takes the Trie dictionary tree as the basic structure of the hybrid filter,
setting the prefix length of the Trie dictionary tree, adding binary number groups on the root node of the last layer of the Trie dictionary tree,
and adjusting the prefix length and the binary array length according to the memory occupation.
In the method for realizing the hybrid filter based on the Trie dictionary tree, the Trie dictionary tree is compressed, and nodes without branches on a line are merged.
A Trie-based hybrid filter: taking the Trie dictionary tree as a basic structure,
setting the prefix length of the Trie dictionary tree, adding binary number groups on the root node of the last layer of the Trie dictionary tree,
and adjusting the prefix length and the binary array length according to the memory occupation.
In the hybrid filter based on the Trie dictionary tree, the Trie dictionary tree is compressed, and nodes without branches on a line are merged.
The invention also provides application of the hybrid filter based on the Trie dictionary tree.
The application of the Trie dictionary tree-based hybrid filter is applied to a storage engine based on an LSM log tree, and files are filtered when range query and single-value query are carried out.
The application of the Trie dictionary tree-based hybrid filter is applied to RocksDB, and files are filtered when range query and single-value query are carried out.
The application of the Trie dictionary tree-based hybrid filter performs the query process:
a matching root node is found by the prefix,
if the matching is successful, the Hash hit verification is carried out on the Key,
if the hit verification of 1 Hash function fails, the fact that the Key is not in the set is indicated.
The invention has the advantages that:
the invention provides a hybrid filter based on a Trie dictionary tree, which is used for helping a storage engine based on an LSM log tree to efficiently filter files during range query and single-value query, improving query performance, reducing disk I/O (input/output) consumption, reducing IO throughput, simultaneously reducing memory consumption by carrying out compression coding on the structure of the filter, and adjusting user scenes between memory consumption and false positive rate FPR (field programmable logic controller) by configuring prefix length and binary array length.
Drawings
FIG. 1 is a schematic diagram of a truncated Trie tree according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a root node adding binary array in accordance with an embodiment of the present invention;
fig. 3 is a schematic diagram of compressing TbMF according to an embodiment of the present invention.
FIG. 4 is a flow chart of a query process in the application of the present invention.
Detailed Description
The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.
The invention provides a method for realizing a hybrid filter based on a Trie dictionary tree, which takes the Trie dictionary tree as the basic structure of the hybrid filter,
setting the prefix length of the Trie dictionary tree, adding binary number groups on the root node of the last layer of the Trie dictionary tree,
and adjusting the prefix length and the binary array length according to the memory occupation.
The method can obtain a Trie based Mixed Filter (TbMF for short) based on the Trie dictionary tree, limit the length of the Trie structure, is beneficial to coding and inquiring, simultaneously increases binary number groups on the root nodes of the last layer, avoids the situation that the character string prefix is matched and the suffix is not matched to judge errors, and adjusts the memory occupation by setting the prefix length and the length of the binary number groups, thereby effectively balancing the memory occupation and the false positive rate FPR. In addition, the Trie dictionary tree is used as a basic structure, a large number of character strings can be counted, ordered and stored, the character strings are often used for text word frequency counting by a search engine system, the query time is reduced by utilizing the public prefixes of the character strings, unnecessary character string comparison is reduced to the maximum extent, and the query efficiency is higher than that of a hash tree.
On the basis, when the prefix length of the Trie dictionary tree is set, bit complementing is carried out on the length if the length is insufficient, meanwhile, a binary array is added on the root node of the last layer of the Trie dictionary tree, the TbMF is compressed, and nodes without branches on a certain line of the Trie dictionary tree can be combined into 1 node. Referring to fig. 1-3, fig. 1 illustrates the structure of the Trie after writing a string Key, in this case, 5 keys are written: data, date, datating, database, and dataset, wherein a node marked with an oblique line indicates that a link to the node is a complete Key, in order to make the Trie smaller, truncation is performed according to a Length set to 6, and a complementary bit is performed when the Length is insufficient, in fig. 2, a binary array is added to a root node of the truncated Trie, so that the query performance of a single Key is improved, and in fig. 3, the TbMF compression process is performed, and if no branch node exists on the line, merging can be performed, so that the depth of the tree is reduced.
The present invention further provides a hybrid filter corresponding to the implementation method, and since the structure of the hybrid filter obtained in the implementation method of the present invention is consistent, the embodiments are also based on the same concept, and specific contents may refer to descriptions in the embodiments of the method of the present invention, and are not described herein again.
The invention also provides application of the hybrid filter, which is mainly applied to a storage engine based on the LSM log tree, and can efficiently filter files during range query and single-value query, improve query performance and reduce disk I/O consumption. The storage engine based on the LSM log tree takes RocksDB as an example, and the RocksDB is an open source Key/Value storage engine realized by C + + language. The RocksDB can efficiently use fast storage, such as a solid state disk, a flash memory, and the like, and is also suitable for a commercial host with multiple CPU cores. Therefore, many common distributed databases and software stacks (such as the romance, cloud and stream database, the TiDB, Cockroach, the mylocks, the Ceph, and the like) adopt the RocksDB as the KV embedded storage engine.
The RocksDB uses LSM-Tree (log merge Tree) to achieve high-speed writing. The Memtable in the memory is written firstly in the writing process, and then the asynchronous sequence Flush is carried out to the disk. Since sequential writing to the disk is much faster than random writing, a high performance boost can be achieved. However, this structure may generate a large number of SST files in the writing process, the reading process needs to determine whether the Key is contained in the SST file, the existing RocksDB completes this task through a Bloom Filter (Bloom Filter), and the Bloom Filter can help to reach two conclusions: 1. this Key is likely in the set; 2. this Key must not be in the set. If the Key is not in the set, the SST file retrieval can be skipped, so that disk IO is reduced, and query performance is improved, but the RocksDB is used as a database of the KV storage engine, and in order to support MVCC (multi-version control), version numbers are added behind the written Key, so that range query is required, and Bloom cannot be completed.
The hybrid filter of the invention supports both single-value query and range query, and the query process is as follows: and matching the prefixes to corresponding root nodes (if the prefixes can be matched), then performing Hash hit verification on the Key, and if the hit verification of 1 Hash function fails, indicating that the Key is not in the set.
On the basis, after the hybrid filter is compressed, the compressed TbMF can be used for effectively reducing the FPR and I/O throughput, the performance improvement of about 2 times of range query can be achieved by verifying with a 100G data set in the RocksDB, and the improvement is more obvious if the ratio of the range query result to be empty is higher.
And the database realized based on rocksDB, such as Langchao Yunxi DB, TiDB, MyRocks, Cockroach and the like, uses a hybrid filter, can have better range query performance, and improves the query performance of the database.
The above-mentioned embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims (8)

1. A method for realizing a hybrid filter based on a Trie dictionary tree is characterized in that the Trie dictionary tree is used as a basic structure of the hybrid filter,
setting the prefix length of the Trie dictionary tree, adding binary number groups on the root node of the last layer of the Trie dictionary tree,
and adjusting the prefix length and the binary array length according to the memory occupation.
2. The method of claim 1, wherein the Trie is compressed, and nodes without branches on a line are merged.
3. A mixing filter based on Trie dictionary tree is characterized in that the Trie dictionary tree is used as a basic structure,
setting the prefix length of the Trie dictionary tree, adding binary number groups on the root node of the last layer of the Trie dictionary tree,
and adjusting the prefix length and the binary array length according to the memory occupation.
4. The Trie-Trie-based hybrid filter according to claim 3, wherein the Trie-Trie is compressed to merge nodes without branches on the line.
5. Use of a Trie-based hybrid filter according to claim 3 or 4.
6. The application of the Trie-Trie-based hybrid filter in the storage engine based on the LSM log tree as claimed in claim 5, wherein the application is used for filtering files during range query and single-value query.
7. The use of the Trie-Trie-based hybrid filter according to claim 5, wherein the filter is applied to a RocksDB for filtering files during range query and single-value query.
8. The use of the Trie-based hybrid filter according to claim 6 or 7, wherein the query process is performed:
a matching root node is found by the prefix,
if the matching is successful, the Hash hit verification is carried out on the Key,
if the hit verification of 1 Hash function fails, the fact that the Key is not in the set is indicated.
CN202010829227.2A 2020-08-18 2020-08-18 Mixed filter based on Trie dictionary tree Pending CN111966654A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010829227.2A CN111966654A (en) 2020-08-18 2020-08-18 Mixed filter based on Trie dictionary tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010829227.2A CN111966654A (en) 2020-08-18 2020-08-18 Mixed filter based on Trie dictionary tree

Publications (1)

Publication Number Publication Date
CN111966654A true CN111966654A (en) 2020-11-20

Family

ID=73388190

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010829227.2A Pending CN111966654A (en) 2020-08-18 2020-08-18 Mixed filter based on Trie dictionary tree

Country Status (1)

Country Link
CN (1) CN111966654A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112667636A (en) * 2020-12-30 2021-04-16 杭州趣链科技有限公司 Index establishing method, device and storage medium
CN113434661A (en) * 2021-06-29 2021-09-24 平安科技(深圳)有限公司 Method and device for prompting draft simulation of official document, electronic equipment and storage medium
WO2023125630A1 (en) * 2021-12-31 2023-07-06 华为技术有限公司 Data management method and related apparatus
CN116414304A (en) * 2022-12-30 2023-07-11 蜂巢科技(南通)有限公司 Data storage device and storage control method based on log structured merging tree

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103428093A (en) * 2013-07-03 2013-12-04 北京邮电大学 Route prefix storing, matching and updating method and device based on names
CN103873371A (en) * 2014-02-21 2014-06-18 北京邮电大学 Name routing fast matching search method and device
CN103942289A (en) * 2014-04-12 2014-07-23 广西师范大学 Memory caching method oriented to range querying on Hadoop
CN104052669A (en) * 2013-03-12 2014-09-17 西普联特公司 Apparatus and Method for Processing Alternately Configured Longest Prefix Match Tables
CN105117417A (en) * 2015-07-30 2015-12-02 西安交通大学 Read-optimized memory database Trie tree index method
KR101587756B1 (en) * 2015-02-17 2016-01-21 이화여자대학교 산학협력단 Apparatus and method for searching string data using bloom filter pre-searching
CN109284299A (en) * 2015-06-08 2019-01-29 南京航空航天大学 Reconstruct the method with the hybrid index of storage perception

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104052669A (en) * 2013-03-12 2014-09-17 西普联特公司 Apparatus and Method for Processing Alternately Configured Longest Prefix Match Tables
CN103428093A (en) * 2013-07-03 2013-12-04 北京邮电大学 Route prefix storing, matching and updating method and device based on names
CN103873371A (en) * 2014-02-21 2014-06-18 北京邮电大学 Name routing fast matching search method and device
CN103942289A (en) * 2014-04-12 2014-07-23 广西师范大学 Memory caching method oriented to range querying on Hadoop
KR101587756B1 (en) * 2015-02-17 2016-01-21 이화여자대학교 산학협력단 Apparatus and method for searching string data using bloom filter pre-searching
CN109284299A (en) * 2015-06-08 2019-01-29 南京航空航天大学 Reconstruct the method with the hybrid index of storage perception
CN105117417A (en) * 2015-07-30 2015-12-02 西安交通大学 Read-optimized memory database Trie tree index method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SHRUTI MISHRA 等: "Improved Search Technique Using Wildcards or Truncation", 2009 IEEE, 31 December 2009 (2009-12-31), pages 1 - 4 *
曹广顺 等: "一种基于key-value数据库的快速地名地址输入提示方法", 计算机应用研究, vol. 34, no. 11, 30 November 2017 (2017-11-30), pages 3334 - 3338 *
袁津生 等编著: "搜索引擎原理与实践", 30 November 2008, 北京邮电大学出版社, pages: 109 - 116 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112667636A (en) * 2020-12-30 2021-04-16 杭州趣链科技有限公司 Index establishing method, device and storage medium
CN113434661A (en) * 2021-06-29 2021-09-24 平安科技(深圳)有限公司 Method and device for prompting draft simulation of official document, electronic equipment and storage medium
WO2023125630A1 (en) * 2021-12-31 2023-07-06 华为技术有限公司 Data management method and related apparatus
CN116414304A (en) * 2022-12-30 2023-07-11 蜂巢科技(南通)有限公司 Data storage device and storage control method based on log structured merging tree
CN116414304B (en) * 2022-12-30 2024-03-12 蜂巢科技(南通)有限公司 Data storage device and storage control method based on log structured merging tree

Similar Documents

Publication Publication Date Title
CN111966654A (en) Mixed filter based on Trie dictionary tree
US8255398B2 (en) Compression of sorted value indexes using common prefixes
US11169978B2 (en) Distributed pipeline optimization for data preparation
US11461304B2 (en) Signature-based cache optimization for data preparation
JP6553649B2 (en) Clustering storage method and apparatus
US7373464B2 (en) Efficient data storage system
CN109299086B (en) Optimal sort key compression and index reconstruction
CN109325032B (en) Index data storage and retrieval method, device and storage medium
CN105069111A (en) Similarity based data-block-grade data duplication removal method for cloud storage
EP1866776A1 (en) Method for detecting the presence of subblocks in a reduced-redundancy storage system
CN113535670B (en) Virtual resource mirror image storage system and implementation method thereof
US20240160609A1 (en) System and method for providing randomly-accessible compacted data
CN108475508B (en) Simplification of audio data and data stored in block processing storage system
KR20170065374A (en) Method for Hash collision detection that is based on the sorting unit of the bucket
CN106940708A (en) A kind of method and system that the positioning of IP scopes is realized based on binary chop
CN113468571A (en) Tracing method based on block chain
CN110515897B (en) Method and system for optimizing reading performance of LSM storage system
CN102214216A (en) Aggregation summarization method for keyword search result of hierarchical relation data
US11991290B2 (en) Associative hash tree
CN106599326B (en) Recorded data duplication eliminating processing method and system under cloud architecture
US20130111164A1 (en) Hardware compression using common portions of data
WO2020238750A1 (en) Data processing method and apparatus, electronic device, and computer storage medium
KR101299555B1 (en) Apparatus and method for text search using index based on hash function
US11288447B2 (en) Step editor for data preparation
US11422975B2 (en) Compressing data using deduplication-like methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination