CN111373389B - Data storage system and method for providing a data storage system - Google Patents

Data storage system and method for providing a data storage system Download PDF

Info

Publication number
CN111373389B
CN111373389B CN201780097046.1A CN201780097046A CN111373389B CN 111373389 B CN111373389 B CN 111373389B CN 201780097046 A CN201780097046 A CN 201780097046A CN 111373389 B CN111373389 B CN 111373389B
Authority
CN
China
Prior art keywords
lock
node
symbol
storage system
data storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201780097046.1A
Other languages
Chinese (zh)
Other versions
CN111373389A (en
Inventor
谢尔盖·罗曼诺维奇·巴希罗夫
阿莱克桑德尔·阿列克山德罗维奇·西马克
徐君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN111373389A publication Critical patent/CN111373389A/en
Application granted granted Critical
Publication of CN111373389B publication Critical patent/CN111373389B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a data storage system 100 having a data storage 104 and a data controller 101 for implementing a prefix tree 600 having a plurality of nodes, wherein the data controller 101 is configured to initiate a write operation in the prefix tree 600; the data controller 101 is configured to provide a lock for a single symbol of a plurality of symbols at nodes 602, 603 in a write operation.

Description

Data storage system and method for providing a data storage system
Technical Field
The present invention relates to a data storage system and a method for providing a data storage system. More particularly, the present invention relates to a data storage system and method for improving scalability of write operations between multiple writers.
Background
Data lookup is essential to all computer programs including information databases and storage systems. The program needs to quickly locate data for subsequent retrieval or calculation. Typically, such lookup functionality is provided by index data structures that contain corresponding locations or addresses of stored data organized by certain data attributes. Thus, it is not necessary to traverse the entire stored data set, but rather to find the corresponding index record. The usual index is a stand-alone tool with an efficient interface for quickly retrieving data locations through the requested attributes.
In the presence of concurrency, several participants perform index modification at the same time, and the overall operational efficiency is largely dependent on the method of concurrency synchronization. Typically, writers modify the index structure record exclusively or in separate node copies. Thus, the index structure represents the complete and correct state of the stored data set. Synchronization of tree-based data structures is generally considered a bottleneck.
Various tree-based indexing structures have been known for decades. Although they differ in nature, many of them share certain common features such as a single root node, internal nodes that interconnect to form a tree, and a last level node (called a leaf). Typically, the payload data is held within the leaf, and the internal node stores a number of different attribute values that are selected during the traversal lookup.
Prefix trees are special cases of tree-based index data structures. Instead of storing the attribute values inside the internal nodes, this information is stored inside the node interconnect. Therefore, during the prefix tree traversal, the child nodes do not need to be checked one by one and the searched attribute values are compared, and only the child node corresponding to the attribute value index needs to be selected when the child node exists. Although such trees are significantly simpler and traverse algorithms are compact, they are not widely used for general purpose databases or storage system index structures due to the large memory overhead. In practice, each prefix tree node needs to hold the full range of possible attribute values, even though only a few such child nodes actually exist. The above implies an exponential increase in memory consumption.
In recent years, prefix trees have been used for general purposes. The most important modification is to provide nodes of variable size. This means that there are internal nodes of different capacities, a certain number of sub-nodes being selected during tree modification according to the number of sub-nodes required. This method is commonly referred to as horizontal compression. Another valuable improvement is that all internal nodes with only a single child node are skipped, thus requiring that common attributes be kept inside the intermediate node and search attributes be stored in the corresponding leaf. This approach is commonly referred to as vertical compression or key sequence skipping. Both of these improvements directly lead to more accurate and moderate memory consumption.
Another important problem with known prefix trees is the method of concurrent access or synchronization. Traditionally, synchronization techniques concatenate writers, thereby attempting to modify the same tree node at the same time, even though they have changed different portions of the node. Such solutions are simple but difficult to extend. More advanced techniques allow several writers to modify a tree node in parallel. However, they require complex data structures and are often more complex to implement. An alternative that may be considered is to use hardware transactional memory, but there are still some impediments to presenting proper hardware support at the time of writing.
Typically, the original prefix tree design requires a large amount of memory, and therefore the use of such trees is very limited. The so-called adaptive radix tree (Adaptive Radix Tree, ART) is an advanced design that is highly optimized for memory usage prefix trees. ART integrates all known prefix tree compression techniques: instead of just one symbol per internal node, each internal node may contain several symbols, called common prefixes, if the rest of the key is unique, key sequence skipping is applied to reduce the number of internal nodes. Further, each internal node has an adaptive size and increases or decreases according to the number of child nodes. In general, in a prefix tree, a modification affects at most two nodes: the node making the change and possibly its parent. Thus, there are generally several known concurrency control mechanisms that can be effectively applied to ART or prefix trees.
Lock coupling is a standard method for synchronizing B-trees and can be easily applied to prefix trees. The idea is to hold at most two locks at a time during a tree traversal. Each time a next level down in the prefix tree starts from the root node, the lock on the child node is acquired, releasing the parent lock. The method allows for simultaneous modification of different prefix tree nodes when they have different parent nodes. The lock coupling algorithm has the problem of high contention on the first level of the tree. Each participant starts from the root node and so there is high contention at the zero level of the tree and also a high probability of contention at the first level.
Another approach is optimistic lock coupling. The method is an enhanced lock coupling method with better performance. Rather than acquire a lock at every step to prevent modifications to the node, each participant assumes that there are no concurrent modifications. After using the version counter, the modification is detected, and if the modification is detected, the operation is restarted. Otherwise, at most two locks are needed to perform the modification (for the node and parent node). Optimistic lock coupling shows better performance because the locks are acquired on demand and the total contention in the tree nodes is not high. However, its main problem is coarse-grained lock. The coarse-grain lock may block multiple tree nodes that do not actually perform any modifications. For example, if one participant locks a tree node, the other participant cannot lock its child node because he also needs to lock the parent node.
The Read-optimized write-exclusion (ROWEX) method shows better performance. A layer field is extended for each internal node, indicating the total length of the key sequence at the node layer. By atomic modification of the prefix and prefix length, and of the node pointer, the reader can operate without any obstruction (locking, retry, etc.). The writer works in a similar way as the writer in the optimistically lock coupling, but performs additional actions to support the reader to correctly read the prefix tree in the state. For writers, ROWEX uses a locking method similar to optimistic lock coupling. Each writer has two exclusive locks on the tree node that needs modification and its parent node. The main difference is that the writer performs additional actions to make the reader work without the lock. Most of these additional actions come from atomic updates to the tree node fields, which are important for the reader to read the tree correctly.
Copy On Write (COW) methods are commonly used for lock-less algorithms. The key idea is to make a hidden copy of the modified node and then perform all changes to that copy and make it visible. Conventional lock, compare and swap operations and node version control may be used to implement COW. The COW method is suitable for read-oriented workloads, e.g., 95% read operations and 5% write operations in the workload. The writer detects the change during preparation for tree node replication, restarts the operation and redos the copy. Such additional memory duplication and allocation may also result in poor write performance. Alternatively, COW may be used with a lock to avoid restarting. In this case, the scalability of the write operation is quite low. In addition, for non-volatile memories where the write speed is slower but the read speed is close to DRAM, additional copying takes more time, so fine-grained locks may actually be faster.
The hardware transactional memory (Hardware Transactional Memory, HTM) is a hardware mechanism for detecting conflicts and undoing any changes made to the shared data. The goal of such hardware is to transparently support code regions marked as transactions by enforcing atomicity, consistency, and isolation. In this scenario, all participants always observe a consistent state of the tree node, since the modification is only performed within the transaction, and thus is visible to all participants only after a successful commit. When the commit operation fails, the modification needs to be restarted because the nodes have been changed at the same time. HTM is a very special piece of hardware that has not been well supported and today has little opportunity to use it in production.
Software transactional memory (Software Transactional Memory, STM) is a simulation of an HTM on a machine without HTM support. However, optimally implementing an STM requires a large number of compare and swap instruction support (e.g., DCAS) operations, otherwise the implementation of this approach is complex and exceptionally slow. Unfortunately, DCAS operations are supported only by the latest processors and are not widely available. Furthermore, in practice, the performance of STM systems is also affected compared to fine-grained lock-based systems, mainly due to the overhead associated with maintaining logs and the time it takes to commit transactions.
Concurrent implementation of radix trees is complex because key compression may cause uncertain states to persist for a period of time until the tree transition is complete. The simultaneous access node of the reader and writer needs to be properly handled to avoid an uncertain state of the tree.
Due to key compression, a write operation is necessary to decompress a portion of the keys in the active node and drop another path. This is the most complex case and results in a change in the structure of the tree. For example, for an adaptive radix tree, node extensions may need to be reassigned. Such operations are complex, often requiring wait states of one writer until the other writer completes its write operation.
Disclosure of Invention
In view of the above problems and disadvantages, the present invention is directed to improving the performance of prefix trees. It is therefore an object of the present invention to improve the scalability of write operations between a plurality of writers.
The object of the invention is achieved by the solution provided in the attached independent claims. Advantageous embodiments of the invention are further defined in the dependent claims.
In particular, the present invention proposes a data storage system comprising a memory efficient index data structure and a unique concurrency control method for synchronizing several writers, providing high scalability in writing.
In databases and storage systems, the performance of the index data structure is very important, with a significant impact on the final performance. To support simultaneous access by data users, the index data structure is provided with a concurrency control mechanism. The above-described concurrency control mechanism causes an additional overhead due to the known complex concurrency control mechanism. Even serialization of parallel accesses sometimes can have adverse effects, limiting the scalability of the system on multi-core hardware. The invention solves the problem of expandability of the index data structure by improving the concurrency control mechanism of the writing operation.
The invention can be seen as a fine-grain locking method, suitable for use as a concurrency control mechanism for prefix trees used as index data structures in databases and storage systems. The prefix tree may also apply vertical compression and horizontal compression, key sequence skipping, variable size nodes, and advanced synchronization between writers and readers. The tree runs within a uniform search field for all words in a limited alphabet or string. The branching factors of the tree nodes are defined by the cardinality of the alphabet. It may also include concepts such as variable node size, common node prefix, and node lock, expanding node locks in multi-symbol semantics. The variable node size technique allows the actual branching factor of a particular node to be determined based on actual demand. The common node prefix enables the radix tree to be compressed vertically with the child nodes sharing multiple symbols or prefixes within their keys.
A multi-symbol lock allows to obtain an exclusive lock of the entire node or a selective lock of a specific symbol of the alphabet in the node. The lock of a particular symbol of the alphabet provides a tool to lock only one corresponding child node, while the exclusive lock locks all child nodes. The writer can then make simultaneous modifications to different child nodes of the same parent node. Furthermore, if these writers work for different symbols of the alphabet, they may be inserted or deleted in the same node at the same time. In more complex situations (which often occur infrequently), for example, when the common prefix needs to be changed or the variable node size should be expanded or contracted, the writer is still able to acquire an exclusive lock. The conceived optimistically relaxed lock coupling synchronization method is an enhancement to the optimistically lock coupling scheme that uses at most two exclusive locks for tree modification of the multi-symbol lock. This approach achieves better scalability and better performance in a concurrent environment.
A first aspect of the present invention provides a data storage system having a data storage and a data controller for implementing a prefix tree having a plurality of nodes, wherein the data controller is for initiating a write operation in the prefix tree; the data controller is to provide a lock for a single symbol of a plurality of symbols at a node under a write operation.
An advantage of the invention is that it is possible to acquire a release lock for only one symbol, in addition to acquiring a completely exclusive lock. For prefix trees, this means that even if multiple child nodes have the same parent node, they can be modified at the same time, since their prefixes are actually different. Other advantages are that the invention has better scalability than optimistically lock coupling using exclusive locks, thus achieving an average throughput gain of 30% as tested on the benchmark model. Another advantage of the method of the present invention over copy-on-write methods is persistent memory.
The data storage system according to the present invention can bring several positive effects, such as higher throughput, reduced operation latency, improved system scalability, and reduced CPU load, to concurrent databases and storage systems, as well as other applications that rely heavily on index data structures.
In an implementation manner of the first aspect, the data controller is configured to initiate insertion or deletion of a specific node; the data controller is used for providing a lock for the specific node; a lock is provided for a single symbol of the plurality of symbols at the parent node of the particular node representing the particular node. This operation allows locking child nodes under operation, while only a single node is locked at the parent node. Thus, more than one child node of a single parent node may be modified simultaneously.
In another implementation of the first aspect, the data controller is configured to initiate a change to a single symbol of a plurality of symbols of a particular node; a lock is provided for the single symbol. Such single symbol locks allow multi-symbol operation at a single node because only the corresponding symbol is locked, not the entire node. Furthermore, the parent node need not be locked.
In another implementation manner of the first aspect, the data controller is configured to provide a lock structure with a bit length of 0 to n-1 for a node, where n is a branching factor of the prefix tree, and wherein the data controller is configured to provide a lock for at least one bit of the lock structure. This lock structure allows full range locking of all symbols with resolution of a single symbol. This approach may increase memory usage. However, the integration complexity and execution time are approximately the same as standard procedures.
In another implementation manner of the first aspect, the data controller is configured to provide a lock structure for a node, where the lock structure includes: a symbol section including an array having a length of 0 to k-1 and storing symbols; a lock mask segment with a bit length of 0 to k-1; where k < n, n is the branching factor of the prefix tree, where the data controller is to provide a lock for at least one bit of the lock mask segment. This lock construction allows the use of a compact lock, which can be used as a direct replacement for a conventional lock. Memory usage and execution time are approximately the same as standard procedures, although integration complexity may be higher.
In another implementation manner of the first aspect, the data controller is configured to: initiating a first write operation to a first node in the prefix tree; providing a lock for the first node; providing a lock for a single symbol representing the first node at a parent node of the first node; the data controller is configured to initiate a second write operation to a second node in the prefix tree that has the same parent node as the first node. Two parallel write operations may be performed to two child nodes of a common parent node. Two writers are activated to write simultaneously so that multiple wait-free write operations can be performed at one point in time.
In another implementation of the first aspect, the data controller is configured to provide a lock for the second node; a lock is provided for a single symbol at the parent node representing the second node. The locking operation for the second symbol at the parent node is optional, e.g., when there are only two concurrent write operations, no locking operation is required. For more than two concurrent write operations, corresponding symbols at the parent node are locked.
In another implementation form of the first aspect, the data controller is configured to initiate a first write operation on a first symbol of a node in the prefix tree, where the first symbol represents a first child node of the node; providing a lock for the first symbol; the data controller is configured to initiate a second write operation to a second symbol in the node, where the second symbol represents a second child node of the node. Two concurrent write operations may be performed on two symbols at one node. Two writers are activated to modify multiple symbols at one node at the same time so that multiple wait-free write operations can be performed at one point in time.
In another implementation of the first aspect, the data controller is configured to provide a lock for the second symbol. The locking operation for the second symbol at the parent node is optional, e.g., when there are only two concurrent write operations, no locking operation is required. For more than two concurrent write operations, corresponding symbols at the parent node are locked.
In another implementation manner of the first aspect, the data controller is configured to: providing a lock for child nodes under write operations; locks are provided at all parent nodes of a child node under the write operation for a single symbol representing the corresponding child node, such that only a single path to the child node under the write operation is locked. This definition of a single path for a single symbol lock causes minimal damage to single node modification or write operations of the tree. It leaves sufficient room for more concurrent write operations. In the prior art, all subtrees need to be locked.
In another implementation of the first aspect, the data controller is configured to release the lock of the symbol and/or the node after the write operation is completed. Upon termination of the modification, the release gives an opportunity to further write to that particular node or symbol.
In another implementation of the first aspect, the data controller is configured to provide an exclusive lock for a particular node by locking the plurality of symbols at the particular node. The lock concept of the present application still allows the use of exclusive locks, i.e. locks with all symbols at the node.
In another implementation manner of the first aspect, the prefix tree is a radix tree or an adaptive radix tree. The lock concept of the present application is well suited to such trees because for radix trees, each node may represent several symbols.
A second aspect of the present application provides a method for providing a data storage system for implementing a prefix tree having a plurality of nodes, the method comprising: initiating a write operation in the prefix tree; a lock is provided for a single symbol of the plurality of symbols at the node under the write operation. The same modifications and advantages as described above apply.
A third aspect of the application provides a computer program having a program code for performing a method as described above when the computer program is run on a computer or a data storage system as described above. The same modifications and advantages as described above apply.
It should be noted that all devices, elements, units and means described in the present application may be implemented in software or hardware elements or any kind of combination thereof. All steps performed by the various entities described in this application and the functions described to be performed by the various entities are intended to indicate that the various entities are adapted to or for performing the respective steps and functions. Although in the following description of specific embodiments, specific functions or steps performed by external entities are not reflected in the description of specific elements of the entity performing the specific steps or functions, it should be clear to a skilled person that the methods and functions may be implemented in respective hardware or software elements or any combination thereof.
Drawings
The aspects of the invention and the manner of attaining them will be elucidated with reference to the embodiments described hereinafter, taken in conjunction with the accompanying drawings, wherein:
FIG. 1 illustrates a system architecture diagram of a data storage system having an index structure.
Fig. 2 shows a diagram of a full range lock configuration.
Fig. 3 shows an example of a 256-bit full range lock structure.
Fig. 4 shows a diagram of a compact lock structure.
Fig. 5 shows an example of a 64-bit sized compact lock structure.
FIG. 6 illustrates an index tree with two concurrent write operations.
Fig. 7 shows a flow chart of the key insertion algorithm.
Fig. 8 shows a flow chart of a multi-symbol lock algorithm.
Fig. 9 shows a flow chart of the full range lock algorithm.
Fig. 10 shows a flow chart of the compact lock algorithm.
FIG. 11 illustrates a method for providing a data storage system.
Detailed Description
Fig. 1 illustrates a data storage system 100 including a data controller 101. The data controller 101 includes a concurrency control mechanism 102 for a prefix tree or radix tree with vertical compression and key sequence skipping that is used as an index data structure 103. The data storage system 100 also includes a data store 104. The data controller 101 may be implemented in a main memory using a dynamic random access memory (Dynamic Random Access Memory, DRAM) or the like, and the data memory 104 typically includes a mass storage such as a hard Disk, a Solid-State-Disk (SSD), or the like.
According to claim 1 of the present invention, the data storage system 100 comprises a data controller 101 and a data storage 104.
Therefore, to locate data quickly, only the index within the index data structure 103 that corresponds to the requested index record needs to be searched. In general, an index data structure organizes raw data into logical units and retains the direct address of each unit on a storage medium, thereby minimizing interactions with the storage medium during data lookup.
Data user 111 may interact with a database or storage system via a network or the internet, or a low-level device protocol, or may run on the same machine. Each data user may act as a writer 112 for adding and modifying data, or a reader 113 for locating and retrieving stored data 104. In the presence of concurrency, when several participants 111 perform data lookup and modification simultaneously, the overall operational efficiency is largely dependent on the method 102 of concurrency synchronization or control, which is intended to keep the stored data in a consistent state.
The data storage system 100 may be viewed as a variation of advanced synchronization between the plurality of writers 112 that is applicable as a concurrency control mechanism 102 for an index or prefix tree in the data storage system 100 and information database that is used as an index data structure 103. Such prefix tree based index data structure 103 may also be provided with vertical compression, horizontal compression, key sequence skipping, and advanced synchronization between writer 112 and reader 113.
The invention provides a novel prefix tree node locking tool, which comprises: multi-symbol lock and implementation thereof: compact locks and full range locks. For any alphabet, the maximum branching factor of the prefix tree node is virtually equal to the alphabet radix. However, the modification of the tree is mostly performed on one child node in the tree node. Thus, in most cases, a mechanism that locks only a single child node is quite advantageous. Multi-symbol locks are designed for this purpose.
Fig. 2 shows a full range lock configuration 200. Such a full range lock structure 200 has n bits, where n is the number of symbols in a given alphabet. For example, for a C-pattern byte string, four 64-bit unsigned integers are used. The full range lock structure 200 may have the form of an array of bits 0 to n-1, where n is the branching factor of the prefix tree.
Fig. 3 shows an example of a full range lock structure 200 for a 256 bit C string. For better explanation, an ASCII table is provided on the right side of FIG. 3. In a first example, the full range lock structure 200a is not locked. Thus, all bits from 0 to 255 are set to 0. In a second example, the full range lock structure 200b is exclusively locked. Thus, all bits from 0 to 255 are set to 1. In a third example, in the full range lock structure 200c, the symbol d is locked. Thus, all bits from 0 to 255 are set to 0 except that the 100 th bit (representing symbol d, see ASCII table) is set to 1. In a fourth example, in the full range lock structure 200d, the symbols d and K are locked. Thus, all bits from 0 to 255 are set to 0 except that the 100 th bit (representing symbol d, see ASCII table) and the 75 th bit (representing symbol K, see ASCII table) are set to 1.
Fig. 4 shows a compact lock structure 400. Such a compact lock structure 400 includes a symbol interval or segment 401 of nodes and a lock mask 402. The symbol section 401 may include an array of lengths 0 to k-1 for storing symbols, where n is the number of symbols in a given alphabet or the branching factor of the prefix tree. The bit length of the lock mask segment or lock mask 402 may be 0 to k-1, where k is a number less than n, because there is little likelihood that all symbols will need to be locked at once. When k is equal to n, the compact lock becomes the full range lock because for each symbol there is a corresponding bit in the lock mask and no symbol interval is required. For the special case, where the lock mask 402 is used up, another write operation occurs, which must wait. Compact lock structure 400 requires a k-bit mask and k value intervals, each representing a symbol. It is a shortened version of the full range lock. For example, for a C-pattern byte string, the most reasonable k values are 7 and 14. For k equal to 7, a 64-bit unsigned integer may be used, where the first byte is used as a bit and the other bytes are used as values. Similarly, for k equal to 14, the compact lock may be represented using a 128 bit unsigned integer, where the first two bytes are used as bits and the other bytes are used as values.
Data controller 101 is configured to provide a lock for at least one bit of lock mask segment 402.
Fig. 5 shows an example of a compact lock structure 400 for a 64-bit sized C string. In a first example, the compact lock structure 400a is not locked. Thus, all bits from 0 to 6 are set to 0. In a second example, the compact lock structure 400b is exclusively locked. Thus, all bits from 0 to 6 are set to 1. In a third example, in the compact lock structure 400c, the symbol d is locked. Therefore, all bits from 0 to 6 are set to 0 except for the 1 st bit (representing symbol d) which is set to 1. Bit 1 of the lock mask 402 corresponds to the 1 st interval of the symbol segment 401. In a fourth example, in the compact lock structure 400d, the symbols d and K are locked. Therefore, all bits from 0 to 6 are set to 0 except that the 1 st bit (representing symbol d) and the 4 th bit (representing symbol K) are set to 1.
For the detailed example of a multi-symbol lock implementation given above, the alphabet used in the example is a single byte character set, each symbol represented within 8 bits, with an overall alphabet base of 256 symbols. All symbols are used to form an input key string, 255 non-zero bytes are the symbols used to encode information, and a zero byte symbol indicates the end of the string. This method is called a C-style string or a string terminated with a null value.
The prefix tree may apply key sequence skipping and vertical compression techniques. In this case, the common prefix field is used to extend the inner nodes, while the leaf nodes store all keys. During tree traversal, if certain symbols are skipped, additional comparisons are needed to ensure that a particular leaf node corresponds to the intended key. In addition, internal nodes may be split or merged when the common prefix changes after insertion or deletion. Adaptive branching factors are typically applied to significantly reduce memory consumption. The internal nodes are updated with variable size intervals of child nodes, for example, they may be expanded or contracted to sizes 4, 16, 48 and 256 as needed. All memory optimization techniques complicate the internal node and it is not possible to automatically update all fields of the internal node on general-purpose hardware. Thus, there is a need for an advanced synchronization mechanism to build a multi-threaded prefix tree.
Fig. 6 shows a radix tree 600 with a root node 601, two internal nodes 602 and 603, and a leaf node 604. The advantages of multi-symbol lock are explained in connection with radix tree 600. Instead of acquiring a perfectly exclusive lock, a multi-symbol lock may acquire a released lock of only one symbol. For prefix trees, this means that even if several child nodes have the same parent node, they can be modified at the same time. In practice only their prefixes need to be different. A compact lock variant of a multi-symbol lock is employed herein.
It is assumed that the first writer 605 and the second writer 606 modify the radix tree 600 at the same time. On the left, the operation is displayed for a multi-symbol lock. By comparison, the right side shows the function of a conventional fully exclusive lock.
The first writer 605 modifies the internal node 602. In accordance with the present invention, the first writer 605 locks the bit 607 at the root node 601, i.e. sets its value to 1. The bit 607 corresponds to the symbol c (interval 608), i.e., the symbol representing the node 602 in a write operation. Further, the first writer 605 locks all symbols of the node 602 under write operation exclusively as indicated by all bits set to 1 in the lock mask segment 402.
At parent node 601, only one bit 607 is set. Thus, the second writer 606 is able to perform a concurrent second write operation to another internal node 603. Thus, the second writer 606 locks 609 the bit at the root node 601, i.e. sets its value to 1. The bit 609 corresponds to the symbol r (interval 610), i.e. the symbol representing the node 603 in the second write operation. Further, the second writer 606 locks all symbols of the node 603 under write operation exclusively as indicated by all bits set to 1 in the lock mask segment 402.
The situation of a completely exclusive lock shown on the right is quite different. Wherein the first written second writer 606 fully locks the root node 601 and the internal node 603. Thus, the first writer 605 needs to wait for the second writer 606 to complete the operation. In contrast, the present invention allows parallel write operations of writers 605 and 606.
FIG. 7 depicts a flowchart of optimistically released loose lock coupling concurrency control for a prefix tree insertion operation. To perform the modification, the writer initially follows a synchronization scheme selected for the reader to traverse the tree and find the required key or location to insert the new node. To perform an insert or delete, the writer requires at most two locks: one for the node for which traversal has stopped and the other for the parent node of the node. If desired, the writer will always acquire a single symbol lock for the corresponding symbol on the parent node. Thus, other writers may modify different child nodes of the same parent node. A single sign lock for the shared field should also be used when a child node can be inserted or deleted without affecting the node, e.g., to add a new child node.
To apply the optimistically relaxed lock coupling method for prefix trees, all internal nodes are extended with multi-symbol locks. It may be one of the proposed implementations: compact locks or full range locks. Generally, the idea is to enhance the optimistic lock coupling algorithm with releasing locks. That is, only a single symbol is locked, as far as possible, rather than an exclusive lock. The algorithm requires synchronization between concurrent writers to maintain prefix tree consistency. For synchronizing concurrent writers and readers, any known method may be used, such as node versioning or lock-less methods based on atomic operations. Furthermore, it is more reasonable that the reader can apply classical lock coupling algorithms, using single sign locks rather than exclusivity.
In step 700, the process begins and proceeds to step 702, where the reader looks up a key to act as a reader. In step 703, it is determined whether the key is found. If found, the corresponding value is updated in step 704. Subsequently, the routine is terminated in an exit step 705.
If the key is not found in step 703, then the active node is selected for insertion in step 706. Then, in step 707, the corresponding single symbol in the parent node is locked. In step 708, it is determined whether the current node at the lock symbol is still a direct child of the parent node. If not, then in step 709, a new node is inserted and the symbol in the parent node is unlocked, and in step 702, the procedure is restarted.
The routine transitions from node 708 to node 710 when the current node is still the immediate child of the parent node at the lock symbol. Wherein it is determined whether node conversion is required. If so, the complete active node is locked in step 711, as the entire node is affected by the modification. If not, only the symbol in the active node is locked in step 712, as only the symbol is affected by the modification. In both cases, a corresponding insertion is performed in step 713. Before exiting the program in step 705, all acquired locks are unlocked in step 714.
Fig. 8 depicts four flowcharts of a multi-symbol lock algorithm supporting four primary operations. The multi-symbol lock algorithm may be implemented by a full range lock algorithm (explained later in connection with fig. 9) or a compact lock algorithm (explained later in connection with fig. 10).
According to a first algorithm, beginning at step 800, an exclusive lock is acquired. In step 801, a lock state is acquired or observed. In step 802, it is determined whether any locks have been held, i.e., whether some kind of lock is set. If so, operation proceeds back to step 801, thereby forming a loop that observes the lock state. If not, then an exclusive lock is set in step 803, as described above. Finally, in step 804, the routine is exited with an exclusive lock set at the corresponding node.
According to a second algorithm, starting from step 810, the exclusive lock is released. In step 810, it is described how to release the set exclusive lock. In step 811, the exclusive lock is released. How this operation is actually performed depends on the algorithm used. Either a full range lock algorithm or a compact lock algorithm may be employed. Finally, in step 812, the routine is exited, with a release lock at the corresponding node.
According to a third algorithm, starting at step 820, a single symbol lock is acquired. In step 821, the lock state is acquired or observed. In step 822, a determination is made as to whether an exclusive lock has been maintained. If so, operation passes back to step 821, forming a loop that observes the lock state of the node. If not, go to step 823. In step 823, it is determined whether the current symbol has been locked. If so, operation proceeds back to step 821, forming a loop that observes the lock state of the symbol. If not, a single symbol lock is set in step 824, as described above. Finally, in step 825, the routine is exited with a single sign lock set at the corresponding node.
According to a fourth algorithm, starting from step 830, the single symbol lock is released. In step 831, the single symbol lock is released. How this operation is actually performed depends on the algorithm used. Either a full range lock algorithm or a compact lock algorithm may be employed. Finally, in step 832, the routine is exited, with a release lock at the corresponding node.
Fig. 9 shows five full range lock algorithms. The full range lock is physically represented as a block of memory of length n bits, as shown in fig. 2. Each bit i (i is 0 to n-1) corresponds to a symbol of the alphabet having the code i. The full range lock is an accurate solution, providing maximum scalability, but may consume more memory than conventional locks.
According to a first algorithm, beginning at step 900, an exclusive lock is acquired. In step 901, the lock value is read and in step 902, it is determined whether the lock value is equal to 0. If not, operation proceeds back to step 901, forming a loop to read the lock value. If so, then in step 903, all n bits are set to a lock value of 1, thereby setting or acquiring an exclusive lock. Finally, in step 904, the routine is exited with an exclusive lock set at the corresponding node.
According to a second algorithm, beginning at step 910, an exclusive lock is actively acquired. The algorithm does not let other participants acquire single-symbol locks when they attempt to acquire an exclusive lock, i.e., an exclusive lock has a higher priority than a single-symbol lock. In step 911, an n-bit local variable residual mask is initialized by setting all n bits to 1. The remaining mask bits represent bits of the lock that still need to be acquired. In step 912, the lock value is read AND in step 913, the remaining mask bits are updated by performing an AND operation with the lock value. In step 914, all bits are set to a lock value of 1. In step 915, it is determined whether the remaining mask value is equal to 0, indicating that all bits of the lock are held. If not, the operation transitions back to step 912 to restart the updating of the remaining mask bits. If so, in step 916, the routine is exited with an exclusive lock set at the corresponding node.
According to a third algorithm, starting at step 920, the exclusive lock is released. In step 921, the lock is released. To release the exclusive lock, n bits are all set to 0. Finally, in step 922, the routine is exited, with a release lock at the corresponding node.
According to a fourth algorithm, starting from step 930, a single symbol lock is acquired. In step 931, the local variable i is set to the symbol number, thereby selecting a corresponding symbol. In step 932, the lock value is read. In step 933, a determination is made as to whether an i bit is found. If so, operation proceeds back to step 932, thereby forming a loop to read the lock value. If not, then in step 934, the i bit is set to the lock value of 1, thereby setting or acquiring a single symbol lock. Finally, in step 935, the routine is exited, with a single-symbol lock set at the corresponding node.
According to a fifth algorithm, starting from step 940, the single symbol lock is released. In step 941, the local variable i is set to the symbol number, thereby selecting the corresponding symbol. In step 942, the i bit is set to lock value 0, thereby releasing the single symbol lock. Finally, in step 943, the routine is exited with a released single-symbol lock at the corresponding node.
Fig. 10 shows five compact lock algorithms. Compact locks are optimized for memory consumption, but may result in a slight decrease in scalability. The compact locks may utilize the same amount of memory as conventional locks as a direct replacement for them. Fig. 4 shows a representation of a compact lock in memory as a number of intervals of a symbol and bit mask. The number of intervals k is also the number of bits in the mask, where k < n. Each interval represents a symbol storing a single lock and must be of sufficient size to represent a symbol of the alphabet. Each bit in the mask indicates whether the corresponding interval of the signed symbol is locked.
According to a first algorithm, beginning at step 1000, an exclusive lock is acquired. In step 1001, the lock mask is read and in step 1002, it is determined whether the lock mask is equal to 0. If not, operation proceeds back to step 1001, forming a loop to read the lock mask. If so, then in step 1003, all n bits of the lock mask are set to 1, thereby setting or acquiring an exclusive lock. During the setting of all k bits in the mask, the space part accommodating the symbol is ignored. Finally, in step 1004, the routine is exited with an exclusive lock set at the corresponding node.
According to a second algorithm, beginning at step 1010, an exclusive lock is actively acquired. The algorithm does not let other participants acquire single-symbol locks when they attempt to acquire an exclusive lock, i.e., an exclusive lock has a higher priority than a single-symbol lock. In step 1011, a k-bit local variable residual mask is initialized by setting all k bits to 1. The remaining mask bits represent bits of the lock mask that still need to be acquired. In step 1012, the lock mask is read AND in step 1013, the remaining mask bits are updated by performing an AND operation with the lock mask. In step 1014, all bits of the lock mask are set to a lock value of 1. During the setting of all k bits in the mask, the space part accommodating the symbol is ignored. In step 1015, it is determined whether the remaining mask value is equal to 0, indicating that all bits of the lock mask are held. If not, the operation transitions back to step 1012 to restart the updating of the remaining mask bits. If so, in step 1016, the routine is exited with an exclusive lock set at the corresponding node.
According to a third algorithm, starting at step 1020, the exclusive lock is released. In step 1021, the lock mask is set to 0 by zeroing all k bits in the mask, thereby ignoring the space portion that accommodates the symbol. Finally, in step 1022, the routine is exited, with the exclusive lock released at the corresponding node.
According to a fourth algorithm, starting at step 1030, a single symbol lock is acquired. In step 1031, the local variable i is set to the symbol number, thereby selecting the corresponding symbol.
In step 1032, the lock mask is read, and in step 1033, each bit in the lock mask is iterated until there are more bits. In steps 1034 and 1035, if the current bit in the lock mask is set (step 1034) and the corresponding symbol interval value is equal to i (step 1035), operation transitions back to step 1032 forming a loop to read the lock mask, otherwise, operation transitions back to step 1033 and continues to the next bit in the lock mask. If there are no more bits (step 1033), operation proceeds to step 1036. In step 1036, a determination is made as to whether there are any unset j bits in the lock mask. If not, the compact lock is full and operation passes back to step 1032 to retry locking. Optionally, the operation in step 1036 may search for unset bits in the lock mask starting from a random index or from the symbol number mod k to reduce the number of collisions that may occur when several writers attempt to occupy the first available zero bit.
Thus, all intervals in which the corresponding bits are set in the lock mask are scanned to ensure that the symbol has not been locked. Accordingly, any bits j not set in the lock mask are found, j being 0 to k-1.
If it is determined in step 1036 that there are unset j bits, then operation proceeds further to step 1037 where j bits in the lock mask are set and the symbol interval j is set to i. In other words, the j bits in the lock mask are set and the symbol is stored in interval j. In optional step 1038, a value of j is returned, which can be used to release the single-symbol lock at a later time. Finally, in step 1039, the routine is exited with a single symbol lock set at the corresponding node.
According to a fifth algorithm, starting from step 1040, the single symbol lock is released. In step 1041, it is determined whether an optional parameter index j is provided. If not, then in step 1042 the local variable i is set to the symbol number, thereby selecting the corresponding symbol. In step 1043, all intervals in which the corresponding bits are set in the mask are scanned for j bits having a symbol interval j in which there is a symbol i, i being a symbol number from 0 to n-1. In step 1044, the j bits of the lock mask found are zeroed out by setting them to zero, thereby ignoring symbol intervals. After determining in step 1041 that the optional parameter index j is provided, operation proceeds to step 1044. Finally, in step 1045, the routine is exited with a released single symbol lock at the corresponding node.
Fig. 11 illustrates a method for providing a data storage system 100 according to claim 14.
According to a first step 1100, a data storage system 100 is provided for implementing a prefix tree having a plurality of nodes. In step 1101, a write operation is initiated in the prefix tree. In step 1102, a lock is provided for a single symbol of a plurality of symbols at a node under a write operation.
The invention has been described in connection with different embodiments and implementations as examples. However, other variations can be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the invention, and the independent claims. In the claims and in the description, the word "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims (13)

1. A data storage system (100) having a data storage (104) and a data controller (101) for implementing a prefix tree (600) having a plurality of nodes, wherein
-the data controller (101) is configured to initiate a write operation in the prefix tree (600);
the data controller (101) is configured to provide a lock for a single symbol of a plurality of symbols at a node (602, 603) under a write operation;
the data controller (101) is specifically configured to: providing a lock structure with bit length 0 to n-1 for a node, wherein n is a branching factor of the prefix tree (600), the lock structure comprising: a symbol section (401) comprising an array of lengths 0 to k-1 for storing symbols; a lock mask segment (402) having a bit length of 0 to k-1, wherein k < n, said data controller (101) is configured to provide a lock for at least one bit of said lock mask segment (402).
2. The data storage system (100) of claim 1, wherein
The data controller (101) is used for initiating insertion or deletion of a specific node;
-the data controller (101) is adapted to provide a lock for the specific node; a lock is provided for a single symbol of the plurality of symbols at the parent node of the particular node representing the particular node.
3. The data storage system (100) of claim 1 or 2, wherein
The data controller (101) is configured to initiate a change to a single symbol of a plurality of symbols of a particular node; a lock is provided for the single symbol.
4. A data storage system (100) according to any of claims 1 to 3, characterized in that
The data controller (101) is configured to: initiating a first write operation to a first node in the prefix tree (600); providing a lock for the first node; providing a lock for a single symbol representing the first node at a parent node of the first node; wherein the method comprises the steps of
The data controller (101) is configured to initiate a second write operation to a second node in the prefix tree (600) having the same parent node as the first node.
5. The data storage system (100) of claim 4, wherein
-the data controller (101) is configured to provide a lock for the second node; a lock is provided for a single symbol at the parent node representing the second node.
6. The data storage system (100) of any of claims 1 to 5, wherein
-the data controller (101) is configured to initiate a first write operation on a first symbol of a node in the prefix tree (600), wherein the first symbol represents a first child node of the node; providing a lock for the first symbol; wherein the method comprises the steps of
The data controller (101) is configured to initiate a second write operation to a second symbol in the node, wherein the second symbol represents a second child node of the node.
7. The data storage system (100) of claim 6, wherein
The data controller (101) is configured to provide a lock for the second symbol.
8. The data storage system (100) of any of claims 1 to 7, wherein
The data controller (101) is configured to: providing a lock for child nodes under write operations; locks are provided at all parent nodes of a child node under the write operation for a single symbol representing the corresponding child node, such that only a single path to the child node under the write operation is locked.
9. The data storage system (100) of any of claims 1 to 8, wherein
The data controller (101) is configured to release the lock of the symbol and/or node after completion of the write operation.
10. The data storage system (100) according to any one of claims 1 to 9, wherein
The data controller (101) is configured to provide an exclusive lock for a particular node by locking a plurality of symbols at the particular node.
11. The data storage system (100) of any of claims 1 to 10, wherein
The prefix tree (600) is a radix tree or an adaptive radix tree.
12. A method for providing a data storage system (100), the data storage system (100) for implementing a prefix tree (600) having a plurality of nodes, the method comprising:
initiating a write operation in the prefix tree (600);
providing a lock for a single symbol of a plurality of symbols at a node (602, 603) under a write operation;
the providing a lock for a single symbol of a plurality of symbols at a node (602, 603) under a write operation, comprising: providing a lock structure with bit length 0 to n-1 for a node, wherein n is a branching factor of the prefix tree (600), the lock structure comprising: a symbol section (401) comprising an array of lengths 0 to k-1 for storing symbols; a lock mask segment (402) having a bit length of 0 to k-1, wherein k < n, the data controller (101) is configured to provide a lock for at least one bit of the lock mask segment (402).
13. A computer-readable storage medium, characterized in that a computer program is stored for performing the method of claim 12 when run on a computer or a data storage system (100) of any one of claims 1 to 11.
CN201780097046.1A 2017-11-20 2017-11-20 Data storage system and method for providing a data storage system Active CN111373389B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RU2017/000856 WO2019098870A1 (en) 2017-11-20 2017-11-20 Data storage system and method of providing a data storage system

Publications (2)

Publication Number Publication Date
CN111373389A CN111373389A (en) 2020-07-03
CN111373389B true CN111373389B (en) 2023-11-17

Family

ID=60766119

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201780097046.1A Active CN111373389B (en) 2017-11-20 2017-11-20 Data storage system and method for providing a data storage system

Country Status (2)

Country Link
CN (1) CN111373389B (en)
WO (1) WO2019098870A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784117B (en) * 2021-01-06 2023-06-02 北京信息科技大学 Advanced radix tree construction method and construction system for mass data
CN113674821B (en) * 2021-10-21 2022-03-22 浙江太美医疗科技股份有限公司 Network interaction method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0438958A2 (en) * 1990-01-22 1991-07-31 International Business Machines Corporation Byte stream file management using shared and exclusive locks
CN103020060A (en) * 2011-09-20 2013-04-03 佳都新太科技股份有限公司 Number segment matching algorithm based on tree structure and realization method of number segment matching algorithm
CN105117417A (en) * 2015-07-30 2015-12-02 西安交通大学 Read-optimized memory database Trie tree index method
AU2016230539A1 (en) * 2015-03-11 2017-10-05 Ntt Communications Corporation Retrieval device, retrieval method, program, and recording medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7899067B2 (en) * 2002-05-31 2011-03-01 Cisco Technology, Inc. Method and apparatus for generating and using enhanced tree bitmap data structures in determining a longest prefix match
US8868531B2 (en) * 2012-09-10 2014-10-21 Apple Inc. Concurrent access methods for tree data structures
US9208258B2 (en) * 2013-04-11 2015-12-08 Apple Inc. Locking and traversal methods for ordered tree data structures
US9817919B2 (en) * 2013-06-10 2017-11-14 Nvidia Corporation Agglomerative treelet restructuring for bounding volume hierarchies
US10496283B2 (en) * 2016-01-22 2019-12-03 Suraj Prabhakar WAGHULDE Adaptive prefix tree based order partitioned data storage system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0438958A2 (en) * 1990-01-22 1991-07-31 International Business Machines Corporation Byte stream file management using shared and exclusive locks
CN103020060A (en) * 2011-09-20 2013-04-03 佳都新太科技股份有限公司 Number segment matching algorithm based on tree structure and realization method of number segment matching algorithm
AU2016230539A1 (en) * 2015-03-11 2017-10-05 Ntt Communications Corporation Retrieval device, retrieval method, program, and recording medium
CN105117417A (en) * 2015-07-30 2015-12-02 西安交通大学 Read-optimized memory database Trie tree index method

Also Published As

Publication number Publication date
WO2019098870A1 (en) 2019-05-23
CN111373389A (en) 2020-07-03

Similar Documents

Publication Publication Date Title
US11080204B2 (en) Latchless, non-blocking dynamically resizable segmented hash index
Leis et al. The ART of practical synchronization
US6457021B1 (en) In-memory database system
EP3047400B1 (en) Multi-version concurrency control on in-memory snapshot store of oracle in-memory database
CN105868228B (en) In-memory database system providing lock-free read and write operations for OLAP and OLTP transactions
CN105630863B (en) Transaction control block for multi-version concurrent commit status
US20190129894A1 (en) Database Transaction Processing Method, Client, and Server
EP3047397B1 (en) Mirroring, in memory, data from disk to improve query performance
CN111316255B (en) Data storage system and method for providing a data storage system
US9208191B2 (en) Lock-free, scalable read access to shared data structures
US6868414B2 (en) Technique for serializing data structure updates and retrievals without requiring searchers to use locks
US9454560B2 (en) Cache-conscious concurrency control scheme for database systems
US6026406A (en) Batch processing of updates to indexes
US7716182B2 (en) Version-controlled cached data store
US20170109295A1 (en) Secondary data structures for storage class memory (scm) enables main-memory databases
EP2565806A1 (en) Multi-row transactions
EP2932410A2 (en) Distributed sql query processing using key-value storage system
CN107665255B (en) Method, device, equipment and storage medium for key value database data change
US9262415B2 (en) Cache efficiency in a shared disk database cluster
US20080177959A1 (en) System and method for executing transactions
US20140279960A1 (en) Row Level Locking For Columnar Data
US20210365440A1 (en) Distributed transaction execution management in distributed databases
US20090187599A1 (en) Generating identity values in a multi-host database management system
CN111373389B (en) Data storage system and method for providing a data storage system
US20150154272A1 (en) Managing data operations in an integrated database system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant