WO2019098871A1 - Data storage system and method of providing a data storage system - Google Patents

Data storage system and method of providing a data storage system Download PDF

Info

Publication number
WO2019098871A1
WO2019098871A1 PCT/RU2017/000857 RU2017000857W WO2019098871A1 WO 2019098871 A1 WO2019098871 A1 WO 2019098871A1 RU 2017000857 W RU2017000857 W RU 2017000857W WO 2019098871 A1 WO2019098871 A1 WO 2019098871A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
prefix
common
tree
write operation
Prior art date
Application number
PCT/RU2017/000857
Other languages
French (fr)
Inventor
Aleksandr Aleksandrovich SIMAK
Sergei Romanovich BASHIROV
Xuecang ZHANG
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to CN201780096673.3A priority Critical patent/CN111316255B/en
Priority to PCT/RU2017/000857 priority patent/WO2019098871A1/en
Publication of WO2019098871A1 publication Critical patent/WO2019098871A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees

Definitions

  • the present invention relates to a data storage system, a method of providing a data storage system and a computer program with a program code.
  • the present invention relates to data structures used for data lookups, and more particularly, to a prefix tree data structure for locating data stored in a database with a novel method of synchronization between writers and readers providing linear scalability on read.
  • payload data is kept within the tree leafs, and internal nodes store some distinct attribute value to choose from during the traversing lookup. Traversing the tree follows entering of a key or search key. The search traverses through the tree according to matches between the attribute values stored in nodes and the key. The search starts at the root node, i.e. a parent node and branches through child nodes which are each dependent from one parent node.
  • a radix tree is a special case of tree-based index data structures. Instead of keeping attribute values inside the internal node this information is preserved within node interconnections. Thus, during radix tree traversing there is no need to look through child nodes and compare searching attribute values, but just to choose the child that corresponds to the attribute value index, if it exists.
  • radix trees were not widely used for general purpose database or storage system index structures due to impressively high memory overheads. Indeed, each radix tree node is supposed to keep the whole range of possible attribute values even when very few of such child nodes actually exist. The above implies an exponential growth in memory consumption. Recently, radix trees are prepare for general purpose use. The most important modification is the provision of variable size nodes. It means internal nodes are different in capacity. There are several pre-set kinds of capacities and depending on the demand the particular one is chosen during tree modification.
  • Such approach is usually called horizontal compression.
  • Another valuable improvement is to skip all internal nodes with a single child only. It is necessary then to keep common attributes inside the intermediate node and to store searching attributes within the corresponding leaf.
  • Such approach is usually called vertical compression or key sequence skip. Both those improvements immediately lead to more accurate, moderate memory consumption.
  • an internal node with limited capacity of only 16 children has common prefix mitigating at least three intermediate nodes with only one child each. Also, corresponding leaf nodes contain the rest of the search attributes.
  • radix trees Another known issue of radix trees is the method of concurrent access, or synchronization.
  • synchronization techniques utilize locks to keep the reader waiting until the modification is complete, or detect the presence of changes and then restart the reader.
  • Such solutions are simple, but hardly scalable.
  • More advanced techniques allow to access nodes hiding pending modifications and operate wait-free.
  • They require sophisticated data structures, and generally are more memory consumptive and complex to implement.
  • An alternative option to consider is to use hardware transactional memory.
  • ART Adaptive Radix Tree
  • Inner nodes can have several symbols onboard called common node prefix instead of just one symbol, such approach is known as vertical compression and is used to decrease the number of tree levels.
  • key sequence skip A well-proven addition to such vertical compression is the so called key sequence skip, which is applied to reduce the number of inner nodes when the rest stored in the leaf node key is unique.
  • Lock Coupling is a standard method for synchronizing B-trees and can be easily applied to radix trees also. The idea is to hold at most two locks at every single moment during tree traversal. Starting from the root node, on every step one level down lock is acquired on the corresponding child and parent lock must be released. Lock Coupling algorithms suffer due to high contention on the first level of the tree. Every actor starts from the root node acquiring the lock. Thus, there is definitely high contention and a good probability of contention on the next level, too.
  • Optimistic Lock Coupling is the enhanced version of the previous method with better performance. Instead of acquiring locks on each step to prevent modifications of nodes, actors assume that no concurrent modifications happen during tree traversing. Modifications are detected afterwards using version counters and the operation is restarted, if needed. It shows higher performance because locks are acquired only on demand and overall contention in tree nodes is not too high. However, it mostly suffers due to the coarse-grained nature of the locks. Several tree nodes are locked during their parent lock, but there is no modification meant for them. For instance, one actor acquires tree node lock which implies that other actors cannot lock that node’s children due to lock coupling before the lock is released.
  • the Read-Optimized Write Exclusion (ROWEX) approach shows even better performance. Inner nodes are extended with a level field which is set only once when the node is created and indicates the total length of the key sequence at the node level. The level field is never changed on later steps. Assuming modifications of common prefix are conducted atomically, as well as updates of node references; readers can operate wait-free with no locks or retries. However, writers are supposed to perform extra actions leaving the radix tree in a complete and correct state for readers. For writers, ROWEX uses the similar approach of locking as Optimistic Lock Coupling. Each writer keeps two exclusive locks: one on the node to be modified and one on the parent. The major difference lies in j extra actions performed by writers to let readers work without locks. These extra actions are mostly atomic updates of the tree node fields important for readers to read the tree correctly.
  • COW Copy on Write
  • Conventional lock, compare and swap operation and version counters can be used to implement COW.
  • COW approaches are beneficial for read oriented workloads, e.g. 95 percent of read operations and 5 percent of write operations in workload.
  • a writer detects changes during preparation of a tree node copy and restarts the operation redoing the copy again. Those extra memory copies and allocations also lead to poor write performance.
  • COW can be used with locks to avoid restarts. In this case, scalability of write operations is quite low.
  • extra copies would require more time.
  • Hardware Transactional Memory is a hardware mechanism to detect conflicts and undo any changes on shared data.
  • the goal of such hardware is to transparently support regions of code marked as transactions by enforcing atomicity, consistency and isolation.
  • all actors always observe consistent state of tree nodes since modifications are performed only within transactions, thus are visible for all actors only after successful commit.
  • commit operation is failed, a restart of the modification is required because the tree node has been simultaneously changed.
  • HTM is a very specific hardware and not well supported yet, there are no wide opportunities to use it in production today.
  • STM Software Transactional Memory
  • DCAS double compare and swap instruction support
  • STM systems also suffer performance hit compared to fine-grained lock-based systems due primarily to the overheads associated with maintaining the log and the time spent committing transactions.
  • readers and writer of nodes and interconnections In order to traverse the tree correctly synchronization between readers and writer of nodes and interconnections is essential. Typically, readers access nodes when no ongoing modifications are visible to them. Thus, nodes represent a complete and correct tree state.
  • the present invention aims to improve the performance of prefix trees.
  • the present invention has thereby the object to improve the synchronization between writers and readers in prefix trees.
  • the present invention relates to the way computers lookup data stored in information databases or storage systems and proposes a novel method of concurrent access synchronization to locate data faster.
  • Fast data lookup is essential for all computer programs, including information databases and storage systems. Programs need to locate data for consequent retrieval or computation. Storage systems and information databases usually have large volume storage media to store huge amount of data into it. Storage media is a slow memory device compared to the CPU main memory, therefore exhaustive search for looking up data is not feasible.
  • indexing data structures which are preferably placed within the main memory that is fast representing a set of associations between data search attributes and stored data locations on the storage media. Thus, to locate data promptly there is no need to traverse whole stored data set from storage media, but find corresponding to the request index records within index data structure instead.
  • the invention directly relates to indexing data structures based on a prefix tree like for example a radix tree and disclosures a unique method of concurrency control for synchronization between writers and readers providing linear scalability on read operations.
  • the overall operation efficiency depends significantly on the method of concurrent synchronization.
  • readers access records of index structure when no ongoing modifications are visible to them.
  • the index structure represents the complete and correct state of the stored data set.
  • Known methods to achieve this often assume some sort of waiting or extra memory overheads. So, the synchronization has a significant impact on the system overall performance and is considered as a bottleneck for tree-based data structures.
  • actors can perform more or less independently with respect to the synchronization model.
  • the present invention is directed towards data storage systems and information databases with an index data structure based on a prefix tree like a radix tree and a method of synchronization between writers and readers providing linear scalability on read operations.
  • a first aspect of the present invention provides a data storage system with a data storage and a data controller configured to implement a prefix tree with a plurality of nodes, wherein the data controller is configured to provide a common node prefix per inner node, the common node prefix including a common prefix, a prefix length and a node depth; wherein the node depth is the absolute offset from the beginning of a key to the beginning of the common prefix.
  • a prefix tree is provided with a common node prefix at at least one inner node.
  • a node depth is also included.
  • the node depth is the absolute offset from the beginning of a key or search key to the beginning of the common prefix, or in other words the effective offset within the key. Now, the reader can rely on the node depth as specified at the node and does not have to count the depth during tree traversal.
  • the prefix length can be provided by a dedicated data field or it can be derived from the common prefix.
  • the invention provides a memory efficient index data structure based on an index tree and an enhanced concurrency control mechanism between writers and readers supporting linear scalability on data index lookup operation by readers.
  • the readers can operate wait-free.
  • the invention leads to an index data structure and concurrency control that can provide several kinds of advantages to information databases, storage systems and other applications heavily relying on efficient data lookup and retrieval. These advantages include near to linear scalability of index data structure lookup, a high performance of lookup and data retrieval, reduced memory overhead with vertical compression applied, and the support of long node common prefixes.
  • the data controller is configured to initiate a write operation in the prefix tree thereby setting a node depth for an inner node under write operation, and is configured to initiate a concurrent read operation including the inner node under write operation thereby using the set node depth, the common prefix and the prefix length for tree traversal.
  • the use of the set node depth, the common prefix and the prefix length for tree traversal by the reader allows a wait-free read operation concurrent to the write operation.
  • the set node depth which can be set directly or atomically, lets the reader traverse to the correct node even if ins case of a node split or a node merge.
  • the node depth leaves the reader independent from the ongoing count of nodes at tree traversal.
  • the data controller is configured to initiate a write operation for an inner node, to provide an auxiliary data structure for the inner node under write operation, and to provide a common node prefix for the auxiliary data structure, wherein the common node prefix reflects the changes of the write operation.
  • the auxiliary data structure contains at least the three fields of the common node prefix. It can include more fields if desired.
  • the auxiliary data structure enables easy implementation of the common node prefix into an existing prefix tree structure. The writer creates a new auxiliary data structure and sets the proper depth and common prefix fields thereby reflecting the changes of the write operation. The reference or structure within the corresponding node under writer operation can then be atomically updated.
  • Traversing readers obtain the correct information at all times. Before the write operation, a reader reads the unchanged node. During the write operation, a reader reads the auxiliary data structure, which already includes the correct information. After the write, the reader reads the updated node.
  • the data controller is configured to initiate a read operation concurrent to the write operation and including the inner node under write operation such that the read operation reads the inner node under write operation and the auxiliary data structure.
  • the auxiliary data structure provides a concurrent reader already with the correct information reflecting the changes of the write operation. By the use of the node depth the reader reads values reflecting the status after the write operation.
  • the data controller is configured to replace the auxiliary data structure wherein the common node prefix reflects the changes of the write operation of the inner node under write operation, after the write operation is completed.
  • the auxiliary data structure of the inner node is replaced with the new auxiliary data structure with the updated common node prefix accordingly to the conducted changes of the inner node within the tree.
  • the data controller is configured to set the node depth and/or the prefix length of the auxiliary data structure such that the sum of the node depth and the prefix length of the auxiliary data structure equals to the sum of the node depth and the prefix length of the auxiliary data structure of the inner node under write operation.
  • the common node prefix is provided within the auxiliary data structure within the inner node. This implementation keeps all data internal to the node.
  • the common node prefix is provided within the auxiliary data structure within a separate structure, and wherein at least one pointer is provided at the inner node pointing to the separate structure. Accessing the common prefix through the pointer is more convenient in terms of modification consistency. Moreover, such indirect approach surpasses hardware limitations on atomic operations for longer common prefix, which is beneficial for general purpose indexing data structures while internal nodes remain quite compact.
  • the common node prefix is provided within a separate structure, wherein the auxiliary data structure is provided within the inner node and wherein at least one pointer is provided at the auxiliary data structure pointing to the separate structure. Accessing the common prefix through the pointer is more convenient in terms of modification consistency.
  • references to children and/or a children node counter of an inner node are provided within a separate structure, and wherein at least one pointer is provided at the inner node pointing to the separate structure.
  • the prefix tree is a radix tree or an adaptive radix tree.
  • the radix tree and the adaptive radix tree, ART fit very well with the proposed field for node depth at inner nodes.
  • the prefix tree comprises a horizontal compression, a vertical compression and/or a key sequence skip. Such approaches reduce the number of nodes so that memory usage is decreased. Horizontal compression may introduce several inner nodes of different capacity and variable size thereby reducing memory cost accordingly to the real number of node children.
  • a second aspect of the present invention provides a method of providing a data storage system configured to implement a prefix tree with a plurality of nodes, comprising providing the common node prefix per inner node, the common node prefix including a common prefix, a prefix length and a node depth.
  • the method comprising initiating a write operation for an inner node; and providing an auxiliary data structure for the inner node under write operation comprising a common node prefix of the inner node; wherein the common node prefix reflects the changes of the write operation.
  • a third aspect of the present invention provides a computer program with a program code for performing the method as described above when the computer program runs on a computer or the data storage system as described above.
  • the same advantages and modifications as above apply.
  • Fig. 1 shows a diagram of the system architecture of a data storage system with an index structure.
  • Fig. 2 shows an example of a radix tree with vertical and horizontal compression.
  • Fig. 3 shows the structure of the common node prefix of inner nodes.
  • Fig. 4 shows a concurrently operating adaptive radix tree with wait-free readers.
  • Fig. 5 shows a method of providing a data storage system configured to implement a prefix tree with a plurality of nodes of wait-free reader operation in a radix tree.
  • Fig. 6 shows a flowchart of wait-free reader operation in a radix tree.
  • Fig. 1 shows a data storage system 100 including a data controller 101.
  • the data controller 101 includes a concurrency control mechanism 102 for a prefix or radix tree with vertical compression and key sequence skip used as an index data structure 103.
  • the data storage system 100 further includes a data storage 104.
  • the data controller 101 may be implemented in a main memory using DRAM (Dynamic Random Access Memory) or the like and the data storage 104 typically includes mass storage like hard disks, Solid-State- Disks (SSD) or the like.
  • DRAM Dynamic Random Access Memory
  • SSD Solid-State- Disks
  • the data storage system 100 includes the data controller 101 and the data storage 104.
  • the data controller 101 to locate data promptly there is no need to traverse the completely stored data set in the data storage 104. Instead, only the index within the index data structure 103 needs to be searched corresponding to the request index records.
  • Data users 1 1 1 1 can act as writers 1 12 adding and modifying data or readers 1 13 looking up and retrieving stored data 104. In the presence of concurrency, when several actors 1 1 1 perform data lookup and modification simultaneously the overall operation efficiency depends significantly on the method of the concurrent synchronization or control 102.
  • the data storage system 100 can be seen as a variant of advanced synchronization between writers 1 12 and readers 1 13 applicable as a concurrency control mechanism 102 for index or radix trees with vertical compression and key sequence skip used as an index data structure 103 in data storage systems 100 and information databases.
  • Such radix tree based index data structure 103 can also be equipped with horizontal compression and advanced synchronization between writers, too.
  • Fig. 2 shows an example of a radix tree 200.
  • the radix tree 200 includes a single root node 201 , internal nodes 203 interconnected to form a tree and last level nodes, so called leafs 205.
  • leafs 205 last level nodes
  • payload data is kept within tree leafs, and internal nodes store some distinct attribute value to choose from during the traversing lookup.
  • the radix tree 200 is a special case of a tree-based index data structures. Instead of keeping attribute values inside the internal node this information is preserved within nodes interconnections. Thus, during radix tree traversing there is no need to look through child nodes and compare searching attribute values, but just to choose the child that corresponds to the attribute value index, if any exists.
  • Internal index tree nodes 203 can be of variable size or adaptive capacity. There are four types of internal tree nodes different in capacity only: 4, 16, 48 and 256 children, accordingly.
  • the internal node 203 contains a children compartment accordingly to the capacity of 16 children in this example.
  • the root node 201 has a capacity of 256 children.
  • the inner node 203 is a child node of the root node 201 and linked to the root node 201 by the symbol“c”.
  • the common prefix in the inner node 201 is“omp”. Common prefix means that all leaf nodes 205, which are children of the inner node 203, contain the prefix“omp”.
  • the tree 200 can be described in a bottom-up fashion starting with alphabet definition, key strings encoding and index structure tree nodes interconnections.
  • the alphabet for such a tree-based index data structure is a single byte character set. It means that each symbol is represented within 8 bits and the whole alphabet cardinality, i.e. the number of elements, is 256 symbols. All symbols are used to form input strings, 255 non-zero bytes symbols serve to encode information and a zero-byte symbol indicates the end of an input string. This encoding is as C-style or null-terminated strings.
  • the interconnections between tree nodes of the index structure can be implemented with the so called augmented pointers technique, as well as with ordinary nodes pointers.
  • a current pointer size is 8 byte or 64 bits and modem MMUs usually operate within 48 bits address space leaving the rest unused. Thus, it is possible to reuse up to 16 bits of ordinary 64 bit pointers preserving important information there.
  • Augmented pointers which are used as node interconnections contain 1 byte alphabet symbol augmented denoting the symbol to which the particular interconnection is pointing to.
  • Fig. 3 shows the common node prefix 300 in detail.
  • the common node prefix 300 is a data compartment within internal node 203.
  • the common node prefix 300 has a layout to keep a number of key symbols which are shared by all node children. Those symbols are called common prefix 301 and arranged in a byte array compartment for prefix symbols.
  • the prefix length field 302 the length of the common prefix 301 is provided.
  • the invention extends the common node prefix 300 with an additional field node depth 303 to indicate the absolute offset from the beginning of the key.
  • Major internal node fields like the children compartment and the children counter also can be a separate memory layout with indirect access through a pointer from the internal node structure.
  • Such approach implies additional memory management overheads, but decrease the number of parent node updates during child node expands and shrinks, so simplifies node capacity management.
  • Radix tree leafs 205, or terminal nodes consist of primary fields key and payload, as well as optional fields lock and uplink. During operation writers perform all updates to ordinary and augmented pointers atomically.
  • the common node prefix 300 is also supposed to be updated atomically. Indirect memory layout does not pose any difficulty and pointer updates are handled by atomic operations of general commodity hardware. Internal nodes undergo expand or shrink when children counters indicate high or low level of occupancy. Splitting or merging a node usually implies another nodes insertion or extraction and common node prefix changes.
  • the node depth 303 and the common prefix 301 are kept together in a single separate data structure. Thus, both of them are updated consistently by referencing that particular structure instance, i.e. the common node prefix 300. Then, a reader never accounts its depth during the tree traversal. Instead, the reader reads the node depth 303 from the field of the common node prefix 300. It immediately follows that a reader always compares symbols at correct positions within the search key. Further, no read locks are required due to concurrent write operations.
  • the common node prefix 300 can be created by the following routine:
  • Such common node prefix 300 may pose a 4 byte constant overhead per node, wherein the 4 byte overhead actually may depend on the maximum possible length of the key.
  • Fig. 4 shows an example of a radix tree 400 with a plurality of nodes. In each node the node depth and the common prefix are shown. The prefix length can be derived from the common prefix. These fields are stored at each inner node in the common node prefix. A first reader 401 parses through the tree 400.
  • the search key or key“Aaronitic” is looked up by the first reader 401.
  • the search starts at a root node 402 and commences to inner node 403 as this is the child of the root node 402 that includes the next symbols of the key.
  • the inner node 403 has the common prefix“aron” and a node depth 1 .
  • the first reader 401 moves further down the tree 400 to inner node 404 that comprises a node depth of 5 and a common prefix of 1.
  • the next step corresponds to the status shown in Figure 4.
  • the first reader 401 is at inner node 405 that has the common prefix“t” and a node depth 6.
  • the offset of six symbols, the prefix length of one and the common prefix“t” is shown.
  • the leaf node 406 matches the complete key“Aaronitic“. Hence, the search was successful.
  • the first reader 401 has already passed inner node 403 which is now under a write operation by a writer 407. Because of the new field depth in the common node prefix data structure readers are allowed to eliminate waiting by only obligating writers to update the common node prefix consistently during internal node split or merge, here at inner node 403. Consistently here literally means the writer 407 updates the common node prefix symbols together with the depth field at once, or atomically. There are no other obligations on the decided order of writer operating or between writers synchronization.
  • a second reader 408 is accessing the internal node 403 with possibly ongoing modification when the common node prefix has already been updated, but new internal node is not inserted yet. Then, readers use the depth value from the new field of the common node prefix instead of nominal depth accounted while traversing through tree levels. Then they are able to compare the common node prefix to the correct symbols of the search key with no regard of the actual node level within the tree. Due to concurrency some nodes possibly underwent split or merge during readers operation. However, readers proceed forward with tree traversing comparing following the symbols of the requested key. It is possible to have false positive match skipping accidently several symbols from the comparison then. If case such uncertainty is detected, the situation is resolved nicely at the final step by comparing the stored key of the located index leaf with the requested one.
  • the writer 407 creates an auxiliary data structure 409 at the inner node 403 containing the common node prefix with the node depth and the common prefix of the new modified node.
  • the common node prefix reflects the changes of the write operation.
  • the writer 407 creates a new auxiliary structure 409, sets the proper depth and common prefix fields, and atomically updates reference like a pointer or the structure within the corresponding node 403. Writers perform all updates atomically to maintain data structures consistency.
  • the common node prefix is also updated atomically.
  • the node depth of the auxiliary data structure 409 equals to 2 and the common prefix of the auxiliary data structure 409 is“ron“ having a length of 3.
  • This new common node prefix of the auxiliary data structure 409 corresponds to a node split in which a new node is inserted between the root node 402 and the inner node 403 under write operation.
  • Such new node has a node depth of 1 and a common prefix“a“.
  • the sum, i.e. 5, of the node depth, i.e. 2, and the prefix length, i.e. 3, of the new auxiliary data structure 409 equals to the sum, i.e. 5, of the node depth, i.e. 1 , and the prefix length, i.e. 4, of the common node prefix of the auxiliary data structure of the inner node 403 under write operation.
  • the auxiliary data structure 409 and/or its common node prefix can be stored directly in the inner node or in a separate structure to which one or more pointers refer.
  • the pointer can be stored at the inner node or at the auxiliary data structure 409.
  • the second reader 408 when reaching the inner node 403 during tree traversal, reads the common node prefix of the auxiliary data structure 409.
  • the second reader 408 relies on the node depth field rather than on the accounted depth, it calculates from the node depth field, i.e. 2, and the prefix length, i.e. 3. Therefore, the second reader 408 reaches the correct inner node 404 despite the write operation to the inner node 403.
  • readers can operate wait-free proceeding forward from the root node to the leaf because they use the depth value from the common node prefix, if such depth value already exists.
  • the reader may skip some symbols from the comparison due to concurrency and may resolve a possible false match by comparing the search key with the key from the located leaf.
  • Fig. 5 depicts a method of providing a data storage system configured to implement a prefix tree with a plurality of nodes of wait- free reader operation in a radix tree.
  • a data storage system is provided that is configured to implement a prefix tree with a plurality of nodes.
  • the common node prefix is provided per inner node.
  • the common node prefix includes a common prefix, a prefix length and a node depth.
  • Fig. 6 shows an operation flowchart of the wait- free reader operation.
  • the flowchart starts at a level start 600 that is repeated for each node level.
  • step 601 it is decided whether the node is a leaf. If yes, the procedure branches to step 602. There, it is decided whether some symbols of the search key have been skipped for example due to key sequence skip or to concurrent writer modification excluding several symbols from the comparison due to node split or merge. If yes, an uncertainty may have occurred and an uncertainty flag or the like is set in step 603. Then, the procedure branches to step 604. This is also the case, when no symbols of the search key have been skipped, i.e. for a no at step 602.
  • step 604 it is decided whether a false positive or an uncertainty exist. If yes, the search key and the leaf key are matched in step 605. For a positive outcome, the method branches to step 606 and decides a true, i.e. the key has been correctly found in the leaf. The method also takes this step 606 when no false positive or an uncertainty exist at step 604.
  • step 608 the operation branches to step 608 where it is decided whether a common node prefix exists. If not, the method branches to step 609. There, it is decided whether a next level child exists. If not, it is branched to step 607, false, because a leaf node was found that does not store the key. If yes, it proceeds at step 610 to the next level, i.e. a new start at step 600.
  • step 608 When a common node prefix exists, i.e. a positive decision at step 608, the operation branches to step 61 1. There, it is decided whether the prefix and the key at the position of the node match. In other words, are symbols of the common node prefix compared with the search key at positions defined by common prefix depth. If no, it is branched to the false step 607 and the operation is terminated. If yes, it is branched to step 612.
  • step 613 it is decided whether some symbols of the search key have been skipped for example due to key sequence skip or to concurrent writer modification excluding several symbols from the comparison due to node split or merge. If yes, an uncertainty may have occurred and an uncertainty flag or the like is set in step 613. Then, the procedure branches to step 609. This is also the case, when no symbols of the search key have been skipped, i.e. for a no at step 612.
  • step 609 it is decided whether a next level child exists. If not, it is branched to step 607, false, because a leaf node was found that does not store the key. If yes, it proceeds at step 610 to the next level, i.e. a new start at step 600.
  • Readers operate wait- free traversing the tree down to the leaf with no locks despite concurrent modifications possibly introduced by writers.
  • a reader compares symbols of the common prefix with the search key at corresponding positions. The corresponding positions are decided by extracting the depth field from the common node prefix, if it exists. The reader proceeds forward to the next level if there is a match, otherwise returns false.
  • the reader may detect a key sequence skip or the case of uncertainty, when a concurrent writer modification excluded one or several symbols from the comparison due to node split or merge. Such cases are usually resolved at the final step comparing the search key with the key extracted from the terminal leaf node. If there are no detected obstacles, the reader returns true when a terminal leaf node was found or false, otherwise.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a data storage system (100) with a data storage (104) and a data controller (101) configured to implement a prefix tree (200) with a plurality of nodes (201, 203, 205), wherein the data controller (101) is configured to provide a common node prefix per inner node (203), the common node prefix (300) including a common prefix (301), a prefix length (302) and a node depth (303); wherein the node depth (303) is the absolute offset from the beginning of a key to the beginning of the common prefix (301).

Description

DATA STORAGE SYSTEM AND METHOD OF PROVIDING A DATA STORAGE
SYSTEM
TECHNICAL FIELD
The present invention relates to a data storage system, a method of providing a data storage system and a computer program with a program code. In particular, the present invention relates to data structures used for data lookups, and more particularly, to a prefix tree data structure for locating data stored in a database with a novel method of synchronization between writers and readers providing linear scalability on read.
BACKGROUND
Different tree-based index data structures have been already known for decades. Despite distinct design, many of them have certain features in common, such as a single root node, internal nodes interconnected to form a tree and last level nodes, so called leafs.
Usually, payload data is kept within the tree leafs, and internal nodes store some distinct attribute value to choose from during the traversing lookup. Traversing the tree follows entering of a key or search key. The search traverses through the tree according to matches between the attribute values stored in nodes and the key. The search starts at the root node, i.e. a parent node and branches through child nodes which are each dependent from one parent node. A radix tree is a special case of tree-based index data structures. Instead of keeping attribute values inside the internal node this information is preserved within node interconnections. Thus, during radix tree traversing there is no need to look through child nodes and compare searching attribute values, but just to choose the child that corresponds to the attribute value index, if it exists. Despite evident simplicity and neat traversing algorithm such radix trees were not widely used for general purpose database or storage system index structures due to impressively high memory overheads. Indeed, each radix tree node is supposed to keep the whole range of possible attribute values even when very few of such child nodes actually exist. The above implies an exponential growth in memory consumption. Recently, radix trees are prepare for general purpose use. The most important modification is the provision of variable size nodes. It means internal nodes are different in capacity. There are several pre-set kinds of capacities and depending on the demand the particular one is chosen during tree modification.
Such approach is usually called horizontal compression. Another valuable improvement is to skip all internal nodes with a single child only. It is necessary then to keep common attributes inside the intermediate node and to store searching attributes within the corresponding leaf. Such approach is usually called vertical compression or key sequence skip. Both those improvements immediately lead to more accurate, moderate memory consumption. In particular, an internal node with limited capacity of only 16 children has common prefix mitigating at least three intermediate nodes with only one child each. Also, corresponding leaf nodes contain the rest of the search attributes.
Another known issue of radix trees is the method of concurrent access, or synchronization. Traditionally, synchronization techniques utilize locks to keep the reader waiting until the modification is complete, or detect the presence of changes and then restart the reader. Such solutions are simple, but hardly scalable. More advanced techniques allow to access nodes hiding pending modifications and operate wait-free. However, they require sophisticated data structures, and generally are more memory consumptive and complex to implement. An alternative option to consider is to use hardware transactional memory. However, there are still certain obstacles regarding proper hardware support present.
Naive radix tree design implies a huge amount of memory, so such kind of trees have been quite limited in use. A new approach optimized for memory usage is called Adaptive Radix Tree (ART). ART accumulates known tree compression techniques. Inner nodes can have several symbols onboard called common node prefix instead of just one symbol, such approach is known as vertical compression and is used to decrease the number of tree levels. A well-proven addition to such vertical compression is the so called key sequence skip, which is applied to reduce the number of inner nodes when the rest stored in the leaf node key is unique. There is an observation that single modification affects two nodes of a radix tree at most: the modified node itself and its parent node. Thus, there are several known concurrent access synchronization mechanisms exploiting such observation to be effectively applied for radix trees in general.
Lock Coupling is a standard method for synchronizing B-trees and can be easily applied to radix trees also. The idea is to hold at most two locks at every single moment during tree traversal. Starting from the root node, on every step one level down lock is acquired on the corresponding child and parent lock must be released. Lock Coupling algorithms suffer due to high contention on the first level of the tree. Every actor starts from the root node acquiring the lock. Thus, there is definitely high contention and a good probability of contention on the next level, too.
More advanced technique called Optimistic Lock Coupling is the enhanced version of the previous method with better performance. Instead of acquiring locks on each step to prevent modifications of nodes, actors assume that no concurrent modifications happen during tree traversing. Modifications are detected afterwards using version counters and the operation is restarted, if needed. It shows higher performance because locks are acquired only on demand and overall contention in tree nodes is not too high. However, it mostly suffers due to the coarse-grained nature of the locks. Several tree nodes are locked during their parent lock, but there is no modification meant for them. For instance, one actor acquires tree node lock which implies that other actors cannot lock that node’s children due to lock coupling before the lock is released.
The Read-Optimized Write Exclusion (ROWEX) approach shows even better performance. Inner nodes are extended with a level field which is set only once when the node is created and indicates the total length of the key sequence at the node level. The level field is never changed on later steps. Assuming modifications of common prefix are conducted atomically, as well as updates of node references; readers can operate wait-free with no locks or retries. However, writers are supposed to perform extra actions leaving the radix tree in a complete and correct state for readers. For writers, ROWEX uses the similar approach of locking as Optimistic Lock Coupling. Each writer keeps two exclusive locks: one on the node to be modified and one on the parent. The major difference lies in j extra actions performed by writers to let readers work without locks. These extra actions are mostly atomic updates of the tree node fields important for readers to read the tree correctly.
Copy on Write (COW) methods are usually applied in lock-free algorithms. The main idea is to create a hidden copy of the node being modified then introduce all changes to this copy and make it visible. Conventional lock, compare and swap operation and version counters can be used to implement COW. COW approaches are beneficial for read oriented workloads, e.g. 95 percent of read operations and 5 percent of write operations in workload. A writer detects changes during preparation of a tree node copy and restarts the operation redoing the copy again. Those extra memory copies and allocations also lead to poor write performance. On the other hand, COW can be used with locks to avoid restarts. In this case, scalability of write operations is quite low. In addition, for non-volatile memory where write speed is slow but read speed is close to DRAM, extra copies would require more time.
Hardware Transactional Memory (HTM) is a hardware mechanism to detect conflicts and undo any changes on shared data. The goal of such hardware is to transparently support regions of code marked as transactions by enforcing atomicity, consistency and isolation. In this scenario, all actors always observe consistent state of tree nodes since modifications are performed only within transactions, thus are visible for all actors only after successful commit. When commit operation is failed, a restart of the modification is required because the tree node has been simultaneously changed. HTM is a very specific hardware and not well supported yet, there are no wide opportunities to use it in production today.
Software Transactional Memory (STM) is an emulation of HTM on machines without HTM support. However, optimal implementation of STM requires double compare and swap instruction support (DCAS) otherwise implementation of such approach is sophisticated and quite slow. Unfortunately, DCAS operation is supported only by the newest processors and is not widely available. Moreover in practice, STM systems also suffer performance hit compared to fine-grained lock-based systems due primarily to the overheads associated with maintaining the log and the time spent committing transactions. In order to traverse the tree correctly synchronization between readers and writer of nodes and interconnections is essential. Typically, readers access nodes when no ongoing modifications are visible to them. Thus, nodes represent a complete and correct tree state.
Traditionally, synchronization techniques for prefix or radix trees use locks to keep a reader waiting until the modification is done, or detect the presence of changes and restart. Such solutions are simple, but inefficiently scalable. Advanced techniques allow to read nodes hiding pending modifications and operate wait-free. However, they require sophisticated data structures and generally are complex to implement.
SUMMARY
In view of the above-mentioned problems and disadvantages, the present invention aims to improve the performance of prefix trees. The present invention has thereby the object to improve the synchronization between writers and readers in prefix trees.
The object of the present invention is achieved by the solution provided in the enclosed independent claims. Advantageous implementations of the present invention are further defined in the dependent claims.
In particular, the present invention relates to the way computers lookup data stored in information databases or storage systems and proposes a novel method of concurrent access synchronization to locate data faster.
Fast data lookup is essential for all computer programs, including information databases and storage systems. Programs need to locate data for consequent retrieval or computation. Storage systems and information databases usually have large volume storage media to store huge amount of data into it. Storage media is a slow memory device compared to the CPU main memory, therefore exhaustive search for looking up data is not feasible.
Modern systems widely use so-called indexing data structures which are preferably placed within the main memory that is fast representing a set of associations between data search attributes and stored data locations on the storage media. Thus, to locate data promptly there is no need to traverse whole stored data set from storage media, but find corresponding to the request index records within index data structure instead.
More particularly, the invention directly relates to indexing data structures based on a prefix tree like for example a radix tree and disclosures a unique method of concurrency control for synchronization between writers and readers providing linear scalability on read operations.
Data users can act as writers adding and modifying data or readers looking up and retrieving stored data. In the presence of concurrency, when several actors perform data lookup and modification simultaneously the overall operation efficiency depends significantly on the method of concurrent synchronization. Usually, readers access records of index structure when no ongoing modifications are visible to them. Thus, the index structure represents the complete and correct state of the stored data set. Known methods to achieve this often assume some sort of waiting or extra memory overheads. So, the synchronization has a significant impact on the system overall performance and is considered as a bottleneck for tree-based data structures. In general, during concurrent operation over some data structure actors can perform more or less independently with respect to the synchronization model. In the worst case, they serialize their actions thereby proportionally increasing total execution time, and in the best case, actors complete their operations in parallel. Linear scalability represents such upper theoretical bound for the aforementioned best case, and regarding to the lookup operation it immediately implies readers never wait despite the presence of ongoing modification conducted by writers concurrently.
The present invention is directed towards data storage systems and information databases with an index data structure based on a prefix tree like a radix tree and a method of synchronization between writers and readers providing linear scalability on read operations.
A first aspect of the present invention provides a data storage system with a data storage and a data controller configured to implement a prefix tree with a plurality of nodes, wherein the data controller is configured to provide a common node prefix per inner node, the common node prefix including a common prefix, a prefix length and a node depth; wherein the node depth is the absolute offset from the beginning of a key to the beginning of the common prefix.
According to the present invention, a prefix tree is provided with a common node prefix at at least one inner node. Besides the common prefix and the prefix length, a node depth is also included. The node depth is the absolute offset from the beginning of a key or search key to the beginning of the common prefix, or in other words the effective offset within the key. Now, the reader can rely on the node depth as specified at the node and does not have to count the depth during tree traversal. The prefix length can be provided by a dedicated data field or it can be derived from the common prefix.
The invention provides a memory efficient index data structure based on an index tree and an enhanced concurrency control mechanism between writers and readers supporting linear scalability on data index lookup operation by readers. The readers can operate wait-free.
The invention leads to an index data structure and concurrency control that can provide several kinds of advantages to information databases, storage systems and other applications heavily relying on efficient data lookup and retrieval. These advantages include near to linear scalability of index data structure lookup, a high performance of lookup and data retrieval, reduced memory overhead with vertical compression applied, and the support of long node common prefixes.
In an implementation form of the first aspect, the data controller is configured to initiate a write operation in the prefix tree thereby setting a node depth for an inner node under write operation, and is configured to initiate a concurrent read operation including the inner node under write operation thereby using the set node depth, the common prefix and the prefix length for tree traversal. The use of the set node depth, the common prefix and the prefix length for tree traversal by the reader allows a wait-free read operation concurrent to the write operation. The set node depth, which can be set directly or atomically, lets the reader traverse to the correct node even if ins case of a node split or a node merge. The node depth leaves the reader independent from the ongoing count of nodes at tree traversal. In a further implementation form of the first aspect, the data controller is configured to initiate a write operation for an inner node, to provide an auxiliary data structure for the inner node under write operation, and to provide a common node prefix for the auxiliary data structure, wherein the common node prefix reflects the changes of the write operation. The auxiliary data structure contains at least the three fields of the common node prefix. It can include more fields if desired. The auxiliary data structure enables easy implementation of the common node prefix into an existing prefix tree structure. The writer creates a new auxiliary data structure and sets the proper depth and common prefix fields thereby reflecting the changes of the write operation. The reference or structure within the corresponding node under writer operation can then be atomically updated. Traversing readers obtain the correct information at all times. Before the write operation, a reader reads the unchanged node. During the write operation, a reader reads the auxiliary data structure, which already includes the correct information. After the write, the reader reads the updated node.
In a further implementation form of the first aspect, the data controller is configured to initiate a read operation concurrent to the write operation and including the inner node under write operation such that the read operation reads the inner node under write operation and the auxiliary data structure. As pointed out above, the auxiliary data structure provides a concurrent reader already with the correct information reflecting the changes of the write operation. By the use of the node depth the reader reads values reflecting the status after the write operation.
In a further implementation form of the first aspect, the data controller is configured to replace the auxiliary data structure wherein the common node prefix reflects the changes of the write operation of the inner node under write operation, after the write operation is completed. The auxiliary data structure of the inner node is replaced with the new auxiliary data structure with the updated common node prefix accordingly to the conducted changes of the inner node within the tree.
In a further implementation form of the first aspect, the data controller is configured to set the node depth and/or the prefix length of the auxiliary data structure such that the sum of the node depth and the prefix length of the auxiliary data structure equals to the sum of the node depth and the prefix length of the auxiliary data structure of the inner node under write operation. Such rule allows easy calculation and keeps consistency for the reader as the sum is the same before and during the write operation.
In a further implementation form of the first aspect, the common node prefix is provided within the auxiliary data structure within the inner node. This implementation keeps all data internal to the node.
In a further implementation form of the first aspect, the common node prefix is provided within the auxiliary data structure within a separate structure, and wherein at least one pointer is provided at the inner node pointing to the separate structure. Accessing the common prefix through the pointer is more convenient in terms of modification consistency. Moreover, such indirect approach surpasses hardware limitations on atomic operations for longer common prefix, which is beneficial for general purpose indexing data structures while internal nodes remain quite compact.
In a further implementation form of the first aspect, the common node prefix is provided within a separate structure, wherein the auxiliary data structure is provided within the inner node and wherein at least one pointer is provided at the auxiliary data structure pointing to the separate structure. Accessing the common prefix through the pointer is more convenient in terms of modification consistency.
In a further implementation form of the first aspect, the references to children and/or a children node counter of an inner node are provided within a separate structure, and wherein at least one pointer is provided at the inner node pointing to the separate structure. Such an approach may imply additional memory management overheads, but decreases the number of parent node updates during child node expands and shrinks, thereby simplifying node capacity management.
In a further implementation form of the first aspect, the prefix tree is a radix tree or an adaptive radix tree. The radix tree and the adaptive radix tree, ART, fit very well with the proposed field for node depth at inner nodes. In a further implementation form of the first aspect, the prefix tree comprises a horizontal compression, a vertical compression and/or a key sequence skip. Such approaches reduce the number of nodes so that memory usage is decreased. Horizontal compression may introduce several inner nodes of different capacity and variable size thereby reducing memory cost accordingly to the real number of node children.
A second aspect of the present invention provides a method of providing a data storage system configured to implement a prefix tree with a plurality of nodes, comprising providing the common node prefix per inner node, the common node prefix including a common prefix, a prefix length and a node depth. The same advantages and modifications as above apply.
In an implementation form of the second aspect, the method comprising initiating a write operation for an inner node; and providing an auxiliary data structure for the inner node under write operation comprising a common node prefix of the inner node; wherein the common node prefix reflects the changes of the write operation.
A third aspect of the present invention provides a computer program with a program code for performing the method as described above when the computer program runs on a computer or the data storage system as described above. The same advantages and modifications as above apply.
It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof. BRIEF DESCRIPTION OF THE DRAWINGS
The above described aspects and implementation forms of the present invention will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which
Fig. 1 shows a diagram of the system architecture of a data storage system with an index structure.
Fig. 2 shows an example of a radix tree with vertical and horizontal compression.
Fig. 3 shows the structure of the common node prefix of inner nodes.
Fig. 4 shows a concurrently operating adaptive radix tree with wait-free readers.
Fig. 5 shows a method of providing a data storage system configured to implement a prefix tree with a plurality of nodes of wait-free reader operation in a radix tree.
Fig. 6 shows a flowchart of wait-free reader operation in a radix tree.
DETAILED DESCRIPTION OF THE EMBODIMENTS
Fig. 1 shows a data storage system 100 including a data controller 101. The data controller 101 includes a concurrency control mechanism 102 for a prefix or radix tree with vertical compression and key sequence skip used as an index data structure 103. The data storage system 100 further includes a data storage 104. The data controller 101 may be implemented in a main memory using DRAM (Dynamic Random Access Memory) or the like and the data storage 104 typically includes mass storage like hard disks, Solid-State- Disks (SSD) or the like.
According to claim 1 of the present invention, the data storage system 100 includes the data controller 101 and the data storage 104. Thus, to locate data promptly there is no need to traverse the completely stored data set in the data storage 104. Instead, only the index within the index data structure 103 needs to be searched corresponding to the request index records.
Data users 1 1 1 can act as writers 1 12 adding and modifying data or readers 1 13 looking up and retrieving stored data 104. In the presence of concurrency, when several actors 1 1 1 perform data lookup and modification simultaneously the overall operation efficiency depends significantly on the method of the concurrent synchronization or control 102.
The data storage system 100 can be seen as a variant of advanced synchronization between writers 1 12 and readers 1 13 applicable as a concurrency control mechanism 102 for index or radix trees with vertical compression and key sequence skip used as an index data structure 103 in data storage systems 100 and information databases. Such radix tree based index data structure 103 can also be equipped with horizontal compression and advanced synchronization between writers, too.
Fig. 2 shows an example of a radix tree 200. The radix tree 200 includes a single root node 201 , internal nodes 203 interconnected to form a tree and last level nodes, so called leafs 205. Usually, payload data is kept within tree leafs, and internal nodes store some distinct attribute value to choose from during the traversing lookup.
The radix tree 200 is a special case of a tree-based index data structures. Instead of keeping attribute values inside the internal node this information is preserved within nodes interconnections. Thus, during radix tree traversing there is no need to look through child nodes and compare searching attribute values, but just to choose the child that corresponds to the attribute value index, if any exists.
Internal index tree nodes 203 as mentioned above can be of variable size or adaptive capacity. There are four types of internal tree nodes different in capacity only: 4, 16, 48 and 256 children, accordingly. The internal node 203 contains a children compartment accordingly to the capacity of 16 children in this example. The root node 201 has a capacity of 256 children. In this example, the inner node 203 is a child node of the root node 201 and linked to the root node 201 by the symbol“c”. The common prefix in the inner node 201 is“omp”. Common prefix means that all leaf nodes 205, which are children of the inner node 203, contain the prefix“omp”.
In a more general form, the tree 200 can be described in a bottom-up fashion starting with alphabet definition, key strings encoding and index structure tree nodes interconnections. The alphabet for such a tree-based index data structure is a single byte character set. It means that each symbol is represented within 8 bits and the whole alphabet cardinality, i.e. the number of elements, is 256 symbols. All symbols are used to form input strings, 255 non-zero bytes symbols serve to encode information and a zero-byte symbol indicates the end of an input string. This encoding is as C-style or null-terminated strings.
The interconnections between tree nodes of the index structure can be implemented with the so called augmented pointers technique, as well as with ordinary nodes pointers. A current pointer size is 8 byte or 64 bits and modem MMUs usually operate within 48 bits address space leaving the rest unused. Thus, it is possible to reuse up to 16 bits of ordinary 64 bit pointers preserving important information there. Augmented pointers which are used as node interconnections contain 1 byte alphabet symbol augmented denoting the symbol to which the particular interconnection is pointing to.
Fig. 3 shows the common node prefix 300 in detail. The common node prefix 300 is a data compartment within internal node 203. The common node prefix 300 has a layout to keep a number of key symbols which are shared by all node children. Those symbols are called common prefix 301 and arranged in a byte array compartment for prefix symbols. In a further field, the prefix length field 302, the length of the common prefix 301 is provided. The invention extends the common node prefix 300 with an additional field node depth 303 to indicate the absolute offset from the beginning of the key.
Major internal node fields like the children compartment and the children counter also can be a separate memory layout with indirect access through a pointer from the internal node structure. Such approach implies additional memory management overheads, but decrease the number of parent node updates during child node expands and shrinks, so simplifies node capacity management. Radix tree leafs 205, or terminal nodes consist of primary fields key and payload, as well as optional fields lock and uplink. During operation writers perform all updates to ordinary and augmented pointers atomically. The common node prefix 300 is also supposed to be updated atomically. Indirect memory layout does not pose any difficulty and pointer updates are handled by atomic operations of general commodity hardware. Internal nodes undergo expand or shrink when children counters indicate high or low level of occupancy. Splitting or merging a node usually implies another nodes insertion or extraction and common node prefix changes.
According to the common node prefix 300 present here, the node depth 303 and the common prefix 301 are kept together in a single separate data structure. Thus, both of them are updated consistently by referencing that particular structure instance, i.e. the common node prefix 300. Then, a reader never accounts its depth during the tree traversal. Instead, the reader reads the node depth 303 from the field of the common node prefix 300. It immediately follows that a reader always compares symbols at correct positions within the search key. Further, no read locks are required due to concurrent write operations.
The common node prefix 300 can be created by the following routine:
struct node prefix {
unsigned depth;
unsigned length;
char prefix[MAX_LEN];
} ;
Such common node prefix 300 may pose a 4 byte constant overhead per node, wherein the 4 byte overhead actually may depend on the maximum possible length of the key.
Fig. 4 shows an example of a radix tree 400 with a plurality of nodes. In each node the node depth and the common prefix are shown. The prefix length can be derived from the common prefix. These fields are stored at each inner node in the common node prefix. A first reader 401 parses through the tree 400.
On the left of Figure 4 the progression of the first reader 401 through the tree 400 is shown. The search key or key“Aaronitic” is looked up by the first reader 401. The search starts at a root node 402 and commences to inner node 403 as this is the child of the root node 402 that includes the next symbols of the key. The inner node 403 has the common prefix“aron” and a node depth 1 . Accordingly, the first reader 401 moves further down the tree 400 to inner node 404 that comprises a node depth of 5 and a common prefix of 1. The next step corresponds to the status shown in Figure 4. The first reader 401 is at inner node 405 that has the common prefix“t” and a node depth 6. On the left side, the offset of six symbols, the prefix length of one and the common prefix“t” is shown. The leaf node 406 matches the complete key“Aaronitic“. Hence, the search was successful.
The first reader 401 has already passed inner node 403 which is now under a write operation by a writer 407. Because of the new field depth in the common node prefix data structure readers are allowed to eliminate waiting by only obligating writers to update the common node prefix consistently during internal node split or merge, here at inner node 403. Consistently here literally means the writer 407 updates the common node prefix symbols together with the depth field at once, or atomically. There are no other obligations on the decided order of writer operating or between writers synchronization.
Thus, a second reader 408 is accessing the internal node 403 with possibly ongoing modification when the common node prefix has already been updated, but new internal node is not inserted yet. Then, readers use the depth value from the new field of the common node prefix instead of nominal depth accounted while traversing through tree levels. Then they are able to compare the common node prefix to the correct symbols of the search key with no regard of the actual node level within the tree. Due to concurrency some nodes possibly underwent split or merge during readers operation. However, readers proceed forward with tree traversing comparing following the symbols of the requested key. It is possible to have false positive match skipping accidently several symbols from the comparison then. If case such uncertainty is detected, the situation is resolved nicely at the final step by comparing the stored key of the located index leaf with the requested one.
In the following is described how the writer 407 actually performs such operation and how the second reader 408 parses through inner node 403 under write operation. The writer 407 creates an auxiliary data structure 409 at the inner node 403 containing the common node prefix with the node depth and the common prefix of the new modified node. Hence, the common node prefix reflects the changes of the write operation. In other words, the writer 407 creates a new auxiliary structure 409, sets the proper depth and common prefix fields, and atomically updates reference like a pointer or the structure within the corresponding node 403. Writers perform all updates atomically to maintain data structures consistency. The common node prefix is also updated atomically.
In the example of Figure 4, the node depth of the auxiliary data structure 409 equals to 2 and the common prefix of the auxiliary data structure 409 is“ron“ having a length of 3. This new common node prefix of the auxiliary data structure 409 corresponds to a node split in which a new node is inserted between the root node 402 and the inner node 403 under write operation. Such new node has a node depth of 1 and a common prefix“a“.
It can be seen that the sum, i.e. 5, of the node depth, i.e. 2, and the prefix length, i.e. 3, of the new auxiliary data structure 409 equals to the sum, i.e. 5, of the node depth, i.e. 1 , and the prefix length, i.e. 4, of the common node prefix of the auxiliary data structure of the inner node 403 under write operation.
The auxiliary data structure 409 and/or its common node prefix can be stored directly in the inner node or in a separate structure to which one or more pointers refer. The pointer can be stored at the inner node or at the auxiliary data structure 409.
The second reader 408, when reaching the inner node 403 during tree traversal, reads the common node prefix of the auxiliary data structure 409. The second reader 408 relies on the node depth field rather than on the accounted depth, it calculates from the node depth field, i.e. 2, and the prefix length, i.e. 3. Therefore, the second reader 408 reaches the correct inner node 404 despite the write operation to the inner node 403.
In a broader sense, readers can operate wait-free proceeding forward from the root node to the leaf because they use the depth value from the common node prefix, if such depth value already exists. The reader may skip some symbols from the comparison due to concurrency and may resolve a possible false match by comparing the search key with the key from the located leaf.
Fig. 5 depicts a method of providing a data storage system configured to implement a prefix tree with a plurality of nodes of wait- free reader operation in a radix tree. In step 500 a data storage system is provided that is configured to implement a prefix tree with a plurality of nodes. In step 501 , the common node prefix is provided per inner node. The common node prefix includes a common prefix, a prefix length and a node depth.
Fig. 6 shows an operation flowchart of the wait- free reader operation. The flowchart starts at a level start 600 that is repeated for each node level.
At step 601 , it is decided whether the node is a leaf. If yes, the procedure branches to step 602. There, it is decided whether some symbols of the search key have been skipped for example due to key sequence skip or to concurrent writer modification excluding several symbols from the comparison due to node split or merge. If yes, an uncertainty may have occurred and an uncertainty flag or the like is set in step 603. Then, the procedure branches to step 604. This is also the case, when no symbols of the search key have been skipped, i.e. for a no at step 602.
At step 604, it is decided whether a false positive or an uncertainty exist. If yes, the search key and the leaf key are matched in step 605. For a positive outcome, the method branches to step 606 and decides a true, i.e. the key has been correctly found in the leaf. The method also takes this step 606 when no false positive or an uncertainty exist at step 604.
Back to step 601 for the case that the node is not a leaf. Then, the operation branches to step 608 where it is decided whether a common node prefix exists. If not, the method branches to step 609. There, it is decided whether a next level child exists. If not, it is branched to step 607, false, because a leaf node was found that does not store the key. If yes, it proceeds at step 610 to the next level, i.e. a new start at step 600.
When a common node prefix exists, i.e. a positive decision at step 608, the operation branches to step 61 1. There, it is decided whether the prefix and the key at the position of the node match. In other words, are symbols of the common node prefix compared with the search key at positions defined by common prefix depth. If no, it is branched to the false step 607 and the operation is terminated. If yes, it is branched to step 612.
There, it is decided whether some symbols of the search key have been skipped for example due to key sequence skip or to concurrent writer modification excluding several symbols from the comparison due to node split or merge. If yes, an uncertainty may have occurred and an uncertainty flag or the like is set in step 613. Then, the procedure branches to step 609. This is also the case, when no symbols of the search key have been skipped, i.e. for a no at step 612.
At step 609 it is decided whether a next level child exists. If not, it is branched to step 607, false, because a leaf node was found that does not store the key. If yes, it proceeds at step 610 to the next level, i.e. a new start at step 600.
From a broader context the operation may be described as follows. Readers operate wait- free traversing the tree down to the leaf with no locks despite concurrent modifications possibly introduced by writers. On each level starting from the root, a reader compares symbols of the common prefix with the search key at corresponding positions. The corresponding positions are decided by extracting the depth field from the common node prefix, if it exists. The reader proceeds forward to the next level if there is a match, otherwise returns false. During the traversal, the reader may detect a key sequence skip or the case of uncertainty, when a concurrent writer modification excluded one or several symbols from the comparison due to node split or merge. Such cases are usually resolved at the final step comparing the search key with the key extracted from the terminal leaf node. If there are no detected obstacles, the reader returns true when a terminal leaf node was found or false, otherwise.
The present invention has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed invention, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word“comprising” does not exclude other elements or steps and the indefinite article“a” or“an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.

Claims

1 . Data storage system ( 100) with a data storage (104) and a data controller ( 101 ) configured to implement a prefix tree (200) with a plurality of nodes (201 , 203, 205), wherein
the data controller (101 ) is configured to provide a common node prefix (300) per inner node (203), the common node prefix (300) including a common prefix (301 ), a prefix length (302) and a node depth (303); wherein the node depth (303) is the absolute offset from the beginning of a key to the beginning of the common prefix (301 ).
2. Data storage system (100) according to claim 1 , wherein
the data controller ( 101 ) is configured to initiate a write operation in the prefix tree (200) thereby setting a node depth (303) for an inner node (203) under write operation, and is configured to initiate a concurrent read operation including the inner node (203) under write operation thereby using the set node depth (303), the common prefix (301 ) and the prefix length (302) for tree traversal.
3. Data storage system (100) according to claim 1 or 2, wherein
the data controller ( 101 ) is configured to initiate a write operation for an inner node (203), to provide an auxiliary data structure (409) for the inner node (203) under write operation, and to provide a common node prefix (300) for the auxiliary data structure (409), wherein the common node prefix (300) reflects the changes of the write operation.
4. Data storage system ( 100) according to claim 3, wherein
the data controller ( 101 ) is configured to initiate a read operation concurrent to the write operation and including the inner node (203) under write operation such that the read operation reads the inner node (203) under write operation and the auxiliary data structure (409).
5. Data storage system (100) according to claim 3 or 4, wherein
the data controller (101 ) is configured to replace the auxiliary data structure (409) wherein the common node prefix (300) reflects the changes of the write operation of the inner node (203) under write operation, after the write operation is completed.
6. Data storage system (100) according to one of the claims 3 to 5, wherein the data controller ( 101 ) is configured to set the node depth (303) and/or the prefix length (302) of the auxiliary data structure (409) such that the sum of the node depth (303) and the prefix length (302) of the auxiliary data structure (409) equals to the sum of the node depth (303) and the prefix length (302) of the auxiliary data structure (409) of the inner node (203) under write operation.
7. Data storage system (100) according to one of the claims 1 to 6, wherein
the common node prefix (300) is provided within the auxiliary data structure (409) within the inner node (203).
8. Data storage system (100) according to one of the claims 1 to 7, wherein
the common node prefix (300) is provided within the auxiliary data structure (409) within a separate structure, and wherein at least one pointer is provided at the inner node (203) pointing to the separate structure.
9. Data storage system (100) according to one of the claims 1 to 8, wherein
the common node prefix (300) is provided within a separate structure, wherein the auxiliary data structure (409) is provided within the inner node (203) and wherein at least one pointer is provided at the auxiliary data structure (409) pointing to the separate structure.
10. Data storage system (100) according to one of the claims 1 to 9, wherein
references to children and/or a children node counter of an inner node (203) are provided within a separate structure, and wherein at least one pointer is provided at the inner node (203) pointing to the separate structure.
1 1. Data storage system (100) according to one of the claims 1 to 10, wherein
the prefix tree (200) is a radix tree or an adaptive radix tree.
12. Data storage system (100) according to one of the claims 1 to 1 1 , wherein
the prefix tree (200) comprises a horizontal compression, a vertical compression and/or a key sequence skip.
13. Method of providing a data storage system (100) configured to implement a prefix tree (200) with a plurality of nodes, comprising
providing (501 ) the common node prefix (300) per inner node (203), the common node prefix (300) including a common prefix (301 ), a prefix length (302) and a node depth (303).
14. Method according to claim 12, comprising
initiating a write operation for an inner node (203); and
providing an auxiliary data structure (409) for the inner node (203) under write operation comprising a common node prefix (300) of the inner node (203);
wherein the common node prefix (300) reflects the changes of the write operation.
15. A computer program with a program code for performing the method according to claim 13 or 14 when the computer program runs on a computer or the data storage system (100) according to one of the claims 1 to 12.
PCT/RU2017/000857 2017-11-20 2017-11-20 Data storage system and method of providing a data storage system WO2019098871A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201780096673.3A CN111316255B (en) 2017-11-20 2017-11-20 Data storage system and method for providing a data storage system
PCT/RU2017/000857 WO2019098871A1 (en) 2017-11-20 2017-11-20 Data storage system and method of providing a data storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RU2017/000857 WO2019098871A1 (en) 2017-11-20 2017-11-20 Data storage system and method of providing a data storage system

Publications (1)

Publication Number Publication Date
WO2019098871A1 true WO2019098871A1 (en) 2019-05-23

Family

ID=60766120

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2017/000857 WO2019098871A1 (en) 2017-11-20 2017-11-20 Data storage system and method of providing a data storage system

Country Status (2)

Country Link
CN (1) CN111316255B (en)
WO (1) WO2019098871A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784117A (en) * 2021-01-06 2021-05-11 北京信息科技大学 High-level radix tree construction method and construction system for mass data
US11256681B2 (en) * 2020-03-03 2022-02-22 WunBae JEON Method and apparatus for controlling access to trie data structure
US20220197884A1 (en) * 2020-12-18 2022-06-23 Via Technologies Inc. Encoding method for key trie, decoding method for key trie, and electronic devices
US11954345B2 (en) 2021-12-03 2024-04-09 Samsung Electronics Co., Ltd. Two-level indexing for key-value persistent storage device

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113626432B (en) * 2021-08-03 2023-10-13 上海沄熹科技有限公司 Improved method of self-adaptive radix tree supporting arbitrary Key value

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7899067B2 (en) * 2002-05-31 2011-03-01 Cisco Technology, Inc. Method and apparatus for generating and using enhanced tree bitmap data structures in determining a longest prefix match
CN101577662B (en) * 2008-05-05 2012-04-04 华为技术有限公司 Method and device for matching longest prefix based on tree form data structure
US8914415B2 (en) * 2010-01-29 2014-12-16 International Business Machines Corporation Serial and parallel methods for I/O efficient suffix tree construction
GB2495106B (en) * 2011-09-28 2020-09-09 Metaswitch Networks Ltd Searching and storing data in a database
CN103870492B (en) * 2012-12-14 2017-08-04 腾讯科技(深圳)有限公司 A kind of date storage method and device based on key row sequence
US9602407B2 (en) * 2013-12-17 2017-03-21 Huawei Technologies Co., Ltd. Trie stage balancing for network address lookup
JP6549704B2 (en) * 2014-09-25 2019-07-24 オラクル・インターナショナル・コーポレイション System and method for supporting zero copy binary radix trees in a distributed computing environment
CN105117417B (en) * 2015-07-30 2018-04-17 西安交通大学 A kind of memory database Trie tree indexing means for reading optimization
CN105320775B (en) * 2015-11-11 2019-05-14 中科曙光信息技术无锡有限公司 The access method and device of data
CN106126722B (en) * 2016-06-30 2019-10-18 中国科学院计算技术研究所 A kind of prefix compound tree and design method based on verifying

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "GitHub - npgall/concurrent-trees: Concurrent Radix and Suffix Trees for Java", 21 April 2017 (2017-04-21), XP055451402, Retrieved from the Internet <URL:https://web.archive.org/web/20170421120252/https://github.com/npgall/concurrent-trees> [retrieved on 20180215] *
NILOUFAR SHAFIEI: "Non-blocking Patricia Tries with Replace Operations", 2013 IEEE 33RD INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, 1 July 2013 (2013-07-01), pages 216 - 225, XP055451388, ISBN: 978-0-7695-5000-8, DOI: 10.1109/ICDCS.2013.43 *
VELAMURI VARUN ED - WALTER DIDIMO ET AL: "Efficient Non-blocking Radix Trees", 1 August 2017, ECCV 2016 CONFERENCE; [LECTURE NOTES IN COMPUTER SCIENCE; LECT.NOTES COMPUTER], SPRINGER INTERNATIONAL PUBLISHING, CHAM, PAGE(S) 565 - 579, ISBN: 978-3-642-33485-6, ISSN: 0302-9743, XP047424109 *
VIKTOR LEIS ET AL: "The adaptive radix tree: ARTful indexing for main-memory databases", DATA ENGINEERING (ICDE), 2013 IEEE 29TH INTERNATIONAL CONFERENCE ON, IEEE, 8 April 2013 (2013-04-08), pages 38 - 49, XP032430852, ISBN: 978-1-4673-4909-3, DOI: 10.1109/ICDE.2013.6544812 *
VIKTOR LEIS ET AL: "The ART of practical synchronization", PROCEEDINGS OF THE 12TH INTERNATIONAL WORKSHOP ON DATA MANAGEMENT ON NEW HARDWARE, DAMON '16, 26 June 2016 (2016-06-26), New York, New York, USA, pages 1 - 8, XP055451385, ISBN: 978-1-4503-4319-0, DOI: 10.1145/2933349.2933352 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11256681B2 (en) * 2020-03-03 2022-02-22 WunBae JEON Method and apparatus for controlling access to trie data structure
US20220197884A1 (en) * 2020-12-18 2022-06-23 Via Technologies Inc. Encoding method for key trie, decoding method for key trie, and electronic devices
US12105695B2 (en) * 2020-12-18 2024-10-01 Via Technologies Inc. Encoding method for key Trie, decoding method for key Trie, and electronic devices
CN112784117A (en) * 2021-01-06 2021-05-11 北京信息科技大学 High-level radix tree construction method and construction system for mass data
CN112784117B (en) * 2021-01-06 2023-06-02 北京信息科技大学 Advanced radix tree construction method and construction system for mass data
US11954345B2 (en) 2021-12-03 2024-04-09 Samsung Electronics Co., Ltd. Two-level indexing for key-value persistent storage device

Also Published As

Publication number Publication date
CN111316255B (en) 2023-11-03
CN111316255A (en) 2020-06-19

Similar Documents

Publication Publication Date Title
US11080204B2 (en) Latchless, non-blocking dynamically resizable segmented hash index
US9672235B2 (en) Method and system for dynamically partitioning very large database indices on write-once tables
US10019382B2 (en) Secondary data structures for storage class memory (scm) enables main-memory databases
WO2019098871A1 (en) Data storage system and method of providing a data storage system
US9734607B2 (en) Graph processing using a mutable multilevel graph representation
US8832050B2 (en) Validation of distributed balanced trees
US10042910B2 (en) Database table re-partitioning using two active partition specifications
US6457021B1 (en) In-memory database system
US7930274B2 (en) Dual access to concurrent data in a database management system
US20180246807A1 (en) Lifecycle management for data in non-volatile memory
US9208191B2 (en) Lock-free, scalable read access to shared data structures
US20170220617A1 (en) Scalable conflict detection in transaction management
US20160147750A1 (en) Versioned Insert Only Hash Table for In-Memory Columnar Stores
US20210255889A1 (en) Hardware Transactional Memory-Assisted Flat Combining
US20090187599A1 (en) Generating identity values in a multi-host database management system
CN111373389B (en) Data storage system and method for providing a data storage system
Graefe Hierarchical locking in B-tree indexes
Wang et al. The concurrent learned indexes for multicore data storage
US12019629B2 (en) Hash-based data structure
CN117377953A (en) Tree-based data structure
Nguyen et al. Why Files If You Have a DBMS
Forfang Evaluation of High Performance Key-Value Stores
US20240273080A1 (en) Versioning of items in a data structure
Huang et al. A Primer on Database Indexing
Junior et al. Monkey Hashing: a Wait-Free Hashing Scheme with Worst-Case Constant-Time Operations

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17817932

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17817932

Country of ref document: EP

Kind code of ref document: A1