WO2019098871A1

WO2019098871A1 - Data storage system and method of providing a data storage system

Info

Publication number: WO2019098871A1
Application number: PCT/RU2017/000857
Authority: WO
Inventors: Aleksandr Aleksandrovich SIMAK; Sergei Romanovich BASHIROV; Xuecang ZHANG
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2017-11-20
Filing date: 2017-11-20
Publication date: 2019-05-23
Also published as: CN111316255B; CN111316255A

Abstract

The present invention provides a data storage system (100) with a data storage (104) and a data controller (101) configured to implement a prefix tree (200) with a plurality of nodes (201, 203, 205), wherein the data controller (101) is configured to provide a common node prefix per inner node (203), the common node prefix (300) including a common prefix (301), a prefix length (302) and a node depth (303); wherein the node depth (303) is the absolute offset from the beginning of a key to the beginning of the common prefix (301).

Description

DATA STORAGE SYSTEM AND METHOD OF PROVIDING A DATA STORAGE

SYSTEM

TECHNICAL FIELD

The present invention relates to a data storage system, a method of providing a data storage system and a computer program with a program code. In particular, the present invention relates to data structures used for data lookups, and more particularly, to a prefix tree data structure for locating data stored in a database with a novel method of synchronization between writers and readers providing linear scalability on read.

BACKGROUND

Different tree-based index data structures have been already known for decades. Despite distinct design, many of them have certain features in common, such as a single root node, internal nodes interconnected to form a tree and last level nodes, so called leafs.

Usually, payload data is kept within the tree leafs, and internal nodes store some distinct attribute value to choose from during the traversing lookup. Traversing the tree follows entering of a key or search key. The search traverses through the tree according to matches between the attribute values stored in nodes and the key. The search starts at the root node, i.e. a parent node and branches through child nodes which are each dependent from one parent node. A radix tree is a special case of tree-based index data structures. Instead of keeping attribute values inside the internal node this information is preserved within node interconnections. Thus, during radix tree traversing there is no need to look through child nodes and compare searching attribute values, but just to choose the child that corresponds to the attribute value index, if it exists. Despite evident simplicity and neat traversing algorithm such radix trees were not widely used for general purpose database or storage system index structures due to impressively high memory overheads. Indeed, each radix tree node is supposed to keep the whole range of possible attribute values even when very few of such child nodes actually exist. The above implies an exponential growth in memory consumption. Recently, radix trees are prepare for general purpose use. The most important modification is the provision of variable size nodes. It means internal nodes are different in capacity. There are several pre-set kinds of capacities and depending on the demand the particular one is chosen during tree modification.

Such approach is usually called horizontal compression. Another valuable improvement is to skip all internal nodes with a single child only. It is necessary then to keep common attributes inside the intermediate node and to store searching attributes within the corresponding leaf. Such approach is usually called vertical compression or key sequence skip. Both those improvements immediately lead to more accurate, moderate memory consumption. In particular, an internal node with limited capacity of only 16 children has common prefix mitigating at least three intermediate nodes with only one child each. Also, corresponding leaf nodes contain the rest of the search attributes.

Another known issue of radix trees is the method of concurrent access, or synchronization. Traditionally, synchronization techniques utilize locks to keep the reader waiting until the modification is complete, or detect the presence of changes and then restart the reader. Such solutions are simple, but hardly scalable. More advanced techniques allow to access nodes hiding pending modifications and operate wait-free. However, they require sophisticated data structures, and generally are more memory consumptive and complex to implement. An alternative option to consider is to use hardware transactional memory. However, there are still certain obstacles regarding proper hardware support present.

Naive radix tree design implies a huge amount of memory, so such kind of trees have been quite limited in use. A new approach optimized for memory usage is called Adaptive Radix Tree (ART). ART accumulates known tree compression techniques. Inner nodes can have several symbols onboard called common node prefix instead of just one symbol, such approach is known as vertical compression and is used to decrease the number of tree levels. A well-proven addition to such vertical compression is the so called key sequence skip, which is applied to reduce the number of inner nodes when the rest stored in the leaf node key is unique. There is an observation that single modification affects two nodes of a radix tree at most: the modified node itself and its parent node. Thus, there are several known concurrent access synchronization mechanisms exploiting such observation to be effectively applied for radix trees in general.

Lock Coupling is a standard method for synchronizing B-trees and can be easily applied to radix trees also. The idea is to hold at most two locks at every single moment during tree traversal. Starting from the root node, on every step one level down lock is acquired on the corresponding child and parent lock must be released. Lock Coupling algorithms suffer due to high contention on the first level of the tree. Every actor starts from the root node acquiring the lock. Thus, there is definitely high contention and a good probability of contention on the next level, too.

More advanced technique called Optimistic Lock Coupling is the enhanced version of the previous method with better performance. Instead of acquiring locks on each step to prevent modifications of nodes, actors assume that no concurrent modifications happen during tree traversing. Modifications are detected afterwards using version counters and the operation is restarted, if needed. It shows higher performance because locks are acquired only on demand and overall contention in tree nodes is not too high. However, it mostly suffers due to the coarse-grained nature of the locks. Several tree nodes are locked during their parent lock, but there is no modification meant for them. For instance, one actor acquires tree node lock which implies that other actors cannot lock that node’s children due to lock coupling before the lock is released.

The Read-Optimized Write Exclusion (ROWEX) approach shows even better performance. Inner nodes are extended with a level field which is set only once when the node is created and indicates the total length of the key sequence at the node level. The level field is never changed on later steps. Assuming modifications of common prefix are conducted atomically, as well as updates of node references; readers can operate wait-free with no locks or retries. However, writers are supposed to perform extra actions leaving the radix tree in a complete and correct state for readers. For writers, ROWEX uses the similar approach of locking as Optimistic Lock Coupling. Each writer keeps two exclusive locks: one on the node to be modified and one on the parent. The major difference lies in j extra actions performed by writers to let readers work without locks. These extra actions are mostly atomic updates of the tree node fields important for readers to read the tree correctly.

Copy on Write (COW) methods are usually applied in lock-free algorithms. The main idea is to create a hidden copy of the node being modified then introduce all changes to this copy and make it visible. Conventional lock, compare and swap operation and version counters can be used to implement COW. COW approaches are beneficial for read oriented workloads, e.g. 95 percent of read operations and 5 percent of write operations in workload. A writer detects changes during preparation of a tree node copy and restarts the operation redoing the copy again. Those extra memory copies and allocations also lead to poor write performance. On the other hand, COW can be used with locks to avoid restarts. In this case, scalability of write operations is quite low. In addition, for non-volatile memory where write speed is slow but read speed is close to DRAM, extra copies would require more time.

Hardware Transactional Memory (HTM) is a hardware mechanism to detect conflicts and undo any changes on shared data. The goal of such hardware is to transparently support regions of code marked as transactions by enforcing atomicity, consistency and isolation. In this scenario, all actors always observe consistent state of tree nodes since modifications are performed only within transactions, thus are visible for all actors only after successful commit. When commit operation is failed, a restart of the modification is required because the tree node has been simultaneously changed. HTM is a very specific hardware and not well supported yet, there are no wide opportunities to use it in production today.

Software Transactional Memory (STM) is an emulation of HTM on machines without HTM support. However, optimal implementation of STM requires double compare and swap instruction support (DCAS) otherwise implementation of such approach is sophisticated and quite slow. Unfortunately, DCAS operation is supported only by the newest processors and is not widely available. Moreover in practice, STM systems also suffer performance hit compared to fine-grained lock-based systems due primarily to the overheads associated with maintaining the log and the time spent committing transactions. In order to traverse the tree correctly synchronization between readers and writer of nodes and interconnections is essential. Typically, readers access nodes when no ongoing modifications are visible to them. Thus, nodes represent a complete and correct tree state.

Traditionally, synchronization techniques for prefix or radix trees use locks to keep a reader waiting until the modification is done, or detect the presence of changes and restart. Such solutions are simple, but inefficiently scalable. Advanced techniques allow to read nodes hiding pending modifications and operate wait-free. However, they require sophisticated data structures and generally are complex to implement.

SUMMARY

In view of the above-mentioned problems and disadvantages, the present invention aims to improve the performance of prefix trees. The present invention has thereby the object to improve the synchronization between writers and readers in prefix trees.

The object of the present invention is achieved by the solution provided in the enclosed independent claims. Advantageous implementations of the present invention are further defined in the dependent claims.

In particular, the present invention relates to the way computers lookup data stored in information databases or storage systems and proposes a novel method of concurrent access synchronization to locate data faster.

Fast data lookup is essential for all computer programs, including information databases and storage systems. Programs need to locate data for consequent retrieval or computation. Storage systems and information databases usually have large volume storage media to store huge amount of data into it. Storage media is a slow memory device compared to the CPU main memory, therefore exhaustive search for looking up data is not feasible.

Modern systems widely use so-called indexing data structures which are preferably placed within the main memory that is fast representing a set of associations between data search attributes and stored data locations on the storage media. Thus, to locate data promptly there is no need to traverse whole stored data set from storage media, but find corresponding to the request index records within index data structure instead.

More particularly, the invention directly relates to indexing data structures based on a prefix tree like for example a radix tree and disclosures a unique method of concurrency control for synchronization between writers and readers providing linear scalability on read operations.

Data users can act as writers adding and modifying data or readers looking up and retrieving stored data. In the presence of concurrency, when several actors perform data lookup and modification simultaneously the overall operation efficiency depends significantly on the method of concurrent synchronization. Usually, readers access records of index structure when no ongoing modifications are visible to them. Thus, the index structure represents the complete and correct state of the stored data set. Known methods to achieve this often assume some sort of waiting or extra memory overheads. So, the synchronization has a significant impact on the system overall performance and is considered as a bottleneck for tree-based data structures. In general, during concurrent operation over some data structure actors can perform more or less independently with respect to the synchronization model. In the worst case, they serialize their actions thereby proportionally increasing total execution time, and in the best case, actors complete their operations in parallel. Linear scalability represents such upper theoretical bound for the aforementioned best case, and regarding to the lookup operation it immediately implies readers never wait despite the presence of ongoing modification conducted by writers concurrently.

The present invention is directed towards data storage systems and information databases with an index data structure based on a prefix tree like a radix tree and a method of synchronization between writers and readers providing linear scalability on read operations.

A first aspect of the present invention provides a data storage system with a data storage and a data controller configured to implement a prefix tree with a plurality of nodes, wherein the data controller is configured to provide a common node prefix per inner node, the common node prefix including a common prefix, a prefix length and a node depth; wherein the node depth is the absolute offset from the beginning of a key to the beginning of the common prefix.

According to the present invention, a prefix tree is provided with a common node prefix at at least one inner node. Besides the common prefix and the prefix length, a node depth is also included. The node depth is the absolute offset from the beginning of a key or search key to the beginning of the common prefix, or in other words the effective offset within the key. Now, the reader can rely on the node depth as specified at the node and does not have to count the depth during tree traversal. The prefix length can be provided by a dedicated data field or it can be derived from the common prefix.

The invention provides a memory efficient index data structure based on an index tree and an enhanced concurrency control mechanism between writers and readers supporting linear scalability on data index lookup operation by readers. The readers can operate wait-free.

The invention leads to an index data structure and concurrency control that can provide several kinds of advantages to information databases, storage systems and other applications heavily relying on efficient data lookup and retrieval. These advantages include near to linear scalability of index data structure lookup, a high performance of lookup and data retrieval, reduced memory overhead with vertical compression applied, and the support of long node common prefixes.

In an implementation form of the first aspect, the data controller is configured to initiate a write operation in the prefix tree thereby setting a node depth for an inner node under write operation, and is configured to initiate a concurrent read operation including the inner node under write operation thereby using the set node depth, the common prefix and the prefix length for tree traversal. The use of the set node depth, the common prefix and the prefix length for tree traversal by the reader allows a wait-free read operation concurrent to the write operation. The set node depth, which can be set directly or atomically, lets the reader traverse to the correct node even if ins case of a node split or a node merge. The node depth leaves the reader independent from the ongoing count of nodes at tree traversal. In a further implementation form of the first aspect, the data controller is configured to initiate a write operation for an inner node, to provide an auxiliary data structure for the inner node under write operation, and to provide a common node prefix for the auxiliary data structure, wherein the common node prefix reflects the changes of the write operation. The auxiliary data structure contains at least the three fields of the common node prefix. It can include more fields if desired. The auxiliary data structure enables easy implementation of the common node prefix into an existing prefix tree structure. The writer creates a new auxiliary data structure and sets the proper depth and common prefix fields thereby reflecting the changes of the write operation. The reference or structure within the corresponding node under writer operation can then be atomically updated. Traversing readers obtain the correct information at all times. Before the write operation, a reader reads the unchanged node. During the write operation, a reader reads the auxiliary data structure, which already includes the correct information. After the write, the reader reads the updated node.

In a further implementation form of the first aspect, the data controller is configured to initiate a read operation concurrent to the write operation and including the inner node under write operation such that the read operation reads the inner node under write operation and the auxiliary data structure. As pointed out above, the auxiliary data structure provides a concurrent reader already with the correct information reflecting the changes of the write operation. By the use of the node depth the reader reads values reflecting the status after the write operation.

In a further implementation form of the first aspect, the data controller is configured to replace the auxiliary data structure wherein the common node prefix reflects the changes of the write operation of the inner node under write operation, after the write operation is completed. The auxiliary data structure of the inner node is replaced with the new auxiliary data structure with the updated common node prefix accordingly to the conducted changes of the inner node within the tree.

In a further implementation form of the first aspect, the data controller is configured to set the node depth and/or the prefix length of the auxiliary data structure such that the sum of the node depth and the prefix length of the auxiliary data structure equals to the sum of the node depth and the prefix length of the auxiliary data structure of the inner node under write operation. Such rule allows easy calculation and keeps consistency for the reader as the sum is the same before and during the write operation.

In a further implementation form of the first aspect, the common node prefix is provided within the auxiliary data structure within the inner node. This implementation keeps all data internal to the node.

In a further implementation form of the first aspect, the common node prefix is provided within the auxiliary data structure within a separate structure, and wherein at least one pointer is provided at the inner node pointing to the separate structure. Accessing the common prefix through the pointer is more convenient in terms of modification consistency. Moreover, such indirect approach surpasses hardware limitations on atomic operations for longer common prefix, which is beneficial for general purpose indexing data structures while internal nodes remain quite compact.

In a further implementation form of the first aspect, the common node prefix is provided within a separate structure, wherein the auxiliary data structure is provided within the inner node and wherein at least one pointer is provided at the auxiliary data structure pointing to the separate structure. Accessing the common prefix through the pointer is more convenient in terms of modification consistency.

In a further implementation form of the first aspect, the references to children and/or a children node counter of an inner node are provided within a separate structure, and wherein at least one pointer is provided at the inner node pointing to the separate structure. Such an approach may imply additional memory management overheads, but decreases the number of parent node updates during child node expands and shrinks, thereby simplifying node capacity management.

In a further implementation form of the first aspect, the prefix tree is a radix tree or an adaptive radix tree. The radix tree and the adaptive radix tree, ART, fit very well with the proposed field for node depth at inner nodes. In a further implementation form of the first aspect, the prefix tree comprises a horizontal compression, a vertical compression and/or a key sequence skip. Such approaches reduce the number of nodes so that memory usage is decreased. Horizontal compression may introduce several inner nodes of different capacity and variable size thereby reducing memory cost accordingly to the real number of node children.

A second aspect of the present invention provides a method of providing a data storage system configured to implement a prefix tree with a plurality of nodes, comprising providing the common node prefix per inner node, the common node prefix including a common prefix, a prefix length and a node depth. The same advantages and modifications as above apply.

In an implementation form of the second aspect, the method comprising initiating a write operation for an inner node; and providing an auxiliary data structure for the inner node under write operation comprising a common node prefix of the inner node; wherein the common node prefix reflects the changes of the write operation.

A third aspect of the present invention provides a computer program with a program code for performing the method as described above when the computer program runs on a computer or the data storage system as described above. The same advantages and modifications as above apply.

It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof. BRIEF DESCRIPTION OF THE DRAWINGS

The above described aspects and implementation forms of the present invention will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which

Fig. 1 shows a diagram of the system architecture of a data storage system with an index structure.

Fig. 2 shows an example of a radix tree with vertical and horizontal compression.

Fig. 3 shows the structure of the common node prefix of inner nodes.

Fig. 4 shows a concurrently operating adaptive radix tree with wait-free readers.

Fig. 5 shows a method of providing a data storage system configured to implement a prefix tree with a plurality of nodes of wait-free reader operation in a radix tree.

Fig. 6 shows a flowchart of wait-free reader operation in a radix tree.

DETAILED DESCRIPTION OF THE EMBODIMENTS

Fig. 1 shows a data storage system 100 including a data controller 101. The data controller 101 includes a concurrency control mechanism 102 for a prefix or radix tree with vertical compression and key sequence skip used as an index data structure 103. The data storage system 100 further includes a data storage 104. The data controller 101 may be implemented in a main memory using DRAM (Dynamic Random Access Memory) or the like and the data storage 104 typically includes mass storage like hard disks, Solid-State- Disks (SSD) or the like.

According to claim 1 of the present invention, the data storage system 100 includes the data controller 101 and the data storage 104. Thus, to locate data promptly there is no need to traverse the completely stored data set in the data storage 104. Instead, only the index within the index data structure 103 needs to be searched corresponding to the request index records.

Data users 1 1 1 can act as writers 1 12 adding and modifying data or readers 1 13 looking up and retrieving stored data 104. In the presence of concurrency, when several actors 1 1 1 perform data lookup and modification simultaneously the overall operation efficiency depends significantly on the method of the concurrent synchronization or control 102.

The data storage system 100 can be seen as a variant of advanced synchronization between writers 1 12 and readers 1 13 applicable as a concurrency control mechanism 102 for index or radix trees with vertical compression and key sequence skip used as an index data structure 103 in data storage systems 100 and information databases. Such radix tree based index data structure 103 can also be equipped with horizontal compression and advanced synchronization between writers, too.

Fig. 2 shows an example of a radix tree 200. The radix tree 200 includes a single root node 201 , internal nodes 203 interconnected to form a tree and last level nodes, so called leafs 205. Usually, payload data is kept within tree leafs, and internal nodes store some distinct attribute value to choose from during the traversing lookup.

The radix tree 200 is a special case of a tree-based index data structures. Instead of keeping attribute values inside the internal node this information is preserved within nodes interconnections. Thus, during radix tree traversing there is no need to look through child nodes and compare searching attribute values, but just to choose the child that corresponds to the attribute value index, if any exists.

Internal index tree nodes 203 as mentioned above can be of variable size or adaptive capacity. There are four types of internal tree nodes different in capacity only: 4, 16, 48 and 256 children, accordingly. The internal node 203 contains a children compartment accordingly to the capacity of 16 children in this example. The root node 201 has a capacity of 256 children. In this example, the inner node 203 is a child node of the root node 201 and linked to the root node 201 by the symbol“c”. The common prefix in the inner node 201 is“omp”. Common prefix means that all leaf nodes 205, which are children of the inner node 203, contain the prefix“omp”.

In a more general form, the tree 200 can be described in a bottom-up fashion starting with alphabet definition, key strings encoding and index structure tree nodes interconnections. The alphabet for such a tree-based index data structure is a single byte character set. It means that each symbol is represented within 8 bits and the whole alphabet cardinality, i.e. the number of elements, is 256 symbols. All symbols are used to form input strings, 255 non-zero bytes symbols serve to encode information and a zero-byte symbol indicates the end of an input string. This encoding is as C-style or null-terminated strings.

The interconnections between tree nodes of the index structure can be implemented with the so called augmented pointers technique, as well as with ordinary nodes pointers. A current pointer size is 8 byte or 64 bits and modem MMUs usually operate within 48 bits address space leaving the rest unused. Thus, it is possible to reuse up to 16 bits of ordinary 64 bit pointers preserving important information there. Augmented pointers which are used as node interconnections contain 1 byte alphabet symbol augmented denoting the symbol to which the particular interconnection is pointing to.

Fig. 3 shows the common node prefix 300 in detail. The common node prefix 300 is a data compartment within internal node 203. The common node prefix 300 has a layout to keep a number of key symbols which are shared by all node children. Those symbols are called common prefix 301 and arranged in a byte array compartment for prefix symbols. In a further field, the prefix length field 302, the length of the common prefix 301 is provided. The invention extends the common node prefix 300 with an additional field node depth 303 to indicate the absolute offset from the beginning of the key.

Major internal node fields like the children compartment and the children counter also can be a separate memory layout with indirect access through a pointer from the internal node structure. Such approach implies additional memory management overheads, but decrease the number of parent node updates during child node expands and shrinks, so simplifies node capacity management. Radix tree leafs 205, or terminal nodes consist of primary fields key and payload, as well as optional fields lock and uplink. During operation writers perform all updates to ordinary and augmented pointers atomically. The common node prefix 300 is also supposed to be updated atomically. Indirect memory layout does not pose any difficulty and pointer updates are handled by atomic operations of general commodity hardware. Internal nodes undergo expand or shrink when children counters indicate high or low level of occupancy. Splitting or merging a node usually implies another nodes insertion or extraction and common node prefix changes.

According to the common node prefix 300 present here, the node depth 303 and the common prefix 301 are kept together in a single separate data structure. Thus, both of them are updated consistently by referencing that particular structure instance, i.e. the common node prefix 300. Then, a reader never accounts its depth during the tree traversal. Instead, the reader reads the node depth 303 from the field of the common node prefix 300. It immediately follows that a reader always compares symbols at correct positions within the search key. Further, no read locks are required due to concurrent write operations.

The common node prefix 300 can be created by the following routine:

struct node prefix {

unsigned depth;

unsigned length;

char prefix[MAX_LEN];

} ;

Such common node prefix 300 may pose a 4 byte constant overhead per node, wherein the 4 byte overhead actually may depend on the maximum possible length of the key.

Fig. 4 shows an example of a radix tree 400 with a plurality of nodes. In each node the node depth and the common prefix are shown. The prefix length can be derived from the common prefix. These fields are stored at each inner node in the common node prefix. A first reader 401 parses through the tree 400.

On the left of Figure 4 the progression of the first reader 401 through the tree 400 is shown. The search key or key“Aaronitic” is looked up by the first reader 401. The search starts at a root node 402 and commences to inner node 403 as this is the child of the root node 402 that includes the next symbols of the key. The inner node 403 has the common prefix“aron” and a node depth 1 . Accordingly, the first reader 401 moves further down the tree 400 to inner node 404 that comprises a node depth of 5 and a common prefix of 1. The next step corresponds to the status shown in Figure 4. The first reader 401 is at inner node 405 that has the common prefix“t” and a node depth 6. On the left side, the offset of six symbols, the prefix length of one and the common prefix“t” is shown. The leaf node 406 matches the complete key“Aaronitic“. Hence, the search was successful.

The first reader 401 has already passed inner node 403 which is now under a write operation by a writer 407. Because of the new field depth in the common node prefix data structure readers are allowed to eliminate waiting by only obligating writers to update the common node prefix consistently during internal node split or merge, here at inner node 403. Consistently here literally means the writer 407 updates the common node prefix symbols together with the depth field at once, or atomically. There are no other obligations on the decided order of writer operating or between writers synchronization.

Thus, a second reader 408 is accessing the internal node 403 with possibly ongoing modification when the common node prefix has already been updated, but new internal node is not inserted yet. Then, readers use the depth value from the new field of the common node prefix instead of nominal depth accounted while traversing through tree levels. Then they are able to compare the common node prefix to the correct symbols of the search key with no regard of the actual node level within the tree. Due to concurrency some nodes possibly underwent split or merge during readers operation. However, readers proceed forward with tree traversing comparing following the symbols of the requested key. It is possible to have false positive match skipping accidently several symbols from the comparison then. If case such uncertainty is detected, the situation is resolved nicely at the final step by comparing the stored key of the located index leaf with the requested one.

In the following is described how the writer 407 actually performs such operation and how the second reader 408 parses through inner node 403 under write operation. The writer 407 creates an auxiliary data structure 409 at the inner node 403 containing the common node prefix with the node depth and the common prefix of the new modified node. Hence, the common node prefix reflects the changes of the write operation. In other words, the writer 407 creates a new auxiliary structure 409, sets the proper depth and common prefix fields, and atomically updates reference like a pointer or the structure within the corresponding node 403. Writers perform all updates atomically to maintain data structures consistency. The common node prefix is also updated atomically.

In the example of Figure 4, the node depth of the auxiliary data structure 409 equals to 2 and the common prefix of the auxiliary data structure 409 is“ron“ having a length of 3. This new common node prefix of the auxiliary data structure 409 corresponds to a node split in which a new node is inserted between the root node 402 and the inner node 403 under write operation. Such new node has a node depth of 1 and a common prefix“a“.

It can be seen that the sum, i.e. 5, of the node depth, i.e. 2, and the prefix length, i.e. 3, of the new auxiliary data structure 409 equals to the sum, i.e. 5, of the node depth, i.e. 1 , and the prefix length, i.e. 4, of the common node prefix of the auxiliary data structure of the inner node 403 under write operation.

The auxiliary data structure 409 and/or its common node prefix can be stored directly in the inner node or in a separate structure to which one or more pointers refer. The pointer can be stored at the inner node or at the auxiliary data structure 409.

The second reader 408, when reaching the inner node 403 during tree traversal, reads the common node prefix of the auxiliary data structure 409. The second reader 408 relies on the node depth field rather than on the accounted depth, it calculates from the node depth field, i.e. 2, and the prefix length, i.e. 3. Therefore, the second reader 408 reaches the correct inner node 404 despite the write operation to the inner node 403.

In a broader sense, readers can operate wait-free proceeding forward from the root node to the leaf because they use the depth value from the common node prefix, if such depth value already exists. The reader may skip some symbols from the comparison due to concurrency and may resolve a possible false match by comparing the search key with the key from the located leaf.

Fig. 5 depicts a method of providing a data storage system configured to implement a prefix tree with a plurality of nodes of wait- free reader operation in a radix tree. In step 500 a data storage system is provided that is configured to implement a prefix tree with a plurality of nodes. In step 501 , the common node prefix is provided per inner node. The common node prefix includes a common prefix, a prefix length and a node depth.

Fig. 6 shows an operation flowchart of the wait- free reader operation. The flowchart starts at a level start 600 that is repeated for each node level.

At step 601 , it is decided whether the node is a leaf. If yes, the procedure branches to step 602. There, it is decided whether some symbols of the search key have been skipped for example due to key sequence skip or to concurrent writer modification excluding several symbols from the comparison due to node split or merge. If yes, an uncertainty may have occurred and an uncertainty flag or the like is set in step 603. Then, the procedure branches to step 604. This is also the case, when no symbols of the search key have been skipped, i.e. for a no at step 602.

At step 604, it is decided whether a false positive or an uncertainty exist. If yes, the search key and the leaf key are matched in step 605. For a positive outcome, the method branches to step 606 and decides a true, i.e. the key has been correctly found in the leaf. The method also takes this step 606 when no false positive or an uncertainty exist at step 604.

Back to step 601 for the case that the node is not a leaf. Then, the operation branches to step 608 where it is decided whether a common node prefix exists. If not, the method branches to step 609. There, it is decided whether a next level child exists. If not, it is branched to step 607, false, because a leaf node was found that does not store the key. If yes, it proceeds at step 610 to the next level, i.e. a new start at step 600.

When a common node prefix exists, i.e. a positive decision at step 608, the operation branches to step 61 1. There, it is decided whether the prefix and the key at the position of the node match. In other words, are symbols of the common node prefix compared with the search key at positions defined by common prefix depth. If no, it is branched to the false step 607 and the operation is terminated. If yes, it is branched to step 612.

There, it is decided whether some symbols of the search key have been skipped for example due to key sequence skip or to concurrent writer modification excluding several symbols from the comparison due to node split or merge. If yes, an uncertainty may have occurred and an uncertainty flag or the like is set in step 613. Then, the procedure branches to step 609. This is also the case, when no symbols of the search key have been skipped, i.e. for a no at step 612.

At step 609 it is decided whether a next level child exists. If not, it is branched to step 607, false, because a leaf node was found that does not store the key. If yes, it proceeds at step 610 to the next level, i.e. a new start at step 600.

From a broader context the operation may be described as follows. Readers operate wait- free traversing the tree down to the leaf with no locks despite concurrent modifications possibly introduced by writers. On each level starting from the root, a reader compares symbols of the common prefix with the search key at corresponding positions. The corresponding positions are decided by extracting the depth field from the common node prefix, if it exists. The reader proceeds forward to the next level if there is a match, otherwise returns false. During the traversal, the reader may detect a key sequence skip or the case of uncertainty, when a concurrent writer modification excluded one or several symbols from the comparison due to node split or merge. Such cases are usually resolved at the final step comparing the search key with the key extracted from the terminal leaf node. If there are no detected obstacles, the reader returns true when a terminal leaf node was found or false, otherwise.

The present invention has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed invention, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word“comprising” does not exclude other elements or steps and the indefinite article“a” or“an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.

Claims

1 . Data storage system ( 100) with a data storage (104) and a data controller ( 101 ) configured to implement a prefix tree (200) with a plurality of nodes (201 , 203, 205), wherein

the data controller (101 ) is configured to provide a common node prefix (300) per inner node (203), the common node prefix (300) including a common prefix (301 ), a prefix length (302) and a node depth (303); wherein the node depth (303) is the absolute offset from the beginning of a key to the beginning of the common prefix (301 ).

2. Data storage system (100) according to claim 1 , wherein

the data controller ( 101 ) is configured to initiate a write operation in the prefix tree (200) thereby setting a node depth (303) for an inner node (203) under write operation, and is configured to initiate a concurrent read operation including the inner node (203) under write operation thereby using the set node depth (303), the common prefix (301 ) and the prefix length (302) for tree traversal.

3. Data storage system (100) according to claim 1 or 2, wherein

the data controller ( 101 ) is configured to initiate a write operation for an inner node (203), to provide an auxiliary data structure (409) for the inner node (203) under write operation, and to provide a common node prefix (300) for the auxiliary data structure (409), wherein the common node prefix (300) reflects the changes of the write operation.

4. Data storage system ( 100) according to claim 3, wherein

the data controller ( 101 ) is configured to initiate a read operation concurrent to the write operation and including the inner node (203) under write operation such that the read operation reads the inner node (203) under write operation and the auxiliary data structure (409).

5. Data storage system (100) according to claim 3 or 4, wherein

the data controller (101 ) is configured to replace the auxiliary data structure (409) wherein the common node prefix (300) reflects the changes of the write operation of the inner node (203) under write operation, after the write operation is completed.

6. Data storage system (100) according to one of the claims 3 to 5, wherein the data controller ( 101 ) is configured to set the node depth (303) and/or the prefix length (302) of the auxiliary data structure (409) such that the sum of the node depth (303) and the prefix length (302) of the auxiliary data structure (409) equals to the sum of the node depth (303) and the prefix length (302) of the auxiliary data structure (409) of the inner node (203) under write operation.

7. Data storage system (100) according to one of the claims 1 to 6, wherein

the common node prefix (300) is provided within the auxiliary data structure (409) within the inner node (203).

8. Data storage system (100) according to one of the claims 1 to 7, wherein

the common node prefix (300) is provided within the auxiliary data structure (409) within a separate structure, and wherein at least one pointer is provided at the inner node (203) pointing to the separate structure.

9. Data storage system (100) according to one of the claims 1 to 8, wherein

the common node prefix (300) is provided within a separate structure, wherein the auxiliary data structure (409) is provided within the inner node (203) and wherein at least one pointer is provided at the auxiliary data structure (409) pointing to the separate structure.

10. Data storage system (100) according to one of the claims 1 to 9, wherein

references to children and/or a children node counter of an inner node (203) are provided within a separate structure, and wherein at least one pointer is provided at the inner node (203) pointing to the separate structure.

1 1. Data storage system (100) according to one of the claims 1 to 10, wherein

the prefix tree (200) is a radix tree or an adaptive radix tree.

12. Data storage system (100) according to one of the claims 1 to 1 1 , wherein

the prefix tree (200) comprises a horizontal compression, a vertical compression and/or a key sequence skip.

13. Method of providing a data storage system (100) configured to implement a prefix tree (200) with a plurality of nodes, comprising

providing (501 ) the common node prefix (300) per inner node (203), the common node prefix (300) including a common prefix (301 ), a prefix length (302) and a node depth (303).

14. Method according to claim 12, comprising

initiating a write operation for an inner node (203); and

providing an auxiliary data structure (409) for the inner node (203) under write operation comprising a common node prefix (300) of the inner node (203);

wherein the common node prefix (300) reflects the changes of the write operation.

15. A computer program with a program code for performing the method according to claim 13 or 14 when the computer program runs on a computer or the data storage system (100) according to one of the claims 1 to 12.