CN109407979B

CN109407979B - Multithreading persistent B + tree data structure design and implementation method

Info

Publication number: CN109407979B
Application number: CN201811129623.3A
Authority: CN
Inventors: 舒继武; 陆游游; 胡庆达; 刘昊
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-09-27
Filing date: 2018-09-27
Publication date: 2020-07-28
Anticipated expiration: 2038-09-27
Also published as: CN109407979A

Abstract

The invention discloses a multithreading persistent B + tree data structure design and implementation method, which comprises the following steps: introducing a layer of shadow leaf nodes based on a chain structure into a preset B + tree; storing linked list based leaf nodes in NVM by a data layout policy based on hybrid main memory to generate an array structure based tree layer and storing other portions of an index data structure in DRAM to generate a linked list structure based chain layer, such that the persistent overhead of balancing and ordering is avoided by the design of a hierarchical volatile tree structure and persistent linked list structure; an embedded fine-grained lock mechanism and an optimistic write mechanism are designed to be respectively used for concurrent control between read-write operations and between write-write operations. The method uses a mixed main memory data structure of a nonvolatile memory and a volatile memory, increases the concurrency of data retrieval, realizes data persistent storage, solves the problem of amplified lock overhead, and accelerates the system recovery process of the data structure.

Description

Multithreading persistent B + tree data structure design and implementation method

Technical Field

The invention relates to the technical field of nonvolatile main memory storage, in particular to a multithreading persistent B + tree data structure design and implementation method.

Background

A Non-Volatile main Memory (NVM) is a new type of Memory storage medium, and has the advantages of byte addressing, Non-Volatile information after power failure, high storage density, no need of dynamic refresh, low static power consumption, etc. However, there are some disadvantages, such as asymmetric read/write performance, limited write times and high write power consumption. The emergence of the hybrid memory architecture brings new huge opportunities and challenges to the storage field, and the research hot tide of the industry and the academic community on the heterogeneous hybrid memory architecture and the related system software is triggered. Non-volatile memory has many new insights into computer system architecture, system software, software libraries, and applications. The nonvolatile Memory device may form a hybrid main Memory together with a Dynamic Random Access Memory (DRAM) device, where temporary data in an application program is stored in the DRAM, and data that needs to be persistently stored is stored in the NVM. The advent of non-volatile main memory has prompted researchers to design main memory based storage systems, including file systems and database systems. The index structure is a key module for constructing the storage system, and largely determines the performance of the storage system. In a storage system based on a nonvolatile main memory, an index structure needs to ensure efficient consistency and multithread extensibility at the same time, which poses a new challenge to designers of the index structure.

In a traditional index data structure such as a B + tree, sorting and balancing operations occupy a large proportion of the whole tree operation overhead, and more seriously, persistence delay can further increase the time for holding locks by tree operations, and a persistent B + tree in the related art faces serious performance problems under a multi-thread scene. In a multi-threaded scenario, as the non-volatile main memory persistence delay increases, the time that a tree operation holds a lock grows almost linearly, and the performance degradation of the B + tree is severe.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, an object of the present invention is to provide a method for designing and implementing a multithreaded persistent B + tree data structure, which uses a mixed main memory data structure of a nonvolatile memory and a volatile memory, increases concurrency of data retrieval and implements data persistent storage, solves the problem of amplified lock overhead, and accelerates a system recovery process of the data structure.

In order to achieve the above object, an embodiment of the present invention provides a method for designing and implementing a data structure of a multithreading persistent B + tree, including the following steps: introducing a layer of shadow leaf nodes based on a chain structure into a preset B + tree; storing linked list based leaf nodes in NVM by a data layout policy based on hybrid main memory to generate an array structure based tree layer and storing other portions of an index data structure in DRAM to generate a linked list structure based chain layer, such that the persistent overhead of balancing and ordering is avoided by the design of a hierarchical volatile tree structure and persistent linked list structure; an embedded fine-grained lock mechanism and an optimistic write mechanism are designed to be respectively used for concurrent control between read-write operations and between write-write operations.

The multithreading persistent B + tree data structure design and implementation method of the embodiment of the invention has the advantages that through the use of a mixed main memory data structure of a nonvolatile memory and a volatile memory, the search operation with good space locality and balance is realized, expensive persistent operation is effectively reduced, an embedded fine-grained lock and optimistic writing mechanism are designed, the problem of amplified lock overhead is solved, and meanwhile, a multithreading recovery mechanism and a persistent garbage collector are adopted for supporting the consistency management of the nonvolatile main memory and accelerating the system recovery process of the data structure.

In addition, the method for designing and implementing the multithreaded persistent B + tree data structure according to the above embodiment of the present invention may further have the following additional technical features:

further, in an embodiment of the present invention, the embedded fine-grained lock mechanism designs an update flag bit and a delete flag bit for each linked list node to remove the persistence delay that does not satisfy the preset condition from the version verification path of the read operation, and the optimistic write mechanism separates the concurrent control mechanisms of the tree nodes and the linked list nodes to remove the persistence delay from the locking path of the tree node granularity.

Further, in an embodiment of the present invention, each node of the tree layer based on the array structure in the DRAM can accommodate a preset number of key-value pairs, wherein each key-value pair of the tree node points to a tree node or a linked list node of a next layer, so that when the number of key-value pairs of any number of tree nodes exceeds or falls below a preset threshold value, the tree node performs a splitting or merging operation, and one key-value pair is inserted or deleted in a tree node of a previous layer.

Further, in an embodiment of the present invention, a chain layer based on an array structure in the NVM stores the chain layer in the non-volatile main memory, where the chain layer is an ordered linked list, each linked list node only stores one key value pair and is connected by a right pointer, and the CPU atomic operation is used to ensure the atomicity and consistency of the insertion/deletion/update operation

Further, in an embodiment of the present invention, each tree operation starts to search from the root node until finding the corresponding leaf node, wherein before accessing any one tree node, a prefetch instruction is executed to read the whole tree node into the CPU cache to cover the access delay of the whole tree node, and the key array and the value array are respectively stored in different main memory spaces to prefetch only the key array, thereby reducing the total data amount of each prefetch operation.

Optionally, a key array size of a preset threshold is selected, a linear lookup operation may be used instead of a binary lookup operation, the linear lookup operation is performed on a main memory space and is accelerated by using a SIMD instruction, wherein each key value pair is provided with a fingerprint of 1B, and each fingerprint is a hash value of a corresponding key value, and the fingerprint array is stored at a head of a leaf node.

Further, in an embodiment of the present invention, if a conflict occurs between read-write operations, a concurrent control mechanism based on a version number is adopted, wherein a version number counter is adopted on each tree node, the version number is incremented each time the state of the tree node is changed, for an insert, delete, or update operation, a lock is applied before the tree node is modified, a corresponding version number is set to be dirty, after the operation is completed and the version number is increased by 1, the lock of the corresponding tree node is released, and if the version number is modified or locked, the read operation will repeatedly execute the above process until the version number is verified; if write the conflict between the operation, then adopt the locking mechanism of tree node granularity, wherein, the lock that adopts tree node granularity ensures to modify the write operation simultaneous execution of different tree nodes, links to each other through the right pointer between the leaf node, and predetermine leaf node's split direction can only turn right from a left side to apply for from the bottom up tree node's lock, and when tree node takes place to split or deletes, apply for the lock of last layer tree node, linked list node and leaf node's key value pair have the one-to-one relation, make write the operation only after obtaining the lock of tree layer corresponding leaf node, just can modify linked list node.

Further, in an embodiment of the present invention, before allocating and releasing a linked list node each time, allocating a nonvolatile main memory space from the system main memory allocator each time, and persisting an address and a length of the nonvolatile main memory space in a persistent linked list, and dividing the allocated main memory space into main memory blocks of a preset size, and maintaining through a volatile free main memory block linked list for main memory allocation and release operations of a link layer, and when the system is restored, the restoring thread scans metadata information on the persistent linked list and nodes of the link layer, determines main memory blocks which are in use and are not in use, thereby reconstructing the volatile free main memory block linked list.

Further, in an embodiment of the present invention, the method may further include: and correctly recovering the released tree nodes and linked list nodes by maintaining a global epoch counter and three garbage recovery linked lists, wherein before executing related operations, a working thread firstly registers the existing epoch number, and for each deleted tree/linked list node, the working thread places the deleted tree/linked list node into the corresponding garbage recovery linked list according to the current global epoch number.

Further, in an embodiment of the present invention, the method further includes: when the system is normally shut down, all volatile internal tree nodes and the garbage recoverers are persisted to the preset position of the nonvolatile main memory, and after the system is restarted, a recovery thread copies all volatile internal tree nodes and the garbage recoverers from the nonvolatile main memory to the DRAM.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow diagram of a method for designing and implementing a multithreaded persistent B + tree data structure, according to one embodiment of the invention;

FIG. 2 is a diagram of a chained structure based multithreaded persistent B + tree structure, according to an embodiment of the invention;

FIG. 3 is a diagram of an optimization strategy for read and write conflicts and write-write conflicts, according to one embodiment of the present invention;

FIG. 4 is a diagram of a persistent B + tree restricted multithreading extensibility analysis, according to one embodiment of the invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The following describes a method for designing and implementing a multithreading persistent B + tree data structure according to an embodiment of the present invention with reference to the drawings.

FIG. 1 is a flow diagram of a method for designing and implementing a multithreaded persistent B + tree data structure, according to an embodiment of the invention.

As shown in FIG. 1, the method for designing and implementing the multithreading persistent B + tree data structure comprises the following steps:

in step S101, a layer of chain structure-based shadow leaf nodes is introduced into a preset B + tree.

In step S102, a linked list based leaf node is stored in the NVM based on a data layout policy of the mixed main memory to generate an array structure based tree layer, and other parts of the index data structure are stored in the DRAM to generate a linked list structure based chain layer, so that the persistence overhead of balancing and sorting is avoided by the design of the layered volatile tree structure and the persistent linked list structure.

Further, in one embodiment of the present invention, the tree layer based on the array structure in the DRAM may contain a preset number of key-value pairs at each node, wherein each key-value pair of the tree node points to a tree node or a linked list node at a next layer, so that when the number of key-value pairs of any number of tree nodes exceeds or falls below a preset threshold, the tree node may perform a splitting or merging operation, and one key-value pair is inserted or deleted in a tree node at a previous layer.

Further, in an embodiment of the present invention, a chain layer based on an array structure in the NVM stores the chain layer in the non-volatile main memory, wherein the chain layer is an ordered linked list, each linked list node only stores one key value pair and is connected by a right pointer, and the CPU atomic operation is used to ensure the insertion/deletion/update operations of atomicity and consistency.

Optionally, a key array size of a preset threshold is selected, a linear lookup operation may be used instead of a binary lookup operation, the linear lookup operation is performed on a main memory space, and is accelerated by using a SIMD instruction, wherein each key value pair is provided with a fingerprint of 1B, and each fingerprint is a hash value of a corresponding key value, and the fingerprint array is stored at a head of a leaf node.

In step S103, an embedded fine-grained lock mechanism and an optimistic write mechanism are designed to be used for concurrent control between read-write operations and between write-write operations, respectively.

Further, in an embodiment of the present invention, if a conflict occurs between read-write operations, a concurrent control mechanism based on a version number is adopted, wherein a version number counter is adopted on each tree node, the version number is incremented each time the state of the tree node is changed, for an insert, delete, or update operation, a lock is applied before the tree node is modified, a corresponding version number is set to be dirty, after the operation is completed and the version number is increased by 1, the lock of the corresponding tree node is released, and if the version number is modified or locked, the read operation will repeatedly execute the above process until the version number is verified; and if the write operation conflicts, a lock mechanism of tree node granularity is adopted, wherein the lock of the tree node granularity is adopted to ensure that the write operation of modifying different tree nodes is executed simultaneously, the leaf nodes are connected through a right pointer, the splitting direction of the preset leaf node can only be from left to right, the lock of the tree node is applied from bottom to top, and when the tree node is split or deleted, the lock of the tree node at the previous layer is applied, the linked list node and the key value pair of the leaf node have a one-to-one correspondence relationship, so that the linked list node can be modified only after the lock of the leaf node corresponding to the tree layer is obtained.

Further, in an embodiment of the present invention, before allocating and releasing a linked list node each time, allocating a nonvolatile main memory space from the system main memory allocator each time, and persisting an address and a length of the nonvolatile main memory space in a persistent linked list, and dividing the allocated main memory space into main memory blocks of a preset size, and maintaining through a volatile idle main memory block linked list for main memory allocation and release operations of a link layer, and when the system is restored, the restoring thread scans metadata information on the persistent linked list and nodes of the link layer, determines main memory blocks which are in use and are not in use, and thereby reconstructs the volatile idle main memory block linked list.

Further, in an embodiment of the present invention, the method further includes: when the system is normally shut down, all volatile internal tree nodes and garbage retrievers are persisted to the preset position of the nonvolatile main memory, and after the system is restarted, all volatile internal tree nodes and garbage retrievers are copied into the DRAM from the nonvolatile main memory by the recovery thread.

The embodiment of the invention provides a mixed main memory data structure using a nonvolatile memory and a volatile memory, wherein a traditional tree data structure is adopted in the volatile memory, a chained data structure is adopted in the nonvolatile memory, the concurrency of data retrieval is increased by the tree data structure, the chained data structure realizes the persistent storage of data on a nonvolatile medium, the tree structure has good space locality and balanced search operation, the chained structure effectively reduces expensive persistent operation, an embedded fine-grained lock and optimistic writing mechanism are designed in the data structure, the problem of amplified lock overhead is solved, and meanwhile, a multithreading recovery mechanism and a persistent garbage recoverer are adopted for supporting the consistent management of the nonvolatile main memory and accelerating the system recovery process of the data structure.

The optimized data structure mainly comprises two layers, wherein the first layer is a Tree layer (Tree L eye) based on an array structure and stored in the DRAM, and the second layer is a chain layer (L ist L eye) based on a linked list structure and stored in the NVM, wherein the chain layer effectively reduces the persistence operation of the data structure, and the Tree layer provides search operation with good spatial locality and balance.

The optimized data structure specifically includes the following features:

(1) the tree layer is positioned in DRAM and based on an array structure, wherein each node can hold a fixed number of key-value pairs, and the ordered key-value pairs are stored in a continuous main memory space, so that good spatial locality is ensured and tree operation with the time complexity of O (log n) is supported. Each key-value pair of a tree node points to a next level of tree node or linked list node. If the number of key-value pairs of a certain tree node exceeds or is lower than a certain threshold value, the tree node can execute splitting or merging operation, and one key-value pair is inserted or deleted in the tree node of the upper layer, so that no persistent overhead is generated. Because the tree layer is only used for accelerating the searching performance of the chain layer, when the system has errors, the volatile tree layer is recovered through the durable chain layer, and the balancing and sorting operation is only performed in the DRAM through the method, the method does not introduce overhigh durable writing expense, and the performance of the index structure can be effectively improved.

(2) The chain layer based array structure located in the NVM, in particular, only the chain layer is stored in the non-volatile main memory. The link layer is an ordered linked list, each linked list node only stores a key value pair, the nodes are connected by a right pointer, and the insertion/deletion/update operation of atomicity and consistency is ensured by using CPU atomic operation (an x86 platform supports aligned 64-bit atomic operation). Specifically, taking the insertion operation as an example, after the correct insertion position is found, the order of the link layer can be ensured only by executing the following two persistence operations, the first persistence operation is to persist the newly generated linked list node (which already points to the post node), and the second persistence operation is to persist the pointer of the pre-linked list node (which already points to the newly generated linked list node). If a system error occurs between the two operations, the consistency of the link layer is not affected because the newly generated link list node is not inserted into the link layer. And for new nodes that are not successfully inserted, persistent garbage collectors will avoid the loss of this block of main memory because the chain layer can accommodate an unlimited number of linked list nodes, naturally eliminating balancing operations.

(3) Each tree operation needs to search from the root node until finding the corresponding leaf node, and needs to read all the tree nodes on the search path, wherein the access and storage delay of the tree nodes becomes the main influence factor of the tree layer performance. Before accessing a tree node, the embodiment of the invention executes a pre-fetching instruction, reads the whole tree node into the CPU cache, covers the access delay of the whole tree node, and respectively stores the key array and the value array in different main memory spaces, thereby realizing the purposes of pre-fetching only the key array and reducing the total data amount of each pre-fetching operation.

(4) For tree data structures located in DRAM, a certain threshold key array size may be selected, and linear lookup operations are used instead of binary lookup operations. Further, the linear lookup operation is placed on the main memory space and accelerated using SIMD instructions. For the search operation, a SIMD instruction is used for comparing a target key value with a plurality of different key values at the same time, and similar strategies are used in sequencing and balancing, so that a plurality of data are moved at the same time, and the index performance is improved.

(5) For leaf nodes of the data structure, a fingerprint of 1B is provided for each key-value pair of each leaf node, wherein each fingerprint is a hash value of the corresponding key-value, and an array of fingerprints is stored at the head of the leaf node. For a lookup operation, only when the hash value of the target key value is the same as the hash value of a certain fingerprint, the key value corresponding to the fingerprint can be compared. The size of each fingerprint is far smaller than that of each key value, so that the comparison operation based on the fingerprint array can further increase the capacity of additional concurrency.

Further, the embodiment of the present invention describes a concurrency control mechanism based on version number of the data structure, and the concurrency control mechanism mainly includes the following contents: for the conflict between the read-write operations, a concurrent control mechanism based on the version number is adopted, and for the conflict between the write-write operations, a tree node granularity locking mechanism is adopted.

On one hand, for conflicts between read-write operations, a version number counter can be adopted on each tree node as a communication medium between the read-write operations executed concurrently, and the expense that lock application is required for each write operation is avoided. The version number is increased when the state of the tree node is changed every time, for the operations of insertion, deletion or update, a lock is applied before the tree node is modified, the version number is set to be dirty, the version number is increased by 1 after the operation is completed, then the lock of the tree node is released, and if the version number is modified or locked, the reading operation is repeatedly executed until the version number passes the verification.

On the other hand, for the conflict between the write operations, the write operations for modifying different tree nodes can be ensured to be executed simultaneously by adopting the tree node granularity lock, because the write operations only need to hold the tree node lock to be modified, each write operation reaches the target leaf node in a version number verification mode, the leaf nodes are connected through a right pointer, and the splitting direction of the leaf nodes is specified to be only from left to right, so that the situation that the target key value pair cannot be found due to the splitting operation is avoided. And next, applying for the lock of the tree node from bottom to top, and applying for the lock of the tree node at the upper layer only when the tree node is split or deleted. Because the linked list nodes and the key value pairs of the leaf nodes have a one-to-one correspondence relationship, the linked list nodes can be modified only after the locks of the leaf nodes corresponding to the tree layer are acquired in the write operation. Therefore, the concurrency control mechanism of the tree nodes simultaneously solves the problem of concurrency conflict of the linked list hierarchy.

Further, embodiments of the present invention propose a concurrency control mechanism, by means of which the data structure can support optimistic reading. In an optimistic read mechanism, a read operation will take a snapshot of the existing version without locking the current version, then read the data and check the version, wherein if the version has not changed or is not marked dirty, the read operation is successful, and the read concurrency mechanism is improved by designs that do not require lock bits. For write-write conflicts, the data structure uses version and lock for each tree-first node lock, first, a write operation locates the node needing to be written, the locating node is determined by top-down read operation, when the node needing to be written is located, the write operation writes the locking bit to the node, and starts the write operation and persistence process. For any write operation that requires balancing, a lock from the bottom level to the top level of the data is required, and by the above method, the data structure supports locks for nodes of the previous tree, rather than locks for the entire tree. For an insert operation, first, the lock operation bit is locked and the version counter is incremented before the write operation and persistence are performed, in which case the version counter is incremented. Secondly, write operation and persistence operation are carried out on leaf nodes, and finally version node numbers are added and locks are released. For a read operation, a lookup and read operation is performed and the snapshot version is compared to the latest version, if the version is dirty and modified, the read operation fails and verification is restarted until successful.

Specifically, for read-write conflicts, because the version is set to be dirty at the leaf node level during the write operation and is increased after the write operation is finished, a longer write execution time will result in a higher probability of read abort. Therefore, for a read operation, even an operation to read other keys will be constantly attempted until the version number is cleared and unchanged, which will result in a higher read abort rate. The method comprises the following steps that persistent delay can be removed from a version verification key path, and the specific process is that element granularity control is allowed to be carried out on a linked list layer based on an element organization mode. Based on the above description, linked list nodes may use embedded mini-locks to perform persistence operations without any reads to other keys. After the persistence operation, the linked list nodes update the version numbers in the group nodes, and the persistence delay is removed from the critical path of version verification. In embodiments of the present invention, for update and delete operations, the data structure may need to set an embedded bit to indicate that a linked list node is being modified. Wherein, after inserting and persisting the linked list node, updating the array layer by using a version mechanism. Finally, the data structure unlocks the embedded bits and the array nodes, the embedded bits are only for scheduling, and therefore persistence is not required. For delete operations, only the delete bit is set and the memory space is reclaimed to prevent read operations from accessing the dangling pointer.

For write-write conflicts, similar to read-write conflicts, persistence overhead in write operations can also delay locking of write operations. Wherein the data structure allows concurrent writing of different keys of the same leaf node for a chain top layer at a leaf node level. To achieve this concurrent writing, the first part is for those intervening nodes that are being generated, which have not yet been linked into the linked list, and which can be written and persisted randomly. The second part is the node being modified, including the node being inserted, deleted, updated. An atomic CAS operation can change the state of linked list nodes and can be realized by decoupling the concurrent control of the linked list layer and the group layer. In particular, random access is allowed to inaccessible nodes. An insert operation will accompany two persistence operations, one being node persistence, will persist the newly generated linked list node and the pointer to the next node, which will make the nodes in the linked list accessible. Based on the above, an insert operation of the data structure may not need to generate a lock when a linked list node is generated and persisted, but the lock only needs to be generated when the node is pointed to and the pointer of the previous node is updated and persisted.

Firstly, each insertion operation obtains the insertion positions of the previous and the next linked list nodes through version verification, then a new linked list is needed to connect the sibling pointer to the next node, and the whole node is persisted. Secondly, acquiring an array lock, determining whether a previous node or a next node is modified, if not, directly connecting a pointer of the previous node to the next node, persisting the node and updating an array layer by using a version mechanism, otherwise, executing an insertion operation by adopting a traditional lock-based mode. Finally, releasing the lock, the persistence cost of the linked list nodes will be removed from the locked path. For decoupling the concurrency control of the array layer and the chain layer, the chain layer can atomically realize the lock-free concurrency mechanism in the DRAM through a series of instructions of CAS. However, CAS operation instructions do not guarantee persistent atomic writes in NVM. Specifically, the following aspects can be guaranteed by a persistent CAS operation. First, atomicity updates to a shared variable. Second, cache lines containing shared variables are persisted to ensure the persistence of the update. A volatile CAS will cause incorrect behavior in persistent memory, when a concurrent read operation reads the value of a shared variable and makes a persistent write operation based on the read operation, which will cause system inconsistencies when system anomalies occur in the course of the write operation. In order to ensure consistency of concurrent operations, the data structure needs persistent CAS operations to wait for persistence to operate linked list nodes, when modification is not visible to a leaf layer, visibility is achieved through an embedded micro lock, and persistent CAS operations are achieved through atomicity of a decoupling chain surface layer and persistence visibility of a group layer. For each insert operation, first, the previous and next nodes of the target are determined, and then the new linked list node is pointed to and persisted to the next node. Second, using CAS to atomically modify and persist siblings of the previous node, the newly inserted element is only visible when it is inserted into the previous level, and if the CAS instruction fails to execute, execution will resume from the first step, otherwise the previous level node will be inserted using a lock-based mechanism and be globally visible.

Further, for each delete operation, the node to be deleted is located and the node with the delete marker is deleted logically and atomically using the CAS instruction. Second, the pointer to the previous node is physically deleted by modifying and persisting it and automatically pointing to the next node. The data structure may also use CAS instructions to check whether the target node is being modified or deleted and whether the previous node was modified. The concurrency control mechanism for each update operation that modifies an existing key is similar to a delete operation, except that the linked list node notifies the linked list node that an update operation is being performed by updating the update bit.

Specifically, the consistent main memory management mechanism of the data structure of the embodiment of the present invention allocates a larger block of non-volatile main memory space from the system main memory allocator each time before allocating and releasing a linked list node each time, and persists the address and length of the block of space in a persistent linked list, and then divides the allocated main memory space into main memory blocks of a specific size, and maintains the main memory blocks through a volatile free main memory block linked list for main memory allocation and release operations of a link layer. When the system is recovered, the recovery thread scans the metadata information and the nodes of the chain layer on the persistent linked list and judges the used and unused main block, so as to rebuild the volatile idle main block linked list, and only after the small main blocks are used, new main memory is distributed from the system main memory distributor again.

Specifically, the consistent main memory management mechanism of the data structure in the embodiment of the present invention correctly recycles released tree nodes and linked list nodes by maintaining a global epoch counter and three garbage collection linked lists. Before executing relevant operations, firstly, a working thread registers the existing epoch number, and for each deleted tree/linked list node, the thread places the deleted tree/linked list node into a corresponding garbage recycling linked list according to the current global epoch number. If the current epoch number is T, the deleted node is placed in the [ T mod 3] th garbage collection linked list, when the garbage collector wants to move the main memory blocks on the garbage collection linked list to the idle main memory block linked list, firstly, whether all the working threads are in the current epoch number is checked, and if the checking is successful, the global epoch number is incremented. By the method, all threads are ensured to be in the ranges of the epoch numbers T and T +1, so that the main memory blocks on the garbage recycling linked list corresponding to the epoch number T-1 are safely recycled.

Specifically, according to the multithread recovery mechanism of the data structure in the embodiment of the present invention, when the system is normally shut down, all volatile internal tree nodes and garbage collectors are persisted to a specific location of the nonvolatile main memory, and after the system is restarted, all volatile internal tree nodes and garbage collectors are copied from the nonvolatile main memory to the DRAM by the recovery thread, so that the system restart process can be completed in a short time. And when the system is recovered after the system is abnormal, the recovery thread scans all the linked list nodes in an offline state, and reconstructs all the internal tree nodes and the garbage recoverer. Specifically, in the normal execution process of the system, a group of persistent trackers is used for recording the positions of some linked list nodes, when ten thousand insertion operations are executed, the trackers can randomly record the main memory address of a new linked list node and persist the main memory address to a reserved area of a nonvolatile main memory, and when the tracked linked list node is deleted, the corresponding tracker is also reset. The recovery process of the system mainly comprises two stages: firstly, in the first stage, the trackers are sequenced according to the key values of the linked list nodes recorded by the trackers and then distributed to the recovery threads, and each thread independently scans the linked list nodes of the disjoint link layers to reconstruct the data structure. Second, in the second phase, after the disjoint parts are reconstructed, a thread is used to build the parts into a complete data structure.

The embodiment of the invention takes an index data structure of a storage system under a nonvolatile memory scene as an optimized object, provides a method for introducing a layer of shadow leaf nodes based on a chain structure in a traditional B + tree aiming at the current storage system based on a main memory, adopts a data layout strategy based on a mixed main memory, stores the leaf nodes based on the chain table in an NVM, and stores other parts in a DRAM, thereby eliminating persistence expenses caused by sequencing and balancing operations, designs an embedded fine-grained lock and an optimistic write mechanism which are respectively used for concurrency control between read-write operations and between write-write operations, wherein the embedded fine-grained lock mechanism designs an update mark bit and a delete mark bit for each chain table node, removes unnecessary persistence delay from a version verification path of the read operation, and separates the tree nodes from the concurrency control mechanism of the chain tables by the optimistic write mechanism, the persistence delay is further removed from the locking path of the tree node granularity, and the concurrency conflict between the write operation and the write operation is reduced. Optionally, the embodiment of the present invention further designs a persistent garbage collector for supporting consistency management of the nonvolatile main memory, and finally accelerates a recovery process of the data structure when the system crashes through a multi-thread recovery technology.

The following describes the design and implementation of the multithreaded persistent B + tree data structure according to an exemplary embodiment.

As shown in fig. 2, the B + tree supporting multi-thread persistent concurrent access according to the embodiment of the present invention uses a DRAM and NVM hybrid main memory architecture, where the DRAM is a tree structure similar to a conventional B + tree and used for index during operation, the NVM is a linked list-based data structure and used for storing all user data and relationships thereof, the system only stores the linked list structure located on the NVM when in a non-operation state, and when the system restarts or abnormally recovers, reconstructs the tree data structure located in the DRAM using the linked list structure located on the NVM, and accelerates the concurrent indexing process using the tree data structure during operation.

In the embodiment of the present invention, a prefetch mechanism may be adopted to reduce the access delay, and since the search for the tree starts from the root node until the corresponding leaf node is found, the access delay of these tree nodes will seriously affect the search performance of the whole tree because all leaf nodes on the search path need to be read in the process. In order to solve the problem, in the embodiment of the invention, before each tree node is accessed, a prefetch instruction is executed to prefetch the whole tree node into the CPU cache, so that the access delay of the whole tree node is covered, the key array and the value array are respectively cached in different main memory spaces, only the key array is prefetched, and the total data amount of each prefetch operation is reduced.

In embodiments of the invention, a SIMD mechanism is employed to speed up the processing. In which linear lookup operations are performed on a contiguous main memory space, so that acceleration can be performed using Single Instruction Multiple Data (SIMD) instructions. Most modern processors support SIMD instructions, supporting the simultaneous execution of the same arithmetic or comparison operations on multiple data. For a lookup operation, a SIMD compare instruction is used to simultaneously compare a target key value with multiple different key values. Similar optimization strategies are also used in the sorting and balancing operations, so that a plurality of data can be moved simultaneously, and the embodiment of the invention uses a 24-core Intel processor to support the SIMD operation with 256 bits, so that the data structure can simultaneously compare 32 fingerprints, and the search process of leaf nodes is accelerated.

In the embodiment of the invention, a concurrent control mechanism based on the version number and a tree node granularity lock are adopted to ensure that the write operation of modifying different tree nodes can be executed simultaneously, the structure of the version number is given in fig. 2, and the version number uses a 32-bit byte sequence structure. Where the first bit is the identifier of whether locked, the second bit is the identifier of the root node, the third bit is the identifier of the leaf node, and the last 29 bits are the incremental version number, which is incremented each time the tree node state is changed.

Before modifying the tree node, applying for the lock of the tree node, then setting the version number as dirty, adding 1 to the version number after finishing the operation, and then releasing the lock of the tree node. For the query operation, the version number of the node is recorded before the tree node is read, the latest version number of the tree node is compared with the previously recorded version number after the read operation is completed, whether the tree node is modified by other operations in the reading process is judged, and if the version number is modified or locked, the read operation is executed again until the version number is verified.

In embodiments of the present invention, too high a persistence delay may block other write operations to different key-value pairs of the same leaf node. Based on the strong association relationship between each key-value pair on the tree node of the array structure and the adjacent key-value pair, any write operation may trigger expensive balance operation (modify most key-value pairs of the same tree node), so that it is difficult to design a lock with key-value pair granularity for coordinating concurrent write operations accessing different key-value pairs of the same leaf node and removing the persistence delay from the locking path of the tree node through an optimistic write mechanism.

In the embodiment of the invention, the index structure uses the mutual exclusion lock to sequence the write operation for modifying the critical section data, thereby avoiding the concurrency conflict between the write operations. Thus, persistent operations that modify non-critical section data may be removed from the mutex lock path.

As shown in fig. 3 and 4, the data structure is a processing procedure for read-write conflicts and write-write conflicts. For read-write conflicts, each insert operation is first positioned to the insert position to obtain the front and back linked list nodes. For the update operation, in the first step, a new link list node is applied, the right pointer is pointed to the post node, and the whole link list node is durably maintained. And secondly, acquiring locks of leaf nodes of the tree layer, and judging whether the front nodes and the rear nodes are modified. And thirdly, if the node is not modified, pointing the pointer of the front node to the new node and persisting the pointer. And fourthly, updating the key value pairs and the version numbers of the leaf nodes. Fifth, if modified, the insert operation will be performed in a conventional lock-based manner and the lock released. Through the optimization strategy, the persistence operation which does not modify the chain layer is removed from the locking path. It is worth noting that a system crash that occurs between the first and third steps does not result in leakage of the non-volatile main memory.

In the embodiment of the present invention, for the read operation, through the first, second, and third steps described above, the corresponding link table node pointer is obtained in the manner of version number verification, and through the fourth step described above, the data of the link table node is read, and then the embedded flag bit of this node is checked. If the flag bit is set to be dirty, the read linked list node is in an updated or deleted state, and if the flag bit is dirty, the read operation waits until the update operation is completed and is persisted. If the delete marker bit is dirty, the read operation may be redone from the root node. Specifically, the difference between the basic read-write concurrency control mechanism and the read-write concurrency control mechanism based on the embedded fine-grained lock is as follows: the latter removes the persistence overhead of write operations to different key-value pairs of the same leaf node from the version number check path of read operations.

In the embodiment of the invention, the chain layer can execute the atomicity-guaranteed updating operation only through a series of CAS operations, and the embedded fine-grained lock supports that the modified key-value pair can be visible only after the persistence operation is completed. Through the two technologies, the concurrency control mechanisms of the tree layer and the chain layer are separated, the concurrency control mechanism of key value-to-granularity is adopted in the persistent chain layer, and the locking concurrency mechanism of tree node granularity is adopted in the volatile tree layer, so that the persistent overhead of the chain layer is removed from the locking path of the tree node. And for each insertion operation, positioning the insertion position in a version number verification mode to acquire the front node and the rear node. In the first step, new linked list nodes are distributed and point to the post nodes, and then the persistence operation is executed. And secondly, pointing the right pointer of the front node to a new linked list node through a CAS instruction. Since the state of the predecessor node is stored in the right pointer, CAS operations are also used to avoid inserting new nodes behind the deleted linked list node, and conventional atomic operations cannot guarantee data persistence.

In the embodiment of the invention, the newly inserted linked list node can be seen only after being updated to the upper tree node, thereby avoiding the new node which is not persisted from being seen by other operations. The data structure, after persisting the pointers to the new linked-list node and the predecessor node, uses a conventional lock-based approach to insert the new key-value pair into the leaf node of the upper layer to be visible. For each delete operation, one existing key-value pair is deleted. Firstly, positioning a target linked list node, setting a deletion marker bit by using CAS operation, finishing logic deletion operation, and avoiding other threads from inserting any newly generated linked list node behind the node to be deleted, which may cause the loss of a new node. Second, the right pointer of the front node is atomically modified, pointed to the back node and persisted, the physical delete operation is completed, the CAS operation is used to check whether the target node is being deleted or updated, and whether the front node is being deleted. The data structure deletes the key-value pair from the upper tree node after completing the above operation. For each update operation, the value of an existing key-value pair is modified. Except for the target update node, the update operation does not affect other nodes of the link layer, so the concurrency control mechanism is very simple. By updating the flag bit, other threads are notified that this linked list node is in the process of being updated. FIG. 3(c) shows the concurrent execution of two insert operations, with the data structure described in embodiments of the present invention removing the persistence delay of the linked list hierarchy from the locking path at the tree node granularity.

In embodiments of the present invention, to ensure consistency and persistence of the link layer, in the event of a system crash, incomplete operations (e.g., insert and delete operations) may result in the loss of newly allocated linked list nodes, resulting in leakage of main memory space, and read operations may see a linked list/tree node being deleted by other threads. To solve the above problem, the data structure designs a lightweight consistent main memory management and a persistent garbage collector.

In the embodiment of the invention, a larger nonvolatile main memory space is allocated from a system main memory each time, the address and the length of the block space are persisted into a persistent linked list, the allocated main memory space is divided into main memory blocks with specific sizes, and the main memory blocks are maintained through a volatile free main memory block linked list and are used for main memory allocation and release operation of a link layer. When the system is recovered, the recovery thread scans the metadata information and the nodes of the chain layer on the persistent linked list, judges the main block which is in use and is not in use, and accordingly rebuilds the volatile idle main block linked list.

In the embodiment of the invention, reading operation is prevented from seeing the linked list nodes and the tree nodes which are deleted by other threads, and the released tree nodes and linked list nodes are correctly recycled by maintaining a global epoch counter and three garbage recycling linked lists. Wherein, prior to executing the operation of the data structure, the worker thread first registers an existing epoch number. For each deleted tree/linked list node, the thread places the deleted node into the corresponding garbage recycling linked list according to the current global epoch number, if the current epoch number is T, the deleted node is placed into the [ T mod 3] garbage recycling linked list, when the garbage collector wants to move the main memory block on the garbage recycling linked list to the idle main memory block linked list, firstly checking whether all the working threads are in the current epoch number, and if the checking is successful, incrementing the global epoch number. By the method, all threads are ensured to be in the ranges of the epoch numbers T and T +1, so that the main memory blocks on the garbage recycling linked list corresponding to the epoch number T-1 are safely recycled.

In the embodiment of the invention, a multithreading recovery mechanism is adopted to accelerate the system recovery process, and all volatile internal tree nodes and garbage recoverers are persisted to a certain specific position of a nonvolatile main memory. When the system is restarted after normal shutdown, the recovery thread copies all volatile internal tree nodes and garbage recoverers from the nonvolatile main memory into the DRAM, and the system restart is completed within a short time. When the system is recovered abnormally, the recovery thread scans all the linked list nodes in an off-line state, and reconstructs all the internal tree nodes and the garbage recoverer. Specifically, in the normal execution process of the system, a group of persistent trackers is used for recording the positions of some linked list nodes, when ten thousand insertion operations are executed, the trackers can randomly record the main memory address of a new linked list node and persist the main memory address to a reserved area of a nonvolatile main memory, and when the tracked linked list node is deleted, the corresponding tracker needs to be reset.

When the system is recovered, the recovery process mainly comprises two stages: in the first stage, firstly, the trackers are sequenced according to the key values of the linked list nodes recorded by the trackers and then distributed to a recovery thread, and each stage independently scans the linked list nodes of the disjoint link layers to reconstruct the data structure; in the second stage, after the data structure of the disjoint parts is reconstructed, the parts are reconstructed into a complete data structure by using one thread, and the mechanism can effectively reduce the conflict among the threads.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A multithreading persistent B + tree data structure design and implementation method is characterized by comprising the following steps:

introducing a layer of shadow leaf nodes based on a chain structure into a preset B + tree;

storing linked list based leaf nodes in NVM by a data layout policy based on hybrid main memory to generate an array structure based tree layer and storing other portions of an index data structure in DRAM to generate a linked list structure based chain layer, such that the persistent overhead of balancing and ordering is avoided by the design of a hierarchical volatile tree structure and persistent linked list structure;

an embedded fine-grained locking mechanism and an optimistic writing mechanism are designed to be respectively used for concurrent control between read-write operations and between write-write operations, wherein the optimistic writing mechanism separates the concurrent control mechanisms of tree nodes and linked list nodes to remove persistence delay from a locking path of the tree node granularity.

2. The method of claim 1, wherein the embedded fine-grained lock mechanism is configured to design an update flag bit and a delete flag bit for each linked list node to remove persistence delays that do not satisfy a predetermined condition from a version verification path of a read operation.

3. The method of claim 1, wherein each node of the tree layers in the DRAM is capable of holding a preset number of key-value pairs, and wherein each key-value pair of a tree node points to a next-level tree node or a linked list node, so that when the number of key-value pairs of any tree node exceeds or falls below a preset threshold, the tree node performs a split or merge operation, and one key-value pair is inserted or deleted from a previous-level tree node.

4. The method of claim 1, wherein the chain layer is an ordered linked list, each linked list node stores only one key-value pair and is connected by a right pointer, and the CPU atomic operation is used to ensure the atomicity and consistency of the insertion/deletion/update operation.

5. The method of claim 1, wherein each tree operation starts to search from a root node until finding a corresponding leaf node, wherein a prefetch instruction is executed before accessing any one tree node, the whole tree node is read into a CPU cache to cover the access delay of the whole tree node, and the key array and the value array are respectively stored in different main memory spaces to prefetch only the key array, thereby reducing the total amount of data of each prefetch operation.

6. The method of claim 1, wherein a key array size of a preset threshold is selected, a linear lookup operation is used to replace a binary lookup operation, the linear lookup operation is performed on a main memory space and is accelerated by a SIMD instruction, wherein each key value pair is provided with a 1B fingerprint, each fingerprint is a hash value of a corresponding key value, and the fingerprint array is stored at a head of a leaf node.

7. The method of claim 1, wherein,

1) for the conflict between the read-write operation, a concurrent control mechanism based on a version number is adopted, wherein a version number counter is adopted on each tree node, the version number is increased when the state of the tree node is changed every time, for the insertion, deletion or update operation, a lock is applied before the tree node is modified, the corresponding version number is set to be dirty, after the operation is completed and the version number is increased by 1, the lock of the corresponding tree node is released, and if the version number is modified or locked, the read operation can repeatedly execute the process 1) until the version number is verified to be passed;

2) to writing the conflict between the operation of writing, then adopt the locking mechanism of tree node granularity, wherein, adopt the lock of tree node granularity to ensure to modify the write operation simultaneous execution of different tree nodes, link to each other through the right pointer between the leaf node, and predetermine the split direction of leaf node can only turn right from a left side to apply for from the bottom up the lock of tree node, and when the tree node takes place to split or deletes, apply for the lock of last layer tree node, linked list node and leaf node's key value pair have the one-to-one relation, make write the operation only after obtaining the lock of the corresponding leaf node in tree layer, just can modify the linked list node.

8. The method of claim 1, wherein before allocating and releasing a linked list node each time, allocating a non-volatile main memory space from a system main memory allocator each time, and persisting addresses and lengths of the non-volatile main memory space in a persistent linked list, and dividing the allocated main memory space into main memory blocks of preset sizes, and maintaining through a volatile free main memory block linked list for main memory allocation and release operations of a link layer, and when the system is restored, a restoring thread scans metadata information on the persistent linked list and the nodes of the link layer, determines main memory blocks in use and those not in use, and reconstructs the volatile free main memory block linked list.

9. The method of claim 1, further comprising:

and correctly recovering the released tree nodes and linked list nodes by maintaining a global epoch counter and three garbage recovery linked lists, wherein before executing related operations, a working thread firstly registers the existing epoch number, and for each deleted tree/linked list node, the working thread places the deleted tree/linked list node into the corresponding garbage recovery linked list according to the current global epoch number.

10. The method of claim 1, further comprising:

when the system is normally shut down, all volatile internal tree nodes and the garbage recoverers are persisted to the preset position of the nonvolatile main memory, and after the system is restarted, a recovery thread copies all volatile internal tree nodes and the garbage recoverers from the nonvolatile main memory to the DRAM.