CN109407979A

CN109407979A - Multithreading persistence B+ data tree structure design and implementation methods

Info

Publication number: CN109407979A
Application number: CN201811129623.3A
Authority: CN
Inventors: 舒继武; 陆游游; 胡庆达; 刘昊
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-09-27
Filing date: 2018-09-27
Publication date: 2019-03-01
Anticipated expiration: 2038-09-27
Also published as: CN109407979B

Abstract

The invention discloses a kind of multithreading persistence B+ data tree structure design and implementation methods, method includes: one layer of shadow leaf node based on chain structure of introducing in preset B+ tree；The leaf node based on chained list is stored in NVM by the data layout strategy based on mixing main memory, to generate the tree layer based on structure of arrays, and the other parts of index data structure are stored in DRAM, to generate the link layer based on list structure, so that the design by the volatibility tree construction and persistence list structure of layering avoids the persistence expense for balancing and sorting；It designs Embedded fine granularity lock mechanism and optimism writes mechanism, with the con current control for being respectively used between read-write operation and writing between write operation.This method uses the mixing main memory data structure of Nonvolatile memory and volatile ram, increases the concurrency of data retrieval and realizes lasting data storage, solves the lock overhead issues of amplification, and accelerate the system recovery procedure of data structure.

Description

Multithreading persistence B+ data tree structure design and implementation methods

Technical field

The present invention relates to non-volatile main memory technical field of memory, in particular to a kind of multithreading persistence B+ tree data knot Structure design and implementation methods.

Background technique

Non-volatile main memory (Non-Volatile Memory, NVM) is a kind of novel memory storage medium, and having can Information is non-volatile, storage density is high, does not need dynamic refresh and the advantages such as quiescent dissipation is low after byte addressing, power down.But Come with some shortcomings place, limited to write number and write the disadvantages of power consumption is higher such as readwrite performance asymmetry.Its appearance is to depositing Storage field brings new huge opportunities and challenges, caused industrial circle and academia to isomery mixing memory hierarchy framework and its The research boom of related system software.Nonvolatile memory is to Computer Systems Organization, system software, software library and applies journey Sequence has many new enlightenments.Nonvolatile memory equipment can be with dynamic random access memory (Dynamic Random Access Memory, DRAM) equipment collectively forms mixing main memory, wherein and provisional data are stored in DRAM in application program On, the data that needs are persistently stored are stored on NVM.The appearance of non-volatile main memory promotes researcher to set about design based on master The storage system deposited, including file system and Database Systems.Index structure is the key modules for constructing storage system, it is very The performance of storage system is determined in big degree.In the storage system based on non-volatile main memory, index structure is needed while being protected Efficient consistency and multithreading scalability are demonstrate,proved, this proposes new challenge to the designer of index structure.

Traditional index data structure such as B+ tree, sequence and balancing run occupy very big in entire tree operations expense Ratio, more serious, persistence postpones that the time of tree operations holder lock, persistence in the related technology can be further increased B+ tree faces serious performance issue under multithreading scene.Under multithreading scene, as non-volatile main memory persistence postpones Increase, the time of tree operations holder lock approximately linearly increases, and the performance decline of B+ tree is extremely serious.

Summary of the invention

The present invention is directed to solve at least some of the technical problems in related technologies.

For this purpose, an object of the present invention is to provide a kind of designs of multithreading persistence B+ data tree structure and realization side Method, this method use the mixing main memory data structure of Nonvolatile memory and volatile ram, increase the concurrency of data retrieval With realization lasting data storage, solve the lock overhead issues of amplification, and accelerate the system recovery procedure of data structure.

In order to achieve the above objectives, the embodiment of the present invention propose a kind of design of multithreading persistence B+ data tree structure with it is real Existing method, comprising the following steps: one layer of shadow leaf node based on chain structure is introduced in preset B+ tree；By based on mixed Leaf node based on chained list is stored in NVM by the data layout strategy for closing main memory, to generate the tree layer based on structure of arrays, and And the other parts of index data structure are stored in DRAM, to generate the link layer based on list structure, so that passing through layering Volatibility tree construction and persistence list structure design avoid balance and sort persistence expense；It designs Embedded thin Granularity lock mechanism and optimism write mechanism, with the con current control for being respectively used between read-write operation and writing between write operation.

The multithreading persistence B+ data tree structure design and implementation methods of the embodiment of the present invention, by using non-volatile The mixing main memory data structure of memory and volatile ram, so that the search operation with good spatial locality and balance, Expensive persistence operation is effectively reduced, and also designs Embedded fine granularity lock and writes mechanism with optimism, solves amplification Lock overhead issues, while using multithreading Restoration Mechanism and persistence Garbage Collector, for supporting non-volatile main memory Coherency management, and accelerate the system recovery procedure of data structure.

In addition, multithreading persistence B+ data tree structure design and implementation methods according to the above embodiment of the present invention may be used also With following additional technical characteristic:

Further, in one embodiment of the invention, the Embedded fine granularity lock machine is made as each chained list section One update mark position of point design and deletion marker bit will be unsatisfactory for the persistence delay of preset condition from the version of read operation Verifying removes on path, and the optimism writes mechanism and separates the concurrent control mechanism of tree node and chained list node, with Persistence delay is removed from the locking path of tree node granularity.

Further, in one embodiment of the invention, the tree layer based on structure of arrays in the DRAM, Each node can accommodate the key-value pair of preset quantity, wherein each key-value pair of tree node be directed toward next layer tree node or Person's chained list node is more than or lower than default threshold, tree node can execute division with the key-value pair quantity in any tree node Perhaps union operation is inserted into upper one layer of tree node or deletes a key-value pair.

Further, in one embodiment of the invention, the link layer based on structure of arrays in the NVM, will Link layer is stored in non-volatile main memory, wherein the link layer is an orderly chained list, and each chained list node only stores a key Value pair, and be connected with right pointer, guarantee that insertion/deletion/update of its atomicity and consistency operates using CPU atomic operation

Further, in one embodiment of the invention, each tree operations are searched for since root node, until finding Corresponding leaf node, wherein before accessing any one tree node, execute prefetched instruction, entire tree node is read into CPU In caching, to cover the memory access latency of the entire tree node, and bond number group and value array are stored in different masters respectively It deposits in space, only to prefetch bond number group, reduces the total amount of data of pre- extract operation every time.

Optionally, the key array size for choosing preset threshold can be used linear search operation and replace binary chop operation, Linear search operation is placed in the primary memory space and is carried out, and is accelerated using SIMD instruction, wherein each key-value pair is equipped with 1B Fingerprint, and each fingerprint is the cryptographic Hash of corresponding key assignments, and by fingerprint storage of array on the head of leaf node.

Further, in one embodiment of the invention, if conflict between read-write operation, using being based on version number Concurrent control mechanism, wherein on each tree node use version number's counter, version number is in each burl dotted state It is incremented by when being changed, for insertion, deletion or updates operation, applies for lock before modifying tree node, and will corresponding version This number is set to dirty, and after completing operation and after version number adds 1, discharges the lock of corresponding tree node, and if version number is repaired Change or be locked, then read operation will repeat the above process, until version number is verified；If between writing write operation Conflict then uses the lock mechanism of tree node granularity, wherein writes behaviour using what the lock of tree node granularity ensured to modify different tree nodes Be performed simultaneously, between leaf node by right pointer be connected, and preset the leaf node cleavage direction can only from left to right, And the lock of the bottom-up application tree node, and when the tree node occurs division or deletes, apply for upper one layer of burl The key-value pair of the lock of point, chained list node and leaf node has one-to-one relationship, so that the write operation is only obtaining tree layer After the lock of corresponding leaf node, the chained list node can be just modified.

Further, in one embodiment of the invention, before one chained list node of every sub-distribution and release, every time One piece of non-volatile primary memory space is distributed from system hosts distributor, and by the address of the non-volatile primary memory space and length It is persisted in a persistence chained list, and the primary memory space being assigned to is divided into the main memory block of default size, and lead to The idle main memory block linked list maintenance an of volatibility is crossed, operation is distributed and discharged for the main memory of link layer, and is restored in system When, restore thread scans persistence chained list on metadata information and link layer node, judge it is currently in use and not by The main memory block used, to rebuild the idle main memory block chained list of volatibility.

It further, in one embodiment of the invention, can also include: by maintenance one global epoch meter Number device and three garbage reclamation chained lists correctly to recycle the tree node and chained list node being released, wherein are executing relevant operation Before, worker thread registers existing No. epoch first, and for tree/chained list node of each deletion, thread is according to the current overall situation No. epoch is placed into corresponding garbage reclamation chained list.

Further, in one embodiment of the invention, further includes: when system normal shutdown, by all volatibility Inside tree node and Garbage Collector be persisted to the predeterminated position of non-volatile main memory, and after system reboot, restore thread The inside tree node of all volatibility and the Garbage Collector are copied in the DRAM from non-volatile main memory.

The additional aspect of the present invention and advantage will be set forth in part in the description, and will partially become from the following description Obviously, or practice through the invention is recognized.

Detailed description of the invention

Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, in which:

Fig. 1 is the multithreading persistence B+ data tree structure design and implementation methods process according to one embodiment of the invention Figure；

Fig. 2 is the multithreading persistence B+ tree construction schematic diagram based on chain structure according to one embodiment of the invention；

Fig. 3 is read/write conflict according to an embodiment of the invention and the optimisation strategy figure for writing write conflict；

Fig. 4 is the limited multithreading scalability analysis chart of persistence B+ tree according to an embodiment of the invention.

Specific embodiment

The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, it is intended to is used to explain the present invention, and is not considered as limiting the invention.

Describe with reference to the accompanying drawings the multithreading persistence B+ data tree structure proposed according to embodiments of the present invention design with Implementation method.

Fig. 1 is the multithreading persistence B+ data tree structure design and implementation methods flow chart of one embodiment of the invention.

As shown in Figure 1, the multithreading persistence B+ data tree structure design and implementation methods the following steps are included:

In step s101, one layer of shadow leaf node based on chain structure is introduced in preset B+ tree.

Further, in one embodiment of the invention, each tree operations are searched for since root node, until finding Corresponding leaf node, wherein before accessing any one tree node, execute prefetched instruction, entire tree node is read into CPU In caching, to cover the memory access latency of entire tree node, and bond number group and value array are stored in different main memory skies respectively Between in, only to prefetch bond number group, reduce the total amount of data of pre- extract operation every time.

In step s 102, the leaf node based on chained list is stored in by the data layout strategy based on mixing main memory In NVM, to generate the tree layer based on structure of arrays, and the other parts of index data structure are stored in DRAM, with life At the link layer based on list structure, so that the design by the volatibility tree construction and persistence list structure that are layered avoids balancing With the persistence expense of sequence.

Further, in one embodiment of the invention, the tree layer based on structure of arrays in DRAM, it is each A node can accommodate the key-value pair of preset quantity, wherein each key-value pair of tree node is directed toward next layer of tree node or chain Table node, with the key-value pair quantity in any tree node be more than perhaps lower than default threshold tree node can execute division or Union operation is inserted into or deletes a key-value pair in upper one layer of tree node.

Further, in one embodiment of the invention, the link layer based on structure of arrays in NVM, by link layer It being stored in non-volatile main memory, wherein link layer is an orderly chained list, and each chained list node only stores a key-value pair, and It is connected with right pointer, guarantees that insertion/deletion/update of its atomicity and consistency operates using CPU atomic operation.

Optionally, the key array size for choosing preset threshold can be used linear search operation and replace binary chop operation, Linear search operation is placed in the primary memory space and is carried out, and is accelerated using SIMD instruction, wherein each key-value pair is equipped with the finger of 1B Line, and each fingerprint is the cryptographic Hash of corresponding key assignments, and by fingerprint storage of array on the head of leaf node.

In step s 103, it designs Embedded fine granularity lock mechanism and optimism writes mechanism, to be respectively used to read-write operation Between and write the con current control between write operation.

Further, in one embodiment of the invention, Embedded fine granularity lock machine is made as each chained list node and sets It counts a update mark position and deletes marker bit, the persistence delay that will be unsatisfactory for preset condition is verified from the version of read operation It is removed on path, and optimism writes mechanism and separates the concurrent control mechanism of tree node and chained list node, by persistence Delay is removed from the locking path of tree node granularity.

Further, in one embodiment of the invention, if conflict between read-write operation, using being based on version number Concurrent control mechanism, wherein on each tree node use version number's counter, version number is in each burl dotted state It is incremented by when being changed, for insertion, deletion or updates operation, applies for lock before modifying tree node, and will corresponding version This number is set to dirty, and after completing operation and after version number adds 1, discharges the lock of corresponding tree node, and if version number is repaired Change or be locked, then read operation will repeat the above process, until version number is verified；If between writing write operation Conflict then uses the lock mechanism of tree node granularity, wherein writes behaviour using what the lock of tree node granularity ensured to modify different tree nodes Be performed simultaneously, be connected between leaf node by right pointer, and the cleavage direction of default leaf node can only from left to right, and from The lock of tree node is applied at bottom upwards, and when tree node occurs division or deletes, applies for the lock of upper one layer of tree node, chained list section Point and the key-value pair of leaf node have one-to-one relationship so that write operation only obtain tree layer correspond to leaf node lock it Afterwards, chained list node can just be modified.

Further, in one embodiment of the invention, before one chained list node of every sub-distribution and release, every time One piece of non-volatile primary memory space is distributed from system hosts distributor, and the address of the non-volatile primary memory space and length is lasting Change into a persistence chained list, and the primary memory space being assigned to is divided into the main memory block of default size, and easily by one The idle main memory block linked list maintenance for the property lost restores line operation is distributed and discharged for the main memory of link layer, and when system is restored Journey scans the node of metadata information and link layer on persistence chained list, judges currently in use and main memory that is being not used Block, to rebuild the idle main memory block chained list of volatibility.

Further, in one embodiment of the invention, further includes: when system normal shutdown, by all volatibility Inside tree node and Garbage Collector be persisted to the predeterminated position of non-volatile main memory, and after system reboot, restore thread The inside tree node of all volatibility and Garbage Collector are copied in DRAM from non-volatile main memory.

The embodiment of the present invention proposes a kind of mixing main memory data structure using Nonvolatile memory and volatile ram, Traditional tree data structure is used in volatile ram, and the data structure of chain type is used in Nonvolatile memory, tree-shaped Data structure increases the concurrency of data retrieval, and linked data structure realizes the lasting data storage in non-volatile media, Tree has the search operation of good spatial locality and balance, and chain structure effectively reduces expensive persistence Operation, and Embedded fine granularity lock has also been devised for the data structure and optimism writes mechanism, and the lock expense for solving amplification is asked Topic, while using the Restoration Mechanism and persistence Garbage Collector of multithreading, for supporting the coherency management of non-volatile main memory, And accelerate the system recovery procedure of data structure.

Specifically, the embodiment of the present invention proposes a kind of based on Nonvolatile memory NVM and volatile ram DRAM mixing The data structure that main memory storage system optimizes.Wherein, the data structure after the optimization mainly includes following characteristics: the data Structure includes two levels, and first level is the tree layer (Tree Layer) based on structure of arrays, is stored in DRAM, second A level is the link layer (List Layer) based on list structure, is stored in NVM.Wherein, link layer effectively reduces data The persistence of structure operates, and tree layer provides the search operation with good spatial locality and balance.

Wherein, the data structure after optimization specifically includes following characteristics:

(1) the tree layer based on structure of arrays being located in DRAM, wherein each of which node can accommodate the key of fixed quantity Value pair, orderly key-value pair is stored in the continuous primary memory space, to guarantee good spatial locality, supports that the time is multiple Miscellaneous degree is the tree operations of O (log n).Each key-value pair of tree node is directed toward next layer of tree node or chained list node.If The key-value pair quantity of some tree node is more than that perhaps can execute division lower than some specific threshold tree node or merge behaviour Make, a key-value pair is inserted into or deleted in upper one layer of tree node, any persistence expense can't be generated.Due to setting layer It is served only for accelerating the search performance of link layer, when mistake occurs for system, restores the tree layer of volatibility by persistent link layer, and And by the method balance and sorting operation are only occurred in DRAM, therefore, this method will not introduce excessively high persistence Expense is write, the performance of index structure can be effectively promoted.

(2) link layer is specifically only stored in non-volatile main memory by the link layer based on structure of arrays being located in NVM.Chain Layer is an orderly chained list, and each chained list node only stores a key-value pair, is connected with right pointer, utilizes CPU atomic operation (the 64 bit atomic operations that x86 platform supports alignment) guarantee insertion/deletion/update operation of its atomicity and consistency.Tool Body, by taking insertion operation as an example, after finding and being properly inserted position, it is only necessary to execute the operation of following two persistence Guarantee the order of link layer, the operation of first persistence be by newly-generated chained list node (having pointed to postposition node) persistence, Second persistence operation is by pointer (having pointed to newly-generated chained list node) persistence of preposition chained list node.Wherein, such as System mistake has occurred between two operations in fruit, because newly-generated chained list node is inserted into link layer not yet, chain The consistency of layer can't be affected.And for being not inserted into successful new node, persistence Garbage Collector can be avoided The loss of this block main memory natively eliminates balancing run because link layer can accommodate an infinite number of chained list node.

(3) each tree operations require to search for since root node, until finding corresponding leaf node, it is also necessary to which reading is searched All tree nodes on rope path, wherein the memory access latency of tree node just becomes the major influence factors of tree layer performance.The present invention Embodiment before accessing a tree node, a prefetched instruction can be executed, entire tree node is read in cpu cache, The memory access latency of entire tree node is masked, also bond number group and value array are stored in the different primary memory spaces respectively, thus Realization only prefetches bond number group, reduces the purpose of the total amount of data of pre- extract operation every time.

(4) for the tree data structure being located in DRAM, the key array size of certain threshold value can be chosen, using linear Search operation replaces binary chop operation.Further, linear search operation is placed in the primary memory space and is carried out, and utilize SIMD Instruction accelerates.Wherein, for search operation, target key value and multiple and different key assignments are compared simultaneously using SIMD instruction, Sequence with similar strategy is also used in balance, to move multiple data simultaneously, promote index performance.

(5) for being equipped with the fingerprint of 1B for each key-value pair of each leaf node on the leaf node of the data structure, In, each fingerprint is the cryptographic Hash of corresponding key assignments, and by fingerprint storage of array on the head of leaf node.One lookup is grasped When work, the only cryptographic Hash of target key value and the identical cryptographic Hash of some fingerprint, it can just go to compare this fingerprint correspondence Key assignments.Wherein, the size of each fingerprint is far smaller than the size of each key assignments, therefore, the contrast operation based on fingerprint array Additional concurrent ability can closer be increased.

Further, the embodiment of the present invention describes the concurrent control mechanism based on version number of data structure, should Concurrent control mechanism main contents are as follows: right using the concurrent control mechanism based on version number for the conflict between read-write operation Conflict between writing write operation, using the lock mechanism of tree node granularity.

Wherein, on the one hand, for the conflict between read-write operation, can be counted on each tree node using a version number Device avoids each write operation from requiring the expense of application lock as the communication media between the read-write operation concurrently executed.Its In, version number is incremented by when each burl dotted state is changed, and for insertion, deletion or updates operation, sets in modification Apply locking before node, and version number is set to dirty, version number is added 1 after completing operation, then discharges this tree node Lock, if version number is modified or is locked, read operation will repeat the above process, until version number's verifying is logical It crosses.

On the other hand, for writing the conflict between write operation, it can ensure to modify different burls using the lock of tree node granularity The write operation of point is performed simultaneously, because write operation only needs to hold the lock for the tree node that will be modified, each write operation passes through version The mode of this number verifying reaches target leaf node, is connected between leaf node by right pointer, and stipulated that leaf node division side To can only the situation that can not find target key value pair caused by splitting operation be avoided from left to right.Bottom-up application in next step The lock of tree node, and only can just apply for the lock of upper one layer of tree node when tree node occurs division or deletes.Because of chain The key-value pair of table node and leaf node has one-to-one relationship, so write operation is only obtaining the lock set layer and correspond to leaf node Later, modification chained list node can just be removed.Therefore, the concurrent control mechanism of tree node solves the concurrency conflict of chained list level simultaneously Problem.

Further, the embodiment of the present invention proposes concurrent control mechanism, and by version counter, which can be with Support optimistic read.In optimistic reading mechanism, a read operation will acquire the snapshot of existing version without to current version into Row locking, then reads and data and checks version, wherein if version does not change or is not flagged as dirty, shows Read operation success, and the design by not needing locking bit, reading concurrent mechanism can be obtained by improvement.For writing write conflict, The data structure uses version and locking to each first tree node lock, the node for needing a to write firstly, write operation is had good positioning, Positioning node is determined by top-down read operation, and after needing the node write to be positioned, node, which is written, in write operation locks Positioning, and start write operation and persistence process.For the write operation of any need balance, require from the data bottom to upper The lock of layer, by the above method, the lock of tree node before which supports, rather than the lock entirely set.One is inserted into Operation, firstly, version counter is locked and increased to lock operation position, upper before carrying out write operation and persistence In the case of stating, version counter increases.Secondly, carrying out write operation and persistence operation to leaf node, finally increase version section Point number simultaneously discharges lock.For a read operation, lookup and read operation are executed, and the version of snapshot and newest version are compared Compared with if version is dirty and is modified, read operation failure, restarting verifying is until success.

Specifically, it for read/write conflict, because setting dirty for version in leaf node layer during write operation, and is writing Version is increased after operation, thus it is longer write execute the time will will lead to it is higher read stop probability.Therefore, for Read operation also will be removed in version number and ceaselessly be attempted always before constant, this behaviour even reading the operation of other keys Work will will lead to higher reading suspension rate.Wherein, persistence delay can be verified in critical path from version and be removed, detailed process To allow to carry out the control of element granularity based on the organizational form of element on chain surface layer, firstly, the data structure is embedding using one Enter to decline type pointer, which includes updating position and delete position, secondly, the comprehensive array and chain surface layer of leaf node can be with Support simultaneously it is optimistic read and element granularity writes locking, version counter is used for each chained list node, chain surface layer can Locking is provided with the element modified each.Based on above description, chained list node can be used Embedded miniature Lock operates to execute persistence, without generating any read operation to other keys.After persistence operation, chained list node updates Version number in array node, persistence delay will be removed from the critical path that version is verified.In the embodiment of the present invention In, need to be arranged embedded position for update and delete operation, the data structure to show that chained list node is being modified.Its In, it is being inserted into after persistence chained list node, array layer is being updated using version mechanism.Finally, the data structure unlocks Embedded position sum number group node, embedded position do not need to carry out persistence only for scheduling.For delete operation, only set Deletion position is set and recycling memory headroom prevents read operation that hovering pointer is accessed.

Similar with read/write conflict for writing write conflict, the persistence expense in write operation equally understands the lock of delayed write operation It is fixed.Wherein, for positioned at the chain surface layer of leaf node layer, which allows to carry out the different keys of the same leaf node Concurrently write.It is concurrently write to reach this, first part is the insertion node generated for those, and node is connected to not yet In chained list, this part of nodes can be randomly written into and persistence.Second part is the node modified, including is inserted The node enter, delete, updated.The CAS operation of one atomicity can change the state of chained list node, can be by decoupling chained list Layer and array layer con current control are realized.Specifically, allow to inaccessible node using random access.One insertion operation Will be with two persistence operations, one is node persistence, by the newly-generated chained list node of persistence and is directed toward next The pointer of node, the operation will make the node in chained list accessible.Based on above description, an insertion of the data structure Operation can not needed when chained list node is generated with persistence generate lock, and lock only need the node be pointed to And the pointer of previous node is updated and generation when persistence.

Firstly, each insertion operation obtains previous and the latter chained list node insertion position by version verifying, Then a new chained list is needed to connect fraternal pointer into next node, and the entire node of persistence.Secondly, obtaining number Group lock, and determine whether previous or the latter node is modified, if not provided, after the pointer for being directly connected to previous node arrives One node, persistence node simultaneously update array layer using version mechanism, otherwise, by using it is traditional based on lock by the way of hold Row insertion operation.Finally, releasing lock, the persistence cost of chained list node will be removed from locking path.For array Layer and the progress of the con current control on chain surface layer are decoupling, and chain surface layer can be with atomicity in DRAM by the instruction of CAS a series of Realize the concurrent mechanism without lock.But CAS operation instruction does not ensure that the persistence atomicity in NVM is write.Specifically, may be used To guarantee following several respects by a persistent CAS operation.Firstly, the atomicity to a shared variable updates.Secondly, Persistence includes the cache lines of shared variable to guarantee the persistence updated.The CAS of volatibility will cause in persistence memory Incorrect behavior when a concurrent read operation reads the value of a shared variable, and makes one persistently based on read operation Property write operation, when system exception appears in during write operation, it will cause system inconsistent.It is concurrently concurrently grasped to ensure The consistency of work, the data structure need the CAS operation of persistence that persistence is waited to operate chained list node, when modification not yet To leaf layer as it can be seen that visibility is realized by Embedded micro-lock, the CAS operation of persistence passes through the atom on decoupling chain surface layer Property and array layer persistence visibility realize.For each insertion operation, firstly, determining the previous node of target with after New chained list node is then directed toward next node and by its persistence by one node.Secondly, being repaired using CAS atomicity Change the brotgher of node of previous node and persistence is carried out to it, the element being newly inserted into only is inserted into a upper level at it When just as it can be seen that if CAS instruction execute failure if, will restart to execute from the first step, otherwise will use based on lock Mechanism is inserted into a upper node layer and to globally visible.

Further, for each delete operation, positioning needs the node deleted, and uses CAS to logicality, atomicity The node with deleted marker is deleted in instruction.Secondly, physically deleting the pointer by modification and the previous node of persistence And it is automatically directed to next node.The data structure can also use CAS instruction and check whether destination node is repaired Change or delete and whether is modified with previous node.For each update for modifying existing key operate its concurrent control mechanism with Delete operation is similar, unlike, what chained list node notified that chained list node is carrying out by updating position is to update behaviour Make.

Specifically, the consistency main memory management mechanism of the data structure of the embodiment of the present invention, in every sub-distribution and release one Before a chained list node, larger one piece of non-volatile primary memory space is distributed from system hosts distributor every time, and this block is empty Between address and length be persisted in a persistence chained list, the primary memory space being assigned to then is divided into particular size Main memory block, and pass through the idle main memory block linked list maintenance of a volatibility, the main memory distribution and release operation for link layer.It is being System restore when, restore thread scans persistence chained list on metadata information and link layer node, judge it is currently in use and The main memory block being not used, so that the idle main memory block chained list of volatibility is rebuild, only after small main memory block is all used, New main memory can be just distributed from system hosts distributor again.

Specifically, the consistency main memory management mechanism of the data structure of the embodiment of the present invention, it is global by maintenance one Epoch counter and three garbage reclamation chained lists correctly to recycle the tree node and chained list node being released.Executing related behaviour Before work, firstly, worker thread registration existing No. epoch, for tree/chained list node of each deletion, thread is according to currently Global No. epoch is placed into corresponding garbage reclamation chained list.Wherein, if current No. epoch is T, the section of deletion Point can be placed to [T mod 3] garbage reclamation into chained list, when garbage collector is wanted the master on garbage reclamation chained list When counterfoil is moved on idle main memory block chained list, firstly, checking whether all worker threads are already in current epoch In number, if checked successfully, it is incremented by No. epoch global.Pass through above-mentioned method, it is ensured that all threads are all in epoch T In the range of T+1, thus by the main memory block safe retrieving on the corresponding garbage reclamation chained list of epoch T-1.

Specifically, the multithreading Restoration Mechanism of the data structure of the embodiment of the present invention will own when system normal shutdown The inside tree node and Garbage Collector of volatibility are persisted to some specific position of non-volatile main memory, after system reboot, Restore thread to copy the inside tree node of all volatibility and Garbage Collector in DRAM from non-volatile main memory to, very short The process of system reboot can be completed in time.When being restored after system is abnormal, thread is recycled in offline shape State scans all chained list nodes, rebuilds all inside tree node and Garbage Collector.Specifically, it was normally executed in system Cheng Zhong records the position of some chained list nodes using one group of persistence tracker, in 10,000 insertion operations of every execution, tracking The core address of the new chained list node of device meeting recorded at random, and it is persisted to a reserved area of non-volatile main memory, work as tracking Chained list node be deleted when, corresponding tracker will be also reset.It mainly include two stages in the recovery process of system: first First, tracker is ranked up, is then distributed to extensive according to the key assignments of the chained list node of tracker record in first stage Multiple line journey, per thread independently scan the chained list node of disjoint link layer, rebuild data structure.Secondly, in second rank These parts are built into a complete data structure using a thread after rebuilding disjoint part by section.

The embodiment of the present invention is using the index data structure of storage system under Nonvolatile memory scene as optimization object, needle To the storage system for being currently based on main memory, a kind of one layer of shadow leaf segment based on chain structure of introducing in traditional B+ tree is proposed Point, and using the data layout strategy based on mixing main memory, the leaf node based on chained list is stored in NVM, other parts are deposited Storage eliminates sequence and balancing run bring persistence expense in DRAM, devises Embedded fine granularity lock and optimism Mechanism is write, the con current control for being respectively used between read-write operation and writing between write operation, wherein Embedded fine granularity lock machine System designs a update mark position for each chained list node and deletes marker bit, and this fine-grained concurrent control mechanism will not The delay of necessary persistence is removed from the version of read operation verifying path, and optimistic mechanism of writing is by tree node and chained list node Concurrent control mechanism is separated, and is further removed persistence delay from the locking path of tree node granularity, behaviour is write in reduction Make the concurrency conflict between write operation.Optionally, the embodiment of the present invention also designs persistence Garbage Collector, for supporting The coherency management of non-volatile main memory, the recovery of data structure when accelerating system crash finally by multithreading recovery technology Process.

Next come according to specific embodiment to multithreading persistence B+ data tree structure design of the invention and realization side Method is described in detail.

As shown in Fig. 2, the B+ tree that the embodiment of the present invention supports multithreading persistence concurrently to access is mixed using DRAM and NVM Main memory framework being closed, being a tree similar with tradition B+ tree in DRAM, index when for running is in NVM One data structure based on chained list, for storing all user data and its relationship, system is only protected in non-operating state The list structure being located on NVM is deposited, when system restarting or abnormal restoring, is reconstructed using the list structure being located on NVM Tree data structure in DRAM, and accelerate concurrent Index process using tree data structure at runtime.

In embodiments of the present invention, memory access latency can be reduced using the mechanism that prefetches, the search of tree is opened from root node Begin, until finding corresponding leaf node, since this process needs to read all leaf nodes in searching route, these trees The memory access latency of node will seriously affect the search performance entirely set.To solve this problem, it is being accessed in the embodiment of the present invention Before each tree node, a prefetched instruction is executed, entire tree node is prefetched in cpu cache, to mask entire The memory access latency of tree node, and bond number group and value array are buffered in the different primary memory spaces respectively, bond number group is only prefetched, is dropped The total amount of data of low pre- extract operation every time.

In embodiments of the present invention, treatment process is accelerated using SIMD mechanism.Wherein, linear search operation is continuous It is executed in the primary memory space, it is possible to be added using Single Instruction Multiple Data (SIMD) instruction Speed.Most modern processors all support SIMD instruction, support to execute identical arithmetic to multiple data simultaneously or compare behaviour Make.For search operation, target key value is compared with multiple and different key assignments simultaneously using SIMD compare instruction.Sequence and Similar optimisation strategy is also used in balancing run, uses 24 so as to mobile multiple data, the embodiment of the present invention simultaneously Core Intel processors support the SIMD operation of 256 bits, and therefore, which can compare 32 fingerprints simultaneously, accelerate The search procedure of leaf node.

In an embodiment of the present invention, using the lock of concurrent control mechanism and tree node granularity based on version number, it is ensured that The write operation for modifying different tree nodes may be performed simultaneously, and Fig. 2 gives the structure of version number, and version number uses one 32 Byte sequence structure.Wherein, first for whether Bei Suoding identifier, second is whether as root node identifier, third Position is whether that for leaf node identifier, latter 29 are incremental version numbers, which is changed in each burl dotted state When be incremented by.

Apply for the lock of this tree node before modifying tree node, then version number is set to it is dirty, and complete operation after By version number plus 1, the lock of the tree node is then discharged.For inquiry operation, this section is recorded before reading tree node The version number of point, after completing read operation by the newest version number of this tree node and before the version number that records carry out pair Than, judge whether this tree node is modified in reading process by other operations, if version number is modified or is locked, Read operation will re-execute the above process, until version number is verified.

In embodiments of the present invention, the delay of excessively high persistence can block the other of identical leaf node difference key-value pair and write behaviour Make.Each key-value pair has very strong incidence relation with adjacent key-value pair on tree node based on structure of arrays, any one Write operation may all trigger expensive balancing run (the most of key-value pair for modifying the same tree node), accordingly, it is difficult to design The lock of key-value pair granularity for coordinating to access the concurrent write of identical leaf node difference key-value pair, and writes mechanism by optimism Persistence delay is removed from the locking path of tree node.

In embodiments of the present invention, those are modified the write operation sequence of critical zone data using mutual exclusion lock by index structure Change, avoids the concurrency conflict between write operation.Therefore, the persistence operation modified to non-critical zones data can be from mutual exclusion Lock removes on path.

It as shown in Figure 3 and Figure 4, is the data structure to read/write conflict and the processing step for writing write conflict.Read-write is rushed Prominent, each insertion operation navigates to the position of insertion first, gets preposition and postposition chained list node.It is operated for updating, the One step applies for a new chained list node, and right pointer is directed toward postposition node, and the entire chained list node of persistence.Second step obtains The lock for taking tree layer leaf node judges whether preposition and postposition node is modified.Third step, if do not modified, will before The pointer for setting node is directed toward new node and persistence.4th step updates key-value pair and the version number of leaf node.5th step, if It is modified, insertion operation will be executed in such a way that tradition is based on lock, and discharge lock.By above-mentioned optimisation strategy, will not repair The persistence operation for changing link layer is removed from locking path.It is worth noting that, what is occurred between the first step and third step is System collapse not will lead to the leakage of non-volatile main memory.

In embodiments of the present invention, read operation is verified by the described above first, two, three step by version number Mode obtain corresponding chained list node pointer, by the 4th step described above, read the data of chained list node, then examine Look into the insertion marker bit of this node.Wherein, if marker bit is arranged to dirty, illustrate that the chained list node read is being in more State that is new or deleting, if update mark position be it is dirty, read operation can wait until always that update operation is completed and lasting Change.If delete marker bit be it is dirty, read operation can be reformed from root node.Specifically, basic read-write concurrent control mechanism and base In the difference of the read-write concurrent control mechanism of Embedded fine granularity lock: the latter will be directed to writing for identical leaf node difference key-value pair The persistence expense of operation is checked from the version number of read operation and is removed on path.

In embodiments of the present invention, link layer, which only needs to execute by a series of CAS operation, guarantees atomicity Operation is updated, and Embedded fine granularity lock supports that the key-value pair of modification only just can be visible after the completion of persistence operates. By above-mentioned two technologies, the concurrent control mechanism for setting layer and link layer is separated, uses key-value pair grain in persistence link layer The concurrent control mechanism of degree uses the locking concurrent mechanism of tree node granularity in volatibility tree layer, thus by the persistence of link layer Expense is removed from the locking path of tree node.For each insertion operation, insertion is navigated in such a way that version number verifies Position, get preposition and postposition node.Wherein, the first step distributes new chained list node, and is directed toward postposition node, Then persistence operation is executed.The right pointer of preposition node is directed toward new chained list node by CAS instruction by second step.Due to The state of preposition node is stored in right pointer, and CAS operation is also used to avoid for new node being inserted into deleted chained list section After point, traditional atomic operation not can guarantee the persistence of data.

In embodiments of the present invention, the chained list node being newly inserted into just can only after the tree node for being updated to upper layer See, to avoid seeing the new node not persisted by other operations.The data structure is in the new chained list node of persistence with before After setting the pointer of node, using it is traditional based on lock by the way of its will be kept visible in leaf node that new key-value pair is inserted into upper layer. For each delete operation, an existing key-value pair is deleted.Firstly, navigating to target chained list node, it is arranged using CAS operation Marker bit is deleted, completes logic delete operation, avoiding other threads that any newly-generated chained list node is inserted into this will Behind the node of deletion, the loss of new node otherwise may cause.Secondly, atomicity modify the right pointer of preposition node, will It is directed toward postposition node and persistence, completes physics delete operation, checks whether destination node is being deleted using CAS operation Or whether update and preposition node are being deleted.The data structure is after completing aforesaid operations, by the key-value pair from upper It is deleted in layer tree node.Each update is operated, the value of an existing key-value pair is modified.In addition to target update node, more New operation has no effect on other nodes of link layer, so concurrent control mechanism very simple.By update mark position, notify other This chained list node of thread is in the process being updated.Fig. 3 (c) shows the concurrent implementation procedure of two insertion operations, this hair The persistence of chained list level is postponed to remove from the locking path of tree node granularity by data structure described in bright embodiment.

It in embodiments of the present invention, is the consistency and persistence that guarantee link layer, in system crash, unfinished operation The loss that (such as insertion and delete operation) may cause newly assigned chained list node, leads to the leakage of the primary memory space, and read Operation may see the chained list/tree node deleted by other threads.To solve the above problems, the Data Structure Design The consistency main memory management of lightweight and a persistence Garbage Collector.

In the embodiment of the present invention, larger one piece of non-volatile primary memory space is distributed from system hosts every time, and by this block The address in space and length are persisted in a persistence chained list, and the primary memory space being assigned to is divided into the master of particular size Counterfoil, and pass through the idle main memory block linked list maintenance of a volatibility, the main memory distribution and release operation for link layer.In system When recovery, restores the node of the metadata information and link layer on thread scans persistence chained list, judge currently in use and do not have There is the main memory block used, to rebuild the idle main memory block chained list of volatibility, is only all used in these small main memory blocks Afterwards, new main memory can be just distributed from system hosts distributor again.

In the embodiment of the present invention, read operation is avoided to see the chained list node and tree node deleted by other threads, led to It crosses one overall situation epoch counter of maintenance and three garbage reclamation chained lists correctly recycles the tree node being released and chained list section Point.Wherein, before the operation for executing the data structure, firstly, worker thread registration existing No. epoch.For each deletion Tree/chained list node, thread is placed into corresponding garbage reclamation chained list according to current global No. epoch, if mesh Preceding No. epoch is T, and the node of deletion can be placed in [T mod 3] garbage reclamation chained list, when garbage collector is wanted When main memory block on garbage reclamation chained list is moved on idle main memory block chained list, whether all worker threads are first checked for It has been in current No. epoch, if checked successfully, has just been incremented by No. epoch global.Pass through above-mentioned method, it is ensured that institute In the range of some threads are all in epoch T and T+1, thus by the master on the corresponding garbage reclamation chained list of epoch T-1 Counterfoil safe retrieving.

In the embodiment of the present invention, using multithreading Restoration Mechanism acceleration system recovering process, by the inside of all volatibility Tree node and Garbage Collector are persisted to some specific position of non-volatile main memory.It is extensive when restarting after system normal shutdown Multiple line journey copies the inside tree node of all volatibility and Garbage Collector in DRAM from non-volatile main memory to, very short System reboot is completed in time.When system exception restores, the recycling thread chained list node all in offline status scan, weight Build all inside tree node and Garbage Collector.Specifically, it in the normal implementation procedure of system, is tracked using one group of persistence Device records the position of some chained list nodes, and in 10,000 insertion operations of every execution, tracker can the new chained list section of recorded at random The core address of point, and it is persisted to a reserved area of non-volatile main memory, it is right when the chained list node of tracking is deleted The tracker answered will be also reset.

When system is restored, recovery process mainly includes two stages: first stage, first according to tracker record The key assignments of chained list node, is ranked up tracker, is then distributed to recovery thread, and each stage is independently scanned disjoint The chained list node of link layer reconstructs the data structure；Second stage uses one after the data structure for rebuilding non-intersecting part For a thread by these partial reconfigurations at a complete data structure, which can be effectively reduced the conflict between thread.

In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the present invention, the meaning of " plurality " is at least two, such as two, three It is a etc., unless otherwise specifically defined.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office It can be combined in any suitable manner in one or more embodiment or examples.In addition, without conflicting with each other, the skill of this field Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples It closes and combines.

Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example Property, it is not considered as limiting the invention, those skilled in the art within the scope of the invention can be to above-mentioned Embodiment is changed, modifies, replacement and variant.

Claims

1. a kind of multithreading persistence B+ data tree structure design and implementation methods, which comprises the following steps:

One layer of shadow leaf node based on chain structure is introduced in preset B+ tree；

The leaf node based on chained list is stored in NVM by the data layout strategy based on mixing main memory, to generate based on number The tree layer of group structure, and the other parts of index data structure are stored in DRAM, to generate the chain based on list structure Layer, so that persistently being melted by what the design of the volatibility tree construction and persistence list structure of layering avoided balancing and sorting Pin；

It designs Embedded fine granularity lock mechanism and optimism writes mechanism, to be respectively used between read-write operation and write between write operation Con current control.

2. multithreading persistence B+ data tree structure design and implementation methods according to claim 1, which is characterized in that its In, the Embedded fine granularity lock machine is made as each chained list node and designs a update mark position and delete marker bit, will The persistence delay for being unsatisfactory for preset condition is removed from the version of read operation verifying path, and the optimism is write mechanism and will be set The concurrent control mechanism of node and chained list node is separated, and the persistence is postponed from the locking path of tree node granularity Upper removal.

3. multithreading persistence B+ data tree structure design and implementation methods according to claim 1, which is characterized in that position The tree layer based on structure of arrays in the DRAM, each of which node can accommodate the key-value pair of preset quantity, wherein burl Each key-value pair of point is directed toward next layer of tree node or chained list node, is more than with the key-value pair quantity in any tree node Perhaps division being executed lower than default threshold tree node, perhaps union operation is inserted into or deletes in upper one layer of tree node One key-value pair.

4. multithreading persistence B+ data tree structure design and implementation methods according to claim 1, which is characterized in that position The link layer based on structure of arrays in the NVM, link layer is stored in non-volatile main memory, wherein the link layer is one Orderly chained list, each chained list node only stores a key-value pair, and is connected with right pointer, guarantees it using CPU atomic operation The insertion/deletion of atomicity and consistency/update operation.

5. multithreading persistence B+ data tree structure design and implementation methods according to claim 1, which is characterized in that every A tree operations are searched for since root node, until finding corresponding leaf node, wherein access any one tree node it Before, prefetched instruction is executed, entire tree node is read in cpu cache, to cover the memory access latency of the entire tree node, and And bond number group and value array are stored in the different primary memory spaces respectively, only to prefetch bond number group, reduce pre- extract operation every time Total amount of data.

6. multithreading persistence B+ data tree structure design and implementation methods according to claim 1, which is characterized in that choosing The key array size for taking preset threshold is operated using linear search and binary chop is replaced to operate, and linear search operation is put It carries out in the primary memory space, and is accelerated using SIMD instruction, wherein each key-value pair is equipped with the fingerprint of 1B, and each fingerprint It is the cryptographic Hash of corresponding key assignments, and by fingerprint storage of array on the head of leaf node.

7. multithreading persistence B+ data tree structure design and implementation methods according to claim 1, which is characterized in that its In,

If the conflict between read-write operation, the concurrent control mechanism based on version number is used, wherein adopt on each tree node With version number's counter, version number is incremented by when each burl dotted state is changed, and for insertion, deletes or more New operation, applies for lock before modify tree node, and will corresponding version number be set to it is dirty, and after completing to operate and version number adds 1 Afterwards, the lock of corresponding tree node is discharged, and if version number is modified or is locked, read operation will repeat above-mentioned Process, until version number is verified；

If the conflict between writing write operation, the lock mechanism of tree node granularity is used, wherein the lock using tree node granularity ensures The write operation for modifying different tree nodes is performed simultaneously, and is connected between leaf node by right pointer, and preset the leaf node Cleavage direction can only from left to right, and the lock of the bottom-up application tree node, and the tree node occur division or When deletion, apply for the lock of upper one layer of tree node, the key-value pair of chained list node and leaf node has one-to-one relationship, so that described Write operation only after obtaining tree layer and corresponding to the lock of leaf node, can just modify the chained list node.

8. multithreading persistence B+ data tree structure design and implementation methods according to claim 1, which is characterized in that Before one chained list node of every sub-distribution and release, one piece of non-volatile primary memory space is distributed from system hosts distributor every time, And the address of the non-volatile primary memory space and length are persisted in a persistence chained list, and described in being assigned to The primary memory space is divided into the main memory block of default size, and passes through the idle main memory block linked list maintenance of a volatibility, to be used for chain Layer main memory distribution and release operation, and system restore when, restore thread scans persistence chained list on metadata information and The node of link layer judges currently in use and main memory block that is being not used, to rebuild the idle main memory block chain of volatibility Table.

9. multithreading persistence B+ data tree structure design and implementation methods according to claim 1, which is characterized in that also Include:

By maintenance one global epoch counter and three garbage reclamation chained lists come correctly recycle the tree node being released with Chained list node, wherein before executing relevant operation, worker thread registers existing No. epoch first, for each deletion Tree/chained list node, thread are placed into corresponding garbage reclamation chained list according to current global No. epoch.

10. multithreading persistence B+ data tree structure design and implementation methods according to claim 1, which is characterized in that Further include:

When system normal shutdown, the inside tree node and Garbage Collector of all volatibility are persisted to non-volatile main memory Predeterminated position, and after system reboot, restore thread for the inside tree node and the Garbage Collector of all volatibility It is copied in the DRAM from non-volatile main memory.