CN111274456B - Data indexing method and data processing system based on NVM (non-volatile memory) main memory - Google Patents

Data indexing method and data processing system based on NVM (non-volatile memory) main memory Download PDF

Info

Publication number
CN111274456B
CN111274456B CN202010064770.8A CN202010064770A CN111274456B CN 111274456 B CN111274456 B CN 111274456B CN 202010064770 A CN202010064770 A CN 202010064770A CN 111274456 B CN111274456 B CN 111274456B
Authority
CN
China
Prior art keywords
index
data
node
nvm
item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010064770.8A
Other languages
Chinese (zh)
Other versions
CN111274456A (en
Inventor
陈世敏
刘霁航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202010064770.8A priority Critical patent/CN111274456B/en
Publication of CN111274456A publication Critical patent/CN111274456A/en
Application granted granted Critical
Publication of CN111274456B publication Critical patent/CN111274456B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a data indexing method based on an NVM, which comprises the following steps: setting leaf nodes of a tree index structure in the NVM main memory; when new data is written into the leaf node, judging whether the leaf node has an idle index item, if so, performing data writing operation, otherwise, performing and finishing node splitting operation, and then performing data writing operation; wherein the data write operation includes: if the first index row of the leaf node has an idle index item, writing the newly added data into the idle index item; otherwise, the newly added data and the stored data stored in the first index row are migrated to the idle index items of the middle index row and/or the tail index row; the node splitting operation includes: and constructing a new leaf node, and migrating part of stored data of the leaf node to a free index item of the new leaf node.

Description

Data indexing method and data processing system based on NVM (non-volatile memory) main memory
Technical Field
The invention relates to the fields of database systems and big data systems, in particular to a tree-like index structure, an index method and a system for non-volatile main memory.
Background
1.1 New generation nonvolatile memory technology
A new generation of non-volatile memory (NVM) is a type of computer memory that is an alternative or complement to existing DRAM (dynamic random access memory) hosting technologies. Current integrated circuit feature sizes have reached 7nm and dram technology continues to scale down to smaller feature sizes presents a significant challenge. The new generation of NVM technology can support smaller feature sizes by changing the resistance of the storage medium to store 0/1, providing a viable solution to the above-mentioned problems. New generation NVM technologies include Phase Change Memory (PCM), spin transfer torque magnetic random access memory (STT-MRAM) and Memristor (Memristor), 3DXPoint, and the like.
Compared with the DRAM technology, the NVM technology has the following characteristics: (1) NVM has read-write performance similar to DRAM but slower (e.g., 3 times) than DRAM; (2) The write performance of the NVM is worse than the read performance, the power consumption is high, and the write may have a limit on the number of times, i.e. the number of times the same memory cell is written exceeds a certain threshold, the memory cell is damaged; (3) The data written into the NVM does not disappear after power failure, and the data in the DRAM and the CPU Cache disappear after power failure; (4) In order to ensure that the contents in the CPU Cache are written back to the NVM, a Cache line flush instruction such as clwb/clflush and a memory operation ordering instruction such as sfenc/mfenc need to be executed, and the performance cost of the special instructions is higher than that of the common write (for example, 10 times); (5) The basic unit for CPU access to NVM is a Cache line (e.g. 64B). (6) The access base unit inside the NVM module may be larger than the Cache line (e.g., the access unit inside Intel Optane DC Persistent Memory is 256B).
The performance of NVM technology is at least 2 orders of magnitude higher than flash memory, and NVM allows in-situ writing without requiring operations like erasure of flash memory. Thus, the features of NVM technology used more closely to DRAM are seen as an alternative or complement to DRAM hosting technology.
1.2 computer System including New Generation NVM maintenances
Two configurations of computer systems containing 3 DXPoint-based Intel Optane DC Persistent Memory are shown in fig. 1A, B. As shown in FIG. 1A, the first configuration uses DRAM as the 3DXPoint buffer, the memory controller automatically completes the buffering operation, and the system visible host size is the 3DXPoint size. The DRAM is fully controlled by hardware and is not visible to software; as shown in fig. 1B, in the second configuration, both the DRAM and the 3DXPoint are visible to software, which can determine which data is placed in the volatile DRAM and which data is placed in the nonvolatile 3DXPoint. The first configuration is mainly suitable for running the original application, and can be hosted by using the 3DXPoint with large capacity without modifying the application, but the first configuration cannot achieve the purpose of permanently storing data in the NVM, so the new NVM-oriented hosted application will use the second configuration.
1.3 B+ -Tree index
B+ -Tree is an index structure of a Tree structure. The leaf nodes store index items, are all in the same layer and are connected in sibling linked lists from left to right, so that the index items are orderly stored from small to large. Each internal node may have multiple pointers, each pointing to a child node. The keys in the internal nodes are ordered from small to large, separating the different child sub-trees. Thus, the B+ -Tree Tree is an ordered index.
B+ -Tree is widely used in databases and big data systems to assist in quick querying and updating of data. The index, due to being called by high frequencies, can affect the overall performance of the system. Therefore, the design of optimizing the B+ -Tree index structure aiming at the nonvolatile main memory has important theoretical and practical significance for databases and big data systems based on the nonvolatile main memory.
1.4 prior art: NVM (non-volatile memory) -optimization-oriented B+ -Tree index structure
The existing B+ -Tree index structure facing the NVM optimization mainly comprises WB+ -Tree, FP-Tree, bzTree and the like. The main optimization thought comprises the following aspects:
focusing on CPU Cache performance, making the node size an integer multiple of the Cache Line size (i.e., 64B) and significantly smaller than the hard disk-based B+ -Tree node size (e.g., 4 KB);
using out-of-order leaf nodes to reduce NVM writes;
placing the internal node in DRAM and the leaf node in NVM, thereby improving access performance of the internal node, and simultaneously rebuilding the internal node from the leaf node when recovering, thereby maintaining durability and downtime consistency of the data structure;
the NVM atomic write is utilized to avoid the use of journals or shadow copies before writing, so that the number of NVM writes and forced writes back can be reduced.
Firstly, the prior art is designed based on the theoretical NVM characteristics, and when no real NVM hardware exists, an analog simulation method is adopted for research. After the appearance of the real 3DXPoint hardware, the inventor discovers that the 3DXPoint hardware has new important characteristics: (1) The granularity of the internal data access is 256B, which is larger than the size 64B of the CPU Cache Line; (2) The cost of writing a Line of each 64B to the NVM does not vary with write content, unlike the previous study assumptions.
Second, the B+ -Tree node splitting operation is costly in the prior art, typically employing a log to ensure downtime consistency of the node splitting operation, and the log incurs the cost of additional NVM writes and forced writeback. Forced write back on 3DXPoint may incur performance costs that are more than 10 times greater than normal write, so how to reduce additional operations such as journaling is a very important issue.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a data indexing method based on the NVM main memory, which utilizes the characteristics of nonvolatile main memory atomic write operation and 3DXPoint real hardware to construct a tree index structure so as to reduce the expenditure of data write operation in the NVM main memory.
Specifically, the data indexing method of the present invention comprises: constructing a tree index structure; the root node and the internal node of the tree index structure are arranged in the DRAM main memory of the data processing system, and the leaf node of the tree index structure is arranged in the NVM main memory of the data processing system; the leaf nodes are in a plurality of layers and are connected in a brother linked list from left to right; the leaf node comprises an A index unit, wherein the A index unit sequentially comprises a first index row, a middle index row and a tail index row, and the first index row, the middle index row and the tail index row respectively comprise a plurality of index items; when new data is written into the current leaf node, judging whether an index unit A of the current leaf node has an idle index item, if so, performing data writing operation, otherwise, performing and finishing node splitting operation, and then performing the data writing operation; wherein the data write operation includes: if the first index line has an idle index item, writing the newly added data into the idle index item of the first index line; otherwise, the newly added data and the stored data stored in the first index row are migrated to the idle index items of the middle index row and/or the tail index row; the node splitting operation includes: and constructing a new leaf node, adding the sibling linked list, and migrating part of stored data of the current leaf node to a free index item of an index unit of the new leaf node.
The invention relates to a data indexing method, wherein an index unit NVMLine comprises M index lines, and each Line is provided with N data item storage positions; line1 is the first index Line L h ,L h The 1 st data item storage location of (1) is the index head H, L h The storage positions of the rest data items are index items, and the 2 nd to M-1 st lines are intermediate index lines L i ,L i All data of (3)The item storage positions are index items, and the Mth Line is the tail index Line L t ,L t The nth data item of (a) is a sibling linked list pointer item S, L t The rest data items of the data are index items; m, N is a positive integer; the index head H comprises a write locking bit lock bit, an alternate control bit alt bit, an index occupation bitmap bit map and an index fingerprint bit F, and S comprises a pointer S 0 And pointer S 1 The method comprises the steps of carrying out a first treatment on the surface of the The lock bit is used for setting the writing state of the current leaf node; alt bit for setting S by NVM atomic write 0 And S is 1 One of which is a valid pointer and the other is an invalid pointer; the bitmap is used for recording the occupation state of each index item respectively; f is a finger print array used for respectively recording the fingerprints of each index item; the active pointer of S is used to connect the sibling list.
The invention discloses a data indexing method, wherein the node splitting operation steps specifically comprise: when the leaf Node n All index items are occupied, and newly added leaf nodes Node are allocated n ',Node n ' have and Node n The same structure; node n Copying part of stored data to Node n ' index item, and modifying Node n ' first index line L h The index of the ' index header H ' occupies the bitmap '; node n The active pointer of the 'sibling linked list pointer item S' points to Node n Right sibling leaf Node of (a) n+1 Node is to n The invalid pointer of the sibling linked list pointer item S points to Node n 'A'; persistent Node in the NVM host n 'A'; setting alt bit to point S to Node with NVM atomic write n The pointer of' is set as the effective pointer and S is pointed to Node n+1 Setting the pointer of (2) to be an invalid pointer; the part of the stored data is stored in Node n The index entry of (2) is emptied as the idle index entry and the bitmap is modified by writing with the NVM atom.
The invention relates to a data indexing method, wherein the leaf node further comprises at least one B index unit NVMLine ', each NVMLine' comprises M lines, and each Line is provided with N data item storage positions; the 1 st data item storage location of the 1 st Line of NVMLine' is the first index head H0, the M th Li of NVMLineThe N-th data item storage position of ne is a tail index head H1; h0 comprises an index occupation bitmap 'and an index fingerprint bit F'; h1 has the same structure as H0; alt bit set H by NVM atomic write to set H 0 And H 1 One of the index entries is an effective index head, and the other index entry is an ineffective index head, wherein the bitmap 'of the effective index head is used for respectively recording the occupation state of each index item of the current NVMLine'; the F 'of the effective index head is a finger print array, and is used for respectively recording the fingerprint of each index item of the current NVMLine'.
The data indexing method of the invention further comprises the following steps: the data write operation of the exclusive leaf node is performed only after the write lock bit is set to the locked state and the concurrently controlled hardware transaction is exited.
When fault recovery operation is carried out, reconstructing the root node and the internal node in the DRAM main memory according to all the leaf nodes so as to recover the tree index structure; if the write lock bit of the index unit NVMLine is in a locked state, the write lock bit is set in an unlocked state, and stored data of the index unit NVMLine is recovered.
The data indexing method of the invention, wherein by modifying the bitmap, a part or all of the index items of NVMLine are set as free index items, and by modifying the bitmap ', a part or all of the index items of NVMLine' are set as free index items.
The present invention also proposes a computer readable multi-storage medium storing executable instructions for performing the NVM-based data indexing method as described above.
The invention also proposes a data processing system comprising: a processor; the main memory is connected with the processor and comprises a DRAM main memory and an NVM main memory which are connected in parallel; a computer readable storage medium, the processor retrieving and executing executable instructions in the computer readable storage medium for NVM-based data indexing.
The LB+ -Tree provided by the patent optimizes the index write operation performance aiming at the characteristics of real hardware, and realizes the node splitting of zero log.
Drawings
Fig. 1A, B is a schematic diagram of a prior art NVM-containing computer system architecture.
FIG. 2 is a schematic representation of the LB+ -Tree structure of the 256B leaf node of the present invention.
FIG. 3 is a schematic diagram of the LB+ -Tree point query algorithm of the present invention.
FIG. 4 is a schematic diagram of a first line index entry migration of the present invention.
Fig. 5 is a schematic diagram of a 256B leaf node lb+ -Tree insertion algorithm of the present invention.
Fig. 6 is a schematic diagram of a zero log leaf node splitting of the present invention.
Fig. 7 is a schematic diagram of a 256B leaf node zero log splitting algorithm of the present invention.
Fig. 8 is a schematic diagram of a multi-256B leaf node structure of the present invention.
FIG. 9 is a diagram illustrating the index insertion performance of the present invention compared to the prior art.
FIG. 10 is a schematic diagram of a data processing system of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following describes the data indexing method and the data processing system based on NVM main storage according to the present invention in further detail with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.
The inventor researches the 3DXPoint hardware characteristics and reconsiders the existing B+ -Tree index structure oriented to the NVM optimization, and discovers that the prior art is not suitable for the new characteristics of real hardware, and the existing node splitting method is high in cost, so that the inventor proposes a new B+ -Tree design, the advantages of the prior art are absorbed, the problems of the prior art are solved, and remarkable performance improvement is achieved.
First, the following definitions are made:
CacheLineSize: the size of the CPU Cache line is the granularity of the CPU to read and write the main memory data, and is generally 64B.
NVMLineSize: the granularity of data access within the NVM, e.g., 256B in 3DXPoint, is typically an integer multiple of cacheline.
The invention provides a Tree index structure LB+ -Tree oriented to nonvolatile main memory, which is a B+ -Tree index structure oriented to real 3DXPoint hardware optimization. The LB+ -Tree of the invention supports the consistency of persistent storage and downtime in nonvolatile main memory, supports multithreading concurrent operation, optimizes the node structure, reduces the expenditure of writing operation by utilizing the characteristics of nonvolatile main memory atomic writing operation and 3DXPoint real hardware, and realizes the persistent support of Tree index structure of zero log.
The leaf nodes of LB+ -Tree are stored in the NVM, all the leaf node initial addresses are aligned according to NVMLineSize, the leaf node size is an integer multiple of NVMLineSize, namely 1 or more NVMLineSize, and the data access bandwidth inside the real NVM hardware can be fully utilized.
The leaf node of LB+ -Tree is composed of 1 or more NVMLine, the NVMLine is an index unit NVMLine with the size of NVMLineSize, the NVMLine comprises an index head header, an index item and a sibling linked list pointer, the header controls which sibling linked list pointer is effective currently, and the leaf node structure supports a head line index migration technology and a zero log node splitting technology.
For the leaf node insertion algorithm, the invention provides head line index migration. First, the insertion is completed by modifying as little as possible the free index entry in the cacheline Line (called the head index Line 0) where the header is located in the leaf node. Secondly, when the index items of the first index row are full, the idle index items of other index rows need to be inserted. At this time, the index entries in the first row are migrated to the index row where the idle index row is inserted, so that the idle bits are emptied as many as possible in the first row, the idle index entries can be found in the first row by subsequent insertion, the insertion cost is reduced, the number of the NVM rows is reduced, and the insertion performance is improved.
For the leaf node splitting inserted when the leaf node is full, the invention also provides a zero log splitting algorithm, and the switching of the leaf node states (brother linked list pointers and headers) is completed by adopting one NVM atomic write, so that the cost of writing logs is avoided, and the node splitting performance is improved.
Compared with the prior art, the LB+ -Tree optimizes the node structure, reduces the cost of writing operation by utilizing the characteristics of nonvolatile main memory atomic writing operation and 3DXPoint real hardware, and realizes the Tree index structure persistence support of the zero log.
FIG. 2 is a schematic representation of the LB+ -Tree structure of the 256B leaf node of the present invention. As shown in fig. 2, the present invention proposes a data index structure lb+ -Tree, which includes a root node and an internal node (all referred to as non-leaf node) disposed in the DRAM main memory, and a leaf node disposed in the NVM main memory (3 DXPoint). The LB+ -Tree overall structure adopts the idea of the existing NVM-oriented B+ -Tree design, and mainly comprises: the leaf node is placed in the NVM to ensure persistent storage, while the internal node is placed in the DRAM and disappears after power failure, and is rebuilt from the leaf node upon recovery; the internal node adopts an ordered index item array to support binary search, and index items in leaf nodes are unordered, so that NVM writing caused by index item movement is reduced.
In particular, the present invention proposes a Tree index structure lb+ -Tree oriented to non-volatile main memory NVM, and in the embodiment of the present invention, according to the implementation environment, cacheline=64b and nvmlinesize=256B may be taken, but the present invention is not limited thereto.
In this embodiment, the leaf node size is an integer multiple of 256B, so as to adapt to the characteristic that the internal data transmission size of 3DXPoint is 256B, and fully exert the internal bandwidth of 3DXPoint. There are two designs of leaf nodes, which correspond to the case where the leaf node size is 256B and the size is at least 512B, respectively, hereinafter referred to as 256B leaf node and multi-256B leaf node.
1. 256B leaf node structure
The 256B leaf node is composed of one index unit NVMLine (256B unit), and assuming that a single index item of the index unit is (8B key,8B val), the index unit is composed of 4 index lines of 64B, namely, a first index Line (Line 0), an intermediate index Line (Line 1, line 2), and a last index Line (Line 3), having 14 index items of length 16B in total, the index items storing valid keys and val of data.
The first index Line0 has 4 data items, the 1 st data item (16B) is an index head H (header), the header sequentially comprises a 1-bit write locking bit lock bit, a 1-bit alternate control bit alt bit, and a 14-bit index occupying index fingerprint bits of bitmaps and 14B, the write locking bit lock bit is used for concurrent control, and the alternate control bit alt bit determines a pointer S in the tail index Line through a 0/1 value 0 And pointer S 1 One of the index fingerprint bits F is a valid pointer, the other index fingerprint bit F is a free pointer and is used for realizing zero log node splitting, an index occupation bitmap is used for recording whether index items of an index unit NVMLine are occupied, 1 indicates that the corresponding index items are occupied, valid keys and val are stored, 0 indicates that the corresponding index items are free, the index fingerprint bit F comprises a 14B fingerprint array, 14 1B fingerprint can be stored, the 14 index items respectively correspond to the index items of the index unit NVMLine, and the 1B fingerprint can be obtained by 8B keys through a hash function; the 3 data items after the header are index items; . In the aligned 64B index Line where H is located, all the rest spaces except H are only used for placing index items, so that the number of the index items which can be placed in the 64B index Line where H is located is as large as possible;
the middle index lines Line1 and Line2 respectively have 4 data items, 8 data items are total for Line1 and Line2, and the data items of the middle index lines are index items;
the tail index Line3 has 4 data items, wherein the first 3 data items are index items, the 4 th data item is a sibling linked list pointer item, and the sibling linked list pointer item contains two pointers S of 8B 0 And S is 1 One of the two pointers is a valid pointer pointing to the next leaf node on the right side in the LB+ -Tree leaf layer, and if the leaf node is the rightmost leaf, the valid pointer is NULL, and the other pointer is an idle pointer; pointer S 0 And S is 1 The active state of (2) is set by the alternate control bit alt bit of the header of the first index Line0.
2. Concurrency control
The LB+ -Tree of the present invention can adopt various concurrent control strategies, and combines HTM (hardware transaction) provided by CPU and operation on leaf node lock bit, the basic thinking comprises:
the access of the internal nodes in the DRAM and the read operation of the leaf nodes in the NVM are protected through the HTM, and note that the cache line flush instructions such as clwb/clflush can cause hardware transaction rollback, so that the HTM cannot be adopted to protect the write operation of the leaf nodes in the NVM; HTM is supported in a variety of mainstream CPU architectures including Intel x86 architecture, ARM architecture, IBM Power8 architecture, etc.;
the write operation of the leaf node sets the lock bit and completes the hardware transaction, and the read operation of the leaf node checks whether the lock bit is set or not, so that the modification of the lock bit by the write operation and the read of the lock bit by the read operation conflict, the write operation is ensured to monopolize the leaf node, and the read operation and other write operations are mutually exclusive, and the accuracy of concurrency control of the leaf node is ensured;
the write operation of a leaf node does not proceed until the leaf node has been monopolized after the lock bit is set and the hardware transaction is exited.
3. LB+ -Tree point query algorithm of 256B leaf node
FIG. 3 is a schematic diagram of the LB+ -Tree point query algorithm of the present invention. As shown in fig. 3, the main flow of the lb+ -Tree point query algorithm is to search for internal nodes first and then leaf nodes. The basic method conforms to the general method of B+ -Tree. The two main features are as follows. The first feature is concurrency control with HTM, where _xbegin, _xend, _xabort is based on one implementation of Intel x 86. XBEGIN starts a hardware transaction and returns XBEGIN STARTED if successful. When a hardware transaction fails due to a data conflict or call_xabort, requiring rollback, the CPU discards all relevant data modifications in the CPU Cache and switches control back to_xbegin. From a software perspective, it will be seen that _xbegin returns some error code. In the event of a transaction failure, the point query algorithm will retry (see line3 of algorithm).
The second feature is a search operation for leaf nodes. First, the algorithm calculates the 1B's fmgerprint value for the key of the query and compares it to the 14 1B values in the fmgerprint array in the header using SIMD vector operations. Only if the fmgerprint matches, the index entries may match. Second, for each index entry that the fmgerprint matches, the bitmap is checked to determine if the index entry is valid, and the index keys are further compared. The method is based on the existing FP-Tree, and the cost of comparing each index item in turn can be reduced.
4. Insertion algorithm of 256B leaf node
The data insertion operation of the leaf node of the present invention includes:
(1) First line index item migration: in real NVM hardware, the write performance of each 64B line of data does not change with the amount of modified data, i.e. when the i bytes in 64B change and the 64-i bytes are unchanged, the NVM write performance is unchanged under various conditions that 1.ltoreq.i.ltoreq.64, and note that simulation studies before the appearance of real NVM hardware sometimes set fewer modifications to produce better performance. Based on this finding, the present invention wishes to minimize the modification of the 64B Line in the leaf node, since the header always needs to be modified at the time of insertion, it is preferable that the insertion modifies only the first index Line0.
FIG. 4 is a schematic diagram of a first line index entry migration of the present invention. As shown in FIG. 4, when inserting 6, line0 has a free index entry, so writing 6 to this free index entry and modifying the header occur within Line0, and the insertion of data only causes 1 index Line to be written, which is the best case.
However, when the index entries of the top row are all occupied, the insertion of a leaf node necessarily requires writing other free bits outside the top row, and the header in the top row also requires modification, so 2 index rows will be written.
Under the condition, the first Line index item migration migrates the index item stored with the stored data in the Line0 to the index Line where the index item being modified is located, so that the Line0 is emptied as much as possible, and the subsequent insertion can fully utilize the emptied index item in the first Line, thereby realizing the writing of 1 index Line. As shown in fig. 4, at the time of inserting 3, since Line0 is full, one free index entry in Line1 needs to be inserted. At this point, all 9, 6, 4 in Line0 are migrated to Line1, freeing up 3 free bits in Line0. Then in the inserting 7, the idle index item can be found in Line0 to complete the inserting, so that the best condition is realized.
Note that when the Line0 index entry migration is performed, there will be more writes to the migrated Line to accept the migrated index entry, and the Line0 will have more writes to modify the finger print array, but the number of lines written to the NVM is unchanged, so the write performance is unchanged. While subsequent insert operations would benefit after the Line0 index entry is migrated.
It can be demonstrated by analysis that: in a stable LB+ -Tree, the index entry migration of the first index row will reduce the number of write NVM row operations by at least 1.35 times (i.e. the number of write NVM rows/the number of first index entry migration write NVM rows of the conventional scheme is not less than 1.35).
Fig. 5 is a schematic diagram of a 256B leaf node lb+ -Tree insertion algorithm of the present invention. As shown in fig. 5, where lbtreebeaf insert is one implementation of the above-described index item migration of the first index row. The 15-20 rows are the case that the idle bit is found in the first row, corresponding to the case of inserting 6 and 7 in fig. 4, the 21-32 rows are all occupied in the first row, and other rows need to be inserted, and the first row index item is migrated while being inserted.
In addition, the algorithm of FIG. 5 has several notable details. The first detail is modification to the header, the algorithm does not directly modify the header, but copies the header into a temporary variable dword, modifies the dword, and then writes the dword back to the header. Therefore, single bit operation and single byte modification can be performed in the temporary variable, and the writing back unit is 8B, so that the processing of the downtime consistency of the algorithm is simplified.
The second detail is that the algorithm always modifies the free position first, including the free index entry position, the free finger position, and finally writes the first 8B in the top line header. This is because modifications in the free locations do not affect downtime consistency, and when these modifications are completed, the first 8B modification in the head end includes a modification in the bitmap, which ultimately changes the state of the leaf node, and the newly written index entry is made valid.
The last detail is the implementation of concurrency control. The algorithm sets the lock bit of the leaf node in LBTreeInsert and commits the hardware transaction_xend. In this way, mutual exclusion of other read and write operations may be completed, exclusive of the current leaf node. Faults such as power failure can occur during the lock bit setting. Upon crash recovery, the leaf needs to be scanned to reconstruct the internal nodes. At this time, the set lock bits can be cleared uniformly, so that the correctness of further processing is ensured.
(2) Zero log leaf node splitting technique
Node splitting in the existing NVM-oriented B+ -Tree needs to be protected through a write-front log, and a large amount of NVM writing, cache line flush, sfenc/mfenc and other costs are caused.
Fig. 6 is a schematic diagram of a zero log leaf node splitting of the present invention. As shown in FIG. 6, the present invention proposes zero-log leaf node splitting, which is to replace two pointers by one NAW (NVMAtocic Write,8B NVM write+clwb/clflush instruction+sfenc/mfenc instruction,/yes or meaning), thereby achieving the purpose of zero-log. Specifically, as shown in fig. 6 (a), the alt bit indicates a valid pointer S0 before splitting; as shown in fig. 6 (b), the first step in node splitting is to allocate and write a new node and set another sibling pointer, note here that the new node and the free sibling pointer do not change the state of the original leaf node, so that downtime consistency is not affected. As shown in fig. 6 (c), the second step writes alt bits by one NAW with sibling pointers swapped while writing bitmap, setting the index entry position moved into the new node to idle. Thus, complete switching of leaf node states is accomplished with one NAW, thereby avoiding the cost of writing logs.
Fig. 7 is a schematic diagram of a 256B leaf node zero log splitting algorithm of the present invention. As shown in fig. 7, one implementation of 256B leaf node zero log splitting is shown, with the overall flow of the algorithm fully following the example of fig. 7. The main complexity is that for the moving part of the index item, especially consider whether the newly inserted index item is put into a new node or an old node, and the optimization of the first line index item migration is adopted. In the new node, the previous index entry is left out as much as possible. In the old node, LBTreeLeafInsert is invoked to complete the insertion of a new index entry.
5. Multi-256B leaf node structure
Fig. 8 is a schematic diagram of a multi-256B leaf node structure of the present invention. As shown in FIG. 8, the leaf node of LB+ -Tree also has a multi-256B node, and the multi-256B node is composed of a plurality of index units (256B units) of NVMLineSize size. The first index unit multi-NVMLine of the multi-256 node B has the same structure and function as the index unit NVMLine of the 256 node B, except that the alternate control bit alt bit in the index header H (header) of the first index unit multi-NVMLine of the multi-256 node B controls which of the 2 sibling linked list pointers S0 and S1 in the first index unit multi-NVMLine is currently active and which of the H0 and H1 in the other index units multi-NVMLine' is currently active. In the aligned 64B index Line where H is located, all the rest spaces except H are only used for placing index items, so that the number of the index items which can be placed in the 64B index Line where the header is located is as large as possible;
the other index units multi-NVMLine' of the multi-256B node sequentially comprise a head index head H0 (header 0), a plurality of index items and a tail index head H1 (header 1), and no sibling linked list pointer exists; the H0 and the H1 have the same structure and also comprise a write lock bit ', an alternate control bit alt bit ', an index occupation bitmap ' and an index fingerprint bit F ', but the write lock bit ' and the alternate control bit alt bit ' of the index unit multi-NVMLine ' only reserve corresponding data bits but are not used;
the index occupation bitmap is used for respectively recording the occupation state of each index item of a current index unit multi-NVMLine' of the multi-256B leaf node; the index fingerprint bit F 'is a fingerprint array, and is used for respectively recording fingerprints of each index item of the current index unit multi-NVMLine' of the multi-256B leaf node, the fingerprints are obtained by calculation of a hash function, and the same index keys have the same fingerprints.
The characteristics and importance of the multi-256B node structure include: (1) distributed index head: the existing B+ -Tree node adopts a centralized index head, and the meta information of all index items is stored at the node starting position. When the size of the node is increased, the meta information of the index item is increased, and the space available for storing the index item in the first index row is reduced, so that the migration technology of the first index item is not beneficial to playing an effect. The multi-256B node provided by the invention adopts a distributed index head, namely each 256B has an own index head to store the meta-information of index items in the 256B. The distributed design furthest reserves the space for storing the index items in the index row containing the index head, thereby fully playing the effect of the first row index item migration technology. (2) H0 and H1: when multi-256B leaf nodes split, each 256B may have a moving index entry, so the meta-information stored in the index header of each 256B may need to be modified, which obviously cannot be done by a single 8B NVM atomic write. While zero log index splitting requires a single NVM atomic write to complete the update of the entire node state, including the removal of the bitmap position of the shifted-out index entry. This goal is achieved by designing H0 and H1. H0 and H1 are one active index head and the other is an inactive index head, controlled by alt bit in H. Thus, index item metadata after the completion of the splitting may be stored with an invalid index header, while index item metadata before the splitting is stored with an effective index header. Therefore, an NVM atomic write can modify alt bit to make the invalid index header valid, and the valid index header becomes invalid, switching the node states before and after splitting.
6. Multi-256B leaf node LB+ -Tree point query algorithm
The main difference from the 256B leaf node lb+ -Tree query algorithm is the search of leaf nodes: each 256B cell of the Multi-256B leaf node is searched in turn. For each 256B cell, the search algorithm in the 256B leaf node described above is applied. In addition to the first 256B cell, the valid header in the other 256B cell is determined by the alt bit in the first 256B cell.
7. LB+ -Tree insertion algorithm for Multi-256B leaf node
The LB+ -Tree insertion algorithm of the Multi-256B leaf node is an extension of the LB+ -Tree insertion algorithm of the 256B leaf node. The internal node search, concurrency control and the like are identical to the LB+ -Tree insertion algorithm of the 256B leaf node. The main difference is the insertion of leaf nodes.
The insertion of leaf nodes accesses each 256B cell in turn from front to back. The algorithm always inserts the first idle bit. In the 256B cell where the first free bit is located, the first row index entry migration technique is applied for insertion.
If the leaf node is full, then node splitting is performed using a 256B leaf node zero log splitting similar algorithm. The main difference is that the other 256B cells than the first 256B cell have two headers (H0 and H1), one being used and one being idle, determined by the alt bit of the first 256B cell. In the first step of splitting, in addition to the zero log splitting operation described above, an idle header is written to reflect the split bitmap and finger print array contents. Then one NAW in the second step can change the state of the first 256B cell and the other 256B cells simultaneously. For the first 256B cell, the NAW modifies both alt bit, replacing sibling pointers, and bitmap, deleting the shifted index entry. Whereas for other 256B cells in the Multi-256B leaf node, NAW modifications to alt bit replace the header. Therefore, the switching of the states of the whole leaf nodes is completed by using one NAW, and downtime consistency is ensured.
8. Deletion algorithm of LB+ -Tree
The delete algorithm of lb+ -Tree does not perform node merging, only the bitmap needs to be modified (for Multi-256B leaf nodes, the modification range also includes bitmaps' of all 256B units except the first index unit Multi-NVMLine), and the deleted position is set to be the idle index bit, so that it can be completed by one NAW, similar to the existing NVM-oriented b+ -Tree.
9. LB+ -Tree range query algorithm
The LB+ -Tree maintains the characteristic that leaf nodes of the B+ -Tree are ordered according to the sequence of the sibling chain list, so that the range query operation can be easily supported. Given a range of keys, a start leaf node and a stop leaf node may be determined by searching, and then sequentially accessing leaf nodes from the start leaf node along the sibling list until the stop leaf node. In the start leaf node and the end leaf node, each index key needs to be compared in turn, and an index item meeting the range condition is found. In the middle leaf node, all index entries meet the range condition, so no comparison is needed.
10. Failure recovery of LB+ -Tree
When faults such as power failure, downtime, system breakdown and the like occur, the leaf nodes of the LB+ -Tree ensure consistency in the NVM. Thus, restoration of LB+ -Tree can be accomplished by scanning the leaf nodes in the NVM to reconstruct the internal nodes. At scan time, if the lock bit is found to be 1, it is cleared and written back.
FIG. 9 is a diagram illustrating the index insertion performance of the present invention compared to the prior art. As shown in fig. 9, in the experiment, the index is initialized to make each node 70% -100% full, then random insertion is performed or dense insertion is performed on the rightmost leaf node, 20 hundred million (8 b key,8b val) index entries are put in to make the nodes 70% -100% full, and then random insertion or dense insertion is performed. In fig. 9, the horizontal axis represents the number of inserted index entries and the vertical axis represents the time for completing the entire insertion operation, so that lower curves indicate better performance. This experiment compares the LB+ -Tree (LB-Tree) of the present invention with two existing NVM-oriented indices WB+ -Tree (WB-Tree), FP-Tree. As shown, in fig. 9 (a), lb+ -Tree has certain advantages under random insertion. In fig. 9 (b), when the node is full, the insertion will cause a large number of node splitting operations, and the zero log splitting technique of lb+ -Tree creates a great advantage. In fig. 9 (c), when nodes on the far right are densely inserted, the characteristic of the first line index item migration technology of lb+ -Tree is fully exerted, and a great advantage is also presented. In addition, other experiments show that lb+ -Tree has similar performance to existing NVM-oriented b+ -Tree indexes under query and delete operations. Therefore, experiments prove that the LB+ -Tree structure has the advantages over the existing NVM-oriented B+ -Tree structure, especially in terms of insertion performance.
FIG. 10 is a schematic diagram of a data processing system of the present invention. As shown in FIG. 10, embodiments of the present invention also provide a computer readable storage medium, and a data processing system. The computer readable storage medium of the present invention stores executable instructions that when executed by a processor of a data processing system, implement the above-described data indexing method for a main memory of the data processing system, where the main memory of the data processing system includes a DRAM main memory and an NVM main memory in parallel with the DRAM main memory, the DRAM main memory and the NVM main memory are arranged in parallel and are all software-visible, and the software can determine which data are placed in volatile DRAM and which data are placed in nonvolatile NVM. Those of ordinary skill in the art will appreciate that all or a portion of the steps of the above-described methods may be performed by a program that instructs associated hardware (e.g., processor, FPGA, ASIC, etc.), which may be stored on a readable storage medium such as read only memory, magnetic or optical disk, etc. All or part of the steps of the embodiments described above may also be implemented using one or more integrated circuits. Accordingly, each module in the above embodiments may be implemented in the form of hardware, for example, by an integrated circuit, or may be implemented in the form of a software functional module, for example, by a processor executing a program/instruction stored in a memory to implement its corresponding function. Embodiments of the invention are not limited to any specific form of combination of hardware and software.
The invention solves the problem that the prior art is not suitable for the real 3DXPoint hardware and the node splitting cost is high, and provides a novel Tree index structure (LB+ -Tree) oriented to non-volatile main memory.
The above embodiments are only for illustrating the present invention, not for limiting the present invention, and various changes and modifications may be made by one of ordinary skill in the relevant art without departing from the spirit and scope of the present invention, and therefore, all equivalent technical solutions are also within the scope of the present invention, and the scope of the present invention is defined by the claims.

Claims (8)

1. A method for NVM-based data indexing, comprising:
constructing a tree index structure; the root node and the internal node of the tree index structure are arranged in the DRAM main memory of the data processing system, and the leaf node of the tree index structure is arranged in the NVM main memory of the data processing system; the leaf nodes are in a plurality of layers and are connected in a brother linked list from left to right; the leaf node comprises an A index unit, wherein the A index unit NVMLine comprises M index lines, and each Line has N data item storage positions; line1 is the first index Line L h ,L h The 1 st data item storage location of (1) is the index head H, L h The storage positions of the rest data items are index items, and the 2 nd to M-1 st lines are intermediate index lines L i ,L i All data item storage positions of (a) are index items, and the Mth Line is the tail index Line L t ,L t The nth data item of (a) is a sibling linked list pointer item S, L t The rest data items of the data are index items; m, N is a positive integer; the index head H comprises a write locking bit lockbit, an alternate control bit altbit, an index occupation bitmap and an index fingerprint bit F, and S comprises a pointer S 0 And pointer S 1 The method comprises the steps of carrying out a first treatment on the surface of the The lockbit is used for setting the writing state of the current leaf node; altbit for setting S by NVM atomic write 0 And S is 1 One of which is a valid pointer and the other is an invalid pointer; the bitmap is used for recording the occupation state of each index item respectively; f is a finger print array used for respectively recording the fingerprints of each index item; the effective pointer of S is used for connecting the brother linked list;
when new data is written into a current leaf node, judging whether an index unit A of the current leaf node has an idle index item, if so, performing data writing operation, otherwise, performing and finishing node splitting operation, and then performing the data writing operation;
wherein the data write operation includes: if the first index line has an idle index item, writing the newly added data into the idle index item of the first index line; otherwise, the newly added data and the stored data stored in the first index row are migrated to the idle index items of the middle index row and/or the tail index row;
the sectionThe point splitting operation includes: when the leaf Node n All index items are occupied, and newly added leaf nodes Node are allocated n ',Node n ' have and Node n The same structure; node n Copying part of stored data to Node n ' index item, and modifying Node n ' first index line L h The index of the ' index header H ' occupies the bitmap '; node n The active pointer of the 'sibling linked list pointer item S' points to Node n Right sibling leaf Node of (a) n+1 Node is to n The invalid pointer of the sibling linked list pointer item S points to Node n 'A'; persistent Node in the NVM host n 'A'; setting altbit to point S to Node with NVM atomic write n The pointer of' is set as the effective pointer and S is pointed to Node n+1 Setting the pointer of (2) to be an invalid pointer; the part of the stored data is stored in Node n The index entry of (2) is emptied as the idle index entry and the bitmap is modified by writing with the NVM atom.
2. The data indexing method of claim 1, wherein the leaf node further comprises at least one B index unit NVMLine ', each NVMLine' comprising M lines, each Line having N data item storage locations;
the 1 st data item storage position of the 1 st Line of NVMLine 'is a first index head H0, and the N data item storage position of the M th Line of NVMLine' is a tail index head H1;
h0 comprises an index occupation bitmap 'and an index fingerprint bit F'; h1 has the same structure as H0; altbit set H by NVM atomic write to set H 0 And H 1 One of the index entries is an effective index head, and the other index entry is an ineffective index head, wherein the bitmap 'of the effective index head is used for respectively recording the occupation state of each index item of the current NVMLine'; the F 'of the effective index head is a finger print array, and is used for respectively recording the fingerprint of each index item of the current NVMLine'.
3. The data indexing method of claim 1, wherein the data indexing method further comprises:
the data write operation of the exclusive leaf node is performed only after the write lock bit lockbit is set to the locked state and the concurrently controlled hardware transaction is exited.
4. The data indexing method of claim 1 wherein upon performing a failback operation, rebuilding the root node and the internal nodes within the DRAM host based on all of the leaf nodes to recover the tree index structure;
if the write lock bit of the NVMLine is in a locked state, the write lock bit is set in an unlocked state, and stored data of the NVMLine of the index unit is restored.
5. The data indexing method of claim 1, wherein a part or all of the index entries of the NVMLine are set as free index entries by modifying a bitmap.
6. The data indexing method of claim 2, wherein a part or all of the index items of the NVMLine are set as free index items by modifying a bitmap, and a part or all of the index items of the NVMLine 'are set as free index items by modifying a bitmap'.
7. A computer readable storage medium storing executable instructions for performing the NVM-based data indexing method of any of claims 1-6.
8. A data processing system, comprising:
a processor;
the main memory is connected with the processor and comprises a DRAM main memory and an NVM main memory which are connected in parallel;
the computer readable storage medium of claim 7, the processor retrieving and executing executable instructions in the computer readable storage medium for NVM-based data indexing.
CN202010064770.8A 2020-01-20 2020-01-20 Data indexing method and data processing system based on NVM (non-volatile memory) main memory Active CN111274456B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010064770.8A CN111274456B (en) 2020-01-20 2020-01-20 Data indexing method and data processing system based on NVM (non-volatile memory) main memory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010064770.8A CN111274456B (en) 2020-01-20 2020-01-20 Data indexing method and data processing system based on NVM (non-volatile memory) main memory

Publications (2)

Publication Number Publication Date
CN111274456A CN111274456A (en) 2020-06-12
CN111274456B true CN111274456B (en) 2023-09-12

Family

ID=70998970

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010064770.8A Active CN111274456B (en) 2020-01-20 2020-01-20 Data indexing method and data processing system based on NVM (non-volatile memory) main memory

Country Status (1)

Country Link
CN (1) CN111274456B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117131050A (en) * 2023-08-28 2023-11-28 中国科学院软件研究所 Spatial index method based on magnetic disk and oriented to workload and query sensitivity

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7702640B1 (en) * 2005-12-29 2010-04-20 Amazon Technologies, Inc. Stratified unbalanced trees for indexing of data items within a computer system
CN102843396A (en) * 2011-06-22 2012-12-26 中兴通讯股份有限公司 Data writing and reading method and device in distributed caching system
CN105930280A (en) * 2016-05-27 2016-09-07 诸葛晴凤 Efficient page organization and management method facing NVM (Non-Volatile Memory)
CN107463447A (en) * 2017-08-21 2017-12-12 中国人民解放军国防科技大学 B + tree management method based on remote direct nonvolatile memory access
CN109407978A (en) * 2018-09-27 2019-03-01 清华大学 The design and implementation methods of high concurrent index B+ linked list data structure
CN109407979A (en) * 2018-09-27 2019-03-01 清华大学 Multithreading persistence B+ data tree structure design and implementation methods
CN109857566A (en) * 2019-01-25 2019-06-07 天翼爱动漫文化传媒有限公司 A kind of resource lock algorithm of memory read-write process
CN109976947A (en) * 2019-03-11 2019-07-05 北京大学 A kind of method and system of the power loss recovery towards mixing memory

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9305112B2 (en) * 2012-09-14 2016-04-05 International Business Machines Corporation Select pages implementing leaf nodes and internal nodes of a data set index for reuse
EP3526680A1 (en) * 2017-06-25 2019-08-21 Korotkov, Alexander Durable multiversion b+-tree
US10599485B2 (en) * 2018-01-31 2020-03-24 Microsoft Technology Licensing, Llc Index structure using atomic multiword update operations

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7702640B1 (en) * 2005-12-29 2010-04-20 Amazon Technologies, Inc. Stratified unbalanced trees for indexing of data items within a computer system
CN102843396A (en) * 2011-06-22 2012-12-26 中兴通讯股份有限公司 Data writing and reading method and device in distributed caching system
CN105930280A (en) * 2016-05-27 2016-09-07 诸葛晴凤 Efficient page organization and management method facing NVM (Non-Volatile Memory)
CN107463447A (en) * 2017-08-21 2017-12-12 中国人民解放军国防科技大学 B + tree management method based on remote direct nonvolatile memory access
CN109407978A (en) * 2018-09-27 2019-03-01 清华大学 The design and implementation methods of high concurrent index B+ linked list data structure
CN109407979A (en) * 2018-09-27 2019-03-01 清华大学 Multithreading persistence B+ data tree structure design and implementation methods
CN109857566A (en) * 2019-01-25 2019-06-07 天翼爱动漫文化传媒有限公司 A kind of resource lock algorithm of memory read-write process
CN109976947A (en) * 2019-03-11 2019-07-05 北京大学 A kind of method and system of the power loss recovery towards mixing memory

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Shimin Chen.persistent B+-Trees in Non-Volatile Main Memory.《proceeding of the VLDB Endowment》.2015,第8卷(第7期),全文. *

Also Published As

Publication number Publication date
CN111274456A (en) 2020-06-12

Similar Documents

Publication Publication Date Title
US11288252B2 (en) Transactional key-value store
EP3159810B1 (en) Improved secondary data structures for storage class memory (scm) enabled main-memory databases
US11023453B2 (en) Hash index
US11360863B2 (en) Key-value store on persistent memory
Kim et al. Pactree: A high performance persistent range index using pac guidelines
JP6764359B2 (en) Deduplication DRAM memory module and its memory deduplication method
US20180011892A1 (en) Foster twin data structure
CN111459846B (en) Dynamic hash table operation method based on hybrid DRAM-NVM
CN105117415B (en) A kind of SSD data-updating methods of optimization
US20170351543A1 (en) Heap data structure
CN107862064A (en) One high-performance based on NVM, expansible lightweight file system
US7805427B1 (en) Integrated search engine devices that support multi-way search trees having multi-column nodes
US11449430B2 (en) Key-value store architecture for key-value devices
US11100083B2 (en) Read only bufferpool
CN111400306B (en) RDMA (remote direct memory Access) -and non-volatile memory-based radix tree access system
US20220027349A1 (en) Efficient indexed data structures for persistent memory
Zhang et al. NBTree: a Lock-free PM-friendly Persistent B+-Tree for eADR-enabled PM Systems
CN111274456B (en) Data indexing method and data processing system based on NVM (non-volatile memory) main memory
Wang et al. The concurrent learned indexes for multicore data storage
US20200272424A1 (en) Methods and apparatuses for cacheline conscious extendible hashing
US11237925B2 (en) Systems and methods for implementing persistent data structures on an asymmetric non-volatile memory architecture
Huang et al. Range Indexes on Non-Volatile Memory
Huang et al. Hash Tables on Non-Volatile Memory
CN115422182A (en) Read-write performance optimization method based on persistent memory B + tree index
Song et al. LIFM: A Persistent Learned Index for Flash Memory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant