CN106708442B - Mass data storage method simultaneously adapting to read-write characteristics of magnetic disk and solid state disk - Google Patents

Mass data storage method simultaneously adapting to read-write characteristics of magnetic disk and solid state disk Download PDF

Info

Publication number
CN106708442B
CN106708442B CN201611255923.7A CN201611255923A CN106708442B CN 106708442 B CN106708442 B CN 106708442B CN 201611255923 A CN201611255923 A CN 201611255923A CN 106708442 B CN106708442 B CN 106708442B
Authority
CN
China
Prior art keywords
block
blocks
layer
data
memory cache
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611255923.7A
Other languages
Chinese (zh)
Other versions
CN106708442A (en
Inventor
龚才鑫
龚奕利
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hard rock technology (Wuhan) Co., Ltd
Original Assignee
Hard Rock Technology Wuhan Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hard Rock Technology Wuhan Co Ltd filed Critical Hard Rock Technology Wuhan Co Ltd
Priority to CN201611255923.7A priority Critical patent/CN106708442B/en
Publication of CN106708442A publication Critical patent/CN106708442A/en
Application granted granted Critical
Publication of CN106708442B publication Critical patent/CN106708442B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0685Hybrid storage combining heterogeneous device types, e.g. hierarchical storage, hybrid arrays

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Signal Processing For Digital Recording And Reproducing (AREA)

Abstract

A mass data storage method adapting to the read-write characteristics of a disk and a solid state disk simultaneously changes the complete ordering of records in a block into partial ordering, and adds a bloom filter at the tail of each block, wherein the method comprises the steps of establishing a Log-Structured application-Tree Tree, when the data amount stored in each block in the Tree reaches a threshold value, and the data in the block is directly added into a corresponding child block, the data of the child block is composed of a plurality of ordering sequences, but not the complete ordering in the block is realized by a merging and ordering mode; each block in the tree holds a bloom filter. According to the invention, under the condition of not sacrificing any other performance, the write amplification is greatly reduced, the random write efficiency is greatly increased, and the service life of the solid state disk is better protected and prolonged. In a read-write mixed scene, the random read performance is also enhanced, and the method has important market value.

Description

Mass data storage method simultaneously adapting to read-write characteristics of magnetic disk and solid state disk
Technical Field
The invention belongs to the field of mass data storage, and particularly relates to a storage tree.
Background
The commonly used index trees on the existing hard disk include B-tree, LSM-tree, buffer-tree, etc. The B-tree is a traditional classical tree, but because a disk is inevitably randomly written in a randomly written scene, and the performance is low when mass data is stored, variants of the B-tree are often used when mass data is stored, for example, the B-tree and the LSM-tree are used in combination in BigTable. For the storage of mass data, an LSM-tree or a buffer-tree (also called a fractional-tree) is often used as an index tree, and the common characteristics of the LSM-tree and the buffer-tree are that writing of a record to be written is postponed, and the record is processed in batch when a certain amount of the record is accumulated. Therefore, the problem of random disk writing caused in the random writing scene of the B-tree can be solved better, and the writing throughput is greatly improved.
In the random reading scene, the number of layers of the LSM-tree and the buffer-tree is large, and the block size in the tree is much larger than that in the B-tree, so that the random reading performance is obviously reduced due to large reading amplification. In order to solve the problem, when the project such as bigtable/leveldb and the like realizes the LSM-tree, bloom filter information is stored in each node, so that the read amplification of the LSM-tree can be well reduced, and the problem of low random reading performance is well solved.
However, these trees have a large write amplification, regardless of whether they are B-trees or LSM-trees/buffer-trees. Due to disk throughput limitations, larger write amplification limits further substantial improvement of random write performance of these index trees and severely compromises the lifetime of solid state disks. The larger write amplification encroaches on most of the throughput of the disk, so that in a read-write mixed scene, random write influences the utilization of random read on the disk performance, and the random read performance is also reduced to a certain extent.
Disclosure of Invention
The invention aims to solve the problem that the random writing efficiency is low due to the large writing amplification of the traditional tree, and the service life of a solid state disk is seriously influenced due to the large writing amplification in the solid state disk. The larger write amplification encroaches on the throughput of most of the mechanical disks or solid state disks, so that in a read-write mixed scene, random write influences the utilization of random reading on the performance of the mechanical disks or solid state disks, and the random reading performance is also reduced to a certain extent. Thus, a Tree called a Log-Structured attached Tree (LSA-Tree for short) is designed.
The invention provides a mass data storage method simultaneously adapting to read-write characteristics of a disk and a solid state disk, which changes complete sequencing of records in a block into partial sequencing, and adds a bloom filter at the tail part of each blockiBlock maximum ti1 block, 1 is more than or equal to i is less than or equal to n-1, parameter t is multiple of threshold value of number of blocks of two adjacent layers, and last layer is less than or equal to tnA block; each block has a range of a key, when the data amount stored in each block reaches a corresponding threshold value, the data in the block is flushed into the block with a coverage overlapping relation in the range in the next layer, and when the data to be flushed is directly added into the corresponding block, the data of a certain block consists of a plurality of sequencing sequences, but the complete sequencing in the block is not realized in a merging and sequencing mode; each block in the tree holdsA bloom filter;
moreover, the operation of the background thread on the blocks in the LSA-tree is divided into three types, including brushing, splitting and merging; all operations only initiate processing on the non-last layer of blocks; the method comprises the steps that the covering overlapping relation of a certain block of a current layer and one or more blocks of a lower layer on keys is called as a parent-child relation, the block of the current layer is called as a parent block, and the one or more blocks of the lower layer are called as child blocks;
the down-brushing operation is to move down the data in the block to the next layer, but the range of the block is still reserved, and the number of the blocks in the layer where the block is located does not change;
the triggering condition of the brushing-down operation is that the data amount stored in the block reaches a storage threshold value and the number of child blocks of the block is less than 2 t;
after triggering, the lower brush is needed to be performed after the following two execution conditions are both met,
condition 1, the number of blocks of the lower layer is less than ti+1+1 and i +1<n, or less than tnAnd i +1 ═ n;
condition 2, if the lower layer is not the last layer, all the child blocks do not reach the storage threshold;
splitting operation is to split the block into two so as to make the number of child blocks of the two newly generated blocks equal;
the triggering condition of the splitting operation is that the data amount stored in the block reaches a storage threshold value and the number of child blocks of the block is more than 2 t;
the operation needs to satisfy the execution condition that the number of the blocks of the layer where the block is located is less than ti+1;
Merging operation is to move down the data in the block to the next layer, and after the down-brushing, the range of the block is deleted, so that the number of the blocks in the layer where the block is located is reduced by 1;
the merging operation is triggered by the fact that the number of blocks in the layer in which the block is located is equal to ti+1;
This operation needs to satisfy the following two execution conditions,
condition 1, the number of blocks of the lower layer is less than ti+1+1 and i +1<n, or less than tnAnd i +1 ═ n;
condition 2, if the lower layer is not the last layer, all the child blocks do not reach the storage threshold;
when a user thread inserts a record, moreover, there are three cases,
1) if the variable memory cache does not reach the capacity threshold, adding the record into a user log, and then inserting the record into the variable memory cache;
2) if the variable memory cache reaches the capacity threshold and the invariable memory cache does not exist, renaming the variable memory cache to the invariable memory cache, and then newly establishing a variable memory cache insertion record;
3) if the variable memory cache reaches the capacity threshold and the invariable memory cache exists, the invariable memory cache is destroyed after the background thread writes the invariable memory cache into the disk, and the user thread processes according to 2);
moreover, based on the LSA-tree, the background thread writing the immutable memory cache to the disk includes the following steps,
step 1.1, if the number of blocks of the last layer is equal to tnIf n is equal to n +1, newly building a layer, wherein the newly built layer is the last new layer;
step 1.2, selecting tasks to be processed, wherein each task comprises a block to be processed and an operation to be executed on the block, and the invariable memory cache is also regarded as a special block; the selection operation is provided with three priorities, from high to low,
priority 1: if the execution condition of the memory access is not met, continuously judging the priority 2 condition;
priority 2: for the non-last layer, starting from the upper layer, judging whether the number of blocks is equal to ti+1 and the number of lower layer blocks is less than ti+1+1 and i +1<n, or whether there are lower layer blocks less than tnAnd i +1 ═ n;
if so, selecting a certain block in the layer to carry out merging operation so as to reduce the number of the blocks of the layer; then selecting an optimal block from the candidate set, and executing merging operation on the selected optimal block;
if no such layer exists, continuing to determine a priority 3 condition;
priority 3: sequentially judging whether blocks with the stored data quantity reaching a storage threshold exist from the upper layer to the lower layer, and if so, selecting a first block encountered in the traversal process; if the number of the child blocks of the block is less than 2t, performing a brushing-down operation on the block;
if the number of the child blocks of the block is more than or equal to 2t, splitting operation is performed on the block;
if the selected block is to be subjected to the brushing-down operation, but the execution condition of the operation is not met because the block has a child block which reaches the storage threshold, selecting the child block to perform the brushing-down or splitting operation instead, and repeating the steps to perform recursive search until the first block meeting the brushing-down or splitting execution condition is finally selected;
if any target block and operation are not selected finally, when the user continues to insert data, the step 1.1 is started to be executed again;
step 1.3, actual disk operations including a brushing operation, a merging operation or a splitting operation are executed according to the obtained tasks;
step 1.4, applying for an exclusive lock, writing the structure information of the tree modified by the actual disk operation into a tree metadata change log after the application is successful, and updating the tree metadata in the memory according to the information;
step 1.5, if the down-shifting operation of the unchangeable memory cache is processed, destroying the unchangeable memory cache; if the user thread is sleeping, the user thread is awakened; all locks acquired by the thread are unlocked, and the thread continues to execute from step 1.1.
Moreover, when the user needs to read the data, the steps are executed as follows:
step 2.1, reading the variable memory cache, and returning if the required record is read;
step 2.2, reading the memory cache of the immutable memory, and returning if reading the required record;
and 2.3, sequentially reading the layer 1 to the layer n, returning after finding, and if the layer is not found in the last layer, indicating that no corresponding record exists in the database.
Furthermore, in step 1.3, if the task is a brush-down operation, there are 3 cases,
in case 1, if the block to be brushed down does not have child blocks, the step 1.4 is directly entered to modify the meta information of the block to realize downward shifting; the range of the downshifted blocks of the current layer is preserved;
case 2, if the piece to be brushed down has child pieces and the next layer is the last layer,
for a record falling within the range of a certain block in the last layer, directly modifying the block;
for records falling outside the range of all the blocks in the last layer, selecting child blocks with the minimum distance to the keys to be inserted into the records for modification, and modifying the range of the child blocks;
modifying the last layer of child blocks by performing additional operation if the data stored in the block does not reach a threshold value; if so, merging and sequencing the data to be written and the original data to generate a plurality of new blocks;
case 3, if the piece to be brushed down has child pieces and the next layer is not the last layer,
for a record falling within the range of a block of the next layer, directly appending data to the block;
for records falling outside the range of all the blocks, selecting child blocks with the minimum distance from the keys to be inserted into the records for adding, and changing the key range of the child blocks; the range of the downshifted blocks of the current layer is preserved.
Moreover, if the task is a merge operation, and the data in the block is moved down to the next layer in the same manner as the flush operation, the range of the block is deleted after the flush, so that the number of blocks in the layer where the block is located is reduced by 1.
The data stored in the block includes index data, a bloom filter, and a user record, the index data and the bloom filter are stored at the end of the block, and the user record is stored at the front end of the block.
Moreover, when a free hole (hole, a logically free address space, but no address of an actual mechanical disk or a solid state disk is bound thereto) in the middle of the block stores all data which are not written next time but stores the index data and the bloom filter, the index data and the bloom filter are stored at the rear end of the block, and the user record is added to the tail of the block;
when the idle hole in the middle of the block does not store the index data and the bloom filter of the data to be written next time, merging and sequencing the data to be written and the original data to generate a new block; alternatively, the merge sort is performed instead by appending all index data, bloom filters, and user records to the end of the block.
According to the invention, under the condition of not sacrificing any other performance, the write amplification is greatly reduced, and the random write efficiency is greatly increased. In a read-write hybrid scenario, random read performance is also enhanced. The service life of the solid state disk is better protected and prolonged, and the solid state disk has important market value.
Drawings
Fig. 1 is a basic architecture diagram, mainly a structural schematic diagram of an LSA-tree, used in the storage method according to the embodiment of the present invention.
FIG. 2 is a logic diagram illustrating a last layer of data in a block being flushed when performing a disk operation according to an embodiment of the present invention.
FIG. 3 is a logic diagram illustrating a process of flushing data in a block to a non-final layer during a disk operation according to an embodiment of the present invention.
FIG. 4 is a diagram of a disk layout of blocks designed in an embodiment of the present invention.
FIG. 5 is a diagram of an alternative disk layout for blocks designed in an embodiment of the present invention.
Detailed description of the invention
The core problem to be solved by the invention is as follows: the large write amplification of conventional trees results in poor write performance or read-write hybrid performance. The life of the solid state disk is also severely influenced by larger write amplification in the solid state disk. The present invention solves the above problem by changing the complete ordering of records in a block to partial ordering and then adding a bloom filter to the tail of each block to minimize the impact of the scheme on read performance.
Fig. 1 is a basic architecture diagram of a storage method provided in an embodiment of the present invention, and is divided into a memory portion and a disk portion. The memory includes one each of a variable memory cache and a non-variable memory cache, and metadata information of the tree. The metadata information of the tree describes the metadata of each block in the tree. The meta-information of a block includes a range of the block, a layer to which the block belongs, a size of a free hole in the middle of the block, the number of times of being added, and the like. The meta-information of the blocks is grouped by the layer to which they belong, and the meta-information of the blocks in each group is sorted by comparing the ranges of the blocks stored in the meta-information. The data in the disk is organized by adopting an LSA-tree structure.
The method comprises the following steps that blocks in a memory adopt a full-sequencing structure and are divided into a variable memory cache and an invariable memory cache, wherein the variable memory cache is a block which does not reach a block storage capacity threshold value, and records of a user can be directly inserted; the latter reaches a threshold size and can only be read and can no longer be changed. When a user thread inserts a record, there are three cases:
1) if the variable memory cache does not reach the capacity threshold, adding the record into a user log, inserting the record into the variable memory cache, and returning;
2) if the variable memory cache reaches the capacity threshold and the invariable memory cache does not exist, the variable memory cache is renamed to the invariable memory cache, then a variable memory cache insertion record is newly established, and the variable memory cache insertion record is returned;
3) if the variable memory cache reaches the capacity threshold and the non-variable memory cache exists, the non-variable memory cache is destroyed after being written into a disk by the background thread (the process will be described in detail below), and the user thread processes the data according to the step 2).
The data in the disk is organized by adopting an LSA-tree structure. The tree is divided into n layers, each layer is composed of a plurality of blocks, and the number of blocks in each layer is increased in an exponential order. The number of blocks of the ith (i is more than or equal to 1 and less than or equal to n-1) layer is tiOr ti+1, the number of the last layer (nth layer) blocks is less than or equal to tn(t is a positive integer of 2 or more, for example, 10). As noted from high to low in fig. 1: l is1Layer has t1Blocks, L2Layer has t2Block …, Ln-1Layer has tn-1Blocks, LnLayer has x blocks (x is greater than 0 and less than or equal to t)n). The parameter t is a multiple of the block number threshold of two adjacent layers, and in specific implementation, a person skilled in the art may preset the layer number n and the parameter t as needed, for example, n is 7, and t is 10. Each block has a range of keys, and when the amount of data stored in each block reaches a corresponding threshold, the data in the block is flushed to the next layer of blocks having a coverage overlap relationship over the range of keys. In most cases, the process adds the data to be flushed directly to the corresponding block (the data in the block thus obtained is composed of several sorted sequences), rather than by merging the sorted sequences, thereby avoiding excessive write amplification. When the threshold value of the block size in the tree reaches the level of ten megabytes, such as 64MB, even if the block is split into a plurality of shares and written into the blocks of the next layer, the average data volume written into each block also reaches the number of megabytes, and the sequential writing performance of the disk and the solid state disk can be well utilized.
Each block in the tree stores a bloom filter, when a user reads a record, the user does not need to read each sequence in the block, and only needs to read the bloom filter occupying a small amount of space to judge whether the inquired record is in a certain sequence in the block, so that the read operation performance of the user is hardly influenced compared with the whole block sequencing.
The operation of the background thread on the blocks in the tree is divided into three categories: brushing down, splitting and merging. All operations are initiated only on non-last layer blocks, set to layer i, Li(i is more than or equal to 1 and less than or equal to n-1). For convenience of description, the overlapping relationship of a certain block of the current layer and one or more blocks of the lower layer on the key is called a parent-child relationship, the block of the current layer is called a parent block, and the one or more blocks of the next layer are called child blocks.
The flush down operation is to move the data within a block down to the next layer, but the extent of the block remains and the number of blocks in the layer in which the block is located does not change. The trigger conditions for the lower brush operation are: the amount of data stored by the block reaches a storage threshold and the number of child blocks of the block is less than 2 t. This operation needs to be performed while satisfying the following two execution conditions: condition 1, the number of blocks of the lower layer is less than ti+1+1(i+1<n,The next layer is not the last layer) or tn(i +1 ═ n, i.e. the next layer is the last layer Ln) (ii) a Condition 2, if the lower layer is not the last layer (i + 1)<n), none of the child blocks need to reach the storage threshold. The detailed operation steps of the brush-down operation are referred to in step 1.3.
The split operation is to split the block into two so that the number of child blocks of the two newly generated blocks is equal. The triggering conditions for the splitting operation are: the amount of data stored by the block reaches a storage threshold and the number of child blocks of the block is greater than 2 t. The execution conditions to be met by the operation are as follows: the number of blocks of the layer where the block is located is less than ti+1. See step 1.3 for detailed procedure.
The merge operation is similar to the flush down operation, moving the data within the block down to the next layer, the only difference being that after the flush down the extent of the block is deleted, so that the number of blocks in the layer where the block is located is reduced by 1. The trigger conditions for the merge operation are: the number of blocks in the layer of the block is equal to ti+1. This operation needs to be performed while satisfying the following two execution conditions: condition 1, the number of blocks of the lower layer is less than ti+1+1(i+1<n, the next layer is not the last layer) or tn(i +1 ═ n, i.e. the next layer is the last layer Ln) (ii) a If the lower layer is not the last layer, the child blocks need not reach the storage threshold.
A layer or block that causes an operation not to satisfy an execution condition is called a blocking layer or block, and the progress of the operation is blocked.
Operations on data blocks without logical dependencies may be in parallel. The block in the disk and the variable memory cache and the non-variable memory cache in the memory are all bound with an exclusive lock one by one. When a certain operation modifies a block, an exclusive lock needs to be sequentially added to the block to be modified so as to prevent the data error caused by the simultaneous modification of a certain block by a plurality of threads.
In an embodiment, a specific process of writing the LSA-Tree in the immutable memory cache (i.e., an operation process executed by the background thread) is as follows:
step 1.1, if the number of blocks of the last layer is equal to tnAnd if n is equal to n +1, newly building a layer, wherein the newly built layer is the last new layer. Go to step 1.2.
Step 1.2, selecting tasks to be processed, wherein each task comprises selecting a block to be processed (here, "immutable memory cache" is also considered as a special block) and an operation to be executed on the block. The selection operation is provided with three priorities (the three priorities ensure that the block number of each layer of the tree meets the requirement on the block number of each layer, and the tree can efficiently store the data flushed by the memory cache of the immutable memory), and the three priorities are sequentially from high to low:
priority 1: and if the brushing operation of the memory cache is not satisfied, continuously judging the priority 2 condition.
Priority 2: for the non-final layer Li(1 is more than or equal to i and less than or equal to n-1), and whether the number of blocks is equal to t is judged from the upper layeri+1 and the number of lower layer blocks is less than ti+1+1(i+1<n, the next layer is not the last layer) or less than tn(the next layer is the last layer Ln) Of (2) a layer of (a).
If so, a block in the layer is selected for a merge operation to reduce the number of blocks in the layer. The strategy selected is:
all blocks that satisfy the number of child blocks equal to or less than t are added to the candidate set (at least one such block must be present in the set as is known by the constraint on the number of blocks per layer of the tree). Then, selecting the optimal block in the candidate set, wherein the selection strategy is as follows: the larger the amount of data stored by the block divided by the number of child blocks, the better, and the smaller the number of child blocks of the new range generated after the block is merged with the range of the neighboring block. And executing the merging operation on the selected optimal block. If no such layer exists, the priority 3 condition continues to be determined.
Priority 3: and sequentially judging whether blocks with the stored data quantity reaching a storage threshold exist from the upper layer to the lower layer, and if so, selecting the first block (block blocking the 'invariable memory cache' is preferred) encountered in the traversal process. If the number of the child blocks of the block is less than 2t, performing a brushing-down operation on the block; if the number of child blocks of the block is greater than or equal to 2t, then a split operation will be performed on the block. If the selected block is to be subjected to the brushing-down operation, but the execution condition of the operation is not met because the block has a child block which has already reached the storage threshold, the child block is selected instead to be subjected to the brushing-down or splitting operation, and the recursive search is performed by analogy until the first block meeting the brushing-down or splitting execution condition is finally selected. If no target block and operation is finally selected, it is re-executed starting from step 1.1.
And after the block to be processed and the operation to be executed on the block are selected, locking the blocks to be modified in sequence, if all locks are successfully locked, successfully acquiring the task, and if any one block fails to be locked, unlocking all the locks, and starting to execute again from the step 1.1.
Step 1.1 ensures that the number of blocks in the last layer must be less than t when the steps from step 1.2 startn. Thus, in priority level 2 in step 1.2, if there are one or more blocks equal to tiThe layer +1 must be chosen such that the number of blocks equal to t is simultaneously satisfiedi+1 and the number of lower layer blocks is less than ti+1+1(i+1<n, the next layer is not the last layer) or less than tn(the next layer is the last layer Ln) The layers of (a) are subjected to a merging operation.
The purpose of setting up priority 2 in step 1.2 is to ensure that there are tasks in priority 3 that satisfy the execution conditions (not because the number of blocks of all layers is equal to t)i+1 so that all tasks in priority 3 are blocked) so that the tree must function properly.
The purpose of setting priority 3 in step 1.2 is to process the blocks in the tree that reach the storage threshold, which may be blocks of the upper layer task so that the tree can continue to store the data flushed from the immutable memory cache, or blocks of the upper layer task that are not blocked so as to optimize performance.
Step 1.3, according to the obtained task, executing actual disk operation, specifically operating as follows (the tasks which are logically independent but not mutually exclusive can be executed in parallel):
1) if the lower brush operation is performed, 3 cases are divided:
case 1: if the block to be brushed down does not have child blocks, step 1.4 is directly entered to modify the metadata (meta information) of the block to realize the move down. The range of the downshifted blocks of the current layer is preserved.
Case 2: if the block to be brushed down has child blocks and the next layer is the last layer. For a record falling within the range of a certain block in the last layer, directly modifying the block; for records that fall outside the range of all blocks in the last layer, the child block with the smallest distance from the key into which the record is to be inserted is selected for modification. For the latter it is necessary to modify the extent of the child block. The concrete operation of modifying the last layer of child blocks is as follows: if the data stored in the block does not reach the threshold value, performing additional operation; and if so, merging and sequencing the data to be written and the original data to generate a plurality of new blocks (the total number of the newly generated blocks is one more than that of the original blocks). The range of the downshifted blocks of the current layer is preserved. Referring to fig. 2, there are child blocks that are appended and also child blocks that are sorted by merge.
Case 3: if the block to be brushed down has child blocks and the next layer is not the last layer. When data is brushed, directly adding data to a record falling within the range of a certain block of the next layer; for records that fall outside the range of all blocks, the child block with the smallest distance from the key to be inserted into the record is selected for addition, and the key range of the child block is changed. The range of the downshifted blocks of the current layer is preserved. Referring to fig. 3, there are only appended child blocks.
Compared with the prior art, the operation almost completely avoids merging and sequencing, and is replaced by additional operation, so that the write amplification is greatly reduced, and the write performance is improved.
2) For merge operation, the specific operation flow is similar to the flush operation, i.e. the data in the block is moved down to the next layer in the same way as the above flush operation, the only difference being that the range of the block is deleted after the flush, so that the number of blocks in the layer where the block is located is reduced by 1.
3) If the operation is splitting, the block is split into two new blocks, and the number of child blocks owned by the two newly generated blocks after splitting is equal.
Step 1.4, applying for an exclusive lock to ensure that only one background thread can carry out the step within a certain moment; after the application is successful, writing the structural information of the tree modified by the actual disk operation into a tree metadata change log, and updating the tree metadata in the memory according to the information.
Step 1.5, if the downward movement operation of the 'invariable memory cache' is processed, destroying the 'invariable memory cache', and if a user thread is sleeping, awakening the user thread; and unlocking all locks acquired by the thread. The thread continues to execute from step 1.1.
When a user needs to read data, the steps are executed as follows:
step 2.1, reading the variable memory cache, and returning if the required record is read;
step 2.2, reading the 'invariable memory cache', and returning if reading the required record;
step 2.3, reading layer L in sequence1->LnAnd returning if the record is found, and if the record is not found in the last layer, indicating that no corresponding record exists in the database. During reading, the MVCC (multi-version concurrent control) can be used for avoiding holding any lock during the reading process of the disk.
FIG. 4 is a diagram showing how the method implements the organization of blocks on a disk, and as shown in FIG. 4 (left), the data stored in the blocks includes index data, bloom filters, and user records; the first two are stored at the end of the block, and the user record is stored at the head of the block; when storing data, three situations may occur:
1) when the free hole in the middle of the block stores all the data of the secondary write, adopting a storage mode as shown in fig. 4 (left), wherein the user records written to the nth write for the 1 st time are sequentially stored at the front end of the block, and the index data written to the nth write for the 1 st time and the bloom filter are sequentially stored at the rear end of the block;
2) when the idle hole in the middle of the block does not store all data to be written next time but can store the index data and the bloom filter, the storage mode shown in fig. 4 (right) is adopted, the index data written at the (n + 1) th time and the bloom filter are sequentially stored at the rear end of the block, and then the user record written at the (n + 1) th time is added to the tail part of the block;
3) when the middle idle hole does not store the index data and the bloom filter, merging and sequencing the data to be written and the original data to generate a new block. In implementation, this can be almost completely avoided by considering that the storage threshold is reached when 95% of the data stored in the block is reached. Further, the present invention proposes that, more preferably, instead of merging sorting, it can be implemented in the manner of fig. 5 without the above-described merging method, i.e., that index data, a bloom filter, and a user record are all appended to the end of a block.
In implementation, the blocks may be implemented in files, each having a threshold of 64MB, but may exceed 64MB, such as when stored in the manner of FIG. 4 (right).
Note that when reading the record in the block, the sequence added later in time is read first, and if the desired record is found, other unread sequences do not need to be read, and can be returned directly.
The method described above is only one example of the idea of "changing the full ordering of records in a block to partial ordering (consisting of multiple ordering sequences) to greatly reduce write amplification, and then adding a bloom filter to the index information of each block to minimize the impact of the scheme on read performance". Any modification, improvement or the like made within the spirit and principle of the present invention shall be included in the scope of the present invention, and it shall be within the scope of the present invention to apply the same logic to the blocks in the buffer-tree.

Claims (9)

1. A mass data storage method simultaneously adapting to the read-write characteristics of a disk and a solid state disk is characterized in that: the complete ordering of the records in a block is changed into partial ordering, and a bloom filter is added to the tail of each block, which is realized as follows,
the memory comprises a variable memory cache, an invariable memory cache and metadata information of a tree, wherein blocks in the memory adopt a full-sequencing structure and are divided into the variable memory cache and the invariable memory cache, the variable memory cache is a block which does not reach a block storage capacity threshold, records of a user can be directly inserted, and the invariable memory cache is up to the threshold, can only be read and can not be changed any more; the metadata information of the tree describes the metadata of each block in the tree, and the metadata of the block comprises the range of the block, the layer to which the block belongs, the size of a free hole in the middle of the block and the number of times of being added;
establishing a structure called a Log-Structured application-Tree structure, organizing data in a disk by adopting the Log-Structured application-Tree structure, dividing the Tree into n layers, and at least t in the ith layeriBlock maximum ti1 block, 1 is more than or equal to i is less than or equal to n-1, parameter t is multiple of threshold value of number of blocks of two adjacent layers, and last layer is less than or equal to tnA block; each block has a range of a key, when the data amount stored in each block reaches a corresponding threshold value, the data in the block is flushed into the block with a coverage overlapping relation in the range in the next layer, and when the data to be flushed is directly added into the corresponding block, the data of a certain block consists of a plurality of sequencing sequences, but the complete sequencing in the block is not realized in a merging and sequencing mode; each block in the tree holds a bloom filter;
the operation of the background thread on the blocks in the Log-Structured application-Tree Tree is divided into three types, including brushing, splitting and merging; all operations are initiated only for blocks that are not the last layer; the method comprises the steps that the covering overlapping relation of a certain block of a current layer and one or more blocks of a lower layer on keys is called as a parent-child relation, the block of the current layer is called as a parent block, and the one or more blocks of the lower layer are called as child blocks;
the down-brushing operation is to move down the data in the block to the next layer, but the range of the block is still reserved, and the number of the blocks in the layer where the block is located does not change;
the triggering condition of the brushing-down operation is that the data amount stored in the block reaches a storage threshold value and the number of child blocks of the block is less than 2 t;
after triggering, the lower brush is needed to be performed after the following two execution conditions are both met,
condition 1, the number of blocks of the lower layer is less than ti+1+1 and i +1<n, or less than tnAnd i ═ n-1;
condition 2, if the lower layer is not the last layer, all the child blocks do not reach the storage threshold;
splitting operation is to split the block into two so as to make the number of child blocks of the two newly generated blocks equal;
the triggering condition of the splitting operation is that the data amount stored in the block reaches a storage threshold value and the number of child blocks of the block is more than 2 t;
the operation needs to satisfy the execution condition that the number of the blocks of the layer where the block is located is less than ti+1;
Merging operation is to move down the data in the block to the next layer, and after the down-brushing, the range of the block is deleted, so that the number of the blocks in the layer where the block is located is reduced by 1;
the merging operation is triggered by the fact that the number of blocks in the layer in which the block is located is equal to ti+1;
This operation needs to satisfy the following two execution conditions,
condition 1, the number of blocks of the lower layer is less than ti+1+1 and i +1<n, or less than tnAnd i +1 ═ n;
if the lower layer is not the last layer, the child blocks need not reach the storage threshold.
2. The mass data storage method simultaneously adaptive to the read-write characteristics of the magnetic disk and the solid state disk as claimed in claim 1, wherein: when a user thread inserts a record, there are three cases,
1) if the variable memory cache does not reach the capacity threshold, adding the record into a user log, and then inserting the record into the variable memory cache;
2) if the variable memory cache reaches the capacity threshold and the invariable memory cache does not exist, renaming the variable memory cache to the invariable memory cache, and then newly establishing a variable memory cache insertion record;
3) and if the variable memory cache reaches the capacity threshold and the invariable memory cache exists, the variable memory cache is destroyed after the background thread writes the invariable memory cache into the disk, and the user thread processes according to the step 2).
3. The mass data storage method simultaneously adaptive to the read-write characteristics of the magnetic disk and the solid state disk as claimed in claim 2, wherein: based on the Log-Structured application-Tree Tree, the background thread writes the immutable memory cache into the disk comprises the following steps,
step 1.1, if the number of blocks of the last layer is equal to tnIf n is equal to n +1, newly building a layer, wherein the newly built layer is the last new layer;
step 1.2, selecting tasks to be processed, wherein each task comprises a block to be processed and an operation to be executed on the block, and the invariable memory cache is also regarded as a special block; the selection operation is provided with three priorities, from high to low,
priority 1: if the execution condition of the memory access is not met, continuously judging the priority 2 condition;
priority 2: for the non-last layer, starting from the upper layer, judging whether the number of blocks is equal to ti+1 and the number of lower layer blocks is less than ti+1+1 and i +1<n, or whether there are more than t blocksnAnd i ═ n-1;
if so, selecting a certain block in the layer to carry out merging operation so as to reduce the number of the blocks of the layer; then selecting an optimal block from the candidate set, and executing merging operation on the selected optimal block;
if no such layer exists, continuing to determine a priority 3 condition;
priority 3: sequentially judging whether blocks with the stored data quantity reaching a storage threshold exist from the upper layer to the lower layer, and if so, selecting a first block encountered in the traversal process; if the number of the child blocks of the block is less than 2t, performing a brushing-down operation on the block; if the number of the child blocks of the block is more than or equal to 2t, splitting the block;
if the selected block is to be subjected to the brushing-down operation, but the execution condition of the operation is not met because the block has a child block which reaches the storage threshold, selecting the child block to perform the brushing-down or splitting operation instead, and repeating the steps to perform recursive search until the first block meeting the brushing-down or splitting execution condition is finally selected;
if any target block and operation are not selected finally, when the user continues to insert data, the step 1.1 is started to be executed again;
step 1.3, actual disk operations including a brushing operation, a merging operation or a splitting operation are executed according to the obtained tasks;
step 1.4, applying for an exclusive lock, writing the structure information of the tree modified by the actual disk operation into a tree metadata change log after the application is successful, and updating the tree metadata in the memory according to the information;
step 1.5, if the down-shifting operation of the unchangeable memory cache is processed, destroying the unchangeable memory cache; if the user thread is sleeping, the user thread is awakened; all locks acquired by the thread are unlocked, and the thread continues to execute from step 1.1.
4. The mass data storage method simultaneously adaptive to the read-write characteristics of the magnetic disk and the solid state disk as claimed in claim 3, wherein: when a user needs to read data, the steps are executed as follows:
step 2.1, reading the variable memory cache, and returning if the required record is read;
step 2.2, reading the memory cache of the immutable memory, and returning if reading the required record;
and 2.3, sequentially reading the layer 1 to the layer n, returning after finding, and if the layer is not found in the last layer, indicating that no corresponding record exists in the database.
5. The mass data storage method simultaneously adaptive to the read-write characteristics of the magnetic disk and the solid state disk as claimed in claim 3, wherein: in step 1.3, if the task is a brushing operation, the task is divided into 3 cases,
in case 1, if the block to be brushed down does not have child blocks, the step 1.4 is directly entered to modify the meta information of the block to realize downward shifting; the range of the downshifted blocks of the current layer is preserved;
case 2, if the piece to be brushed down has child pieces and the next layer is the last layer,
for a record falling within the range of a certain block in the last layer, directly modifying the block;
for records falling outside the range of all the blocks in the last layer, selecting child blocks with the minimum distance to the keys to be inserted into the records for modification, and modifying the key range of the child blocks;
modifying the last layer of child blocks by performing additional operation if the data stored in the block does not reach a threshold value; if so, merging and sequencing the data to be written and the original data to generate a plurality of new blocks;
case 3, if the piece to be brushed down has child pieces and the next layer is not the last layer,
for a record falling within the range of a block of the next layer, directly appending data to the block;
for records falling outside the range of all the blocks, selecting child blocks with the minimum distance from the keys to be inserted into the records for adding, and changing the key range of the child blocks; the range of the downshifted blocks of the current layer is preserved.
6. The mass data storage method simultaneously adaptive to the read-write characteristics of the magnetic disk and the solid state disk as claimed in claim 5, wherein: if the task is a merge operation, the data in the block is moved down to the next layer in the same way as the flush operation, and the range of the block is deleted after the flush operation, so that the number of blocks in the layer where the block is located is reduced by 1.
7. The mass data storage method simultaneously adapting to the read-write characteristics of the magnetic disk and the solid state disk according to claim 1, 2, 3, 4, 5 or 6, characterized in that: the data stored in the block comprises index data, a bloom filter and a user record, wherein the index data and the bloom filter are stored at the tail end of the block, and the user record is stored at the front end of the block.
8. The mass data storage method simultaneously adaptive to the read-write characteristics of the magnetic disk and the solid state disk as claimed in claim 7, wherein: when the idle hole in the middle of the block stores all data which are not written next time but stores the index data and the bloom filter, the index data and the bloom filter are stored at the rear end of the block, and the user record is added to the tail of the block.
9. The mass data storage method simultaneously adaptive to the read-write characteristics of the magnetic disk and the solid state disk as claimed in claim 7, wherein: when the idle hole in the middle of the block does not store the index data and the bloom filter of the data to be written next time, merging and sequencing the data to be written and the original data to generate a new block; alternatively, the merge sort is performed instead by appending all index data, bloom filters, and user records to the end of the block.
CN201611255923.7A 2016-12-30 2016-12-30 Mass data storage method simultaneously adapting to read-write characteristics of magnetic disk and solid state disk Active CN106708442B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611255923.7A CN106708442B (en) 2016-12-30 2016-12-30 Mass data storage method simultaneously adapting to read-write characteristics of magnetic disk and solid state disk

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611255923.7A CN106708442B (en) 2016-12-30 2016-12-30 Mass data storage method simultaneously adapting to read-write characteristics of magnetic disk and solid state disk

Publications (2)

Publication Number Publication Date
CN106708442A CN106708442A (en) 2017-05-24
CN106708442B true CN106708442B (en) 2020-02-14

Family

ID=58905003

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611255923.7A Active CN106708442B (en) 2016-12-30 2016-12-30 Mass data storage method simultaneously adapting to read-write characteristics of magnetic disk and solid state disk

Country Status (1)

Country Link
CN (1) CN106708442B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107247624B (en) * 2017-06-05 2020-10-13 安徽大学 Key-Value system oriented collaborative optimization method and system
TWI645295B (en) * 2017-06-20 2018-12-21 慧榮科技股份有限公司 Data storage device and data storage method
CN107391088B (en) * 2017-07-24 2021-03-02 苏州浪潮智能科技有限公司 Data information sequencing method, CPU (Central processing Unit) end, FPGA (field programmable Gate array) end and system
CN107515827B (en) * 2017-08-21 2021-07-27 湖南国科微电子股份有限公司 PCIE SSD custom log storage method and device and SSD
CN109508140B (en) * 2017-09-15 2022-04-05 阿里巴巴集团控股有限公司 Storage resource management method and device, electronic equipment and system
CN109033365B (en) * 2018-07-26 2022-03-08 郑州云海信息技术有限公司 Data processing method and related equipment
CN109542339B (en) * 2018-10-23 2021-09-03 拉扎斯网络科技(上海)有限公司 Data layered access method and device, multilayer storage equipment and storage medium
CN109271570A (en) * 2018-10-30 2019-01-25 郑州云海信息技术有限公司 A kind of method of metadata management inquiry
CN109933570B (en) * 2019-03-15 2020-02-07 中山大学 Metadata management method, system and medium
CN110727403B (en) * 2019-09-12 2021-03-30 华为技术有限公司 Metadata management method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103597785A (en) * 2011-06-07 2014-02-19 华为技术有限公司 Method and apparatus for content identifier based radius constrained cache flooding to enable efficient content routing
CN104978239A (en) * 2014-04-08 2015-10-14 重庆邮电大学 Method, device and system for realizing multi-backup-data dynamic updating
CN105117415A (en) * 2015-07-30 2015-12-02 西安交通大学 Optimized SSD data updating method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103597785A (en) * 2011-06-07 2014-02-19 华为技术有限公司 Method and apparatus for content identifier based radius constrained cache flooding to enable efficient content routing
CN104978239A (en) * 2014-04-08 2015-10-14 重庆邮电大学 Method, device and system for realizing multi-backup-data dynamic updating
CN105117415A (en) * 2015-07-30 2015-12-02 西安交通大学 Optimized SSD data updating method

Also Published As

Publication number Publication date
CN106708442A (en) 2017-05-24

Similar Documents

Publication Publication Date Title
CN106708442B (en) Mass data storage method simultaneously adapting to read-write characteristics of magnetic disk and solid state disk
CN110347336B (en) Key value storage system based on NVM (non volatile memory) and SSD (solid State disk) hybrid storage structure
US8412752B2 (en) File system having transaction record coalescing
US7194589B2 (en) Reducing disk IO by full-cache write-merging
US8667029B2 (en) Optimized startup verification of file system integrity
CN111399777A (en) Differentiated key value data storage method based on data value classification
CN109933570A (en) A kind of metadata management method, system and medium
CN103631940A (en) Data writing method and data writing system applied to HBASE database
JP2007012056A (en) File system having authentication of postponed data integrity
CN103279532A (en) Filtering system and filtering method for removing duplication of elements of multiple sets and identifying belonged sets
CN106469120A (en) Scrap cleaning method, device and equipment
CN111857890B (en) Service processing method, system, device and medium
CN112269786A (en) Method for creating KV storage engine index of memory database
CN104077078B (en) Read memory block, update the method and device of memory block
KR20170065374A (en) Method for Hash collision detection that is based on the sorting unit of the bucket
KR100907477B1 (en) Apparatus and method for managing index of data stored in flash memory
CN103425802B (en) Method for quickly retrieving magnetic disk file
US20120317384A1 (en) Data storage method
KR102233880B1 (en) Method and apparatus for storing data based on single-level
CN1464451A (en) A sorting method of data record
KR101861475B1 (en) Database method for PCM+ tree of non-volatile memory
CN115840769A (en) Method and device for processing jump table based on range partition
KR100982591B1 (en) File system, main storage and flash storage for progressive indexing and data management method using the progressive indexing
IL157385A (en) Organising data in a database
US20220245123A1 (en) Fast Skip List Purge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20191023

Address after: 430075 No. 201-5, floor 2, unit 2, north main building, phase II, National Geospatial Information Industry base, No. 5-2, wudayuan Road, Donghu New Technology Development Zone, Wuhan City, Hubei Province

Applicant after: Hard rock technology (Wuhan) Co., Ltd

Address before: 430070, No. two, building 2032, capital building, No. 1, National Road, East Lake New Technology Development Zone, Hubei, Wuhan, Optics Valley

Applicant before: Wuhan Safety Technology Co., Ltd.

GR01 Patent grant
GR01 Patent grant