CN106708442A - Massive data storage method simultaneously applicable to disk and solid state disk reading and writing features - Google Patents

Massive data storage method simultaneously applicable to disk and solid state disk reading and writing features Download PDF

Info

Publication number
CN106708442A
CN106708442A CN201611255923.7A CN201611255923A CN106708442A CN 106708442 A CN106708442 A CN 106708442A CN 201611255923 A CN201611255923 A CN 201611255923A CN 106708442 A CN106708442 A CN 106708442A
Authority
CN
China
Prior art keywords
block
layer
data
child
record
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611255923.7A
Other languages
Chinese (zh)
Other versions
CN106708442B (en
Inventor
龚才鑫
龚奕利
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hard rock technology (Wuhan) Co., Ltd
Original Assignee
Wuhan Safety Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Safety Technology Co Ltd filed Critical Wuhan Safety Technology Co Ltd
Priority to CN201611255923.7A priority Critical patent/CN106708442B/en
Publication of CN106708442A publication Critical patent/CN106708442A/en
Application granted granted Critical
Publication of CN106708442B publication Critical patent/CN106708442B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0685Hybrid storage combining heterogeneous device types, e.g. hierarchical storage, hybrid arrays

Abstract

The invention provides a massive data storage method simultaneously applicable to disk and solid state disk reading and writing features. Full sequencing of records in each block is changed into partial sequencing, a Bloom filter is added to the tail portion of each block, a Log-Structured Append-Tree is created, when the quantity of data stored in each block in the tree reaches a threshold and data in the block is directly added to corresponding child blocks, the data of the child blocks is composed of multiple collating sequences rather than full sequencing is achieved in the blocks in a merging sorting mode; each block in the tree stores one Bloom filter. According to the method, on the condition that no other properties are sacrificed, write amplification is greatly reduced, and the random writing efficiency is greatly improved. Besides, the service life of a solid state disk is better protected and prolonged. In read and write mixed scenes, the random read property is also enhanced, and the method has important market value.

Description

The mass data storage means of disk and solid state disk read-write characteristic are adapted to simultaneously
Technical field
The invention belongs to mass data storage field, more particularly to storage tree, the method can simultaneously adapt to disk and solid-state Disk read-write characteristic.
Background technology
The index tree commonly used on existing hard disk has B-tree, LSM-tree, buffer-tree etc..Wherein B-tree is Traditional classical tree, but because its inevitable random write disk in the scene of random write, when storage mass data when property Can be relatively low, so its variant is frequently used during storage mass data, to the variant and LSM-tree of B-tree in such as BigTable Be used in combination.For the storage of mass data, often LSM-tree or buffer-tree (being also called fractal-tree) is used Used as index tree, the common feature of both is that the record that is written into is postponed and write, the batch processing again when running up to a certain amount of. The random write disk caused in the random write scene that so can preferably solve the problems, such as B-tree so that write handling capacity and obtain Larger lifting.
In the scene of random write because the number of plies of LSM-tree and buffer-tree it is more and tree in block size ratio Block size in B-tree is much larger, so reading to amplify larger so that random reading performance has substantially reduction.In order to solve this The projects such as problem, bigtable/leveldb save bloom filter information when LSM-tree is realized in each node, this The reading that sample can be very good to reduce LSM-tree is amplified, and preferably solves the problems, such as that random reading performance is low.
But either B-tree or LSM-tree/buffer-tree, writing for these trees amplifies all larger.Due to disk The limitation of handling capacity, it is larger to write the further substantial lifting amplified and limit these index tree random write performances, and The life-span of serious infringement solid state hard disc.Larger amplification of writing has been occupied the handling capacity of most disk and then has been caused mixed in read-write In the scene of conjunction, random write influence random write the utilization of disk performance is caused random reading performance also have it is a certain degree of under Drop.
The content of the invention
Problem to be solved by this invention is:The problem so that random write inefficiency is amplified in larger the writing of traditional tree, The also serious life-span for affecting solid state hard disc is amplified in larger writing in solid state hard disc disk.Big portion has been occupied in larger amplification of writing Point mechanical disk or solid state hard disc handling capacity so that cause read-write mixing scene in, random write influence random write to machine The utilization of tool disk or solid state hard disc performance and cause random reading performance also have a certain degree of decline.Thus devise referred to as The tree of Log-Structured Append-Tree (log-structured additional tree, abbreviation LSA-tree).
The present invention provides a kind of while adapting to the mass data storage means of disk and solid state disk read-write characteristic, by one The sequence completely of the record in block is changed to portions sequence, then adds Bloom filter in the afterbody of each block, and implementation is as follows, Internal memory includes the metadata information of variable memory cache, immutable memory cache and tree, and the data in disk use LSA- Tree structure organizations, if the tree is divided into n-layer, at least t in i-th layeriThe individual most t of blocki+ 1 block, 1≤i≤n-1, parameter t are adjacent The multiple of two-layer block number threshold value, last layer is less than or equal to tnIndividual block;Each block has a scope for key, when the storage of each block When data volume reaches respective threshold, during the data brush in block to enter to have in scope in next layer the block of covering overlapping relation, will The data to be brushed are directly appended to when in corresponding block, and a certain piece of data are made up of several collating sequences, rather than passing through The mode of merger sequence is realized being sorted completely in block;The in store Bloom filter of each block in tree;
And, operation of the background thread to the block in LSA-tree trees is divided three classes, including lower brush, division and merging;Institute There is block initiation of the operation all only to non-final one layer to process;By a certain piece of current layer with one or more blocks of lower floor on key Covering overlapping relation be referred to as set membership, the block of current layer is referred to as parent block, and next layer one or more blocks are referred to as child Sub-block;
Lower brush operation is that the data in block are displaced downwardly in next layer, but the scope of the block still retains, layer where the block The number of block does not change;
The trigger condition of lower brush operation is that the data volume of block storage reaches storage threshold value and child's block number mesh of the block is small In 2t;
, it is necessary to two execution conditions below carry out lower brush after being satisfied by after triggering,
Condition 1, the number of the block of lower floor is less than ti+1+ 1 and i+1<N, or less than tnAnd i+1=n;
Condition 2, if lower floor is non-final one layer, child's block all need to not up to store threshold value;
Splitting operation is that block is split into two, so that child's block number mesh of two newly-generated blocks is equal;
The trigger condition of splitting operation is that the data volume of block storage reaches storage threshold value and child's block number mesh of the block is big In 2t;
The execution condition that the operation need to meet is that the number of the block of layer is less than t where the blocki+1;
Union operation is that the data in block are displaced downwardly in next layer, and the scope of the block is deleted after lower brush, to cause The number of the block of layer subtracts 1 where the block;
The trigger condition of union operation is that the block number mesh of layer is equal to t where the blocki+1;
The operation needs to meet following two execution condition,
Condition 1, the number of the block of lower floor is less than ti+1+ 1 and i+1<N, or less than tnAnd i+1=n;
Condition 2, if lower floor is non-final one layer, child's block all need to not up to store threshold value;
And, when user thread is inserted to be recorded, there are following three kinds of situations,
If 1) variable memory cache is not up to capacity threshold, record addition is entered into user journal, then record insertion is variable Memory cache;
If 2) variable memory cache reaches capacity threshold and immutable memory cache does not exist, first can not by its RNTO Become memory cache, then a newly-built variable memory cache insertion record;
If 3) variable memory cache reaches capacity threshold and immutable memory cache is present, wait background thread will be immutable Destroyed after memory cache write-in disk, 2) user thread according still further to being processed;
And, based on LSA-tree trees, background thread comprises the following steps immutable memory cache write-in disk,
Step 1.1, if the number of the block of last layer is equal to tn, then n=n+1, and newly-built one layer are made, newly-built layer is new Last layer;
Step 1.2, chooses task to be processed, and each task includes what will be performed on to be processed piece of selection and the block Operation, also regards immutable memory cache as a kind of special block;This selection operation is provided with three kinds of priority, from high to low successively It is as follows,
Priority 1:The lower brush operation of immutable memory cache, if being unsatisfactory for the execution condition of lower brush, continuation judges excellent The condition of first level 2;
Priority 2:For non-final one layer, judge whether that block number is equal to t since upper stratai+ 1 and lower floor's block number is small In ti+1+ 1 and i+1<The layer of n, or whether in the presence of layer block number be less than tnAnd the layer of i+1=n;
If in the presence of selecting a certain piece in this layer to merge operation to reduce the number of the block of this layer;Then waiting Selected works choose optimal block in closing, to the optimal piece of execution union operation selected;
If in the absence of such layer, continuation judges priority 3 condition;
Priority 3:Judge whether that the data volume of storage reaches the block of storage threshold value successively from upper strata to lower floor, if depositing Then choosing first block that ergodic process is run into;If the number of child's block of the block is less than 2t, under being performed to the block Brush operation;
If child's block number mesh of the block is more than or equal to 2t, will be to the execution splitting operation;
If operation is brushed under being that the block chosen will be carried out, but because the block is in the presence of the child's block for having arrived at storage threshold value So that the execution condition of the operation is unsatisfactory for, be then changed to select child's block to carry out lower brush or splitting operation, the like carry out Recursive lookup, until final choice meets lower brush or divides the block of execution condition to first;
If final non-selected to any object block and operation, when user continues into data, the weight since step 1.1 It is new to perform;
Step 1.3, according to the actual disk operating of the tasks carrying for obtaining, including lower brush operation, union operation or division Operation;
Step 1.4, applies for an exclusive lock, after applying successfully, the tree that the actual disk operating for performing is changed Structural information write-in tree metadata change journal, and the tree in this information updating internal memory metamessage;
Step 1.5, if what is processed is the operation that moves down of immutable memory cache, destroys immutable memory cache;If useful Family thread is just slept, then wake up user thread;All locks unblock acquired in this thread, this thread are continued out from step 1.1 Begin to perform.
And, it is as follows the step of execution when user need to read data:
Step 2.1, reads variable memory cache, if the record required for reading is returned;
Step 2.2, reads immutable memory cache, if the record required for reading is returned;
Step 2.3, reads the 1st layer to n-th layer successively, finds and returns, if not found to last layer, database of descriptions In do not exist corresponding record.
And, in step 1.3, if task is lower brush operation, it is divided into 3 kinds of situations,
Situation 1, if treating, the block of lower brush, in the absence of child's block, is directly entered step 1.4 and changes the metamessage of the block with reality Now move;The scope of the block for being moved down of current layer retains;
Situation 2, if it is last layer to treat that the block of lower brush has child's block and next layer,
For the record fallen in the range of a certain piece in last layer, the block is directly changed;
For the record fallen outside last layer all pieces of scope, the distance of chosen distance and the key for being inserted into record Minimum child's block is modified, and changes the scope of child's block;
The concrete operations for changing last layer of child's block are, if the data of block storage are not up to threshold value, to be added Operation;If reaching, the data being written into carry out merger sequence and generate several new blocks with original data;
Situation 3, if it is non-final one layer to treat that the block of lower brush has child's block and next layer,
For the record fallen in the range of a certain piece of next layer, directly by data supplementing to the block;
For the record fallen outside all pieces of scope, the minimum child's block of the distance of the key of record is selected and is inserted into Added, and changed the key range of child's block;The scope of the block for being moved down of current layer retains.
And, if task is union operation, and the data in block are displaced downwardly to next layer by lower brush operation using the same manner In, the scope of the block is deleted after lower brush, to cause that the number of the block of layer where the block subtracts 1.
And, the data stored in block have index data, Bloom filter and user record, and index data and Bu Long are filtered Device storage is stored in the front end of block in the end of block, user record.
And, in the middle of block free time cavity (hole, in logic idle address space, but there is no actual machine magnetic Bound with it the address of disk or solid state hard disc) do not store this secondary all data write but store index data and Bu Long During filter, by index data and Bloom filter storage in the rear end of block, user record is appended to the afterbody of block;
And, the free time cavity in the middle of block does not store the index data and Bloom filter of this secondary data write When, the data that will be write and original aggregation of data sort, and generate a new block;Or, by by index data, the grand mistake of cloth Filter and user record are all appended to the afterbody of block, and replacement carries out merger sequence.
According to the present invention, in the case where any other performance is not sacrificed so that write amplification and substantially reduce, considerably increase Random write efficiency.In the scene of read-write mixing, random reading performance has also strengthened.Solid-state disk service life is served preferably Protection and extension, with important market value.
Brief description of the drawings
Fig. 1 is the basic framework figure that uses in this storage method for the embodiment of the present invention, predominantly the structure of LSA-tree Schematic diagram.
Fig. 2 be the embodiment of the present invention perform disk operating when, will be brushed under data in block last layer logic illustrate Figure.
Fig. 3 be the embodiment of the present invention perform disk operating when, the logic that non-final a layer is brushed under data in block is shown It is intended to.
Fig. 4 is the schematic diagram of the magnetic disk of block designed in the embodiment of the present invention.
Fig. 5 is the schematic diagram of the optional magnetic disk of block designed in the embodiment of the present invention.
Specific implementation method
The invention solves the problems that key problem be:The property for causing write performance or read-write mixing is amplified in larger the writing of traditional tree Can be low.The also serious life-span for affecting solid state hard disc is amplified in larger writing in solid state hard disc disk.The present invention is by by one The sequence completely of the record in individual block is changed to portions sequence, then causes the program pair plus Bloom filter in the afterbody of each block The method that the influence of reading performance is preferably minimized is to solve the above problems.
Fig. 1 is the basic framework figure that the embodiment of the present invention provides storage method, is divided into memory part and disk segment.It is interior Depositing includes variable memory cache and each one of immutable memory cache, and the metadata information set.The metadata information of tree Describe the metamessage of each block in tree.The scope of the metamessage of block including block, affiliated layer, in the middle of block free time cavity it is big It is small, number of times being added etc..The metamessage of these blocks is grouped by affiliated layer, and the metamessage of block is by by metamessage in every group The scope of the block of middle preservation is compared, and causes that every group of metamessage sequences sequence.Data in disk are tied using LSA-tree Structure tissue.
Block in internal memory uses full ordering structure, is divided into two kinds of variable memory cache and immutable memory cache, Qian Zheshi The not up to block of block memory capacity threshold value, the record of user can be inserted directly into;The latter's size reaches threshold value, and can only be read can not It is changed again.When user thread is inserted to be recorded, there are three kinds of situations:
If 1) variable memory cache is not up to capacity threshold, record addition is entered into user journal, then record insertion is variable Memory cache, returns;
If 2) variable memory cache reaches capacity threshold and immutable memory cache does not exist, first can not by its RNTO Become memory cache, then newly-built one " variable memory cache " insertion record, return;
If 3) variable memory cache reaches capacity threshold and immutable memory cache is present, wait background thread will be immutable (this process is detailed below) is destroyed after memory cache write-in disk, 2) user thread according still further to being processed.
Data in disk are organized using the structure of LSA-tree.The tree is divided into n-layer, and each layer is by multiple block groups Into every layer of quantity of block is incremented by with exponential.The block number of i-th (1≤i≤n-1) layer is tiOr ti+ 1, last layer (n-th layer) block Number be less than or equal to tn(t is the positive integer more than or equal to 2, for example 10).As being designated as from high to low in Fig. 1:L1Layer has t1It is individual Block, L2Layer has t2Individual block ..., Ln-1Layer has tn-1Individual block, LnLayer has x block, and (x is more than 0 less than or equal to tn).Parameter t is adjacent two The multiple of layer block number threshold value, those skilled in the art can as needed preset number of plies n, parameter t, such as n=7, t during specific implementation =10.Each block has a scope for key, when the data volume of each block storage reaches respective threshold, the data brush in block is entered Next layer has in the block of covering overlapping relation on key range.In most cases, the data that the process will be brushed directly are added To corresponding block (data are made up of several collating sequences in the block for so obtaining), by way of being sorted merger Realize, so as to avoid excessive writing amplification.When the threshold value of the block size in tree reaches 10,000,000 ranks, such as 64MB, even if splitting into Several pieces are write in next layer of block, and the average amount for writing each piece also reaches number million, the disk that can be utilized well with The order write performance of solid state hard disc.
The in store Bloom filter of each block in tree, user need not read each sequence in block when reading record, And only need to read the Bloom filter for accounting for a small amount of space and judge the record of inquiry whether in certain sequence in block, to use The read operation performance at family is barely affected compared with full block sequencing.
Operation of the background thread to the block in tree is divided three classes:Lower brush, division and merging.All operations are all only to non-final One layer of block initiation treatment, is set to i-th layer of Li(1≤i≤n-1).For convenience of describing, by a certain piece of current layer and the one of lower floor Covering overlapping relation of the individual or multiple blocks on key is referred to as set membership, and the block of current layer is referred to as parent block, the one of next layer Individual or multiple blocks are referred to as child's block.
Lower brush operation is that the data in block are displaced downwardly in next layer, but the scope of the block still retains, layer where the block The number of block does not change.It is lower brush operation trigger condition be:The data volume of block storage reaches storage threshold value and the block Child's block number mesh is less than 2t.The operation needs to meet following two and performs condition and can just carry out:Condition 1, the number of the block of lower floor Less than ti+1+1(i+1<N, next layer is non-final one layer) or tn(i+1=n, i.e., next layer is last layer of Ln);Condition 2, if Lower floor is non-final one layer of (i+1<N), child's block all need to not up to store threshold value.Lower brush operation Detailed operating procedures are referring to step 1.3。
Splitting operation is that block is split into two, so that child's block number mesh of two newly-generated blocks is equal.Splitting operation Trigger condition be:The data volume of block storage reaches storage threshold value and child's block number mesh of the block is more than 2t.The operation need to expire Foot execution condition be:The number of the block of layer is less than t where the blocki+1.Detailed operating procedures are referring to step 1.3.
Union operation is similar with the operation of lower brush, and the data in block are displaced downwardly in next layer, is not both uniquely in lower brush The scope of the block is deleted afterwards, to cause that the number of the block of layer where the block subtracts 1.The trigger condition of union operation is:The block institute T is equal in the block number mesh of layeri+1.The operation needs to meet following two and performs condition and can just carry out:Condition 1, the block of lower floor Number is less than ti+1+1(i+1<N, next layer is non-final one layer) or tn(i+1=n, i.e., next layer is last layer of Ln);Condition 2, if lower floor is non-final one layer, child's block all need to not up to store threshold value.
Operation is unsatisfactory for layer or the block referred to as blocking layer or block of execution condition, block the carrying out of the operation.
There is no the operation in the data block of logic dependencies can be parallel.Variable internal memory delays in block and internal memory in disk Deposit and immutable memory cache has an exclusive lock to be bound one by one with it.When certain operation modified block, it is necessary to to change Block add exclusive lock successively, with prevent a certain piece by multiple threads simultaneously change, cause error in data.
In embodiment, the idiographic flow (operation stream that i.e. background thread is performed of immutable memory cache write-in LSA-Tree Journey) it is as follows:
Step 1.1, if the number of the block of last layer is equal to tn, then n=n+1, and newly-built one layer are made, newly-built layer is new Last layer.Into step 1.2.
Step 1.2, chooses task to be processed, and each task includes to be processed piece of selection (here by " immutable internal memory Caching " also regards a kind of special block as) and the block on by operation to be performed.This selection operation is provided with three kinds of priority, and (this three Kind priority ensure that the block number of each layer of tree is met above to every layer of requirement of block number, and allow that tree is efficiently deposited Store up the data that immutable memory cache is brushed down), being followed successively by from high to low:
Priority 1:The lower brush operation of immutable memory cache, if being unsatisfactory for the execution condition of lower brush, continuation judges excellent The condition of first level 2.
Priority 2:For non-final one layer of Li(1≤i≤n-1), judges whether that block number is equal to t since upper stratai+1 And lower floor's block number is less than ti+1+1(i+1<N, next layer is non-final one layer) or less than tn(next layer is last layer of Ln) layer.
If in the presence of selecting a certain piece in this layer to merge operation to reduce the number of the block of this layer.The plan of selection Slightly:
The all piece addition candidate collection (constraints of every layer block number by set of the number less than or equal to t of child's block will be met Block as being apparent from there will necessarily be at least one in gathering).Then optimal block is chosen in candidate collection, Selection Strategy is: Given birth to after the data volume of block storage is the bigger the better divided by with child's block number purpose value, and the block merges with the scope of adjacent block Into new scope child's block number it is the smaller the better.To the optimal piece of execution union operation selected.If in the absence of so Layer, then continue judge priority 3 condition.
Priority 3:Judge whether that the data volume of storage reaches the block of storage threshold value successively from upper strata to lower floor, if depositing Then choosing first block that ergodic process is run into (block of obstruction " immutable memory cache " is preferential).If child's block of the block Number be less than 2t, then lower brush operation will be performed to the block;If child's block number mesh of the block is more than or equal to 2t, will be to the execution Splitting operation.If operation is brushed under being that the block chosen will be carried out, but because there is the child's block for having arrived at storage threshold value in the block And cause the operation execution condition be unsatisfactory for, then be changed to select child's block carry out lower brush or splitting operation, the like enter Row recursive lookup, until final choice meets lower brush or divides the block of execution condition to first.If final non-selected to any Object block and operation, then re-execute since step 1.1.
On it have selected to be processed piece and the block by operation to be performed after, then the block that will be changed locked successively, If after all locks are all locked successfully, becoming work(and obtaining task, if any block is locked failing, to added all lock solutions Lock, and re-executed since step 1.1.
Step 1.1 ensure that the block number of last layer when being performed the step of after step 1.2 is necessarily smaller than tn。 If being equal to t in the presence of one or more block numbers in the priority 2 so in step 1.2i+ 1 layer, must choose one simultaneously completely Sufficient block number is equal to ti+ 1 and lower floor's block number be less than ti+1+1(i+1<N, next layer is non-final one layer) or less than tn(next layer is for most Later layer Ln) layer merge operation.
The purpose of setting up of the priority 2 in step 1.2 is to ensure there is the meeting execution condition of the task calmly in priority 3 (t will not be equal to because of all layers of block numberi+ 1 and cause priority 3 in all tasks be blocked) so that tree necessarily may be used Normally to operate.
The purpose that priority 3 in step 1.2 is set up is to plant the block for reaching storage threshold value in treatment tree, both can be resistance The block of layer task causes tree to continue to store the data brushed under immutable memory cache, or be not obstruction upper strata beyond the Great Wall The block of task and optimize performance.
Step 1.3, according to the actual disk operating of the tasks carrying for obtaining, concrete operations are following (logically independent without mutual The task of reprimand can be with executed in parallel):
If 1) lower brush operation, is divided into 3 kinds of situations:
Situation 1:If treating, the block of lower brush, in the absence of child's block, is directly entered metadata (unit's letter that step 1.4 changes the block Breath) moved down with realization.The scope of the block for being moved down of current layer retains.
Situation 2:If it is last layer to treat that the block of lower brush has child's block and next layer.For fall in last layer certain Record in the range of one piece, directly changes the block;For the record fallen outside last layer all pieces of scope, selection away from Modified from the child's block for being inserted into minimum with a distance from the key of record.Need to change the scope of child's block for the latter.Repair The concrete operations for changing last layer of child's block are:If the data of block storage are not up to threshold value, additional operation is carried out;If reaching Arrive, then the data being written into and original data carry out merger sequence generation several new blocks (makes newly-generated block sum Than original block sum most the more).The scope of the block for being moved down of current layer retains.Referring to Fig. 2, there is the child being added Sub-block, there is also child's block of the sequence that is merged.
Situation 3:If it is non-final one layer to treat that the block of lower brush has child's block and next layer.During lower brush data, for falling Record in the range of a certain piece of next layer, directly by data supplementing to the block;For the note fallen outside all pieces of scope Record, the child's block for selecting and being inserted into the distance minimum of the key of record is added, and changes the key range of child's block.Current layer The block for being moved down scope retain.Referring to Fig. 3, the child's block being added is only existed.
Compared to the prior art, this operation almost completely avoid merger sequence, and be replaced with additional operation, therefore greatly Big reducing writes amplification, improves write performance.
If 2) union operation, concrete operations flow is similar with the operation of lower brush, i.e., according to above-mentioned lower brush operation the same manner Data in block are displaced downwardly in next layer, the scope for not being both the block after lower brush uniquely is deleted, to cause the block institute Subtract 1 in the number of the block of layer.
If 3) splitting operation, then the block splitting is into two new blocks, the child that two newly-generated blocks possess after division Block number mesh is equal.
Step 1.4, applies for an exclusive lock, to ensure that only one of which background thread can carry out this step in a certain moment Suddenly;After applying successfully, the structural information of the tree that the actual disk operating for performing is changed writes " tree metadata change journal ", And the metamessage of the tree in this information updating internal memory.
Step 1.5, if what is processed is the operation that moves down of " immutable memory cache ", destroys " immutable memory cache ", if There is user thread just to sleep, then wake up user thread;By all locks unblock acquired in this thread.This thread from step 1.1 after It is continuous to start to perform.
It is as follows the step of execution when user need to read data:
Step 2.1, reads " variable memory cache ", if the record required for reading is returned;
Step 2.2, reads " immutable memory cache ", if the record required for reading is returned;
Step 2.3, successively read layer L1->Ln, find and return, if not found to last layer, in database of descriptions not There is corresponding record.In reading process, need not be held in the reading process that disk is caused by MVCC (Multi version concurrency control) There is any lock.
Fig. 4 is that this method realizes organizational form of the block on disk, and shown in such as Fig. 4 (left side), the data stored in block have rope Argument evidence, Bloom filter and user record;The above two are stored at the end of block, and user record stores the head in block;Depositing During storage data, in fact it could happen that three kinds of situations:
1) the idle cavity in the middle of block stores this secondary all data write, and takes as shown in Fig. 4 (left side) Storage mode, write the front end that the user record that n-th writes is sequentially stored in block for the 1st time, write the rope that n-th is write the 1st time Argument evidence and Bloom filter are sequentially stored in the rear end of block;
2) the free time cavity in the middle of block store secondary all data write but can store index data with During Bloom filter, take the storage mode as shown in Fig. 4 (right side), the index data that (n+1)th time is write and Bloom filter according to The user record that (n+1)th time is write is appended to the afterbody of block behind the rear end of block for secondary storage;
3) when middle free time cavity does not store index data and Bloom filter, then the data that will write and original Aggregation of data sequence generation one new block.When realizing, it is considered as when reaching 95% by the data that will be stored in block and reaches Storage threshold value can almost avoid the occurrence of this kind of completely.Further, present invention proposition, it is highly preferred that can substitute Merger is sorted, and above-mentioned merging method is implemented without by way of Fig. 5, will index data, Bloom filter and user note Record is all appended to the afterbody of block.
When realizing, block can be realized with the mode of file, and the threshold value of each file is 64MB, but can exceed 64MB, such as be worked as When being stored in the way of Fig. 4 (right side).
Note, in the record in reading block, sequence additional rearward on first read time, if finding required record, other The sequence not read just without reading, can be returned directly.
The process described above is only " to be changed to partial ordered mode by by the complete sortord of the record in a block (being made up of multiple collating sequences) writes amplification to greatly reduce, then makes plus Bloom filter in the index information of each block Influence of the program to reading performance is preferably minimized " example of thought.It is all it is of the invention spirit with principle within, done Any modification, improve etc., should be included within the scope of the present invention, applied in the block such as in buffer-tree Same logic is also within protection scope of the present invention.

Claims (10)

1. a kind of while adapting to the mass data storage means of disk and solid state disk read-write characteristic, it is characterised in that:By one The sequence completely of the record in block is changed to portions sequence, then adds Bloom filter in the afterbody of each block, and implementation is as follows, Internal memory includes variable memory cache and immutable memory cache, the metadata information of tree, sets up and is referred to as Log-Structured The structure of Append-Tree trees, the data in disk use Log-Structured Append-Tree structure organizations, if the tree It is divided into n-layer, at least t in i-th layeriThe individual most t of blocki+ 1 block, 1≤i≤n-1, parameter t are the multiple of adjacent two layers block number threshold value, Last layer is less than or equal to tnIndividual block;Each block has a scope for key, when the data volume of each block storage reaches respective threshold When, the data brush in block being entered to have in scope in next layer in the block of covering overlapping relation, the data that will be brushed directly are added When in corresponding block, a certain piece of data are made up of several collating sequences, rather than being realized by way of being sorted merger Sorted completely in block;The in store Bloom filter of each block in tree.
2. the mass data storage means of disk and solid state disk read-write characteristic are adapted to simultaneously according to claim 1, and it is special Levy and be:Operation of the background thread to the block in Log-Structured Append-Tree trees is divided three classes, including lower brush, point Split and merge;Block of all operations all only to non-final one layer is initiated;By a certain piece of current layer with lower floor one or more Covering overlapping relation of the block on key is referred to as set membership, and the block of current layer is referred to as parent block, one or more of next layer Block is referred to as child's block;
Lower brush operation is that the data in block are displaced downwardly in next layer, but the scope of the block still retains, the block of layer where the block Number does not change;
The trigger condition of lower brush operation is that the data volume of block storage reaches storage threshold value and child's block number mesh of the block is less than 2t;
, it is necessary to two execution conditions below carry out lower brush after being satisfied by after triggering,
Condition 1, the number of the block of lower floor is less than ti+1+ 1 and i+1<N, or less than tnAnd i=n-1;
Condition 2, if lower floor is non-final one layer, child's block all need to not up to store threshold value;
Splitting operation is that block is split into two, so that child's block number mesh of two newly-generated blocks is equal;
The trigger condition of splitting operation is that the data volume of block storage reaches storage threshold value and child's block number mesh of the block is more than 2t;
The execution condition that the operation need to meet is that the number of the block of layer is less than t where the blocki+1;
Union operation is that the data in block are displaced downwardly in next layer, and the scope of the block is deleted after lower brush, to cause the block The number of the block of place layer subtracts 1;
The trigger condition of union operation is that the block number mesh of layer is equal to t where the blocki+1;
The operation needs to meet following two execution condition,
Condition 1, the number of the block of lower floor is less than ti+1+ 1 and i+1<N, or less than tnAnd i+1=n;
Condition 2, if lower floor is non-final one layer, child's block all need to not up to store threshold value.
3. the mass data storage means of disk and solid state disk read-write characteristic are adapted to simultaneously according to claim 2, and it is special Levy and be:When user thread is inserted to be recorded, there are following three kinds of situations,
If 1) variable memory cache is not up to capacity threshold, record addition is entered into user journal, then record is inserted into variable internal memory Caching;
It is first that its RNTO is immutable interior if 2) variable memory cache reaches capacity threshold and immutable memory cache does not exist Deposit caching, then a newly-built variable memory cache insertion record;
If 3) variable memory cache reaches capacity threshold and immutable memory cache is present, background thread is waited by immutable internal memory Destroyed after caching write-in disk, 2) user thread according still further to being processed.
4. the mass data storage means of disk and solid state disk read-write characteristic are adapted to simultaneously according to claim 3, and it is special Levy and be:Based on Log-Structured Append-Tree trees, background thread includes immutable memory cache write-in disk Following steps,
Step 1.1, if the number of the block of last layer is equal to tn, then make n=n+1, and newly-built one layer, newly-built layer be it is new most Later layer;
Step 1.2, chooses task to be processed, and each task includes the behaviour that will be performed on to be processed piece of selection and the block Make, also regard immutable memory cache as a kind of special block;This selection operation is provided with three kinds of priority, from high to low successively such as Under,
Priority 1:The lower brush operation of immutable memory cache, if being unsatisfactory for the execution condition of lower brush, continuation judges priority 2 Condition;
Priority 2:For non-final one layer, judge whether that block number is equal to t since upper stratai+ 1 and lower floor's block number be less than ti+1 + 1 and i+1<The layer of n, or it is less than t with the presence or absence of block numbernAnd the layer of i=n-1;
If in the presence of selecting a certain piece in this layer to merge operation to reduce the number of the block of this layer;Then in Candidate Set Optimal block is chosen in conjunction, to the optimal piece of execution union operation selected;
If in the absence of such layer, continuation judges priority 3 condition;
Priority 3:Judge whether that the data volume of storage reaches the block of storage threshold value successively from upper strata to lower floor, if in the presence of if Choose first block that ergodic process is run into;If the number of child's block of the block is less than 2t, lower brush behaviour will be performed to the block Make;
If child's block number mesh of the block is more than or equal to 2t, will be to the execution splitting operation;
If operation is brushed under being that the block chosen will be carried out, but because the block is caused in the presence of the child's block for having arrived at storage threshold value The execution condition of the operation is unsatisfactory for, then be changed to select child's block carry out lower brush or splitting operation, the like carry out recurrence Search, until final choice meets lower brush or divides the block of execution condition to first;
If final non-selected to any object block and operation, when user continues into data, held again since step 1.1 OK;
Step 1.3, according to the actual disk operating of the tasks carrying for obtaining, including lower brush operation, union operation or splitting operation;
Step 1.4, applies for an exclusive lock, after applying successfully, the structure of the tree that the actual disk operating for performing is changed Information write-in tree metadata change journal, and the tree in this information updating internal memory metamessage;
Step 1.5, if what is processed is the operation that moves down of immutable memory cache, destroys immutable memory cache;If there is user's line Journey is just slept, then wake up user thread;All locks unblock acquired in this thread, this thread are held since step 1.1 continues OK.
5. the mass data storage means of disk and solid state disk read-write characteristic are adapted to simultaneously according to claim 4, and it is special Levy and be:
It is as follows the step of execution when user need to read data:
Step 2.1, reads variable memory cache, if the record required for reading is returned;
Step 2.2, reads immutable memory cache, if the record required for reading is returned;
Step 2.3, reads the 1st layer to n-th layer successively, finds and returns, if not found to last layer, in database of descriptions not There is corresponding record.
6. the mass data storage means of disk and solid state disk read-write characteristic are adapted to simultaneously according to claim 4, and it is special Levy and be:In step 1.3, if task is lower brush operation, it is divided into 3 kinds of situations,
Situation 1, if treating, the block of lower brush, in the absence of child's block, is directly entered step 1.4 and changes the metamessage of the block to realize down Move;The scope of the block for being moved down of current layer retains;
Situation 2, if it is last layer to treat that the block of lower brush has child's block and next layer,
For the record fallen in the range of a certain piece in last layer, the block is directly changed;
For the record fallen outside last layer all pieces of scope, chosen distance is minimum with the distance of the key for being inserted into record Child's block modify, and change the key range of child's block;
The concrete operations for changing last layer of child's block are, if the data of block storage are not up to threshold value, carry out additional operation; If reaching, the data being written into carry out merger sequence and generate several new blocks with original data;
Situation 3, if it is non-final one layer to treat that the block of lower brush has child's block and next layer,
For the record fallen in the range of a certain piece of next layer, directly by data supplementing to the block;
For the record fallen outside all pieces of scope, the child's block for selecting and being inserted into the distance minimum of the key of record is carried out It is additional, and change the key range of child's block;The scope of the block for being moved down of current layer retains.
7. the mass data storage means of disk and solid state disk read-write characteristic are adapted to simultaneously according to claim 6, and it is special Levy and be:If task is union operation, and be displaced downwardly to the data in block in next layer using the same manner by lower brush operation, lower brush The scope of the block is deleted afterwards, to cause that the number of the block of layer where the block subtracts 1.
8. the magnanimity of disk and solid state disk read-write characteristic is adapted to simultaneously according to claim 1 or 2 or 3 or 4 or 5 or 6 or 7 Date storage method, it is characterised in that:The data stored in block have index data, Bloom filter and user record, index number According to, at the end of block, user record is stored in the front end of block with Bloom filter storage.
9. the mass data storage means of disk and solid state disk read-write characteristic are adapted to simultaneously according to claim 8, and it is special Levy and be:Free time cavity in the middle of block does not store this secondary all data write but stores index data and Bu Long mistakes During filter, by index data and Bloom filter storage in the rear end of block, user record is appended to the afterbody of block.
10. the mass data storage means of disk and solid state disk read-write characteristic are adapted to simultaneously according to claim 8, and it is special Levy and be:When free time cavity in the middle of block does not store the index data and Bloom filter of this secondary data write, will The data write and original aggregation of data sort, and generate a new block;Or, by by index data, Bloom filter and User record is all appended to the afterbody of block, and replacement carries out merger sequence.
CN201611255923.7A 2016-12-30 2016-12-30 Mass data storage method simultaneously adapting to read-write characteristics of magnetic disk and solid state disk Active CN106708442B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611255923.7A CN106708442B (en) 2016-12-30 2016-12-30 Mass data storage method simultaneously adapting to read-write characteristics of magnetic disk and solid state disk

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611255923.7A CN106708442B (en) 2016-12-30 2016-12-30 Mass data storage method simultaneously adapting to read-write characteristics of magnetic disk and solid state disk

Publications (2)

Publication Number Publication Date
CN106708442A true CN106708442A (en) 2017-05-24
CN106708442B CN106708442B (en) 2020-02-14

Family

ID=58905003

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611255923.7A Active CN106708442B (en) 2016-12-30 2016-12-30 Mass data storage method simultaneously adapting to read-write characteristics of magnetic disk and solid state disk

Country Status (1)

Country Link
CN (1) CN106708442B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107247624A (en) * 2017-06-05 2017-10-13 安徽大学 A kind of cooperative optimization method and system towards Key Value systems
CN107391088A (en) * 2017-07-24 2017-11-24 郑州云海信息技术有限公司 A kind of data message sort method, CPU ends, FPGA ends and system
CN107515827A (en) * 2017-08-21 2017-12-26 湖南国科微电子股份有限公司 Storage method, device and the SSD of the self-defined daily records of PCIE SSD
CN109033365A (en) * 2018-07-26 2018-12-18 郑州云海信息技术有限公司 A kind of data processing method and relevant device
CN109101189A (en) * 2017-06-20 2018-12-28 慧荣科技股份有限公司 Data storage device and data storage method
CN109271570A (en) * 2018-10-30 2019-01-25 郑州云海信息技术有限公司 A kind of method of metadata management inquiry
CN109508140A (en) * 2017-09-15 2019-03-22 阿里巴巴集团控股有限公司 Storage resource management method, apparatus, electronic equipment and electronic equipment, system
CN109542339A (en) * 2018-10-23 2019-03-29 拉扎斯网络科技(上海)有限公司 Data hierarchy access method, device, multi-layered memory apparatus and storage medium
CN109933570A (en) * 2019-03-15 2019-06-25 中山大学 A kind of metadata management method, system and medium
CN110727403A (en) * 2019-09-12 2020-01-24 华为技术有限公司 Metadata management method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103597785A (en) * 2011-06-07 2014-02-19 华为技术有限公司 Method and apparatus for content identifier based radius constrained cache flooding to enable efficient content routing
CN104978239A (en) * 2014-04-08 2015-10-14 重庆邮电大学 Method, device and system for realizing multi-backup-data dynamic updating
CN105117415A (en) * 2015-07-30 2015-12-02 西安交通大学 Optimized SSD data updating method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103597785A (en) * 2011-06-07 2014-02-19 华为技术有限公司 Method and apparatus for content identifier based radius constrained cache flooding to enable efficient content routing
CN104978239A (en) * 2014-04-08 2015-10-14 重庆邮电大学 Method, device and system for realizing multi-backup-data dynamic updating
CN105117415A (en) * 2015-07-30 2015-12-02 西安交通大学 Optimized SSD data updating method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
PENG WANG ET AL.: ""An Efficient Design and Implementation of LSM-Tree based Key-Value Store on Open-Channel SSD"", 《百度文库》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107247624B (en) * 2017-06-05 2020-10-13 安徽大学 Key-Value system oriented collaborative optimization method and system
CN107247624A (en) * 2017-06-05 2017-10-13 安徽大学 A kind of cooperative optimization method and system towards Key Value systems
CN109101189B (en) * 2017-06-20 2021-12-24 慧荣科技股份有限公司 Data storage device and data storage method
CN109101189A (en) * 2017-06-20 2018-12-28 慧荣科技股份有限公司 Data storage device and data storage method
CN107391088A (en) * 2017-07-24 2017-11-24 郑州云海信息技术有限公司 A kind of data message sort method, CPU ends, FPGA ends and system
CN107391088B (en) * 2017-07-24 2021-03-02 苏州浪潮智能科技有限公司 Data information sequencing method, CPU (Central processing Unit) end, FPGA (field programmable Gate array) end and system
CN107515827B (en) * 2017-08-21 2021-07-27 湖南国科微电子股份有限公司 PCIE SSD custom log storage method and device and SSD
CN107515827A (en) * 2017-08-21 2017-12-26 湖南国科微电子股份有限公司 Storage method, device and the SSD of the self-defined daily records of PCIE SSD
CN109508140A (en) * 2017-09-15 2019-03-22 阿里巴巴集团控股有限公司 Storage resource management method, apparatus, electronic equipment and electronic equipment, system
CN109508140B (en) * 2017-09-15 2022-04-05 阿里巴巴集团控股有限公司 Storage resource management method and device, electronic equipment and system
CN109033365A (en) * 2018-07-26 2018-12-18 郑州云海信息技术有限公司 A kind of data processing method and relevant device
CN109033365B (en) * 2018-07-26 2022-03-08 郑州云海信息技术有限公司 Data processing method and related equipment
CN109542339A (en) * 2018-10-23 2019-03-29 拉扎斯网络科技(上海)有限公司 Data hierarchy access method, device, multi-layered memory apparatus and storage medium
CN109542339B (en) * 2018-10-23 2021-09-03 拉扎斯网络科技(上海)有限公司 Data layered access method and device, multilayer storage equipment and storage medium
CN109271570A (en) * 2018-10-30 2019-01-25 郑州云海信息技术有限公司 A kind of method of metadata management inquiry
CN109933570A (en) * 2019-03-15 2019-06-25 中山大学 A kind of metadata management method, system and medium
CN110727403A (en) * 2019-09-12 2020-01-24 华为技术有限公司 Metadata management method and device

Also Published As

Publication number Publication date
CN106708442B (en) 2020-02-14

Similar Documents

Publication Publication Date Title
CN106708442A (en) Massive data storage method simultaneously applicable to disk and solid state disk reading and writing features
CN105320775B (en) The access method and device of data
CN110347336B (en) Key value storage system based on NVM (non volatile memory) and SSD (solid State disk) hybrid storage structure
CN103631940B (en) Data writing method and data writing system applied to HBASE database
CN105468298B (en) A kind of key assignments storage method based on log-structured merging tree
CN111399777A (en) Differentiated key value data storage method based on data value classification
CN103345472B (en) De-redundant file system based on limited binary tree Bloom filter and construction method thereof
DE112011105774B4 (en) Movable storage that supports in-memory data structures
US20140025635A1 (en) Method and apparatus for fault-tolerant memory management
EP2093681A2 (en) Method and system for implementing an enhanced database
CN105159915A (en) Dynamically adaptive LSM (Log-structured merge) tree combination method and system
CN107526550B (en) Two-stage merging method based on log structure merging tree
CN108319543A (en) A kind of asynchronous processing method and its medium, system of computer log data
DE112016004527T5 (en) Implement a hardware accelerator for the management of a memory write cache
CN107832013A (en) A kind of method for managing solid-state hard disc mapping table
US20080162591A1 (en) Method of Logging Transactions and a Method of Reversing a Transaction
CN104077078B (en) Read memory block, update the method and device of memory block
CN105389128B (en) A kind of solid state hard disk date storage method and storage control
US20120317384A1 (en) Data storage method
CN110515897B (en) Method and system for optimizing reading performance of LSM storage system
CN110597912A (en) Block storage method and device
WO2015129109A1 (en) Index management device
Kuno et al. Deferred maintenance of indexes and of materialized views
US20100106682A1 (en) Database Index
RU2647648C1 (en) Method of organizing storage of historical deltas of records

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20191023

Address after: 430075 No. 201-5, floor 2, unit 2, north main building, phase II, National Geospatial Information Industry base, No. 5-2, wudayuan Road, Donghu New Technology Development Zone, Wuhan City, Hubei Province

Applicant after: Hard rock technology (Wuhan) Co., Ltd

Address before: 430070, No. two, building 2032, capital building, No. 1, National Road, East Lake New Technology Development Zone, Hubei, Wuhan, Optics Valley

Applicant before: Wuhan Safety Technology Co., Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant