CN106708442A - Massive data storage method simultaneously applicable to disk and solid state disk reading and writing features - Google Patents
Massive data storage method simultaneously applicable to disk and solid state disk reading and writing features Download PDFInfo
- Publication number
- CN106708442A CN106708442A CN201611255923.7A CN201611255923A CN106708442A CN 106708442 A CN106708442 A CN 106708442A CN 201611255923 A CN201611255923 A CN 201611255923A CN 106708442 A CN106708442 A CN 106708442A
- Authority
- CN
- China
- Prior art keywords
- block
- layer
- data
- child
- record
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/064—Management of blocks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/0671—In-line storage system
- G06F3/0683—Plurality of storage devices
- G06F3/0685—Hybrid storage combining heterogeneous device types, e.g. hierarchical storage, hybrid arrays
Abstract
The invention provides a massive data storage method simultaneously applicable to disk and solid state disk reading and writing features. Full sequencing of records in each block is changed into partial sequencing, a Bloom filter is added to the tail portion of each block, a Log-Structured Append-Tree is created, when the quantity of data stored in each block in the tree reaches a threshold and data in the block is directly added to corresponding child blocks, the data of the child blocks is composed of multiple collating sequences rather than full sequencing is achieved in the blocks in a merging sorting mode; each block in the tree stores one Bloom filter. According to the method, on the condition that no other properties are sacrificed, write amplification is greatly reduced, and the random writing efficiency is greatly improved. Besides, the service life of a solid state disk is better protected and prolonged. In read and write mixed scenes, the random read property is also enhanced, and the method has important market value.
Description
Technical field
The invention belongs to mass data storage field, more particularly to storage tree, the method can simultaneously adapt to disk and solid-state
Disk read-write characteristic.
Background technology
The index tree commonly used on existing hard disk has B-tree, LSM-tree, buffer-tree etc..Wherein B-tree is
Traditional classical tree, but because its inevitable random write disk in the scene of random write, when storage mass data when property
Can be relatively low, so its variant is frequently used during storage mass data, to the variant and LSM-tree of B-tree in such as BigTable
Be used in combination.For the storage of mass data, often LSM-tree or buffer-tree (being also called fractal-tree) is used
Used as index tree, the common feature of both is that the record that is written into is postponed and write, the batch processing again when running up to a certain amount of.
The random write disk caused in the random write scene that so can preferably solve the problems, such as B-tree so that write handling capacity and obtain
Larger lifting.
In the scene of random write because the number of plies of LSM-tree and buffer-tree it is more and tree in block size ratio
Block size in B-tree is much larger, so reading to amplify larger so that random reading performance has substantially reduction.In order to solve this
The projects such as problem, bigtable/leveldb save bloom filter information when LSM-tree is realized in each node, this
The reading that sample can be very good to reduce LSM-tree is amplified, and preferably solves the problems, such as that random reading performance is low.
But either B-tree or LSM-tree/buffer-tree, writing for these trees amplifies all larger.Due to disk
The limitation of handling capacity, it is larger to write the further substantial lifting amplified and limit these index tree random write performances, and
The life-span of serious infringement solid state hard disc.Larger amplification of writing has been occupied the handling capacity of most disk and then has been caused mixed in read-write
In the scene of conjunction, random write influence random write the utilization of disk performance is caused random reading performance also have it is a certain degree of under
Drop.
The content of the invention
Problem to be solved by this invention is:The problem so that random write inefficiency is amplified in larger the writing of traditional tree,
The also serious life-span for affecting solid state hard disc is amplified in larger writing in solid state hard disc disk.Big portion has been occupied in larger amplification of writing
Point mechanical disk or solid state hard disc handling capacity so that cause read-write mixing scene in, random write influence random write to machine
The utilization of tool disk or solid state hard disc performance and cause random reading performance also have a certain degree of decline.Thus devise referred to as
The tree of Log-Structured Append-Tree (log-structured additional tree, abbreviation LSA-tree).
The present invention provides a kind of while adapting to the mass data storage means of disk and solid state disk read-write characteristic, by one
The sequence completely of the record in block is changed to portions sequence, then adds Bloom filter in the afterbody of each block, and implementation is as follows,
Internal memory includes the metadata information of variable memory cache, immutable memory cache and tree, and the data in disk use LSA-
Tree structure organizations, if the tree is divided into n-layer, at least t in i-th layeriThe individual most t of blocki+ 1 block, 1≤i≤n-1, parameter t are adjacent
The multiple of two-layer block number threshold value, last layer is less than or equal to tnIndividual block;Each block has a scope for key, when the storage of each block
When data volume reaches respective threshold, during the data brush in block to enter to have in scope in next layer the block of covering overlapping relation, will
The data to be brushed are directly appended to when in corresponding block, and a certain piece of data are made up of several collating sequences, rather than passing through
The mode of merger sequence is realized being sorted completely in block;The in store Bloom filter of each block in tree;
And, operation of the background thread to the block in LSA-tree trees is divided three classes, including lower brush, division and merging;Institute
There is block initiation of the operation all only to non-final one layer to process;By a certain piece of current layer with one or more blocks of lower floor on key
Covering overlapping relation be referred to as set membership, the block of current layer is referred to as parent block, and next layer one or more blocks are referred to as child
Sub-block;
Lower brush operation is that the data in block are displaced downwardly in next layer, but the scope of the block still retains, layer where the block
The number of block does not change;
The trigger condition of lower brush operation is that the data volume of block storage reaches storage threshold value and child's block number mesh of the block is small
In 2t;
, it is necessary to two execution conditions below carry out lower brush after being satisfied by after triggering,
Condition 1, the number of the block of lower floor is less than ti+1+ 1 and i+1<N, or less than tnAnd i+1=n;
Condition 2, if lower floor is non-final one layer, child's block all need to not up to store threshold value;
Splitting operation is that block is split into two, so that child's block number mesh of two newly-generated blocks is equal;
The trigger condition of splitting operation is that the data volume of block storage reaches storage threshold value and child's block number mesh of the block is big
In 2t;
The execution condition that the operation need to meet is that the number of the block of layer is less than t where the blocki+1;
Union operation is that the data in block are displaced downwardly in next layer, and the scope of the block is deleted after lower brush, to cause
The number of the block of layer subtracts 1 where the block;
The trigger condition of union operation is that the block number mesh of layer is equal to t where the blocki+1;
The operation needs to meet following two execution condition,
Condition 1, the number of the block of lower floor is less than ti+1+ 1 and i+1<N, or less than tnAnd i+1=n;
Condition 2, if lower floor is non-final one layer, child's block all need to not up to store threshold value;
And, when user thread is inserted to be recorded, there are following three kinds of situations,
If 1) variable memory cache is not up to capacity threshold, record addition is entered into user journal, then record insertion is variable
Memory cache;
If 2) variable memory cache reaches capacity threshold and immutable memory cache does not exist, first can not by its RNTO
Become memory cache, then a newly-built variable memory cache insertion record;
If 3) variable memory cache reaches capacity threshold and immutable memory cache is present, wait background thread will be immutable
Destroyed after memory cache write-in disk, 2) user thread according still further to being processed;
And, based on LSA-tree trees, background thread comprises the following steps immutable memory cache write-in disk,
Step 1.1, if the number of the block of last layer is equal to tn, then n=n+1, and newly-built one layer are made, newly-built layer is new
Last layer;
Step 1.2, chooses task to be processed, and each task includes what will be performed on to be processed piece of selection and the block
Operation, also regards immutable memory cache as a kind of special block;This selection operation is provided with three kinds of priority, from high to low successively
It is as follows,
Priority 1:The lower brush operation of immutable memory cache, if being unsatisfactory for the execution condition of lower brush, continuation judges excellent
The condition of first level 2;
Priority 2:For non-final one layer, judge whether that block number is equal to t since upper stratai+ 1 and lower floor's block number is small
In ti+1+ 1 and i+1<The layer of n, or whether in the presence of layer block number be less than tnAnd the layer of i+1=n;
If in the presence of selecting a certain piece in this layer to merge operation to reduce the number of the block of this layer;Then waiting
Selected works choose optimal block in closing, to the optimal piece of execution union operation selected;
If in the absence of such layer, continuation judges priority 3 condition;
Priority 3:Judge whether that the data volume of storage reaches the block of storage threshold value successively from upper strata to lower floor, if depositing
Then choosing first block that ergodic process is run into;If the number of child's block of the block is less than 2t, under being performed to the block
Brush operation;
If child's block number mesh of the block is more than or equal to 2t, will be to the execution splitting operation;
If operation is brushed under being that the block chosen will be carried out, but because the block is in the presence of the child's block for having arrived at storage threshold value
So that the execution condition of the operation is unsatisfactory for, be then changed to select child's block to carry out lower brush or splitting operation, the like carry out
Recursive lookup, until final choice meets lower brush or divides the block of execution condition to first;
If final non-selected to any object block and operation, when user continues into data, the weight since step 1.1
It is new to perform;
Step 1.3, according to the actual disk operating of the tasks carrying for obtaining, including lower brush operation, union operation or division
Operation;
Step 1.4, applies for an exclusive lock, after applying successfully, the tree that the actual disk operating for performing is changed
Structural information write-in tree metadata change journal, and the tree in this information updating internal memory metamessage;
Step 1.5, if what is processed is the operation that moves down of immutable memory cache, destroys immutable memory cache;If useful
Family thread is just slept, then wake up user thread;All locks unblock acquired in this thread, this thread are continued out from step 1.1
Begin to perform.
And, it is as follows the step of execution when user need to read data:
Step 2.1, reads variable memory cache, if the record required for reading is returned;
Step 2.2, reads immutable memory cache, if the record required for reading is returned;
Step 2.3, reads the 1st layer to n-th layer successively, finds and returns, if not found to last layer, database of descriptions
In do not exist corresponding record.
And, in step 1.3, if task is lower brush operation, it is divided into 3 kinds of situations,
Situation 1, if treating, the block of lower brush, in the absence of child's block, is directly entered step 1.4 and changes the metamessage of the block with reality
Now move;The scope of the block for being moved down of current layer retains;
Situation 2, if it is last layer to treat that the block of lower brush has child's block and next layer,
For the record fallen in the range of a certain piece in last layer, the block is directly changed;
For the record fallen outside last layer all pieces of scope, the distance of chosen distance and the key for being inserted into record
Minimum child's block is modified, and changes the scope of child's block;
The concrete operations for changing last layer of child's block are, if the data of block storage are not up to threshold value, to be added
Operation;If reaching, the data being written into carry out merger sequence and generate several new blocks with original data;
Situation 3, if it is non-final one layer to treat that the block of lower brush has child's block and next layer,
For the record fallen in the range of a certain piece of next layer, directly by data supplementing to the block;
For the record fallen outside all pieces of scope, the minimum child's block of the distance of the key of record is selected and is inserted into
Added, and changed the key range of child's block;The scope of the block for being moved down of current layer retains.
And, if task is union operation, and the data in block are displaced downwardly to next layer by lower brush operation using the same manner
In, the scope of the block is deleted after lower brush, to cause that the number of the block of layer where the block subtracts 1.
And, the data stored in block have index data, Bloom filter and user record, and index data and Bu Long are filtered
Device storage is stored in the front end of block in the end of block, user record.
And, in the middle of block free time cavity (hole, in logic idle address space, but there is no actual machine magnetic
Bound with it the address of disk or solid state hard disc) do not store this secondary all data write but store index data and Bu Long
During filter, by index data and Bloom filter storage in the rear end of block, user record is appended to the afterbody of block;
And, the free time cavity in the middle of block does not store the index data and Bloom filter of this secondary data write
When, the data that will be write and original aggregation of data sort, and generate a new block;Or, by by index data, the grand mistake of cloth
Filter and user record are all appended to the afterbody of block, and replacement carries out merger sequence.
According to the present invention, in the case where any other performance is not sacrificed so that write amplification and substantially reduce, considerably increase
Random write efficiency.In the scene of read-write mixing, random reading performance has also strengthened.Solid-state disk service life is served preferably
Protection and extension, with important market value.
Brief description of the drawings
Fig. 1 is the basic framework figure that uses in this storage method for the embodiment of the present invention, predominantly the structure of LSA-tree
Schematic diagram.
Fig. 2 be the embodiment of the present invention perform disk operating when, will be brushed under data in block last layer logic illustrate
Figure.
Fig. 3 be the embodiment of the present invention perform disk operating when, the logic that non-final a layer is brushed under data in block is shown
It is intended to.
Fig. 4 is the schematic diagram of the magnetic disk of block designed in the embodiment of the present invention.
Fig. 5 is the schematic diagram of the optional magnetic disk of block designed in the embodiment of the present invention.
Specific implementation method
The invention solves the problems that key problem be:The property for causing write performance or read-write mixing is amplified in larger the writing of traditional tree
Can be low.The also serious life-span for affecting solid state hard disc is amplified in larger writing in solid state hard disc disk.The present invention is by by one
The sequence completely of the record in individual block is changed to portions sequence, then causes the program pair plus Bloom filter in the afterbody of each block
The method that the influence of reading performance is preferably minimized is to solve the above problems.
Fig. 1 is the basic framework figure that the embodiment of the present invention provides storage method, is divided into memory part and disk segment.It is interior
Depositing includes variable memory cache and each one of immutable memory cache, and the metadata information set.The metadata information of tree
Describe the metamessage of each block in tree.The scope of the metamessage of block including block, affiliated layer, in the middle of block free time cavity it is big
It is small, number of times being added etc..The metamessage of these blocks is grouped by affiliated layer, and the metamessage of block is by by metamessage in every group
The scope of the block of middle preservation is compared, and causes that every group of metamessage sequences sequence.Data in disk are tied using LSA-tree
Structure tissue.
Block in internal memory uses full ordering structure, is divided into two kinds of variable memory cache and immutable memory cache, Qian Zheshi
The not up to block of block memory capacity threshold value, the record of user can be inserted directly into;The latter's size reaches threshold value, and can only be read can not
It is changed again.When user thread is inserted to be recorded, there are three kinds of situations:
If 1) variable memory cache is not up to capacity threshold, record addition is entered into user journal, then record insertion is variable
Memory cache, returns;
If 2) variable memory cache reaches capacity threshold and immutable memory cache does not exist, first can not by its RNTO
Become memory cache, then newly-built one " variable memory cache " insertion record, return;
If 3) variable memory cache reaches capacity threshold and immutable memory cache is present, wait background thread will be immutable
(this process is detailed below) is destroyed after memory cache write-in disk, 2) user thread according still further to being processed.
Data in disk are organized using the structure of LSA-tree.The tree is divided into n-layer, and each layer is by multiple block groups
Into every layer of quantity of block is incremented by with exponential.The block number of i-th (1≤i≤n-1) layer is tiOr ti+ 1, last layer (n-th layer) block
Number be less than or equal to tn(t is the positive integer more than or equal to 2, for example 10).As being designated as from high to low in Fig. 1:L1Layer has t1It is individual
Block, L2Layer has t2Individual block ..., Ln-1Layer has tn-1Individual block, LnLayer has x block, and (x is more than 0 less than or equal to tn).Parameter t is adjacent two
The multiple of layer block number threshold value, those skilled in the art can as needed preset number of plies n, parameter t, such as n=7, t during specific implementation
=10.Each block has a scope for key, when the data volume of each block storage reaches respective threshold, the data brush in block is entered
Next layer has in the block of covering overlapping relation on key range.In most cases, the data that the process will be brushed directly are added
To corresponding block (data are made up of several collating sequences in the block for so obtaining), by way of being sorted merger
Realize, so as to avoid excessive writing amplification.When the threshold value of the block size in tree reaches 10,000,000 ranks, such as 64MB, even if splitting into
Several pieces are write in next layer of block, and the average amount for writing each piece also reaches number million, the disk that can be utilized well with
The order write performance of solid state hard disc.
The in store Bloom filter of each block in tree, user need not read each sequence in block when reading record,
And only need to read the Bloom filter for accounting for a small amount of space and judge the record of inquiry whether in certain sequence in block, to use
The read operation performance at family is barely affected compared with full block sequencing.
Operation of the background thread to the block in tree is divided three classes:Lower brush, division and merging.All operations are all only to non-final
One layer of block initiation treatment, is set to i-th layer of Li(1≤i≤n-1).For convenience of describing, by a certain piece of current layer and the one of lower floor
Covering overlapping relation of the individual or multiple blocks on key is referred to as set membership, and the block of current layer is referred to as parent block, the one of next layer
Individual or multiple blocks are referred to as child's block.
Lower brush operation is that the data in block are displaced downwardly in next layer, but the scope of the block still retains, layer where the block
The number of block does not change.It is lower brush operation trigger condition be:The data volume of block storage reaches storage threshold value and the block
Child's block number mesh is less than 2t.The operation needs to meet following two and performs condition and can just carry out:Condition 1, the number of the block of lower floor
Less than ti+1+1(i+1<N, next layer is non-final one layer) or tn(i+1=n, i.e., next layer is last layer of Ln);Condition 2, if
Lower floor is non-final one layer of (i+1<N), child's block all need to not up to store threshold value.Lower brush operation Detailed operating procedures are referring to step
1.3。
Splitting operation is that block is split into two, so that child's block number mesh of two newly-generated blocks is equal.Splitting operation
Trigger condition be:The data volume of block storage reaches storage threshold value and child's block number mesh of the block is more than 2t.The operation need to expire
Foot execution condition be:The number of the block of layer is less than t where the blocki+1.Detailed operating procedures are referring to step 1.3.
Union operation is similar with the operation of lower brush, and the data in block are displaced downwardly in next layer, is not both uniquely in lower brush
The scope of the block is deleted afterwards, to cause that the number of the block of layer where the block subtracts 1.The trigger condition of union operation is:The block institute
T is equal in the block number mesh of layeri+1.The operation needs to meet following two and performs condition and can just carry out:Condition 1, the block of lower floor
Number is less than ti+1+1(i+1<N, next layer is non-final one layer) or tn(i+1=n, i.e., next layer is last layer of Ln);Condition
2, if lower floor is non-final one layer, child's block all need to not up to store threshold value.
Operation is unsatisfactory for layer or the block referred to as blocking layer or block of execution condition, block the carrying out of the operation.
There is no the operation in the data block of logic dependencies can be parallel.Variable internal memory delays in block and internal memory in disk
Deposit and immutable memory cache has an exclusive lock to be bound one by one with it.When certain operation modified block, it is necessary to to change
Block add exclusive lock successively, with prevent a certain piece by multiple threads simultaneously change, cause error in data.
In embodiment, the idiographic flow (operation stream that i.e. background thread is performed of immutable memory cache write-in LSA-Tree
Journey) it is as follows:
Step 1.1, if the number of the block of last layer is equal to tn, then n=n+1, and newly-built one layer are made, newly-built layer is new
Last layer.Into step 1.2.
Step 1.2, chooses task to be processed, and each task includes to be processed piece of selection (here by " immutable internal memory
Caching " also regards a kind of special block as) and the block on by operation to be performed.This selection operation is provided with three kinds of priority, and (this three
Kind priority ensure that the block number of each layer of tree is met above to every layer of requirement of block number, and allow that tree is efficiently deposited
Store up the data that immutable memory cache is brushed down), being followed successively by from high to low:
Priority 1:The lower brush operation of immutable memory cache, if being unsatisfactory for the execution condition of lower brush, continuation judges excellent
The condition of first level 2.
Priority 2:For non-final one layer of Li(1≤i≤n-1), judges whether that block number is equal to t since upper stratai+1
And lower floor's block number is less than ti+1+1(i+1<N, next layer is non-final one layer) or less than tn(next layer is last layer of Ln) layer.
If in the presence of selecting a certain piece in this layer to merge operation to reduce the number of the block of this layer.The plan of selection
Slightly:
The all piece addition candidate collection (constraints of every layer block number by set of the number less than or equal to t of child's block will be met
Block as being apparent from there will necessarily be at least one in gathering).Then optimal block is chosen in candidate collection, Selection Strategy is:
Given birth to after the data volume of block storage is the bigger the better divided by with child's block number purpose value, and the block merges with the scope of adjacent block
Into new scope child's block number it is the smaller the better.To the optimal piece of execution union operation selected.If in the absence of so
Layer, then continue judge priority 3 condition.
Priority 3:Judge whether that the data volume of storage reaches the block of storage threshold value successively from upper strata to lower floor, if depositing
Then choosing first block that ergodic process is run into (block of obstruction " immutable memory cache " is preferential).If child's block of the block
Number be less than 2t, then lower brush operation will be performed to the block;If child's block number mesh of the block is more than or equal to 2t, will be to the execution
Splitting operation.If operation is brushed under being that the block chosen will be carried out, but because there is the child's block for having arrived at storage threshold value in the block
And cause the operation execution condition be unsatisfactory for, then be changed to select child's block carry out lower brush or splitting operation, the like enter
Row recursive lookup, until final choice meets lower brush or divides the block of execution condition to first.If final non-selected to any
Object block and operation, then re-execute since step 1.1.
On it have selected to be processed piece and the block by operation to be performed after, then the block that will be changed locked successively,
If after all locks are all locked successfully, becoming work(and obtaining task, if any block is locked failing, to added all lock solutions
Lock, and re-executed since step 1.1.
Step 1.1 ensure that the block number of last layer when being performed the step of after step 1.2 is necessarily smaller than tn。
If being equal to t in the presence of one or more block numbers in the priority 2 so in step 1.2i+ 1 layer, must choose one simultaneously completely
Sufficient block number is equal to ti+ 1 and lower floor's block number be less than ti+1+1(i+1<N, next layer is non-final one layer) or less than tn(next layer is for most
Later layer Ln) layer merge operation.
The purpose of setting up of the priority 2 in step 1.2 is to ensure there is the meeting execution condition of the task calmly in priority 3
(t will not be equal to because of all layers of block numberi+ 1 and cause priority 3 in all tasks be blocked) so that tree necessarily may be used
Normally to operate.
The purpose that priority 3 in step 1.2 is set up is to plant the block for reaching storage threshold value in treatment tree, both can be resistance
The block of layer task causes tree to continue to store the data brushed under immutable memory cache, or be not obstruction upper strata beyond the Great Wall
The block of task and optimize performance.
Step 1.3, according to the actual disk operating of the tasks carrying for obtaining, concrete operations are following (logically independent without mutual
The task of reprimand can be with executed in parallel):
If 1) lower brush operation, is divided into 3 kinds of situations:
Situation 1:If treating, the block of lower brush, in the absence of child's block, is directly entered metadata (unit's letter that step 1.4 changes the block
Breath) moved down with realization.The scope of the block for being moved down of current layer retains.
Situation 2:If it is last layer to treat that the block of lower brush has child's block and next layer.For fall in last layer certain
Record in the range of one piece, directly changes the block;For the record fallen outside last layer all pieces of scope, selection away from
Modified from the child's block for being inserted into minimum with a distance from the key of record.Need to change the scope of child's block for the latter.Repair
The concrete operations for changing last layer of child's block are:If the data of block storage are not up to threshold value, additional operation is carried out;If reaching
Arrive, then the data being written into and original data carry out merger sequence generation several new blocks (makes newly-generated block sum
Than original block sum most the more).The scope of the block for being moved down of current layer retains.Referring to Fig. 2, there is the child being added
Sub-block, there is also child's block of the sequence that is merged.
Situation 3:If it is non-final one layer to treat that the block of lower brush has child's block and next layer.During lower brush data, for falling
Record in the range of a certain piece of next layer, directly by data supplementing to the block;For the note fallen outside all pieces of scope
Record, the child's block for selecting and being inserted into the distance minimum of the key of record is added, and changes the key range of child's block.Current layer
The block for being moved down scope retain.Referring to Fig. 3, the child's block being added is only existed.
Compared to the prior art, this operation almost completely avoid merger sequence, and be replaced with additional operation, therefore greatly
Big reducing writes amplification, improves write performance.
If 2) union operation, concrete operations flow is similar with the operation of lower brush, i.e., according to above-mentioned lower brush operation the same manner
Data in block are displaced downwardly in next layer, the scope for not being both the block after lower brush uniquely is deleted, to cause the block institute
Subtract 1 in the number of the block of layer.
If 3) splitting operation, then the block splitting is into two new blocks, the child that two newly-generated blocks possess after division
Block number mesh is equal.
Step 1.4, applies for an exclusive lock, to ensure that only one of which background thread can carry out this step in a certain moment
Suddenly;After applying successfully, the structural information of the tree that the actual disk operating for performing is changed writes " tree metadata change journal ",
And the metamessage of the tree in this information updating internal memory.
Step 1.5, if what is processed is the operation that moves down of " immutable memory cache ", destroys " immutable memory cache ", if
There is user thread just to sleep, then wake up user thread;By all locks unblock acquired in this thread.This thread from step 1.1 after
It is continuous to start to perform.
It is as follows the step of execution when user need to read data:
Step 2.1, reads " variable memory cache ", if the record required for reading is returned;
Step 2.2, reads " immutable memory cache ", if the record required for reading is returned;
Step 2.3, successively read layer L1->Ln, find and return, if not found to last layer, in database of descriptions not
There is corresponding record.In reading process, need not be held in the reading process that disk is caused by MVCC (Multi version concurrency control)
There is any lock.
Fig. 4 is that this method realizes organizational form of the block on disk, and shown in such as Fig. 4 (left side), the data stored in block have rope
Argument evidence, Bloom filter and user record;The above two are stored at the end of block, and user record stores the head in block;Depositing
During storage data, in fact it could happen that three kinds of situations:
1) the idle cavity in the middle of block stores this secondary all data write, and takes as shown in Fig. 4 (left side)
Storage mode, write the front end that the user record that n-th writes is sequentially stored in block for the 1st time, write the rope that n-th is write the 1st time
Argument evidence and Bloom filter are sequentially stored in the rear end of block;
2) the free time cavity in the middle of block store secondary all data write but can store index data with
During Bloom filter, take the storage mode as shown in Fig. 4 (right side), the index data that (n+1)th time is write and Bloom filter according to
The user record that (n+1)th time is write is appended to the afterbody of block behind the rear end of block for secondary storage;
3) when middle free time cavity does not store index data and Bloom filter, then the data that will write and original
Aggregation of data sequence generation one new block.When realizing, it is considered as when reaching 95% by the data that will be stored in block and reaches
Storage threshold value can almost avoid the occurrence of this kind of completely.Further, present invention proposition, it is highly preferred that can substitute
Merger is sorted, and above-mentioned merging method is implemented without by way of Fig. 5, will index data, Bloom filter and user note
Record is all appended to the afterbody of block.
When realizing, block can be realized with the mode of file, and the threshold value of each file is 64MB, but can exceed 64MB, such as be worked as
When being stored in the way of Fig. 4 (right side).
Note, in the record in reading block, sequence additional rearward on first read time, if finding required record, other
The sequence not read just without reading, can be returned directly.
The process described above is only " to be changed to partial ordered mode by by the complete sortord of the record in a block
(being made up of multiple collating sequences) writes amplification to greatly reduce, then makes plus Bloom filter in the index information of each block
Influence of the program to reading performance is preferably minimized " example of thought.It is all it is of the invention spirit with principle within, done
Any modification, improve etc., should be included within the scope of the present invention, applied in the block such as in buffer-tree
Same logic is also within protection scope of the present invention.
Claims (10)
1. a kind of while adapting to the mass data storage means of disk and solid state disk read-write characteristic, it is characterised in that:By one
The sequence completely of the record in block is changed to portions sequence, then adds Bloom filter in the afterbody of each block, and implementation is as follows,
Internal memory includes variable memory cache and immutable memory cache, the metadata information of tree, sets up and is referred to as Log-Structured
The structure of Append-Tree trees, the data in disk use Log-Structured Append-Tree structure organizations, if the tree
It is divided into n-layer, at least t in i-th layeriThe individual most t of blocki+ 1 block, 1≤i≤n-1, parameter t are the multiple of adjacent two layers block number threshold value,
Last layer is less than or equal to tnIndividual block;Each block has a scope for key, when the data volume of each block storage reaches respective threshold
When, the data brush in block being entered to have in scope in next layer in the block of covering overlapping relation, the data that will be brushed directly are added
When in corresponding block, a certain piece of data are made up of several collating sequences, rather than being realized by way of being sorted merger
Sorted completely in block;The in store Bloom filter of each block in tree.
2. the mass data storage means of disk and solid state disk read-write characteristic are adapted to simultaneously according to claim 1, and it is special
Levy and be:Operation of the background thread to the block in Log-Structured Append-Tree trees is divided three classes, including lower brush, point
Split and merge;Block of all operations all only to non-final one layer is initiated;By a certain piece of current layer with lower floor one or more
Covering overlapping relation of the block on key is referred to as set membership, and the block of current layer is referred to as parent block, one or more of next layer
Block is referred to as child's block;
Lower brush operation is that the data in block are displaced downwardly in next layer, but the scope of the block still retains, the block of layer where the block
Number does not change;
The trigger condition of lower brush operation is that the data volume of block storage reaches storage threshold value and child's block number mesh of the block is less than
2t;
, it is necessary to two execution conditions below carry out lower brush after being satisfied by after triggering,
Condition 1, the number of the block of lower floor is less than ti+1+ 1 and i+1<N, or less than tnAnd i=n-1;
Condition 2, if lower floor is non-final one layer, child's block all need to not up to store threshold value;
Splitting operation is that block is split into two, so that child's block number mesh of two newly-generated blocks is equal;
The trigger condition of splitting operation is that the data volume of block storage reaches storage threshold value and child's block number mesh of the block is more than
2t;
The execution condition that the operation need to meet is that the number of the block of layer is less than t where the blocki+1;
Union operation is that the data in block are displaced downwardly in next layer, and the scope of the block is deleted after lower brush, to cause the block
The number of the block of place layer subtracts 1;
The trigger condition of union operation is that the block number mesh of layer is equal to t where the blocki+1;
The operation needs to meet following two execution condition,
Condition 1, the number of the block of lower floor is less than ti+1+ 1 and i+1<N, or less than tnAnd i+1=n;
Condition 2, if lower floor is non-final one layer, child's block all need to not up to store threshold value.
3. the mass data storage means of disk and solid state disk read-write characteristic are adapted to simultaneously according to claim 2, and it is special
Levy and be:When user thread is inserted to be recorded, there are following three kinds of situations,
If 1) variable memory cache is not up to capacity threshold, record addition is entered into user journal, then record is inserted into variable internal memory
Caching;
It is first that its RNTO is immutable interior if 2) variable memory cache reaches capacity threshold and immutable memory cache does not exist
Deposit caching, then a newly-built variable memory cache insertion record;
If 3) variable memory cache reaches capacity threshold and immutable memory cache is present, background thread is waited by immutable internal memory
Destroyed after caching write-in disk, 2) user thread according still further to being processed.
4. the mass data storage means of disk and solid state disk read-write characteristic are adapted to simultaneously according to claim 3, and it is special
Levy and be:Based on Log-Structured Append-Tree trees, background thread includes immutable memory cache write-in disk
Following steps,
Step 1.1, if the number of the block of last layer is equal to tn, then make n=n+1, and newly-built one layer, newly-built layer be it is new most
Later layer;
Step 1.2, chooses task to be processed, and each task includes the behaviour that will be performed on to be processed piece of selection and the block
Make, also regard immutable memory cache as a kind of special block;This selection operation is provided with three kinds of priority, from high to low successively such as
Under,
Priority 1:The lower brush operation of immutable memory cache, if being unsatisfactory for the execution condition of lower brush, continuation judges priority 2
Condition;
Priority 2:For non-final one layer, judge whether that block number is equal to t since upper stratai+ 1 and lower floor's block number be less than ti+1
+ 1 and i+1<The layer of n, or it is less than t with the presence or absence of block numbernAnd the layer of i=n-1;
If in the presence of selecting a certain piece in this layer to merge operation to reduce the number of the block of this layer;Then in Candidate Set
Optimal block is chosen in conjunction, to the optimal piece of execution union operation selected;
If in the absence of such layer, continuation judges priority 3 condition;
Priority 3:Judge whether that the data volume of storage reaches the block of storage threshold value successively from upper strata to lower floor, if in the presence of if
Choose first block that ergodic process is run into;If the number of child's block of the block is less than 2t, lower brush behaviour will be performed to the block
Make;
If child's block number mesh of the block is more than or equal to 2t, will be to the execution splitting operation;
If operation is brushed under being that the block chosen will be carried out, but because the block is caused in the presence of the child's block for having arrived at storage threshold value
The execution condition of the operation is unsatisfactory for, then be changed to select child's block carry out lower brush or splitting operation, the like carry out recurrence
Search, until final choice meets lower brush or divides the block of execution condition to first;
If final non-selected to any object block and operation, when user continues into data, held again since step 1.1
OK;
Step 1.3, according to the actual disk operating of the tasks carrying for obtaining, including lower brush operation, union operation or splitting operation;
Step 1.4, applies for an exclusive lock, after applying successfully, the structure of the tree that the actual disk operating for performing is changed
Information write-in tree metadata change journal, and the tree in this information updating internal memory metamessage;
Step 1.5, if what is processed is the operation that moves down of immutable memory cache, destroys immutable memory cache;If there is user's line
Journey is just slept, then wake up user thread;All locks unblock acquired in this thread, this thread are held since step 1.1 continues
OK.
5. the mass data storage means of disk and solid state disk read-write characteristic are adapted to simultaneously according to claim 4, and it is special
Levy and be:
It is as follows the step of execution when user need to read data:
Step 2.1, reads variable memory cache, if the record required for reading is returned;
Step 2.2, reads immutable memory cache, if the record required for reading is returned;
Step 2.3, reads the 1st layer to n-th layer successively, finds and returns, if not found to last layer, in database of descriptions not
There is corresponding record.
6. the mass data storage means of disk and solid state disk read-write characteristic are adapted to simultaneously according to claim 4, and it is special
Levy and be:In step 1.3, if task is lower brush operation, it is divided into 3 kinds of situations,
Situation 1, if treating, the block of lower brush, in the absence of child's block, is directly entered step 1.4 and changes the metamessage of the block to realize down
Move;The scope of the block for being moved down of current layer retains;
Situation 2, if it is last layer to treat that the block of lower brush has child's block and next layer,
For the record fallen in the range of a certain piece in last layer, the block is directly changed;
For the record fallen outside last layer all pieces of scope, chosen distance is minimum with the distance of the key for being inserted into record
Child's block modify, and change the key range of child's block;
The concrete operations for changing last layer of child's block are, if the data of block storage are not up to threshold value, carry out additional operation;
If reaching, the data being written into carry out merger sequence and generate several new blocks with original data;
Situation 3, if it is non-final one layer to treat that the block of lower brush has child's block and next layer,
For the record fallen in the range of a certain piece of next layer, directly by data supplementing to the block;
For the record fallen outside all pieces of scope, the child's block for selecting and being inserted into the distance minimum of the key of record is carried out
It is additional, and change the key range of child's block;The scope of the block for being moved down of current layer retains.
7. the mass data storage means of disk and solid state disk read-write characteristic are adapted to simultaneously according to claim 6, and it is special
Levy and be:If task is union operation, and be displaced downwardly to the data in block in next layer using the same manner by lower brush operation, lower brush
The scope of the block is deleted afterwards, to cause that the number of the block of layer where the block subtracts 1.
8. the magnanimity of disk and solid state disk read-write characteristic is adapted to simultaneously according to claim 1 or 2 or 3 or 4 or 5 or 6 or 7
Date storage method, it is characterised in that:The data stored in block have index data, Bloom filter and user record, index number
According to, at the end of block, user record is stored in the front end of block with Bloom filter storage.
9. the mass data storage means of disk and solid state disk read-write characteristic are adapted to simultaneously according to claim 8, and it is special
Levy and be:Free time cavity in the middle of block does not store this secondary all data write but stores index data and Bu Long mistakes
During filter, by index data and Bloom filter storage in the rear end of block, user record is appended to the afterbody of block.
10. the mass data storage means of disk and solid state disk read-write characteristic are adapted to simultaneously according to claim 8, and it is special
Levy and be:When free time cavity in the middle of block does not store the index data and Bloom filter of this secondary data write, will
The data write and original aggregation of data sort, and generate a new block;Or, by by index data, Bloom filter and
User record is all appended to the afterbody of block, and replacement carries out merger sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611255923.7A CN106708442B (en) | 2016-12-30 | 2016-12-30 | Mass data storage method simultaneously adapting to read-write characteristics of magnetic disk and solid state disk |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611255923.7A CN106708442B (en) | 2016-12-30 | 2016-12-30 | Mass data storage method simultaneously adapting to read-write characteristics of magnetic disk and solid state disk |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106708442A true CN106708442A (en) | 2017-05-24 |
CN106708442B CN106708442B (en) | 2020-02-14 |
Family
ID=58905003
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611255923.7A Active CN106708442B (en) | 2016-12-30 | 2016-12-30 | Mass data storage method simultaneously adapting to read-write characteristics of magnetic disk and solid state disk |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106708442B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107247624A (en) * | 2017-06-05 | 2017-10-13 | 安徽大学 | A kind of cooperative optimization method and system towards Key Value systems |
CN107391088A (en) * | 2017-07-24 | 2017-11-24 | 郑州云海信息技术有限公司 | A kind of data message sort method, CPU ends, FPGA ends and system |
CN107515827A (en) * | 2017-08-21 | 2017-12-26 | 湖南国科微电子股份有限公司 | Storage method, device and the SSD of the self-defined daily records of PCIE SSD |
CN109033365A (en) * | 2018-07-26 | 2018-12-18 | 郑州云海信息技术有限公司 | A kind of data processing method and relevant device |
CN109101189A (en) * | 2017-06-20 | 2018-12-28 | 慧荣科技股份有限公司 | Data storage device and data storage method |
CN109271570A (en) * | 2018-10-30 | 2019-01-25 | 郑州云海信息技术有限公司 | A kind of method of metadata management inquiry |
CN109508140A (en) * | 2017-09-15 | 2019-03-22 | 阿里巴巴集团控股有限公司 | Storage resource management method, apparatus, electronic equipment and electronic equipment, system |
CN109542339A (en) * | 2018-10-23 | 2019-03-29 | 拉扎斯网络科技(上海)有限公司 | Data hierarchy access method, device, multi-layered memory apparatus and storage medium |
CN109933570A (en) * | 2019-03-15 | 2019-06-25 | 中山大学 | A kind of metadata management method, system and medium |
CN110727403A (en) * | 2019-09-12 | 2020-01-24 | 华为技术有限公司 | Metadata management method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103597785A (en) * | 2011-06-07 | 2014-02-19 | 华为技术有限公司 | Method and apparatus for content identifier based radius constrained cache flooding to enable efficient content routing |
CN104978239A (en) * | 2014-04-08 | 2015-10-14 | 重庆邮电大学 | Method, device and system for realizing multi-backup-data dynamic updating |
CN105117415A (en) * | 2015-07-30 | 2015-12-02 | 西安交通大学 | Optimized SSD data updating method |
-
2016
- 2016-12-30 CN CN201611255923.7A patent/CN106708442B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103597785A (en) * | 2011-06-07 | 2014-02-19 | 华为技术有限公司 | Method and apparatus for content identifier based radius constrained cache flooding to enable efficient content routing |
CN104978239A (en) * | 2014-04-08 | 2015-10-14 | 重庆邮电大学 | Method, device and system for realizing multi-backup-data dynamic updating |
CN105117415A (en) * | 2015-07-30 | 2015-12-02 | 西安交通大学 | Optimized SSD data updating method |
Non-Patent Citations (1)
Title |
---|
PENG WANG ET AL.: ""An Efficient Design and Implementation of LSM-Tree based Key-Value Store on Open-Channel SSD"", 《百度文库》 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107247624B (en) * | 2017-06-05 | 2020-10-13 | 安徽大学 | Key-Value system oriented collaborative optimization method and system |
CN107247624A (en) * | 2017-06-05 | 2017-10-13 | 安徽大学 | A kind of cooperative optimization method and system towards Key Value systems |
CN109101189B (en) * | 2017-06-20 | 2021-12-24 | 慧荣科技股份有限公司 | Data storage device and data storage method |
CN109101189A (en) * | 2017-06-20 | 2018-12-28 | 慧荣科技股份有限公司 | Data storage device and data storage method |
CN107391088A (en) * | 2017-07-24 | 2017-11-24 | 郑州云海信息技术有限公司 | A kind of data message sort method, CPU ends, FPGA ends and system |
CN107391088B (en) * | 2017-07-24 | 2021-03-02 | 苏州浪潮智能科技有限公司 | Data information sequencing method, CPU (Central processing Unit) end, FPGA (field programmable Gate array) end and system |
CN107515827B (en) * | 2017-08-21 | 2021-07-27 | 湖南国科微电子股份有限公司 | PCIE SSD custom log storage method and device and SSD |
CN107515827A (en) * | 2017-08-21 | 2017-12-26 | 湖南国科微电子股份有限公司 | Storage method, device and the SSD of the self-defined daily records of PCIE SSD |
CN109508140A (en) * | 2017-09-15 | 2019-03-22 | 阿里巴巴集团控股有限公司 | Storage resource management method, apparatus, electronic equipment and electronic equipment, system |
CN109508140B (en) * | 2017-09-15 | 2022-04-05 | 阿里巴巴集团控股有限公司 | Storage resource management method and device, electronic equipment and system |
CN109033365A (en) * | 2018-07-26 | 2018-12-18 | 郑州云海信息技术有限公司 | A kind of data processing method and relevant device |
CN109033365B (en) * | 2018-07-26 | 2022-03-08 | 郑州云海信息技术有限公司 | Data processing method and related equipment |
CN109542339A (en) * | 2018-10-23 | 2019-03-29 | 拉扎斯网络科技(上海)有限公司 | Data hierarchy access method, device, multi-layered memory apparatus and storage medium |
CN109542339B (en) * | 2018-10-23 | 2021-09-03 | 拉扎斯网络科技(上海)有限公司 | Data layered access method and device, multilayer storage equipment and storage medium |
CN109271570A (en) * | 2018-10-30 | 2019-01-25 | 郑州云海信息技术有限公司 | A kind of method of metadata management inquiry |
CN109933570A (en) * | 2019-03-15 | 2019-06-25 | 中山大学 | A kind of metadata management method, system and medium |
CN110727403A (en) * | 2019-09-12 | 2020-01-24 | 华为技术有限公司 | Metadata management method and device |
Also Published As
Publication number | Publication date |
---|---|
CN106708442B (en) | 2020-02-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106708442A (en) | Massive data storage method simultaneously applicable to disk and solid state disk reading and writing features | |
CN105320775B (en) | The access method and device of data | |
CN110347336B (en) | Key value storage system based on NVM (non volatile memory) and SSD (solid State disk) hybrid storage structure | |
CN103631940B (en) | Data writing method and data writing system applied to HBASE database | |
CN105468298B (en) | A kind of key assignments storage method based on log-structured merging tree | |
CN111399777A (en) | Differentiated key value data storage method based on data value classification | |
CN103345472B (en) | De-redundant file system based on limited binary tree Bloom filter and construction method thereof | |
DE112011105774B4 (en) | Movable storage that supports in-memory data structures | |
US20140025635A1 (en) | Method and apparatus for fault-tolerant memory management | |
EP2093681A2 (en) | Method and system for implementing an enhanced database | |
CN105159915A (en) | Dynamically adaptive LSM (Log-structured merge) tree combination method and system | |
CN107526550B (en) | Two-stage merging method based on log structure merging tree | |
CN108319543A (en) | A kind of asynchronous processing method and its medium, system of computer log data | |
DE112016004527T5 (en) | Implement a hardware accelerator for the management of a memory write cache | |
CN107832013A (en) | A kind of method for managing solid-state hard disc mapping table | |
US20080162591A1 (en) | Method of Logging Transactions and a Method of Reversing a Transaction | |
CN104077078B (en) | Read memory block, update the method and device of memory block | |
CN105389128B (en) | A kind of solid state hard disk date storage method and storage control | |
US20120317384A1 (en) | Data storage method | |
CN110515897B (en) | Method and system for optimizing reading performance of LSM storage system | |
CN110597912A (en) | Block storage method and device | |
WO2015129109A1 (en) | Index management device | |
Kuno et al. | Deferred maintenance of indexes and of materialized views | |
US20100106682A1 (en) | Database Index | |
RU2647648C1 (en) | Method of organizing storage of historical deltas of records |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20191023 Address after: 430075 No. 201-5, floor 2, unit 2, north main building, phase II, National Geospatial Information Industry base, No. 5-2, wudayuan Road, Donghu New Technology Development Zone, Wuhan City, Hubei Province Applicant after: Hard rock technology (Wuhan) Co., Ltd Address before: 430070, No. two, building 2032, capital building, No. 1, National Road, East Lake New Technology Development Zone, Hubei, Wuhan, Optics Valley Applicant before: Wuhan Safety Technology Co., Ltd. |
|
TA01 | Transfer of patent application right | ||
GR01 | Patent grant | ||
GR01 | Patent grant |