Summary of the invention
The object of the present invention is to provide a kind of, and the Delta based on block grade data deduplication compresses storage assembly, eliminates byte level
Repeated data improves data de-duplication ratio and storage space utilization.
In order to achieve the above objectives, the technical solution adopted by the present invention is that: the invention discloses one kind to be gone based on block grade data
The Delta of weight compresses storage assembly, and the Delta compression storage assembly includes container access module;The container access module
Container storage algorithm and container recovery algorithms are run using similarity indexing, similar buffer area and vessel buffers area data structure;
What the block grade data deduplication storage that the container storage algorithm is used to receive upper layer sended over writes container order, to container
Delta compression is carried out, and will be in the container storage pond on the compressed container write-in disk unit of Delta;The container restores
Algorithm is for receiving the reading container order that the block grade data deduplication storage on upper layer sends over, by container index from disk
On container storage pond in read specified container, will reading container restore after return to upper layer block grade data deduplication storage
System;The Delta compression storage assembly also receives the reading container metadata that the block grade data deduplication storage on upper layer is sent
Order reads the metadata of specified containers from the container storage pond on disk unit, specified metadata is sent to upper layer
Block grade data deduplication storage.
The container storage algorithm in turn includes the following steps:
(201), it initializes:
First: parameter S, R, Sr and L are read from configuration file;The configuration file is stationed on disk unit, for recording
The configuration information of system;
The parameter S is preset positive integer, when indicating to carry out Delta compression out of in the container storage pond of disk read in
The maximum number for the similar vessels deposited, the similar vessels refer to the similar container of content;
The parameter R is preset positive number, indicates the minimum Delta compression ratio allowed, the Delta compression ratio is to instigate
After carrying out Delta compression generation Delta block to data block with reference block, the ratio of data block size and Delta block size;
The parameter Sr is preset positive integer, and 1/Sr indicates hook signature sampling rate;
The parameter L is preset positive integer, indicates maximum Delta chain length;
Then: judging whether to be system configuration initial stage, then generate an empty similarity indexing in memory in this way;If not, from matching
It sets and the similarity indexing of backup is read in into memory on file;
It is last: to generate an empty similar buffer area in memory;An empty vessel buffers area is generated in memory, is used to temporary
It deposits from the container read in the container storage pond on disk in memory;An empty storehouse is generated in memory, is denoted as Stack;
(202), it receives container: receiving one and write container order from what the block grade data deduplication storage on upper layer sended over,
Container to be written is extracted in container order from writing, and is denoted as upper layer container;An empty format container, note are generated in memory
For Work container, vessel buffers area is written into Work container;It is spare to empty similar buffer area;The upper layer container refers to upper layer
Container used in block grade data deduplication storage;The format container refers to that the container of Delta compression storage assembly is deposited
Container used in modulus block;
(203), fingerprint copies: reading container identifier from the meta-data region of upper layer container, and is written into the appearance of Work container
The container identification field of device head;From the meta-data region read block fingerprint of upper layer container, by these data block fingerprints according to
Its original sequence is sequentially written in the fingerprint region of Work container;
(204), similar signatures are calculated: successively calculating the similar signatures of each data block in the data field of upper layer container;It is every
One similar signatures generates a similar signatures block, and the similar signatures are written to the similar signatures field of similar signatures block;It presses
Similar signatures block is sequentially written in the similar signatures area of Work container according to the sequence of its corresponding data block;
(205), it extracts hook signature: all similar signatures for including in Work container being sampled according to the ratio of 1/Sr,
It signs the similar signatures of extraction as hook, and hook signature is sequentially written in the hook signature area of Work container;By work
Similar signatures of the smallest similar signatures as container, the similar signatures of container are write in all similar signatures for including in container
Enter the container signature field of the vessel head of Work container;
(206), similarity indexing updates:
First: the container identifier of Work container is assigned to variable cid;
Secondly: signing, be handled as follows: by the hook label for each hook for including in Work container hook signature area
Name is assigned to variable hook, generates a mapping<hook, and cid>, general<hook, cid>be inserted into similarity indexing;
(207), similar vessels are searched: inquiry similarity indexing, are found out and are shared those of hook signature container with Work container, if
It does not find, then goes to step (228);Otherwise, according to sharing the quantity of hook signature with Work container from big to small from finding
Most S containers are chosen in container, confirm that these containers being selected are the similar vessels of Work container;In vessel buffers area
These similar vessels are searched, the similar vessels not in vessel buffers area are read into vessel buffers area from container storage pool, into
Enter step (208);
(208), similar buffer area is write:
The similar signatures area for successively scanning each similar vessels of Work container in vessel buffers area, reads in similar signatures area
All similar signatures blocks, each similar signatures block of reading is handled as follows: in memory generate a similar buffering
Area's index node, is denoted as Node;The type field value of the similar signatures block and offset word segment value are individually copied to the class of Node
Type-word section and offset field;By the container identification field of the container identifier write-in Node of the similar signatures block said container;
The similar signatures field for reading the similar signatures block remembers that the similar signatures of reading are sign, general<sign, Node>be inserted into phase
Like buffer area;
(209), prepare processing data block: one read pointer P1 of setting is directed toward first similar label in Work container similar signatures area
First data block that a read pointer P2 is directed toward upper layer container data area is arranged in name block (NAM);
(210), read block: data block pointed by a P2 is read from the container data area of upper layer, is denoted as Dr, from work
Similar signatures block pointed by a P1 is read in container similar signatures area, is denoted as Block, is read the similar signatures word of Block
The value of section, is denoted as sign1;
(211), similar buffer area is searched: the similar signatures knot that similar signatures field value is sign1 is searched in similar buffer area
Point goes to step (224) if it is not found,;Otherwise, the rope that a read pointer P3 is directed toward the similar signatures node just found is set
First index node for drawing node chained list, enters step (212);
(212), judge data block: if the type field value for the index node that P3 is directed toward for Delta block mark, turns (214)
Step;Otherwise, it is data block mark, enters step (213);
(213), short chain Delta operation: by the value of the container identification field of index node pointed by P3 and offset field
Value is assigned to variable cid0 and offset0 respectively, and from address<cid0 in container buffer area, offset0>place reads one
Data block is denoted as D0;With D0For reference block and DrDelta operation is carried out, Delta block △ is generated0,r ;If Delta compression ratio
More than or equal to R, then success is compressed, goes to step (225);Otherwise, compression failure turns in next step;
(214), skip Delta block: P3 moves forward a step, is directed toward next index node, if P3 non-empty, turns (212)
Step, otherwise, shows the tail portion for having arrived at index node chained list, and P3 is directed toward to first index of the index node chained list again
Node turns in next step;
(215), judge Delta block: if the type field value for the index node that P3 is directed toward for data block mark, turns (223)
Step, is Delta block mark otherwise, is turned in next step;
(216), Delta block: the value of the container identification field of index node pointed by P3 and the value of offset field is read
Address of the Delta block in vessel buffers area is designated, a Delta block is read from the address;
(217), it is pressed into storehouse: the Delta block is pressed into Stack, reference block is read from the Delta build of the Delta block
The reference block address of reading is stored in variable<cid1 by location, and offset1>in, wherein cid1 is container identifier, offset1
It is the position of the reference block in the data field of container cid1;
(218), read reference block: if container cid1 in vessel buffers area, from container buffer area read reference block <
Cid1, offset1 >, otherwise, container cid1 is read in into vessel buffers area from container storage pool first, then reads reference block again
< cid1, offset1>;
(219), judgement reference block: if reference block<cid1, offset1>it is Delta block, otherwise turning (217) step is
The content of the data block is stored in variables D by data block0In, general<cid1, offset1>it is assigned to variable<cid0,
Offset0 >, it is assigned to variable length by 1, is turned in next step;
(220), long-chain Delta calculation step: with D0For reference block and DrIt carries out Delta operation and generates Delta block △0,r ;Such as
Fruit Delta compression ratio is greater than or equal to R and length is less than or equal to L, then compresses success, go to step (225), otherwise, pressure
Contracting failure, turns in next step;
(221), judge storehouse: if Stack goes to step (223) for sky, otherwise, turning in next step;
(222), pop-up a stack: popping up a Delta block from Stack, be denoted as △, and the address of △ is stored in variable < cid0,
Offset0 >, to D0Delta inverse operation is carried out with △, the result of Delta inverse operation is stored in variables D0In, by variable length
Value increase by 1, go to step (220);
(223), skip data block: P3 moves forward a step, is directed toward next index node, if P3 non-empty, turns (215)
Step, otherwise, shows the tail portion for having arrived at index node chained list, turns in next step;
(224), storing data block: in data block DrBefore add a data block head, data are written in the data block head
Block mark and DrSize information;If the data field non-empty of Work container, by the data block D after additional data buildrIt is additional
In the data field of Work container behind data with existing, otherwise, by the data block D after additional data buildrWork is written to hold
The initial position of the data field of device;By the type field of data block mark write-in Block, by data block DrIn Work container data
The offset field of the location information write-in Block in area, goes to step (226);
(225), Delta block is stored: in Delta block △0,rBefore add a Delta build, write in the Delta build
Enter Delta block mark, △0,rSize and △0,rReference block address<cid0, offset0>;If Work container
Data field non-empty, then by the Delta block △ after additional Delta build0,rThe additional data with existing in the data field of Work container
Below, otherwise, by the Delta block △ after additional Delta build0,rThe initial position of the data field of Work container is written;It will
The type field of Block is written in Delta block mark, by Delta block △0,rLocation information in Work container data field is written
The offset field of Block;Storehouse Stack is emptied;
(226), similar buffer area updates: generating a similar buffer area index node in memory, is denoted as Node1;By Block
Type field value and offset word segment value be individually copied to the type field and offset field of Node1;By the container of Work container
The container identification field of identifier write-in Node1;General<sign1, Node1>it is inserted into similar buffer area;
(227), data block is disposed judgement: P1 being moved forward a step, under being directed toward in the similar signatures area of Work container
P2 is moved forward a step, the next data block being directed toward in upper layer container data area by one similar signatures block;If P2 is
Sky then shows that the data block in the container of upper layer is all disposed, goes to step (229);Otherwise, (210) are gone to step;
(228), former container is stored: first: from first data BOB(beginning of block), successively handling each in upper layer container data area
Data block adds data block head before the data block, the size of writing data blocks mark and the data block in data block head
Information;If the data field non-empty of Work container, by the additional data in Work container of data block after additional data build
In area behind data with existing, otherwise, by the start bit of the data field of the data block write-in Work container after additional data build
It sets;Similar signatures block corresponding with the data block in the similar signatures area of Work container is handled, data block mark is write
The similar signatures block is written in the type field for entering the similar signatures block, the location information by the data block in Work container data field
Offset field;
Secondly: abandoning upper layer container;It calculates the size of Work container and size information is written to the vessel head of Work container;If
Work container is then added and behind data with existing, otherwise, Work container is write in container storage pond by container storage pond non-empty
Enter the initial position in container storage pond;
It is last: by the container identifier of the Work container in rigid write-in container storage pond and the Work container in container storage pond
Location information be written container index, go to step (230);
(229), new container is stored:
First: abandoning upper layer container;It calculates the size of Work container and size information is written to the vessel head of Work container;If
Work container is then added and behind data with existing, otherwise, Work container is write in container storage pond by container storage pond non-empty
Enter the initial position in container storage pond;
Secondly: by the container identifier of the Work container in rigid write-in container storage pond and the Work container in container storage pond
Location information be written container index;
(230), end of run judges: judging whether to receive end of run instruction, such as otherwise goes to step (202);If so, then turning
In next step;
(231), terminate:
First: stopping receiving the container that the block grade data deduplication storage on upper layer sends over;
Then: configuration file is written into the similarity indexing in memory;
It is last: to destroy similarity indexing, vessel buffers area, similar buffer area and the storehouse Stack backed off after random in memory.
The container recovery algorithms in turn include the following steps:
(301), it initializes: generating an empty vessel buffers area in memory, for temporary from the container storage pond on disk
In read in memory in container;An empty storehouse is generated in memory, is denoted as Stack;
(302), it receives read command: receiving the reading container sended over from the block grade data deduplication storage on upper layer a life
It enables, from extraction vessel identifier in container order is read, is denoted as cid;An empty upper layer format container is generated in memory, is denoted as
Upper layer container;
(303), it reads container: reading the container that container identifier is cid in the container storage pond on disk, be denoted as work appearance
Vessel buffers area is written in Work container by device;
(304), metadata recovering step: according to the type and call format of upper layer container metadata from the metadata of Work container
Area reads corresponding metadata, by the meta-data region of these metadata write-in upper layer container of reading;
(305), prepare processing data field: one read pointer P1 of setting is directed toward first object of Work container data field;
(306), judge object: if object pointed by P1 is a data block, which being denoted as Dr, go to step
(312);Otherwise, it is Delta block, turns in next step;
(307), it is pressed into storehouse: the Delta block is pressed into Stack, reference block is read from the Delta build of the Delta block
Location, remember reading reference block address be<cid1, offset1>, wherein cid1 is container identifier, and offset1 is the reference
Position of the block in the data field of container cid1;
(308), read reference block: if container cid1 in vessel buffers area, from container buffer area read reference block <
Cid1, offset1 >, otherwise, container cid1 is read in into vessel buffers area from container storage pool first, then reads reference block again
< cid1, offset1>;
(309), judgement reference block step: if reference block<cid1, offset1>it is Delta block, go to step (307);It is no
Then, it is data block, which is stored in variables D, turns in next step;
(310), pop-up a stack: popping up a Delta block from Stack, be denoted as △, carries out Delta inverse operation to D and △, will
In the result deposit variables D of Delta inverse operation;
(311), judge storehouse: if Stack non-empty, turning (310) step;Otherwise, the content of variables D is denoted as data block Dr,
Turn in next step;
(312), data block is copied: if upper layer container data area non-empty, by data block DrIt adds in upper layer container data
In area behind data with existing;Otherwise, by data block DrThe initial position in upper layer container data area is written;
(313), judge data field: read pointer P1 moves forward a step, the next object being directed toward in Work container data field, such as
Fruit P1 non-empty, goes to step (306);Otherwise, data field is disposed, and turns in next step;
(314), end of run judges: the upper layer container handled well being sent to the block grade data deduplication storage on upper layer, such as
Fruit is not received by end operation order, then goes to step (302);Otherwise, turn in next step;
(315), terminate: destroying vessel buffers area and the operation of storehouse Stack backed off after random.
The similarity indexing is memory Hash table;The memory Hash table includes a bucket group;Each of in the bucket group
The corresponding number of bucket, and the mapping between hook signature and bucket number, the hook being mapped in bucket are established using hash function
Signature is stored in hook signature node;Each hook signature node stores a unique hook and signs and be associated with an appearance
Device identifier queue, the identifier of container of the storage comprising hook signature in container identifier queue;The hook signature knot
Point is made of hook signature field, spilling chain table pointer field and container identifier queue field;The hook signature field is used
In storage one unique hook signature;The spilling chain table pointer field is mapped to same for storage when handling hash-collision
The address of another hook signature node in a bucket;Container identifier queue pointer field is for storing hook signature
The first address of the associated container identifier queue of node.
The similar buffer area is memory Hash table;The memory Hash table includes a bucket group;It is every in the bucket group
A barrel of correspondence one number, and the mapping between similar signatures and bucket number, the phase being mapped in bucket are established using hash function
It is stored in similar signatures node like signature;Each similar signatures node stores a unique similar signatures and is associated with one
Index node chained list stores index node in index node chained list, has this similar wherein each index node stores one
The data block of signature or the information of Delta block;The similar signatures of the Delta block refer to the corresponding data block of Delta block
Similar signatures;The similar signatures node is by similar signatures field, spilling chain table pointer field and index node chain table pointer word
Duan Zucheng;The similar signatures field is for storing a unique similar signatures;The spilling chain table pointer field is for locating
The address for another similar signatures node that storage is mapped in the same bucket when managing hash-collision;The index node chained list refers to
Needle field is used to store the first address of the associated index node chained list of the similar signatures node;The index node is by class type-word
Section, container identification field, offset field and chain table pointer field composition;The type field for store data block mark or
Delta mark;The container identification field and offset field give the address information of data block or Delta block;The chain
Table pointer field is used to store the address of next index node in the index node chained list.
The vessel buffers area is the logical memory chained list known, and the container being read into vessel buffers area is linked at the memory
In chained list, when vessel buffers area is full, deleted from the memory chained list using the least recently used replacement algorithm of logical knowledge
Container;Work container and its similar vessels are stationed always in vessel buffers area until new Work container and its similar vessels are read
When entering vessel buffers area, old Work container and its similar vessels are likely to be set to scaling method to choose and delete from container buffer area
It removes.
The container storage pond is stationed on disk unit, and storage container is used for;The container index is stationed to be set in disk
It is standby upper, for establishing container identifier and container the reflecting between the position in container storage pond with the container identifier
It penetrates.
The present invention proposes a kind of Delta compress technique based on block grade data deduplication, the block series applied to current mainstream
According to the backstage of deduplication storage, Delta compression is carried out to set of metadata of similar data block, byte level repeated data is eliminated, further increases
Data de-duplication ratio and storage space utilization.The present invention determines phase by calculating and comparing the similar signatures of data block
Likelihood data block, the data block with same and similar signature is set of metadata of similar data block, and is handled as unit of container, and backstage is grasped
Make such as container compression, storage and restores transparent to upper-level system, the seamless interfacing of realization and upper-level system.Using data buffering
And index technology, it realizes the instant lookup of set of metadata of similar data, the readwrite performance of Delta compression and container can be effectively improved, so that should
Technology is able to satisfy the needs of extensive high-performance data backup.Specific advantage is as described below:
1, Delta compression is carried out to set of metadata of similar data block, eliminates byte level repeated data, further increases data de-duplication ratio
And storage space utilization;
2, without modifying to existing piece of grade data deduplication storage and can use the present invention on backstage;
3, it is handled as unit of container, protects the redundancy locality of data flow, while using similarity indexing, similar buffering
The technologies such as area and vessel buffers area, can effectively improve the data processing performance on backstage;
4, container is added in order in the container storage pond on disk, avoids the random small letter I/O of disk, reading and writing data performance
It is high.
Specific embodiment
The invention discloses a kind of, and the Delta based on block grade data deduplication compresses storage assembly, as shown in Figure 1, described
It includes container access module that Delta, which compresses storage assembly,;The container access module uses similarity indexing, similar buffer area and appearance
Device buffer data structure runs container storage algorithm and container recovery algorithms;The container storage algorithm receives the block on upper layer
What grade data deduplication storage sended over writes container order, carries out Delta compression to container, and Delta is compressed
Container is written in the container storage pond on disk;The container recovery algorithms receive the block grade data deduplication storage hair on upper layer
The reading container order brought reads specified container from the container storage pond on disk by container index, by reading
Container returns to the block grade data deduplication storage on upper layer after restoring.Delta compression storage assembly operates in block grade data
The backstage of weight storage system is responsible for the container sended over to block grade data deduplication storage and carries out Delta compression, further
Byte level repeated data between set of metadata of similar data block is eliminated, improves data de-duplication ratio and storage space utilization to reach
Purpose.The present invention determines set of metadata of similar data block by calculating and comparing the similar signatures of data block, and as unit of container into
Row is handled, and consistency operation such as container compression, storage and recovery etc. is transparent to upper-level system, and it is seamless right with upper-level system to realize
It connects.Using technologies such as similarity indexing, similar buffer area and vessel buffers areas, realizes the instant lookup of set of metadata of similar data, can effectively mention
The readwrite performance of high Delta compression and container, so that the technology is able to satisfy the needs of extensive high-performance data backup.
As shown in Fig. 2, the container storage algorithm in turn includes the following steps:
(201), it initializes:
First: parameter S, R, Sr and L are read from configuration file;The configuration file is stationed on disk unit, for recording
The configuration information of system;
The parameter S is preset positive integer, when indicating to carry out Delta compression out of in the container storage pond of disk read in
The maximum number for the similar vessels deposited, the similar vessels refer to the similar container of content;Parameter S setting is excessive, can reduce number
According to the Delta compression performance of block, setting is too small, then can reduce the Delta compression ratio of data block, in an implementation, S may be configured as
2,3 or 4.
The parameter R is preset positive number, indicates that the minimum Delta compression ratio allowed, the Delta compression ratio are
Refer to after generating Delta block to data block progress Delta compression using reference block, the ratio of data block size and Delta block size;
In an implementation, R may be configured as 2,2.5 or 3.
The parameter Sr is preset positive integer, and 1/Sr indicates hook signature sampling rate;Hook signature sampling rate be
A critically important parameter, if its value is too small, the hook signature generated is very little, will affect the lookup precision of set of metadata of similar data block,
If its value is excessive, the hook signature generated is too many, and similarity indexing can be made excessive, and memory overhead is high.In an implementation, root
According to the size of system scale, Sr can value 64 or 32.
The parameter L is preset positive integer, indicates maximum Delta chain length;Parameter L setting is too small, can reduce
Delta compression effectiveness, is arranged excessive, then can reduce reading and writing data performance, while data compression income obtained and little;?
In implementation, parameter L may be configured as 5,6 or 7.
Then: judging whether to be system configuration initial stage, then generate an empty similarity indexing in memory in this way;If not,
The similarity indexing of backup is read in into memory from configuration file;
The similarity indexing is memory Hash table;As shown in Figure 8: the memory Hash table includes a bucket group;In the bucket group
Each of a barrel corresponding number, and establish the mapping between hook signature and bucket number using hash function, be mapped in bucket
Hook signature be stored in hook signature node in;Each hook signature node stores a unique hook and signs and be associated with
One container identifier queue, the identifier of container of the storage comprising hook signature in container identifier queue;Such as Fig. 9 institute
Show, the hook signature node is by hook signature field, spilling chain table pointer field and container identifier queue pointer field groups
At;The hook signature field is for storing a unique hook signature;The spilling chain table pointer field is breathed out for handling
The address for another hook signature node that storage is mapped in the same bucket when uncommon conflict;The container identifier queue pointer
Field is used to store the first address of the hook signature associated container identifier queue of node;
The similarity indexing is used to establish the mapping between hook signature and the container signed comprising the hook, includes the same hook
The possible more than one of the container of son signature, in this way, the container that shared hook signature can be quickly found out by inquiring similarity indexing;
The present invention confirms that the container of shared hook signature is similar vessels, and the similar vessels refer to the similar container of content.
In the present embodiment, the similarity indexing is stationed in memory, is convenient for quick search.Wherein, at the beginning of judging system configuration
The method of phase is the mature prior art.
It is last: to generate an empty similar buffer area in memory;An empty vessel buffers area is generated in memory, is used
To keep in from the container read in the container storage pond on disk in memory;An empty storehouse is generated in memory, is denoted as
Stack;
The similar buffer area is memory Hash table;As shown in Figure 10, the memory Hash table includes a bucket group;The bucket group
In each of a barrel corresponding number, and establish the mapping between similar signatures and bucket number using hash function, be mapped to bucket
In similar signatures be stored in similar signatures node;Each similar signatures node is stored a unique similar signatures and is associated with
An index node chained list, index node is stored in index node chained list, wherein each index node, which stores one, to be had
The data block of the similar signatures or the information of Delta block;The similar signatures of the Delta block refer to the corresponding number of Delta block
According to the similar signatures of block;As shown in figure 11: the similar signatures node is by similar signatures field, spilling chain table pointer field and rope
Draw node chain table pointer field composition;The similar signatures field is for storing a unique similar signatures;The spilling chain
The address for another similar signatures node that storage is mapped in the same bucket when table pointer field is used to handle hash-collision;Institute
Index node chain table pointer field is stated for storing the first address of the associated index node chained list of the similar signatures node;Such as Figure 12
Shown: the index node is made of type field, container identification field, offset field and chain table pointer field;The class
Type-word section is for storing data block mark or Delta mark;The container identification field and offset field give data block
Or the address information of Delta block;The chain table pointer field is for storing next index node in the index node chained list
Address.
Each similar signatures node is associated with an index node chained list, the index knot in the similar buffer area
Each index node stores the index information of a data block or Delta block in point chained list, in the same index node chained list
Data block or Delta block similar signatures having the same.
In the present embodiment, the similar buffer area is stationed in memory, convenient for being quickly found out when carrying out Delta compression
The reference block of data block to be compressed.
The vessel buffers area is the logical memory chained list known, and the container being read into vessel buffers area is linked at the memory
In chained list, when vessel buffers area is full, deleted from the memory chained list using the least recently used replacement algorithm of logical knowledge
Container;Work container and its similar vessels are stationed always in vessel buffers area until new Work container and its similar vessels are read
When entering vessel buffers area, old Work container and its similar vessels are likely to be set to scaling method to choose and delete from container buffer area
It removes.
Being provided with for the vessel buffers area is conducive to improve reading and writing data performance, because container protects the redundancy of data flow
Locality so that in the same container set of metadata of similar data block of data block very likely also in a same vessel, in this way, from disk
Last time reads whole container, not only can be to avoid the random small letter I/O of disk, but also buffer area hit rate can be improved, and reduces disk
Read and write number.
(202), it receives container: receiving the container of writing that one sends over from the block grade data deduplication storage on upper layer and order
It enables, extracts container to be written in container order from writing, be denoted as upper layer container;An empty format is generated in memory to hold
Device is denoted as Work container, and vessel buffers area is written in Work container;It is spare to empty similar buffer area;The upper layer container refers to
Container used in the block grade data deduplication storage on upper layer;The format container refers to Delta compression storage assembly
Container used in container access module.As shown in figure 4, container is by meta-data region and data district's groups at the read-write of meta-data region
From top to bottom, the read-write sequence of data field is packaged into appearance from bottom to top, by the meta-data region after finishing writing and data field docking to sequence
Device;The meta-data region is made of vessel head, fingerprint region, similar signatures area and hook signature area.The vessel head is by container mark
Know symbol field, size field and container signature field composition, be respectively used to store the container identifier of the container, container size and
Container similar signatures;Fingerprint region block fingerprint for storing data;The similar signatures area is for storing similar signatures block;Such as
Shown in Fig. 5, the similar signatures block is made of similar signatures field, type field and offset field;The similar signatures field
For storing the similar signatures of corresponding data block;The type field block mark or Delta block mark for storing data;It is described
Offset field is for storing the address of corresponding data block or Delta block within a data area.The hook signature area is for storing hook
Son signature;The data field block or Delta block for storing data;The data block is added in front when storage is to data field
One data block head, as shown in fig. 6, the data block head is made of data block mark and data block size field.It is described
Delta block attached a Delta build when storage is to data field in front, as shown in fig. 7, the Delta build by
Delta block mark, Delta block size field and reference block address field composition, the reference block address field is by container identification
Accord with field and offset field composition;
(203), fingerprint copies: reading container identifier from the meta-data region of upper layer container, and is written into the appearance of Work container
The container identification field of device head;From the meta-data region read block fingerprint of upper layer container, by these data block fingerprints according to
Its original sequence is sequentially written in the fingerprint region of Work container;
(204), similar signatures are calculated: successively calculating the similar signatures of each data block in the data field of upper layer container;It is every
One similar signatures generates a similar signatures block, and the similar signatures are written to the similar signatures field of similar signatures block;It presses
Similar signatures block is sequentially written in the similar signatures area of Work container according to the sequence of its corresponding data block;
The calculation method of the similar signatures of the data block is the mature prior art, method are as follows:: from the start bit of data block
Beginning is set, is slided in data block with the window of a fixed size, as soon as before every sliding byte, referred to using logical guest sieve known
Line algorithm calculates sieve guest's fingerprint for falling into data patch in window, takes in all data patch the smallest guest sieve fingerprint as number
According to the similar signatures of block.
In the present embodiment, the size of the window is predetermined a constant, can use 512 bytes, and guest sieve refers to
The length of line can use 4 bytes.
(205), it extracts hook signature: all similar signatures for including in Work container being taken out according to the ratio of 1/Sr
Sample is signed the similar signatures of extraction as hook, and hook signature is sequentially written in the hook signature area of Work container;By work
Make similar signatures of the smallest similar signatures as container in all similar signatures for including in container, by the similar signatures of container
The container signature field of the vessel head of Work container is written;
(206), similarity indexing updates:
First: the container identifier of Work container is assigned to variable cid;
Secondly: signing, be handled as follows: by the hook label for each hook for including in Work container hook signature area
Name is assigned to variable hook, generates a mapping<hook, and cid>, general<hook, cid>be inserted into similarity indexing;
In the present embodiment, the method for general<hook, cid>be inserted into similarity indexing be equal into memory Hash table insertion<
Key, value >, for the mature prior art.
(207), similar vessels are searched: inquiry similarity indexing, are found out and are shared those of hook signature container with Work container,
It is if it is not found, then go to step (228);Otherwise, according to sharing the quantity of hook signature with Work container from big to small from looking for
To container in choose most S containers, confirm that these containers for being selected are the similar vessels of Work container;In vessel buffers
These similar vessels are searched in area, and the similar vessels not in vessel buffers area are read into vessel buffers from container storage pool
Area;
(208), similar buffer area is write:
The similar signatures area for successively scanning each similar vessels of Work container in vessel buffers area, reads in similar signatures area
All similar signatures blocks, each similar signatures block of reading is handled as follows: in memory generate a similar buffering
Area's index node, is denoted as Node;The type field value of the similar signatures block and offset word segment value are individually copied to the class of Node
Type-word section and offset field;By the container identification field of the container identifier write-in Node of the similar signatures block said container;
The similar signatures field for reading the similar signatures block remembers that the similar signatures of reading are sign, general<sign, Node>be inserted into phase
Like buffer area;
In the present embodiment, the method for general<sign, Node>be inserted into similar buffer area is equal to inserts into memory Hash table
Enter<key, value>, for the mature prior art.
(209), prepare processing data block: one read pointer P1 of setting is directed toward first phase in Work container similar signatures area
Like signaling block, first data block that a read pointer P2 is directed toward upper layer container data area is set;
(210), read block: data block pointed by a P2 is read from the container data area of upper layer, is denoted as Dr, from work
Similar signatures block pointed by a P1 is read in container similar signatures area, is denoted as Block, is read the similar signatures word of Block
The value of section, is denoted as sign1;
(211), similar buffer area is searched: the similar signatures knot that similar signatures field value is sign1 is searched in similar buffer area
Point goes to step (224) if it is not found,;Otherwise, the rope that a read pointer P3 is directed toward the similar signatures node just found is set
Draw first index node of node chained list, turns in next step;
To data block DrBefore carrying out Delta compression, need to find and DrThe similar data block of content is as reference block.In this implementation
In example, with data block DrSimilar signatures sign1 be keyword corresponding similar signatures node is searched in similar buffer area, such as
Fruit is not found, then shows not find data block DrReference block, go to data block D in (224) steprIt stores as former state, such as
Fruit has found, then the index node storage of linked list of similar signatures node data block DrIt is all it is potential reference blocks letters
Breath, then, data block D is further found by traversing the index node chained listrReference block.
Following (212), (213), (214) step operation in, preferential detection data block index node, only when all
Data block index node pointed by data block be unsuitable for be used as data block DrReference block when just further detect Delta
Block index node.This processing method can effectively improve Delta compression performance, because data block can be directly used as reference block,
And Delta block then needs to be first converted into and just can serve as quoting block after data block, this is related to traversing Delta chain, Delta inverse operation
Deng operation, time overhead is larger.
(212), judge data block: if the type field value for the index node that P3 is directed toward for Delta block mark, turn the
(214) step;Otherwise, it is data block mark, turns in next step;
(213), short chain Delta operation: by the value of the container identification field of index node pointed by P3 and offset field
Value is assigned to variable cid0 and offset0 respectively, and from address<cid0 in container buffer area, offset0>place reads one
Data block is denoted as D0;With D0For reference block and DrDelta operation is carried out, Delta block △ is generated0,r ;If Delta compression ratio
More than or equal to R, then success is compressed, goes to step (225);Otherwise, compression failure turns in next step;
(214), skip Delta block: P3 moves forward a step, is directed toward next index node, if P3 non-empty, turns (212)
Step, otherwise, shows the tail portion for having arrived at index node chained list, and P3 is directed toward to first index of the index node chained list again
Node turns in next step;
The data block pointed by all data block index nodes is unsuitable for being used as data block DrReference block when just into one
Step detection Delta block index node.Index node chained list is successively detected and handles in the operation of following (215) ~ (223) step
Each of Delta block index node, until finding a reference block appropriate to data block DrDelta is carried out to compress successfully
And turn storing data block D in (225) steprDelta block, or can not find reference block appropriate and turn will in (224) step
Data block DrIt stores as former state.
For any Delta block index node in index node chained list, detects and handle and be divided to two processes, first
Process is traversal Delta chain, and second process is detection reference block.
(216) ~ (219) step is traversal Delta chain process below, and the process is pointed by Delta block index node
Delta block is starting point, data of each Delta block until Delta chain end on the direction of Delta chain reading Delta chain
Block.
(220) ~ (222) step is detection reference block process below, which uses the data block of Delta chain end first
It is reference block to data block DrDelta compression is carried out, turns storing data block D in (225) step if compressing successfullyrDelta
Otherwise block carries out Delta inverse operation against the direction of Delta chain, be reference block pair with the data block that Delta inverse operation generates
Data block DrDelta compression is carried out, successfully turns storing data block D in (225) step until Delta compressesrDelta block, or
All Delta blocks in Delta chain, which all detect to finish, does not find reference block appropriate yet, at this moment, turn (223) step detection and
Handle next Delta block index node in index node chained list.
(215), judge Delta block: if the type field value for the index node that P3 is directed toward for data block mark, turn the
(223) otherwise step is Delta block mark, turn in next step;
(216), Delta block: the value of the container identification field of index node pointed by P3 and the value of offset field is read
Address of the Delta block in vessel buffers area is designated, a Delta block is read from the address;
(217), it is pressed into storehouse: the Delta block is pressed into Stack, reference block is read from the Delta build of the Delta block
The reference block address of reading is stored in variable<cid1 by location, and offset1>in, wherein cid1 is container identifier, offset1
It is the position of the reference block in the data field of container cid1;
(218), read reference block: if container cid1 in vessel buffers area, from container buffer area read reference block <
Cid1, offset1 >, otherwise, container cid1 is read in into vessel buffers area from container storage pool first, then reads reference block again
< cid1, offset1>;
(219), judgement reference block: if reference block<cid1, offset1>it is Delta block, otherwise turning (217) step is
The content of the data block is stored in variables D by data block0In, general<cid1, offset1>it is assigned to variable<cid0,
Offset0 >, it is assigned to variable length by 1, is turned in next step;
(220), long-chain Delta calculation step: with D0For reference block and DrIt carries out Delta operation and generates Delta block △0,r ;Such as
Fruit Delta compression ratio is greater than or equal to R and length is less than or equal to L, then compresses success, go to step (225), otherwise, pressure
Contracting failure, turns in next step;
(221), judge storehouse: if Stack goes to step (223) for sky, otherwise, turning in next step;
(222), pop-up a stack: popping up a Delta block from Stack, be denoted as △, and the address of △ is stored in variable < cid0,
Offset0 >, to D0Delta inverse operation is carried out with △, the result of Delta inverse operation is stored in variables D0In, by variable length
Value increase by 1, go to step (220);
(223), skip data block: P3 moves forward a step, is directed toward next index node, if P3 non-empty, turns (215)
Step, otherwise, shows the tail portion for having arrived at index node chained list, turns in next step;
(224), storing data block: in data block DrBefore add a data block head, data are written in the data block head
Block mark and DrSize information;If the data field non-empty of Work container, by the data block D after additional data buildrIt is additional
In the data field of Work container behind data with existing, otherwise, by the data block D after additional data buildrWork is written to hold
The initial position of the data field of device;By the type field of data block mark write-in Block, by data block DrIn Work container data
The offset field of the location information write-in Block in area, goes to step (226);
(225), Delta block is stored: in Delta block △0,rBefore add a Delta build, write in the Delta build
Enter Delta block mark, △0,rSize and △0,rReference block address<cid0, offset0>;If Work container
Data field non-empty, then by the Delta block △ after additional Delta build0,rThe additional data with existing in the data field of Work container
Below, otherwise, by the Delta block △ after additional Delta build0,rThe initial position of the data field of Work container is written;It will
The type field of Block is written in Delta block mark, by Delta block △0,rLocation information in Work container data field is written
The offset field of Block;Storehouse Stack is emptied;
(226), similar buffer area updates: generating a similar buffer area index node in memory, is denoted as Node1;By Block
Type field value and offset word segment value be individually copied to the type field and offset field of Node1;By the container of Work container
The container identification field of identifier write-in Node1;General<sign1, Node1>it is inserted into similar buffer area;
In the present embodiment, the method for general<sign1, Node1>be inserted into similar buffer area is equal into memory Hash table
Insertion<key, value>, for the mature prior art.
(227), data block is disposed judgement: P1 being moved forward a step, is directed toward in the similar signatures area of Work container
Next similar signatures block, by P2 move forward a step, be directed toward upper layer container data area in next data block;If P2
For sky, then shows that the data block in the container of upper layer is all disposed, go to step (229);Otherwise, (210) are gone to step;
(228), former container is stored: first: from first data BOB(beginning of block), successively handling each in upper layer container data area
Data block adds data block head before the data block, the size of writing data blocks mark and the data block in data block head
Information;If the data field non-empty of Work container, by the additional data in Work container of data block after additional data build
In area behind data with existing, otherwise, by the start bit of the data field of the data block write-in Work container after additional data build
It sets;Similar signatures block corresponding with the data block in the similar signatures area of Work container is handled, data block mark is write
The similar signatures block is written in the type field for entering the similar signatures block, the location information by the data block in Work container data field
Offset field;
Secondly: abandoning upper layer container;It calculates the size of Work container and size information is written to the vessel head of Work container;If
Work container is then added and behind data with existing, otherwise, Work container is write in container storage pond by container storage pond non-empty
Enter the initial position in container storage pond;
It is last: by the container identifier of the Work container in rigid write-in container storage pond and the Work container in container storage pond
Location information be written container index, go to step (230);
(229), new container is stored:
First: abandoning upper layer container;It calculates the size of Work container and size information is written to the vessel head of Work container;If
Work container is then added and behind data with existing, otherwise, Work container is write in container storage pond by container storage pond non-empty
Enter the initial position in container storage pond;
Secondly: by the container identifier of the Work container in rigid write-in container storage pond and the Work container in container storage pond
Location information be written container index;
The container storage pond is stationed on disk unit, and storage container is used for.
The container index is stationed on disk unit, for establishing container identifier and with the appearance of the container identifier
Mapping of the device between the position in container storage pond.
(230), end of run judges: judging whether to receive end of run instruction, such as otherwise goes to step (202);If so,
Then turn in next step;
(231), terminate:
First: stopping receiving the container that the block grade data deduplication storage on upper layer sends over;
Then: configuration file is written into the similarity indexing in memory;
It is last: to destroy similarity indexing, vessel buffers area, similar buffer area and the storehouse Stack backed off after random in memory.
As shown in figure 3, the container recovery algorithms in turn include the following steps:
(301), it initializes: generating an empty vessel buffers area in memory, for temporary from the container storage pond on disk
In read in memory in container;An empty storehouse is generated in memory, is denoted as Stack;
(302), it receives read command: receiving the reading container sended over from the block grade data deduplication storage on upper layer a life
It enables, from extraction vessel identifier in container order is read, is denoted as cid;An empty upper layer format container is generated in memory, is denoted as
Upper layer container;
(303), it reads container: reading the container that container identifier is cid in the container storage pond on disk, be denoted as work appearance
Vessel buffers area is written in Work container by device;
(304), metadata recovering step: according to the type and call format of upper layer container metadata from the metadata of Work container
Area reads corresponding metadata, by the meta-data region of these metadata write-in upper layer container of reading;
(305), prepare processing data field: one read pointer P1 of setting is directed toward first object of Work container data field;
(306), judge object: if object pointed by P1 is a data block, which being denoted as Dr, go to step
(312);Otherwise, it is Delta block, turns in next step;
Delta block pointed by P1 is reduced into data block by (307) ~ (310) step below, and operation is divided into two processes, the
One process is traversal Delta chain, and second process is Delta chain inverse operation.
(307), (308), (309) step are traversal Delta chain process below, and the process is with Delta block pointed by P1
For starting point, data block of each Delta block until Delta chain end on the direction of Delta chain reading Delta chain.
(310), (311) step carry out the inverse operation of Delta chain below, i.e., carry out the inverse fortune of Delta against the direction of Delta chain
It calculates, Delta block pointed by P1 is finally reduced into data block.
(307), it is pressed into storehouse: the Delta block being pressed into Stack, reference is read from the Delta build of the Delta block
Block address, remember reading reference block address be<cid1, offset1>, wherein cid1 is container identifier, and offset1 is this
Quote position of the block in the data field of container cid1;
(308), read reference block: if container cid1 in vessel buffers area, from container buffer area read reference block <
Cid1, offset1 >, otherwise, container cid1 is read in into vessel buffers area from container storage pool first, then reads reference block again
< cid1, offset1>;
(309), judgement reference block step: if reference block<cid1, offset1>it is Delta block, go to step (307);It is no
Then, it is data block, which is stored in variables D, turns in next step;
(310), pop-up a stack: popping up a Delta block from Stack, be denoted as △, carries out Delta inverse operation to D and △, will
In the result deposit variables D of Delta inverse operation;
(311), judge storehouse: if Stack non-empty, turning (310) step;Otherwise, the content of variables D is denoted as data block Dr,
Turn in next step;
(312), data block is copied: if upper layer container data area non-empty, by data block DrIt adds in upper layer container data
In area behind data with existing;Otherwise, by data block DrThe initial position in upper layer container data area is written;
(313), judge data field: read pointer P1 moves forward a step, the next object being directed toward in Work container data field, such as
Fruit P1 non-empty, goes to step (306);Otherwise, data field is disposed, and turns in next step;
(314), end of run judges: the upper layer container handled well being sent to the block grade data deduplication storage on upper layer, such as
Fruit is not received by end operation order, then goes to step (302);Otherwise, turn in next step;
(315), terminate: destroying vessel buffers area and the operation of storehouse Stack backed off after random.
In the implementation that said vesse stores algorithm and container recovery algorithms, the Delta operation and Delta inverse operation can
To select the Delta tool of compression such as vdelta, xdelta and zdelta, the Delta such as described vdelta, xdelta and zdelta pressure
Contracting tool is the mature prior art.
Other than what the block grade data deduplication storage in addition to executing upper layer sended over writes container order and reads container order,
In an implementation, the reading that the block grade data deduplication storage that the Delta compression storage assembly also executes upper layer sends over is held
Device metadata order, including read the order of container fingerprint;When executing reading container metadata order, the Delta compresses storage assembly
Specified metadata is sent to the block grade on upper layer by the meta-data region that specified container is read from the container storage pond on disk
Data deduplication storage system.